# Preprocessing
So far... we have just over 700 scripts devolved into bag-of-words, along with ratings (from 1 to 10). Let's preprocess!
We'll read in the datafile and reduce dimensionality. We'll also convert the ratings from a float to an integer so we can use random forests in modeling.

In [1]:
import pandas as pd
ratingsScriptsML = pd.read_csv('ratingsAndBagOfWords.csv')
ratingsScriptsML.head()

Unnamed: 0.1,Unnamed: 0,00,000,10,100,101,102,103,104,105,...,genre_mystery,genre_news,genre_noir,genre_romance,genre_sci,genre_sport,genre_thriller,genre_war,genre_western,averageRating
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6.2
1,1,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,4.3
2,2,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,5.5
3,3,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.8
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,7.3


In [2]:
ratingsScriptsML.drop('Unnamed: 0',axis = 1, inplace = True)
ratingsScriptsML.head()

Unnamed: 0,00,000,10,100,101,102,103,104,105,106,...,genre_mystery,genre_news,genre_noir,genre_romance,genre_sci,genre_sport,genre_thriller,genre_war,genre_western,averageRating
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,6.2
1,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,4.3
2,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,5.5
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4.8
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,7.3


Our dependent variable - ratings - is a continuous variable. For the random forest, we need integers. Since the ratings as given are rounded to the tenths place, we'll multiply them by 10. This way, our ratings will be integers; they'll range from 10 to almost 100.

In [16]:
ratings = ratingsScriptsML['averageRating']*10
ratings = ratings.astype(int)
ratingsDF = pd.DataFrame({'avgRating':ratings})
ratingsDF.head()

Unnamed: 0,avgRating
0,62
1,43
2,55
3,48
4,73


In order to do dimension reduction, we need to drop the average rating from the bag-of-words.

In [14]:
ratingsScriptsML2 =ratingsScriptsML.drop('averageRating', axis= 1)


In [36]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
fs = SelectKBest(score_func=f_regression, k=27)
# apply feature selection
scriptsSelected = fs.fit_transform(ratingsScriptsML2, ratings)
scriptsSelected.shape
#https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

(708, 27)

In [37]:
terms = fs.get_feature_names_out()
print(terms)

['19' 'avoid' 'buddy' 'buildings' 'dimly' 'discovery' 'families' 'gotten'
 'ledge' 'northern' 'race' 'scurry' 'slightest' 'sprint' 'thrill' 'thump'
 'wailing' 'watched' 'genre_adventure' 'genre_animation'
 'genre_documentary' 'genre_drama' 'genre_horror' 'genre_music'
 'genre_musical' 'genre_thriller' 'genre_western']


Set the reduced bag-of-words as a dataframe

In [38]:
scriptsSelectedDF = pd.DataFrame(scriptsSelected)

scriptsSelectedDF = scriptsSelectedDF.set_axis(terms, axis = 1)
scriptsSelectedDF.head()

Unnamed: 0,19,avoid,buddy,buildings,dimly,discovery,families,gotten,ledge,northern,...,watched,genre_adventure,genre_animation,genre_documentary,genre_drama,genre_horror,genre_music,genre_musical,genre_thriller,genre_western
0,0,2,0,1,0,0,0,1,1,0,...,0,0,0,0,1,0,0,0,0,0
1,0,1,0,1,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,1,0,2,0,0,0,0,0,...,0,0,0,0,1,1,0,0,1,0
3,0,0,0,0,1,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,1,1,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0


Attach the integer ratings to the bag-of-words for saving.

In [39]:
ratingsScriptsSelectedDF = scriptsSelectedDF.join(ratings)
ratingsScriptsSelectedDF.head()

Unnamed: 0,19,avoid,buddy,buildings,dimly,discovery,families,gotten,ledge,northern,...,genre_adventure,genre_animation,genre_documentary,genre_drama,genre_horror,genre_music,genre_musical,genre_thriller,genre_western,averageRating
0,0,2,0,1,0,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,62
1,0,1,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,43
2,0,0,1,0,2,0,0,0,0,0,...,0,0,0,1,1,0,0,1,0,55
3,0,0,0,0,1,0,0,0,2,0,...,0,0,0,0,0,0,0,0,0,48
4,0,1,1,1,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,73


Save the new dataframe for modeling.

In [40]:
ratingsScriptsSelectedDF.to_csv('ratingsAndScriptsPreprocessed.csv')