## 2. Content Based Filtering - preprocessing 
They suggest similar items based on a particular item. This system uses item metadata, such as genre, director, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.

### 2.1 build countvectorizer model for the combined information column (genres, cast, director) 

In [19]:
import pandas as pd 
import numpy as np
import scipy.spatial.distance as dist
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer

In [38]:
combined = pd.read_csv('data/combined_info.csv',encoding='iso-8859-1',index_col=0)

In [39]:
cv = CountVectorizer(input='content', encoding='iso-8859-1', decode_error='ignore', analyzer='word',
                      ngram_range=(1,1))

#### the dataset is too large and here we first use a sample of the combined data to do the modeling

In [59]:
sample = combined.iloc[:2000,:].copy()
cv_model = cv.fit_transform(sample['info'])

In [60]:
cv_model.get_shape()

(2000, 4200)

In [69]:
df_cv = pd.DataFrame(cv_model.toarray(), index=sample.index,columns=sorted(cv.vocabulary_))

In [68]:
sample.head(2)

Unnamed: 0_level_0,adult,popularity,vote_average,vote_count,info
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
862,0,21.946943,7.7,5415.0,tomhanks timallen donrickles johnlasseter anim...
8844,0,17.015539,6.9,2413.0,robinwilliams jonathanhyde kirstendunst adven...


### 2.2 concatenate the numerial columns (MovieID, popularity,vote_average,vote_count,adult)

In [63]:
cols = ['popularity','vote_average','vote_count','adult']
sample2 = sample.loc[:,cols].copy()
featureMatrix = pd.concat([df_cv,sample2],axis=1)
df_cv2.head(2)

Unnamed: 0_level_0,aaroneckhart,aaronkimjohnston,aaronschwartz,abbaskiarostami,abbassayah,abdolrahmanbagheri,abelferrara,abo,abrahampolonsky,acasares,...,ðº,ðºð,ð½,ð½ð,ð¾ð,ð¾ð½ñ,popularity,vote_average,vote_count,adult
MovieID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
862,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,21.946943,7.7,5415.0,0
8844,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,17.015539,6.9,2413.0,0


### 2.3 normalize the column values to the same scale

The movie profile has some components as Boolean and others are real-valued or integer-valued. We can compute the cosine distance between vectors, but before that, we should apply appropriate scaling of the nonBoolean components, so that they neither dominate the calculation nor are they irrelevant.

In [64]:
scaler = MinMaxScaler()
featureMatrix_norm = scaler.fit_transform(featureMatrix)

In [65]:
featureMatrix_norm = pd.DataFrame(featureMatrix_norm,index=featureMatrix.index, columns=featureMatrix.columns)

In [66]:
featureMatrix_norm.to_csv('data/movie_norm_featureMatrix_sample.csv')