# Data Pre-processing and Training Data Development

**<font color = 'teal'> After thorough Exploratory data analysis, we saved two separate datsets containg data from Kaggle and IMDB. Here in this notebook, we'll carry out pre-processing which includes making data ready for model training. This will include the necessary steps like feature engineering, splitting data into training and test sets.**

In [1]:
# import necessary Python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import collections
from collections import Counter

In [2]:
# import both datasets
Kaggle = pd.read_csv("clean_datafiles/Kaggledata_ready_for_preprocessing.csv", index_col = 0)
IMDB = pd.read_csv("clean_datafiles/IMDBdata_ready_for_preprocessing.csv", index_col = 0)

In [3]:
Kaggle.head(2)

Unnamed: 0,title,cast,country,date_added,release_year,duration_min,listed_in,description
1,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,93,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,78,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."


In [4]:
IMDB.head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0


In [5]:
Kaggle.columns

Index(['title', 'cast', 'country', 'date_added', 'release_year',
       'duration_min', 'listed_in', 'description'],
      dtype='object')

In [6]:
IMDB.columns

Index(['title', 'release_year', 'genre', 'duration_min', 'country', 'language',
       'director', 'writer', 'production_company', 'actors', 'description',
       'avg_vote', 'votes', 'reviews_from_users'],
      dtype='object')

**<font color = 'teal'> Before we do the feature engineering, we'll form two new datasets from existing both datasets. The movies which are common to both datasets will be considered for testing the regression model that will be developed on movies which will only belong to IMDB dataset, but not to Kaggle dataset.**

In [7]:
# Merge both datasets common on 'title' and 'release_year'
common = IMDB.merge(Kaggle, how = 'inner', on = ['title', 'release_year'])

# Feature Enginerring

In [8]:
# list_unique_values_in_column is function to obtain unique values from categorical column
def list_unique_values_in_column(data_frame, column):
    df = data_frame[column].str.split(',')
    list1 = []
    for i in df:
        list1.extend(each.strip() for each in i)
    return df, set(list1), len(set(list1))

In [9]:
# dummie_variables is a function to create dummie variables for categorical column
def dummie_variables(dataframe, column):
    new_column = dataframe[column].str.split(',').apply(lambda x: [each.strip() for each in x])
    col_names = list_unique_values_in_column(dataframe, column)[1]
    new_dataframe = pd.DataFrame(index = new_column.index, columns = col_names)
    print(new_dataframe.shape)
    new_dataframe[:] = 0
    for i, each in new_column.items():
        new_dataframe.loc[i, each] = 1
    return new_dataframe          

### Dummie variables for 'language' column

In [10]:
a = list_unique_values_in_column(IMDB, 'language')
print("Total number of languages: ", a[2])

Total number of languages:  261


In [11]:
a[1]

{'Abkhazian',
 'Aboriginal',
 'Afrikaans',
 'Akan',
 'Albanian',
 'Algonquin',
 'American Sign Language',
 'Amharic',
 'Ancient (to 1453)',
 'Apache languages',
 'Arabic',
 'Aragonese',
 'Aramaic',
 'Arapaho',
 'Armenian',
 'Aromanian',
 'Assamese',
 'Assyrian Neo-Aramaic',
 'Athapascan languages',
 'Australian Sign Language',
 'Awadhi',
 'Aymara',
 'Azerbaijani',
 'Bable',
 'Balinese',
 'Bambara',
 'Basque',
 'Belarusian',
 'Bemba',
 'Bengali',
 'Berber languages',
 'Bhojpuri',
 'Bicolano',
 'Bosnian',
 'Brazilian Sign Language',
 'Breton',
 'British Sign Language',
 'Bulgarian',
 'Burmese',
 'Cantonese',
 'Catalan',
 'Central American Indian languages',
 'Chechen',
 'Cheyenne',
 'Chinese',
 'Cornish',
 'Corsican',
 'Cree',
 'Creek',
 'Crimean Tatar',
 'Croatian',
 'Crow',
 'Czech',
 'Danish',
 'Dari',
 'Dinka',
 'Dutch',
 'Dyula',
 'Dzongkha',
 'Eastern Frisian',
 'Egyptian (Ancient)',
 'English',
 'Esperanto',
 'Estonian',
 'Ewe',
 'Faroese',
 'Filipino',
 'Finnish',
 'Flemish',
 'F

**Note: Here, one can notice that there is value "None" in the unique language list. This needs to be replaced with proper value. To take care of this, we can replace this with empty string ' ', get dummies of this and delete that empty string column.** 

In [12]:
IMDB.isnull().sum()

title                 0
release_year          0
genre                 0
duration_min          0
country               0
language              0
director              0
writer                0
production_company    0
actors                0
description           0
avg_vote              0
votes                 0
reviews_from_users    0
dtype: int64

In [13]:
IMDB_None = IMDB[IMDB.language == 'None']

In [14]:
IMDB_None.shape

(53, 14)

In [15]:
IMDB_None.head(10)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0
47,The Birth of a Nation,1915,"Drama, History, War",195,USA,,D.W. Griffith,"Thomas Dixon Jr., Thomas Dixon Jr.",David W. Griffith Corp.,"Henry B. Walthall, Lillian Gish, Mae Marsh, Mi...",The Stoneman family finds its friendship with ...,6.3,22213,368.0
136,Tarzan of the Apes,1918,"Action, Adventure",73,USA,,Scott Sidney,"Edgar Rice Burroughs, Fred Miller",National Film Corporation of America,"Elmo Lincoln, Enid Markey, True Boardman, Kath...","Reared by a childless ape, the orphaned heir o...",5.7,684,14.0
245,The Four Horsemen of the Apocalypse,1921,"Drama, Romance, War",150,USA,,Rex Ingram,"Vicente Blasco Ibáñez, June Mathis",Metro Pictures Corporation,"Pomeroy Cannon, Josef Swickard, Bridgetta Clar...",An extended family split up in France and Germ...,7.2,3058,45.0
301,The Prisoner of Zenda,1922,"Adventure, Romance",125,USA,,Rex Ingram,"Anthony Hope, Mary O'Hara",Metro Pictures Corporation,"Lewis Stone, Alice Terry, Robert Edeson, Stuar...",When the king is drugged and abducted by his a...,6.7,452,16.0
313,The Toll of the Sea,1922,"Drama, Family, Fantasy",54,USA,,Chester M. Franklin,Frances Marion,Technicolor,"Anna May Wong, Kenneth Harlan, Beatrice Bentle...","While visiting China, an American man falls in...",6.6,642,25.0
386,The Red Lily,1924,Drama,81,USA,,Fred Niblo,"Fred Niblo, Bess Meredyth",Metro-Goldwyn Pictures Corporation,"Ramon Novarro, Enid Bennett, Frank Currier, Mi...",Two young lovers escape their past lives to Pa...,7.1,415,22.0
499,The Lodger: A Story of the London Fog,1927,"Crime, Drama, Mystery",68,UK,,Alfred Hitchcock,"Marie Belloc Lowndes, Eliot Stannard",Gainsborough Pictures,"Marie Ault, Arthur Chesney, June Tripp, Malcol...",A landlady suspects her new lodger is the madm...,7.3,10024,102.0
627,La chute de la maison Usher,1928,Horror,63,France,,Jean Epstein,"Edgar Allan Poe, Luis Buñuel",Films Jean Epstein,"Jean Debucourt, Marguerite Gance, Charles Lamy...",Allan visits the sinister Usher family mansion...,7.3,3369,26.0


In [16]:
IMDB_None[['country', 'language']]

Unnamed: 0,country,language
0,USA,
1,Australia,
47,USA,
136,USA,
245,USA,
301,USA,
313,USA,
386,USA,
499,UK,
627,France,


In [17]:
IMDB.replace('None', np.nan , inplace = True)
IMDB.head()

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0
3,Cleopatra,1912,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,25.0
4,L'Inferno,1911,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,31.0
5,"From the Manger to the Cross; or, Jesus of Naz...",1912,"Biography, Drama",60,USA,English,Sidney Olcott,Gene Gauntier,Kalem Company,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ...",5.7,484,13.0


In [18]:
IMDB.dropna(inplace = True)
IMDB.head()

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
3,Cleopatra,1912,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,25.0
4,L'Inferno,1911,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,31.0
5,"From the Manger to the Cross; or, Jesus of Naz...",1912,"Biography, Drama",60,USA,English,Sidney Olcott,Gene Gauntier,Kalem Company,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ...",5.7,484,13.0
6,Madame DuBarry,1919,"Biography, Drama, Romance",85,Germany,German,Ernst Lubitsch,"Norbert Falk, Hanns Kräly",Projektions-AG Union (PAGU),"Pola Negri, Emil Jannings, Harry Liedtke, Edua...","The story of Madame DuBarry, the mistress of L...",6.8,753,12.0
7,Quo Vadis?,1913,"Drama, History",120,Italy,Italian,Enrico Guazzoni,"Henryk Sienkiewicz, Enrico Guazzoni",Società Italiana Cines,"Amleto Novelli, Gustavo Serena, Carlo Cattaneo...","An epic Italian film ""Quo Vadis"" influenced ma...",6.2,273,7.0


In [19]:
# Replacing the 'None' values with empty string ' ' and check it for one arbitary row
#IMDB.replace({'language': 'None'}, '', inplace = True)
#IMDB.loc[[80865]]

In [20]:
#dummies_language = dummie_variables(IMDB, 'language')
#dummies_language.shape

In [21]:
#dummies_language.head()

In [22]:
#dummies_language.shape

In [23]:
#dummies_language.drop(columns = 'None', inplace = True)

In [24]:
#dummies_language.shape

### Dummie variables for 'genre' column 

In [25]:
b = list_unique_values_in_column(IMDB, 'genre')
print("Total number of genre: ", b[2])

Total number of genre:  24


In [26]:
b[1]

{'Action',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Sport',
 'Thriller',
 'War',
 'Western'}

In [27]:
dummies_genre = dummie_variables(IMDB, 'genre')
dummies_genre.head(2)

(71990, 24)


Unnamed: 0,News,Mystery,Adventure,Western,Crime,Music,Documentary,Drama,Action,Romance,...,Fantasy,Musical,War,Reality-TV,Sport,History,Horror,Animation,Film-Noir,Biography
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,0


**<font color = 'teal'>while obtaining the dummy variables, if categorical unique values are very large in number, then dummie variables for top 200 unique values will be taken and rest will be discarded.**

To select the top 200 values, we'll write a function top_200() and carry out the selection.

In [28]:
def top_200(df10, column, top_items):
    df_split = df10[column].str.split(',')
    list10 = []
    for i in df_split:
        list10.extend(i)
    freq = pd.Series(Counter(list10)).sort_values(ascending = False)
    best_values = freq.index[:top_items]
    df_best_values = pd.DataFrame(index = df10.index, columns = best_values)
    df_best_values[:] = 0
    for i, element in df_split.iteritems():
        for j in element:
            if j in best_values:
                df_best_values.loc[i, j] = 1
            else:
                pass
    return df_best_values, freq

###  Getting dummie features for 'actors' column

In [29]:
c = list_unique_values_in_column(IMDB, 'actors')
print("Total number of actors: ", c[2])

Total number of actors:  366402


In [30]:
dummies_actors, top200actors = top_200(IMDB, 'actors', 200)
dummies_actors.head(2)

Unnamed: 0,Brahmanandam,Anupam Kher,Nassar,Prakash Raj,Henry O'Neill,John Carradine,Tanikella Bharani,Eric Roberts,Gulshan Grover,Paresh Rawal,...,Cesar Romero,William B. Davidson,Eric Roberts.1,Max von Sydow,Bernard Lee,Whit Bissell,Robert Morley,Farida Jalal,V. Jayaprakash,Ned Beatty
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


###  Getting dummie features for 'director' column

In [31]:
d = list_unique_values_in_column(IMDB, 'director')
print("Total number of directors: ", d[2])

Total number of directors:  29605


In [32]:
dummies_director, top200director = top_200(IMDB, 'director', 200)

###  Getting dummie features for 'writer' column

In [33]:
e = list_unique_values_in_column(IMDB, 'writer')
print("Total number of writers: ", e[2])

Total number of writers:  54260


In [34]:
dummies_writer, top200writer = top_200(IMDB, 'writer', 200)

###  Getting dummie features for 'production_company' column

In [35]:
f = list_unique_values_in_column(IMDB, 'production_company')
print("Total number of 'production companies': ", f[2])

Total number of 'production companies':  29162


In [36]:
dummies_production_company, top200production = top_200(IMDB, 'production_company', 200)
top200production[300:330]

Cineville                                             19
Viva Films                                            19
Lira Films                                            19
Bo Ho Film Company Ltd.                               19
Elías Querejeta Producciones Cinematográficas S.L.    19
Franchise Pictures                                    19
Les Films Corona                                      19
Art Pictures Studio                                   19
Asahi Broadcasting Corporation (ABC)                  19
Mukta Arts                                            19
3B Productions                                        19
Recorded Picture Company (RPC)                        19
Central Motion Pictures                               19
Colorado Film Production                              18
Dean Film                                             18
Bac Films                                             18
Maipo Film                                            18
Sidus                          

In [37]:
dummies_language, top200language = top_200(IMDB, 'language', 200)

### Imputation for numerical feature columns: 'duration_min', 'votes', 'reviews_from_users'

**In the numerical columns there few values which are very large. Though they are true values, they are outliers and doesn't imply the general population of that column, hence instead of dropping those values or replacing them with mean or median, it'll better to replace them with 95th percentile which will represent those values resonably.  So in general, numerical values with >95% percentile will be replaced with 95% percentile as a step to take care of the outliers as they are not mistakenly entered.**

In [38]:
def imputation_for_outliers(dataframe, column, percentile):
#    before_imputation = dataframe[column].describe()
    percentile_value = dataframe[column].quantile(q = percentile)
    dataframe.loc[dataframe[column] > percentile_value, column] = percentile_value
#    after_imputation = dataframe[column].describe()
    return dataframe

In [39]:
print(IMDB.duration_min.describe())
IMDB = imputation_for_outliers(IMDB, 'duration_min', 0.95)
IMDB.duration_min.describe()

count    71990.000000
mean       100.223726
std         21.983894
min         41.000000
25%         88.000000
50%         96.000000
75%        108.000000
max        808.000000
Name: duration_min, dtype: float64


count    71990.000000
mean        99.091540
std         17.814833
min         41.000000
25%         88.000000
50%         96.000000
75%        108.000000
max        140.000000
Name: duration_min, dtype: float64

### Imputation for 'votes' column

In [40]:
print(IMDB.votes.describe())
IMDB = imputation_for_outliers(IMDB, 'votes', 0.95)
IMDB.votes.describe()

count    7.199000e+04
mean     1.121390e+04
std      5.830994e+04
min      9.900000e+01
25%      2.420000e+02
50%      6.170000e+02
75%      2.425000e+03
max      2.278845e+06
Name: votes, dtype: float64


count    71990.000000
mean      4705.709557
std      10683.558482
min         99.000000
25%        242.000000
50%        617.000000
75%       2425.000000
max      43825.450000
Name: votes, dtype: float64

### Imputation for 'reviews_from_users' column

In [41]:
print(IMDB.reviews_from_users.describe())
IMDB = imputation_for_outliers(IMDB, 'reviews_from_users', 0.95)
IMDB.reviews_from_users.describe()

count    71990.000000
mean        49.232185
std        185.540540
min          1.000000
25%          4.000000
50%         10.000000
75%         29.000000
max      10472.000000
Name: reviews_from_users, dtype: float64


count    71990.000000
mean        31.007918
std         50.656511
min          1.000000
25%          4.000000
50%         10.000000
75%         29.000000
max        201.000000
Name: reviews_from_users, dtype: float64

### Scaling for numrical columns: 'duration_min', 'votes' 'reviews_from_users'

**<font color = 'teal'> From EDA part of this project we have seen that the above three features 'duration_min', 'votes', and 'reviews_from_users' follow a normal distribution. So, after removing the outliers, we can scale them using standard scaler.** 

In [42]:
def standard_scaling(df1, col_name):
    scale = StandardScaler()
    df1[col_name] = scale.fit_transform(df1[[col_name]])
    return df1  

In [43]:
IMDB = standard_scaling(IMDB, 'duration_min')
IMDB = standard_scaling(IMDB, 'votes')
IMDB = standard_scaling(IMDB, 'reviews_from_users')

In [44]:
IMDB.head(3)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
3,Cleopatra,1912,"Drama, History",0.050995,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,-0.398719,-0.118602
4,L'Inferno,1911,"Adventure, Drama, Fantasy",-1.745274,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,-0.231077,-0.000156
5,"From the Manger to the Cross; or, Jesus of Naz...",1912,"Biography, Drama",-2.194341,USA,English,Sidney Olcott,Gene Gauntier,Kalem Company,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ...",5.7,-0.395162,-0.355493


In [45]:
#IMDB = pd.concat([IMDB, dummies_genre, dummies_language, dummies_actors, dummies_director, dummies_writer, dummies_production_company], axis = 1)
IMDB = pd.concat([IMDB, dummies_genre, dummies_language, dummies_actors, dummies_director, dummies_writer, dummies_production_company], axis = 1)

In [46]:
# Make two datasets, one with movies from IMDB which are also in Kaggle "IMDB_Kaggle_common" and another with movies only belonging to IMDB dataset "IMDB_only"
IMDB_Kaggle_common = IMDB[IMDB.title.isin(common.title)]
print(IMDB_Kaggle_common.shape)
IMDB_only = IMDB[~(IMDB.title.isin(common.title))]
print(IMDB_only.shape)

(2585, 1038)
(69405, 1038)


In [47]:
# After adding the dummie columns for all categorical columns,  Drop the original categorical columns from both IMDB_only, IMDB_Kaggle_common 
IMDB_only.drop(columns = ['title', 'release_year', 'country', 'genre', 'language', 'director', 'writer','production_company', 'actors', 'description'], inplace = True)
IMDB_Kaggle_common.drop(columns = ['title', 'release_year', 'country', 'genre', 'language', 'director', 'writer', 'production_company', 'actors', 'description'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [48]:
print(IMDB_only.shape)
print(IMDB_Kaggle_common.shape)

(69405, 1028)
(2585, 1028)


# Save datasets for model development

In [49]:
IMDB_only.to_csv('pre-processed_dataset/IMDB_only.csv')
IMDB_Kaggle_common.to_csv('pre-processed_dataset/IMDB_Kaggle_common.csv')

# Summary

In this notebook, data pre-processing and training data development were carried out. As the IMDB dataset was very large as compared to Kaggle dataset, the movies those are common to both IMDB dataset and Kaggle dataset were separated out from IMDB dataset and kept for testing the Machine learning(ML) model. The rest of the movies in the iMDB dataset were considered for training the ML model.<br>

Feature engineering was carried out on both training and testing datasets as follows: <br> *  Dummie variables for all the categorical columns like genre, langugae, actors, directors, writers, production company was carried out. For genre and language, all the dummies for all the unique values were carried out, however actors, directors, writers, production company columns were handled differenetly. Because of their very large number of unique values, dummies for top 200 values were obtained.<br> * Standard scaling was considered for numerical columns 'duration_min', 'votes' and 'reviiews from users' columns as their distribution follows normal distribution.<br>* Imputation for the numerical columns were carried out assigning 95th percentile values of the respective column.<br><br>Finally all the dummies columns were added to the respective main datasets and original categorical columns were dropped. 

Finally both datsets were saved in csv format for model developement.