# Data Pre-processing and Training Data Development

**<font color = 'teal'> After thorough Exploratory data analysis, we saved two separate datsets containg data from Kaggle and IMDB. Here in this notebook, we'll carry out pre-processing which includes making data ready for model training. This will include the necessary steps like feature engineering, splitting data into training and test sets.**

In [1]:
# import necessary Python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import collections
from collections import Counter

In [2]:
# import both datasets
Kaggle = pd.read_csv("clean_datafiles/Kaggledata_ready_for_preprocessing.csv", index_col = 0)
IMDB = pd.read_csv("clean_datafiles/IMDBdata_ready_for_preprocessing.csv", index_col = 0)

In [3]:
Kaggle.head(2)

Unnamed: 0,title,cast,country,date_added,release_year,duration_min,listed_in,description
1,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,93,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,78,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."


In [4]:
IMDB.head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0


In [5]:
Kaggle.columns

Index(['title', 'cast', 'country', 'date_added', 'release_year',
       'duration_min', 'listed_in', 'description'],
      dtype='object')

In [6]:
IMDB.columns

Index(['title', 'release_year', 'genre', 'duration_min', 'country', 'language',
       'director', 'writer', 'production_company', 'actors', 'description',
       'avg_vote', 'votes', 'reviews_from_users'],
      dtype='object')

**<font color = 'teal'> Before we do the feature engineering, we'll form two new datasets from existing both datasets. The movies which are common to both datasets will be considered for testing the regression model that will be developed on movies which will only belong to IMDB dataset, but not to Kaggle dataset.**

In [7]:
# Merge both datasets common on 'title' and 'release_year'
common = IMDB.merge(Kaggle, how = 'inner', on = ['title', 'release_year'])

In [8]:
# Make two datasets, one with movies from IMDB which are also in Kaggle "IMDB_Kaggle_common" and another with movies only belong to IMDB dataset "IMDB_only"
IMDB_Kaggle_common = IMDB[IMDB.title.isin(common.title)]
print(IMDB_Kaggle_common.shape)
IMDB_only = IMDB[~(IMDB.title.isin(common.title))]
print(IMDB_only.shape)

(2585, 14)
(69458, 14)


In [9]:
IMDB_Kaggle_common.head(3)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
83,Sherlock Holmes,1916,Mystery,116,USA,English,Arthur Berthelet,"H.S. Sheldon, William Gillette",The Essanay Film Manufacturing Company,"William Gillette, Marjorie Kay, Ernest Maupain...",When a couple of swindlers hold young Alice Fa...,6.4,313,10.0
253,Leap Year,1924,Comedy,56,USA,English,"James Cruze, Roscoe 'Fatty' Arbuckle","Walter Woods, Sarah Y. Mason",Paramount Pictures,"Roscoe 'Fatty' Arbuckle, Lucien Littlefield, M...","A young man, heir to his misogynistic and mill...",6.1,268,9.0
300,Phantom,1922,"Drama, Romance",125,Germany,German,F.W. Murnau,"Gerhart Hauptmann, Thea von Harbou",Uco-Film GmbH,"Alfred Abel, Frida Richard, Aud Egede-Nissen, ...",A shiftless young man becomes obsessed with a ...,6.9,1528,19.0


In [10]:
IMDB_only.head(3)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0
3,Cleopatra,1912,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,25.0


# Feature Enginerring

In [11]:
# list_unique_values_in_column is function to obtain unique values from categorical column
def list_unique_values_in_column(data_frame, column):
    df = data_frame[column].str.split(',')
    list1 = []
    for i in df:
        list1.extend(each.strip() for each in i)
    return df, set(list1), len(set(list1))

In [12]:
# dummie_variables is a function to create dummie variables for categorical column
def dummie_variables(dataframe, column):
    new_column = dataframe[column].str.split(',').apply(lambda x: [each.strip() for each in x])
    col_names = list_unique_values_in_column(dataframe, column)[1]
    new_dataframe = pd.DataFrame(index = new_column.index, columns = col_names)
    print(new_dataframe.shape)
    new_dataframe[:] = 0
    for i, each in new_column.items():
        new_dataframe.loc[i, each] = 1
    return new_dataframe          

In [13]:
a = list_unique_values_in_column(IMDB_only, 'language')
a[1]

{'Abkhazian',
 'Aboriginal',
 'Afrikaans',
 'Akan',
 'Albanian',
 'Algonquin',
 'American Sign Language',
 'Amharic',
 'Ancient (to 1453)',
 'Apache languages',
 'Arabic',
 'Aragonese',
 'Aramaic',
 'Arapaho',
 'Armenian',
 'Aromanian',
 'Assamese',
 'Assyrian Neo-Aramaic',
 'Athapascan languages',
 'Australian Sign Language',
 'Awadhi',
 'Aymara',
 'Azerbaijani',
 'Bable',
 'Balinese',
 'Bambara',
 'Basque',
 'Belarusian',
 'Bemba',
 'Bengali',
 'Berber languages',
 'Bhojpuri',
 'Bicolano',
 'Bosnian',
 'Brazilian Sign Language',
 'Breton',
 'British Sign Language',
 'Bulgarian',
 'Burmese',
 'Cantonese',
 'Catalan',
 'Central American Indian languages',
 'Chechen',
 'Cheyenne',
 'Chinese',
 'Cornish',
 'Corsican',
 'Cree',
 'Creek',
 'Crimean Tatar',
 'Croatian',
 'Crow',
 'Czech',
 'Danish',
 'Dari',
 'Dinka',
 'Dutch',
 'Dyula',
 'Dzongkha',
 'Eastern Frisian',
 'Egyptian (Ancient)',
 'English',
 'Esperanto',
 'Estonian',
 'Ewe',
 'Faroese',
 'Filipino',
 'Finnish',
 'Flemish',
 'F

### Getting dummie features for 'genre' column in both new datasets

In [14]:
dummies_genre = dummie_variables(IMDB_only, 'genre')
dummies_genre.head()

(69458, 24)


Unnamed: 0,Fantasy,History,Sport,Mystery,Biography,Musical,Family,Adventure,Music,Comedy,...,Film-Noir,Drama,Action,Romance,War,Crime,Horror,News,Documentary,Western
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [15]:
dummies_genre2 = dummie_variables(IMDB_Kaggle_common, 'genre')

(2585, 21)


### Getting dummie features for 'language' column in both new datasets

In [16]:
# Before obtaining the dummy variable for all the languages, let us investigate deep into this column 
IMDB_only.isnull().sum()

title                 0
release_year          0
genre                 0
duration_min          0
country               0
language              0
director              0
writer                0
production_company    0
actors                0
description           0
avg_vote              0
votes                 0
reviews_from_users    0
dtype: int64

In [17]:
IMDB_only_None = IMDB_only[IMDB_only.language == 'None']

In [18]:
IMDB_only_None.shape

(53, 14)

In [19]:
IMDB_only_None.head(3)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0
47,The Birth of a Nation,1915,"Drama, History, War",195,USA,,D.W. Griffith,"Thomas Dixon Jr., Thomas Dixon Jr.",David W. Griffith Corp.,"Henry B. Walthall, Lillian Gish, Mae Marsh, Mi...",The Stoneman family finds its friendship with ...,6.3,22213,368.0


In [20]:
IMDB_only_None[['country', 'language']]

Unnamed: 0,country,language
0,USA,
1,Australia,
47,USA,
136,USA,
245,USA,
301,USA,
313,USA,
386,USA,
499,UK,
627,France,


In [21]:
IMDB_only_None.index

Int64Index([    0,     1,    47,   136,   245,   301,   313,   386,   499,
              627,  1309,  7410, 11095, 15129, 16833, 20764, 21054, 21458,
            21709, 25813, 27435, 28193, 29728, 31099, 32151, 33751, 38332,
            45340, 45396, 45949, 49899, 49942, 51616, 51842, 52405, 53834,
            58194, 58376, 59693, 60058, 60370, 66547, 66643, 68935, 69191,
            71588, 72130, 73067, 73773, 73804, 74870, 80865, 83234],
           dtype='int64')

In [22]:
# Replacing the 'None' values with empty string ' '
IMDB_only.replace({'language': 'None'}, '', inplace = True)
IMDB_only.loc[[80865]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
80865,Akher Wahed Fina,2016,"Adventure, Drama, Fantasy",95,"Tunisia, Qatar, United Arab Emirates, Lebanon",,Alaeddine Slim,Alaeddine Slim,Exit Productions,"Jahwar Soudani, Fethi Akkari, Jihed Fourti, Sl...",N is coming from the desert to reach North Afr...,6.3,153,2.0


In [23]:
dummies_language = dummie_variables(IMDB_only, 'language')
dummies_language.head()

(69458, 259)


Unnamed: 0,Unnamed: 1,Norwegian,Sinhalese,Soninke,Algonquin,Arabic,Brazilian Sign Language,Micmac,Athapascan languages,Corsican,...,Hmong,Chinese,Marathi,Swedish,Scottish Gaelic,Bable,Tok Pisin,Peul,American Sign Language,Lithuanian
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
dummies_language.drop(columns = '', inplace = True )

In [25]:
IMDB_Kaggle_common_None = IMDB_Kaggle_common[IMDB_Kaggle_common.language == 'None']
print(IMDB_Kaggle_common_None.shape)
IMDB_Kaggle_common.replace({'language': 'None'}, '', inplace = True)

(0, 14)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [26]:
dummies_language2 = dummie_variables(IMDB_Kaggle_common, 'language')

(2585, 112)


**<font color = 'teal'>while obtaining the dummy variables, if categorical unique values are very large in number, then dummie variables for top 200 unique values will be taken and rest will be discarded.**

###  Getting dummie features for 'actors' column

In [27]:
def top_200(df10, column, top_items):
    df_split = df10[column].str.split(',')
    list10 = []
    for i in df_split:
        list10.extend(i)
    freq = pd.Series(Counter(list10)).sort_values(ascending = False)
    best_values = freq.index[:top_items]
    df_best_values = pd.DataFrame(index = df10.index, columns = best_values)
    df_best_values[:] = 0
    for i, element in df_split.iteritems():
        for j in element:
            if j in best_values:
                df_best_values.loc[i, j] = 1
            else:
                pass
    return df_best_values

In [28]:
dummies_actors = top_200(IMDB_only, 'actors', 200)
dummies_actors.head()

Unnamed: 0,Brahmanandam,Nassar,Anupam Kher,Henry O'Neill,John Carradine,Prakash Raj,Eric Roberts,Tanikella Bharani,John Wayne,Keenan Wynn,...,Gérard Depardieu,Whit Bissell,George Zucco,William Forsythe,Renji Ishibashi,Leon Ames,Frank Reicher,Wallace Shawn,Siddique,Sayaji Shinde
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
dummies_actors2 = top_200(IMDB_Kaggle_common, 'actors', 200)
dummies_actors2.head()

Unnamed: 0,Anupam Kher,Paresh Rawal,Om Puri,Johnny Lever,Boman Irani,Sharat Saxena,Akshay Kumar,Kareena Kapoor,Shah Rukh Khan,Asrani,...,Alia Shawkat,Alok Nath,Reema Lagoo,John C. Reilly,Madhavan,Laura Linney,Manoj Pahwa,Emory Cohen,Ben Kingsley,John Travolta
83,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
253,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
300,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
308,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
430,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
dummies_director = top_200(IMDB_only, 'director', 200)
dummies_director2 = top_200(IMDB_Kaggle_common, 'director', 200)

In [31]:
dummies_writer = top_200(IMDB_only, 'writer', 200)
dummies_writer2 = top_200(IMDB_Kaggle_common, 'writer', 200)

In [32]:
dummies_production_company = top_200(IMDB_only, 'production_company', 200)
dummies_production_company2 = top_200(IMDB_Kaggle_common, 'production_company', 200)

### Imputation for numerical feature columns: 'duration_min', 'votes', 'reviews_from_users'

In [33]:
# Numerical values with >95% percentile will be replaced with 95% percentile as a step to take care of the outliers as they are not mistakenly entered.

In [34]:
def imputation_for_outliers(dataframe, column, percentile):
    before_imputation = dataframe[column].describe()
    percentile_value = dataframe[column].quantile(q = percentile)
    print(percentile_value)
    dataframe.loc[dataframe[column] > percentile_value, column] = percentile_value
    after_imputation = dataframe[column].describe()
    return before_imputation, after_imputation

In [35]:
impute_duration = imputation_for_outliers(IMDB_only, 'duration_min', 0.95)
impute_duration

140.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


(count    69458.000000
 mean        99.845864
 std         21.837568
 min         41.000000
 25%         88.000000
 50%         96.000000
 75%        108.000000
 max        808.000000
 Name: duration_min, dtype: float64, count    69458.000000
 mean        98.751850
 std         17.691837
 min         41.000000
 25%         88.000000
 50%         96.000000
 75%        108.000000
 max        140.000000
 Name: duration_min, dtype: float64)

### Imputation for 'votes' column

In [36]:
impute_votes = imputation_for_outliers(IMDB_only, 'votes', 0.95)
impute_votes

35295.39999999991


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


(count    6.945800e+04
 mean     9.770994e+03
 std      5.286388e+04
 min      9.900000e+01
 25%      2.370000e+02
 50%      5.890000e+02
 75%      2.207000e+03
 max      2.278845e+06
 Name: votes, dtype: float64, count    69458.000000
 mean      3900.271145
 std       8592.013543
 min         99.000000
 25%        237.000000
 50%        589.000000
 75%       2207.000000
 max      35295.400000
 Name: votes, dtype: float64)

### Imputation for 'reviews_from_users' column

In [37]:
impute_reviews_from_users = imputation_for_outliers(IMDB_only, 'reviews_from_users', 0.95)
impute_reviews_from_users

173.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


(count    69458.000000
 mean        44.099715
 std        166.861864
 min          1.000000
 25%          4.000000
 50%         10.000000
 75%         27.000000
 max       8869.000000
 Name: reviews_from_users, dtype: float64, count    69458.000000
 mean        27.711639
 std         43.664003
 min          1.000000
 25%          4.000000
 50%         10.000000
 75%         27.000000
 max        173.000000
 Name: reviews_from_users, dtype: float64)

### Scaling for numrical columns: 'duration_min', 'votes' 'reviews_from_users'

**<font color = 'teal'> From EDA part of this project we have seen that the above three features 'duration_min', 'votes', and 'reviews_from_users' follow a normal distribution. So, after removing the outliers, we can scale them using standard scaler.** 

In [38]:
def standard_scaling(df1, df2, col_name):
    scale = StandardScaler()
    df1[col_name] = scale.fit_transform(df1[[col_name]])
    df2[col_name] = scale.transform(df2[[col_name]])
    return df1, df2   

In [39]:
scaled_DurationMin = standard_scaling(IMDB_only, IMDB_Kaggle_common, 'duration_min')
scaled_votes = standard_scaling(IMDB_only, IMDB_Kaggle_common, 'votes')
scaled_ReviewsFromUsers = standard_scaling(IMDB_only, IMDB_Kaggle_common, 'reviews_from_users')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value

In [40]:
# Scaling output of 'duration_min' column for IMDB_only dataset
scaled_DurationMin[0].head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,-3.038251,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,-0.436021,-0.611759
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",-1.62516,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,-0.385392,-0.474345


In [41]:
# Scaling output of 'duration_min' column for IMDB_Kaggle_common dataset
scaled_DurationMin[1].head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
83,Sherlock Holmes,1916,Mystery,0.974928,USA,English,Arthur Berthelet,"H.S. Sheldon, William Gillette",The Essanay Film Manufacturing Company,"William Gillette, Marjorie Kay, Ernest Maupain...",When a couple of swindlers hold young Alice Fa...,6.4,-0.417515,-0.405638
253,Leap Year,1924,Comedy,-2.416491,USA,English,"James Cruze, Roscoe 'Fatty' Arbuckle","Walter Woods, Sarah Y. Mason",Paramount Pictures,"Roscoe 'Fatty' Arbuckle, Lucien Littlefield, M...","A young man, heir to his misogynistic and mill...",6.1,-0.422753,-0.42854


In [42]:
scaled_votes[0].head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,-3.038251,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,-0.436021,-0.611759
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",-1.62516,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,-0.385392,-0.474345


In [43]:
scaled_votes[1].head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
83,Sherlock Holmes,1916,Mystery,0.974928,USA,English,Arthur Berthelet,"H.S. Sheldon, William Gillette",The Essanay Film Manufacturing Company,"William Gillette, Marjorie Kay, Ernest Maupain...",When a couple of swindlers hold young Alice Fa...,6.4,-0.417515,-0.405638
253,Leap Year,1924,Comedy,-2.416491,USA,English,"James Cruze, Roscoe 'Fatty' Arbuckle","Walter Woods, Sarah Y. Mason",Paramount Pictures,"Roscoe 'Fatty' Arbuckle, Lucien Littlefield, M...","A young man, heir to his misogynistic and mill...",6.1,-0.422753,-0.42854


In [44]:
scaled_ReviewsFromUsers[0].head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,-3.038251,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,-0.436021,-0.611759
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",-1.62516,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,-0.385392,-0.474345


In [45]:
scaled_ReviewsFromUsers[1].head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
83,Sherlock Holmes,1916,Mystery,0.974928,USA,English,Arthur Berthelet,"H.S. Sheldon, William Gillette",The Essanay Film Manufacturing Company,"William Gillette, Marjorie Kay, Ernest Maupain...",When a couple of swindlers hold young Alice Fa...,6.4,-0.417515,-0.405638
253,Leap Year,1924,Comedy,-2.416491,USA,English,"James Cruze, Roscoe 'Fatty' Arbuckle","Walter Woods, Sarah Y. Mason",Paramount Pictures,"Roscoe 'Fatty' Arbuckle, Lucien Littlefield, M...","A young man, heir to his misogynistic and mill...",6.1,-0.422753,-0.42854


In [46]:
IMDB_only.columns

Index(['title', 'release_year', 'genre', 'duration_min', 'country', 'language',
       'director', 'writer', 'production_company', 'actors', 'description',
       'avg_vote', 'votes', 'reviews_from_users'],
      dtype='object')

In [47]:
IMDB_Kaggle_common.columns

Index(['title', 'release_year', 'genre', 'duration_min', 'country', 'language',
       'director', 'writer', 'production_company', 'actors', 'description',
       'avg_vote', 'votes', 'reviews_from_users'],
      dtype='object')

In [48]:
IMDB_only_combined = pd.concat([IMDB_only, dummies_genre, dummies_language, dummies_actors, dummies_director, dummies_writer, dummies_production_company], axis = 1)
IMDB_only_combined.shape

(69458, 1096)

In [49]:
IMDB_Kaggle_common_combined = pd.concat([IMDB_Kaggle_common, dummies_genre2, dummies_language2, dummies_actors2, dummies_director2, dummies_writer2, dummies_production_company2], axis = 1)
IMDB_Kaggle_common_combined.shape

(2585, 947)

In [50]:
# After adding the dummie columns for all categorical columns,  Drop the original categorical columns
IMDB_only_combined.drop(columns = ['title', 'release_year', 'country', 'genre', 'language', 'director', 'writer', 'production_company', 'actors', 'description'], inplace = True)
IMDB_Kaggle_common_combined.drop(columns = ['title', 'release_year', 'country', 'genre', 'language', 'director', 'writer', 'production_company', 'actors', 'description'], inplace = True)

In [51]:
print(IMDB_only_combined.shape)
print(IMDB_Kaggle_common_combined.shape)

(69458, 1086)
(2585, 937)


# Save datasets for model development

In [52]:
IMDB_only_combined.to_csv('pre-processed_dataset/IMDB_only_combined.csv')
IMDB_Kaggle_common_combined.to_csv('pre-processed_dataset/IMDB_Kaggle_common_combined.csv')

# Summary

In this notebook, data pre-processing and training data development were carried out. As the IMDB dataset was very large as compared to Kaggle dataset, the movies those are common to both IMDB dataset and Kaggle dataset were separated out from IMDB dataset and kept for testing the Machine learning(ML) model. The rest of the movies in the iMDB dataset were considered for training the ML model.<br>

Feature engineering was carried out on both training and testing datasets as follows: <br> *  Dummie variables for all the categorical columns like genre, langugae, actors, directors, writers, production company was carried out. For genre and language, all the dummies for all the unique values were carried out, however actors, directors, writers, production company columns were handled differenetly. Because of their very large number of unique values, dummies for top 200 values were obtained.<br> * Standard scaling was considered for numerical columns 'duration_min', 'votes' and 'reviiews from users' columns as their distribution follows normal distribution.<br>* Imputation for the numerical columns were carried out assigning 95th percentile values of the respective column.<br><br>Finally all the dummies columns were added to the respective main datasets and original categorical columns were dropped. 

Finally both datsets were saved in csv format for model developement.