# Data Pre-processing and Training Data Development

**<font color = 'teal'> After thorough Exploratory data analysis, we saved two separate datsets containg data from Kaggle and IMDB. Here in this notebook, we'll carry out pre-processing which includes making data ready for model training. This will include the necessary steps like feature engineering, splitting data into training and test sets.**

In [1]:
# import necessary Python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [2]:
# import both datasets
Kaggle = pd.read_csv("clean_datafiles/Kaggledata_ready_for_preprocessing.csv", index_col = 0)
IMDB = pd.read_csv("clean_datafiles/IMDBdata_ready_for_preprocessing.csv", index_col = 0)

In [3]:
Kaggle.head(2)

Unnamed: 0,title,cast,country,date_added,release_year,duration_min,listed_in,description
1,7:19,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,2016,2016,93,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,23:59,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,2018,2011,78,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."


In [4]:
IMDB.head(2)

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0


In [5]:
Kaggle.columns

Index(['title', 'cast', 'country', 'date_added', 'release_year',
       'duration_min', 'listed_in', 'description'],
      dtype='object')

In [6]:
IMDB.columns

Index(['title', 'release_year', 'genre', 'duration_min', 'country', 'language',
       'director', 'writer', 'production_company', 'actors', 'description',
       'avg_vote', 'votes', 'reviews_from_users'],
      dtype='object')

In [7]:
# Generate dummie features with OneHotEncoding 
# Do this for language and genre column.
# Avg_vote will be target
# build a regression model
# Recommendation system is build based on "movie features" and "review from users".Also note that there is "votes" column and it can be also while doing regression analysis. 

In [8]:
common = IMDB.merge(Kaggle, how = 'inner', on = ['title', 'release_year'])

In [9]:
IMDB_Kaggle_common = IMDB[IMDB.title.isin(common.title)]
print(IMDB_Kaggle_common.shape)
IMDB_only = IMDB[~(IMDB.title.isin(common.title))]
print(IMDB_only.shape)

(2585, 14)
(69458, 14)


In [10]:
IMDB_Kaggle_common.head()

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
83,Sherlock Holmes,1916,Mystery,116,USA,English,Arthur Berthelet,"H.S. Sheldon, William Gillette",The Essanay Film Manufacturing Company,"William Gillette, Marjorie Kay, Ernest Maupain...",When a couple of swindlers hold young Alice Fa...,6.4,313,10.0
253,Leap Year,1924,Comedy,56,USA,English,"James Cruze, Roscoe 'Fatty' Arbuckle","Walter Woods, Sarah Y. Mason",Paramount Pictures,"Roscoe 'Fatty' Arbuckle, Lucien Littlefield, M...","A young man, heir to his misogynistic and mill...",6.1,268,9.0
300,Phantom,1922,"Drama, Romance",125,Germany,German,F.W. Murnau,"Gerhart Hauptmann, Thea von Harbou",Uco-Film GmbH,"Alfred Abel, Frida Richard, Aud Egede-Nissen, ...",A shiftless young man becomes obsessed with a ...,6.9,1528,19.0
308,Sherlock Holmes,1922,"Drama, Mystery",85,USA,English,Albert Parker,"Earle Browne, Arthur Conan Doyle",Goldwyn Pictures Corporation,"John Barrymore, Roland Young, Carol Dempster, ...",A young Sherlock Holmes seeks to bring down th...,5.8,551,18.0
430,The Monster,1925,"Comedy, Horror, Mystery",86,USA,English,Roland West,"Roland West, Crane Wilbur",Metro-Goldwyn Pictures Corporation,"Lon Chaney, Gertrude Olmstead, Hallam Cooley, ...",A meek clerk who doubles as an amateur detecti...,6.2,906,29.0


In [11]:
IMDB_only.head()

Unnamed: 0,title,release_year,genre,duration_min,country,language,director,writer,production_company,actors,description,avg_vote,votes,reviews_from_users
0,Miss Jerry,1894,Romance,45,USA,,Alexander Black,Alexander Black,Alexander Black Photoplays,"Blanche Bayliss, William Courtenay, Chauncey D...",The adventures of a female reporter in the 1890s.,5.9,154,1.0
1,The Story of the Kelly Gang,1906,"Biography, Crime, Drama",70,Australia,,Charles Tait,Charles Tait,J. and N. Tait,"Elizabeth Tait, John Tait, Norman Campbell, Be...",True story of notorious Australian outlaw Ned ...,6.1,589,7.0
3,Cleopatra,1912,"Drama, History",100,USA,English,Charles L. Gaskill,Victorien Sardou,Helen Gardner Picture Players,"Helen Gardner, Pearl Sindelar, Miss Fielding, ...",The fabled queen of Egypt's affair with Roman ...,5.2,446,25.0
4,L'Inferno,1911,"Adventure, Drama, Fantasy",68,Italy,Italian,"Francesco Bertolini, Adolfo Padovan",Dante Alighieri,Milano Film,"Salvatore Papa, Arturo Pirovano, Giuseppe de L...",Loosely adapted from Dante's Divine Comedy and...,7.0,2237,31.0
5,"From the Manger to the Cross; or, Jesus of Naz...",1912,"Biography, Drama",60,USA,English,Sidney Olcott,Gene Gauntier,Kalem Company,"R. Henderson Bland, Percy Dyer, Gene Gauntier,...","An account of the life of Jesus Christ, based ...",5.7,484,13.0


In [12]:
def list_unique_values_in_column(data_frame, column):
    df = data_frame[column].str.split(',')
    list1 = []
    for i in df:
        list1.extend(each.strip() for each in i)
    return df, set(list1), len(set(list1))

In [13]:
def dummie_variables(dataframe, column):
    new_column = dataframe[column].str.split(',')
    col_names = list_unique_values_in_column(dataframe, column)[1]
    new_dataframe = pd.DataFrame(index = new_column.index, columns = col_names)
    new_dataframe[:] = 0
    list10 = []
    for each in new_column[:10]: # if whole new_column is taken, it is unable to compute.
        list10.append(each)
    for x, i in enumerate(list10):
        for j in i:
            element = j.strip()
            if element in col_names:
                new_dataframe.loc[x, element] = 1    
            else:
                pass
    return new_dataframe            

In [14]:
#def get_dummie_variables(df, col):
#    x = df[col].str().split(',').apply(lambda p: each.strip() for each in p)
#    col_name = 
#    pd.DataFrame(index = x.index, col = col_name)

In [15]:
f_genre = dummie_variables(IMDB_only, 'genre')
f_genre.head()

Unnamed: 0,Sport,Crime,Film-Noir,Animation,Family,Biography,Musical,Fantasy,Horror,Mystery,...,Action,News,Reality-TV,Romance,Western,Documentary,Drama,Comedy,Sci-Fi,Music
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
5,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0


In [16]:
f_language = dummie_variables(IMDB_only, 'language')
f_language.head()

Unnamed: 0,Kannada,Hausa,Gallegan,Shanxi,Assyrian Neo-Aramaic,Washoe,Dutch,Hebrew,Creek,Akan,...,Mandarin,Aromanian,Nama,Tulu,Lithuanian,Bhojpuri,Armenian,Malayalam,Polynesian,Uighur
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
