## Data Profile

The data used for the following models is retrieved from Netflix and contains over 8,000 entries with various information for each piece of media. Since the primary goal is to classify media genres based on description, only the listed_in (genre) and description features are needed.

In [50]:
# Imports

import pandas as pd
import numpy as np
import re
import os
import csv
from google.colab import drive
from google.colab import files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import tensorflow_hub as hub
import tensorflow as tf
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn import metrics
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
import spacy
nltk.download('averaged_perceptron_tagger_eng')
from nltk import pos_tag

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


In [2]:
# Changing directory

data_dir = '/content/drive/My Drive/Colab Notebooks'
drive.mount('/content/drive')
os.chdir(data_dir)

Mounted at /content/drive


In [3]:
df = pd.read_csv('netflix_dataset.csv')

## EDA

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [5]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [6]:
df.tail()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,,,,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,"March 2, 2019",2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...


In [7]:
df.isnull().sum()

Unnamed: 0,0
show_id,0
type,0
title,0
director,2634
cast,825
country,831
date_added,10
release_year,0
rating,4
duration,3


There are no missing values in the features of interest, which are genre (listed_in) and description.

In [8]:
df['description'][49]

'A pair of high-powered, successful lawyers find themselves defending opposite interests of the justice system, causing a strain on their happy marriage.'

In [9]:
df['listed_in'][49]

'International TV Shows, TV Dramas'

Description appears to be in an appropriate state to be encoded, while listed_in sometimes features multiple genres. This will be handled in the pre-processing phase.

## Data Pre-processing

In [12]:
# Selecting features

df = df[['listed_in', 'description']]

Only the features listed_in (genre) and media description are needed for the following classification models.

In [13]:
# Separating listed_in into multiple occurances

df['listed_in'] = df['listed_in'].str.split(', ')

df = df.explode('listed_in')

# Re-setting index

df.reset_index(drop= True)

Unnamed: 0,listed_in,description
0,Documentaries,"As her father nears the end of his life, filmm..."
1,International TV Shows,"After crossing paths at a party, a Cape Town t..."
2,TV Dramas,"After crossing paths at a party, a Cape Town t..."
3,TV Mysteries,"After crossing paths at a party, a Cape Town t..."
4,Crime TV Shows,To protect his family from a powerful drug lor...
...,...,...
19318,Children & Family Movies,"Dragged from civilian life, a former superhero..."
19319,Comedies,"Dragged from civilian life, a former superhero..."
19320,Dramas,A scrappy but poor boy worms his way into a ty...
19321,International Movies,A scrappy but poor boy worms his way into a ty...


Since TV shows or films sometimes have multiple relevant genres that could be equally valid for classification, media will be duplicated in accordance with however many genres they list. This will be so that the following models are trained on all descriptions and related genres.

In [14]:
# Genre types and count

df['listed_in'].value_counts()

Unnamed: 0_level_0,count
listed_in,Unnamed: 1_level_1
International Movies,2752
Dramas,2427
Comedies,1674
International TV Shows,1351
Documentaries,869
Action & Adventure,859
TV Dramas,763
Independent Movies,756
Children & Family Movies,641
Romantic Movies,616


As demonstrated above, the films are duplicated with the largest class being International Movies and the smallest class being TV Shows. However, the individual classes are quite small sample wise, and there is redundancy in genres between TV shows and films. Therefore similar genres will be grouped together regardless of media format.

In [15]:
# Joining genres

df['listed_in'] = df['listed_in'].str.replace('TV Comedies', 'Comedies')

df['listed_in'] = df['listed_in'].str.replace('Romantic Movies', 'Romance')
df['listed_in'] = df['listed_in'].str.replace('Romantic TV Shows', 'Romance')

df['listed_in'] = df['listed_in'].str.replace('TV Horror', 'Horror')
df['listed_in'] = df['listed_in'].str.replace('Horror Movies', 'Horror')

df['listed_in'] = df['listed_in'].str.replace("Kids' TV", 'Children & Family')
df['listed_in'] = df['listed_in'].str.replace('Children & Family Movies', 'Children & Family')
df['listed_in'] = df['listed_in'].str.replace('Teen TV Shows', 'Children & Family')

df['listed_in'] = df['listed_in'].str.replace('Anime Series', 'Anime')
df['listed_in'] = df['listed_in'].str.replace('Anime Features', 'Anime')

df['listed_in'] = df['listed_in'].str.replace('Science & Nature TV', 'Documentaries')
df['listed_in'] = df['listed_in'].str.replace('Docuseries', 'Documentaries')

df['listed_in'] = df['listed_in'].str.replace('Stand-Up Comedy & Talk Shows', 'Stand-Up Comedy')

df['listed_in'] = df['listed_in'].str.replace('TV Sci-Fi & Fantasy', 'Sci-Fi & Fantasy')

df['listed_in'] = df['listed_in'].str.replace('Classic Movies', 'Classics')
df['listed_in'] = df['listed_in'].str.replace('Classic & Cult TV', 'Classics')

df['listed_in'] = df['listed_in'].str.replace('TV Dramas', 'Drama')
df['listed_in'] = df['listed_in'].str.replace('Dramas', 'Drama')

df['listed_in'] = df['listed_in'].str.replace('TV Action & Adventure', 'Action & Adventure')

df['listed_in'] = df['listed_in'].str.replace('International TV Shows', 'International')
df['listed_in'] = df['listed_in'].str.replace('International Movies', 'International')

df['listed_in'] = df['listed_in'].str.replace('TV Thrillers', 'Thrillers')

df['listed_in'] = df['listed_in'].str.replace('TV Mysteries', 'Mystery')

df['listed_in'] = df['listed_in'].str.replace('Crime TV Shows', 'Crime')

df['listed_in'] = df['listed_in'].str.replace('British TV Shows', 'International')

df['listed_in'] = df['listed_in'].str.replace('Korean TV Shows', 'International')

df['listed_in'] = df['listed_in'].str.replace('Spanish-Language TV Shows', 'International')

In [16]:
# Dropping unneeded categories

df = df.drop(df[df['listed_in'] == 'Classics'].index)

df = df.drop(df[df['listed_in'] == 'Cult Movies'].index)

df = df.drop(df[df['listed_in'] == 'TV Shows'].index)

df = df.drop(df[df['listed_in'] == 'Movies'].index)

df = df.drop(df[df['listed_in'] == 'Independent Movies'].index)

In addition, there are several listed_in categories that do not specify genre, such as classics, cult movies, TV shows, movies, and independent movies. These categories are too vague and unhelpful for classifying descriptions by discrete genres. Therefore these categories will be dropped.

In [17]:
# Genre types and count

df['listed_in'].value_counts()

Unnamed: 0_level_0,count
listed_in,Unnamed: 1_level_1
International,4346
Drama,2537
Comedies,1998
Documentaries,1344
Children & Family,1140
Romance,950
Action & Adventure,938
Thrillers,560
Crime,465
Stand-Up Comedy,399


Unfortunately there is insignificant data for certain genres, therefore only the top genres will be selected.

In [18]:
# Selecting genres

df = df.loc[(df['listed_in'] == 'International') | (df['listed_in'] == 'Drama') | (df['listed_in'] == 'Comedies') | (df['listed_in'] == 'Documentaries')
      | (df['listed_in'] == 'Children & Family')]

In [19]:
# Final genre types and count

df['listed_in'].value_counts()

Unnamed: 0_level_0,count
listed_in,Unnamed: 1_level_1
International,4346
Drama,2537
Comedies,1998
Documentaries,1344
Children & Family,1140


These are the final classes the models will be trained on.

In [20]:
# Final dataframe

df = pd.read_csv('netflix_dataset2.csv')

## Model 1: Logistic Regression

In [21]:
# Separate x and y

x = df['description']
y = df['listed_in']

In [22]:
# Train/test split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15, random_state=1)

In [23]:
# Encoding data

tfidf = TfidfVectorizer(stop_words='english')

x_train_tfidf = tfidf.fit_transform(x_train)

x_test_tfidf = tfidf.transform(x_test)

In [24]:
# Model instance

log_mod = LogisticRegression(max_iter=1000)

log_mod.fit(x_train_tfidf, y_train)

In [25]:
# Assessing model

log_preds= log_mod.predict(x_test_tfidf)

print(classification_report(y_test, log_preds))

                   precision    recall  f1-score   support

Children & Family       0.44      0.29      0.35       181
         Comedies       0.19      0.13      0.15       303
    Documentaries       0.62      0.44      0.52       220
            Drama       0.17      0.10      0.13       381
    International       0.34      0.55      0.42       620

         accuracy                           0.33      1705
        macro avg       0.35      0.30      0.31      1705
     weighted avg       0.32      0.33      0.31      1705



As demonstrated above, the model does a poor job of classifying the descriptions. For the next model a different encoding technique will attempt to resolve this issue.

## Model 2: Logistic Regression and GLoVe

In [26]:
# Load dataset

df = pd.read_csv('netflix_dataset2.csv')

In [27]:
# GloVe path

model_path = '/content/drive/My Drive/Colab Notebooks/glove.6B.50d.txt'

In [28]:
# Encoding data

x = df['description'].values

# Creating the vectorizer

vectorizer = CountVectorizer(stop_words='english')

# Converting the text to numeric data

X = vectorizer.fit_transform(x)

CountVectorizedData= pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
CountVectorizedData['genre']= df['listed_in']
print(CountVectorizedData.shape)
CountVectorizedData.head()

(11365, 16455)


Unnamed: 0,000,009,10,100,102,108,10th,11,112,11th,...,zulu,zumbo,zé,álex,álvaro,ángel,ömer,über,ōarai,şeref
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
# Defining an empty dictionary to store the values

GloveWordVectors = {}

# Reading Glove Data

with open(model_path, 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], "float")
        GloveWordVectors[word] = vector

In [30]:
# Creating the list of words which are present in the Document term matrix

WordsVocab=CountVectorizedData.columns[:]

len(WordsVocab)

16455

In [31]:
# Function to encode text

def FunctionText2Vec(inpTextData):

    # Converting the text to numeric data
    X = vectorizer.transform(inpTextData)
    CountVecData=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

    # Creating empty dataframe to hold sentences
    W2Vec_Data=pd.DataFrame()

    # Looping through each row for the data
    for i in range(CountVecData.shape[0]):

        # initiating a sentence with all zeros
        Sentence = np.zeros(50)

        # Looping through each word in the sentence and if it's present in Glove model then storing its vector
        for word in WordsVocab[CountVecData.iloc[i, : ]>=1]:

            #print(word)
            if word in GloveWordVectors.keys():
                Sentence=Sentence+GloveWordVectors[word]

        # Appending the sentence to the dataframe
        W2Vec_Data = pd.concat([W2Vec_Data, pd.DataFrame([Sentence])], ignore_index=True)
    return(W2Vec_Data)

In [32]:
# Calling the function to convert all the text data to Glove Vectors

W2Vec_Data= FunctionText2Vec(df['description'])

In [33]:
 # Checking the new representation for sentences

W2Vec_Data.shape

(11365, 50)

In [34]:
# Adding the target variable

W2Vec_Data.reset_index(inplace=True, drop=True)

W2Vec_Data['genre']=CountVectorizedData['genre']

# Assigning to DataForML variable

DataForML=W2Vec_Data

DataForML.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,genre
0,5.052582,1.71072,-3.666168,-4.108647,4.166192,6.141827,-2.512953,2.254126,-1.085915,5.185655,...,4.571316,-0.302529,-1.770962,1.744142,3.327103,2.430729,-4.10996,-2.318013,3.045191,Documentaries
1,3.03946,8.58896,-1.972457,-1.464754,1.569551,3.870489,-10.84892,1.421079,1.767305,-3.400586,...,-0.04608,2.839046,-0.343701,0.434308,-3.848396,-1.76796,-10.23378,2.803779,-3.807173,International
2,3.03946,8.58896,-1.972457,-1.464754,1.569551,3.870489,-10.84892,1.421079,1.767305,-3.400586,...,-0.04608,2.839046,-0.343701,0.434308,-3.848396,-1.76796,-10.23378,2.803779,-3.807173,Drama
3,5.009121,-6.02279,-0.086222,-1.314308,6.422839,0.877058,-2.256123,2.05291,1.560338,-0.110109,...,4.88649,1.029866,2.010515,11.894831,-1.879907,-4.035765,-4.100839,1.372283,-5.5277,International
4,-0.633594,3.111229,-3.572546,0.901594,-0.6485,2.960267,-2.584706,-0.17836,-1.198687,0.669407,...,0.226957,0.035615,4.668532,0.363104,3.036796,-1.392697,-3.474599,-2.06297,1.245539,Documentaries


In [35]:
# Train/test split

TargetVariable=DataForML.columns[-1]
Predictors=DataForML.columns[:-1]

X=DataForML[Predictors].values
y=DataForML[TargetVariable].values

# Split the data into training and testing set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)

In [37]:
# Model instance

clf = LogisticRegression()

# Creating the model on Training Data

LOG= clf.fit(X_train,y_train)

# Generating predictions on testing data

prediction=LOG.predict(X_test)

# Measuring accuracy on Testing Data

print(metrics.classification_report(y_test, prediction))

                   precision    recall  f1-score   support

Children & Family       0.49      0.43      0.45       181
         Comedies       0.45      0.28      0.35       303
    Documentaries       0.63      0.51      0.56       220
            Drama       0.38      0.14      0.21       381
    International       0.43      0.72      0.54       620

         accuracy                           0.46      1705
        macro avg       0.47      0.42      0.42      1705
     weighted avg       0.45      0.46      0.42      1705



After the implementation of GloVe the accuracy score improved quite a bit, but it is still quite low. Next, different models will be explored.

## Random Forest Model and GloVe

In [39]:
# Model instance

rfm = RandomForestClassifier(n_estimators=380)

# Creating the model on Training Data

rfm_mod = rfm.fit(X_train,y_train)

# Generating predictions on testing data

prediction= rfm_mod.predict(X_test)

# Measuring accuracy on Testing Data

print(metrics.classification_report(y_test, prediction))

                   precision    recall  f1-score   support

Children & Family       0.37      0.27      0.31       181
         Comedies       0.09      0.07      0.08       303
    Documentaries       0.50      0.36      0.42       220
            Drama       0.04      0.03      0.04       381
    International       0.26      0.38      0.31       620

         accuracy                           0.23      1705
        macro avg       0.25      0.22      0.23      1705
     weighted avg       0.22      0.23      0.22      1705



As demonstrated with the above model, even with the addition of a pre-trained word embedding technique like GloVe, the accuracy scores are still very low for the Random Forest (and the SVM) model. The Logistic Regression model showed the most improvement and was the most efficient, therefore it will be the model of focus. Moving forward, the next potential solution is performing more rigorous pre-processing of the data.

## Pre-Processing Data Further

In [40]:
# Dataframe

df = pd.read_csv('netflix_dataset2.csv')

In [46]:
# Removing stop words with NLTK

stop = stopwords.words('english')

df['des_nostopwords'] = df['description'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [47]:
df.head()

Unnamed: 0,listed_in,description,des_nostopwords
0,Documentaries,"As her father nears the end of his life, filmm...","As father nears end life, filmmaker Kirsten Jo..."
1,International,"After crossing paths at a party, a Cape Town t...","After crossing paths party, Cape Town teen set..."
2,Drama,"After crossing paths at a party, a Cape Town t...","After crossing paths party, Cape Town teen set..."
3,International,To protect his family from a powerful drug lor...,"To protect family powerful drug lord, skilled ..."
4,Documentaries,"Feuds, flirtations and toilet talk go down amo...","Feuds, flirtations toilet talk go among incarc..."


In [48]:
# Lemmatizing

# Load the spaCy English language model

nlp = spacy.load("en_core_web_sm")

def lemmatize_text(text):
  doc = nlp(text)
  return " ".join([token.lemma_ for token in doc])

def apply_lemmatization_to_dataframe(df, column_name):
  df['des_lemmatized'] = df['des_nostopwords'].apply(lemmatize_text)
  return df

In [49]:
# Lemmatization

df = apply_lemmatization_to_dataframe(df, 'text')

In [51]:
# Selecting nouns

df['tag_text'] = df['description'].apply(lambda item:item.strip().split()).apply(pos_tag)
df['only_nouns'] = df['tag_text'].apply(lambda item:[w for w,t in item if t=='NN'])

In [52]:
df.head()

Unnamed: 0,listed_in,description,des_nostopwords,des_lemmatized,tag_text,only_nouns
0,Documentaries,"As her father nears the end of his life, filmm...","As father nears end life, filmmaker Kirsten Jo...","as father near end life , filmmaker Kirsten Jo...","[(As, IN), (her, PRP$), (father, NN), (nears, ...","[father, end, life,, filmmaker, death, inevita..."
1,International,"After crossing paths at a party, a Cape Town t...","After crossing paths party, Cape Town teen set...","after cross path party , Cape Town teen set pr...","[(After, IN), (crossing, VBG), (paths, NNS), (...","[party,, swimming, star, sister, birth.]"
2,Drama,"After crossing paths at a party, a Cape Town t...","After crossing paths party, Cape Town teen set...","after cross path party , Cape Town teen set pr...","[(After, IN), (crossing, VBG), (paths, NNS), (...","[party,, swimming, star, sister, birth.]"
3,International,To protect his family from a powerful drug lor...,"To protect family powerful drug lord, skilled ...","to protect family powerful drug lord , skilled...","[(To, TO), (protect, VB), (his, PRP$), (family...","[family, drug, lord,, team, violent, war.]"
4,Documentaries,"Feuds, flirtations and toilet talk go down amo...","Feuds, flirtations toilet talk go among incarc...","Feuds , flirtation toilet talk go among incarc...","[(Feuds,, NNP), (flirtations, NNS), (and, CC),...","[toilet, talk, reality, series.]"


As demonstrated above, a few columns were added to explore various pre-processing methods. Removing any unnecessary words that are a distraction for classification (stop words) will be potentially benificial. In addition, changing words to their root meaning with Lemmatization may be helpful. Selecting key words, such as nouns, from descriptions may also potentially improve genre classification accuracy.

## Logistic Regression Models: Various Pre-Processing Techniques

In [53]:
# Dataframe #2

df = pd.read_csv('netflix_dataset5.csv')

In [54]:
# Encoding data

x = df['only_nouns'].values

# Creating the vectorizer

vectorizer = CountVectorizer(stop_words='english')

# Converting the text to numeric data

X = vectorizer.fit_transform(x)

CountVectorizedData= pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
CountVectorizedData['genre']= df['listed_in']
print(CountVectorizedData.shape)
CountVectorizedData.head()

(11365, 5987)


Unnamed: 0,000,10,100,1960s,20th,3below,80s,90s,95,abandonment,...,zany,zenith,zine,zixin,zombie,zombies,zone,zoo,zoologist,ömer
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [55]:
# Defining an empty dictionary to store the values

GloveWordVectors = {}

# Reading Glove Data

with open(model_path, 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], "float")
        GloveWordVectors[word] = vector

In [56]:
# Creating the list of words which are present in the Document term matrix

WordsVocab=CountVectorizedData.columns[:]

len(WordsVocab)

5987

In [57]:
# Encoding data

W2Vec_Data= FunctionText2Vec(df['only_nouns'])

In [58]:
# Adding the target variable

W2Vec_Data.reset_index(inplace=True, drop=True)

W2Vec_Data['genre']=CountVectorizedData['genre']

# Assigning to DataForML variable

DataForML=W2Vec_Data

DataForML.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,genre
0,1.704386,2.13432,-1.683328,-3.074941,3.160622,2.999857,-1.92656,1.34122,-1.58253,2.364455,...,2.661354,-1.431016,-1.19107,0.52499,-0.273575,-0.73179,-3.5486,-0.338446,0.381301,Documentaries
1,-0.57972,5.28226,-1.02989,1.254198,1.61532,3.365488,-3.39135,0.680099,0.19749,-0.82219,...,0.29968,-1.058654,-0.321895,0.51687,-1.299582,-1.96036,-5.32354,0.992772,-0.73333,International
2,-0.57972,5.28226,-1.02989,1.254198,1.61532,3.365488,-3.39135,0.680099,0.19749,-0.82219,...,0.29968,-1.058654,-0.321895,0.51687,-1.299582,-1.96036,-5.32354,0.992772,-0.73333,Drama
3,3.09464,-1.2152,-2.92706,-0.485373,2.996259,1.33832,-1.159574,0.565005,-0.198497,-0.57385,...,1.31681,-0.78717,0.02798,2.763911,0.97189,-2.277762,-2.664109,-0.467599,-0.76189,International
4,0.410236,1.41688,-0.288138,0.60121,-0.63138,0.74145,-0.95195,-1.18393,-0.64444,0.688537,...,-0.18642,-1.695466,1.698,0.09412,1.485126,0.610615,0.166488,-0.02682,1.12392,Documentaries


In [59]:
# Train/test split

TargetVariable=DataForML.columns[-1]
Predictors=DataForML.columns[:-1]

X=DataForML[Predictors].values
y=DataForML[TargetVariable].values

# Split the data into training and testing set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)

In [60]:
# Model instance

clf = LogisticRegression()

# Creating the model on Training Data

LOG= clf.fit(X_train,y_train)

# Generating predictions on testing data

prediction=LOG.predict(X_test)

# Measuring accuracy on Testing Data

print(metrics.classification_report(y_test, prediction))

                   precision    recall  f1-score   support

Children & Family       0.43      0.23      0.30       181
         Comedies       0.40      0.15      0.22       303
    Documentaries       0.51      0.38      0.43       220
            Drama       0.36      0.07      0.12       381
    International       0.39      0.79      0.52       620

         accuracy                           0.40      1705
        macro avg       0.42      0.33      0.32      1705
     weighted avg       0.41      0.40      0.34      1705



As demonstrated above with the "only nouns" pre-processing approach, it seems removing stop words, lemmatizing descriptions, or only selecting nouns all have a negative impact on accuracy when used in conjunction with GloVe.

The genres that all the models consistently have trouble classifying are Comedies and Drama. Upon taking a look at the descriptions, it is evident there is no clear verbage that is unique of these genres in their descriptions.

Therefore, the classes selected will be based on the F1 scores from previous logistic regression models, regardless of sample size, with the assumption that these classes are more linguistically cohesive. For example, Drama has the second highest data count but one of the lower F1 scores and showed little improvement after the implementation of GloVe.

## Final Model

In [61]:
# Final dataframe

df = pd.read_csv('netflix_dataset10.csv')

In [62]:
df['listed_in'].value_counts()

Unnamed: 0_level_0,count
listed_in,Unnamed: 1_level_1
International,4346
Documentaries,1344
Children & Family,1140
Stand-Up Comedy,399


In [63]:
# GloVe path

model_path = '/content/drive/My Drive/Colab Notebooks/glove.6B.50d.txt'

In [64]:
# Data selection

x = df['description'].values

# Creating the vectorizer

vectorizer = CountVectorizer(stop_words='english')

# Converting the text to numeric data

X = vectorizer.fit_transform(x)

CountVectorizedData= pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
CountVectorizedData['genre']= df['listed_in']
print(CountVectorizedData.shape)
CountVectorizedData.head()

(7229, 15845)


Unnamed: 0,000,009,10,100,102,10th,11,112,11th,12,...,zumbo,zé,álex,álvaro,ángel,ömer,über,łukasz,ōarai,şeref
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [65]:
# Defining an empty dictionary to store the values

GloveWordVectors = {}

# Reading Glove Data

with open(model_path, 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.array(values[1:], "float")
        GloveWordVectors[word] = vector

In [66]:
# Creating the list of words which are present in the Document term matrix

WordsVocab=CountVectorizedData.columns[:]

len(WordsVocab)

15845

In [67]:
 # Function to encode data

 def FunctionText2Vec(inpTextData):

    # Converting the text to numeric data
    X = vectorizer.transform(inpTextData)
    CountVecData=pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

    # Creating empty dataframe to hold sentences
    W2Vec_Data=pd.DataFrame()

    # Looping through each row for the data
    for i in range(CountVecData.shape[0]):

        # initiating a sentence with all zeros
        Sentence = np.zeros(50)

        # Looping thru each word in the sentence and if it's present in the Glove model then storing its vector
        for word in WordsVocab[CountVecData.iloc[i, : ]>=1]:

            #print(word)
            if word in GloveWordVectors.keys():
                Sentence=Sentence+GloveWordVectors[word]

        # Appending the sentence to the dataframe
        W2Vec_Data = pd.concat([W2Vec_Data, pd.DataFrame([Sentence])], ignore_index=True)
    return(W2Vec_Data)

In [68]:
# Encoding data

W2Vec_Data= FunctionText2Vec(df['description'])

In [69]:
# Adding the target variable

W2Vec_Data.reset_index(inplace=True, drop=True)

W2Vec_Data['genre']=CountVectorizedData['genre']

# Assigning to DataForML variable

DataForML=W2Vec_Data

DataForML.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,genre
0,5.052582,1.71072,-3.666168,-4.108647,4.166192,6.141827,-2.512953,2.254126,-1.085915,5.185655,...,4.571316,-0.302529,-1.770962,1.744142,3.327103,2.430729,-4.10996,-2.318013,3.045191,Documentaries
1,3.03946,8.58896,-1.972457,-1.464754,1.569551,3.870489,-10.84892,1.421079,1.767305,-3.400586,...,-0.04608,2.839046,-0.343701,0.434308,-3.848396,-1.76796,-10.23378,2.803779,-3.807173,International
2,5.009121,-6.02279,-0.086222,-1.314308,6.422839,0.877058,-2.256123,2.05291,1.560338,-0.110109,...,4.88649,1.029866,2.010515,11.894831,-1.879907,-4.035765,-4.100839,1.372283,-5.5277,International
3,-0.633594,3.111229,-3.572546,0.901594,-0.6485,2.960267,-2.584706,-0.17836,-1.198687,0.669407,...,0.226957,0.035615,4.668532,0.363104,3.036796,-1.392697,-3.474599,-2.06297,1.245539,Documentaries
4,-0.114692,5.8426,-5.133942,-4.51287,2.124638,-3.598643,-9.37884,-0.149255,0.140741,-0.100732,...,-2.236275,-0.406165,6.42186,-0.311425,2.486897,0.120732,-2.333594,-2.014489,-0.541233,International


In [70]:
# Train/test split

TargetVariable=DataForML.columns[-1]
Predictors=DataForML.columns[:-1]

X=DataForML[Predictors].values
y=DataForML[TargetVariable].values

# Split the data into training and testing set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1)

In [71]:
# Model instance

clf = LogisticRegression()

# Creating the model on Training Data

LOG=clf.fit(X_train,y_train)

# Generating predictions on testing data

prediction=LOG.predict(X_test)

# Measuring accuracy on Testing Data

print(metrics.classification_report(y_test, prediction))

                   precision    recall  f1-score   support

Children & Family       0.76      0.54      0.63       173
    Documentaries       0.64      0.51      0.57       202
    International       0.77      0.87      0.82       646
  Stand-Up Comedy       0.69      0.70      0.70        64

         accuracy                           0.74      1085
        macro avg       0.72      0.66      0.68      1085
     weighted avg       0.74      0.74      0.73      1085



The final model has the highest accuracy when trained on the following four genres: Children & Family, Documentaries, International, and Stand-Up Comedy. International has the highest F1 score and coincidentally the highest data count of all of the genres. Meanwhile Stand-Up Comedy has one of the lowest data counts but evidently has a high rate of cohesion within its descriptions, showing consistent improvement with each model iteration as more genres were abandoned. Though Comedies and Drama have the second and third highest data counts in the dataset, they ultimately lowered all of the models' classification accuracies and ommitting these genres significantly improved the final model.