#### Project 4: Netflix NLP
#### Corey J Sinnott
# Data Cleaning, Initial EDA and Early Featurization

## Executive Summary

This report was commissioned to perform natural language processing (NLP) and analysis on two subreddits of Reddit.com. Data includes over 8000 posts, 4000 belonging to r/AskALiberal, and 4000 belonging to r/AskAConservative. The problem statement was defined as, can we classify to which subreddit a post belongs? After in-depth analysis, conclusions and recommendations will be presented.

*See model_classification_exec_summary.ipynb for the full summary, data dictionary, and findings.*

## Contents:
- [Initial EDA & Cleaning](#Initial-EDA-&-Cleaning)
- [EDA and Featurization](#EDA-and_Featurization)

#### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
import language_tool_python
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
from textblob import Blobber
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#### Inspecting Data

In [13]:
df = pd.read_csv('./data/netflix_titles.csv')

In [15]:
df.sample(3)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...
1595,s1596,Movie,Dangal,Nitesh Tiwari,"Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh,...",India,"June 21, 2017",2016,TV-PG,161 min,"Dramas, International Movies, Sports Movies",A once-promising wrestler pursues the gold med...
3717,s3718,Movie,"Look Out, Officer",Sze Yu Lau,"Stephen Chow, Bill Tung, Stanley Sui-Fan Fung,...",Hong Kong,"August 16, 2018",1990,TV-14,88 min,"Action & Adventure, Comedies, International Mo...",An officer killed on the job returns to Earth ...


In [28]:
df['rating'].value_counts()

TV-MA       2863
TV-14       1931
TV-PG        806
R            665
PG-13        386
TV-Y         280
TV-Y7        271
PG           247
TV-G         194
NR            84
G             39
TV-Y7-FV       6
UR             5
NC-17          3
Name: rating, dtype: int64

#### Binarizing Target for Classification
 - 0 for not adult content
 - 1 for adult content

In [30]:
df.replace(to_replace=('TV-MA', 'R', 'NC-17'), value = 1, inplace = True)

In [31]:
# will leave NR for now
df.replace(to_replace=('TV-14', 'TV-PG', 'PG-13', 'TV-Y', 'TV-Y7',
                      'PG', 'TV-G', 'NR', 'G', 'TV-Y7-FV', 'UR'), 
           value = 0, inplace = True)

In [32]:
df['rating'].value_counts()

0.0    4249
1.0    3531
Name: rating, dtype: int64

In [37]:
#df with all columns for reference
df_all = df

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   show_id                 7787 non-null   object 
 1   type                    7787 non-null   object 
 2   title                   7787 non-null   object 
 3   director                5398 non-null   object 
 4   cast                    7069 non-null   object 
 5   country                 7280 non-null   object 
 6   date_added              7777 non-null   object 
 7   release_year            7787 non-null   int64  
 8   rating                  7780 non-null   float64
 9   duration                7787 non-null   object 
 10  listed_in               7787 non-null   object 
 11  description             7787 non-null   object 
 12  description_length      7787 non-null   int64  
 13  description_word_count  7787 non-null   int64  
dtypes: float64(1), int64(3), object(10)
memo

 -  Dropping features

In [44]:
df = df.drop(columns = ['show_id', 'director', 'cast', 'date_added', 
            'release_year'], axis = 1)

In [47]:
df.set_index(keys = df['title'], inplace = True)

#### Engineering Features

In [34]:
df['description_length'] = [len(i) for i in df['description']]
df['description_word_count'] = [len(i.split()) for i in df['description']]

In [48]:
df.sample(1)

Unnamed: 0_level_0,type,title,country,rating,duration,listed_in,description,description_length,description_word_count
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
District 9,Movie,District 9,"South Africa, United States, New Zealand, Canada",1.0,112 min,"Action & Adventure, International Movies, Sci-...","After years of segregation and forced labor, a...",144,23


In [64]:
df['type'].value_counts()

Movie      5377
TV Show    2410
Name: type, dtype: int64

In [50]:
# average netflix episode = 42min
# converting all to minutes
df['duration'].value_counts()

1 Season      1608
2 Seasons      382
3 Seasons      184
90 min         136
93 min         131
              ... 
253 min          1
193 min          1
16 Seasons       1
203 min          1
43 min           1
Name: duration, Length: 216, dtype: int64

In [59]:
df['duration'] = df['duration'].map(lambda x: ''.join([i for i in x if i.isdigit()]))

In [60]:
df['duration'].value_counts()

1      1608
2       382
3       185
90      136
93      131
       ... 
312       1
167       1
203       1
193       1
36        1
Name: duration, Length: 206, dtype: int64

In [63]:
df['duration'] = df['duration'].astype(int)

In [67]:
# will return to / refine this if time permits
def durationater(df):
    if df['type'] == 'TV Show':
        [(i * 8 * 42) for i in df['duration']]
        
    return df['duration']

#### Adding columns for polarity and subjectivity

In [69]:
df['descr_polarity'] = [TextBlob(i).polarity for i in df['description']]

In [71]:
df['descr_subjectivity'] = [TextBlob(i).subjectivity for i in df['description']]

In [73]:
df = df.drop(columns = ['title'])

In [74]:
df.sample(3)

Unnamed: 0_level_0,type,country,rating,duration,listed_in,description,description_length,description_word_count,descr_polarity,descr_subjectivity
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
The Bible's Buried Secrets,TV Show,United Kingdom,0.0,1,"British TV Shows, Docuseries, Science & Nature TV",Host Francesca Stavrakopoulou travels across t...,132,18,0.144444,0.4
The Circle Brazil,TV Show,"Brazil, United Kingdom",1.0,1,"International TV Shows, Reality TV",Be yourself or someone else? In this fun reali...,150,26,0.65,0.25
The Wedding Party 2: Destination Dubai,Movie,Nigeria,0.0,98,"Comedies, International Movies, Romantic Movies","In this sequel to the 2016 hit ""The Wedding Pa...",149,26,0.0,0.0


#### Exporting to CSV for analysis notebooks

In [75]:
df.to_csv('netflix_prepped_df.csv')