#                                   Netflix Culture 
## Identifying Trends and Patterns in TV Shows and Movies


Introduction

Netflix is one of the most popular streaming services in the world. It offers a wide range of movies and Tv shows to its subscribers. Netflix is popular with its foreign-langauge, genre specific and binge-worthy content. It provides the audience with quality and original products which is why it records a huge success in the market. As each company Netflix, also relies on a huge amount of data (Big Data). As a big streaming company it collects data from its subscribers about their actions like what they watch the most, when they watch and how long they watch. This data is working also for their recommendation system. By analyzing subscribers viewing history and behavior Netflix offers content that the subscriber is most likely to be interested in.Hence, the audience stays engaged with the platform and benefits the company itself.


The aim of our project

As having subscribers of Netflix in our group members we got interested in analyzing some patterns that are used in its Tv shows and movies. We decided to choose two datasets containing different types of information about Netflix like the names of the movies, the names of the TV shows, the producing year, the producers, etc. Having this amount of data gave us the opportunity to analyze some patterns in the content and provide some visualizations demonstrating them in a more clear way. The purpose of this paper is to clean, analyze and visualize the data we have explaining our steps in detail. The language we used for all the processes is Python.


Datasets 
As mentioned above we have two datasets netflix_titles .csv and imdb_top_1000 .csv. Both datasets are obtained from Kaggle.com which is one of the largest data science communities providing reliable and useful resources. 
netflix_titles.csv contains unlabelled text data of around 9000 Netflix Shows and Movies along with full details like Cast, Release Year, Rating, Description, etc.
imdb_top_1000 .csv is an IMDB Dataset of top 1000 movies and tv shows.
In addition to the datasets we have used a json file in our project called countries.geojson


### Data Cleaning


In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from wordcloud import WordCloud, STOPWORDS
import itertools

Below are all the libraries we inserted. Numpy is a library for numerical computing in Python. It provides tools for working with arrays and matrices. Pandas is a library for data manipulation and analysis. It provides tools for data analysis. Seaborn is a library for statistical data visualization. It provides a high-level interface for creating statistical graphics. Matplotlib is a library for creating static, animated, and interactive visualizations in Python. It provides various tools for creating various types of plots. Wordcloud is a library for creating word clouds in Python. A word cloud is a visual representation of text data, where the size of each word is proportional to its frequency in the text. And the last, itertools is a library for working with iterators, which are objects that can be looped over. It provides tools for creating, combining, and manipulating iterators.


In [4]:
df = pd.read_csv("./netflix_titles.csv")
df2 = pd.read_csv("./imdb_top_1000.csv")
pd.set_option('display.max_columns', None)

This is the first cell of our python code. The first two rows of the cell are creating two pandas dataframes one df and the second df2. Both lines are reading the csv files and allow us to work with the data inside of the datasets. The third line sets the requirement for pandas to display all the columns of the data frames because without setting the option to “None” the Pandas would limit the number of columns when displaying by default. Obviously, we needed to insert some libraries to make analysis and data visualization. Below are all the libraries we inserted.


### Getting basic information about dataset

In [5]:
df.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

isnull().sum()is a pandas method chain used to find the number of missing values in each column of a DataFrame df. Missing values can be zeros, NaNs, etc.

In [6]:
df["country"].fillna("MISSING", inplace=True)
df["duration"].fillna("0 min", inplace=True)
df["director"].fillna("Unknown", inplace=True)
df["cast"].fillna("Unknown", inplace=True)
df["date_added"].fillna("Unknown", inplace=True)
df["rating"].fillna("Unknown", inplace=True)
df["duration"].fillna("Unknown", inplace=True)
df

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",MISSING,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,Unknown,Unknown,MISSING,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
...,...,...,...,...,...,...,...,...,...,...,...,...
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,"November 20, 2019",2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a..."
8803,s8804,TV Show,Zombie Dumb,Unknown,Unknown,MISSING,"July 1, 2019",2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g..."
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,"November 1, 2019",2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,"January 11, 2020",2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero..."


df["country"].fillna("MISSING", inplace=True) line fills the missing values in the "country" column of the dataset with the "MISSING", which is a string. 

df["duration"].fillna("0 min", inplace=True) line fills the missing values in the "duration" column with "0 min".

df["director"].fillna("Unknown", inplace=True) line fills the missing values in the "director" column with "Unknown".

df["cast"].fillna("Unknown", inplace=True) line fills the missing values in the "cast" column with "Unknown".

df["date_added"].fillna("Unknown", inplace=True) line fills the missing values in the "date_added" column with "Unknown".

df["rating"].fillna("Unknown", inplace=True) line fills the missing values in the "rating" column with "Unknown".

df["duration"].fillna("Unknown", inplace=True) line fills the missing values in the "duration" column with "Unknown".

In all the above cases the last words are strings.
In all the cases we write inplace= True to indicate that we want to amke changes on the original dataset instead of creating a copy.
At the end we get a dataframe with no missing and NaN values.

In [7]:
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


describe() is a method which returns a summary of the central tendency, dispersion, and shape of the distribution of the columns of our DataFrame.
In the output we can see the words "count", "mean","std","min","25%","50%","75%","max". These are the information about the columns of our dataframe already cleaned from missing vales. Here count: The number of non-null values in each column of the DataFrame
mean is the average of each column,std is the standars deviation, min is the minimum value, 25% is the 25% of each column, similarly are 50% and 75% and the max is the maximum value of the coumns.

In [8]:
df.shape()

TypeError: 'tuple' object is not callable

df.shape() returns a tuple with the number of rows and the number of columns in the dataframe. In our case the number of the rows is 8807 and the number of columns is 12.

In [9]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In pandas the method df.columns returns the names of the columns. We can see the names of the columns of our dataset above and we can also see the type of it which is "object".

In [10]:
df.count()

show_id         8807
type            8807
title           8807
director        8807
cast            8807
country         8807
date_added      8807
release_year    8807
rating          8807
duration        8807
listed_in       8807
description     8807
dtype: int64

count() method returns the number of non-null values in each column of a DataFrame. This method can be used to quickly identify missing values in a Dataframe.

In [11]:
df.nunique()

show_id         8807
type               2
title           8807
director        4529
cast            7693
country          749
date_added      1768
release_year      74
rating            18
duration         221
listed_in        514
description     8775
dtype: int64

nunique() method returns the number of unique values in each column of a Dataframe.

In [13]:
print(f" dtype - show_id: {df.show_id.dtype}")
print(f" dtype - type: {df.type.dtype}")
print(f" dtype - title: {df.title.dtype}")
print(f" dtype - director: {df.director.dtype}")
print(f" dtype - cast: {df.cast.dtype}")
print(f" dtype - country: {df.country.dtype}")
print(f" dtype - date_added: {df.date_added.dtype}")
print(f" dtype - release_year: {df.release_year.dtype}")
print(f" dtype - rating: {df.rating.dtype}")
print(f" dtype - duration: {df.duration.dtype}")
print(f" dtype - listed_in: {df.listed_in.dtype}")
print(f" dtype - description: {df.description.dtype}")

 dtype - show_id: object
 dtype - type: object
 dtype - title: object
 dtype - director: object
 dtype - cast: object
 dtype - country: object
 dtype - date_added: object
 dtype - release_year: int64
 dtype - rating: object
 dtype - duration: object
 dtype - listed_in: object
 dtype - description: object


Instead of writing for each row of the code the first row will be explained as the others are the same just for different columns.

print(f" dtype - show_id: {df.show_id.dtype}")
We are printing the data type of the column. (f" dtype - show_id) this is for having a string with the name of the column and the meanoning of our code in the output.

From the output we can see that we have 11 columns with type object and 1 column with type int64

In [None]:
df.dropna(axis="index", how="all")

dropna() method is used to remove missing values from a DataFrame. The axis parameter specifies whether to remove rows or columns that contain missing values.how parameter is the condition for removing a row or column.

### Working with duration column and splitting it into 2 columns

In [14]:
df["duration"].unique()

array(['90 min', '2 Seasons', '1 Season', '91 min', '125 min',
       '9 Seasons', '104 min', '127 min', '4 Seasons', '67 min', '94 min',
       '5 Seasons', '161 min', '61 min', '166 min', '147 min', '103 min',
       '97 min', '106 min', '111 min', '3 Seasons', '110 min', '105 min',
       '96 min', '124 min', '116 min', '98 min', '23 min', '115 min',
       '122 min', '99 min', '88 min', '100 min', '6 Seasons', '102 min',
       '93 min', '95 min', '85 min', '83 min', '113 min', '13 min',
       '182 min', '48 min', '145 min', '87 min', '92 min', '80 min',
       '117 min', '128 min', '119 min', '143 min', '114 min', '118 min',
       '108 min', '63 min', '121 min', '142 min', '154 min', '120 min',
       '82 min', '109 min', '101 min', '86 min', '229 min', '76 min',
       '89 min', '156 min', '112 min', '107 min', '129 min', '135 min',
       '136 min', '165 min', '150 min', '133 min', '70 min', '84 min',
       '140 min', '78 min', '7 Seasons', '64 min', '59 min', '139 min',
    

unique() method is used to get an array of unique values in a Dataframe column.In our case unique() will return an array of unique values in the "duration" column.

In [None]:
l_num_dur=list() #creates a new list
l_seas_min=list() #creates another new list
for i in df["duration"]: #loop for duration column
    num_dur=int(i.split()[0]) #to get the integer value
    w_dur=i.split()[1] #to get the word Seasons/min
    l_num_dur.append(num_dur) #will add the numerical value of duration
    l_seas_min.append(w_dur) #will add either Seasons or min
df["Number duration"]=l_num_dur #puts the values in the list
df["Season/min"]=l_seas_min #puts the values in the list
df #prints the dataframe

The above code is to split the duration column into two parts integers and Seasons or min. Each line is explained in the code cell.

In [None]:
df["rating"].unique()

In this case unique() will return an array of unique values in the "rating" column.

In [None]:
print(df.loc[df["rating"]=="74 min", "duration"])
print(df.loc[df["rating"]=="84 min", "duration"])
print(df.loc[df["rating"]=="66 min", "duration"])
#There were 3 Nan values in the duration column, and the original values were put in ratings column

The first line will return the "duration" values for all rows where the "rating" column equals "74 min". The same are doing the next to lines. 

In [None]:
df.loc[df["rating"]=="74 min", "duration"]="74 min"
df.loc[df["rating"]=="84 min", "duration"]="84 min"
df.loc[df["rating"]=="66 min", "duration"]="66 min"
df.loc[df["rating"]=="74 min", "rating"]="Unknown"
df.loc[df["rating"]=="84 min", "rating"]="Unknown"
df.loc[df["rating"]=="66 min", "rating"]="Unknown"

the same is here as above