# Netflix Motivations - Data Science Project

Welcome to the Netflix Motivations project! 

**Contributors**: *Christian Rhodes* and *Drew Jepson*

# The Data

...

## Setting up the data and previewing it

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Read in the data and store in a DataFrame
titles_df = pd.read_csv("data/titles.csv")

# Print the first 5 rows of the DataFrame
titles_df.head()

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.6


In [2]:
# Plot a histogram of the data
plt.hist(titles_df["genres"], bins=20)
plt.xlabel('Genre')
plt.ylabel('Amount')
plt.title('Histogram of Genre Counts')
plt.show()

ModuleNotFoundError: No module named 'matplotlib_inline'

# Data Cleaning

First, let's look at the structure of our various features. 

In [3]:
# Print some info about our data
titles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5850 entries, 0 to 5849
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5850 non-null   object 
 1   title                 5849 non-null   object 
 2   type                  5850 non-null   object 
 3   description           5832 non-null   object 
 4   release_year          5850 non-null   int64  
 5   age_certification     3231 non-null   object 
 6   runtime               5850 non-null   int64  
 7   genres                5850 non-null   object 
 8   production_countries  5850 non-null   object 
 9   seasons               2106 non-null   float64
 10  imdb_id               5447 non-null   object 
 11  imdb_score            5368 non-null   float64
 12  imdb_votes            5352 non-null   float64
 13  tmdb_popularity       5759 non-null   float64
 14  tmdb_score            5539 non-null   float64
dtypes: float64(5), int64(

In [4]:
# Check our data types
titles_df.dtypes

id                       object
title                    object
type                     object
description              object
release_year              int64
age_certification        object
runtime                   int64
genres                   object
production_countries     object
seasons                 float64
imdb_id                  object
imdb_score              float64
imdb_votes              float64
tmdb_popularity         float64
tmdb_score              float64
dtype: object

It looks like several of our features that appear to be simple strings or integers at firs glance are in fact objects. Let's look if we can simply some data types to make processing our data more intuitive and less error-prone. 

In [5]:
titles_df["id"].dtype

dtype('O')

Next, lets look at null values ...

In [6]:
titles_df["title"].isnull().any()


True

It looks like some entries to not contain a title. Let's remove these rows.

In [7]:
# remove rows with missing values for title
titles_df = titles_df.dropna(subset=["title"])
titles_df["title"].isnull().any()

False

# Exploratory Data Analysis
To start, let's look at unique values in critical columns. 

In [8]:
titles_df["genres"].unique()

array(["['documentation']", "['drama', 'crime']",
       "['drama', 'action', 'thriller', 'european']", ...,
       "['drama', 'animation', 'music']",
       "['animation', 'family', 'scifi']",
       "['documentation', 'music', 'reality']"], dtype=object)

# Visualizations
An important feature to consider is genres. Let's make some visualizations to look at the distribution of genres.

In [15]:
import seaborn as sns
import plotly.express as px

genre_df = titles_df.explode('genres')

ModuleNotFoundError: No module named 'plotly'

In [12]:
#create a bar chart showing the number of movies in each genre.
sns.countplot(data=genre_df, x='genres')

ModuleNotFoundError: No module named 'matplotlib_inline'

In [16]:
#create a pie chart
fig = px.pie(genre_df, names='genres')
fig.show()

NameError: name 'px' is not defined

In [17]:
#create a horizontal bar chart
fig = px.bar(genre_df, y='genres', orientation='h')
fig.show()

NameError: name 'px' is not defined

In [18]:
#create a bubble chart that shows the relationship between the number of movies in each genre and their average rating
genre_rating_df = genre_df.groupby('genres').agg({'imdb_score': 'mean', 'title': 'count'}).reset_index()
fig = px.scatter(genre_rating_df, x='title', y='genres', size='imdb_score', color='imdb_score', hover_name='genres')
fig.show()

KeyError: "Column(s) ['rating'] do not exist"