# Explore Dataset - Homework exercice 1 (Programming task)

## Introduction

This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

Dataset with Netflix content available as of 2019
[Netflix content dataset](https://www.kaggle.com/shivamb/netflix-shows)

This dataset was gathered from kaggle and it was used to cluster movies togheter to find similar movies and create recommendations.

### Inspiration:

1. Understanding what content is available in different countries
2. Identifying similar content by matching text-based features
3. Network analysis of Actors / Directors and find interesting insights
4. Is Netflix has increasingly focusing on TV rather than movies in recent years.

### Authors:

**Grupo: 5** 
- André Moreira, 62058
- Catarina Silva, 76399
- Luís Marques, 81526


In [None]:
#Import Pandas library
import pandas as pd

#Import Numpy library
import numpy as np

#Import Plotly library
import plotly.express as px

## Load dataset and first analysis

In [None]:
# Load dataset from the folder datasets
netflix_df = pd.read_csv("./datasets/netflix_titles.csv")
# returns rows
netflix_df.head()

In [None]:
# Print a concise summary of a DataFrame
netflix_df.info()

In [None]:
# Return a tuple representing the dimensionality of the DataFrame.
netflix_df.shape

In [None]:
# Generate descriptive statistics.
netflix_df.describe(include=np.object)

In [None]:
# Examine how often specific values occur in release_year column
netflix_df['release_year'].value_counts()

## Quering the dataset

In [None]:
netflix_df[netflix_df['release_year'] > 2019]

## Sorting

In [None]:
netflix_df.sort_values(by='release_year').head()

## Transforming Dataset

In [None]:
# Convert season to minutes and convert x min to x (convert string to int)
def convert_duration(d):
    a = d.split(' ')
    if a[1] == 'min':
        return int(a[0])
    else:
        #estimate season duration
        # 12 = average number of episodes
        # 45 = average number of minutes per episode
        return int(a[0])*12*45
        
netflix_df['duration'] = netflix_df['duration'].apply(convert_duration)
netflix_df.head()

In [None]:
# Count the number of Nulls
netflix_df.isnull().sum()

### Missing values

In [None]:
# Remove Nulls
netflix_df.dropna(inplace=True)

In [None]:
# Count the number of Nulls
netflix_df.isnull().sum()

### Convert String Date to Datetime

1. Conversion Times

In [None]:
# Conversion times with inference and without

'''
%timeit pd.to_datetime(netflix_df['date_added'], infer_datetime_format=True)

%timeit pd.to_datetime(netflix_df['date_added'], infer_datetime_format=False)
'''

![Convertion Times](images/convertion_times.png)

2. Convert date to datetime

In [None]:
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'], infer_datetime_format=True)

In [None]:
netflix_df.info()

## Visualizing

### Plot

In [None]:
# Relase year plot (bar)
netflix_df["release_year"].value_counts().head(10).plot(kind="bar")

### Histogram

1. Release year histogram

In [None]:
# Relase year Histogram
netflix_df.hist(column='release_year', bins =20)

2. Duration histogram

In [None]:
netflix_df.hist(column='duration', bins=50)

## Top rated movies on netflix

Due to the lack of movie ratings on the original dataset, we decided to use the ratings available on IMDb by merging the original dataset with the ratings from the second.
[IMDb movies extensive dataset](https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset)

IMDb is the most popular movie website and it combines movie plot description, Metastore ratings, critic and user ratings and reviews, release dates, and many more aspects.

IMDb stores information related to more than 6 million titles (of which almost 500,000 are featured films) and it is owned by Amazon since 1998.

The data has been scraped from the publicly available website https://www.imdb.com. All the movies with more than 100 votes have been scraped as of 01/01/2020.

In [None]:
#Loads IMDB ratings dataset
imdb_ratings=pd.read_csv('datasets/IMDb ratings.csv',usecols=['weighted_average_vote'])
imdb_titles=pd.read_csv('datasets/IMDb movies.csv', usecols=['title','year','genre'], dtype=str)

#Builds a dataframe containing movie's title, release year, rating and genre
ratings = pd.DataFrame({'Title':imdb_titles.title,
                    'Release Year':imdb_titles.year,
                    'Rating': imdb_ratings.weighted_average_vote,
                    'Genre':imdb_titles.genre})

#Removes duplicate values from the dataframe and rebuilds it
ratings.drop_duplicates(subset=['Title','Release Year','Rating'], inplace=True)
ratings.shape

In [None]:
imdb_titles.info()

In [None]:
imdb_ratings.info()

In [None]:
#Removes rows with empty values from the dataframe and rebuilds it
ratings.dropna(inplace=True)

#Merges the two dataframes into a new one
joint_data=ratings.merge(netflix_df,left_on='Title',right_on='title',how='inner')

#Sorts the dataframe into descending rating order
joint_data=joint_data.sort_values(by='Rating', ascending=False)

In [None]:
joint_data.info()

In [None]:
#Picks the first 10 elements from the dataframe and creates a list
top_rated=joint_data[0:10]

#Creates sunburst visualization with Plotly
fig = px.sunburst(
    top_rated,
    path=['title','country'],
    values='Rating',
    color='Rating')
fig.show()

## Top Movie Makers

In [None]:
#Creates a dataframe based on the value counts of movies by 'Country' column
country_count=joint_data['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)

#Picks 11 top counts
topcountries=country_count[0:10]
topcountries

In [None]:
#Get countries to list format
topcountries.to_dict()['country'].keys()

In [None]:
#Get movie counts to list format
topcountries['country'].values.tolist()

In [None]:
#Converts data to dictionary to fit plotting criteria
data = dict(
    number=topcountries['country'].values.tolist(),
    country=topcountries.to_dict()['country'].keys())

#Plots funnel image
fig = px.funnel(data, x='number', y='country')
fig.show()

## Average IMDb Movie rating on Netflix

In [None]:
#Ratings count of each movie on Netflix. Ratings were floored as to set a range instead of individual values
np.floor(joint_data['Rating']).value_counts()

In [None]:
#Transforms data to dictionary to better plot the data
data = dict(
    rating = dict(np.floor(joint_data['Rating']).value_counts()).keys(),
    count = np.floor(joint_data['Rating']).value_counts()
)

'''
Creates bar chart to display how Netflix movies are rated.
We can access a movie quality/resonance with general public by its rating, 
thus making it possible to understand how good Netflix content is
'''
fig = px.bar(data, x='rating', y='count')
fig.show()

## Movie classification on Netflix

In [None]:
fig = px.histogram(joint_data['rating'], x='rating')
fig.show()

## Content growth on Netflix over the years

In order to make a analysis on content growth of Netflix over the years we had to read the dataset again and only drop invalid entries that contained null values on a subset of features. The fields we look at for the analysis in question are the content *type* and *date_added*. The remaining features such as *cast* have no impact on the desired analysis so the entries were kept despite not having information on the *cast*.

In [None]:
#Read CSCV
netflix_new_df = pd.read_csv("./datasets/netflix_titles.csv")

#Drop null values from type and date_added subset
netflix_new_df.dropna(subset=['type', 'date_added'], inplace=True)

#Convert date_added dtype to datetime
netflix_new_df['date_added'] = pd.to_datetime(netflix_new_df['date_added'], infer_datetime_format=True)

#Display number of nulls and dtype
netflix_new_df.info()

In [None]:
#Retrieve cumulative sum of Netflix content by date_added
netflix_content_growth = pd.DatetimeIndex(netflix_new_df.date_added).year.sort_values(ascending=True)\
            .value_counts().sort_index().cumsum(axis=0)

#Transform cumulative sum to dictionary so we can plot it
data = dict(
    year = netflix_content_growth.keys(),
    count = netflix_content_growth.values
)

In [None]:
#Create line chart with content growth on Netflix
fig = px.line(data, x='year', y='count')
fig.show()

## Growth per content type over the years

In [None]:
#Retrieve content type on Netflix
netflix_new_df.type.unique()

In [None]:
#Create reference to netflix_new dataframe filtering data to only contain TV Shows
tvshow_df = netflix_new_df[netflix_new_df.type.str.contains('TV Show')]

#Cumulative sum of Netflix TV Shows over the years
pd.DatetimeIndex(tvshow_df.date_added).year.sort_values(ascending=True)\
            .value_counts().sort_index().cumsum(axis=0)

In [None]:
#Create reference to netflix_new dataframe filtering data to only contain Movies
movies_df = netflix_new_df[netflix_new_df.type.str.contains('Movie')]

#Cumulative sum of Netflix Movies over the years
pd.DatetimeIndex(movies_df.date_added).year.sort_values(ascending=True)\
            .value_counts().sort_index().cumsum(axis=0)

In [None]:
#Create dataframe from cumulative sum of Netflix TV Shows over the years
tvshow_df = pd.DataFrame(pd.DatetimeIndex(tvshow_df.date_added).year.sort_values(ascending=True)\
            .value_counts().sort_index().cumsum(axis=0))

In [None]:
#Added year indexes as a value on the dataframe so we can plot it
tvshow_df.insert(0, 'year', tvshow_df.index)

#Renamed feature to its correct representation
tvshow_df.rename(columns = {'date_added':'count'}, inplace = True)

#Added feature type and filled values as TV Show
tvshow_df.insert(2, 'type', 'TV Show')

In [None]:
#Display TV Show dataframe
tvshow_df

In [None]:
#Create dataframe from cumulative sum of Netflix Movies over the years
movies_df = pd.DataFrame(pd.DatetimeIndex(movies_df.date_added).year.sort_values(ascending=True)\
            .value_counts().sort_index().cumsum(axis=0))

In [None]:
#Added year indexes as a value on the dataframe so we can plot it
movies_df.insert(0, 'year', movies_df.index)

#Renamed feature to its correct representation
movies_df.rename(columns = {'date_added':'count'}, inplace = True)

#Added feature type and filled values as Movie
movies_df.insert(2, 'type', 'Movie')

In [None]:
#Display Movie dataframe
movies_df

In [None]:
#Creates new dataframe by appending Movies dataframe to TV Shows dataframe
content_type_df = tvshow_df.append(movies_df, ignore_index=True, sort=False)

In [None]:
#Display cumulative sum per content dataframe
content_type_df

In [None]:
#Create line chart with content growth on Netflix per type
fig = px.line(content_type_df, x="year", y="count", color='type')
fig.show()