### Note:

This notebook is made for project outlining and organizational purposes. This will help us understand what we will need to code for the actual python file that will be used for deployment.

In [2]:
import pandas as pd
import numpy as np
import plotly.graph_objs as go

df = pd.read_csv('netflix_list.csv')
df2 = pd.read_csv('simple.csv', skiprows=4)
#https://www.kaggle.com/snehaanbhawal/anime-list-for-recommendation-system-june-2021

In [3]:
pd.set_option('display.max_columns', None)

In [4]:
df.head()

Unnamed: 0,imdb_id,title,popular_rank,certificate,startYear,endYear,episodes,runtime,type,orign_country,language,plot,summary,rating,numVotes,genres,isAdult,cast,image_url
0,tt4052886,Lucifer,1,15,2016.0,,93.0,42,tvSeries,United States,English,Lucifer Morningstar has decided he's had enoug...,"Lucifer Morningstar, bored from his sulking li...",8.1,250884.0,"Crime,Drama,Fantasy",0,"['Tom Ellis', 'Lauren German', 'Lesley-Ann Bra...",https://m.media-amazon.com/images/M/MV5BNzY1Yj...
1,tt0993840,Army of the Dead,2,18,2021.0,,,148,movie,United States,English,"Following a zombie outbreak in Las Vegas, a gr...","With the abandoned, walled city of Las Vegas o...",5.8,110780.0,"Action,Crime,Horror",0,"['Dave Bautista', 'Ella Purnell', 'Ana de la R...",https://m.media-amazon.com/images/M/MV5BNGY0Nz...
2,tt7255502,The Kominsky Method,3,18,2018.0,2021.0,22.0,30,tvSeries,United States,English,"An aging actor, who long ago enjoyed a brush w...",Michael Douglas plays an actor who made it big...,8.2,28795.0,"Comedy,Drama",0,"['Michael Douglas', 'Sarah Baker', 'Graham Rog...",https://m.media-amazon.com/images/M/MV5BMzA0YT...
3,tt0108778,Friends,4,13+,1994.0,2004.0,235.0,22,tvSeries,United States,English,Follows the personal and professional lives of...,"Ross Geller, Rachel Green, Monica Geller, Joey...",8.9,861843.0,"Comedy,Romance",0,"['Jennifer Aniston', 'Courteney Cox', 'Lisa Ku...",https://m.media-amazon.com/images/M/MV5BNDVkYj...
4,tt9251798,Ragnarok,5,18,2020.0,,12.0,45,tvSeries,Norway,Norwegian,A small Norwegian town experiencing warm winte...,In the small fictional town of Edda coming of ...,7.5,26606.0,"Action,Drama,Fantasy",0,"['David Stakston', 'Jonas Strand Gravli', 'Her...",https://m.media-amazon.com/images/M/MV5BODM3NT...


In [5]:
def clean_data(dataset):
    """Cleans the netflix data frame.

    Replaces null values in the ratngs column with the mean. 
    Removes 2 rows of the genres column where null values were present.

    Args:
        dataset (str): name of the csv data file

    Returns:
        None

    """
    
    df = pd.read_csv(dataset)
    df['rating'].fillna(int(df['rating'].mean()), inplace=True)
    df = df.dropna( how='any', subset=['genres'])
    
    return df

Let's see how much information is missing in the ratings column before cleaning.

In [6]:
df.rating.isnull().sum() * 100 / len(df)

14.383561643835616

About 14% of the rating column is missing, and only 2 values in the genres column are missing. Let's fill the NAN values with the mean of the column and remove the rows in the genres column.

In [7]:
df = clean_data('netflix_list.csv')

Now, let's organize our data for each graph that we will want in our dashboard.

For graph 1, we will get the average ratings of which countries these movies and shows come from.

In [8]:
data = df.groupby(df.orign_country)["rating"].mean().head(11).sort_values(ascending=False)
top_countries = pd.DataFrame(data)

In [9]:
top_countries = top_countries.reset_index()
top_countries.drop(top_countries.loc[top_countries['orign_country']=='-'].index, inplace=True)

In [10]:
top_countries.sort_values(by='rating')

Unnamed: 0,orign_country,rating
10,Bahamas,5.2
9,Bulgaria,5.266667
8,Argentina,6.046032
7,Albania,6.1
6,Brazil,6.382759
4,Austria,6.588889
3,Australia,6.904938
2,Belgium,6.938095
1,Belarus,7.0
0,Cambodia,7.2


Graph 2 will show us the number of new shows released each year. We can find this out by using the startYear column.

In [11]:
data2 = df.startYear.value_counts()

shows_year = pd.DataFrame(data2)
shows_year = shows_year.reset_index()
shows_year.rename(columns = {'index':'Year', 'startYear':'Count',}, inplace = True)
shows_year = shows_year.head(10)

In [12]:
shows_year.head(10)

Unnamed: 0,Year,Count
0,2020.0,917
1,2019.0,883
2,2018.0,834
3,2017.0,722
4,2016.0,561
5,2021.0,546
6,2015.0,399
7,2014.0,318
8,2013.0,268
9,2012.0,185


Graph 3 will show us the top 10 shows with the most episodes.

In [13]:
#data3
top_episodes = df.nlargest(10, columns=['episodes'])