# Introductory data analysis with a real world dataset

In this notebook, we'll cover some Introductory Data Analysis skills that can help you analyze new data. We've been exploring different pandas techniques in the previous problem sets. Let's see how we can apply those learnings to clean, explore, and gain insights from a real-world dataset. 

For this exercise, we'll be working with Netflix data! Specifically, we'll be working with the "Most watched Netflix original shows" dataset. Credit: [Muhmad Akmal,  Kaggle](https://www.kaggle.com/code/muhmadakmal/most-watched-netflix-original-shows-tv-time/input)


## Loading the data

As you know from previous problem sets, one of the first things we can do when encountering a new dataset is load it into pandas and glance at first few rows.

In [426]:
import pandas as pd
df = pd.read_csv('netflix_data.csv')

Let's start with a very useful method called `[df.head()]` 

`df.head()` returns the first n rows of a dataframe based on position. By default, it will return the first 5 rows 


In [427]:
df.head()

Unnamed: 0,lister-item-index,lister-item-header,certificate,runtime,genre,rating,votes
0,1,Stranger Things,15,60 min,"Drama, Fantasy, Horror",8.7,1327188
1,2,13 Reasons Why,18,60 min,"Drama, Mystery, Thriller",7.5,314321
2,3,Orange Is the New Black,18,59 min,"Comedy, Crime, Drama",8.0,319342
3,4,Black Mirror,18,60 min,"Drama, Mystery, Sci-Fi",8.7,636319
4,5,Money Heist,15,60 min,"Action, Crime, Drama",8.2,529086


But let's say we were interested in seeing the first 8 rows. We can simply specify:


In [428]:
df.head(8)

Unnamed: 0,lister-item-index,lister-item-header,certificate,runtime,genre,rating,votes
0,1,Stranger Things,15,60 min,"Drama, Fantasy, Horror",8.7,1327188
1,2,13 Reasons Why,18,60 min,"Drama, Mystery, Thriller",7.5,314321
2,3,Orange Is the New Black,18,59 min,"Comedy, Crime, Drama",8.0,319342
3,4,Black Mirror,18,60 min,"Drama, Mystery, Sci-Fi",8.7,636319
4,5,Money Heist,15,60 min,"Action, Crime, Drama",8.2,529086
5,6,Lucifer,15,60 min,"Crime, Drama, Fantasy",8.1,354155
6,7,Narcos,15,50 min,"Biography, Crime, Drama",8.8,467909
7,8,Daredevil,15,60 min,"Action, Crime, Drama",8.6,472940


`df.tail()` works in a similar way. It returns the last n rows of the dataset, with the default being 5 rows. 

In [429]:
df.tail()

Unnamed: 0,lister-item-index,lister-item-header,certificate,runtime,genre,rating,votes
75,76,F Is for Family,15,30 min,"Animation, Comedy, Drama",8.0,41074
76,77,The Ranch,15,30 min,"Comedy, Drama, Western",7.5,42401
77,78,American Vandal,15,34 min,"Comedy, Crime, Drama",8.1,32985
78,79,Dead to Me,15,30 min,"Comedy, Crime, Drama",7.9,99440
79,80,Quicksand,18,45 min,"Crime, Drama, Mystery",7.4,25507


You can use `df.sample()` to randomly sample some rows that you want to take a peek at. This is especially useful in cases where you don't think the first/last few rows are indicitative of the entire dataset.

In [473]:
df.sample(5)

Unnamed: 0,lister-item-index,lister-item-header,certificate,runtime,genre,rating,votes
44,45,The Society,15,58 min,"Drama, Mystery, Sci-Fi",7.1,36504
61,62,Love,15,50 min,"Comedy, Drama, Romance",7.6,45854
6,7,Narcos,15,50 min,"Biography, Crime, Drama",8.8,467909
34,35,Disenchantment,15,30 min,"Animation, Action, Adventure",7.2,72503
18,19,The Punisher,18,60 min,"Action, Crime, Drama",8.4,263516


## Analyzing Columns & Rows

Now that we have a good idea of what this dataset looks like, let's analyze what each column here is representing! 

In our case, we can visually see the title of all 7 columns of our dataset. But in larger datasets (say with a 100 columns) we can use `df.columns` to return all the column labels in our dataframe.


In [474]:
df.columns

Index(['lister-item-index', 'lister-item-header', 'certificate', 'runtime',
       'genre', 'rating', 'votes'],
      dtype='object')

Some of these column names are little long. Luckily, we can use `df.rename` to rename any column we'd like

In [432]:
new_df = df.rename(columns={'lister-item-header': 'show', 'lister-item-index': 'index'})
new_df.head()

Unnamed: 0,index,show,certificate,runtime,genre,rating,votes
0,1,Stranger Things,15,60 min,"Drama, Fantasy, Horror",8.7,1327188
1,2,13 Reasons Why,18,60 min,"Drama, Mystery, Thriller",7.5,314321
2,3,Orange Is the New Black,18,59 min,"Comedy, Crime, Drama",8.0,319342
3,4,Black Mirror,18,60 min,"Drama, Mystery, Sci-Fi",8.7,636319
4,5,Money Heist,15,60 min,"Action, Crime, Drama",8.2,529086


`index:` Describes the index of the show as listed on IMDB - begins indexing at 1

`show:` Show Name

`certificate:` Maturity Rating. For example, a certificate of 15 implies that content may be unsuitable for children under this age. 

`runtime:` The average runtime of each episode of the show

`genre:` The genres associated with the show. 

`rating:` IMDB rating for that show 

`votes:` Number of people that rated the show

For the purposes of this exercise, let's assume that all the data in this dataset is accurate. 

Let's keep going with our columns and row analysis: We can use `new_df.shape` to return the number of rows and columns of our dataframe. 

In [475]:
new_df.shape

(80, 7)

Note that our `.shape` returns a tuple, which we can actually store into two seperate variables (called rows and columns) for easy future access.

In [476]:
rows, columns = new_df.shape

In [477]:
rows

80

In [478]:
columns

7

There are some other helpful things we can use to describe our data! 

`new_df.info()` prints information about the dataframe, including the data types found in each column.

In [437]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   index        80 non-null     int64  
 1   show         80 non-null     object 
 2   certificate  80 non-null     object 
 3   runtime      80 non-null     object 
 4   genre        80 non-null     object 
 5   rating       80 non-null     float64
 6   votes        80 non-null     object 
dtypes: float64(1), int64(1), object(5)
memory usage: 4.5+ KB


There's a lot of things outputted here! Let's break them down: 

For each of our 7 dataframe columns, we can see `Non-Null Count` and `Dtype`. The `Non-Null Count` (as the name implies) tells us how many values in that particular column are not null.

`Dtype` tells us what the data type is for values in a column. The index column is of type `int64`, the rating column is `float64`, and the remaining  columns are of type `object`. In other words, only two of our columns are representing a numerical value (index and rating). The show, certificate, genre, and ratings, and votes columns are strings (the default data type for strings in DataFrames is the object type).


We also get overall memory usage of our dataframe. Side note: If you're interested in a more accurate measure of memory usage, feel free to try `new_df.info(memory_usage='deep')`

##### ----------------------------------------------------------------------------------------------------------------------------------------------------------------
Small segway since we're talking about null values! There are many, many, many ways to check for null values in your dataset! Here's one approach:

In [481]:
new_df.isnull().values.any()

False

You can also run the following code to see __how many__ values are null in the dataset. In our case, we know the total number is 0


In [482]:
new_df.isnull().sum().sum()

0

##### ----------------------------------------------------------------------------------------------------------------------------------------------------------------

Another useful method (and one of my favorites) is `df.describe()`. `df.describe()` generates descriptive statistics (e.g. total count, mean, standard deviation, min, and max) for all __numerical__ columns in your dataset by default. 

As we saw above, only the `index` and `rating` columns in our dataframe are numerical

In [440]:
new_df.describe()

Unnamed: 0,index,rating
count,80.0,80.0
mean,40.5,7.72
std,23.2379,0.71235
min,1.0,5.9
25%,20.75,7.2
50%,40.5,7.8
75%,60.25,8.225
max,80.0,8.8


It's not particularly helpful for us to learn statistics about the index column- the index just identifies each row in the dataset. 


But we did get some interesting analysis on the ratings across our Netflix shows! For example, the average rating for a show was a 7.72. The standard deviation was only ~.71! This tells us that there wasn't much variance from the average rating. 


We also learned that the max rating achieved by any show was 8.8

Let's come back to `df.describe` after we do some more data cleaning (yay...)! This method will become even more useful for us later.

## Data Cleaning

Anubha is a Netflix addict! She'd like to watch the first episode of ALL the Netflix shows in our dataframe. How many minutes will it take her to watch all of these epsiodes? Let's sum up all of the runtimes in our table to get the answer. 



In [441]:
new_df['runtime'].sum()

'60 min60 min59 min60 min60 min60 min50 min60 min60 min56 min50 min25 min60 min60 min60 min45 min60 min60 min60 min60 min55 min55 min30 min30 min45 min50 min60 min60 min45 min60 min49 min60 min60 min30 min30 min25 min26 min60 min60 min50 min30 min30 min50 min60 min58 min60 min60 min30 min23 min30 min60 min30 min15 min30 min60 min25 min63 min60 min30 min45 min30 min50 min60 min27 min30 min60 min35 min60 min51 min52 min45 min30 min22 min46 min60 min30 min30 min34 min30 min45 min'

We've done this before in previous problem sets! What went wrong here? 

Remember that Python only let's us do arithmetic operations on numbers. From `new_df.info()`, we learned that only `index` and `rating` are numerical columns.

To tripled check, let's just see which columns in our datframe can actually be summed. 

In [483]:
new_df.sum(numeric_only= True)

index      3240.0
runtime    3741.0
rating      617.6
dtype: float64

Runtime is not one of them. 





In [460]:
#Let's take another look at what a value in df['runtimes'] looks like, just for reference
df['runtime'][0]

'60 min'

To sum runtimes, we'll need to convert each value into a numerical value instead of a string. We want to go from '60 min' to  60.

Let's do this by first removing the word 'min' from each value, and then converting the result from a string to an int! The simplest (but not only) way to do this is by using `df.replace` and `.astype()`


In [446]:
#The following code shows you how to do this operation for all values in our df.runtime column at once.
#It also overwrites the original runtime in new_df with the new runtime column we just created.
new_df['runtime'] = df['runtime'].str.replace(' min', '').astype('int')

In [484]:
#Great! If you run this cell, you'll see that all our runtime values are of type int64!
new_df['runtime']

0     60
1     60
2     59
3     60
4     60
      ..
75    30
76    30
77    34
78    30
79    45
Name: runtime, Length: 80, dtype: int64

Now... how many minutes will it take Anubha to watch the first episode of every show in our Netflix dataset?

In [463]:
new_df['runtime'].sum()

3741

That's a lot of minutes 😬. (This example was to show that you'll have to do some data cleaning/data manipulation with pretty much every dataset you encounter. This is a relatively "easy" dataset- it has a usability rating of 10 on Kaggle. But we still had to do some additional tricks to find the answers we're looking for.) 

Now that we've converted the runtime column into a numerical column, we can do all kinds of other fun analysis. What's the average runtime of Netflix orgininal shows? 

In [488]:
new_df['runtime'].mean()

46.7625

What about the max runtime?

In [486]:
new_df['runtime'].max()

63

Also, let's go back to `new_df.describe` and see how the output changes now that runtime is a numerical column! 

In [489]:
new_df.describe()

Unnamed: 0,index,runtime,rating
count,80.0,80.0,80.0
mean,40.5,46.7625,7.72
std,23.2379,14.156667,0.71235
min,1.0,15.0,5.9
25%,20.75,30.0,7.2
50%,40.5,50.0,7.8
75%,60.25,60.0,8.225
max,80.0,63.0,8.8


The mode (or most common value) is not included in `.describe()`. Let's see what the most __common__ rating is for a Netflix Orginal show. 



In [491]:
new_df['rating'].mode()

0    8.2
Name: rating, dtype: float64

Note that in our previous mean, max, mode calculations, we didn't actually learn __which__ show is associate with any of those values. 

Let's compute the max rating again, but this time, let's find out which show got the max rating.


In [497]:
#Which show(s) had the max rating?
new_df.loc[df['rating'] == new_df['rating'].max()]

Unnamed: 0,index,show,certificate,runtime,genre,rating,votes
6,7,Narcos,15,50,"Biography, Crime, Drama",8.8,467909
35,36,BoJack Horseman,18,25,"Animation, Comedy, Drama",8.8,186993


In [499]:
#You couild also store the max rating as a variable (optional, but improves readbility)
max_rating = new_df['rating'].max()
max_rating 

#This code does the same thing as the code from the cell above
new_df.loc[new_df['rating'] == max_rating]

Unnamed: 0,index,show,certificate,runtime,genre,rating,votes
6,7,Narcos,15,50,"Biography, Crime, Drama",8.8,467909
35,36,BoJack Horseman,18,25,"Animation, Comedy, Drama",8.8,186993


# Exercises 

### 1. Look up your favorite Netflix original and return its information. If it doesn't pop up in this table, feel free to pick a random show instead.

### 2. What show had the lowest ratings? What was the rating? :( 

### 3. Which Netflix show was the most popular (had the most votes)? What was the number of votes?
Hint: Votes are strings here, and need to be converted to ints! 

Another Hint: Last time, we removed the word "mins" from our runtime values before converting to ints. What do need to remove from the votes column before we can convert to ints?  


### 4. Sarah loves the Mystery genre. She's curious which shows in the Mystery genre have the highest ratings. Recommend the top 3 highest rated shows in the Mystery genre to Sarah!



### 5. Leif would like to recommend shows to his 15 year old cousin, but wants to make sure that he recommends shows that are age appropriate. Return a dataframe of all the shows with a Maturity Rating less than or equal to 15.  

Hint: How should we think about handling 'PG' in this case? We could replace 'PG' with 13. 


### 6. Feel free to do your analysis and answer other questions you are curious about!