# DS3 Kaggle Workshop - Intro to Pandas

Welcome to our Intro to Pandas Jupyter Notebook. With our interactive problems, we hope to guide you in your learning process. Here, you can practice useful pandas functions for DataFrame manipulation and analysis. Have fun!

The dataset we will be using is called “Movies on Netflix, Prime Video, Hulu and Disney+” from Kaggle. Here is the link to the dataset: https://www.kaggle.com/ruchi798/movies-on-netflix-prime-video-hulu-and-disney. For easier access, we have downloaded it into the same repository as this Jupyter Notebook for you, and named it "Movies.csv."

**Note.** The slideshow presentation will be published after the workshop. This will allow you to look back at the material covered and go over concepts that we were not able to get to during the timeframe.

# Introduction

## Importing Libraries

First, let's import pandas and numpy, crucial Python libraries for any Jupyter Notebook session.

In [1]:
import pandas as pd
import numpy as np

## Importing Data

The file "Movies.csv" contains data on movies available on various streaming services. Let's read it into Python:

In [2]:
df = pd.read_csv('Movies.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


**Note.** pd.read_ is the general method of loading data into a Pandas dataframe. Whether the file type is csv, excel, json, etc. 

For example:

movies = pd.read_excel('Movies.xls')

## Heads or Tails

In [3]:
#feel free to play around with the integer n in df.head(n) or df.tail()
df.tail(3)

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
16741,16741,16742,Sharks of Lost Island,2013,,5.7,,0,0,0,1,0,Neil Gelinas,Documentary,United States,English,
16742,16742,16743,Man Among Cheetahs,2017,,6.6,,0,0,0,1,0,Richard Slater-Jones,Documentary,United States,English,
16743,16743,16744,In Beaver Valley,1950,,,,0,0,0,1,0,James Algar,"Documentary,Short,Family",United States,English,32.0


## Exploring the Data: DataFrame Stucture and Datatypes

Take a look at the structure of your DataFrame using *df.shape*. This is an easy method of quantifying your DataFrame.  DataFrames often contain various data types and it can be beneficial to understand just what type are contained within. Here, *df.dtypes* is especially helpful. 

In [4]:
df.shape

(16744, 17)

In [5]:
df.dtypes

Unnamed: 0           int64
ID                   int64
Title               object
Year                 int64
Age                 object
IMDb               float64
Rotten Tomatoes     object
Netflix              int64
Hulu                 int64
Prime Video          int64
Disney+              int64
Type                 int64
Directors           object
Genres              object
Country             object
Language            object
Runtime            float64
dtype: object

# Data Selection

## Selecting a Column

There are many different ways to select a column. **METHOD 1** consists of passing the column name in quotes the brackets following the dataframe. 

The brackets are known as indexing operators in Python, meaning it allows you to quickly access certain parts of the given data structure.

In [6]:
df['Title']

0                             Inception
1                            The Matrix
2                Avengers: Infinity War
3                    Back to the Future
4        The Good, the Bad and the Ugly
                      ...              
16739         The Ghosts of Buxley Hall
16740                    The Poof Point
16741             Sharks of Lost Island
16742                Man Among Cheetahs
16743                  In Beaver Valley
Name: Title, Length: 16744, dtype: object

**METHOD 2** consists of passing the column name following the dataframe and a period.

In [7]:
df.Title

0                             Inception
1                            The Matrix
2                Avengers: Infinity War
3                    Back to the Future
4        The Good, the Bad and the Ugly
                      ...              
16739         The Ghosts of Buxley Hall
16740                    The Poof Point
16741             Sharks of Lost Island
16742                Man Among Cheetahs
16743                  In Beaver Valley
Name: Title, Length: 16744, dtype: object

**METHOD 3** consists of passing the column name in quotes into the *get()* method.

The get() method returns the value of the item passed into it, therefore passing a particular column name will return the values within the column.

In [8]:
df.get('Title')

0                             Inception
1                            The Matrix
2                Avengers: Infinity War
3                    Back to the Future
4        The Good, the Bad and the Ugly
                      ...              
16739         The Ghosts of Buxley Hall
16740                    The Poof Point
16741             Sharks of Lost Island
16742                Man Among Cheetahs
16743                  In Beaver Valley
Name: Title, Length: 16744, dtype: object

**Note.** Methods 4 and 5 are not recommended unless you also need specific rows too. Right now, we are solely focusing on selecting a particular column. 

**METHOD 4** consists of passing the index of the column name into the *iloc* method.

**METHOD 5** consists of passing the column name into the *loc* method. 

Both methods take in specifications for the row(s), then column(s). This means that whatever is in front of the comma specifies the row(s) and whatever is behind the comma specifies the column(s).

In [9]:
df.iloc[:,2]

0                             Inception
1                            The Matrix
2                Avengers: Infinity War
3                    Back to the Future
4        The Good, the Bad and the Ugly
                      ...              
16739         The Ghosts of Buxley Hall
16740                    The Poof Point
16741             Sharks of Lost Island
16742                Man Among Cheetahs
16743                  In Beaver Valley
Name: Title, Length: 16744, dtype: object

In [10]:
df.loc[:,"Title"]

0                             Inception
1                            The Matrix
2                Avengers: Infinity War
3                    Back to the Future
4        The Good, the Bad and the Ugly
                      ...              
16739         The Ghosts of Buxley Hall
16740                    The Poof Point
16741             Sharks of Lost Island
16742                Man Among Cheetahs
16743                  In Beaver Valley
Name: Title, Length: 16744, dtype: object

**What does the colon stand for?**

The colon specifies which slice of the dataframe we want. Since it is used before the comma, it is specifying which row(s) to get. pandas will **generally** give back the rows between the start (inclusive) and end (exclusive). There are special cases with loc, which we will see in a bit.

For example, the 0:5 slice will result in elements from indices 0 (inclusive) to 5 (exclusive).

If there is no start or end specified, **ALL elements** from start to end (inclusive) are selected.

## Selecting Multiple Columns

You can pass in a list of columns into the indexing operators (or the outermost brackets).

A list in Python is formatted as comma-seperated items inside opening and closing brackets.

In [11]:
df[['Title', 'Year']]

Unnamed: 0,Title,Year
0,Inception,2010
1,The Matrix,1999
2,Avengers: Infinity War,2018
3,Back to the Future,1985
4,"The Good, the Bad and the Ugly",1966
...,...,...
16739,The Ghosts of Buxley Hall,1980
16740,The Poof Point,2001
16741,Sharks of Lost Island,2013
16742,Man Among Cheetahs,2017


## Data Structures: Series & Dataframe

You might notice that the result looks different than when we selected just a single column. Now, it looks more like a table, as opposed to a list. This is because we encountered two different data structures -- series and dataframe.

*Series*: a single list with indices; a one-dimensional labeled array

*Dataframe*: a collection of more than one series; a two-dimensional labeled data structure

## Selecting a Row

Like selecting a column, there are multiple ways to select a row. **METHOD 1** and **METHOD 2** consist of passing the row index into either the loc or iloc method.

In [12]:
df.loc[0]

Unnamed: 0                                        0
ID                                                1
Title                                     Inception
Year                                           2010
Age                                             13+
IMDb                                            8.8
Rotten Tomatoes                                 87%
Netflix                                           1
Hulu                                              0
Prime Video                                       0
Disney+                                           0
Type                                              0
Directors                         Christopher Nolan
Genres             Action,Adventure,Sci-Fi,Thriller
Country                United States,United Kingdom
Language                    English,Japanese,French
Runtime                                         148
Name: 0, dtype: object

In [13]:
df.iloc[0]

Unnamed: 0                                        0
ID                                                1
Title                                     Inception
Year                                           2010
Age                                             13+
IMDb                                            8.8
Rotten Tomatoes                                 87%
Netflix                                           1
Hulu                                              0
Prime Video                                       0
Disney+                                           0
Type                                              0
Directors                         Christopher Nolan
Genres             Action,Adventure,Sci-Fi,Thriller
Country                United States,United Kingdom
Language                    English,Japanese,French
Runtime                                         148
Name: 0, dtype: object

## loc vs iloc

*loc* → retrieves rows and columns based on a label (or labels) that is/are passed

*iloc* → retrieves rows and columns based on an index (or indices)  that is/are passed

**NOTE.** loc is different from normal Python slices because it includes both the start & stop

**METHOD 3** consists of passing in the slice, but just within the indexing operators. This method is only recommended if you have multiple rows to select, as opposed to one.

This method returns a dataframe as opposed to a series.

0:1 slice means that we want items from index 0 to index 1 (exclusive), which is the same as asking for the item in index 0.

In [14]:
df[0:1]

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0


## Practice Problem: iloc

Using the *iloc* method, how would you access the **year** Avengers: Infinity War was released (ie. third row, fourth column)?

**NOTE.** Remember, indexing starts at 0, not 1. Also, iloc takes in **indices**, not strings.

In [15]:
df.iloc[2,3]

2018

## Selecting Multiple Rows

**METHOD 1** for selecting multiple rows is exactly like METHOD 3 for selecting a row. The only difference is that the slice covers more ground! Similarly, it returns a dataframe.

0:5 means that we want items from index 0 to index 5 (exclusive)

In [16]:
#same as df.head()
df[0:5]

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


**METHOD 2** and **METHOD 3** consists of passing the row indices into the iloc or loc method.

**NOTE.** loc is different from normal Python slices because it includes both the start & stop.

0:5 for iloc and 0:4 for loc returns the same thing.

In [17]:
df.iloc[0:5]

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


In [18]:
df.loc[0:4]

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0
3,3,4,Back to the Future,1985,7+,8.5,96%,1,0,0,0,0,Robert Zemeckis,"Adventure,Comedy,Sci-Fi",United States,English,116.0
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0


## Practice Problem: Selecting Multiple Rows

Using .loc, return the *fifth to tenth (inclusive) rows* of the dataframe. Feel free to do all if time allows.

**NOTE.** Think about how rows correspond with indices. What index is the first row, and so on?

In [19]:
df.loc[4:9]

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0
5,5,6,Spider-Man: Into the Spider-Verse,2018,7+,8.4,97%,1,0,0,0,0,"Bob Persichetti,Peter Ramsey,Rodney Rothman","Animation,Action,Adventure,Family,Sci-Fi",United States,"English,Spanish",117.0
6,6,7,The Pianist,2002,18+,8.5,95%,1,0,1,0,0,Roman Polanski,"Biography,Drama,Music,War","United Kingdom,France,Poland,Germany","English,German,Russian",150.0
7,7,8,Django Unchained,2012,18+,8.4,87%,1,0,0,0,0,Quentin Tarantino,"Drama,Western",United States,"English,German,French,Italian",165.0
8,8,9,Raiders of the Lost Ark,1981,7+,8.4,95%,1,0,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Hebrew,Spanish,Arabic,Nepali",115.0
9,9,10,Inglourious Basterds,2009,18+,8.3,89%,1,0,0,0,0,Quentin Tarantino,"Adventure,Drama,War","Germany,United States","English,German,French,Italian",153.0


In [20]:
df.iloc[4:10]

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0
5,5,6,Spider-Man: Into the Spider-Verse,2018,7+,8.4,97%,1,0,0,0,0,"Bob Persichetti,Peter Ramsey,Rodney Rothman","Animation,Action,Adventure,Family,Sci-Fi",United States,"English,Spanish",117.0
6,6,7,The Pianist,2002,18+,8.5,95%,1,0,1,0,0,Roman Polanski,"Biography,Drama,Music,War","United Kingdom,France,Poland,Germany","English,German,Russian",150.0
7,7,8,Django Unchained,2012,18+,8.4,87%,1,0,0,0,0,Quentin Tarantino,"Drama,Western",United States,"English,German,French,Italian",165.0
8,8,9,Raiders of the Lost Ark,1981,7+,8.4,95%,1,0,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Hebrew,Spanish,Arabic,Nepali",115.0
9,9,10,Inglourious Basterds,2009,18+,8.3,89%,1,0,0,0,0,Quentin Tarantino,"Adventure,Drama,War","Germany,United States","English,German,French,Italian",153.0


In [21]:
df[4:10]

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
4,4,5,"The Good, the Bad and the Ugly",1966,18+,8.8,97%,1,0,1,0,0,Sergio Leone,Western,"Italy,Spain,West Germany",Italian,161.0
5,5,6,Spider-Man: Into the Spider-Verse,2018,7+,8.4,97%,1,0,0,0,0,"Bob Persichetti,Peter Ramsey,Rodney Rothman","Animation,Action,Adventure,Family,Sci-Fi",United States,"English,Spanish",117.0
6,6,7,The Pianist,2002,18+,8.5,95%,1,0,1,0,0,Roman Polanski,"Biography,Drama,Music,War","United Kingdom,France,Poland,Germany","English,German,Russian",150.0
7,7,8,Django Unchained,2012,18+,8.4,87%,1,0,0,0,0,Quentin Tarantino,"Drama,Western",United States,"English,German,French,Italian",165.0
8,8,9,Raiders of the Lost Ark,1981,7+,8.4,95%,1,0,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Hebrew,Spanish,Arabic,Nepali",115.0
9,9,10,Inglourious Basterds,2009,18+,8.3,89%,1,0,0,0,0,Quentin Tarantino,"Adventure,Drama,War","Germany,United States","English,German,French,Italian",153.0


# Data Cleaning and Preparation

- *Data Cleaning*: correcting/removing erroneous values, filling in missing values
- Data Cleaning and Preparation take around 80% of a data scientist’s time.
- Garbage in → Garbage out (GIGO): Your analysis can only be as good as the data going into it

## Data Cleaning Part 1: Dropping Columns
- Try to find find which columns are redundant or irrelevant to your analysis. 
- Notice that there are initially 17 columns and 16744 rows. 

In [22]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,ID,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Type,Directors,Genres,Country,Language,Runtime
0,0,1,Inception,2010,13+,8.8,87%,1,0,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,1,2,The Matrix,1999,18+,8.7,87%,1,0,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,2,3,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0


In [23]:
df.shape

(16744, 17)

- The 'Unnamed: 0', 'ID', and 'Type' columns do not seem to hold any valuable information, so we can drop them using **.drop()**!
- Note the new dimensions! What changed?
- Some parameters: 
  - 'axis' parameter: tells whether to drop rows or columns. The default for **.drop()** is 0, which drops rows. We must use axis = 1 since we want to drop columns.
  - 'inPlace' parameter: tells whether to do an operation in place or to return a copy. Doing an operation in place (inPlace = True) changes the dataset, which you do not always want. However, since we want to permanently drop these columns, we will use inPlace = True here!

In [24]:
df.drop(['Unnamed: 0', 'ID', 'Type'], axis = 1, inplace = True)
df.head(3)

Unnamed: 0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime
0,Inception,2010,13+,8.8,87%,1,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,The Matrix,1999,18+,8.7,87%,1,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0


In [25]:
df.shape

(16744, 14)

## Data Cleaning Part 2: Dealing with Null Values
- Datasets are usually imperfect: they contain many NaN values, and it is our job as data scientists to fix that!
- We will need to use the handy **.dropna()** function. 
- Let's first use the **.dropna()** function with no parameters.

In [26]:
df.dropna().head(3)

Unnamed: 0,Title,Year,Age,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime
0,Inception,2010,13+,8.8,87%,1,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,The Matrix,1999,18+,8.7,87%,1,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0


In [27]:
df.dropna().shape

(3301, 14)

- This line dropped every row that had at least 1 NaN value. Out of the 16744 total rows in the dataset, only 3301 rows had no missing values. 
- 13443 (16744 - 14) rows, around 80% of the dataset, were dropped in this line!
- This is NOT a good example of data cleaning.
- Good thing we kept the default inPlace = False :)

- A better approach is to use the parameters to specify what you want to drop.
- 'how' Parameter: Dictates whether a row/column is dropped if ALL of its values are NaN or ANY of its values are NaN. The default is to drop a row/column if ANY are NaN.
- As seen by the resulting dimensions in this following line of code, there were no rows with ALL NaN values.

In [28]:
df.dropna(how = 'all', inplace = True)
df.shape

(16744, 14)

### Example: Create a distribution of movies comparing the IMDb ratings of movies across the major streaming platforms. 
- We will be dropping rows (axis = 0 by default) in the 'IMDB' column. We use how = 'any' because we want any rows with empty values in our specified column to be purged. This does not make a difference in this example, but will in the following one!

In [29]:
df.dropna(subset = ['IMDb']).shape

(16173, 14)

### Dropping Rows in Multiple Columns:  Create a distribution of the average rating (mean of the IMDb and Rotten Tomatoes ratings) of Netflix movies.

In [30]:
df.dropna(subset = ['IMDb', 'Rotten Tomatoes']).shape

(5156, 14)

## Data Cleaning Part 3: Renaming Columns
### Method 1: Use the method **.rename()**
- Better option for changing only a few column names
- Useful if you want to use a function on all column names, like str.lower.
- Use inPlace = True if you want to permanently change the dataset

In [31]:
m1 = df.rename(columns = {'Age':'Age Rating', 'Runtime':'Runtime (min)'})
m1.head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
0,Inception,2010,13+,8.8,87%,1,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,The Matrix,1999,18+,8.7,87%,1,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0


### Method 2: Use the attribute **.columns**
- Better for changing several columns names
- Order is crucial!!!

In [62]:
df.columns

Index(['Title', 'Year', 'Age Rating', 'IMDb', 'Rotten Tomatoes', 'Netflix',
       'Hulu', 'Prime Video', 'Disney+', 'Directors', 'Genres', 'Country',
       'Language', 'Runtime (min)'],
      dtype='object')

In [33]:
df.columns = ['Title', 'Year', 'Age Rating', 'IMDb', 'Rotten Tomatoes', 'Netflix', 'Hulu',
       'Prime Video', 'Disney+', 'Directors', 'Genres', 'Country', 'Language',
       'Runtime (min)']
df.head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
0,Inception,2010,13+,8.8,87%,1,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,The Matrix,1999,18+,8.7,87%,1,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,Avengers: Infinity War,2018,13+,8.5,84%,1,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0


## Data Cleaning Part 4: Data Formatting
- Data must be formatted in the correct way or your analysis may be filled with errors!
- Let's take the example of the Rotten Tomatoes values. They are written like "__%", which is a String value.
  - Intuitively speaking, 100 > 99. However, Strings compare each digit so ‘100%’ < ‘99%’
  - If we look at the worst movies, or sort from lowest rating to highest, the movies with a Rotten Tomatoes of 100% would come first.
  - If we look at the best movies, or sort from highest rating to lowest, the movies with a Rotten Tomatoes of 99% would come first.
  - We must fix this if we want to sort by Rotten Tomatoes later!

- Procedure: Convert every value in the column to a numeric value by stripping the ‘%’ and casting the values

In [34]:
df['Rotten Tomatoes'] = df['Rotten Tomatoes'].str.strip('%')
df['Rotten Tomatoes'] = pd.to_numeric(df['Rotten Tomatoes'])

In [35]:
df.head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
0,Inception,2010,13+,8.8,87.0,1,0,0,0,Christopher Nolan,"Action,Adventure,Sci-Fi,Thriller","United States,United Kingdom","English,Japanese,French",148.0
1,The Matrix,1999,18+,8.7,87.0,1,0,0,0,"Lana Wachowski,Lilly Wachowski","Action,Sci-Fi",United States,English,136.0
2,Avengers: Infinity War,2018,13+,8.5,84.0,1,0,0,0,"Anthony Russo,Joe Russo","Action,Adventure,Sci-Fi",United States,English,149.0


# Conditionals
- Great tool for filtering out your dataframe to include only pertinent information
- Python comparison operators ( ==, !=, <, <=, >, >= ) for each comparison 
- Python bitwise operators ( & for AND, | for OR) to combine comparisons

### Example: Filter out the movies dataset to include only the Disney+ movies.

In [64]:
# Boolean Series
df['Disney+'] == 1

0        False
1        False
2        False
3        False
4        False
         ...  
16739     True
16740     True
16741     True
16742     True
16743     True
Name: Disney+, Length: 16744, dtype: bool

In [37]:
# Dataframe
disney = df[df['Disney+'] == 1]
disney.head()

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
95,Saving Mr. Banks,2013,13+,7.5,79.0,1,0,0,1,John Lee Hancock,"Biography,Comedy,Drama","United States,United Kingdom,Australia",English,125.0
103,Amy,2015,18+,7.8,95.0,1,0,1,1,,Drama,United States,English,60.0
122,Bolt,2008,7+,6.8,89.0,1,0,0,1,"Byron Howard,Chris Williams","Animation,Adventure,Comedy,Drama,Family",United States,English,96.0
125,The Princess and the Frog,2009,all,7.1,85.0,1,0,0,1,"Ron Clements,John Musker","Animation,Adventure,Comedy,Family,Fantasy,Musi...",United States,"English,French",97.0
150,Miracle,2004,7+,7.5,81.0,1,0,0,1,Gavin O'Connor,"Biography,Drama,History,Sport","Canada,United States",English,135.0


In [38]:
disney.shape

(564, 14)

- String Operations:
  - Searching for a substring? Just use the Series function **.str.contains()**

### Example: Find all the movies that were directed by Steven Spielberg and were Family movies or Action movies.
- NOTE: we need to use .str.contains() in this case because a movie may have multiple directors or genres

In [78]:
# Multiple Conditionals: Intermediary variables help you keep track of your steps!
director = df['Directors'].str.contains('Steven Spielberg')
family_genre = df['Genres'].str.contains('Family')
action_genre = df['Genres'].str.contains('Action')
df[director & (family_genre | action_genre)]

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
8,Raiders of the Lost Ark,1981,7+,8.4,95.0,1,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Hebrew,Spanish,Arabic,Nepali",115.0
16,Indiana Jones and the Last Crusade,1989,13+,8.2,88.0,1,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Greek,Arabic",127.0
36,Minority Report,2002,13+,7.6,90.0,1,0,0,0,Steven Spielberg,"Action,Crime,Mystery,Sci-Fi,Thriller",United States,"English,Swedish",145.0
44,Indiana Jones and the Temple of Doom,1984,7+,7.6,85.0,1,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,Sinhalese,Hindi",118.0
121,The Adventures of Tintin,2011,7+,7.3,74.0,1,0,0,0,Steven Spielberg,"Animation,Action,Adventure,Family,Mystery","United States,New Zealand,United Kingdom",English,107.0
186,War Horse,2011,13+,7.2,74.0,1,0,0,0,Steven Spielberg,"Action,Adventure,Drama,History,War","United States,India","English,German",146.0
219,Indiana Jones and the Kingdom of the Crystal S...,2008,13+,6.1,78.0,1,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Russian",122.0
16331,The BFG,2016,7+,6.4,75.0,0,0,0,1,Steven Spielberg,"Adventure,Family,Fantasy","United States,India,United Kingdom",English,117.0


In [40]:
# Condensed into 1 line: Same result but faster to write and usually harder to debug
df[(df['Directors'].str.contains('Steven Spielberg')) &
   ((df['Genres'].str.contains('Family')) | (df['Genres'].str.contains('Action')))]

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
8,Raiders of the Lost Ark,1981,7+,8.4,95.0,1,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Hebrew,Spanish,Arabic,Nepali",115.0
16,Indiana Jones and the Last Crusade,1989,13+,8.2,88.0,1,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Greek,Arabic",127.0
36,Minority Report,2002,13+,7.6,90.0,1,0,0,0,Steven Spielberg,"Action,Crime,Mystery,Sci-Fi,Thriller",United States,"English,Swedish",145.0
44,Indiana Jones and the Temple of Doom,1984,7+,7.6,85.0,1,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,Sinhalese,Hindi",118.0
121,The Adventures of Tintin,2011,7+,7.3,74.0,1,0,0,0,Steven Spielberg,"Animation,Action,Adventure,Family,Mystery","United States,New Zealand,United Kingdom",English,107.0
186,War Horse,2011,13+,7.2,74.0,1,0,0,0,Steven Spielberg,"Action,Adventure,Drama,History,War","United States,India","English,German",146.0
219,Indiana Jones and the Kingdom of the Crystal S...,2008,13+,6.1,78.0,1,0,0,0,Steven Spielberg,"Action,Adventure",United States,"English,German,Russian",122.0
16331,The BFG,2016,7+,6.4,75.0,0,0,0,1,Steven Spielberg,"Adventure,Family,Fantasy","United States,India,United Kingdom",English,117.0


## Practice Problem: Conditionals
You’re holding a club social virtually through Netflix Party next Friday. As all of your members are adults, you choose to watch a movie that is suitable for ‘16+’ or ‘18+’. (This dataset doesn’t do MPAA ratings like PG-13 or R). You also want to watch a Comedy movie that came out sometime between 2000 and 2016 (inclusive). Also, you don’t really have high expectations for this movie. Anything rated above a 5.5 on IMDb is perfect.

Create a “social” dataframe that has all the possible movies according to your criteria!


In [41]:
social = df[(df['Netflix'] == 1) & ((df['Age Rating'].str.contains('16+')) | (df['Age Rating'].str.contains('18+')))
    & (df['Genres'].str.contains('Comedy')) & ((df['Year'] >= 2000) & (df['Year'] <= 2016))
    & (df['IMDb'] > 5.5)]
social.head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
26,Silver Linings Playbook,2012,18+,7.7,92.0,1,0,0,0,David O. Russell,"Comedy,Drama,Romance",United States,English,122.0
73,About Time,2013,18+,7.8,68.0,1,0,0,0,Richard Curtis,"Comedy,Drama,Fantasy,Romance,Sci-Fi",United Kingdom,English,123.0
74,Kung Fu Hustle,2004,18+,7.7,90.0,1,0,0,0,Stephen Chow,"Action,Comedy,Fantasy","Hong Kong,China,United States","Cantonese,Mandarin",99.0


In [42]:
nfx_pty = df['Netflix'] == 1
ages = (df['Age Rating'].str.contains('16+')) | (df['Age Rating'].str.contains('18+'))
genre = df['Genres'].str.contains('Comedy')
year_range = (df['Year'] >= 2000) & (df['Year'] <= 2016)
rating = df['IMDb'] > 5.5
social = df[nfx_pty & ages & genre & year_range & rating]
social.head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
26,Silver Linings Playbook,2012,18+,7.7,92.0,1,0,0,0,David O. Russell,"Comedy,Drama,Romance",United States,English,122.0
73,About Time,2013,18+,7.8,68.0,1,0,0,0,Richard Curtis,"Comedy,Drama,Fantasy,Romance,Sci-Fi",United Kingdom,English,123.0
74,Kung Fu Hustle,2004,18+,7.7,90.0,1,0,0,0,Stephen Chow,"Action,Comedy,Fantasy","Hong Kong,China,United States","Cantonese,Mandarin",99.0


In [43]:
social.shape[0]

108

## Another Practice Problem!
You are babysitting your little sibling, who refuses to fall asleep until you watch a movie with him. You only have Netflix and Prime Video and want to watch critically-acclaimed movies (IMDb of at least 8). You would prefer movies that were released in 2000 or later. You must show a movie suitable for “all” ages unless you want to be grounded. Since you have homework to do, the movie should be 1 ½ hours, at the most.

Create a “babysitting” dataframe that has all the possible movies according to your criteria!


In [44]:
babysitting = df[((df['Netflix'] == 1) | (df['Prime Video'] == 1)) & 
   (df['IMDb'] >= 8) & (df['Year'] >= 2000) & (df['Age Rating'] == 'all') & 
    (df['Runtime (min)'] <= 90)]
babysitting.head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
536,I Am Kalam,2010,all,8.0,80.0,1,0,1,0,Nila Madhab Panda,"Comedy,Drama,Family",India,Hindi,88.0
6987,If so,2003,all,8.1,,0,0,1,0,Rajiv Whabi,Mystery,United Arab Emirates,English,80.0
8543,Fetish,2010,all,8.5,,0,0,1,0,Soopum Sohn,Thriller,United States,"English,Korean",87.0


In [45]:
platform = (df['Netflix'] == 1) | (df['Prime Video'] == 1)
rating = df['IMDb'] >= 8
year = df['Year'] >= 2000
age = df['Age Rating'] == 'all'
time = df['Runtime (min)'] <= 90
babysitting = df[platform & rating & year & age & time]
babysitting.head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
536,I Am Kalam,2010,all,8.0,80.0,1,0,1,0,Nila Madhab Panda,"Comedy,Drama,Family",India,Hindi,88.0
6987,If so,2003,all,8.1,,0,0,1,0,Rajiv Whabi,Mystery,United Arab Emirates,English,80.0
8543,Fetish,2010,all,8.5,,0,0,1,0,Soopum Sohn,Thriller,United States,"English,Korean",87.0


In [46]:
babysitting.shape[0]

7

## Sorting a Single Column
- We can use the **.sort_values()** function!
- 'ascending' Parameter: determines whether to sort the dataframe in ascending or descending order. The default is ascending = True.


### Example: Let’s sort the movies according to Rotten Tomatoes.

In [47]:
# Rotten Movies
df.sort_values('Rotten Tomatoes', ascending = True).head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
8133,Kickin' It Old Skool,2007,13+,4.6,2.0,0,0,1,0,Harvey Glazer,Comedy,"United States,Canada",English,108.0
6938,Strange Wilderness,2008,18+,5.3,2.0,0,0,1,0,Fred Wolf,"Adventure,Comedy",United States,English,87.0
4208,Getaway,2013,13+,4.4,2.0,0,1,0,0,Sam Peckinpah,"Action,Crime,Thriller",United States,"English,Spanish",123.0


In [48]:
# Certified Fresh!
df.sort_values(by = "Rotten Tomatoes", ascending = False).head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
481,Dance Academy: The Movie,2017,,7.0,100.0,1,0,0,0,Jeffrey Walker,Drama,"Germany,Australia",English,101.0
828,National Bird,2016,,7.1,100.0,1,0,0,0,Sonia Kennebeck,Documentary,United States,"English,Dari",92.0
6507,The Shelter,2015,,3.6,100.0,0,0,1,0,María Lidón,"Drama,Sci-Fi",Spain,English,95.0


## Sorting Multiple Columns

- Simply pass a list to the 'by' parameter!
- Order in the list matters! Will sort according to the first column in the list, and in case of any ties, the second column in the list will take precedence.
- Note how the difference in order makes a vast difference in the dataframe you get.

In [49]:
df.sort_values(by = ["Rotten Tomatoes", 'IMDb'], ascending = False).head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
4473,Stop Making Sense,1984,,8.6,100.0,0,0,1,0,Jonathan Demme,"Documentary,Music",United States,English,88.0
4662,Tom Petty and the Heartbreakers: Runnin' Down ...,2007,,8.6,100.0,0,0,1,0,Peter Bogdanovich,"Documentary,Music",United States,English,239.0
4591,Mahanati,2018,7+,8.5,100.0,0,0,1,0,Nag Ashwin,"Biography,Drama",India,"Telugu,Tamil",177.0


In [50]:
df.sort_values(by = ['IMDb', 'Rotten Tomatoes'], ascending = False).head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
1292,My Next Guest with David Letterman and Shah Ru...,2019,,9.3,,1,0,0,0,,Talk-Show,,,61.0
5110,Love on a Leash,2011,,9.3,,0,0,1,0,Fen Tian,"Comedy,Drama,Fantasy,Romance",United States,,90.0
6566,Square One,2019,,9.3,,0,0,1,0,Danny Wu,"Documentary,Drama,Music",United States,English,83.0


- What if we want to sort columns in different directions? (Ex: finding critically-acclaimed classics)
- Pass a list to the 'ascending' parameter!

In [51]:
df.sort_values(by = ['Rotten Tomatoes', 'Year'], ascending = [False, True]).head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
4467,A Trip to the Moon,1902,all,8.2,100.0,0,0,1,0,Georges Méliès,"Short,Action,Adventure,Comedy,Fantasy,Sci-Fi",France,"None,French",13.0
4470,The Cabinet of Dr. Caligari,1920,7+,8.1,100.0,0,0,1,0,Robert Wiene,"Fantasy,Horror,Mystery,Thriller",Germany,German,76.0
4844,The Golem: How He Came into the World,1920,,7.2,100.0,0,0,1,0,"Carl Boese,Paul Wegener","Fantasy,Horror",Germany,,76.0


## Practice Problem: Sorting

There are still too many movies to choose from for your social. Since all 108 of the movies fit your criteria from before, you now decide to find the movies with the best Rotten Tomatoes ratings and IMDb ratings, in that order. 

Create a “social_sorted” dataframe that has the top movies from the “social” dataframe. Then use the previous strategies for indexing and slicing to find the top 10 movies which your club can vote on!

Hint: The indices will not be in numerical order after you sort according to rating, so look for a method to reset the indices!


In [52]:
# Step 1: First sort the values accordingly!
social_sorted = social.sort_values(by = ['Rotten Tomatoes', 'IMDb'], ascending = False)
social_sorted.head(10)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
141,Bill Burr: I'm Sorry You Feel That Way,2014,18+,8.4,100.0,1,0,0,0,Jay Karas,Comedy,United States,English,80.0
1014,Jim Gaffigan: Obsessed,2014,16+,7.6,100.0,1,0,0,0,Jay Chapman,"Documentary,Comedy",United States,English,60.0
521,Melvin Goes to Dinner,2003,18+,6.8,100.0,1,0,0,0,Bob Odenkirk,"Comedy,Drama,Romance",United States,English,83.0
884,Hannibal Buress: Comedy Camisado,2016,18+,6.6,100.0,1,0,0,0,Lance Bangs,Comedy,United States,English,83.0
473,Aśoka,2001,18+,6.5,100.0,1,0,0,0,,Comedy,United States,English,30.0
332,Sour Grapes,2016,16+,7.3,96.0,1,0,0,0,Larry David,Comedy,United States,English,91.0
79,The Edge of Seventeen,2016,18+,7.3,94.0,1,0,0,0,Kelly Fremon Craig,"Comedy,Drama","United States,China",English,104.0
244,The Death of Mr. Lazarescu,2005,18+,7.9,93.0,1,0,0,0,Cristi Puiu,"Comedy,Drama",Romania,Romanian,153.0
26,Silver Linings Playbook,2012,18+,7.7,92.0,1,0,0,0,David O. Russell,"Comedy,Drama,Romance",United States,English,122.0
134,Frances Ha,2013,18+,7.5,92.0,1,0,0,0,Noah Baumbach,"Comedy,Drama,Romance",United States,"French,English",86.0


In [53]:
# Step 2: Try slicing!
# This will give you an error, so you can comment out this line.
#social_sorted.loc[0:9, 'Title']

In [54]:
# Step 3: Reset the index for slicing!
social_sorted.reset_index(drop = True, inplace = True)
social_sorted.head(3)

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
0,Bill Burr: I'm Sorry You Feel That Way,2014,18+,8.4,100.0,1,0,0,0,Jay Karas,Comedy,United States,English,80.0
1,Jim Gaffigan: Obsessed,2014,16+,7.6,100.0,1,0,0,0,Jay Chapman,"Documentary,Comedy",United States,English,60.0
2,Melvin Goes to Dinner,2003,18+,6.8,100.0,1,0,0,0,Bob Odenkirk,"Comedy,Drama,Romance",United States,English,83.0


In [55]:
# Step 4: Now try slicing!
social_sorted.loc[0:9, 'Title']

0    Bill Burr: I'm Sorry You Feel That Way
1                    Jim Gaffigan: Obsessed
2                     Melvin Goes to Dinner
3          Hannibal Buress: Comedy Camisado
4                                     Aśoka
5                               Sour Grapes
6                     The Edge of Seventeen
7                The Death of Mr. Lazarescu
8                   Silver Linings Playbook
9                                Frances Ha
Name: Title, dtype: object

In [56]:
social_sorted.iloc[0:10, 0]

0    Bill Burr: I'm Sorry You Feel That Way
1                    Jim Gaffigan: Obsessed
2                     Melvin Goes to Dinner
3          Hannibal Buress: Comedy Camisado
4                                     Aśoka
5                               Sour Grapes
6                     The Edge of Seventeen
7                The Death of Mr. Lazarescu
8                   Silver Linings Playbook
9                                Frances Ha
Name: Title, dtype: object

- Note the difference in indexing between loc[] and iloc[] in the previous example.

## Exporting Data
- You can export your Pandas dataframe after your analysis!
- Several options: .to_csv(), .to_excel(), .to_html(), .to_json(), .to_string()
  - Full list on the Pandas documentation!
- Export location: your working directory
- Scenario: Your officer team loves the data analysis you did to find a good movie! Let’s send our teammates a CSV file with the movies so they can make an easy decision next time. 

In [57]:
social_sorted.to_csv('Club_Movie_Social.csv')

In [58]:
social_sorted.to_csv('Club_Movie_Social_wo_Index.csv', index = False)

- Exporting data with the index leaves an extra column when you open it in Excel or read it into a new dataframe.

In [59]:
df1 = pd.read_csv("Club_Movie_Social.csv")
df1.head()

Unnamed: 0.1,Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
0,0,Bill Burr: I'm Sorry You Feel That Way,2014,18+,8.4,100.0,1,0,0,0,Jay Karas,Comedy,United States,English,80.0
1,1,Jim Gaffigan: Obsessed,2014,16+,7.6,100.0,1,0,0,0,Jay Chapman,"Documentary,Comedy",United States,English,60.0
2,2,Melvin Goes to Dinner,2003,18+,6.8,100.0,1,0,0,0,Bob Odenkirk,"Comedy,Drama,Romance",United States,English,83.0
3,3,Hannibal Buress: Comedy Camisado,2016,18+,6.6,100.0,1,0,0,0,Lance Bangs,Comedy,United States,English,83.0
4,4,Aśoka,2001,18+,6.5,100.0,1,0,0,0,,Comedy,United States,English,30.0


In [60]:
df2 = pd.read_csv('Club_Movie_Social_wo_Index.csv')
df2.head()

Unnamed: 0,Title,Year,Age Rating,IMDb,Rotten Tomatoes,Netflix,Hulu,Prime Video,Disney+,Directors,Genres,Country,Language,Runtime (min)
0,Bill Burr: I'm Sorry You Feel That Way,2014,18+,8.4,100.0,1,0,0,0,Jay Karas,Comedy,United States,English,80.0
1,Jim Gaffigan: Obsessed,2014,16+,7.6,100.0,1,0,0,0,Jay Chapman,"Documentary,Comedy",United States,English,60.0
2,Melvin Goes to Dinner,2003,18+,6.8,100.0,1,0,0,0,Bob Odenkirk,"Comedy,Drama,Romance",United States,English,83.0
3,Hannibal Buress: Comedy Camisado,2016,18+,6.6,100.0,1,0,0,0,Lance Bangs,Comedy,United States,English,83.0
4,Aśoka,2001,18+,6.5,100.0,1,0,0,0,,Comedy,United States,English,30.0


# Thanks for joining us and good luck for all your endeavors in Data Science!