# UCL AI Society Machine Learning Tutorials
### Session 01. Introduction to Numpy, Pandas, Matplotlib

### Contents
1. Numpy
2. Pandas
3. Matplotlib
4. EDA(Exploratory Data Analysis)

### Aim
At the end of this session, you will be able to:
- Understand the basics of numpy.
- Understand the basics of pandas.
- Understand the basics of matplotlib.
- Perform an Exploratory Data Analysis (EDA).


## 2. Pandas
Pandas is another essential open-source library of Python and today it is widely used by data scientists and analysts. It is built by Wes McKinney based on numpy. The name 'Pandas' is originated from the term "Panel Data", an econometrics term for data sets that include observations over multiple time periods for the same individual.

### 2.1 Basics of Pandas

In [1]:
# run this shell if you haven't install pandas library
! pip install pandas

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import pandas as pd
import numpy as np

  (fname, cnt))


In [3]:
print(pd.__version__)

0.24.2


The main data structures of pandas are **Series** and **DataFrames**, where data are stored and manipulated. A `Series` is a column and a `DataFrame` is a multi-dimensional table that consists of a collection of Series.

In [4]:
a = pd.Series([1, 2, 3, np.nan, 5, 6])

In [5]:
print(a)

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64


**Let's see how they are different!**

In [22]:
# TODO: Make Series using pandas Series()
# This is another way to initialise a pandas series
module_score_dic = {'Database': 90, 'Security': 70, 'Math': 100, 'Machine Learning': 80}
module_score = pd.Series(module_score_dic)
print("Module_score: \n", module_score, '\n')
print("type: \n", type(module_score), '\n')

# TODO: Make DataFrame using pandas DataFrame()
dataframe = pd.DataFrame(module_score, columns=['score'])
# dataframe = pd.DataFrame(module_score, index=[x for x in module_score.keys()], columns=['score'])
print("dataframe: \n", dataframe, '\n')
print("type: \n", type(data))
dataframe.dtypes

Module_score: 
 Database             90
Security             70
Math                100
Machine Learning     80
dtype: int64 

type: 
 <class 'pandas.core.series.Series'> 

dataframe: 
                   score
Database             90
Security             70
Math                100
Machine Learning     80 

type: 
 <class 'pandas.core.frame.DataFrame'>


score    int64
dtype: object

Series can also be a Dataframe that has only one attribute.  
**Now let's make a Dataframe that has multiple attributes**

In [18]:
solar_data = {
    'Name' : ["Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"],
    'Satellite' : [0, 0, 1, 2, 79, 60, 27, 14],
    'AU' : [0.4, 0.7, 1, 1.5, 5.2, 9.5, 19.2, 30.1],
    'Diameter (in 1Kkm)' : [4.9, 12.1, 12.7, 6.8, 139.8, 116.5, 50.7, 49.2]
}

In [19]:
solar_system = pd.DataFrame(solar_data, index = [i for i in range(1, 9)])
solar_system

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1
3,Earth,1,1.0,12.7
4,Mars,2,1.5,6.8
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2


In [23]:
solar_system.dtypes # check data type

Name                   object
Satellite               int64
AU                    float64
Diameter (in 1Kkm)    float64
dtype: object

We can select what to read from the DataFrame
- `head()` : Extracts the first few data
- `tail()` : Extracts the last few data
- `index` : Extracts the index
- `columns` : Extracts the column
- `loc` : Extracts the information of that row
- `values` : Extracts only the values
- `describe()` : Outputs the summary of statistics of DataFrame
- `sort_values(self, by, axis = 0, ascending = True, inplace = False)` : Sort the DataFrame
- `drop()` : Drop the selected row

In [24]:
df = solar_system

In [25]:
df.head() # the default value in the bracket is 5

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1
3,Earth,1,1.0,12.7
4,Mars,2,1.5,6.8
5,Jupiter,79,5.2,139.8


In [26]:
df.tail(2)

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2


In [40]:
df.index

Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

**Beware! 0th index of df.index is 1**

In [39]:
df.index[0]

1

In [28]:
df.columns

Index(['Name', 'Satellite', 'AU', 'Diameter (in 1Kkm)'], dtype='object')

In [30]:
# df.loc[0] gives you an error. number inside bracket doesn't work like an indexing in list. you have to put df.index
df.loc[3]

Name                  Earth
Satellite                 1
AU                        1
Diameter (in 1Kkm)     12.7
Name: 3, dtype: object

In [31]:
df.values

array([['Mercury', 0, 0.4, 4.9],
       ['Venus', 0, 0.7, 12.1],
       ['Earth', 1, 1.0, 12.7],
       ['Mars', 2, 1.5, 6.8],
       ['Jupiter', 79, 5.2, 139.8],
       ['Saturn', 60, 9.5, 116.5],
       ['Uranus', 27, 19.2, 50.7],
       ['Neptune', 14, 30.1, 49.2]], dtype=object)

In [32]:
df.describe()

Unnamed: 0,Satellite,AU,Diameter (in 1Kkm)
count,8.0,8.0,8.0
mean,22.875,8.45,49.0875
std,30.670775,10.853702,52.38417
min,0.0,0.4,4.9
25%,0.75,0.925,10.775
50%,8.0,3.35,30.95
75%,35.25,11.925,67.15
max,79.0,30.1,139.8


In [35]:
df.sort_values(by = 'Diameter (in 1Kkm)', ascending = False)

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2
3,Earth,1,1.0,12.7
2,Venus,0,0.7,12.1
4,Mars,2,1.5,6.8
1,Mercury,0,0.4,4.9


In [36]:
# TO DO: re-sort the DataFrame by the number of satellite in the decreasing order.
df.sort_values(by = 'Satellite', ascending = False)

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2
4,Mars,2,1.5,6.8
3,Earth,1,1.0,12.7
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1


Before 2006, Pluto was classified as a planet of the solar system. Let's add Pluto into our DataFrame.

In [37]:
df.loc[9] = ["Pluto", 0, 39.5, 2.38]
df

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1
3,Earth,1,1.0,12.7
4,Mars,2,1.5,6.8
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2
9,Pluto,0,39.5,2.38


Let's reclassify pluto as a dwarf planet.

In [43]:
# Beware again! Remember that the 0th index of df.index is 1 ?
# To drop pluto, you either do df.drop(index=idx) or df.drop(df.index[idx])
# df.drop(9)
df.drop(df.index[8])

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1
3,Earth,1,1.0,12.7
4,Mars,2,1.5,6.8
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2


### 2.2 Read Data via Pandas
Pandas supports loading, reading, and writing data from various file format, including CSV, JSON and SQL, by converting it to a DataFrame. 
1. `pd.read_csv()` : Read CSV files
2. `pd.read_json()` : Read JSON files
3. `pd.read_sql_query()` : Read SQL files

In [61]:
# download data at https://support.spatialkey.com/spatialkey-sample-csv-data/
movie = pd.read_csv("./data/IMDB-Movie-Data.csv", index_col = "Title")   # ---> option 1
# movie = pd.read_csv("./data/IMDB-Movie-Data.csv")    #---> option 2
print(type(movie))
movie

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
The Great Wall,6,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
La La Land,7,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
Mindhorn,8,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,,71.0
The Lost City of Z,9,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0
Passengers,10,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0


In [62]:
# To Do: Extract third row
# To Do: Uncomment the option 2 from the upper cell see which one works and which one does not
movie.iloc[2]
# movie.loc["Split"]
# movie.loc[2]

Rank                                                                  3
Genre                                                   Horror,Thriller
Description           Three girls are kidnapped by a man with a diag...
Director                                             M. Night Shyamalan
Actors                James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
Year                                                               2016
Runtime (Minutes)                                                   117
Rating                                                              7.3
Votes                                                            157606
Revenue (Millions)                                               138.12
Metascore                                                            62
Name: Split, dtype: object

In [63]:
# To Do: Sort the table by Ratings, in the descending order
# Do you agree with the rankings? :)
movie.sort_values(by = "Rating", ascending = False)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
Dangal,118,"Action,Biography,Drama",Former wrestler Mahavir Singh Phogat and his t...,Nitesh Tiwari,"Aamir Khan, Sakshi Tanwar, Fatima Sana Shaikh,...",2016,161,8.8,48969,11.15,
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
Kimi no na wa,97,"Animation,Drama,Fantasy",Two strangers find themselves linked in a biza...,Makoto Shinkai,"Ryûnosuke Kamiki, Mone Kamishiraishi, Ryô Nari...",2016,106,8.6,34110,4.68,79.0
The Intouchables,250,"Biography,Comedy,Drama",After he becomes a quadriplegic from a paragli...,Olivier Nakache,"François Cluzet, Omar Sy, Anne Le Ny, Audrey F...",2011,112,8.6,557965,13.18,57.0
Whiplash,134,"Drama,Music",A promising young drummer enrolls at a cut-thr...,Damien Chazelle,"Miles Teller, J.K. Simmons, Melissa Benoist, P...",2014,107,8.5,477276,13.09,88.0
The Prestige,65,"Drama,Mystery,Sci-Fi",Two stage magicians engage in competitive one-...,Christopher Nolan,"Christian Bale, Hugh Jackman, Scarlett Johanss...",2006,130,8.5,913152,53.08,66.0
The Departed,100,"Crime,Drama,Thriller",An undercover cop and a mole in the police att...,Martin Scorsese,"Leonardo DiCaprio, Matt Damon, Jack Nicholson,...",2006,151,8.5,937414,132.37,85.0
Taare Zameen Par,992,"Drama,Family,Music",An eight-year-old boy is thought to be a lazy ...,Aamir Khan,"Darsheel Safary, Aamir Khan, Tanay Chheda, Sac...",2007,165,8.5,102697,1.20,42.0


In [68]:
# To Do: Sort the table again by 'Revenue(Millions)', in ascending order and print the first 3 rows out
movie.sort_values(by = "Revenue (Millions)", ascending = True).head(3)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
A Kind of Murder,232,"Crime,Drama,Thriller","In 1960s New York, Walter Stackhouse is a succ...",Andy Goddard,"Patrick Wilson, Jessica Biel, Haley Bennett, V...",2016,95,5.2,3305,0.0,50.0
Into the Forest,962,"Drama,Sci-Fi,Thriller","After a massive power outage, two sisters lear...",Patricia Rozema,"Ellen Page, Evan Rachel Wood, Max Minghella,Ca...",2015,101,5.9,10220,0.01,59.0
"Love, Rosie",678,"Comedy,Romance",Rosie and Alex have been best friends since th...,Christian Ditter,"Lily Collins, Sam Claflin, Christian Cooke, Ja...",2014,102,7.2,80415,0.01,44.0


In [76]:
# The value_counts() function is used to get a Series containing counts of unique values
movie['Genre'].value_counts().head()

Action,Adventure,Sci-Fi    50
Drama                      48
Comedy,Drama,Romance       35
Comedy                     32
Drama,Romance              31
Name: Genre, dtype: int64

In [82]:
# This is called a "Masking Operation"
movie[movie['Runtime (Minutes)'] >= 170].sort_values(by="Rating", ascending=False)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3 Idiots,431,"Comedy,Drama",Two friends are searching for their long lost ...,Rajkumar Hirani,"Aamir Khan, Madhavan, Mona Singh, Sharman Joshi",2009,170,8.4,238789,6.52,67.0
The Wolf of Wall Street,83,"Biography,Comedy,Crime","Based on the true story of Jordan Belfort, fro...",Martin Scorsese,"Leonardo DiCaprio, Jonah Hill, Margot Robbie,M...",2013,180,8.2,865134,116.87,75.0
The Hateful Eight,89,"Crime,Drama,Mystery","In the dead of a Wyoming winter, a bounty hunt...",Quentin Tarantino,"Samuel L. Jackson, Kurt Russell, Jennifer Jaso...",2015,187,7.8,341170,54.12,68.0
La vie d'Adèle,312,"Drama,Romance","Adèle's life is changed when she meets Emma, a...",Abdellatif Kechiche,"Léa Seydoux, Adèle Exarchopoulos, Salim Kechio...",2013,180,7.8,103150,2.2,88.0
Grindhouse,829,"Action,Horror,Thriller",Quentin Tarantino and Robert Rodriguez's homag...,Robert Rodriguez,"Kurt Russell, Rose McGowan, Danny Trejo, Zoë Bell",2007,191,7.6,160350,25.03,
Cloud Atlas,268,"Drama,Sci-Fi",An exploration of how the actions of individua...,Tom Tykwer,"Tom Hanks, Halle Berry, Hugh Grant, Hugo Weaving",2012,172,7.5,298651,27.1,55.0
Inland Empire,966,"Drama,Mystery,Thriller",As an actress starts to adopt the persona of h...,David Lynch,"Laura Dern, Jeremy Irons, Justin Theroux, Karo...",2006,180,7.0,44227,,


In [84]:
# To Do: Extract the movies whose 'Metascore' is bigger than 95, and list it from the recent to the least recent
movie[movie['Metascore'] >= 95].sort_values(by="Year", ascending=False)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Manchester by the Sea,22,Drama,A depressed uncle is asked to take care of his...,Kenneth Lonergan,"Casey Affleck, Michelle Williams, Kyle Chandle...",2016,137,7.9,134213,47.7,96.0
Moonlight,42,Drama,"A chronicle of the childhood, adolescence and ...",Barry Jenkins,"Mahershala Ali, Shariff Earp, Duan Sanderson, ...",2016,111,7.5,135095,27.85,99.0
Carol,502,"Drama,Romance",An aspiring photographer develops an intimate ...,Todd Haynes,"Cate Blanchett, Rooney Mara, Sarah Paulson, Ky...",2015,118,7.2,77995,0.25,95.0
Boyhood,657,Drama,"The life of Mason, from early childhood to his...",Richard Linklater,"Ellar Coltrane, Patricia Arquette, Ethan Hawke...",2014,165,7.9,286722,25.36,100.0
12 Years a Slave,112,"Biography,Drama,History","In the antebellum United States, Solomon North...",Steve McQueen,"Chiwetel Ejiofor, Michael Kenneth Williams, Mi...",2013,134,8.1,486338,56.67,96.0
Gravity,510,"Drama,Sci-Fi,Thriller",Two astronauts work together to survive after ...,Alfonso Cuarón,"Sandra Bullock, George Clooney, Ed Harris, Ort...",2013,91,7.8,622089,274.08,96.0
Zero Dark Thirty,407,"Drama,History,Thriller",A chronicle of the decade-long hunt for al-Qae...,Kathryn Bigelow,"Jessica Chastain, Joel Edgerton, Chris Pratt, ...",2012,157,7.4,226661,95.72,95.0
The Social Network,325,"Biography,Drama",Harvard student Mark Zuckerberg creates the so...,David Fincher,"Jesse Eisenberg, Andrew Garfield, Justin Timbe...",2010,120,7.7,510100,96.92,95.0
Ratatouille,490,"Animation,Comedy,Family",A rat who can cook makes an unusual alliance w...,Brad Bird,"Brad Garrett, Lou Romano, Patton Oswalt,Ian Holm",2007,111,8.0,504039,206.44,96.0
Pan's Labyrinth,231,"Drama,Fantasy,War","In the falangist Spain of 1944, the bookish yo...",Guillermo del Toro,"Ivana Baquero, Ariadna Gil, Sergi López,Maribe...",2006,118,8.2,498879,37.62,98.0


In [85]:
# To Do: Extract movies whose directed by one of UCL Alumni
movie[(movie['Director'] == 'Christopher Nolan')]

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
The Prestige,65,"Drama,Mystery,Sci-Fi",Two stage magicians engage in competitive one-...,Christopher Nolan,"Christian Bale, Hugh Jackman, Scarlett Johanss...",2006,130,8.5,913152,53.08,66.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0
The Dark Knight Rises,125,"Action,Thriller",Eight years after the Joker's reign of anarchy...,Christopher Nolan,"Christian Bale, Tom Hardy, Anne Hathaway,Gary ...",2012,164,8.5,1222645,448.13,78.0


#### 2.2.1 Pandas Exercise

To Do: Extract the movie list that meets below requirements:
- 1. Released in and after 2010 (key = 'Year')
- 2. Runtime is shorter than 150 minutes (key = 'Runtime (Minutes)')
- 3. Rating is above 8.0 (key = 'Rating') 
Print out only the first 3 movies.

In [114]:
movie[
    (movie['Year'] >= 2010) & 
    (movie['Runtime (Minutes)'] <= 150) & 
    (movie['Rating'] >= 8.0)
]

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
La La Land,7,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
Hacksaw Ridge,17,"Biography,Drama,History","WWII American Army Medic Desmond T. Doss, who ...",Mel Gibson,"Andrew Garfield, Sam Worthington, Luke Bracey,...",2016,139,8.2,211760,67.12,71.0
Lion,19,"Biography,Drama",A five-year-old Indian boy gets lost on the st...,Garth Davis,"Dev Patel, Nicole Kidman, Rooney Mara, Sunny P...",2016,118,8.1,102061,51.69,69.0
Arrival,20,"Drama,Mystery,Sci-Fi",When twelve mysterious spacecraft appear aroun...,Denis Villeneuve,"Amy Adams, Jeremy Renner, Forest Whitaker,Mich...",2016,116,8.0,340798,100.5,81.0
Deadpool,34,"Action,Adventure,Comedy",A fast-talking mercenary with a morbid sense o...,Tim Miller,"Ryan Reynolds, Morena Baccarin, T.J. Miller, E...",2016,108,8.0,627797,363.02,65.0
Star Wars: Episode VII - The Force Awakens,51,"Action,Adventure,Fantasy",Three decades after the defeat of the Galactic...,J.J. Abrams,"Daisy Ridley, John Boyega, Oscar Isaac, Domhna...",2015,136,8.1,661608,936.63,81.0
Mad Max: Fury Road,68,"Action,Adventure,Sci-Fi",A woman rebels against a tyrannical ruler in p...,George Miller,"Tom Hardy, Charlize Theron, Nicholas Hoult, Zo...",2015,120,8.1,632842,153.63,90.0
Zootopia,75,"Animation,Adventure,Comedy","In a city of anthropomorphic animals, a rookie...",Byron Howard,"Ginnifer Goodwin, Jason Bateman, Idris Elba, J...",2016,108,8.1,296853,341.26,78.0
The Avengers,77,"Action,Sci-Fi",Earth's mightiest heroes must come together an...,Joss Whedon,"Robert Downey Jr., Chris Evans, Scarlett Johan...",2012,143,8.1,1045588,623.28,69.0


### 2.3 How to deal with Missing Data
To represent missing data, pandas use np.nan. Data scientists and machine learning engineers sometimes just remove missing data. However, it heavily depends on which data are missing, how big the missing data are and so on. You can fill the missing part with 0, with the mean value of the column or with mean value of only 10 nearest value in the column. It is important for you to choose the way how you are going to deal with missing data.
- `isnull()`: returns True or False, depending on the cell's null status. 
- `sum()`: This is a trick when you count the number of True's. Once the Dataframe is filtered through isnull() function, sum of all True's in a column gives you how many fields have missing data in them.
- `dropna()`: deletes any row that contains any single null value.
- `fillna(value)`: Fill missing value with the given values.

In [115]:
movie.isnull()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,False,False,False,False,False,False,False,False,False,False,False
Prometheus,False,False,False,False,False,False,False,False,False,False,False
Split,False,False,False,False,False,False,False,False,False,False,False
Sing,False,False,False,False,False,False,False,False,False,False,False
Suicide Squad,False,False,False,False,False,False,False,False,False,False,False
The Great Wall,False,False,False,False,False,False,False,False,False,False,False
La La Land,False,False,False,False,False,False,False,False,False,False,False
Mindhorn,False,False,False,False,False,False,False,False,False,True,False
The Lost City of Z,False,False,False,False,False,False,False,False,False,False,False
Passengers,False,False,False,False,False,False,False,False,False,False,False


In [116]:
movie.isnull().sum()

Rank                    0
Genre                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)    128
Metascore              64
dtype: int64

In [117]:
movie.shape

(1000, 11)

In [120]:
# Take a look at "Take Me Home Tonight" and "Search Party"
movie.fillna(value = 0)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
The Great Wall,6,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
La La Land,7,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
Mindhorn,8,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,0.00,71.0
The Lost City of Z,9,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0
Passengers,10,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0


In [121]:
movie.dropna(inplace = True)

In [122]:
movie.shape

(838, 11)

After dropping the rows that contain missing data, the shape of the dataFrame has changed, from (1000, 11) to (838, 11)

### 2.4 Merging Data
Some of you who know SQL might have felt that pandas is quite similar to query language.
What is the most popular thing that you do in most of the relational database query language?  
Yes! (terminologies alert!) Inner JOIN, Outer JOIN, Left JOIN, Right JOIN, Full JOIN...
- `concat()` : Concatenation. Used to merge two or more pandas object.
- `merge()` : It is merged in the way of SQL. 

In [123]:
df1 = pd.DataFrame(np.random.randn(10, 2))
df1

Unnamed: 0,0,1
0,-1.292454,0.994892
1,0.40748,-0.693513
2,1.548593,0.114417
3,-0.362648,0.784441
4,-0.272858,1.829562
5,1.569398,-1.7185
6,-0.357931,-0.056753
7,0.977494,-0.576888
8,-0.510238,0.989765
9,0.993332,2.750075


In [124]:
df2 = pd.DataFrame(np.random.randn(10, 3))
df2

Unnamed: 0,0,1,2
0,0.412625,0.60596,-1.821624
1,-0.186185,0.729974,-0.510493
2,-1.883591,0.250437,1.790063
3,-0.367597,1.338071,-1.112454
4,0.928728,0.839511,-2.002918
5,0.806188,3.427185,1.479313
6,1.392479,-1.748136,-0.655155
7,0.107495,0.419968,-0.155408
8,-0.281937,-1.269672,-1.369113
9,-0.199018,-0.331693,0.584287


In [125]:
pd.concat([df1, df2])

Unnamed: 0,0,1,2
0,-1.292454,0.994892,
1,0.40748,-0.693513,
2,1.548593,0.114417,
3,-0.362648,0.784441,
4,-0.272858,1.829562,
5,1.569398,-1.7185,
6,-0.357931,-0.056753,
7,0.977494,-0.576888,
8,-0.510238,0.989765,
9,0.993332,2.750075,


In [128]:
pd.concat([df1, df2], axis = 1)

Unnamed: 0,0,1,0.1,1.1,2
0,-1.292454,0.994892,0.412625,0.60596,-1.821624
1,0.40748,-0.693513,-0.186185,0.729974,-0.510493
2,1.548593,0.114417,-1.883591,0.250437,1.790063
3,-0.362648,0.784441,-0.367597,1.338071,-1.112454
4,-0.272858,1.829562,0.928728,0.839511,-2.002918
5,1.569398,-1.7185,0.806188,3.427185,1.479313
6,-0.357931,-0.056753,1.392479,-1.748136,-0.655155
7,0.977494,-0.576888,0.107495,0.419968,-0.155408
8,-0.510238,0.989765,-0.281937,-1.269672,-1.369113
9,0.993332,2.750075,-0.199018,-0.331693,0.584287


In [129]:
demis = pd.DataFrame(
    {'Modules': ['Bioinformatics', 'Robotic Systems', 'Security', 'Compilers'], 'Demis' : [75, 97, 64, 81]}
)
demis

Unnamed: 0,Modules,Demis
0,Bioinformatics,75
1,Robotic Systems,97
2,Security,64
3,Compilers,81


In [130]:
sedol = pd.DataFrame(
    {'Modules': ['Bioinformatics', 'Robotic Systems', 'Security', 'Compilers'], 'Sedol' : [63, 78, 84, 95]})
sedol

Unnamed: 0,Modules,Sedol
0,Bioinformatics,63
1,Robotic Systems,78
2,Security,84
3,Compilers,95


In [131]:
pd.merge(demis, sedol, on = 'Modules')

Unnamed: 0,Modules,Demis,Sedol
0,Bioinformatics,75,63
1,Robotic Systems,97,78
2,Security,64,84
3,Compilers,81,95


In [132]:
#To Do: Define your own dataframe and use functions introducesd above to concatenate them.

### What to do next?
Below websites would be helpful for your further study on pandas library:
- [Pandas official website](https://pandas.pydata.org)
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
- [Data Wrangling with Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)