# UCL AI Society Machine Learning Tutorials
### Session 01. Introduction to Numpy, Pandas and Matplotlib Libraries

### Contents
1. Numpy
2. Pandas
3. Matplotlib
4. EDA(Exploratory Data Analysis)

### Aim
At the end of this session, you will be able to:
- Understand the basics of numpy.
- Understand the basics of pandas.
- Understand the basics of matplotlib.
- Build a simple EDA using above libraries.

## 2. Pandas
Pandas is another essential open-source library of Python and today widely used by data scientists and analysts. Wes McKinney built it on the numpy to transform and analyze data. According to the Wikipedia, It is derived from the term "Panel Data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.

### 2.1 Basics of Pandas

In [1]:
# run this shell if you haven't install pandas library
! pip install pandas



In [2]:
import pandas as pd
import numpy as np

In [3]:
print(pd.__version__)

0.24.2


The main data structures of pandas are **Series** and **DataFrames**, where data are stored and manipulated. A `Series` is a column and a `DataFrame` is a multi-dimensional table consisted of a collection of Series.

In [4]:
a = pd.Series([1, 2, 3, np.nan, 5, 6])

In [5]:
print(a)

0    1.0
1    2.0
2    3.0
3    NaN
4    5.0
5    6.0
dtype: float64


In [6]:
b = {
    'Name' : ["Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"],
    'Satellite' : [0, 0, 1, 2, 79, 60, 27, 14],
    'AU' : [0.4, 0.7, 1, 1.5, 5.2, 9.5, 19.2, 30.1],
    'Diameter (in 1Kkm)' : [4.9, 12.1, 12.7, 6.8, 139.8, 116.5, 50.7, 49.2]
}

In [7]:
solar_system = pd.DataFrame(b, index = [i for i in range(1, 9)])
solar_system

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1
3,Earth,1,1.0,12.7
4,Mars,2,1.5,6.8
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2


In [8]:
solar_system.dtypes # check data type

Name                   object
Satellite               int64
AU                    float64
Diameter (in 1Kkm)    float64
dtype: object

We can select what to read from the DataFrame
- `head()` : Extracts the first few data
- `tail()` : Extracts the last few data
- `index` : Extracts the index
- `columns` : Extracts the column
- `loc` : Extracts the information of that row
- `values` : Extracts only the values
- `describe()` : Outputs the summary of statistics of DataFrame
- `sort_values(self, by, axis = 0, ascending = True, inplace = False)` : Sort the DataFrame
- `drop()` : Drop the selected row

In [9]:
df = solar_system

In [10]:
df.head() # the default value in the bracket is 5

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1
3,Earth,1,1.0,12.7
4,Mars,2,1.5,6.8
5,Jupiter,79,5.2,139.8


In [11]:
df.tail(2)

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2


In [12]:
df.index

Int64Index([1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')

In [13]:
df.columns

Index(['Name', 'Satellite', 'AU', 'Diameter (in 1Kkm)'], dtype='object')

In [14]:
df.loc[4]

Name                  Mars
Satellite                2
AU                     1.5
Diameter (in 1Kkm)     6.8
Name: 4, dtype: object

In [15]:
df.values

array([['Mercury', 0, 0.4, 4.9],
       ['Venus', 0, 0.7, 12.1],
       ['Earth', 1, 1.0, 12.7],
       ['Mars', 2, 1.5, 6.8],
       ['Jupiter', 79, 5.2, 139.8],
       ['Saturn', 60, 9.5, 116.5],
       ['Uranus', 27, 19.2, 50.7],
       ['Neptune', 14, 30.1, 49.2]], dtype=object)

In [16]:
df.describe()

Unnamed: 0,Satellite,AU,Diameter (in 1Kkm)
count,8.0,8.0,8.0
mean,22.875,8.45,49.0875
std,30.670775,10.853702,52.38417
min,0.0,0.4,4.9
25%,0.75,0.925,10.775
50%,8.0,3.35,30.95
75%,35.25,11.925,67.15
max,79.0,30.1,139.8


In [17]:
df.sort_values(by = 'Diameter (in 1Kkm)', ascending = False)

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2
3,Earth,1,1.0,12.7
2,Venus,0,0.7,12.1
4,Mars,2,1.5,6.8
1,Mercury,0,0.4,4.9


In [18]:
# TO DO: re-sort the DataFrame by the number of satellite in the decreasing order.
None

Before 2016, Pluto was classified as a planet of the solar system. Let's add Pluto to our DataFrame.

In [19]:
df.loc[9] = ["Pluto", 0, 39.5, 2.38]
df

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1
3,Earth,1,1.0,12.7
4,Mars,2,1.5,6.8
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2
9,Pluto,0,39.5,2.38


Let's excludeit again as it is regraded as a dwarf planet in the field of astronomy.

In [20]:
# Even if the index of Pluto was 9, it is regarded as 8, as the index starts from 0
df.drop(df.index[8])

Unnamed: 0,Name,Satellite,AU,Diameter (in 1Kkm)
1,Mercury,0,0.4,4.9
2,Venus,0,0.7,12.1
3,Earth,1,1.0,12.7
4,Mars,2,1.5,6.8
5,Jupiter,79,5.2,139.8
6,Saturn,60,9.5,116.5
7,Uranus,27,19.2,50.7
8,Neptune,14,30.1,49.2


### 2.2 Read Data via Pandas
Pandas supports loading, reading, and writing data from various file format, including CSV, JSON and SQL, by converting it to a DataFrame. 
1. `pd.read_csv()` : Read CSV files
2. `pd.read_json()` : Read JSON files
3. `pd.read_sql_query()` : Read SQL files

In [21]:
# download data at https://support.spatialkey.com/spatialkey-sample-csv-data/
movie = pd.read_csv("./data/IMDB-Movie-Data.csv", index_col = "Title")
movie

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
The Great Wall,6,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
La La Land,7,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
Mindhorn,8,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,,71.0
The Lost City of Z,9,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0
Passengers,10,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0


In [22]:
# To Do: Extract the information of the movie that has the rank of 3
None

In [23]:
# To Do: Re-sort the table in terms of Rating, in the decreasing order
None

In [24]:
# To Do: Re-sort the table in terms of 'Revenue(Millions)', in the increasing order and print out the first 5.
# None
movie.sort_values(by = "Revenue (Millions)", ascending = True).head()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
A Kind of Murder,232,"Crime,Drama,Thriller","In 1960s New York, Walter Stackhouse is a succ...",Andy Goddard,"Patrick Wilson, Jessica Biel, Haley Bennett, V...",2016,95,5.2,3305,0.0,50.0
Into the Forest,962,"Drama,Sci-Fi,Thriller","After a massive power outage, two sisters lear...",Patricia Rozema,"Ellen Page, Evan Rachel Wood, Max Minghella,Ca...",2015,101,5.9,10220,0.01,59.0
"Love, Rosie",678,"Comedy,Romance",Rosie and Alex have been best friends since th...,Christian Ditter,"Lily Collins, Sam Claflin, Christian Cooke, Ja...",2014,102,7.2,80415,0.01,44.0
Lovesong,322,Drama,The relationship between two friends deepens d...,So Yong Kim,"Riley Keough, Jena Malone, Jessie Ok Gray, Car...",2016,84,6.4,616,0.01,74.0
Wakefield,69,Drama,A man's nervous breakdown causes him to leave ...,Robin Swicord,"Bryan Cranston, Jennifer Garner, Beverly D'Ang...",2016,106,7.5,291,0.01,61.0


In [25]:
movie['Genre'].value_counts().head(10)

Action,Adventure,Sci-Fi       50
Drama                         48
Comedy,Drama,Romance          35
Comedy                        32
Drama,Romance                 31
Animation,Adventure,Comedy    27
Comedy,Drama                  27
Action,Adventure,Fantasy      27
Comedy,Romance                26
Crime,Drama,Thriller          24
Name: Genre, dtype: int64

In [26]:
movie[movie['Runtime (Minutes)'] <= 120]

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
The Great Wall,6,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
Mindhorn,8,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,,71.0
Passengers,10,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0
Moana,14,"Animation,Adventure,Comedy","In Ancient Polynesia, when a terrible curse in...",Ron Clements,"Auli'i Cravalho, Dwayne Johnson, Rachel House,...",2016,107,7.7,118151,248.75,81.0
Colossal,15,"Action,Comedy,Drama",Gloria is an out-of-work party girl forced to ...,Nacho Vigalondo,"Anne Hathaway, Jason Sudeikis, Austin Stowell,...",2016,109,6.4,8612,2.87,70.0
The Secret Life of Pets,16,"Animation,Adventure,Comedy",The quiet life of a terrier named Max is upend...,Chris Renaud,"Louis C.K., Eric Stonestreet, Kevin Hart, Lake...",2016,87,6.6,120259,368.31,61.0
Lion,19,"Biography,Drama",A five-year-old Indian boy gets lost on the st...,Garth Davis,"Dev Patel, Nicole Kidman, Rooney Mara, Sunny P...",2016,118,8.1,102061,51.69,69.0
Arrival,20,"Drama,Mystery,Sci-Fi",When twelve mysterious spacecraft appear aroun...,Denis Villeneuve,"Amy Adams, Jeremy Renner, Forest Whitaker,Mich...",2016,116,8.0,340798,100.50,81.0


In [27]:
# To Do: Extract the movie list whose 'Metascore' is bigger than 70.
None

In [28]:
# To Do: Extract the movie list whose 'Director' is 'Christopher Nolan'
None

#### 2.2.1 Pandas Exercise

To Do: Extract the movie list that meets below requirements:
- 1. Released after (and including) 2010 (key = 'Year')
- 2. Runtime is shorter than 150 minutes (key = 'Runtime (Minutes)')
- 3. Rating is above 8.0 (key = 'Rating') 
Print out only the first 3 movies.

In [29]:
None

### 2.3 Remove Missing Data
To represent missing data, pandas use np.nan. Pandas supports removal of missing data, which may deteriorate the performance of data science or machine learning model. In addition, most of data scientists and machine learning engineers remove missing data when they analyze data. 
- `isnull()` returns True or False, depending on the cell's null status. 
- `sum()` counts the number of nulls in each column.
- `dropna()` deletes any row that contains any single null value.
- `fillna(value)` : Fill missing value with the given values.

In [30]:
movie.isnull()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,False,False,False,False,False,False,False,False,False,False,False
Prometheus,False,False,False,False,False,False,False,False,False,False,False
Split,False,False,False,False,False,False,False,False,False,False,False
Sing,False,False,False,False,False,False,False,False,False,False,False
Suicide Squad,False,False,False,False,False,False,False,False,False,False,False
The Great Wall,False,False,False,False,False,False,False,False,False,False,False
La La Land,False,False,False,False,False,False,False,False,False,False,False
Mindhorn,False,False,False,False,False,False,False,False,False,True,False
The Lost City of Z,False,False,False,False,False,False,False,False,False,False,False
Passengers,False,False,False,False,False,False,False,False,False,False,False


In [31]:
movie.isnull().sum()

Rank                    0
Genre                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)    128
Metascore              64
dtype: int64

In [32]:
movie.shape

(1000, 11)

In [33]:
movie.fillna(value = 0)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
The Great Wall,6,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
La La Land,7,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
Mindhorn,8,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,0.00,71.0
The Lost City of Z,9,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0
Passengers,10,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0


In [34]:
movie.dropna(inplace = True)

In [35]:
movie.shape

(838, 11)

After removing missing data, the shape of movie dataFrame has changed, from (1000, 11) to (838, 11)

### 2.4 Merging Data
- `concat()` : Concatenation. Used to merge two or more pandas object.
- `merge()` : It is merged in the way of SQL. 

In [36]:
df1 = pd.DataFrame(np.random.randn(10, 2))
df1

Unnamed: 0,0,1
0,0.896173,0.716901
1,-1.997141,0.17874
2,-0.886698,1.481461
3,0.173028,-0.493156
4,0.8387,0.147142
5,-1.2771,0.377638
6,1.498519,-0.45308
7,-1.627298,-0.1072
8,0.149254,0.216347
9,0.468412,0.805645


In [37]:
df2 = pd.DataFrame(np.random.randn(10, 3))
df2

Unnamed: 0,0,1,2
0,-1.382738,0.224039,1.716445
1,-1.009707,-1.754388,-0.271483
2,0.120054,0.266942,-0.316821
3,-1.292154,-0.608201,-0.439765
4,1.514542,-0.550726,1.660942
5,-1.092151,-1.566807,0.203487
6,1.210065,0.305923,-0.318262
7,-1.072489,-1.402229,-1.447015
8,0.210973,-0.329682,-1.75891
9,0.355878,0.205786,-1.367619


In [38]:
pd.concat([df1, df2])

Unnamed: 0,0,1,2
0,0.896173,0.716901,
1,-1.997141,0.17874,
2,-0.886698,1.481461,
3,0.173028,-0.493156,
4,0.8387,0.147142,
5,-1.2771,0.377638,
6,1.498519,-0.45308,
7,-1.627298,-0.1072,
8,0.149254,0.216347,
9,0.468412,0.805645,


In [39]:
pd.concat([df1, df2], axis = 1)

Unnamed: 0,0,1,0.1,1.1,2
0,0.896173,0.716901,-1.382738,0.224039,1.716445
1,-1.997141,0.17874,-1.009707,-1.754388,-0.271483
2,-0.886698,1.481461,0.120054,0.266942,-0.316821
3,0.173028,-0.493156,-1.292154,-0.608201,-0.439765
4,0.8387,0.147142,1.514542,-0.550726,1.660942
5,-1.2771,0.377638,-1.092151,-1.566807,0.203487
6,1.498519,-0.45308,1.210065,0.305923,-0.318262
7,-1.627298,-0.1072,-1.072489,-1.402229,-1.447015
8,0.149254,0.216347,0.210973,-0.329682,-1.75891
9,0.468412,0.805645,0.355878,0.205786,-1.367619


In [40]:
demis = pd.DataFrame(
    {'Modules': ['Bioinformatics', 'Robotic Systems', 'Security', 'Compilers'], 'Demis' : [75, 97, 64, 81]}
)
demis

Unnamed: 0,Modules,Demis
0,Bioinformatics,75
1,Robotic Systems,97
2,Security,64
3,Compilers,81


In [41]:
sedol = pd.DataFrame(
    {'Modules': ['Bioinformatics', 'Robotic Systems', 'Security', 'Compilers'], 'Sedol' : [63, 78, 84, 95]})
sedol

Unnamed: 0,Modules,Sedol
0,Bioinformatics,63
1,Robotic Systems,78
2,Security,84
3,Compilers,95


In [42]:
pd.merge(demis, sedol, on = 'Modules')

Unnamed: 0,Modules,Demis,Sedol
0,Bioinformatics,75,63
1,Robotic Systems,97,78
2,Security,64,84
3,Compilers,81,95


In [43]:
#To Do: Define your own dataframe and use functions introducesd above to concatenate them.

### What to do next?
Below websites would be helpful for your further study on pandas library:
- [Pandas official website](https://pandas.pydata.org)
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
- [Data Wrangling with Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)