# Preliminaries

We need to load the `pandas` library/module/package. It will be abbreviated at `pd`.

Users typically load the `numpy` library/module/package at the same time (abbreviated as `np`), since it has some useful functions for data analysis, even though we will not use it here.

In [1]:
import pandas as pd
import numpy as np

---

# Load Pandas data frames from CSV files

Let's use annual data from the World Happiness Report.

CSV = comma separated values. (This is what you'll see if you open up the files in a browser or text editor.)

Note how we use the function `read_csv` from the `pandas` library (which we have abbreviated to `pd`). You can find details on what arguments the function accepts and how it behaves by googling it, which takes you to this page: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [2]:
whr_df_15 = pd.read_csv('WHR_2015.csv')
whr_df_19 = pd.read_csv('WHR_2019.csv')

## Describing data

### Dimensions of these tables

In [3]:
# Number of rows:
len(whr_df_15)

158

In [4]:
# Number of rows and columns
whr_df_15.shape

(158, 12)

In [5]:
whr_df_19.shape


(156, 9)

### head & tail

In [6]:
whr_df_15.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


In [7]:
whr_df_15[:5]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


In [8]:
whr_df_15.tail(10)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
148,Chad,Sub-Saharan Africa,149,3.667,0.0383,0.34193,0.76062,0.1501,0.23501,0.05269,0.18386,1.94296
149,Guinea,Sub-Saharan Africa,150,3.656,0.0359,0.17417,0.46475,0.24009,0.37725,0.12139,0.28657,1.99172
150,Ivory Coast,Sub-Saharan Africa,151,3.655,0.05141,0.46534,0.77115,0.15185,0.46866,0.17922,0.20165,1.41723
151,Burkina Faso,Sub-Saharan Africa,152,3.587,0.04324,0.25812,0.85188,0.27125,0.39493,0.12832,0.21747,1.46494
152,Afghanistan,Southern Asia,153,3.575,0.03084,0.31982,0.30285,0.30335,0.23414,0.09719,0.3651,1.9521
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.7737,0.42864,0.59201,0.55191,0.22628,0.67042
154,Benin,Sub-Saharan Africa,155,3.34,0.03656,0.28665,0.35386,0.3191,0.4845,0.0801,0.1826,1.63328
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.6632,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.0153,0.41587,0.22396,0.1185,0.10062,0.19727,1.83302
157,Togo,Sub-Saharan Africa,158,2.839,0.06727,0.20868,0.13995,0.28443,0.36453,0.10731,0.16681,1.56726


In [9]:
whr_df_15[-10:]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
148,Chad,Sub-Saharan Africa,149,3.667,0.0383,0.34193,0.76062,0.1501,0.23501,0.05269,0.18386,1.94296
149,Guinea,Sub-Saharan Africa,150,3.656,0.0359,0.17417,0.46475,0.24009,0.37725,0.12139,0.28657,1.99172
150,Ivory Coast,Sub-Saharan Africa,151,3.655,0.05141,0.46534,0.77115,0.15185,0.46866,0.17922,0.20165,1.41723
151,Burkina Faso,Sub-Saharan Africa,152,3.587,0.04324,0.25812,0.85188,0.27125,0.39493,0.12832,0.21747,1.46494
152,Afghanistan,Southern Asia,153,3.575,0.03084,0.31982,0.30285,0.30335,0.23414,0.09719,0.3651,1.9521
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.7737,0.42864,0.59201,0.55191,0.22628,0.67042
154,Benin,Sub-Saharan Africa,155,3.34,0.03656,0.28665,0.35386,0.3191,0.4845,0.0801,0.1826,1.63328
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.6632,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.0153,0.41587,0.22396,0.1185,0.10062,0.19727,1.83302
157,Togo,Sub-Saharan Africa,158,2.839,0.06727,0.20868,0.13995,0.28443,0.36453,0.10731,0.16681,1.56726


### info()

In [10]:
whr_df_15.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158 entries, 0 to 157
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Country                        158 non-null    object 
 1   Region                         158 non-null    object 
 2   Happiness Rank                 158 non-null    int64  
 3   Happiness Score                158 non-null    float64
 4   Standard Error                 158 non-null    float64
 5   Economy (GDP per Capita)       158 non-null    float64
 6   Family                         158 non-null    float64
 7   Health (Life Expectancy)       158 non-null    float64
 8   Freedom                        158 non-null    float64
 9   Trust (Government Corruption)  158 non-null    float64
 10  Generosity                     158 non-null    float64
 11  Dystopia Residual              158 non-null    float64
dtypes: float64(9), int64(1), object(2)
memory usage: 1

### Data types

In [11]:
whr_df_19.dtypes

Overall rank                      int64
Country or region                object
Score                           float64
GDP per capita                  float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
dtype: object

### Accessing columns

In [12]:
whr_df_19['Score']

0      7.769
1      7.600
2      7.554
3      7.494
4      7.488
       ...  
151    3.334
152    3.231
153    3.203
154    3.083
155    2.853
Name: Score, Length: 156, dtype: float64

In [13]:
whr_df_19.Score

0      7.769
1      7.600
2      7.554
3      7.494
4      7.488
       ...  
151    3.334
152    3.231
153    3.203
154    3.083
155    2.853
Name: Score, Length: 156, dtype: float64

### describe() & mean()

In [14]:
whr_df_19['Score'].describe()

count    156.000000
mean       5.407096
std        1.113120
min        2.853000
25%        4.544500
50%        5.379500
75%        6.184500
max        7.769000
Name: Score, dtype: float64

In [15]:
whr_df_19['Score'].describe()['min']

2.853

In [16]:
whr_df_19['Score'].mean()

5.407096153846155

### value_counts()

In [17]:
whr_df_15['Region'].value_counts()

Region
Sub-Saharan Africa                 40
Central and Eastern Europe         29
Latin America and Caribbean        22
Western Europe                     21
Middle East and Northern Africa    20
Southeastern Asia                   9
Southern Asia                       7
Eastern Asia                        6
North America                       2
Australia and New Zealand           2
Name: count, dtype: int64

---

# Processing data

## sort_values()

In [18]:
whr_df_19

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.600,1.383,1.573,0.996,0.592,0.252,0.410
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.380,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
151,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,0.276,0.147
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
154,155,Central African Republic,3.083,0.026,0.000,0.105,0.225,0.235,0.035


In [19]:
whr_df_19.sort_values('Country or region')

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
106,107,Albania,4.719,0.947,0.848,0.874,0.383,0.178,0.027
87,88,Algeria,5.211,1.002,1.160,0.785,0.086,0.073,0.114
46,47,Argentina,6.086,1.092,1.432,0.881,0.471,0.066,0.050
115,116,Armenia,4.559,0.850,1.055,0.815,0.283,0.095,0.064
...,...,...,...,...,...,...,...,...,...
107,108,Venezuela,4.707,0.960,1.427,0.805,0.154,0.064,0.047
93,94,Vietnam,5.175,0.741,1.346,0.851,0.543,0.147,0.073
150,151,Yemen,3.380,0.287,1.163,0.463,0.143,0.108,0.077
137,138,Zambia,4.107,0.578,1.058,0.426,0.431,0.247,0.087


In [20]:
whr_df_15

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.03880,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176
...,...,...,...,...,...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042
154,Benin,Sub-Saharan Africa,155,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302


In [21]:
whr_df_15.sort_values(['Region', 'Country'])

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,0.43562,2.26646
8,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,0.42922,0.47501,2.26425
94,Albania,Central and Eastern Europe,95,4.959,0.05013,0.87867,0.80434,0.81325,0.35733,0.06413,0.14272,1.89894
126,Armenia,Central and Eastern Europe,127,4.350,0.04763,0.76821,0.77711,0.72990,0.19847,0.03900,0.07855,1.75873
79,Azerbaijan,Central and Eastern Europe,80,5.212,0.03363,1.02389,0.93793,0.64045,0.37030,0.16065,0.07799,2.00073
...,...,...,...,...,...,...,...,...,...,...,...,...
87,Portugal,Western Europe,88,5.102,0.04802,1.15991,1.13935,0.87519,0.51469,0.01078,0.13719,1.26462
35,Spain,Western Europe,36,6.329,0.03468,1.23011,1.31379,0.95562,0.45951,0.06398,0.18227,2.12367
7,Sweden,Western Europe,8,7.364,0.03157,1.33171,1.28907,0.91087,0.65980,0.43844,0.36262,2.37119
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738


In [22]:
whr_df_15

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.03880,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176
...,...,...,...,...,...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042
154,Benin,Sub-Saharan Africa,155,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302


In [23]:
sorted_whr_df_19 = whr_df_19.sort_values(by="Freedom to make life choices", ascending=False)
sorted_whr_df_19

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
40,41,Uzbekistan,6.174,0.745,1.529,0.756,0.631,0.322,0.240
108,109,Cambodia,4.700,0.574,1.122,0.637,0.609,0.232,0.062
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
20,21,United Arab Emirates,6.825,1.503,1.310,0.825,0.598,0.262,0.182
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
...,...,...,...,...,...,...,...,...,...
121,122,Mauritania,4.490,0.570,1.167,0.489,0.066,0.106,0.088
146,147,Haiti,3.597,0.323,0.688,0.449,0.026,0.419,0.110
148,149,Syria,3.462,0.619,0.378,0.440,0.013,0.331,0.141
155,156,South Sudan,2.853,0.306,0.575,0.295,0.010,0.202,0.091


In [24]:
whr_df_19

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.600,1.383,1.573,0.996,0.592,0.252,0.410
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.380,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
151,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,0.276,0.147
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
154,155,Central African Republic,3.083,0.026,0.000,0.105,0.225,0.235,0.035


---

# Cleaning data

## Handling NAs

Let's read in a new table:

In [28]:
dirty_df = pd.read_csv('dirty_data.csv')

In [29]:
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [30]:
dirty_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  32 non-null     int64  
 1   Date      31 non-null     object 
 2   Pulse     32 non-null     int64  
 3   Maxpulse  32 non-null     int64  
 4   Calories  30 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.4+ KB


In [31]:
dirty_df.isna().sum()

Duration    0
Date        1
Pulse       0
Maxpulse    0
Calories    2
dtype: int64

Drop the rows with NAs

In [32]:
new_dirty_df = dirty_df.dropna()

In [33]:
new_dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


In [34]:
new_dirty_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29 entries, 0 to 31
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  29 non-null     int64  
 1   Date      29 non-null     object 
 2   Pulse     29 non-null     int64  
 3   Maxpulse  29 non-null     int64  
 4   Calories  29 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 1.4+ KB


Or, replace some NAs with imputed values </b>

In [35]:
dirty_df['Calories']

0     409.1
1     479.0
2     340.0
3     282.4
4     406.0
5     300.0
6     374.0
7     253.3
8     195.1
9     269.0
10    329.3
11    250.7
12    250.7
13    345.3
14    379.3
15    275.0
16    215.2
17    300.0
18      NaN
19    323.0
20    243.0
21    364.2
22    282.0
23    300.0
24    246.0
25    334.5
26    250.0
27    241.0
28      NaN
29    280.0
30    380.3
31    243.0
Name: Calories, dtype: float64

In [36]:
median_C = dirty_df['Calories'].median()
median_C

291.2

In [39]:
new2_dirty_col = dirty_df['Calories'].fillna(median_C)
new2_dirty_col

0     409.1
1     479.0
2     340.0
3     282.4
4     406.0
5     300.0
6     374.0
7     253.3
8     195.1
9     269.0
10    329.3
11    250.7
12    250.7
13    345.3
14    379.3
15    275.0
16    215.2
17    300.0
18    291.2
19    323.0
20    243.0
21    364.2
22    282.0
23    300.0
24    246.0
25    334.5
26    250.0
27    241.0
28    291.2
29    280.0
30    380.3
31    243.0
Name: Calories, dtype: float64

In [40]:
# Create a new dataframe called new2_dirty_df by taking a copy of dirty_df
new2_dirty_df = dirty_df.copy()

In [41]:
# Assign the cleaned Score column to new2_dirty_df
new2_dirty_df['Cleaned Score'] = new2_dirty_col

In [42]:
new2_dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Cleaned Score
0,60,'2020/12/01',110,130,409.1,409.1
1,60,'2020/12/02',117,145,479.0,479.0
2,60,'2020/12/03',103,135,340.0,340.0
3,45,'2020/12/04',109,175,282.4,282.4
4,45,'2020/12/05',117,148,406.0,406.0
5,60,'2020/12/06',102,127,300.0,300.0
6,60,'2020/12/07',110,136,374.0,374.0
7,450,'2020/12/08',104,134,253.3,253.3
8,30,'2020/12/09',109,133,195.1,195.1
9,60,'2020/12/10',98,124,269.0,269.0


In [43]:
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


## Add and drop a column

In [44]:
dirty_df['Today'] = 'Tuesday'

In [45]:
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories,Today
0,60,'2020/12/01',110,130,409.1,Tuesday
1,60,'2020/12/02',117,145,479.0,Tuesday
2,60,'2020/12/03',103,135,340.0,Tuesday
3,45,'2020/12/04',109,175,282.4,Tuesday
4,45,'2020/12/05',117,148,406.0,Tuesday
5,60,'2020/12/06',102,127,300.0,Tuesday
6,60,'2020/12/07',110,136,374.0,Tuesday
7,450,'2020/12/08',104,134,253.3,Tuesday
8,30,'2020/12/09',109,133,195.1,Tuesday
9,60,'2020/12/10',98,124,269.0,Tuesday


In [46]:
del dirty_df['Today']

In [47]:
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


## Date-times

In [48]:
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


Let's remove rows with missing dates.

In [None]:
# dirty_df = dirty_df.dropna(subset=['Date'])

In [49]:
dirty_df.dropna(subset=['Date'], inplace=True)

In [50]:
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


Let's make sure dates are not messy strings, but genuine date-times:

In [51]:
dirty_df['Date'] = pd.to_datetime(dirty_df['Date'], format='mixed')

In [52]:
dirty_df.Date

0    2020-12-01
1    2020-12-02
2    2020-12-03
3    2020-12-04
4    2020-12-05
5    2020-12-06
6    2020-12-07
7    2020-12-08
8    2020-12-09
9    2020-12-10
10   2020-12-11
11   2020-12-12
12   2020-12-12
13   2020-12-13
14   2020-12-14
15   2020-12-15
16   2020-12-16
17   2020-12-17
18   2020-12-18
19   2020-12-19
20   2020-12-20
21   2020-12-21
23   2020-12-23
24   2020-12-24
25   2020-12-25
26   2020-12-26
27   2020-12-27
28   2020-12-28
29   2020-12-29
30   2020-12-30
31   2020-12-31
Name: Date, dtype: datetime64[ns]

In [53]:
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,450,2020-12-08,104,134,253.3
8,30,2020-12-09,109,133,195.1
9,60,2020-12-10,98,124,269.0


In [54]:
dirty_df.loc[30, 'Date'] - dirty_df.loc[0, 'Date'] 

Timedelta('29 days 00:00:00')

## Replacing values

In [55]:
dirty_df[:10]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,450,2020-12-08,104,134,253.3
8,30,2020-12-09,109,133,195.1
9,60,2020-12-10,98,124,269.0


In [56]:
dirty_df.loc[7, "Duration"]

450

In [57]:
dirty_df.loc[7, "Duration"] = 45
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.3
8,30,2020-12-09,109,133,195.1
9,60,2020-12-10,98,124,269.0


## Detecting duplicates

In [58]:
dirty_df[dirty_df.duplicated()]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
12,60,2020-12-12,100,120,250.7


In [59]:
dirty_df = dirty_df.drop_duplicates()
dirty_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
4,45,2020-12-05,117,148,406.0
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.3
8,30,2020-12-09,109,133,195.1
9,60,2020-12-10,98,124,269.0


## Filtering/subsetting

### Single condition

In [60]:
dirty_df[(dirty_df['Pulse'] > 110)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
1,60,2020-12-02,117,145,479.0
4,45,2020-12-05,117,148,406.0
23,60,2020-12-23,130,101,300.0


For pandas data frame, the `~` operator is like the `not` we saw in the previous session:

In [61]:
dirty_df[~(dirty_df['Pulse'] > 110)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.3
8,30,2020-12-09,109,133,195.1
9,60,2020-12-10,98,124,269.0
10,60,2020-12-11,103,147,329.3
11,60,2020-12-12,100,120,250.7


In [62]:
dirty_df[(dirty_df['Pulse'] <= 110)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
2,60,2020-12-03,103,135,340.0
3,45,2020-12-04,109,175,282.4
5,60,2020-12-06,102,127,300.0
6,60,2020-12-07,110,136,374.0
7,45,2020-12-08,104,134,253.3
8,30,2020-12-09,109,133,195.1
9,60,2020-12-10,98,124,269.0
10,60,2020-12-11,103,147,329.3
11,60,2020-12-12,100,120,250.7


### Multiple conditions

And (`&`) 

In [63]:
dirty_df[(dirty_df['Pulse'] > 110) & (dirty_df['Calories'] > 400)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
1,60,2020-12-02,117,145,479.0
4,45,2020-12-05,117,148,406.0


Or (`|`)

In [64]:
dirty_df[(dirty_df['Pulse'] > 110) | (dirty_df['Calories'] > 400)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,2020-12-01,110,130,409.1
1,60,2020-12-02,117,145,479.0
4,45,2020-12-05,117,148,406.0
23,60,2020-12-23,130,101,300.0


In [65]:
dirty_df[(dirty_df['Pulse'] == 100) | (dirty_df['Pulse'] == 102)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
5,60,2020-12-06,102,127,300.0
11,60,2020-12-12,100,120,250.7
17,60,2020-12-17,100,120,300.0
25,60,2020-12-25,102,126,334.5
26,60,2020-12-26,100,120,250.0
29,60,2020-12-29,100,132,280.0
30,60,2020-12-30,102,129,380.3


`isin`

In [66]:
dirty_df[ dirty_df['Pulse'].isin( [100, 102] ) ]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
5,60,2020-12-06,102,127,300.0
11,60,2020-12-12,100,120,250.7
17,60,2020-12-17,100,120,300.0
25,60,2020-12-25,102,126,334.5
26,60,2020-12-26,100,120,250.0
29,60,2020-12-29,100,132,280.0
30,60,2020-12-30,102,129,380.3


## More description

In [None]:
dirty_df.corr()

---

# Quizzes

## Quiz 1

In `whr_df_15`, create a column for a boolean dummy variable, "ME_AF," that indicates whether a given country belongs to the African regions (i.e., Sub-Saharan Africa, or Middle East and Northern Africa)

In [67]:
# Robin's solution
regions_of_interest = ['Sub-Saharan Africa', 'Middle East and Northern Africa']
new_whr_df_15 = whr_df_15.copy()
new_whr_df_15['ME_AF'] = new_whr_df_15['Region'].isin(regions_of_interest).astype(int)
new_whr_df_15

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,ME_AF
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,0
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201,0
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,0
3,Norway,Western Europe,4,7.522,0.03880,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,0
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042,1
154,Benin,Sub-Saharan Africa,155,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328,1
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858,1
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302,1


In [68]:
# Gavin's solution
whr_df_15['ME_AF'] = (whr_df_15['Region'].isin( ['Middle East and Northern Africa', 'Sub-Saharan Africa'] ))
whr_df_15

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,ME_AF
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,False
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201,False
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,False
3,Norway,Western Europe,4,7.522,0.03880,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,False
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042,True
154,Benin,Sub-Saharan Africa,155,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328,True
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858,True
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302,True


In [70]:
# My other solution
whr_df_15['ME_AF'] = (whr_df_15['Region'] == 'Sub-Saharan Africa') | (whr_df_15['Region'] == "Middle East and Northern Africa")
whr_df_15

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,ME_AF
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,False
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.43630,2.70201,False
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,False
3,Norway,Western Europe,4,7.522,0.03880,1.45900,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,False
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.03464,0.22208,0.77370,0.42864,0.59201,0.55191,0.22628,0.67042,True
154,Benin,Sub-Saharan Africa,155,3.340,0.03656,0.28665,0.35386,0.31910,0.48450,0.08010,0.18260,1.63328,True
155,Syria,Middle East and Northern Africa,156,3.006,0.05015,0.66320,0.47489,0.72193,0.15684,0.18906,0.47179,0.32858,True
156,Burundi,Sub-Saharan Africa,157,2.905,0.08658,0.01530,0.41587,0.22396,0.11850,0.10062,0.19727,1.83302,True


## Quiz 2

In `whr_df_15`, what is the difference in mean happiness between the ME-AF group and the non-ME-AF group? 

In [71]:
# Gavin's solution
whr_df_15_me = whr_df_15[(whr_df_15['ME_AF'] == True)]
whr_df_15_nme = whr_df_15[(whr_df_15['ME_AF'] == False)]
whr_df_15_me['Happiness Score'].describe()['mean'] - whr_df_15_nme['Happiness Score'].describe()['mean']

-1.2439557823129261

In [72]:
# Maria's solution
mean_happiness_me_af = whr_df_15[whr_df_15['ME_AF'] == True]['Happiness Score'].mean()
mean_happiness_non_me_af = whr_df_15[whr_df_15['ME_AF'] == False]['Happiness Score'].mean()
mean_difference = mean_happiness_me_af - mean_happiness_non_me_af
print(f"Mean Happiness (ME_AF group): {mean_happiness_me_af}")
print(f"Mean Happiness (non-ME_AF group): {mean_happiness_non_me_af}")
print(f"Mean Difference in Happiness: {mean_difference}")

Mean Happiness (ME_AF group): 4.604166666666665
Mean Happiness (non-ME_AF group): 5.848122448979591
Mean Difference in Happiness: -1.2439557823129261


In [73]:
# Doudou's solution
mean_diff = whr_df_15[whr_df_15['ME_AF'] == True]['Happiness Score'].mean() - whr_df_15[whr_df_15['ME_AF'] == False]['Happiness Score'].mean()
mean_diff

-1.2439557823129261

In [74]:
whr_df_15['ME_AF']

0      False
1      False
2      False
3      False
4      False
       ...  
153     True
154     True
155     True
156     True
157     True
Name: ME_AF, Length: 158, dtype: bool

In [75]:
# My other solution
whr_df_15[whr_df_15['ME_AF']]['Happiness Score'].mean() - whr_df_15[~whr_df_15['ME_AF']]['Happiness Score'].mean()

-1.2439557823129261

---

# Concatenate

![Illustration of concatenating](https://miro.medium.com/v2/resize:fit:1400/0*Xhaw5NqAkkqRPxUF.png)

In [76]:
df1 = pd.DataFrame({"A": ["A0", "A1", "A2", "A3"],
                    "B": ["B0", "B1", "B2", "B3"],
                    "C": ["C0", "C1", "C2", "C3"],
                    "D": ["D0", "D1", "D2", "D3"]}, index=[0, 1, 2, 3])

In [77]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [78]:
df2 = pd.DataFrame({"A": ["A4", "A5", "A6", "A7"],
                    "B": ["B4", "B5", "B6", "B7"],
                    "C": ["C4", "C5", "C6", "C7"],
                    "D": ["D4", "D5", "D6", "D7"]}, index=[4, 5, 6, 7])

In [79]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [80]:
df3 = pd.DataFrame({"E": ["E8", "E9", "E10", "E11"],
                    "F": ["F8", "F9", "F10", "F11"],
                    "G": ["G8", "G9", "G10", "G11"]}, index=[0, 1, 2, 3])

In [81]:
df3

Unnamed: 0,E,F,G
0,E8,F8,G8
1,E9,F9,G9
2,E10,F10,G10
3,E11,F11,G11


The most common way to concatenate is along the index/rows ("vertically"). This is the default:

In [82]:
pd.concat([df1, df2])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


You can also concatenate along columns ("horizontally"), but you should only do this if the columns are different and you *know* the row order is *exactly* same for both dataframes. So this is usually not the right thing to do! (Usually, you want to merge in this situation, which we will see soon.) 

Let's show an example anyway:

In [83]:
pd.concat([df1, df3], axis=1)

Unnamed: 0,A,B,C,D,E,F,G
0,A0,B0,C0,D0,E8,F8,G8
1,A1,B1,C1,D1,E9,F9,G9
2,A2,B2,C2,D2,E10,F10,G10
3,A3,B3,C3,D3,E11,F11,G11


---

# Merge/Join

A very common operation is to merge/join two dataframes that share some columns (typically an identifier or key) but not others.

## Check the variables in each df

In [84]:
whr_df_15[:3]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,ME_AF
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,False
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,False
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,False


In [85]:
whr_df_15.columns

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Score',
       'Standard Error', 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual', 'ME_AF'],
      dtype='object')

Suppose that the columns of our interest are (1) Country, (2) Region, (3) Happiness Rank, (4) Happiness Score, (5) Economy, (6) Freedom, and (7) Trust 

In [None]:
whr_df_19[:3]

In [None]:
whr_df_19.columns

## Rename columns

In [86]:
whr_df_19.rename(columns = {
    'Overall rank':"Happiness Rank"
    , "Country or region":"Country"
    , "Score":"Happiness Score"
    , "GDP per capita": "Economy (GDP per Capita)"
    , "Freedom to make life choices":"Freedom"
    ,"Perceptions of corruption":"Trust (Government Corruption)"
}, inplace=True)

In [87]:
whr_df_19

Unnamed: 0,Happiness Rank,Country,Happiness Score,Economy (GDP per Capita),Social support,Healthy life expectancy,Freedom,Generosity,Trust (Government Corruption)
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.600,1.383,1.573,0.996,0.592,0.252,0.410
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.380,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
151,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,0.276,0.147
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
154,155,Central African Republic,3.083,0.026,0.000,0.105,0.225,0.235,0.035


## Merging two dfs

In [88]:
my_columns_15 = [
    'Country', "Region", 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)',
    'Freedom', 'Trust (Government Corruption)'
]

In [89]:
# Subset some of the columns in this dataframe
whr_df_15[my_columns_15]

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Freedom,Trust (Government Corruption)
0,Switzerland,Western Europe,1,7.587,1.39651,0.66557,0.41978
1,Iceland,Western Europe,2,7.561,1.30232,0.62877,0.14145
2,Denmark,Western Europe,3,7.527,1.32548,0.64938,0.48357
3,Norway,Western Europe,4,7.522,1.45900,0.66973,0.36503
4,Canada,North America,5,7.427,1.32629,0.63297,0.32957
...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.22208,0.59201,0.55191
154,Benin,Sub-Saharan Africa,155,3.340,0.28665,0.48450,0.08010
155,Syria,Middle East and Northern Africa,156,3.006,0.66320,0.15684,0.18906
156,Burundi,Sub-Saharan Africa,157,2.905,0.01530,0.11850,0.10062


In [90]:
whr_df_15_new = whr_df_15[my_columns_15]

In [91]:
my_columns_19 = [
    'Country', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)',
    'Freedom', 'Trust (Government Corruption)'
]

In [92]:
whr_df_19[my_columns_19]

Unnamed: 0,Country,Happiness Rank,Happiness Score,Economy (GDP per Capita),Freedom,Trust (Government Corruption)
0,Finland,1,7.769,1.340,0.596,0.393
1,Denmark,2,7.600,1.383,0.592,0.410
2,Norway,3,7.554,1.488,0.603,0.341
3,Iceland,4,7.494,1.380,0.591,0.118
4,Netherlands,5,7.488,1.396,0.557,0.298
...,...,...,...,...,...,...
151,Rwanda,152,3.334,0.359,0.555,0.411
152,Tanzania,153,3.231,0.476,0.417,0.147
153,Afghanistan,154,3.203,0.350,0.000,0.025
154,Central African Republic,155,3.083,0.026,0.225,0.035


In [93]:
whr_df_19_new = whr_df_19[my_columns_19]

In [94]:
# JOIN STEP
whr_df_15_19 = whr_df_15_new.merge(whr_df_19_new, left_on="Country", right_on="Country", how='left', suffixes=("_15", "_19"))

In [95]:
whr_df_15_19

Unnamed: 0,Country,Region,Happiness Rank_15,Happiness Score_15,Economy (GDP per Capita)_15,Freedom_15,Trust (Government Corruption)_15,Happiness Rank_19,Happiness Score_19,Economy (GDP per Capita)_19,Freedom_19,Trust (Government Corruption)_19
0,Switzerland,Western Europe,1,7.587,1.39651,0.66557,0.41978,6.0,7.480,1.452,0.572,0.343
1,Iceland,Western Europe,2,7.561,1.30232,0.62877,0.14145,4.0,7.494,1.380,0.591,0.118
2,Denmark,Western Europe,3,7.527,1.32548,0.64938,0.48357,2.0,7.600,1.383,0.592,0.410
3,Norway,Western Europe,4,7.522,1.45900,0.66973,0.36503,3.0,7.554,1.488,0.603,0.341
4,Canada,North America,5,7.427,1.32629,0.63297,0.32957,9.0,7.278,1.365,0.584,0.308
...,...,...,...,...,...,...,...,...,...,...,...,...
153,Rwanda,Sub-Saharan Africa,154,3.465,0.22208,0.59201,0.55191,152.0,3.334,0.359,0.555,0.411
154,Benin,Sub-Saharan Africa,155,3.340,0.28665,0.48450,0.08010,102.0,4.883,0.393,0.349,0.082
155,Syria,Middle East and Northern Africa,156,3.006,0.66320,0.15684,0.18906,149.0,3.462,0.619,0.013,0.141
156,Burundi,Sub-Saharan Africa,157,2.905,0.01530,0.11850,0.10062,145.0,3.775,0.046,0.220,0.180


## Check missing values

In [96]:
whr_df_15_19.isna().sum()

Country                             0
Region                              0
Happiness Rank_15                   0
Happiness Score_15                  0
Economy (GDP per Capita)_15         0
Freedom_15                          0
Trust (Government Corruption)_15    0
Happiness Rank_19                   9
Happiness Score_19                  9
Economy (GDP per Capita)_19         9
Freedom_19                          9
Trust (Government Corruption)_19    9
dtype: int64

In [97]:
len(whr_df_15_19)

158

## Drop NAs

In [98]:
whr_df_15_19.dropna(inplace=True)

In [99]:
whr_df_15_19.isna().sum()

Country                             0
Region                              0
Happiness Rank_15                   0
Happiness Score_15                  0
Economy (GDP per Capita)_15         0
Freedom_15                          0
Trust (Government Corruption)_15    0
Happiness Rank_19                   0
Happiness Score_19                  0
Economy (GDP per Capita)_19         0
Freedom_19                          0
Trust (Government Corruption)_19    0
dtype: int64

In [100]:
len(whr_df_15_19)

149

## There are different ways to perform joins...

![Illustration of merging](https://miro.medium.com/v2/resize:fit:1400/1*Vq8e0dAr0Xsfw0bJRz4FRg.png)

## ... so let's inner-join our data

Instead of `how="left"` we'll use `how="inner"`

In [101]:
whr_df_15_19 = whr_df_15_new.merge(whr_df_19_new, left_on="Country", right_on="Country", how='inner', suffixes=("_15", "_19"))

In [102]:
whr_df_15_19

Unnamed: 0,Country,Region,Happiness Rank_15,Happiness Score_15,Economy (GDP per Capita)_15,Freedom_15,Trust (Government Corruption)_15,Happiness Rank_19,Happiness Score_19,Economy (GDP per Capita)_19,Freedom_19,Trust (Government Corruption)_19
0,Switzerland,Western Europe,1,7.587,1.39651,0.66557,0.41978,6,7.480,1.452,0.572,0.343
1,Iceland,Western Europe,2,7.561,1.30232,0.62877,0.14145,4,7.494,1.380,0.591,0.118
2,Denmark,Western Europe,3,7.527,1.32548,0.64938,0.48357,2,7.600,1.383,0.592,0.410
3,Norway,Western Europe,4,7.522,1.45900,0.66973,0.36503,3,7.554,1.488,0.603,0.341
4,Canada,North America,5,7.427,1.32629,0.63297,0.32957,9,7.278,1.365,0.584,0.308
...,...,...,...,...,...,...,...,...,...,...,...,...
144,Rwanda,Sub-Saharan Africa,154,3.465,0.22208,0.59201,0.55191,152,3.334,0.359,0.555,0.411
145,Benin,Sub-Saharan Africa,155,3.340,0.28665,0.48450,0.08010,102,4.883,0.393,0.349,0.082
146,Syria,Middle East and Northern Africa,156,3.006,0.66320,0.15684,0.18906,149,3.462,0.619,0.013,0.141
147,Burundi,Sub-Saharan Africa,157,2.905,0.01530,0.11850,0.10062,145,3.775,0.046,0.220,0.180


In [103]:
whr_df_15_19.isna().sum()

Country                             0
Region                              0
Happiness Rank_15                   0
Happiness Score_15                  0
Economy (GDP per Capita)_15         0
Freedom_15                          0
Trust (Government Corruption)_15    0
Happiness Rank_19                   0
Happiness Score_19                  0
Economy (GDP per Capita)_19         0
Freedom_19                          0
Trust (Government Corruption)_19    0
dtype: int64

In [104]:
len(whr_df_15_19)

149

Also, since `left_on=` and `right_on=` are the same, we can replace them with `on=` to save space:

In [105]:
whr_df_15_19 = whr_df_15_new.merge(whr_df_19_new, on="Country", how='inner', suffixes=("_15", "_19"))

In [106]:
whr_df_15_19

Unnamed: 0,Country,Region,Happiness Rank_15,Happiness Score_15,Economy (GDP per Capita)_15,Freedom_15,Trust (Government Corruption)_15,Happiness Rank_19,Happiness Score_19,Economy (GDP per Capita)_19,Freedom_19,Trust (Government Corruption)_19
0,Switzerland,Western Europe,1,7.587,1.39651,0.66557,0.41978,6,7.480,1.452,0.572,0.343
1,Iceland,Western Europe,2,7.561,1.30232,0.62877,0.14145,4,7.494,1.380,0.591,0.118
2,Denmark,Western Europe,3,7.527,1.32548,0.64938,0.48357,2,7.600,1.383,0.592,0.410
3,Norway,Western Europe,4,7.522,1.45900,0.66973,0.36503,3,7.554,1.488,0.603,0.341
4,Canada,North America,5,7.427,1.32629,0.63297,0.32957,9,7.278,1.365,0.584,0.308
...,...,...,...,...,...,...,...,...,...,...,...,...
144,Rwanda,Sub-Saharan Africa,154,3.465,0.22208,0.59201,0.55191,152,3.334,0.359,0.555,0.411
145,Benin,Sub-Saharan Africa,155,3.340,0.28665,0.48450,0.08010,102,4.883,0.393,0.349,0.082
146,Syria,Middle East and Northern Africa,156,3.006,0.66320,0.15684,0.18906,149,3.462,0.619,0.013,0.141
147,Burundi,Sub-Saharan Africa,157,2.905,0.01530,0.11850,0.10062,145,3.775,0.046,0.220,0.180


Finally, the default merge/join is an inner-join, so we can omit the `how` argument too:

In [107]:
whr_df_15_19 = whr_df_15_new.merge(whr_df_19_new, on="Country", suffixes=("_15", "_19"))

In [108]:
whr_df_15_19

Unnamed: 0,Country,Region,Happiness Rank_15,Happiness Score_15,Economy (GDP per Capita)_15,Freedom_15,Trust (Government Corruption)_15,Happiness Rank_19,Happiness Score_19,Economy (GDP per Capita)_19,Freedom_19,Trust (Government Corruption)_19
0,Switzerland,Western Europe,1,7.587,1.39651,0.66557,0.41978,6,7.480,1.452,0.572,0.343
1,Iceland,Western Europe,2,7.561,1.30232,0.62877,0.14145,4,7.494,1.380,0.591,0.118
2,Denmark,Western Europe,3,7.527,1.32548,0.64938,0.48357,2,7.600,1.383,0.592,0.410
3,Norway,Western Europe,4,7.522,1.45900,0.66973,0.36503,3,7.554,1.488,0.603,0.341
4,Canada,North America,5,7.427,1.32629,0.63297,0.32957,9,7.278,1.365,0.584,0.308
...,...,...,...,...,...,...,...,...,...,...,...,...
144,Rwanda,Sub-Saharan Africa,154,3.465,0.22208,0.59201,0.55191,152,3.334,0.359,0.555,0.411
145,Benin,Sub-Saharan Africa,155,3.340,0.28665,0.48450,0.08010,102,4.883,0.393,0.349,0.082
146,Syria,Middle East and Northern Africa,156,3.006,0.66320,0.15684,0.18906,149,3.462,0.619,0.013,0.141
147,Burundi,Sub-Saharan Africa,157,2.905,0.01530,0.11850,0.10062,145,3.775,0.046,0.220,0.180


If you ommit the `suffixes` argument, it will create suffixes `_x` and `_y` for you:

In [109]:
whr_df_15_new.merge(whr_df_19_new, on="Country")

Unnamed: 0,Country,Region,Happiness Rank_x,Happiness Score_x,Economy (GDP per Capita)_x,Freedom_x,Trust (Government Corruption)_x,Happiness Rank_y,Happiness Score_y,Economy (GDP per Capita)_y,Freedom_y,Trust (Government Corruption)_y
0,Switzerland,Western Europe,1,7.587,1.39651,0.66557,0.41978,6,7.480,1.452,0.572,0.343
1,Iceland,Western Europe,2,7.561,1.30232,0.62877,0.14145,4,7.494,1.380,0.591,0.118
2,Denmark,Western Europe,3,7.527,1.32548,0.64938,0.48357,2,7.600,1.383,0.592,0.410
3,Norway,Western Europe,4,7.522,1.45900,0.66973,0.36503,3,7.554,1.488,0.603,0.341
4,Canada,North America,5,7.427,1.32629,0.63297,0.32957,9,7.278,1.365,0.584,0.308
...,...,...,...,...,...,...,...,...,...,...,...,...
144,Rwanda,Sub-Saharan Africa,154,3.465,0.22208,0.59201,0.55191,152,3.334,0.359,0.555,0.411
145,Benin,Sub-Saharan Africa,155,3.340,0.28665,0.48450,0.08010,102,4.883,0.393,0.349,0.082
146,Syria,Middle East and Northern Africa,156,3.006,0.66320,0.15684,0.18906,149,3.462,0.619,0.013,0.141
147,Burundi,Sub-Saharan Africa,157,2.905,0.01530,0.11850,0.10062,145,3.775,0.046,0.220,0.180


---

# Reshape

## Casting wide to long

`whr_df_15_19` is in a "wide" data format: there are columns repeated for two different years:

In [110]:
whr_df_15_19.shape

(149, 12)

In [111]:
whr_df_15_19.columns

Index(['Country', 'Region', 'Happiness Rank_15', 'Happiness Score_15',
       'Economy (GDP per Capita)_15', 'Freedom_15',
       'Trust (Government Corruption)_15', 'Happiness Rank_19',
       'Happiness Score_19', 'Economy (GDP per Capita)_19', 'Freedom_19',
       'Trust (Government Corruption)_19'],
      dtype='object')

Let's "cast" it to a "long" format: 
* Each row will be split into 2 rows, one per year (15/19).
* And the following duplicated columns will be consolidated into a single set 
'Happiness Rank_15', 'Happiness Score_15', 'Economy (GDP per Capita)_15', 'Freedom_15', 'Trust (Government Corruption)_15', 'Happiness Rank_19', 'Happiness Score_19', 'Economy (GDP per Capita)_19', 'Freedom_19', 'Trust (Government Corruption)_19'
* A new year column will be created.
* And the following identifier columns will not be changed: 'Country', 'Region'

We'll use the [wide_to_long](https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html) function to do this. (Another possibility is to use the similar [melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) function.)

In [112]:
long_df = pd.wide_to_long(whr_df_15_19
                , stubnames = [ 'Happiness Rank_', 'Happiness Score_', 'Economy (GDP per Capita)_'
                                 , 'Freedom_', 'Trust (Government Corruption)_' ]
                , i = [ 'Country', 'Region' ]
                , j = 'year'
               ).reset_index()

long_df


Unnamed: 0,Country,Region,year,Happiness Rank_,Happiness Score_,Economy (GDP per Capita)_,Freedom_,Trust (Government Corruption)_
0,Switzerland,Western Europe,15,1,7.587,1.39651,0.66557,0.41978
1,Switzerland,Western Europe,19,6,7.480,1.45200,0.57200,0.34300
2,Iceland,Western Europe,15,2,7.561,1.30232,0.62877,0.14145
3,Iceland,Western Europe,19,4,7.494,1.38000,0.59100,0.11800
4,Denmark,Western Europe,15,3,7.527,1.32548,0.64938,0.48357
...,...,...,...,...,...,...,...,...
293,Syria,Middle East and Northern Africa,19,149,3.462,0.61900,0.01300,0.14100
294,Burundi,Sub-Saharan Africa,15,157,2.905,0.01530,0.11850,0.10062
295,Burundi,Sub-Saharan Africa,19,145,3.775,0.04600,0.22000,0.18000
296,Togo,Sub-Saharan Africa,15,158,2.839,0.20868,0.36453,0.10731


You can see that the consolidated columns still have a "_" postfix, so let's update the column names to remove that:

In [113]:
long_df.columns = [c.replace('_', '') for c in long_df.columns]
long_df

Unnamed: 0,Country,Region,year,Happiness Rank,Happiness Score,Economy (GDP per Capita),Freedom,Trust (Government Corruption)
0,Switzerland,Western Europe,15,1,7.587,1.39651,0.66557,0.41978
1,Switzerland,Western Europe,19,6,7.480,1.45200,0.57200,0.34300
2,Iceland,Western Europe,15,2,7.561,1.30232,0.62877,0.14145
3,Iceland,Western Europe,19,4,7.494,1.38000,0.59100,0.11800
4,Denmark,Western Europe,15,3,7.527,1.32548,0.64938,0.48357
...,...,...,...,...,...,...,...,...
293,Syria,Middle East and Northern Africa,19,149,3.462,0.61900,0.01300,0.14100
294,Burundi,Sub-Saharan Africa,15,157,2.905,0.01530,0.11850,0.10062
295,Burundi,Sub-Saharan Africa,19,145,3.775,0.04600,0.22000,0.18000
296,Togo,Sub-Saharan Africa,15,158,2.839,0.20868,0.36453,0.10731


## Casting long to wide

If you want to go in the other direction, you can "cast" the "long" format data frame to a "wide" format data frame (like we had originally).

Now, the "year" column will be subsumed under the value columns, in a hierarchical structure.

We'll use the [pivot](https://pandas.pydata.org/docs/reference/api/pandas.pivot.html) function to do this. (Another possibility is to use the similar [pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) function.)


In [114]:
wide_df = pd.pivot(long_df
                   , index = [ 'Country', 'Region' ]
                   , columns = 'year'
                  ).reset_index()

wide_df

Unnamed: 0_level_0,Country,Region,Happiness Rank,Happiness Rank,Happiness Score,Happiness Score,Economy (GDP per Capita),Economy (GDP per Capita),Freedom,Freedom,Trust (Government Corruption),Trust (Government Corruption)
year,Unnamed: 1_level_1,Unnamed: 2_level_1,15,19,15,19,15,19,15,19,15,19
0,Afghanistan,Southern Asia,153,154,3.575,3.203,0.31982,0.350,0.23414,0.000,0.09719,0.025
1,Albania,Central and Eastern Europe,95,107,4.959,4.719,0.87867,0.947,0.35733,0.383,0.06413,0.027
2,Algeria,Middle East and Northern Africa,68,88,5.605,5.211,0.93929,1.002,0.28579,0.086,0.17383,0.114
3,Argentina,Latin America and Caribbean,30,47,6.574,6.086,1.05351,1.092,0.44974,0.471,0.08484,0.050
4,Armenia,Central and Eastern Europe,127,116,4.350,4.559,0.76821,0.850,0.19847,0.283,0.03900,0.064
...,...,...,...,...,...,...,...,...,...,...,...,...
144,Venezuela,Latin America and Caribbean,23,108,6.810,4.707,1.04424,0.960,0.42908,0.154,0.11069,0.047
145,Vietnam,Southeastern Asia,75,94,5.360,5.175,0.63216,0.741,0.59444,0.543,0.10441,0.073
146,Yemen,Middle East and Northern Africa,136,151,4.077,3.380,0.54649,0.287,0.35571,0.143,0.07854,0.077
147,Zambia,Sub-Saharan Africa,85,138,5.129,4.107,0.47038,0.578,0.48827,0.431,0.12468,0.087


The above has a hierarchical column structure. That's tricky to work with. So let's collapse to a single level.

In [115]:
list(zip(['a', 'b'], ['c', 'd']))

[('a', 'c'), ('b', 'd')]

In [116]:
level_0_cnames = wide_df.columns.get_level_values(0)
level_0_cnames

Index(['Country', 'Region', 'Happiness Rank', 'Happiness Rank',
       'Happiness Score', 'Happiness Score', 'Economy (GDP per Capita)',
       'Economy (GDP per Capita)', 'Freedom', 'Freedom',
       'Trust (Government Corruption)', 'Trust (Government Corruption)'],
      dtype='object')

In [117]:
level_1_cnames = wide_df.columns.get_level_values(1)
level_1_cnames

Index(['', '', 15, 19, 15, 19, 15, 19, 15, 19, 15, 19], dtype='object', name='year')

In [118]:
new_cnames = [(a+'_'+str(b)).rstrip('_') for (a,b) in zip(level_0_cnames, level_1_cnames)]
new_cnames

['Country',
 'Region',
 'Happiness Rank_15',
 'Happiness Rank_19',
 'Happiness Score_15',
 'Happiness Score_19',
 'Economy (GDP per Capita)_15',
 'Economy (GDP per Capita)_19',
 'Freedom_15',
 'Freedom_19',
 'Trust (Government Corruption)_15',
 'Trust (Government Corruption)_19']

In [119]:
wide_df.columns = new_cnames
wide_df

Unnamed: 0,Country,Region,Happiness Rank_15,Happiness Rank_19,Happiness Score_15,Happiness Score_19,Economy (GDP per Capita)_15,Economy (GDP per Capita)_19,Freedom_15,Freedom_19,Trust (Government Corruption)_15,Trust (Government Corruption)_19
0,Afghanistan,Southern Asia,153,154,3.575,3.203,0.31982,0.350,0.23414,0.000,0.09719,0.025
1,Albania,Central and Eastern Europe,95,107,4.959,4.719,0.87867,0.947,0.35733,0.383,0.06413,0.027
2,Algeria,Middle East and Northern Africa,68,88,5.605,5.211,0.93929,1.002,0.28579,0.086,0.17383,0.114
3,Argentina,Latin America and Caribbean,30,47,6.574,6.086,1.05351,1.092,0.44974,0.471,0.08484,0.050
4,Armenia,Central and Eastern Europe,127,116,4.350,4.559,0.76821,0.850,0.19847,0.283,0.03900,0.064
...,...,...,...,...,...,...,...,...,...,...,...,...
144,Venezuela,Latin America and Caribbean,23,108,6.810,4.707,1.04424,0.960,0.42908,0.154,0.11069,0.047
145,Vietnam,Southeastern Asia,75,94,5.360,5.175,0.63216,0.741,0.59444,0.543,0.10441,0.073
146,Yemen,Middle East and Northern Africa,136,151,4.077,3.380,0.54649,0.287,0.35571,0.143,0.07854,0.077
147,Zambia,Sub-Saharan Africa,85,138,5.129,4.107,0.47038,0.578,0.48827,0.431,0.12468,0.087


---

# Exercise for you

<ol>
    <li>Find two data files on the internet that you think might be relevant for your research. They must relate to the same subjects: e.g., countries, or companies, or books, or movies, or celebrities, or planets, ... (In our example above, we had data describing countries.)
    <li>Load up each file into a separate data frame.
    <li>Describe each data frame.
    <li>Clean any missing or incorrect values, in a way that makes sense to you.
    <li>Merge the two data frames together. Which kind of join makes sense for the data you have? Inner? Left? Right? Or maybe you don't know until it's time to perform further analysis?
</ol>


In [None]:
# TODO