# POLI 175 - Machine Learning for Social Sciences

## Python Refresh II

---

## Load Pandas and Numpy

To get started, let us load Pandas and Numpy:

In [1]:
# My code here
import pandas as pd
import numpy as np

## Load Datasets

We will load three datasets here:

In [2]:
# My code here
PErisk = pd.read_csv("https://raw.githubusercontent.com/umbertomig/qtm150/master/datasets/PErisk.csv")
tips = pd.read_csv("https://raw.githubusercontent.com/umbertomig/qtm151/main/datasets/tips.csv")

Let's explore the datasets we just loaded.

In [3]:
# My code here
PErisk.head()

Unnamed: 0,country,courts,barb2,prsexp2,prscorr2,gdpw2
0,Argentina,0,-0.720775,1,3,9.69017
1,Australia,1,-6.907755,5,4,10.30484
2,Austria,1,-4.910337,5,4,10.10094
3,Bangladesh,0,0.775975,1,0,8.379768
4,Belgium,1,-4.617344,5,4,10.25012


In [4]:
tips.head()

Unnamed: 0,obs,totbill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,F,No,Sun,Night,2
1,2,10.34,1.66,M,No,Sun,Night,3
2,3,21.01,3.5,M,No,Sun,Night,3
3,4,23.68,3.31,M,No,Sun,Night,2
4,5,24.59,3.61,F,No,Sun,Night,4


In [6]:
PErisk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   country   62 non-null     object 
 1   courts    62 non-null     int64  
 2   barb2     62 non-null     float64
 3   prsexp2   62 non-null     int64  
 4   prscorr2  62 non-null     int64  
 5   gdpw2     62 non-null     float64
dtypes: float64(2), int64(3), object(1)
memory usage: 3.0+ KB


In [7]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   obs      244 non-null    int64  
 1   totbill  244 non-null    float64
 2   tip      244 non-null    float64
 3   sex      244 non-null    object 
 4   smoker   244 non-null    object 
 5   day      244 non-null    object 
 6   time     244 non-null    object 
 7   size     244 non-null    int64  
dtypes: float64(2), int64(2), object(4)
memory usage: 15.4+ KB


## Counting

### Counting data

To count data we need to:

```
dat["variable"].value_counts()
```

If we want it sorted, we can type:

```
dat["variable"].value_counts(sort = True)
```

We can also count proportions:

```
dat["variable"].value_counts(normalize = True)
```

Let's try?!

### Detecting missing data

We can also detect missing data using the function:

```
dat.isna()
```

And if we want, count the missing data by variable:

```
dat.isna().sum()
```

Ultimately, to remove the missing we should:

```
dat.dropna()
```

Or we can fill the missing with a custom value (proceed with caution here!)

```
dat.fillna(0)
```

In [None]:
# My code here
PErisk.columns

In [10]:
PErisk["courts"].value_counts()

0    34
1    28
Name: courts, dtype: int64

**Exercise**: Count the number of `tips` by week day. Then, normalize to have the proportions.

In [12]:
## Your answers here!
tips.head()

Unnamed: 0,obs,totbill,tip,sex,smoker,day,time,size
0,1,16.99,1.01,F,No,Sun,Night,2
1,2,10.34,1.66,M,No,Sun,Night,3
2,3,21.01,3.5,M,No,Sun,Night,3
3,4,23.68,3.31,M,No,Sun,Night,2
4,5,24.59,3.61,F,No,Sun,Night,4


In [16]:
tips["day"].value_counts()

Sat    87
Sun    76
Thu    62
Fri    19
Name: day, dtype: int64

In [15]:
tips["day"].value_counts(ascending = False, normalize = True)

Sat    0.356557
Sun    0.311475
Thu    0.254098
Fri    0.077869
Name: day, dtype: float64

In [17]:
tips[["day", "time"]].value_counts(ascending = False, normalize = True)

day  time 
Sat  Night    0.356557
Sun  Night    0.311475
Thu  Day      0.250000
Fri  Night    0.049180
     Day      0.028689
Thu  Night    0.004098
dtype: float64

## Summary by groups

Suppose we want the mean of gdp by countries with and without courts. There are two ways:

```
# Hard way
perisk[perisk['courts'] == 0]['gdpw2'].mean()
perisk[perisk['courts'] == 0]['gdpw2'].mean()
```

Or, we can use the `groupby` function in Pandas:

```
# Easy way
perisk.groupby("courts")["gdpw2"].mean()
```

In [20]:
# My code here
PErisk.groupby("prsexp2")[["gdpw2"]].mean()

Unnamed: 0_level_0,gdpw2
prsexp2,Unnamed: 1_level_1
0,8.975829
1,8.483506
2,8.695454
3,8.613695
4,8.947314
5,10.139483


**Exercise**: In the `tips` dataset, compute the mean of tips by weekday.

In [22]:
## Your answers here!
tips.groupby("day")[["tip"]].mean()

Unnamed: 0_level_0,tip
day,Unnamed: 1_level_1
Fri,2.734737
Sat,2.993103
Sun,3.255132
Thu,2.771452


In [24]:
tips.groupby(["day", "time"])[["tip"]].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,tip
day,time,Unnamed: 2_level_1
Fri,Day,16.68
Fri,Night,35.28
Sat,Night,260.4
Sun,Night,247.39
Thu,Day,168.83
Thu,Night,3.0


### Summary by groups (multiple functions)

To group results by multiple functions, we can simply:

```
dat.groupby("var_group")["var_stat"].agg([stat1, stat2, stat3])
```

### Summary by groups (multiple levels)

To group results by multiple levels, we can simply:

```
dat.groupby(["varlevel1", "varlevel2"])["var_stat"].mean()
```

In [27]:
# My code here
PErisk.groupby("prscorr2")["gdpw2"].agg([min, max, sum])

Unnamed: 0_level_0,min,max,sum
prscorr2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7.095064,8.727616,41.178301
1,7.096721,9.664151,89.571847
2,7.501082,9.84882,157.378553
3,7.029973,10.26078,101.910794
4,9.167329,10.30484,78.866357
5,9.882724,10.41018,91.690384


**Exercise**: For the `tips` dataset:

1. Compute maximum, minimum, and sum of tips by weekday
2. Compute the sum of tips by weekday and day time.

In [26]:
## Your answers here!
tips.groupby("day")["tip"].agg([max, min])

Unnamed: 0_level_0,max,min
day,Unnamed: 1_level_1,Unnamed: 2_level_1
Fri,4.73,1.0
Sat,10.0,1.0
Sun,6.5,1.01
Thu,6.7,1.25


In [30]:
tips.groupby(["day", "time"])["tip"].agg([sum])

Unnamed: 0_level_0,Unnamed: 1_level_0,sum
day,time,Unnamed: 2_level_1
Fri,Day,16.68
Fri,Night,35.28
Sat,Night,260.4
Sun,Night,247.39
Thu,Day,168.83
Thu,Night,3.0


## Indexing

To find the indexes we use:

```
dat.columns
dat.index
```

We can set index:

```
dat_ind = dat.set_index("var_index")
```

And to remove indexing:

```
dat_ind.reset_index()
```

The reason we index is because it makes subset simple:

```
# Hard way:
perisk[perisk["country"].isin(["Argentina", "Austria"])]

# Easy way:
perisk_ind.loc[["Argentina", "Austria"]]
```

Also, indexes do not need to be unique, and you can use multiple levels to index.

In [31]:
# My code here
PErisk_ind = PErisk.set_index("country")
PErisk_ind

Unnamed: 0_level_0,courts,barb2,prsexp2,prscorr2,gdpw2
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,0,-0.720775,1,3,9.690170
Australia,1,-6.907755,5,4,10.304840
Austria,1,-4.910337,5,4,10.100940
Bangladesh,0,0.775975,1,0,8.379768
Belgium,1,-4.617344,5,4,10.250120
...,...,...,...,...,...
United Kingdom,1,-6.907755,5,5,10.127270
Uruguay,0,-2.127775,2,2,9.414342
Venezuela,1,0.428845,3,2,9.848820
Zambia,0,0.965811,3,1,7.726213


In [35]:
PErisk_ind.loc[["Argentina", "Austria"]]

Unnamed: 0_level_0,courts,barb2,prsexp2,prscorr2,gdpw2
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,0,-0.720775,1,3,9.69017
Austria,1,-4.910337,5,4,10.10094


**Exercise**: Index the data by the variable `obs`. Subset the observations 33 and 132.

In [33]:
## Your answers here!
tips_ind = tips.set_index("obs")
tips_ind

Unnamed: 0_level_0,totbill,tip,sex,smoker,day,time,size
obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,16.99,1.01,F,No,Sun,Night,2
2,10.34,1.66,M,No,Sun,Night,3
3,21.01,3.50,M,No,Sun,Night,3
4,23.68,3.31,M,No,Sun,Night,2
5,24.59,3.61,F,No,Sun,Night,4
...,...,...,...,...,...,...,...
240,29.03,5.92,M,No,Sat,Night,3
241,27.18,2.00,F,Yes,Sat,Night,2
242,22.67,2.00,M,Yes,Sat,Night,2
243,17.82,1.75,M,No,Sat,Night,2


In [38]:
tips_ind.iloc[[32, 131]]

Unnamed: 0_level_0,totbill,tip,sex,smoker,day,time,size
obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
33,15.06,3.0,F,No,Sat,Night,2
132,20.27,2.83,F,No,Thu,Day,2


## Plots

Now, let's create some plots!

The library to create plots is the `matplotlib`. We can import this library easily in python:

```
from matplotlib import pyplot as plt
```

### Scatterplot

And for making a plot we need to:

```
plt.plot(dat.vx, dat.vy, kind="scatter")
plt.show()
```

If we want to add legends and change attributes:

```
plt.plot(dat.vx, dat.vy, kind="scatter")
plt.xlabel("X-axis name")
plt.ylabel("Y-axis name")
plt.title("Plot title")
plt.show()
```

In [None]:
# My code here

### Histogram

We can make a simple histogram using the function `.hist()`:

```
dat['variable'].hist()
plt.show()
```

And if we want overlapping histograms by a category:

```
dat[dat['vcat'] == 'v1']['variable'].hist()
dat[dat['vcat'] == 'v2']['variable'].hist()
plt.legend(["v1", "v2"])
plt.show()
```

Let's try?

In [None]:
# My code here

**Exercise**:

In [None]:
## Your answers here!

**Great job!!!**