# More Advanced Pandas

In this notebook we look at further aspects of working with Pandas DataFrames and Series, including normalising data, aggregating data, and addressing the problem of missing values in a DataFrame. 

Firstly, we will load a dataset of country-level statistics.

In [1]:
import pandas as pd

In [2]:
# read the dataset and set the index column
df = pd.read_csv("data/world_data.csv", index_col="Country")
# look at the first few rows
df.head()

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,South America,43.59,75.77,No,Spanish
Australia,Oceania,23.99,82.09,No,English
Brazil,South America,200.4,73.12,No,Portuguese
Canada,North America,35.99,80.99,No,English
Chad,Africa,11.63,49.81,Yes,Arabic


## Frequency Tables

When working with a Series with categorical values, frequency tables in Pandas provide a way of counting the frequency of different values in the Series. The function *value_counts()* returns a new Series containing counts of unique values. By default, these values are sorted.

https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

We could apply this to any of the categorical columns in our DataFrame.

In [3]:
df["Region"].value_counts()

Region
Europe           6
Africa           4
North America    3
South America    3
Asia             3
Oceania          2
Name: count, dtype: int64

In [4]:
df["Language"].value_counts()

Language
English       7
Spanish       4
Portuguese    2
Arabic        2
German        2
Chinese       1
Japanese      1
French        1
Korean        1
Name: count, dtype: int64

In [5]:
df["Landlocked"].value_counts()

Landlocked
No     17
Yes     4
Name: count, dtype: int64

We can also normalise the values, to give the relative frequencies of the unique values (i.e. the fraction of entries in the Series which have a given value):

In [6]:
df["Language"].value_counts(normalize=True)

Language
English       0.333333
Spanish       0.190476
Portuguese    0.095238
Arabic        0.095238
German        0.095238
Chinese       0.047619
Japanese      0.047619
French        0.047619
Korean        0.047619
Name: proportion, dtype: float64

In [7]:
df["Region"].value_counts(normalize=True)

Region
Europe           0.285714
Africa           0.190476
North America    0.142857
South America    0.142857
Asia             0.142857
Oceania          0.095238
Name: proportion, dtype: float64

In [8]:
df["Landlocked"].value_counts(normalize=True)

Landlocked
No     0.809524
Yes    0.190476
Name: proportion, dtype: float64

## Aggregating Data

We can use the *groupby()* function to group data based on the values in a categorical column:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

In [12]:
# group the countries by their region value
groups1 = df.groupby("Region")

We can now apply a range of statistical operations on the groups:

In [13]:
# get the mean of the numeric columns, per group
groups1.mean(numeric_only=True)

Unnamed: 0_level_0,Population,Life Exp
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa,76.76,56.68
Asia,511.656667,80.82
Europe,36.123333,81.191667
North America,161.546667,78.183333
Oceania,14.325,81.38
South America,83.59,75.293333


In [14]:
# get the total of the numeric columns, per group
groups1.sum(numeric_only=True)

Unnamed: 0_level_0,Population,Life Exp
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa,307.04,226.72
Asia,1534.97,242.46
Europe,216.74,487.15
North America,484.64,234.55
Oceania,28.65,162.76
South America,250.77,225.88


In [15]:
# use an alternative categorical variable to aggregate the data
groups2 = df.groupby("Language")
groups2.mean(numeric_only=True)

Unnamed: 0_level_0,Population,Life Exp
Language,Unnamed: 1_level_1,Unnamed: 2_level_1
Arabic,51.0,60.145
Chinese,1357.0,74.87
English,91.777143,76.257143
French,18.05,55.13
German,44.79,81.37
Japanese,126.26,84.36
Korean,51.71,83.23
Portuguese,105.345,76.9
Spanish,56.27,77.825


In [16]:
# use an alternative categorical variable to aggregate the data
groups2 = df.groupby("Landlocked")
groups2.mean(numeric_only=True)

Unnamed: 0_level_0,Population,Life Exp
Landlocked,Unnamed: 1_level_1,Unnamed: 2_level_1
No,163.425294,77.358235
Yes,11.145,66.1075


## Cross Tabulation

*Cross tabulation* allows us to quantitatively analyse the relationship between multiple variables. In Pandas, this involves counting the frequency with which values from different columns in a DataFrame co-occur.

https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html

In the simplest case, we can compare one column relative to another (our new index). For example, compare the Region and Landlocked columns, where Region will be the row index in the new DataFrame.

In [17]:
# compare a pair of categorical variables
pd.crosstab(df["Region"], df["Landlocked"])

Landlocked,No,Yes
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Africa,2,2
Asia,3,0
Europe,5,1
North America,3,0
Oceania,2,0
South America,2,1


In [18]:
# compare a different pair of categorical variables
pd.crosstab(df["Language"], df["Landlocked"])

Landlocked,No,Yes
Language,Unnamed: 1_level_1,Unnamed: 2_level_1
Arabic,1,1
Chinese,1,0
English,7,0
French,0,1
German,1,1
Japanese,1,0
Korean,1,0
Portuguese,2,0
Spanish,3,1


In [19]:
# compare a different pair of categorical variables
pd.crosstab(df["Region"], df["Language"])

Language,Arabic,Chinese,English,French,German,Japanese,Korean,Portuguese,Spanish
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Africa,2,0,1,1,0,0,0,0,0
Asia,0,1,0,0,0,1,1,0,0
Europe,0,0,2,0,2,0,0,1,1
North America,0,0,2,0,0,0,0,0,1
Oceania,0,0,2,0,0,0,0,0,0
South America,0,0,0,0,0,0,0,1,2


## Data Normalisation

Data normalisation is a preprocessing step which is often applied to numeric columns to transform their scale or range.

For instance, for country population data, we could normalise the values in this column in different ways.

We could divided by the maximum value in the series:

In [20]:
df["Pop Norm"] = df["Population"] / df["Population"].max()
df.head(10)

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language,Pop Norm
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Argentina,South America,43.59,75.77,No,Spanish,0.032122
Australia,Oceania,23.99,82.09,No,English,0.017679
Brazil,South America,200.4,73.12,No,Portuguese,0.147679
Canada,North America,35.99,80.99,No,English,0.026522
Chad,Africa,11.63,49.81,Yes,Arabic,0.00857
China,Asia,1357.0,74.87,No,Chinese,1.0
Egypt,Africa,90.37,70.48,No,Arabic,0.066595
Germany,Europe,81.46,80.24,No,German,0.060029
Ireland,Europe,4.64,80.15,No,English,0.003419
Japan,Asia,126.26,84.36,No,Japanese,0.093043


Alternatively, we could subtract the mean value from each value in the column. Note that this can give negative values:

In [21]:
df["Pop Norm"] = df["Population"] - df["Population"].mean()
df.head(10)

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language,Pop Norm
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Argentina,South America,43.59,75.77,No,Spanish,-90.829524
Australia,Oceania,23.99,82.09,No,English,-110.429524
Brazil,South America,200.4,73.12,No,Portuguese,65.980476
Canada,North America,35.99,80.99,No,English,-98.429524
Chad,Africa,11.63,49.81,Yes,Arabic,-122.789524
China,Asia,1357.0,74.87,No,Chinese,1222.580476
Egypt,Africa,90.37,70.48,No,Arabic,-44.049524
Germany,Europe,81.46,80.24,No,German,-52.959524
Ireland,Europe,4.64,80.15,No,English,-129.779524
Japan,Asia,126.26,84.36,No,Japanese,-8.159524


A particularly common form of normalisation is to compute a *Z-score*, which involves subtracting the mean value of a variable for each value and then dividing by its standard deviation:

https://en.wikipedia.org/wiki/Standard_score

In [22]:
df["Pop Norm"] = (df["Population"] - df["Population"].mean())/df["Population"].std()
df.head(10)

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language,Pop Norm
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Argentina,South America,43.59,75.77,No,Spanish,-0.311463
Australia,Oceania,23.99,82.09,No,English,-0.378674
Brazil,South America,200.4,73.12,No,Portuguese,0.226254
Canada,North America,35.99,80.99,No,English,-0.337524
Chad,Africa,11.63,49.81,Yes,Arabic,-0.421057
China,Asia,1357.0,74.87,No,Chinese,4.192348
Egypt,Africa,90.37,70.48,No,Arabic,-0.15105
Germany,Europe,81.46,80.24,No,German,-0.181603
Ireland,Europe,4.64,80.15,No,English,-0.445027
Japan,Asia,126.26,84.36,No,Japanese,-0.02798


Another common normalisation method is *min-max normalisation*, which rescales the range of a feature's values to [0,1], based on its minimum and maximum values. We could apply this to the life expectancy values in our dataset as follows:

In [23]:
life_min = df["Life Exp"].min()
life_max = df["Life Exp"].max()
df["Life Exp Norm"] = (df["Life Exp"]-life_min)/(life_max-life_min)
df.head(10)

Unnamed: 0_level_0,Region,Population,Life Exp,Landlocked,Language,Pop Norm,Life Exp Norm
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Argentina,South America,43.59,75.77,No,Spanish,-0.311463,0.751375
Australia,Oceania,23.99,82.09,No,English,-0.378674,0.934298
Brazil,South America,200.4,73.12,No,Portuguese,0.226254,0.674674
Canada,North America,35.99,80.99,No,English,-0.337524,0.90246
Chad,Africa,11.63,49.81,Yes,Arabic,-0.421057,0.0
China,Asia,1357.0,74.87,No,Chinese,4.192348,0.725326
Egypt,Africa,90.37,70.48,No,Arabic,-0.15105,0.598263
Germany,Europe,81.46,80.24,No,German,-0.181603,0.880753
Ireland,Europe,4.64,80.15,No,English,-0.445027,0.878148
Japan,Asia,126.26,84.36,No,Japanese,-0.02798,1.0


## Handling Missing Values

Many real datasets have missing values, either because they exist and were not collected or because the values never existed. 

In the example here, we consider a different dataset representing the passenger list from the Titanic. 

In [25]:
# load the data and use the passenger Id as the row index for the DataFrame
dft = pd.read_csv("data/titanic.csv", index_col="PassengerId")
dft.head(20)

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [26]:
dft.shape

(418, 10)

When we load the dataset *titanic.csv* dataset, we see that some columns have many missing values - i.e. they contain the null/empty value *NaN*.

In [27]:
# how many missing values per column?
dft.isnull().sum()

Pclass        0
Name          0
Sex           0
Age          86
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin       327
Embarked      0
dtype: int64

One option is to simply drop a feature with many missing values. So we could drop the "Age" column using the drop() function:



In [28]:
dft.drop(["Age"], axis=1)

Unnamed: 0_level_0,Pclass,Name,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
892,3,"Kelly, Mr. James",male,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,1,0,363272,7.0000,,S
894,2,"Myles, Mr. Thomas Francis",male,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...
1305,3,"Spector, Mr. Woolf",male,0,0,A.5. 3236,8.0500,,S
1306,1,"Oliva y Ocana, Dona. Fermina",female,0,0,PC 17758,108.9000,C105,C
1307,3,"Saether, Mr. Simon Sivertsen",male,0,0,SOTON/O.Q. 3101262,7.2500,,S
1308,3,"Ware, Mr. Frederick",male,0,0,359309,8.0500,,S


However, if we expect age to play an important role, then we want to keep the column and estimate the missing values in some way. A simple approach is to fill in missing values using the mean value. We can do this using the *fillna()* function.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

In [29]:
mean_age = dft["Age"].mean()
mean_age

np.float64(30.272590361445783)

In [30]:
# replace all NaN values in the Age column with the mean value
dft["Age"] = dft["Age"].fillna(mean_age)
dft.head(20)

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


Confirm that the "Age" column no longer has any missing values:

In [31]:
dft.isnull().sum()

Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          1
Cabin       327
Embarked      0
dtype: int64