# More Advanced Pandas

In this notebook we look at further aspects of working with Pandas DataFrames and Series, including normalising data, aggregating data, and addressing the problem of missing values in a DataFrame. 

Firstly, we will load a dataset of country-level statistics.

In [None]:
import pandas as pd

In [None]:
# read the dataset and set the index column
df = pd.read_csv("world_data.csv", index_col="Country")
# look at the first few rows
df.head()

## Frequency Tables

When working with a Series with categorical values, frequency tables in Pandas provide a way of counting the frequency of different values in the Series. The function *value_counts()* returns a new Series containing counts of unique values. By default, these values are sorted.

https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html

We could apply this to any of the categorical columns in our DataFrame.

In [None]:
df["Region"].value_counts()

In [None]:
df["Language"].value_counts()

In [None]:
df["Landlocked"].value_counts()

We can also normalise the values, to give the relative frequencies of the unique values (i.e. the fraction of entries in the Series which have a given value):

In [None]:
df["Language"].value_counts(normalize=True)

In [None]:
df["Region"].value_counts(normalize=True)

In [None]:
df["Landlocked"].value_counts(normalize=True)

## Aggregating Data

We can use the *groupby()* function to group data based on the values in a categorical column:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

In [None]:
# group the countries by their region value
groups1 = df.groupby("Region")

We can now apply a range of statistical operations on the groups:

In [None]:
# get the mean of the numeric columns, per group
groups1.mean(numeric_only=True)

In [None]:
# get the total of the numeric columns, per group
groups1.sum(numeric_only=True)

In [None]:
# use an alternative categorical variable to aggregate the data
groups2 = df.groupby("Language")
groups2.mean(numeric_only=True)

In [None]:
# use an alternative categorical variable to aggregate the data
groups2 = df.groupby("Landlocked")
groups2.mean(numeric_only=True)

## Cross Tabulation

*Cross tabulation* allows us to quantitatively analyse the relationship between multiple variables. In Pandas, this involves counting the frequency with which values from different columns in a DataFrame co-occur.

https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html

In the simplest case, we can compare one column relative to another (our new index). For example, compare the Region and Landlocked columns, where Region will be the row index in the new DataFrame.

In [None]:
# compare a pair of categorical variables
pd.crosstab(df["Region"], df["Landlocked"])

In [None]:
# compare a different pair of categorical variables
pd.crosstab(df["Language"], df["Landlocked"])

In [None]:
# compare a different pair of categorical variables
pd.crosstab(df["Region"], df["Language"])

## Data Normalisation

Data normalisation is a preprocessing step which is often applied to numeric columns to transform their scale or range.

For instance, for country population data, we could normalise the values in this column in different ways.

We could divided by the maximum value in the series:

In [None]:
df["Pop Norm"] = df["Population"] / df["Population"].max()
df.head(10)

Alternatively, we could subtract the mean value from each value in the column. Note that this can give negative values:

In [None]:
df["Pop Norm"] = df["Population"] - df["Population"].mean()
df.head(10)

A particularly common form of normalisation is to compute a *Z-score*, which involves subtracting the mean value of a variable for each value and then dividing by its standard deviation:

https://en.wikipedia.org/wiki/Standard_score

In [None]:
df["Pop Norm"] = (df["Population"] - df["Population"].mean())/df["Population"].std()
df.head(10)

Another common normalisation method is *min-max normalisation*, which rescales the range of a feature's values to [0,1], based on its minimum and maximum values. We could apply this to the life expectancy values in our dataset as follows:

In [None]:
life_min = df["Life Exp"].min()
life_max = df["Life Exp"].max()
df["Life Exp Norm"] = (df["Life Exp"]-life_min)/(life_max-life_min)
df.head(10)

## Handling Missing Values

Many real datasets have missing values, either because they exist and were not collected or because the values never existed. 

In the example here, we consider a different dataset representing the passenger list from the Titanic. 

In [None]:
# load the data and use the passenger Id as the row index for the DataFrame
dft = pd.read_csv("titanic.csv", index_col="PassengerId")
dft.head(20)

In [None]:
dft.shape

When we load the dataset *titanic.csv* dataset, we see that some columns have many missing values - i.e. they contain the null/empty value *NaN*.

In [None]:
# how many missing values per column?
dft.isnull().sum()

One option is to simply drop a feature with many missing values. So we could drop the "Age" column using the drop() function:



In [None]:
dft.drop(["Age"], axis=1)

However, if we expect age to play an important role, then we want to keep the column and estimate the missing values in some way. A simple approach is to fill in missing values using the mean value. We can do this using the *fillna()* function.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html

In [None]:
mean_age = dft["Age"].mean()
mean_age

In [None]:
# replace all NaN values in the Age column with the mean value
dft["Age"] = dft["Age"].fillna(mean_age)
dft.head(20)

Confirm that the "Age" column no longer has any missing values:

In [None]:
dft.isnull().sum()