# CM4125 Week 3: Data Manipulation

In [None]:
# This cell is used to change parameter of the rise slideshow, 
# such as the window width/height and enabling a scroll bar
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'width': 1700,
              'height': 800,
              'scroll': True,
})
# This code lets you colour the axis lines white
from matplotlib import style
style.use('dark_background')
# This code allows you to show images within the notebook
%matplotlib inline

## Coursework Clarification!
Today @ 4:00 pm, I will open a Zoom room

## Lecture objectives

1) Load a set of datasets containing different characteristics

2) Understand the best methods to manipulate it prior to analysis

3) Apply more elaborated pre-processing techniques to clean data

## Loading data (revised)

If you recall, last week we had to download and upload csv files

This is obviously not very practical!

The `Pandas` module allows us to load data from csv files that are stored online in places such as:

    Github
    Dropbox
    AWS
    etc

In [1]:
# Importing a dataset that I have in dropbox
import pandas as pd
# Copy dropbox link, change dl=0 for raw=1
url = 'https://www.dropbox.com/s/ju1iemc8k1wp0p5/time.csv?raw=1'
time = pd.read_csv(url)
time 

Unnamed: 0,Year,Honor,Name,Country,Birth Year,Death Year,Title,Category,Context
0,1927,Man of the Year,Charles Lindbergh,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight
1,1928,Man of the Year,Walter Chrysler,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger
2,1929,Man of the Year,Owen D. Young,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan
3,1930,Man of the Year,Mahatma Gandhi,India,1869.0,1948.0,,Revolution,Salt March
4,1931,Man of the Year,Pierre Laval,France,1883.0,1945.0,Prime Minister of France,Politics,
...,...,...,...,...,...,...,...,...,...
86,2012,Person of the Year,Barack Obama,United States,1961.0,,President of the United States,Politics,Presidential Election
87,2013,Person of the Year,Pope Francis,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave
88,2014,Person of the Year,The Ebola Fighters,,,,,Science,Ebola Epidemic
89,2015,Person of the Year,Angela Merkel,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...


This dataset contains the list of people named Man/Person of the Year by Times magazine

## Data by Columns

Last week we added new columns to dataframes

    The dog/cat class at the end of the image repository
    The "watched" column at the end of the Netflix dataset

Today, we will learn other ways in which we can manipulate, add or relate data from columns

### Mapping Columns

Sometimes we want to convert a column of categorical data from strings (text) to integers (numbers)

We can define a `dictionary` which maps the strings (as keys) to integers (as values) using the `map` method

Another option is to define a `set` of the unique categories as follows.

In [2]:
categories = set(time['Category'])
categories

{'Diplomacy',
 'Economics',
 'Environment',
 'Media',
 'Philanthropy',
 'Politics',
 'Religion',
 'Revolution',
 'Science',
 'Space',
 'Technology',
 'War',
 nan}

Then, we can then assign unique numbers to this `set` as a `dictionary` (defined in Python using the `{}`)

In [3]:
mapping = {category : index for index, category in enumerate(categories)}
mapping

{nan: 0,
 'Environment': 1,
 'Philanthropy': 2,
 'Politics': 3,
 'Revolution': 4,
 'Religion': 5,
 'Technology': 6,
 'Economics': 7,
 'War': 8,
 'Media': 9,
 'Diplomacy': 10,
 'Space': 11,
 'Science': 12}

We now have a `map` which allows us to convert from a category string to an `int`. For example:

In [4]:
mapping['Diplomacy']

10

In [5]:
mapping['Media']

9

We can use the `map` method on the chosen column to do the conversion

The `map` method takes our `dictionary` as an argument and returns a new column with the mapping applied

In [6]:
time['Category'].map(mapping)

0      0
1      7
2     10
3      4
4      3
      ..
86     3
87     5
88    12
89     3
90     3
Name: Category, Length: 91, dtype: int64

If we wanted to overwrite the `Category` column with our new mapped values, we could do:

    time['Category'] = time['Category'].map(mapping)

However, it may be best to keep both. So we can instead assign it a different column name

In [7]:
time['Category Ordinal'] = time['Category'].map(mapping)
time

Unnamed: 0,Year,Honor,Name,Country,Birth Year,Death Year,Title,Category,Context,Category Ordinal
0,1927,Man of the Year,Charles Lindbergh,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight,0
1,1928,Man of the Year,Walter Chrysler,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger,7
2,1929,Man of the Year,Owen D. Young,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan,10
3,1930,Man of the Year,Mahatma Gandhi,India,1869.0,1948.0,,Revolution,Salt March,4
4,1931,Man of the Year,Pierre Laval,France,1883.0,1945.0,Prime Minister of France,Politics,,3
...,...,...,...,...,...,...,...,...,...,...
86,2012,Person of the Year,Barack Obama,United States,1961.0,,President of the United States,Politics,Presidential Election,3
87,2013,Person of the Year,Pope Francis,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave,5
88,2014,Person of the Year,The Ebola Fighters,,,,,Science,Ebola Epidemic,12
89,2015,Person of the Year,Angela Merkel,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...,3


### Manipulating Numerical Columns

We can perform arithmetic between columns as if they were simply numbers

We will take the column `Death Year`, subtract the column `Birth Year`, and storing the result in a (new) column called `Lifespan`

In [8]:
time['Lifespan'] = time['Death Year'] - time['Birth Year']
time

Unnamed: 0,Year,Honor,Name,Country,Birth Year,Death Year,Title,Category,Context,Category Ordinal,Lifespan
0,1927,Man of the Year,Charles Lindbergh,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight,0,72.0
1,1928,Man of the Year,Walter Chrysler,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger,7,65.0
2,1929,Man of the Year,Owen D. Young,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan,10,88.0
3,1930,Man of the Year,Mahatma Gandhi,India,1869.0,1948.0,,Revolution,Salt March,4,79.0
4,1931,Man of the Year,Pierre Laval,France,1883.0,1945.0,Prime Minister of France,Politics,,3,62.0
...,...,...,...,...,...,...,...,...,...,...,...
86,2012,Person of the Year,Barack Obama,United States,1961.0,,President of the United States,Politics,Presidential Election,3,
87,2013,Person of the Year,Pope Francis,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave,5,
88,2014,Person of the Year,The Ebola Fighters,,,,,Science,Ebola Epidemic,12,
89,2015,Person of the Year,Angela Merkel,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...,3,


Notice that the method acknowledged that some columns were `NaN` (since the person is still alive) and thus put a `NaN` as a result in the lifespan column.

Sometimes you want to do "more difficult" numerical conversions

In those cases it is more convenient to define a **function** and then use the `apply` method

In this example, we will define a function to convert years to decades to be applied in the different columns that contain years

In [9]:
def to_decade(value):
    return 10 * (value // 10)

First we should test our function on some examples to ensure the function is correct.

In [11]:
to_decade(1971)

1970

Now we can use `apply` to transform the entire column `Birth Year` into the corresponding decade

In [12]:
time['Birth Decade'] = time['Birth Year'].apply(to_decade)
time

Unnamed: 0,Year,Honor,Name,Country,Birth Year,Death Year,Title,Category,Context,Category Ordinal,Lifespan,Birth Decade
0,1927,Man of the Year,Charles Lindbergh,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight,0,72.0,1900.0
1,1928,Man of the Year,Walter Chrysler,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger,7,65.0,1870.0
2,1929,Man of the Year,Owen D. Young,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan,10,88.0,1870.0
3,1930,Man of the Year,Mahatma Gandhi,India,1869.0,1948.0,,Revolution,Salt March,4,79.0,1860.0
4,1931,Man of the Year,Pierre Laval,France,1883.0,1945.0,Prime Minister of France,Politics,,3,62.0,1880.0
...,...,...,...,...,...,...,...,...,...,...,...,...
86,2012,Person of the Year,Barack Obama,United States,1961.0,,President of the United States,Politics,Presidential Election,3,,1960.0
87,2013,Person of the Year,Pope Francis,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave,5,,1930.0
88,2014,Person of the Year,The Ebola Fighters,,,,,Science,Ebola Epidemic,12,,
89,2015,Person of the Year,Angela Merkel,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...,3,,1950.0


### Manipulating String Columns

`apply` can also be used on strings

For example, we can apply the `len` function to `Name` to get the number of characters in each name.

In [13]:
time['Name Length'] = time['Name'].apply(len)
time

Unnamed: 0,Year,Honor,Name,Country,Birth Year,Death Year,Title,Category,Context,Category Ordinal,Lifespan,Birth Decade,Name Length
0,1927,Man of the Year,Charles Lindbergh,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight,0,72.0,1900.0,17
1,1928,Man of the Year,Walter Chrysler,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger,7,65.0,1870.0,15
2,1929,Man of the Year,Owen D. Young,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan,10,88.0,1870.0,13
3,1930,Man of the Year,Mahatma Gandhi,India,1869.0,1948.0,,Revolution,Salt March,4,79.0,1860.0,14
4,1931,Man of the Year,Pierre Laval,France,1883.0,1945.0,Prime Minister of France,Politics,,3,62.0,1880.0,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,2012,Person of the Year,Barack Obama,United States,1961.0,,President of the United States,Politics,Presidential Election,3,,1960.0,12
87,2013,Person of the Year,Pope Francis,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave,5,,1930.0,12
88,2014,Person of the Year,The Ebola Fighters,,,,,Science,Ebola Epidemic,12,,,18
89,2015,Person of the Year,Angela Merkel,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...,3,,1950.0,13


Another example, we can create a function to get the initials from a name.

In [14]:
# First we define the function
def to_initials(name):
    import numpy as np # we need this for Python to know what NaN is!
    if name == np.NaN:
        return np.NaN
    else:
        initials = ""
        for word in name.split(' '):
            first_letter = word[0]
            initials += first_letter
        return initials
    # Then, we try it
to_initials("Barack Obama")

'BO'

Applying this gives us a name column with the person's initials.

In [15]:
time['Initials'] = time['Name'].apply(to_initials)
time

Unnamed: 0,Year,Honor,Name,Country,Birth Year,Death Year,Title,Category,Context,Category Ordinal,Lifespan,Birth Decade,Name Length,Initials
0,1927,Man of the Year,Charles Lindbergh,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight,0,72.0,1900.0,17,CL
1,1928,Man of the Year,Walter Chrysler,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger,7,65.0,1870.0,15,WC
2,1929,Man of the Year,Owen D. Young,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan,10,88.0,1870.0,13,ODY
3,1930,Man of the Year,Mahatma Gandhi,India,1869.0,1948.0,,Revolution,Salt March,4,79.0,1860.0,14,MG
4,1931,Man of the Year,Pierre Laval,France,1883.0,1945.0,Prime Minister of France,Politics,,3,62.0,1880.0,12,PL
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,2012,Person of the Year,Barack Obama,United States,1961.0,,President of the United States,Politics,Presidential Election,3,,1960.0,12,BO
87,2013,Person of the Year,Pope Francis,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave,5,,1930.0,12,PF
88,2014,Person of the Year,The Ebola Fighters,,,,,Science,Ebola Epidemic,12,,,18,TE
89,2015,Person of the Year,Angela Merkel,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...,3,,1950.0,13,AM


### Removing Columns (Unwanted Variables)

There are two main ways to remove them, either we specify which columns we want to **keep** or we specify which we want to **remove**

Here is the version where we specify, by name, which columns to **keep**. You saw this in an earlier lecture as selecting columns.

In [16]:
time[['Year', 'Honor', 'Name']]

Unnamed: 0,Year,Honor,Name
0,1927,Man of the Year,Charles Lindbergh
1,1928,Man of the Year,Walter Chrysler
2,1929,Man of the Year,Owen D. Young
3,1930,Man of the Year,Mahatma Gandhi
4,1931,Man of the Year,Pierre Laval
...,...,...,...
86,2012,Person of the Year,Barack Obama
87,2013,Person of the Year,Pope Francis
88,2014,Person of the Year,The Ebola Fighters
89,2015,Person of the Year,Angela Merkel


Here is the version where we specify, by name, which columns to **remove**

In [17]:
time.drop(columns=['Year', 'Honor', 'Name'])

Unnamed: 0,Country,Birth Year,Death Year,Title,Category,Context,Category Ordinal,Lifespan,Birth Decade,Name Length,Initials
0,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight,0,72.0,1900.0,17,CL
1,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger,7,65.0,1870.0,15,WC
2,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan,10,88.0,1870.0,13,ODY
3,India,1869.0,1948.0,,Revolution,Salt March,4,79.0,1860.0,14,MG
4,France,1883.0,1945.0,Prime Minister of France,Politics,,3,62.0,1880.0,12,PL
...,...,...,...,...,...,...,...,...,...,...,...
86,United States,1961.0,,President of the United States,Politics,Presidential Election,3,,1960.0,12,BO
87,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave,5,,1930.0,12,PF
88,,,,,Science,Ebola Epidemic,12,,,18,TE
89,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...,3,,1950.0,13,AM


To apply the changes, assign the result back to the same variable `time`

In [18]:
time = time.drop(columns=['Year', 'Honor', 'Name'])
time

Unnamed: 0,Country,Birth Year,Death Year,Title,Category,Context,Category Ordinal,Lifespan,Birth Decade,Name Length,Initials
0,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight,0,72.0,1900.0,17,CL
1,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger,7,65.0,1870.0,15,WC
2,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan,10,88.0,1870.0,13,ODY
3,India,1869.0,1948.0,,Revolution,Salt March,4,79.0,1860.0,14,MG
4,France,1883.0,1945.0,Prime Minister of France,Politics,,3,62.0,1880.0,12,PL
...,...,...,...,...,...,...,...,...,...,...,...
86,United States,1961.0,,President of the United States,Politics,Presidential Election,3,,1960.0,12,BO
87,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave,5,,1930.0,12,PF
88,,,,,Science,Ebola Epidemic,12,,,18,TE
89,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...,3,,1950.0,13,AM


### Renaming Columns

You can do this using a `dictionary`

In [19]:
time = time.rename(columns={'Birth Year' : 'Born',
                        'Death Year' : 'Died'})
time

Unnamed: 0,Country,Born,Died,Title,Category,Context,Category Ordinal,Lifespan,Birth Decade,Name Length,Initials
0,United States,1902.0,1974.0,US Air Mail Pilot,,First Solo Transatlantic Flight,0,72.0,1900.0,17,CL
1,United States,1875.0,1940.0,Founder of Chrysler,Economics,Chrysler/Dodge Merger,7,65.0,1870.0,15,WC
2,United States,1874.0,1962.0,Member of the German Reparations International...,Diplomacy,Young Plan,10,88.0,1870.0,13,ODY
3,India,1869.0,1948.0,,Revolution,Salt March,4,79.0,1860.0,14,MG
4,France,1883.0,1945.0,Prime Minister of France,Politics,,3,62.0,1880.0,12,PL
...,...,...,...,...,...,...,...,...,...,...,...
86,United States,1961.0,,President of the United States,Politics,Presidential Election,3,,1960.0,12,BO
87,Vatican City,1936.0,,Pope of the Roman Catholic Church,Religion,Papal Conclave,5,,1930.0,12,PF
88,,,,,Science,Ebola Epidemic,12,,,18,TE
89,Germany,1954.0,,Chancellor of Germany,Politics,Debt Crisis; Refugee Crisis; Paris Terrorist A...,3,,1950.0,13,AM


## Missing Data

### Fundamentals of missing data

In the previous example, you saw that some entries had `Nan` to fill in for **missing data**

Missing data is one of the most recurrent problems in data analysis

In fact, there are a lot of studies and methods on how to handle it!

We will work with a small dataset called [missing.csv](https://www.dropbox.com/s/5h4k4rszebd6p0n/missing.csv?raw=1) (it's the same as data.csv from last week, but now with missing values)

In [20]:
df = pd.read_csv('https://www.dropbox.com/s/5h4k4rszebd6p0n/missing.csv?raw=1')
df

Unnamed: 0,Name,Age,Height
0,Nick,21.0,1.85
1,Chris,29.0,Unknown
2,Tim,28.0,1.75
3,Ron,,1.81
4,Monica,35.0,Unknown
5,Cassandra,21.0,1.66


The first thing to notice is that the second column is being shown with decimals, even though they were integers in the original file

Moreover, the blank entry is displayed as `NaN`

`NaN` stands for *Not A Number* as is part of the floating point specification, typically used when a calculation has no valid numerical result

In fact, Python considers `NaN` as a floating point (or `float64`)

In [21]:
df

Unnamed: 0,Name,Age,Height
0,Nick,21.0,1.85
1,Chris,29.0,Unknown
2,Tim,28.0,1.75
3,Ron,,1.81
4,Monica,35.0,Unknown
5,Cassandra,21.0,1.66


In [22]:
type(df.at[3, 'Age'])

numpy.float64

As a result, the entire column has been turned into `float64`, instead of `int64`

 This is why it was being shown as decimals!

In [23]:
df['Age']

0    21.0
1    29.0
2    28.0
3     NaN
4    35.0
5    21.0
Name: Age, dtype: float64

### Incorrectly Imported Missing Data

The missing value in the `Age` column was generated because the cell was blank

By default, `Pandas` will treat certain cell values, such as a *blank*, `NULL`, `NaN`, and `n/a` as missing values

The `Height` column contains some entries which we recognise as missing data (i.e. `Unknown`) but that Pandas did not

In fact, the entire column has been imported as strings, not numbers!

In [24]:
df

Unnamed: 0,Name,Age,Height
0,Nick,21.0,1.85
1,Chris,29.0,Unknown
2,Tim,28.0,1.75
3,Ron,,1.81
4,Monica,35.0,Unknown
5,Cassandra,21.0,1.66


In [25]:
print(df.at[1, 'Height'])
print(type(df.at[1, 'Height']))

Unknown
<class 'str'>


This in turn made `Pandas` believe that the whole column has strings!

In [26]:
print(df.at[0, 'Height'])
print(type(df.at[0, 'Height']))

1.85
<class 'str'>


What happens if we multiply this column by `100`?

In [27]:
df['Height times 100'] = df['Height'] * 100
df

Unnamed: 0,Name,Age,Height,Height times 100
0,Nick,21.0,1.85,1.851.851.851.851.851.851.851.851.851.851.851....
1,Chris,29.0,Unknown,UnknownUnknownUnknownUnknownUnknownUnknownUnkn...
2,Tim,28.0,1.75,1.751.751.751.751.751.751.751.751.751.751.751....
3,Ron,,1.81,1.811.811.811.811.811.811.811.811.811.811.811....
4,Monica,35.0,Unknown,UnknownUnknownUnknownUnknownUnknownUnknownUnkn...
5,Cassandra,21.0,1.66,1.661.661.661.661.661.661.661.661.661.661.661....


We need to tell Pandas in advance to treat `Unknown` not as a string, but as a missing value

Lets re-import the data set passing a list of values which represent missing values as the `na_values` parameter

In [28]:
df = pd.read_csv('https://www.dropbox.com/s/5h4k4rszebd6p0n/missing.csv?raw=1', 
                 na_values=['Unknown'])
df

Unnamed: 0,Name,Age,Height
0,Nick,21.0,1.85
1,Chris,29.0,
2,Tim,28.0,1.75
3,Ron,,1.81
4,Monica,35.0,
5,Cassandra,21.0,1.66


Now the column is `float64`

In [29]:
df['Height']

0    1.85
1     NaN
2    1.75
3    1.81
4     NaN
5    1.66
Name: Height, dtype: float64

Multiplying `Height` now works as expected.

In [30]:
df['Height times 100'] = df['Height'] * 100
df

Unnamed: 0,Name,Age,Height,Height times 100
0,Nick,21.0,1.85,185.0
1,Chris,29.0,,
2,Tim,28.0,1.75,175.0
3,Ron,,1.81,181.0
4,Monica,35.0,,
5,Cassandra,21.0,1.66,166.0


In [31]:
# Another example
df['Test'] = df['Age'] + df['Height']
df

Unnamed: 0,Name,Age,Height,Height times 100,Test
0,Nick,21.0,1.85,185.0,22.85
1,Chris,29.0,,,
2,Tim,28.0,1.75,175.0,29.75
3,Ron,,1.81,181.0,
4,Monica,35.0,,,
5,Cassandra,21.0,1.66,166.0,22.66


### Methods to address missing data

As mentioned before, there are different ways to deal with missing data

The method you should use depends on your analysis, the application domain and what you want to achieve

Here are some common methods:

    1. Leave the missing values as NaN
    2. Lookup the values from another data source
    3. Delete the rows / observations
    4. Compute reasonable guesses about the values

In **option 1**, it is often reasonable to just leave the data as missing data

For example, if we wanted to calculate average `Height`, it may not matter that we are missing `Age` for one row

As for **option 2**, if we have some other ways of going back and getting correct data, we could just update the values with `df.at[3, 'Age'] = ...`

Let's assume we are going for **option 3** and want to remove any rows where we do no have `Age`

If we know the row we want to delete (`3`), we can use the `drop` method, only this time we specify rows, not columns. As in:

    df.drop(3)

However, often there are many missing values, and we don't want to check on the index of each

A good way to remove rows with missing data is to specify which rows to **keep**

The `isnull` method will filter for rows which contains `NaN`.

In [32]:
df[df['Age'].isnull()]

Unnamed: 0,Name,Age,Height,Height times 100,Test
3,Ron,,1.81,181.0,


Of course, this is the *opposite* of specifying which to keep

The **notnull** method will show us which rows do not contain `NaN` in the specified column.

In [33]:
df[df['Age'].notnull()]

Unnamed: 0,Name,Age,Height,Height times 100,Test
0,Nick,21.0,1.85,185.0,22.85
1,Chris,29.0,,,
2,Tim,28.0,1.75,175.0,29.75
4,Monica,35.0,,,
5,Cassandra,21.0,1.66,166.0,22.66


Notice that the filter is only checking the column we specified `Age`, not the other column with `NaN` values

We can apply the filter by assigning the result back to `df`

In [34]:
df = df[df['Age'].notnull()]
df

Unnamed: 0,Name,Age,Height,Height times 100,Test
0,Nick,21.0,1.85,185.0,22.85
1,Chris,29.0,,,
2,Tim,28.0,1.75,175.0,29.75
4,Monica,35.0,,,
5,Cassandra,21.0,1.66,166.0,22.66


Since the `Age` column no longer contains any non-integers, it is safe to convert the column to `int64` type

We can do this with the `astype` method

In [35]:
df = df.astype({'Age': 'int64'})
df

Unnamed: 0,Name,Age,Height,Height times 100,Test
0,Nick,21,1.85,185.0,22.85
1,Chris,29,,,
2,Tim,28,1.75,175.0,29.75
4,Monica,35,,,
5,Cassandra,21,1.66,166.0,22.66


### Updating Indices

After these operations, the current data frame has the index 0, 1, 2, 4, 5

We could leave it like this, being careful to select the correct rows with `loc`. However, we can also reset the index with `reset_index`

Note that we need to specify `drop=True`, without this we will end up with an extra column with the old index

In [36]:
df = df.reset_index(drop=True)
df

Unnamed: 0,Name,Age,Height,Height times 100,Test
0,Nick,21,1.85,185.0,22.85
1,Chris,29,,,
2,Tim,28,1.75,175.0,29.75
3,Monica,35,,,
4,Cassandra,21,1.66,166.0,22.66


Alternatively, we can set the index to a particular column if we don't want default indexing

In [37]:
df = df.set_index('Name')
df

Unnamed: 0_level_0,Age,Height,Height times 100,Test
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Nick,21,1.85,185.0,22.85
Chris,29,,,
Tim,28,1.75,175.0,29.75
Monica,35,,,
Cassandra,21,1.66,166.0,22.66


Say we want to go with **option 4** and replace missing `Height` with a computed value

In this example we simply calculate the mean height of other people in the data set (to the nearest cm)

**Note:** The `mean` method just ignores the missing data and gives us the average of the non-missing data, which is what we want

In [38]:
mean_height = df['Height'].mean()

mean_height

1.7533333333333332

In [39]:
mean_height = round(mean_height, 
                    2)
mean_height

1.75

Of course there are many more sophisticated ways to predict a replacement value!

For instance, could fit a regression with `Age`

Or if we had categories such as `Gender` we could group by categories and do predictions for each

In this example we will just use `1.75` for every missing value.

In [40]:
df = df.fillna({'Height' : mean_height})
df

Unnamed: 0_level_0,Age,Height,Height times 100,Test
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Nick,21,1.85,185.0,22.85
Chris,29,1.75,,
Tim,28,1.75,175.0,29.75
Monica,35,1.75,,
Cassandra,21,1.66,166.0,22.66


##  Joints and Aggregations

Let's create two small dataset to use as example

In [41]:
books = pd.DataFrame({'Author' : ['J. R. R. Tolkien',
                                  'George R. R. Martin',
                                  'J. K. Rowling', 
                                  'Suzanne Collins']},
                     index = ['The Lord of the Rings',
                              'Game of Thrones',
                              'Harry Potter',
                              'The Hunger Games'])
books

Unnamed: 0,Author
The Lord of the Rings,J. R. R. Tolkien
Game of Thrones,George R. R. Martin
Harry Potter,J. K. Rowling
The Hunger Games,Suzanne Collins


In [42]:
films = pd.DataFrame({'Year of First Film' : [1999, 2001, 2001, 2012],
                      'Number of Films' : [3, 2, 8, 4]},
                     index = ['The Matrix',
                              'The Lord of the Rings',
                              'Harry Potter',
                              'The Hunger Games'])


films

Unnamed: 0,Year of First Film,Number of Films
The Matrix,1999,3
The Lord of the Rings,2001,2
Harry Potter,2001,8
The Hunger Games,2012,4


The `.join` operation is used to join the columns of two different data sets based on matching index

In [43]:
books.join(films)

Unnamed: 0,Author,Year of First Film,Number of Films
The Lord of the Rings,J. R. R. Tolkien,2001.0,2.0
Game of Thrones,George R. R. Martin,,
Harry Potter,J. K. Rowling,2001.0,8.0
The Hunger Games,Suzanne Collins,2012.0,4.0


### Types of Join

In the previous example, the `books` data set is the *'left'* dataset and `films` is the *'right'* dataset

A left join keeps all of the data from the *'left'* data set and adds in the applicable data from the *'right'* data set where the keys match up

If there is no film, the cells are populated with `NaN` (e.g. Game of Thrones)

The above operations is equivalent to

    books.join(films, how='left')

We can also do a right join. This keeps all of the films, adding the book data where applicable. If there is no book, the cells are populated with `NaN`.

In [44]:
books.join(films, how='right')

Unnamed: 0,Author,Year of First Film,Number of Films
The Matrix,,1999,3
The Lord of the Rings,J. R. R. Tolkien,2001,2
Harry Potter,J. K. Rowling,2001,8
The Hunger Games,Suzanne Collins,2012,4


This is almost equivalent to `films.join(books)` (or `films.join(books, how='left')`), with the exception of the order in which the columns appear

If we want to keep all of the data (book **OR** film), we can use an outer join

In [45]:
films.join(books, how='outer')

Unnamed: 0,Year of First Film,Number of Films,Author
Game of Thrones,,,George R. R. Martin
Harry Potter,2001.0,8.0,J. K. Rowling
The Hunger Games,2012.0,4.0,Suzanne Collins
The Lord of the Rings,2001.0,2.0,J. R. R. Tolkien
The Matrix,1999.0,3.0,


And if we only want to keep the data (book **AND** film), we can use an inner join

In [46]:
films.join(books, how='inner')

Unnamed: 0,Year of First Film,Number of Films,Author
The Lord of the Rings,2001,2,J. R. R. Tolkien
Harry Potter,2001,8,J. K. Rowling
The Hunger Games,2012,4,Suzanne Collins


You should choose the join type based on what your resulting data table is intended to describe

For instance, the inner join gave us a table of films based on books

Contrarily, the left join gave us a list of books with additional information on the film (if any)

Once again, remember that `.join` is not modifying the dataframe, so if you want to save the result, assign it to either the same or a different variable with an appropriate name

    films = films.join(books, how='left')

    books = films.join(books, how='right')
    
    films_based_on_books = films.join(books, how='inner')
    
    favourite_series = films.join(books, how='outer')

This table may be used as a reminder of the difference between the joins


| Type of Join   | Keeps Rows of Left Data | Keeps Rows of Right Data |
| :------------- | ----------------------: | -----------------------: |
| left (default) | yes                     | only if matching left    |
| right          | only if matching right  | yes                      |
| outer          | yes                     | yes                      |
| inner          | only if matching right  | only if matching left    |

### Joining Different Columns

`.join` joins by comparing indexes of each dataframe

Sometimes the key column(s) is not the index (particularly if you are using default indexing)

If you need to join based on columns other than the index, you should use `merge`

For example, it is possible that we would encounter data with default indexes as follows:

In [47]:
books = books.reset_index()
books = books.rename(columns={'index' : 'Book Series Title'})
books

Unnamed: 0,Book Series Title,Author
0,The Lord of the Rings,J. R. R. Tolkien
1,Game of Thrones,George R. R. Martin
2,Harry Potter,J. K. Rowling
3,The Hunger Games,Suzanne Collins


In [48]:
films = films.reset_index()
films = films.rename(columns={'index' : 'Film Series Title'})
films

Unnamed: 0,Film Series Title,Year of First Film,Number of Films
0,The Matrix,1999,3
1,The Lord of the Rings,2001,2
2,Harry Potter,2001,8
3,The Hunger Games,2012,4


If we join on the index, the result is nonsense!

In [49]:
books.join(films) # WRONG

Unnamed: 0,Book Series Title,Author,Film Series Title,Year of First Film,Number of Films
0,The Lord of the Rings,J. R. R. Tolkien,The Matrix,1999,3
1,Game of Thrones,George R. R. Martin,The Lord of the Rings,2001,2
2,Harry Potter,J. K. Rowling,Harry Potter,2001,8
3,The Hunger Games,Suzanne Collins,The Hunger Games,2012,4


We could change the `Book Series Title` and `Film Series Title` to indexes and join with `.join`

Or we can use `.merge`, in which left, right, outer, and inner joins work the same way

However, it is not the index we are comparing, it is the column specified with `left_on=` and `right_on=`

In [50]:
books.merge(films,
            how='inner',
            left_on='Book Series Title',
            right_on='Film Series Title')

Unnamed: 0,Book Series Title,Author,Film Series Title,Year of First Film,Number of Films
0,The Lord of the Rings,J. R. R. Tolkien,The Lord of the Rings,2001,2
1,Harry Potter,J. K. Rowling,Harry Potter,2001,8
2,The Hunger Games,Suzanne Collins,The Hunger Games,2012,4


### Data Aggregation

Before continuing, we will re set the index of the books

In [51]:
books = books.set_index('Book Series Title', drop=True)
books

Unnamed: 0_level_0,Author
Book Series Title,Unnamed: 1_level_1
The Lord of the Rings,J. R. R. Tolkien
Game of Thrones,George R. R. Martin
Harry Potter,J. K. Rowling
The Hunger Games,Suzanne Collins


Let's import a dataset of a list of books

In [52]:
volumes = pd.read_csv('https://www.dropbox.com/s/9flqjjvetgbex97/volumes.csv?raw=1')
volumes

Unnamed: 0,Series,Title,Rating,Year
0,Harry Potter,Harry Potter and the Philosopher's Stone,4.47,1997
1,Harry Potter,Harry Potter and the Chamber of Secrets,4.42,1998
2,Harry Potter,Harry Potter and the Prisoner of Azkaban,4.56,1999
3,Harry Potter,Harry Potter and the Goblet of Fire,4.55,2000
4,Harry Potter,Harry Potter and the Order of the Phoenix,4.49,2003
5,Harry Potter,Harry Potter and the Half-Blood Prince,4.57,2005
6,Harry Potter,Harry Potter and the Deathly Hallows,4.61,2007
7,The Lord of the Rings,The Fellowship of the Ring,4.36,1954
8,The Lord of the Rings,The Two Towers,4.44,1954
9,The Lord of the Rings,The Return of the King,4.53,1955


We want to summarise by series, the `.groupby` method gives you the name of the column which has the groups

The result is a Python object which we will use for the next step

In [53]:
groups = volumes.groupby('Series')
groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F7E0FDFE10>

We will use `count` and `mean` to work out the number of books and the average rating, and use `min` to work out the first publication year.

Other operations available include `sum` and `max` (you can check others [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html))

Now we have our data summary by groups

In [54]:
summary = groups.agg({'count', 'mean', 'min'})
summary

Unnamed: 0_level_0,Rating,Rating,Rating,Year,Year,Year
Unnamed: 0_level_1,mean,min,count,mean,min,count
Series,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Game of Thrones,4.372,4.13,5,2002.0,1996,5
Harry Potter,4.524286,4.42,7,2001.285714,1997,7
The Hunger Games,4.216667,4.03,3,2009.0,2008,3
The Lord of the Rings,4.443333,4.36,3,1954.333333,1954,3


### Joining Aggregation Data

We now have a dataframe with two sub-frames (one for `Rating` and one for `Year`) we can easily separate them

In [55]:
rating_summary = summary['Rating']
rating_summary

Unnamed: 0_level_0,mean,min,count
Series,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Game of Thrones,4.372,4.13,5
Harry Potter,4.524286,4.42,7
The Hunger Games,4.216667,4.03,3
The Lord of the Rings,4.443333,4.36,3


In [56]:
year_summary = summary['Year']
year_summary

Unnamed: 0_level_0,mean,min,count
Series,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Game of Thrones,2002.0,1996,5
Harry Potter,2001.285714,1997,7
The Hunger Games,2009.0,2008,3
The Lord of the Rings,1954.333333,1954,3


Let's rename the columns for the `Ratings` aggregation, and remove the unneeded `min` column

In [57]:
rating_summary = summary['Rating'].rename(
    columns={'mean' : 'Average Rating',
             'count' : 'Number of Books'})
rating_summary = rating_summary.drop(columns={'min'})
rating_summary

Unnamed: 0_level_0,Average Rating,Number of Books
Series,Unnamed: 1_level_1,Unnamed: 2_level_1
Game of Thrones,4.372,5
Harry Potter,4.524286,7
The Hunger Games,4.216667,3
The Lord of the Rings,4.443333,3


We may also want to round off the average ratings

In [58]:
rating_summary['Average Rating'] = rating_summary['Average Rating'].round(2)

rating_summary

Unnamed: 0_level_0,Average Rating,Number of Books
Series,Unnamed: 1_level_1,Unnamed: 2_level_1
Game of Thrones,4.37,5
Harry Potter,4.52,7
The Hunger Games,4.22,3
The Lord of the Rings,4.44,3


Let's also rename the column from the `Year` aggregation and drop the other columns

In [59]:
year_summary = year_summary.rename(columns={'min' : 'First Published'})
year_summary = year_summary.drop(columns={'mean', 'count'})
year_summary

Unnamed: 0_level_0,First Published
Series,Unnamed: 1_level_1
Game of Thrones,1996
Harry Potter,1997
The Hunger Games,2008
The Lord of the Rings,1954


Now we can join the `books`, `rating_summary` and `year_summary` dataframes

Since all have the same keys we don't need to worry about join type, but this is a left join so will keep everything in the `books` data frame if it didn't match

In [60]:
books.join(rating_summary).join(year_summary)

Unnamed: 0_level_0,Author,Average Rating,Number of Books,First Published
Book Series Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Lord of the Rings,J. R. R. Tolkien,4.44,3,1954
Game of Thrones,George R. R. Martin,4.37,5,1996
Harry Potter,J. K. Rowling,4.52,7,1997
The Hunger Games,Suzanne Collins,4.22,3,2008


## Melt and Pivoting

### Wide vs Long Data (some real-life examples)

Wide data contains a column for each variable, and a row for each entity

The "entity" ID (in this case `Name`, but it could be an ID number etc.) is in the first column, or could be the index

| Name    | Age  | Height | Hair Colour |
| ------: | ---: | -----: | :---------- |
| Alice   |   36 |  1.68  | Blonde      |
| Bob     |   28 |  1.73  | Red         |
| Charlie |   29 |  1.60  | -           |

Long data contains a row for each observation of a variable

his is also called entity-attribute-value data

Note that rows can be omitted if there is missing data

| Entity ID | Attribute / Variable   |   Value |
| --------: | :--------------------- | ------: |
| Alice     | Age                    |      36 |
| Bob       | Age                    |      28 |
| Charlie   | Age                    |      29 |
| Alice     | Height                 |    1.68 |
| Bob       | Height                 |    1.73 |
| Charlie   | Height                 |    1.60 |
| Alice     | Hair Colour            |  Blonde |
| Bob       | Hair Colour            |     Red |

### Tidy Data

Defined by Hadley Wickham in [this article](https://vita.had.co.nz/papers/tidy-data.pdf), it describes long and wide data, and gives more examples on how to better work with both

It is written for R users!

### Melting Wide to Long Data
![Fig. 1](https://www.dropbox.com/s/giskupj9ibff4bd/fig1.jpg?raw=1)

Once again, let's import a new dataset

In [61]:
stock = pd.read_csv('https://www.dropbox.com/s/dl0bz061v5biv4k/stock.csv?raw=1')
stock

Unnamed: 0,Company,Symbol,1980,1990,2000,2010,2020
0,Apple,AAPL,0.51,1.22,3.88,37.53,318.73
1,Google,GOOGL,,,,312.54,1518.73
2,Microsoft,MSFT,,1.03,53.31,28.21,185.38


**What kind of data is this?**

**What are the entities, and which columns are ID columns?**

**What takes the place of attributes (variables) in this case?**

Let's melt this data using the `.melt` method

In [62]:
stock.melt(id_vars = ['Company', 'Symbol'])

Unnamed: 0,Company,Symbol,variable,value
0,Apple,AAPL,1980,0.51
1,Google,GOOGL,1980,
2,Microsoft,MSFT,1980,
3,Apple,AAPL,1990,1.22
4,Google,GOOGL,1990,
5,Microsoft,MSFT,1990,1.03
6,Apple,AAPL,2000,3.88
7,Google,GOOGL,2000,
8,Microsoft,MSFT,2000,53.31
9,Apple,AAPL,2010,37.53


We can ensure that the new columns get named correctly by using the `var_name` and `value_name` options

In [63]:
stock = stock.melt(id_vars = ['Company', 'Symbol'], var_name='Year', value_name='Price (USD)')
stock

Unnamed: 0,Company,Symbol,Year,Price (USD)
0,Apple,AAPL,1980,0.51
1,Google,GOOGL,1980,
2,Microsoft,MSFT,1980,
3,Apple,AAPL,1990,1.22
4,Google,GOOGL,1990,
5,Microsoft,MSFT,1990,1.03
6,Apple,AAPL,2000,3.88
7,Google,GOOGL,2000,
8,Microsoft,MSFT,2000,53.31
9,Apple,AAPL,2010,37.53


The `NaN` values aren't really contributing anything and are just there because they were in the wide data set, so let's remove them!

In [64]:
stock = stock[stock['Price (USD)'].notnull()]
stock

Unnamed: 0,Company,Symbol,Year,Price (USD)
0,Apple,AAPL,1980,0.51
3,Apple,AAPL,1990,1.22
5,Microsoft,MSFT,1990,1.03
6,Apple,AAPL,2000,3.88
8,Microsoft,MSFT,2000,53.31
9,Apple,AAPL,2010,37.53
10,Google,GOOGL,2010,312.54
11,Microsoft,MSFT,2010,28.21
12,Apple,AAPL,2020,318.73
13,Google,GOOGL,2020,1518.73


If you prefer the data sorted by entity, then variable, you can re-sort

In [65]:
stock.sort_values(['Symbol', 'Year'])

Unnamed: 0,Company,Symbol,Year,Price (USD)
0,Apple,AAPL,1980,0.51
3,Apple,AAPL,1990,1.22
6,Apple,AAPL,2000,3.88
9,Apple,AAPL,2010,37.53
12,Apple,AAPL,2020,318.73
10,Google,GOOGL,2010,312.54
13,Google,GOOGL,2020,1518.73
5,Microsoft,MSFT,1990,1.03
8,Microsoft,MSFT,2000,53.31
11,Microsoft,MSFT,2010,28.21


Once again, Python has not detected that the columns were integer types

In [66]:
stock[stock['Year'] >= 2000]  # Error!

TypeError: '>=' not supported between instances of 'str' and 'int'

As you can see, the data type is `object`, meaning these numbers are stored as `str`

In [67]:
stock['Year']

0     1980
3     1990
5     1990
6     2000
8     2000
9     2010
10    2010
11    2010
12    2020
13    2020
14    2020
Name: Year, dtype: object

The columns which was originally headings has become `object` (`str`) types

We can change the data type of the `Year` column using `astype`

In [68]:
stock = stock.astype({'Year' : 'int64'})
stock

Unnamed: 0,Company,Symbol,Year,Price (USD)
0,Apple,AAPL,1980,0.51
3,Apple,AAPL,1990,1.22
5,Microsoft,MSFT,1990,1.03
6,Apple,AAPL,2000,3.88
8,Microsoft,MSFT,2000,53.31
9,Apple,AAPL,2010,37.53
10,Google,GOOGL,2010,312.54
11,Microsoft,MSFT,2010,28.21
12,Apple,AAPL,2020,318.73
13,Google,GOOGL,2020,1518.73


Now we are able to do numerical comparisons on the `Year` column

In [69]:
stock[stock['Year'] >= 2000]

Unnamed: 0,Company,Symbol,Year,Price (USD)
6,Apple,AAPL,2000,3.88
8,Microsoft,MSFT,2000,53.31
9,Apple,AAPL,2010,37.53
10,Google,GOOGL,2010,312.54
11,Microsoft,MSFT,2010,28.21
12,Apple,AAPL,2020,318.73
13,Google,GOOGL,2020,1518.73
14,Microsoft,MSFT,2020,185.38


### Pivoting Long to Wide Data
![Fig. 2](https://www.dropbox.com/s/x2i8xhzt0yvfip5/fig2.gif?raw=1)

Another new dataset...

In [70]:
weather = pd.read_csv('https://www.dropbox.com/s/1pyru339tll1njf/weather-canada.csv?raw=1')
weather

Unnamed: 0,Station Name,Province,Year,Mean Temperature (C),Total Precipitation (mm)
0,BEAR CREEK,BC,1971,15.4,20.9
1,COWICHAN BAY CHERRY POINT,BC,1971,17.4,12.8
2,COWICHAN LAKE FORESTRY,BC,1971,18.8,21.3
3,COWICHAN LAKE VILLAGE,BC,1971,17.7,36.4
4,DUNCAN FORESTRY,BC,1971,17.7,18.1
...,...,...,...,...,...
81757,GOOSE A,NL,2017,15.8,109.0
81758,HOPEDALE (AUT),NL,2017,11.6,83.2
81759,MARY'S HARBOUR A,NL,2017,14.5,56.9
81760,NAIN,NL,2017,10.6,38.3


This was long data, however we have multiple variables per observation

We can pivot the table to see each station with the entry corresponding to different years

In [71]:
weather_p = weather.pivot(index='Station Name',
                          columns='Year',
                          values=['Mean Temperature (C)', 
                                  'Total Precipitation (mm)'])
weather_p

Unnamed: 0_level_0,Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),...,Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm)
Year,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Station Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
(AE) BOW SUMMIT,,,,,,,,,,,...,,,,,,,,,,
100 MILE HOUSE,16.0,15.1,15.1,13.7,17.2,14.5,13.5,15.5,16.0,14.1,...,,,,,,,,,,
100 MILE HOUSE 6NE,,,,,,,,,,,...,48.4,47.0,21.4,75.8,68.4,28.8,112.4,93.2,106.4,4.8
108 MILE HOUSE,,,15.9,,,,,,,,...,,,,,,,,,,
108 MILE HOUSE ABEL LAKE,,,,,,,,,,,...,37.6,42.8,12.0,55.4,57.2,19.9,30.4,51.8,37.6,3.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YOYO,,,,,,,,,,,...,,,,,,,,,,
ZAMA LO,15.7,14.3,15.3,13.5,17.5,14.3,13.6,14.6,18.1,16.0,...,81.2,42.4,62.8,175.0,,,,,,
ZEBALLOS MURAUDE CREEK,,,,,,,,,,,...,,,,123.2,56.4,2.4,75.9,62.8,89.4,47.9
ZEHNER,,,,,,,,,,,...,,,,,,,,,,


You can see that the method was capable of "grouping" data, either for mean temp or precipitations!

This allows us to create new dataframes based only on the required info

In [72]:
temperature = weather_p['Mean Temperature (C)']
temperature

Year,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Station Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(AE) BOW SUMMIT,,,,,,,,,,,...,,,,,,,,,,
100 MILE HOUSE,16.0,15.1,15.1,13.7,17.2,14.5,13.5,15.5,16.0,14.1,...,,,,,,,,,,
100 MILE HOUSE 6NE,,,,,,,,,,,...,14.8,16.5,14.6,11.2,15.1,14.5,16.6,16.5,14.9,17.3
108 MILE HOUSE,,,15.9,,,,,,,,...,,,,,,,,,,
108 MILE HOUSE ABEL LAKE,,,,,,,,,,,...,16.1,18.7,16.8,14.4,18.0,17.5,18.5,18.0,16.7,18.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YOYO,,,,,,,,,,,...,,,,,,,,,,
ZAMA LO,15.7,14.3,15.3,13.5,17.5,14.3,13.6,14.6,18.1,16.0,...,14.9,11.1,9.5,13.5,,,,,,
ZEBALLOS MURAUDE CREEK,,,,,,,,,,,...,,,,13.9,16.4,18.4,17.6,19.3,17.3,16.3
ZEHNER,,,,,,,,,,,...,,,,,,,,,,


In [73]:
temp2010 = temperature[[2010]]
temp2010 = temp2010[temp2010[2010].notnull()]
temp2010

Year,2010
Station Name,Unnamed: 1_level_1
100 MILE HOUSE 6NE,14.6
108 MILE HOUSE ABEL LAKE,16.8
ABBOTSFORD A,18.3
ABEE AGDM,15.3
ACADIA VALLEY,17.8
...,...
YOHIN,17.7
YOHO NP OHARA LAKE,10.0
YOHO PARK,12.0
YORKTON,18.5


In [74]:
temp2010.mean()

Year
2010    17.429932
dtype: float64

Let's pivot again, this time with the year as index

In [76]:
weather_p2 = weather.pivot(index='Year',
                           columns='Station Name',
                           values=['Mean Temperature (C)', 
                                   'Total Precipitation (mm)'])

weather_p2

Unnamed: 0_level_0,Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),...,Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm)
Station Name,(AE) BOW SUMMIT,100 MILE HOUSE,100 MILE HOUSE 6NE,108 MILE HOUSE,108 MILE HOUSE ABEL LAKE,150 MILE HOUSE 7N,70 MILE HOUSE,ABBEY,ABBOTSFORD,ABBOTSFORD A,...,YOHO PARK,YORK FACTORY,YORKTON,YORKTON A,YOUBOU SCHOOL,YOYO,ZAMA LO,ZEBALLOS MURAUDE CREEK,ZEHNER,ZHODA
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1971,,16.0,,,,15.5,,17.6,,17.5,...,,,,92.1,,,74.5,,,
1972,,15.1,,,,14.1,,,,17.1,...,,,,57.6,81.1,,48.3,,,
1973,,15.1,,15.9,,14.1,,,,16.6,...,,,,67.9,,,131.1,,,
1974,,13.7,,,,13.0,11.9,,,16.1,...,,,,31.3,,,34.7,,,
1975,,17.2,,,,,15.2,,,17.6,...,,,,36.1,,,95.6,,,
1976,,14.5,,,,,12.8,,,16.5,...,,,,34.4,,,147.5,,,
1977,,13.5,,,,,13.0,18.3,,15.8,...,,,,53.1,,,87.1,,,
1978,,15.5,,,,,15.6,18.3,,17.9,...,,,,86.2,,,34.7,,,
1979,,16.0,,,,,15.0,19.3,,17.9,...,,,,11.2,,,62.9,,,
1980,,14.1,,,,,13.1,18.4,,16.7,...,,,,62.3,,,108.0,,,


This allows us to get the mean of **all** stations!

In [77]:
weather_p2['Mean Temperature (C)'].mean()

Station Name
(AE) BOW SUMMIT             10.380000
100 MILE HOUSE              15.217241
100 MILE HOUSE 6NE          15.243333
108 MILE HOUSE              15.900000
108 MILE HOUSE ABEL LAKE    15.883333
                              ...    
YOYO                        13.350000
ZAMA LO                     15.392500
ZEBALLOS MURAUDE CREEK      17.028571
ZEHNER                      18.083333
ZHODA                       18.825000
Length: 4808, dtype: float64

### Pivoting with Aggregation

This next cell will give an error! **WHY**

In [78]:
weather.pivot(index='Province',
              columns='Year',
              values=['Mean Temperature (C)', 
                    'Total Precipitation (mm)'])  # Error!

ValueError: Index contains duplicate entries, cannot reshape

If we use `pivot_table` with `aggfunc`, we can tell Pandas what to do with these values for instance we may want the `mean`

In [79]:
weather.pivot_table(index='Province',
                    columns='Year',
                    values=['Mean Temperature (C)', 'Total Precipitation (mm)'],
                    aggfunc='mean')

Unnamed: 0_level_0,Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),Mean Temperature (C),...,Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm),Total Precipitation (mm)
Year,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Province,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AB,15.057655,13.122741,15.16055,14.393491,17.15816,15.045833,13.898171,15.229518,16.707165,15.106562,...,57.499465,71.784946,63.541739,90.536782,72.519481,54.610132,46.419178,55.166038,77.486829,45.0885
BC,17.1033,15.994702,15.956347,14.695879,17.289267,14.965147,15.166482,17.168539,17.047714,15.647929,...,45.344048,33.36748,16.013248,70.2168,46.386822,19.295,43.874,38.252863,50.727064,24.443781
MB,16.25,16.147368,17.946281,20.249194,20.560504,18.756303,18.335897,17.756364,19.988288,18.857273,...,91.12,79.639394,76.880952,57.208197,61.190625,77.395238,43.877778,98.643103,81.407273,49.230909
NB,18.055172,17.520968,19.972581,17.357143,19.917742,17.65082,18.02623,18.147368,19.105357,17.536842,...,90.202941,150.363636,100.396552,126.335714,46.455556,148.278571,159.2125,60.976923,83.503846,41.114815
NL,14.676596,14.482456,16.917241,13.165,17.165517,14.661818,14.688235,15.235088,15.448276,13.847692,...,66.025397,104.258333,122.348,138.456522,86.168519,93.465455,79.782,93.0,75.052381,67.087805
NS,18.052174,17.608451,19.455844,16.082895,19.432877,17.419737,17.6875,17.256164,18.333803,17.115714,...,83.421429,93.6525,96.146875,103.189189,58.54,99.411111,58.148387,75.729032,73.887097,70.881818
NT,13.304545,11.672727,14.507143,13.164286,14.565789,13.507692,12.472,11.266667,15.052,12.096,...,29.204878,43.634483,44.984211,43.134483,42.741667,46.530556,44.153333,52.686364,37.513043,41.33
NU,6.747222,4.506452,6.985294,7.532432,6.761765,6.46,6.934375,4.663636,6.081818,6.17,...,33.063889,29.728571,33.527027,24.541379,33.997561,34.493617,34.283721,30.3175,28.125,28.312821
ON,18.109699,18.820209,19.890492,19.444156,20.575974,18.717377,19.830201,19.06918,19.676431,19.504667,...,103.38908,96.332143,88.679268,63.204516,62.550641,100.649007,94.489362,57.989041,63.464964,79.549624
PE,18.7,18.006667,20.486667,17.133333,20.94,18.2,18.257143,18.371429,19.078571,17.371429,...,54.588889,139.633333,123.988889,121.822222,48.7,92.211111,53.177778,39.188889,53.957143,46.528571


## "Homework"

Look for data visualisations out there
    
    Plots from data/knowledge is beautiful
    Infographics
    Dashboards

Play/watch people play Among Us ([PC](https://store.steampowered.com/app/945360/Among_Us/) or [Mobile](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjUtar0xJ_sAhVhyoUKHZNlCToQFjACegQIAhAC&url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.innersloth.spacemafia%26hl%3Den_GB&usg=AOvVaw35aRiUcD6FcD9l3sgH3Ofy))

## An alternative way to run the dashboard for the Coursework Part 1
Thanks Pam!

## Lab (at last!)

I have created one notebook per topic

    Manipulating Columns
    Missing Values
    Joints & Aggregations
    Melt & Pivoting

Moreover, I have provided the guided versions (in .html)

Next week, I will release the Python solutions (.ipynb & html)

**NOTE**: All datasets can be accessed from the dropbox links, but can be also downloaded from Moodle if you prefer (including the ones used in the lecture)