# Introduction to Data Analysis with Pandas
First sign in to hub.ada.ac.uk, open a terminal and clone this repo from https://github.com/adacollege/infographic-challenge

In [None]:
# We tend to abbreviate the pandas library as pd
import pandas as pd

# You can use these options to adjust how pandas abbreviates tables to fit in the notebook
pd.options.display.max_columns = 100
pd.options.display.max_rows = 10

# Display graphs in the notebook
%matplotlib inline

## Getting the data into Python

The `pandas` library stores data in what it calls a *dataframe*, which is really just a smart table.

We use the `read_csv` function to read in our London Boroughs data.

Don't forget to run each cell when you get to it with either `ctrl`+`enter` or `shift`+`enter`

In [None]:
# read in our csv file, and automatically change missing values (a dot in the csv) into NaN
boroughs = pd.read_csv('boroughs.csv', na_values = ['.',' '])

# Remember jupyter hub echoes the last line of a cell, so we'll see a representation of our data when we run this cell
# If we only want to see e.g. the first 5 rows, use boroughs.head(5)
boroughs

Notice that Pandas adds an index column for us. We don't really need this as every row has a unique index already - the name of the borough - so let's change this. This will make it easier for us later when we want to look at individual boroughs.

In [None]:
boroughs = boroughs.set_index('Borough')
boroughs.head(5)

## Accessing data

A single column of the data (known as a *series*) is accessible using Python dot notation (e.g. boroughs.Age) or square brackets, a bit like with a Python list or dictionary.

In [None]:
# this is the same as boroughs.Age
boroughs['Age']

Square brackets are more flexible because we can give them a list of headings. When we do this we get another smaller dataframe returned.

In [None]:
# note the nested brackets
boroughs[['Population','Happy']]

We can also use the `loc` function (which uses square brackets, too) to *filter* the data and *locate* the index Haringey.

In [None]:
boroughs.loc['Haringey']

We can also select multiple boroughs to compare.

In [None]:
boroughs.loc[['Haringey','Hackney']]

Alternatively we can use loc to get a single piece of data in a given row and column.

In [None]:
boroughs.loc['Waltham Forest', 'Households']

## Tasks
*Make yourself some cells below this one to try out some of these activities to practice accessing data from dataframes*
- Choose your home borough (or another borough if you prefer) and make a new dataframe to include just this borough's data and the overall 'London' data (make sure you give the dataframe a new name)
- Select just the columns 'Cars', 'Cycle' and 'PublicT' and make your dataframe contain just these columns.
- Compare your borough to the London average in these areas, what do you notice?

## Sorting and filtering

Let's find out which boroughs have the highest population.

`pandas` dataframes have a `sort_values` function.

### Task

> Make the sort_values function below work, to put the boroughs in order of population
>
> Now put them in *descending* order
>
> Which borough has the largest population?

Remember in a jupyter notebook, you can put the cursor in the function brackets and hit `shift`+`tab` to bring up documentation for that function.

In [None]:
# *** broken ***
boroughs.sort_values(by='Population', ascending = False)

What if we wanted to only include **innerLondon** boroughs?

In [None]:
boroughs.loc[boroughs["InnerOuter"]=='Inner London']

So we can pass a Boolean into those square brackets to *filter* the data. `pandas` square brackets are clearly a bit more powerful than regular Python square brackets.

### Task

> Filter the data to show only Outer London boroughs
>
> Apply `sort_values` to give the Outer London boroughs in descending order of population

If you want to combine two Booleans into one filter you'll need to put both into parentheses *for reasons*. For example,

In [None]:
boroughs.loc[(boroughs.InnerOuter=="Inner London") | (boroughs.InnerOuter=="Outer London")]

It might be useful to come back to this table of *just* the individual boroughs, so let's assign that to a variable `justBoroughs`

In [None]:
justBoroughs = boroughs.loc[(boroughs.InnerOuter=="Inner London") | (boroughs.InnerOuter=="Outer London")]
justBoroughs.head()

### Note

There is a subtle catch here that is worth thinking about when you're trying to do more advanced stuff with `pandas`.

`boroughs[]` and `boroughs.loc[]` can appear to do the same thing, but they don't. In general it is better to use `loc`.

See [this article](https://www.dataquest.io/blog/settingwithcopywarning/) later if you want more details. 

# Representing Data - Investigating relationships

We would expect there to be an obvious relationship between unemployment rates and employment rates

In [None]:
justBoroughs.plot.scatter("Employ", "Unemploy");

Let's quantify that by asking for the correlation coefficient - this is a number between -1 and 1 which tells us whether there is a correlation, how strong it is, and whether this is negative or positive.

In [None]:
justBoroughs.Employ.corr(justBoroughs.Unemploy)

### Question

> Why isn't this a perfect correlation?

The scatterplot above isn't particularly engaging - not great for an infographic. Thankfully we have alternative ways to do this. One possible library with extra options we could use is `seaborn`.

In [None]:
# pyplot is the grandparent of all python plotting packages
import matplotlib.pyplot as plt
# seaborn is based on pyplot but makes it easier to use
import seaborn as sns
# I don't know why we abbreviate seaborn as sns

Now an example,

In [None]:
# by default seaborn plots come out a bit small, so make ours 8in by 8in
plt.figure(figsize=(8,8))
# sns.scatterplot has options for controlling colour and dot size so we can use four variables on one graph
sns.scatterplot(data=justBoroughs,
                x="Employ",
                y="Unemploy",
                size="Population",
                sizes=(10,200),
                hue="NEET",
                palette="Reds")
# where to put the legend
plt.legend(loc='upper right');

### Tasks
- Using the corr function look for relationships which could be strong and plot this using seaborn
- Size your data points according to the number of cars
- Color the data according to whether it is an inner-London or outer-London borough

# Representing Data - Using interesting graphics

The `matplotlib` library lets us create axes as we saw above. We can plot straightforward graphs on these axes, or we can do things which are a bit more interesting...

In [None]:
fig, ax = plt.subplots()
stickman = plt.imread("stick.png")
ax.imshow(stickman, alpha=0.5,extent=(100,200,100,300))
ax.imshow(stickman, alpha=0.2,extent=(0,700,0,400))

ax.annotate('Small stickman', (100,320));
ax.annotate('Large stickman', (100,420));

ax.set_xlim(0, 800)
ax.set_ylim(0, 500)
ax.set_aspect('auto')

### Task
Using the code above, the tooltip/documentation and any other sources you need, try to draw two stickmen on a set of axes below. The heights of the stickment should represent the populations of Hackney and Islington respectively.

In [None]:
fig, ax = plt.subplots()
stickman = plt.imread("stick.png")

#Get the populations of Hackney and Islington from the table
hac_pop = 
is_pop = 

#Add the stickmen in so they have the correct heights

#You'll need to use and possibly adjust these lines to prevent the plot from stretching, squashing or cutting your figures off
ax.set_xlim(0, 3)
ax.set_ylim(0, 250000)
ax.set_aspect('auto')

### Challenge
Try iterating through all the boroughs in inner london and drawing stick men heights to represent each population.

In [None]:
#This turns our population series into a list so we can iterate through it
#A better solution could be to use the iterrows or iteritems objects
borough_pop_list =boroughs[boroughs['InnerOuter']=='Inner London'].Population.tolist()


# Adding some statistical rigour

As part of your project you will need to provide summary statitics to include in a presentation/report backing up your project. The pandas dataframe object has built in functions for statistical measures but you need to be careful whether using them makes sense.

In [None]:
justBoroughs['Age'].mean()

In [None]:
boroughs.loc['London', 'Age']

### Discussion

> Why is the mean of the average ages not the same as the London average age?

So use the Inner London, Outer London and London averages from the main table rather than applying `mean` to a column.

### Discussion
Other statistical functions we can use with pandas include `median`, `std`, `quantile`. You've already seen `corr`.

> What does each of these functions do?
> 
> When would each of them be appropriate to use?
> 
> What other functions can you find which might be useful?

You should make sure that everything you represent with your infographic has supporting numerical statistics to give a rigorous backing to your project.

# Representing Data - Time Series

The other `csv` files all contain time series. Let's look at how recycling has changed over recent years.

In [None]:
recycling = pd.read_csv('recycling.csv')
recycling

In [None]:
pd.to_datetime(recycling.Year,format="%Y")

This time we'll make `Year` the index

In [None]:
recycling['Year']=pd.to_datetime(recycling.Year,format="%Y")
recycling = recycling.set_index("Year")
recycling

Now we can draw a time series graph

In [None]:
recycling['Barnet'].plot()
recycling["Barking and Dagenham"].plot();

It would be helpful to be able to show that Barking and Dagenham has improved by more *as a proportion* of their starting point than Barnet has.

We can make a new a column, call it BarnetIndexed say, and fill it with the percentages scaled to 1 at 2004. And the same for Barking and Dagenham.

In [None]:
recycling["BarnetIndexed"] = recycling.Barnet/recycling.Barnet[2004]
recycling["Barking and DagenhamIndexed"] =recycling["Barking and Dagenham"]/recycling["Barking and Dagenham"][2004]

Things to note about the above

* you can make a new column just by saying `recycling["New column name"]=`
* you can divide every number in a column by the value in 2004 by just doing `recycling.Barnet/recycling.Barnet[2004]`

In [None]:
recycling.BarnetIndexed.plot(c="red")
recycling["Barking and DagenhamIndexed"].plot(c="blue");
# note the `c` for colour

In fact, let's go ahead and do that for all the boroughs. We can use a `for` loop over all the columns (remember that in this dataframe it's the boroughs that are columns and the years are rows.)

In [None]:
for column in recycling.columns:
    recycling["{}Indexed".format(column)] = recycling[column]/recycling[column][2004]

In [None]:
recycling["Newham"].plot(c="green")
recycling["NewhamIndexed"].plot(c="blue")
recycling["Barnet"].plot(c="orange")
recycling["BarnetIndexed"].plot(c="red")
plt.title("Recycling in Newham and Barnet");

There was a small fudge in here. If you check `recycling.dtypes` you'll see that `Year` was an `int64` (an integer) which worked okay for us this time, but in future we'll want to explicit turn it into a `datetime` object instead, so `pandas` knows we're dealing with time. We'll do that with [`to_datetime`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html).

Documentation for [`pandas` is here](http://pandas.pydata.org/pandas-docs/stable/).

We've installed several visualisation libraries that you might find useful

* [`pyplot`](https://matplotlib.org/)
* [`seaborn`](https://seaborn.pydata.org/)
* [`bokeh`](https://bokeh.pydata.org/)
* [`chartify`](https://labs.spotify.com/2018/11/15/introducing-chartify-easier-chart-creation-in-python-for-data-scientists/)
* [`geopandas`](http://geopandas.org/)