<a href="https://colab.research.google.com/github/conceptbin/workshops/blob/main/DAPy01_Why_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Why bother with Python?
Programming involves, at first sight, a lot more effort than running Excel or Google Sheets. Which is true. Sometimes you just need a simple spreadsheet, and sometimes you need a bit more.

This session aims to introduce you to the possibilities of working beyond spreadsheets:
1. The Colab environment.
2. Key libraries for working with data (Pandas, Seaborn).
3. Getting started with key **functions** for data analysis.
4. Applying those functions to actual data.




> "One reason I love programming is that it’s a cross between magic spells and LEGO." (Cassie Kozyrkov, ["Understanding Data"](https://medium.com/towards-data-science/what-is-data-8f94ae3a56b4))

# Introducing Colab
Google Colab is a programming environment that runs in the cloud (on a Google server). A Colab notebook allows you to write and annotate code and to run it on a self-contained virtual machine in the cloud. This means that you do not have to install a Python environment with all the necessary libraries on your own machine.

* [Introduction to Colab](https://colab.research.google.com/) (in notebook format)

# Libraries for data analysis
In Python, you can use specialist libraries to perform specific tasks. The libraries we focus on today are:
* [Pandas](https://pandas.pydata.org/docs/index.html) (for data analysis)
* [Seaborn](https://seaborn.pydata.org/tutorial/introduction.html) (for data visualisation)

# Functions
Libraries are like software programs that rely on Python to run. Instead of using graphical interfaces to run it we use the **functions** that come with those libraries.

In [None]:
# This code cell loads the Pandas and Seaborn libraries, and gives them short-form names.
import pandas as pd
import seaborn as sns

# Numerical data
First, let's load some simple numerical data. Here we look at the UK 2021 Census dataset on towns and housing ([source](https://www.ons.gov.uk/peoplepopulationandcommunity/housing/datasets/townsandcitiesanalysis)). This subset of the data shows the percentage of each area's population within certain age ranges.

In [None]:
# URL of the data file:
housing_url = 'https://github.com/conceptbin/DA_Notebooks/raw/master/py-intro/data/UK2021census_town_population.csv'
# Tell pandas to create a dataframe
df_hs = pd.read_csv(housing_url)

Here we declared a *variable* (housing_url) and used it in the pandas (pd) *function* `read_csv()`

But what have we loaded? Let's take a look at (some of) the data:

In [None]:
# Display the dataframe within the notebook
df_hs

The table above shows the top and tail (first and last five rows of data) of the **dataframe**, which is a data table in pandas. It's called a data "frame" because it's a flat, two-dimensional table of data organised in columns and rows.

## Describe the data using functions
Let's try some functions to summarise the data ([full tutorial here](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html)):

### `info()`

In [None]:
df_hs.info()

The `info()` function gives you the basic shape (how many rows and columns) of the dataframe, a list of columns, and a description of how many have data in them (i.e., are "non-null") and what data type they are.

### `describe()`

In [None]:
df_hs.describe().round(2)

The `describe()` function gives you descriptive statistics about the numerical columns in your dataframe. These 8 data points can tell you a lot about your dataset at a glance, especially how the data is *distributed* within each column.

In [None]:
# Show a list of column headings
df_hs.keys()

### `unique()`
Unique values for a specific column:

In [None]:
df_hs['Region/Country'].unique()

### `sort()`
You can sort the dataframe by one or more column. The code below sorts, and the  `slice` operator `[:10]` shows the first 10 rows of data.

In [None]:
# Top 10 regions for the largest percentage of children in the population
df_hs.sort_values(by='Population aged 0-15 ', ascending=False)[:10]

# Visualise distributions

## Histogram
Histogram, using Seaborn, `histplot()`, bin width of 1.

In [None]:
sns.histplot(data=df_hs, x='Population aged 0-15 ', binwidth=1)

## Box plot
Box plot with Seaborn `boxplot()`

In [None]:
sns.boxplot(data=df_hs, y='Population aged 0-15 ')

## Scatterplot

In [None]:
# Scatterplot by selected region
sns.relplot(data=df_hs, x='Population aged 0-15 ', y='Region/Country')

Here's a scatterplot for all the data columns. We pass the whole dataframe to Seaborn, which inteprets the wide-form data for plotting without any further instructions needed. ([Seaborn guidance on data structures](https://seaborn.pydata.org/tutorial/data_structure.html).)

In [None]:
# Scatterplot for all data columns.
sns.relplot(data=df_hs)

# Compare categories

To have more control over how the categories are plotted, we "melt" the dataset into "long-form" (or *tidy format*).

In [None]:
# Melt the population columns into an identifier ('Age Range') and value ('Pop Pct')
df_hs_melt = pd.melt(df_hs, id_vars=['TCITY15CD', 'Town/City', 'Region/Country'], value_vars=['Population aged 0-15 ',
       'Population aged 16-64 ', 'Population aged 65+ ', 'Population aged 85+ '], var_name='Age Range', value_name='Pop Pct')

In [None]:
df_hs_melt

This allows us to select how we plot the data (i.e., by Town/City or by Region/Country).

In [None]:
sns.catplot(data=df_hs_melt, kind='strip', x='Pop Pct', y='Region/Country', hue='Age Range', aspect=1.5)