# Python for data manipulation - DDI session - 31 Mar 2022

Charlotte Desvages

## Welcome!

Today I'll present an overview of some of the most widely used Python libraries for data manipulation and visualisation.

These are huge libraries with a lot of functionality -- the goal for today won't be to teach you everything about them!

Instead, we'll work through some examples, to illustrate and give you a flavour of the kind of problems that each of them can help with.

We will work through **interactive code examples** to demonstrate some of the functionalities of each library.

You will be able to play with the code yourselves, using a cloud service called Binder -- no need to install anything!

## Teams communication

At the top of Teams, use the [**Reactions** menu](https://support.microsoft.com/en-us/office/express-yourself-in-teams-meetings-with-live-reactions-a8323a40-3d07-4129-934b-305370a36e21#ID0EFD=Desktop).

You can also use the **chat** to ask questions.

### How do I code along?

## [bit.ly/ddi-python-31mar22](https://bit.ly/ddi-python-31mar22)

Then, click on the "**launch binder**" button. You should then see the content of these slides. This is a **Jupyter notebook** -- a Python environment that runs in your browser. Wait until it loads completely (~1min), then:
- scroll down until you see the flags 🚩🚩🚩. Then, click on the **Python code cell** just below. You should see a green frame appearing around it.
- Click the <kbd>▶</kbd> button in the toolbar at the top (or press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> on your keyboard). This will **run** the code inside the cell, and you will see the result below.

# 🚩🚩🚩 Example 1

In [None]:
import numpy as np
print(f'NumPy (version {np.__version__}) imported successfully!')

In [None]:
import numpy as np
print(f'NumPy (version {np.__version__}) imported successfully!')

When you've run the code, `NumPy (version 1.22.1) imported successfully!` should appear **below** the code cell. If that's the case, come back on Teams and give a *thumbs up*. **Don't close your browser tab or you'll lose your progress!**

You can follow the presentation on the Teams meeting. When there are code examples you can run and change yourself, they will be flagged with 🚩🚩🚩 if you want to jump back into your notebook.

## Prerequisites

Some familiarity with Python will be assumed, but the session should also be accessible to those familiar with programming fundamentals more generally -- even in another language.

If at any point I'm going too fast, or you'd like me to explain a particular point or command in more detail, **please do say so in the chat!**

## Our toolbox

Today we'll use **5 Python libraries**:

- **NumPy** and **SciPy**, for general scientific computing using NumPy *arrays*,
- **matplotlib**, to create highly customisable plots,
- **pandas**, to work with datasets and manipulate data stored in *dataframes*,
- **seaborn**, to generate plots and data visualisations easily from pandas dataframes.

# NumPy (*Numerical Python*)

**NumPy** is an *essential* library for scientific computing. (In fact, SciPy, pandas, and many other libraries are built on top of it!)

#### Useful links

- [What is NumPy?](https://numpy.org/doc/stable/user/whatisnumpy.html#whatisnumpy)
- [NumPy: the absolute basics for beginners](https://numpy.org/doc/stable/user/absolute_beginners.html)

In [None]:
import numpy as np

## NumPy features

NumPy is built around the **`ndarray` type** (stands for "N-dimensional array"). NumPy arrays are *containers*, like lists, which allow us to store and handle vectors and matrices efficiently.

```py
# Create a vector
v = np.array([3, 4, -2.8, 0])
print(v)
print(type(v))

# Create a matrix: pass a list of lists to np.array(),
# each element of which is a row of the matrix
id_4 = np.array([[1, 0, 0, 0],
                 [0, 1, 0, 0],
                 [0, 0, 1, 0],
                 [0, 0, 0, 1]])
print(id_4)
```

NumPy also provides convenience functions to construct arrays with particular properties.

```py
np.zeros()
np.ones()
np.eye()
np.random.random()
np.arange()
```

## Array operations

Using NumPy arrays allows to perform operations on (and between) many numbers at once, *efficiently*.

```py
x = np.linspace(1, 5, 5, dtype=int)
print(y)
print(x)

print(x + y)
print(x * y)
print(x**2)

print(y > 5)
print(y[y > 5])
```

Note that matrix multiplication is also possible!

We'll come back to NumPy throughout the session.

# Plotting data with matplotlib

The **matplotlib** library (and its **pyplot** interface) contain a very large number of built-in functions for creating figures and plots.

### Useful links

* [Matplotlib: Python plotting](https://matplotlib.org/contents.html) - Matplotlib documentation
* [Matplotlib gallery](https://matplotlib.org/gallery/index.html)
* [Pyplot tutorial](https://matplotlib.org/tutorials/introductory/pyplot.html) - Matplotlib documentation

In [None]:
import matplotlib.pyplot as plt

# 🚩🚩🚩 Example 2

In [None]:
# Create an x-axis with 100 points
x = np.linspace(-4, 4, 100)

# Evaluate the function at all these points
y = x ** 2

# Create the plot and display it
plt.plot(x, y, 'bx')
plt.show()

## Multiple plots

```py
# Create a grid of plots with 1 row, 3 columns
fig, ax = plt.subplots(1, 3, figsize=(12, 4))

# Create an x-axis of 50 points between 0 and 2
x = np.linspace(0, 2, 50)

# Plot 3 different functions
ax[0].plot(x, np.tan(x - 2), 'c-')
ax[1].plot(x, np.exp(x), 'g--')
ax[2].plot(x, np.sin(x) * np.cos(2*x), 'ro')
plt.show()
```

See [Parts of a figure](https://matplotlib.org/stable/tutorials/introductory/usage.html#parts-of-a-figure) - Matplotlib documentation

Try a 2D grid!

# SciPy

The **SciPy** library provides a set of scientific computing tools for Python -- [it is actually a relative of Numpy](https://www.scipy.org/scipylib/faq.html#numpy-vs-scipy-vs-other-packages).

It's not possible to cover *everything* SciPy can do today, so we'll look at some examples.

### Useful links

* [SciPy documentation](https://docs.scipy.org/doc/scipy/reference/)
* [SciPy tutorial - Introduction](https://docs.scipy.org/doc/scipy/reference/tutorial/general.html)
* [Frequently asked questions - SciPy documentation](https://www.scipy.org/scipylib/faq.html#frequently-asked-questions)

# 🚩🚩🚩 Example 3: interpolating some noisy data

In [None]:
from scipy.interpolate import interp1d, UnivariateSpline

Let's generate some simulated data for this example.

In [None]:
# Generate simulated data: a sinusoid with some added noise
x = np.linspace(0, 10, 30)
noise = np.random.normal(0, 0.3, len(x))
y = np.sin(x) + noise



```python
%matplotlib notebook
fig, ax = plt.subplots()
ax.plot(x, y, 'k.', label='Noisy data')

ax.legend()
plt.show()
```

In [None]:
# Create an interpolating function going through all the datapoints
f_interp = interp1d(x, y, kind='cubic')

# Fit a spline using a different function, which takes a smoothing parameter
f_smooth = UnivariateSpline(x, y, s=4)

# We can use these functions to get interpolated values for y
# at a point in between our x values
print(f_interp(1.5))
print(f_smooth(1.5))

In [None]:
# Plot these splines on top of the data points
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(x, y, 'k.', label='Data points')



```py
# First, we need a denser x-axis, so the plots look smoother!
x_dense = np.linspace(0, 10, 1000)

# Use the previous functions, evaluate them at each point in x_dense
ax.plot(x_dense, f_interp(x_dense), 'c--', label='Interpolating spline')
ax.plot(x_dense, f_smooth(x_dense), 'r-.', label='Smooth spline')

ax.legend()
plt.show()
```

- Change noise variance and observe the results.
- Change the smoothing parameter `s` and observe the results.

# 🚩🚩🚩 Example 4: Summary statistics

Let's continue with NumPy, SciPy, and matplotlib for now.

The file `oil_reserve_data.csv` in your binder is a slightly edited version of one that comes from [this EU data source](http://data.europa.eu/euodp/en/data/dataset/RH8xH4aVauYKO92JJxv8sA). It describes the emergency oil reserves kept by several European countries over recent years.

In [None]:
import numpy as np
from scipy import stats

In [None]:
# # Use Numpy's loadtxt function to open the data file:
A = np.loadtxt('oil_reserve_data.csv',
                  dtype=float,
                  delimiter=',',
                  skiprows=1,
                  usecols=(1, 2, 3, 4))

In [None]:
# Make sure we name the columns appropriately
germany = A.T[0]
denmark = A.T[1]
belgium = A.T[2]
bulgaria = A.T[3]

In [None]:
# Summary statistics with NumPy and SciPy functions


```python
print(stats.describe(germany))
print(stats.describe(denmark))

# Find the 5% and 95% percentiles
print('\n5% and 95% percentiles:')
print(np.percentile(germany, (5, 95)))

# Find the Pearson's and Spearman's correlation coefficients
# between two columns:
print('\nCorrelation coefficients for Germany/Denmark and Germany/Bulgaria:')
print(stats.pearsonr(germany, denmark))
print(stats.spearmanr(germany, bulgaria))

# We can save out an array to a file using Numpy's
# savetxt() function:
np.savetxt('belgium.txt', belgium)
```

In [None]:
# Find a linear relation between the data from Germany and Denmark


```python
reg = stats.linregress(germany, denmark)
print(reg)
print(reg.slope, reg.intercept)

# Plot the data and the line of best fit
fig, ax = plt.subplots()
ax.plot(germany, denmark, 'kx', label='Data points')
ax.plot(germany, reg.slope * germany + reg.intercept, 'r-', label='Line of best fit')

ax.set(xlabel='Germany', ylabel='Denmark')
ax.legend()
plt.show()
```

# Dealing with mixed data: pandas

Pandas is a module which allows the construction of a **dataframe**, an object to store data that looks a little like a spreadsheet.

The data contained in a dataframe does *not* have to be of the same type. 

### Useful links

* [Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/).
* [A quick introduction to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
* There is a fantastic tutorial (also in Jupyter) [here](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) (under Lessons for New Pandas Users). This is well worth working through a little if you want a longer introduction to the basic concepts in Pandas.

# 🚩🚩🚩 Example 5

Let's use pandas to read the same data file as the previous example.

In [None]:
import pandas as pd

# Use the read_csv method to read the CSV file into a dataframe
A = pd.read_csv('oil_reserve_data.csv')

# Look at what the dataframe contains
print(A)
print(type(A))

In [None]:
# Display the data for one country
print(A['Germany'])


```python
# Process the dates appropriately
A['Month'] = pd.to_datetime(A['Year/month'], format='%YM%m')
print(A.info())
```

In [None]:
# Plot the data for Denmark and Bulgaria


```python
# Plot the data for Denmark and Bulgaria
fig, ax = plt.subplots()
ax.plot(A['Month'], A['Denmark'], 'gx', label='Denmark')
ax.plot(A['Month'], A['Bulgaria'], 'r.', label='Bulgaria')

ax.legend()
plt.show()
```

# Seaborn: visualising dataframes

Seaborn is a library based on matplotlib, which provides very useful tools for various statistics and data visualisations.

### Useful links

- [Seaborn documentation page](https://seaborn.pydata.org/)
- [Example gallery](https://seaborn.pydata.org/examples/index.html) demonstrating the types of visualisations you can do
- [Introduction with plenty of examples](https://seaborn.pydata.org/introduction.html)
- [Tutorial/user guide](https://seaborn.pydata.org/tutorial.html)

# 🚩🚩🚩 Example 6

In [None]:
import seaborn as sns

# Visualise possible correlations between countries


```python
grid = sns.PairGrid(A)
grid.map(sns.regplot)
plt.show()
```

# Edinburgh cycle hire data

Until a few months ago (unfortunately!), Edinburgh had a bike hire scheme, where you could rent a bike from one station to another. The company which ran the scheme published anonymised data every month, containing all information about the bike trips people have made that month.

All available data files (since the scheme started in September 2018) are available [here](https://edinburghcyclehire.com/open-data/historical), as well as a description of the data. It is published under the [Open Government License (OGL)](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).

Some questions we could investigate:

- What is the average journey time and distance of bike trips on weekdays? What about weekends?
- How many stations appear in the dataset? Which were the most common starting stations?
- What was the most common time(s) of day for journeys to be undertaken?

---

Useful documentation:

- [`pandas.to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html?highlight=to_datetime#pandas.to_datetime)
- [The `.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#dt-accessor)
- [Time and date components](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components)
- Relational plots: [`sns.relplot()`](https://seaborn.pydata.org/generated/seaborn.relplot.html)
- Categorical plots: [`sns.catplot()`](https://seaborn.pydata.org/generated/seaborn.catplot.html)
- Distribution plots: [`sns.displot()`](https://seaborn.pydata.org/generated/seaborn.displot.html)

## Mean journey time on weekdays/weekends

In [None]:
trips = pd.read_csv('07.csv')
trips.head()

In [None]:
# Visualise duration of trips taken on different days of the week


```python
# trips.info()
trips['started_at'] = pd.to_datetime(trips['started_at'])
trips['ended_at'] = pd.to_datetime(trips['ended_at'])

trips['day_of_week'] = trips['started_at'].dt.dayofweek

# Weekdays/weekends
from numpy import median

sns.catplot(data=trips,
            x='day_of_week',
            y='duration',
            kind='bar',
            estimator=median)

grid = sns.catplot(data=trips,
                   kind='violin',
                   x='day_of_week',
                   y='duration',
                   ylim=[0, 10000])

grid = sns.displot(data=trips,
                   x='duration',
                   hue='day_of_week',
                   kind='kde',
                   log_scale=True,
                   common_norm=False)
```

## Most common starting stations

```python
# Getting station information with groupby
station_groups = trips.groupby('start_station_name')
station_info = pd.DataFrame()

for column in ['start_station_name', 'start_station_latitude', 'start_station_longitude']:
    station_info[column] = station_groups[column].first()

station_info['frequency'] = trips['start_station_name'].value_counts()

g = sns.relplot(data=station_info,
                kind='scatter',
                x='start_station_longitude',
                y='start_station_latitude',
                hue='frequency',
                size='frequency',
                height=12)

# Get the Axes object using g.ax
g.ax.set_aspect('equal')
```