# Exercise 1: Explore a dataset using pandas and seaborn

## Pandas and seaborn Refresher

Let's review using Seaborn and Pandas to load up some data and then pair plot it.

We'll be using the same tools that we used last week for this 
- [pandas](pandas.pydata.org) for data handling (our dataframe library)
- [seaborn](seaborn.pydata.org) for _nice_ data visualization

Shortly we'll also by trying out:

- [scikit-learn](scikit-learn.org) an extensive machine learning library.
- [numpy](numpy.org) - a fundamental maths library best used by people with a strong maths background.  We won't explore it much today, but it does have some useful methods that we'll need.  It underlies all other mathematical and plotting tools that we use in Python.

We'll be using scikit-learn over the next few weeks, and it's well worth reading the documentation and high level descriptions.

_You will probably want to take a moment to look at the documentation of the libraries above - especially pandas_

The other useful resource is Stack Overflow - if you have a question that sounds like 'how do I do {x}' then someone will probably have answered it on SO. Questions are also tagged by library so if you have a particular pandas question you can do something like going to https://stackoverflow.com/questions/tagged/pandas (just replace the 'pandas' in the URL with whatever library you're trying to use.

Generally answers on SO are probably a lot closer to getting you up and running than the documentation. Once you get used to the library then the documentation is generally a quicker reference. We will cover strategies for getting help in class.

## Git links

If you want to work in pairs, use GitHub and GitKraken to share code. Here are some useful links for reference:

- GitKraken interface basics: https://support.gitkraken.com/start-here/interface
- Staging and committing (save current state -> local history): https://support.gitkraken.com/working-with-commits/commits
- Pushing and pulling (sync local history <-> GitHub history): https://support.gitkraken.com/working-with-repositories/pushing-and-pulling
- Forking and pull requests (request to sync your GitHub history <-> someone else's history - requires a _review_):
  - https://help.github.com/articles/about-forks/
  - https://help.github.com/articles/creating-a-pull-request-from-a-fork/

## Step 1: Read in the dataset

For this exercise, we will be using the Beijing PM2.5 Data Set, which contains meteorological data from Beijing Capital International Airport and atmospheric particulate matter (PM) that have a diameter of less than 2.5 micrometers.

All the packages you need for the exercise are already there, just run the cell.

In [None]:
# Install required packages if using jupyterhub
# %pip install -r ../requirements.txt

In [None]:
import urllib.request
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

First, we need to download the data into the folder *data*. Define *filename*, which provide the path to the folder *data* and the name of the file (you can use the same name as in the URL).

In [None]:
filename =

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv'
urllib.request.urlretrieve(url, filename)

Read the file using pandas and *filename*.

## Step 2: Explore the dataset

Look at the content of the data frame.

Look at the summary statistics of each variable using pandas.

Use seaborn to plot a pairplot of all the variables in the data frame.

## Step 3: Focus on the variables of interest

This part is slightly biased, because our goal is to perform a linear regression, and it turns out two variables show a nice linear relationship.

Find those two variables and plot their distributions using pandas.

Now, plot the variation of one variable compared to the other using pandas.

Using pandas or seaborn, check the correlation between those two variables.

## Step 4: Find a linear regression with Seaborn

Now that you've seen a linear relationship between two of the variables, use Seaborn to plot the line of best fit.

There are a few different ways to do this. Try using regplot.

# Exercise 2: Linear regression with scikit-learn

Scikit-learn provides machine learning tools in several categories. These include supervised learning and unsupervised learning. We'll start working with unsupervised learning next week. Supervised learning is about finding a model for features that can be measured and some labelling that we have for the available data. If, for example, we have lithium assays and we want to try to predict lithium based on sensor data from a portable spectrometer, then the lithium assays are the labels and the measured intensities at different wavelengths are the measured features. This kind of supervised learning is called regression.

There's another kind of supervised learned which is called classification, this is what we're doing when we want to assign observed data to different discrete classes. Regression can sometimes be used, with minor additions, to classify data as well. For example, with our lithium spectral regression model we could classify samples as being high in lithium or low in lithium simply by using a threshold value that we set. There are more sophisticated ways to classify, which will be covered in later weeks.

We use the estimator API of scikit-learn to do regression.

## The Estimator API of scikit-learn

There are a few steps to follow when using the estimator API.  These steps are the same for all methods that scikit-learn implements, not just for linear regression.

1. Choose a class of model by importing the appropriate estimator class. In our case we want to import Linear Regression. Scikit-learn's documentation might come in handy for that.

First, import LinearRegression from scikit-learn.

Now create an "instance" of the LinearRegression class.

In [None]:
model =

To check that this has worked look at the model object after it's created. It should tell you about some of its settings.

In [None]:
model

These settings are also called hyperparameters.  We'll encounter hyperparameters again next week, and will talk about them in more detail then.  They're often very important in working out whether our model is well fitted to the data.

2. Next we need to arrange a pandas dataframe into a features matrix and a target vector.

Search on the Internet for this, and use the two variables identified during the previous exercise.  I know that Stack Overflow will be helpful.  You will need to look at the column names in the dataframe to find the names of the two columns that are important to us.  Do this in the next cell.

The notation is a bit strange!  The two pairs of "[ ]" as "[[ ]]" that you will see is correct.

In [None]:
x = 
y = 

3. Fit the model to your data by using the fit() method of the LinearRegression object.

Again, look at the documentation for how to apply this.  You'll need to provide your features matrix (X) and target vector (y) as parameters to the fit method.

#### Congratulations you've trained your first machine learning model!

As this is a two dimensional linear model, it has two parameters.  The line's intercept and slope.  The notation that scikit-learn uses is a little unfriendly.  Its convention is to add underscores to the names of the parameters it finds.  Also, it calls the slope "coef".

After fitting the model, find the coefficient and intercept of the model.

You can also look at the coefficient of determination of the model, R<sup>2</sup>.

#### Now that we've trained a model, we should make predictions!

6. Make predictions!

This is also more complicated with scikit-learn than it is with Seaborn.

For a given, single value for a feature (i.e., a temperature) we can predict a label.  For example, for a temperature of 20 &deg;C, we could make a prediction with:

```predicted_pressure = model.predict(20)```

But to find the smooth line that seaborn finds we need to explicitly tell scikit-learn that we want to do a prediction for all of the temperatures that we're interested in. To do this we
use a new library called "numpy" and a method called linspace (which is short for linear spacing).

First we need to import numpy.

```import numpy as np```

While I used predicted_pressure above as an example of a predicted target array, and 20 is an example of x, I'll now switch to the usual y and x conventions used in tutorials with scikit-learn.  You can of course use any variables names you, and in your own code it's best to use descriptive names that mean something in the domain of your industry, like 'predicted_pressure", or "octane_rating".

We need to use the linspace method in numpy.  Use it like this:

```x_fit = np.linspace(-20, 40)```

This will create a collection of temperatures, in order, starting from -20 &deg;C up to 40 &deg;C.  This is what we need, but this collection isn't formatted correctly for scikit-learn.  To make it work with scikit-learn we next have to adjust the format with this instruction:

```x_fit_reshaped = x_fit[:, np.newaxis]
y_fit = model.predict(x_fit_reshaped).```

y_fit now contains our predicted pressures.  Type ```y_fit``` to see them numerically.

Try this all out in the next cell.  Take it step by step.  Don't try to run this all in one go, but build it up line by line, checking that you do not get errors after each line.

Although pandas and seaborn work nicely for simple plots, we need sometimes to go back to matplotlib, which they both use in the background. Here we do that to reproduce the result we got from regplot.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
df.plot(x='TEMP', y='PRES', kind='scatter', alpha=0.15, ax=ax)
ax.plot(x_fit, y_fit, color='red')

We can also predict our training data, to compare the predicted *y* to the real *y* from the data.

In [None]:
y_pred =

And compute the residuals.

In [None]:
y_res =

Use seaborn to plot the distribution of those residuals.

# Exercise 3: Perturbing perfect linear data

In this exercise, we want to look at the influence of noise and outliers on the predictions of a linear regression. To make things easier to understand, we're gonna work with synthetic data this time.

Scikit-learn has a set of functions to generate synthetic data. You can find more information about them here:

[scikit-learn.org/stable/datasets/index.html#sample-generators](https://scikit-learn.org/stable/datasets/index.html#sample-generators)

You can start playing with the generators for regression at the end of this exercise if you want, but in the meantime we're gonna use a lower-level approach with NumPy, which gives us more flexibility.

NumPy is a collection of mathematics functions which underlies all other mathematical libraries that we've been using, such as Seaborn and scikit-learn. *random* is a NumPy's module to generate random numbers from distributions.

First, we need to set up the seed, which means that our results will be reproducible.

In [None]:
np.random.seed(100)

## Step 1: Generate a perfect linear data set

Now let's define a simple linear dataset using NumPy. A uniform distribution means that all of the values that may be returned are equally likely.  When we throw dice we are sampling from a uniform distribution.

Tell Python that for *x* we want random numbers between 0 and 100 from a uniform distribution, and we want *n_samples* of them.

In [None]:
n_samples = 1000
x = 
a = 0.75
b = 0.75
y = a*x + b

In [None]:
sns.regplot(x=x, y=y, line_kws={"color": "red"})

Scikit-learn is very powerful, but it can be a bit long to set up, especially for a problem as simple as this one. Fortunately Python offers other solutions. One is [statsmodels](https://www.statsmodels.org/stable/index.html), a module "that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration."

Here we're gonna use another widely used package for scientific computing, [SciPy](https://www.scipy.org/), and its [stats](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html) module in particular.

In [None]:
from scipy import stats

We can look at the correlation coefficient (or Pearson coefficient).

In [None]:
stats.pearsonr(x, y)

And we can fit a linear model to the data and look at a (the slope), b (the intercept), and R<sup>2</sup>.

In [None]:
stats.linregress(x, y)

## Step 2: Add Gaussian noise

We had a perfect linear relationship, let's add some noise. A normal (or Gaussian) distribution returns values which are most likely to be near the mean, falling off symmetrically to either side.  It is the "bell" curve that you've seen many times.

Here, tell Python that we want the noise that we add to our simple line to have a mean of zero, and a standard deviation of 5.

In [None]:
noise = 
y = a*x + b + noise

In [None]:
sns.regplot(x=x, y=y, line_kws={"color": "red"})

In [None]:
stats.pearsonr(x, y)

In [None]:
stats.linregress(x, y)

Use eaborn's residplot function to plot the residuals after fitting a line to the data. With a normal distribution we expect to see these residuals evenly scattered around zero.

Try to run the code again with different number of samples and different level of noise. What happens?

## Step 3: Add non-Gaussian noise

Now let's see what happens when the noise isn't normally distributed. An example of a heavy tailed distribution is the gamma distribution.  This is often used to model failure likelihood for machines.  Unlike the normal distribution it is not symmetric.  In quality control applications it quickly peaks after a short lifetime, but then has a long tail that extends many years into the future.  This makes sense as we expect most failures to be early in the life of a machine because of manufacturing faults, after that the failure time is less predictable, but we all know of machines or gadgets that seem to last forever.  Google will quickly bring up examples of the shape.

Tell Python that now we want the error to follow a gamma distribution of parameters k = 2 and theta = 2.

In [None]:
noise = 
y = a*x + b + noise

Now, plot the data and regression line with seaborn, look at the correlation coefficient and R<sup>2</sup> with stats, and the residuals with seaborn.

## Step 4: Add outliers

Start again with a normally distributed noise in *y*.

Let's add some outliers.

In [None]:
n_outliers = 5
x_outliers = np.random.uniform(0, 40, n_outliers)
y_outliers = np.random.uniform(80, 100, n_outliers)

x_outliers = np.concatenate((x, x_outliers))
y_outliers = np.concatenate((y, y_outliers))

Plot the data and regression line with seaborn, look at the correlation coefficient and R<sup>2</sup> with stats, and the residuals with seaborn.

Try to change the number of outliers and their distribution for *x* and *y*, and see what happens.

## Optional: Step 5, Add a second population

Let's add another population that follows a linear relationship between *x* and *y* too, but with different parameters.

In [None]:
n_spop = 100
a_spop = 1
b_spop = 2
noise_level_spop = 1
x_spop = np.random.uniform(80, 100, n_spop)
noise_spop = np.random.normal(0, noise_level_spop, n_spop)
y_spop = a_spop*x_spop + b_spop + noise_spop

x_spop = np.concatenate((x, x_spop))
y_spop = np.concatenate((y, y_spop))

Plot the data and regression line with seaborn, look at the correlation coefficient and R<sup>2</sup> with stats, and the residuals with seaborn.

## Optional: Step 6, Add heteroscadistic error

Now, how would you change this code to create a heteroscadistic error?

Plot the data and regression line with seaborn, look at the correlation coefficient and R<sup>2</sup> with stats, and the residuals with seaborn.