# Data Analysis in Python II

## Section 1 - Visualising Data

Let's start again by importing the modules we'll be using in this section

In [None]:
import math

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')

### Line Graphs

Plot the lines with the points given in the following two arrays, with `X` on the X axis and `Y` on the Y axis

In [None]:
X = [1, 2, 3, 4, 5]
Y = [10, 20, -5, 12, 24]

# Your code here


The following values are the goals scored by the top 5 Premier League teams between 2011 and 2016, plot these all on the same graph. 

Make sure you have the legend on the graph and label the X and Y axes.  
Also have your graph start from 0 on the Y axis

In [None]:
years = [2011, 2012, 2013, 2014, 2015, 2016]
arsenal = [72, 74, 72, 68, 71, 65]
chelsea = [69, 65, 75, 71, 73, 59]
liverpool = [59, 47, 71, 101, 52, 63]
man_city = [60, 93, 66, 102, 83, 71]
man_utd = [78, 89, 86, 64, 62, 49]

# Your code here


Plot the graph for:

$ y = \dfrac{1}{x} $

For x between 0.5 and 4

In [None]:
# Your code here


On the same graph, draw the lines for

$ y = x^2 $  
and  
$ y = x^4 $

for x between -3 and 3.

Also plot markers for the individual points where X is an integer

In [None]:
# Your code here


On the same graph draw the lines for

$ y = 3 + \sqrt{6x - x^2 -8} $  
and  
$ y = 3 - \sqrt{6x - x^2 -8} $  

between 2 < x < 4

Draw the first line as a solid green line ('-') and the second line as a dot-dashed red line ('-.')

In [None]:
# Your code here


### Scatter Plots

#### Litter Data Set

The file `data/litters.csv` has a number of litters of mice, with the pups' body and brain weights.  

Plot a scatter plot of litter size against body weight in grams.

In [None]:
litters = pd.read_csv('data/litters.csv')

litters.head()

In [None]:
# Your code here


### Boxplot

Load the iris dataset into a DataFrame as shown in the slides.

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
iris_df['species'] = iris['target']
iris_df.head()

Now draw 3 boxplots for species 0 (Iris setosa), 1 (Iris versicolor), or 2 (Iris virginica) against `petal length (cm)`

In [None]:
# Your code here


### Histograms

#### UK Driver Deaths Data Set

The file UK Driver Deaths contains the number of drivers that died in the UK for each month between 1969 and 1984 inclusive.  

Draw a histogram showing the distribution of deaths per month with bins between 1000 and 3000 of width 100.

In [None]:
deaths = pd.read_csv('data/uk_driver_deaths.csv')

deaths.head()

In [None]:
# Your code here


## Bar Charts

Plot a bar chart of the different types of energy fuel against their output in Gigawatt-hours (GWh)

In [None]:
energy_types = ['Nuclear', 'Hydro', 'Gas', 'Oil', 'Coal', 'Biofuel']
energy = [5, 6, 15, 22, 24, 8]

# Your code here


Plot each type of medals against their respective country for the medals they won at the 2012 Olympic Games, pick colours of bars to match the medal colour.

In [None]:
medals_2012 = pd.DataFrame(
    {
        'gold': [46, 27, 26, 19, 17],
        'silver': [37, 23, 18, 18, 10],
        'bronze': [38, 17, 26, 19, 15]
    }, index = ['USA', 'GB', 'China', 'Russia', 'Germany']
)

medals_2012

In [None]:
# Your code here


# Section 2 - Data Aggregation

#### Student's Sleep Data Set

The data set below shows 10 students, and their response to two soporific (sleep-inducing) drugs, compared to a control period.  

The increase in sleep is given by the variable 'extra'.

In [None]:
sleep = pd.DataFrame({
    'extra': [0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, 0.8, 0, 2, 1.9, 0.8, 1.1, 0.1, -0.1, 4.4, 5.5, 1.6, 4.6, 3.4],
    'group': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
})

Using groupby, find the means of `extra` for `group` 1 and 2, which of the 2 is the more effective drug on average?

In [None]:
# Your code here


Also using groupby, what was the total effect (sum of `extra` sleep) of the soporific drugs for each of the students (`ID`). Which ID was the most susceptible to both drugs in total?

In [None]:
# Your code here


#### Product Sales Data Set (Mocked)

The data set below describes sales of products from an online retailer. Run the cell below to see some of the sales

In [None]:
# Run this cell to create mocked data set
np.random.seed(50)
# 20 products, 200 transactions, 7 countries
prices = {x: np.random.randint(5, 20) for x in np.arange(1, 20)}
sales_df = pd.DataFrame({
    'Quantity': np.random.randint(1, 5, 200),
    'Product_Id': np.random.randint(1, 20, 200),
    'Country': np.random.choice(['UK', 'USA', 'France', 'Australia', 'Norway', 'Rep. Ireland', 'Netherlands'], 200)
})
sales_df['Price'] = sales_df['Product_Id'].map(lambda x: prices[x])
sales_df['Revenue'] = sales_df['Price'] * sales_df['Quantity']
sales_df.head()

Which country brings in the most revenue?

In [None]:
# Your code here


Which country had the most individual transactions?

In [None]:
# Your code here


Which country bought the most items of stock?

In [None]:
# Your code here


Which country bought the highest quantity of `Product_Id` 5?

In [None]:
# Your code here


### Nobel Data Set

The Nobel Foundation would like some information on a data set they have provided you with, it is an Excel document with two sheets named `nobel_prizes` and `population` respectively. 

This file is in the data directory as `"data/nobel_prizes.xlsx"`

Read in the two Excel sheets into 2 separate Data Frames

_Hint:_ http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

In [None]:
nobel = pd.read_excel('data/nobel_prizes.xlsx', sheetname='nobel_prizes')
population = # Fill this in

They would first like to know what proportion of winners were female and which category has the most female winners.

In [None]:
# Your code here


In [None]:
# Your code here


Which country do the most Literature Nobel Prize winners come from?

In [None]:
# Your code here


How many Nobel Prize Winners have the first name "Robert"?

In [None]:
# Your code here


We need to replace some of the values in the `"Birth Country"` column in our `nobel` Data Frame.

One of the Nobel Prize winners is listed as coming from French overseas territory `"Guadeloupe Island"`. Replace this value with `"France"`.

Trinidad and Tobago is listed as `"Trinidad"`, replace this value with `"Trinidad and Tobago"`.

`"Northern Ireland"` is listed Separately to United Kingdom, replace this value with `"United Kingdom"`.

There was also a winner from Taiwan, but there is no population entry for Taiwan, so we will have to (perhaps controversially) assign Taiwan to China

In [None]:
# nobel['Birth Country'] = nobel['Birth Country'].replace(
#     ['Guadeloupe Island', 'Trinidad', 'Northern Ireland', 'Scotland', 'Taiwan'],
#     ['France', 'Trinidad and Tobago', 'United Kingdom', 'United Kingdom', 'China']
# )

The Nobel Foundation would like to know the 5 countries with the most prize winners per capita that were born in that country, use your 2 data frames to calculate this.

_Hint: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html_

In [None]:
# Your code here


# Section 3 - Time Series Data

Convert the strings in the following Data Frame to datetimes using 3 separate functions. 

When you've completed, you should see that all the columns in the dataframe are the same.

_Hint: Remember you can use http://www.strftime.org for information on date parsing codes_

In [None]:
# run this cell
import datetime

df = pd.DataFrame({
    'a': ['2015-01-04 15:03:00', '2016-12-30 18:18:12', '2017-02-23 09:13:04'],
    'b': ['15:03:00, 4 January 2015', '18:18:12, 30 December 2016', '09:13:04, 23 February 2017'],
    'c': ['01/04/2015 15:03:00', '12/30/2016 18:18:12', '02/23/2017 09:13:04']
})

df

In [None]:
# Your code here, make sure all columns in the data frame read the same when you're code is complete


### Resampling



Let's start by reading in our csv `'distances.csv'`

This is a csv that contains mocked data for the sum of distances covered by 100 people in each minute of the day.

In [None]:
df = pd.read_csv('data/distances.csv')
df.head()

Now we can use the function we created in the last section to parse column `a` to parse our `datetime` column here

In [None]:
# Your code here


Now let's plot a line graph of distance against time

In [None]:
plt.figure(figsize=(10,5))
# Your code here


We can make out a trend but it's hardly pretty, this data is for every minute of the day.

Make the datetime column your Data Frame's index, then resample to an hourly granularity, summing up each of the values within each hour.

In [None]:
# Your code here


Now, again plot a line graph distance against time and see how this compares.

In [None]:
plt.figure(figsize=(10,5))
# Your code here


In the first graph there would have been no way of noticing the peaks at 09:00 - 10:00 or 18:00 - 19:00, but in our resampled graph we can see the same general trends but peaks and troughs are much easier to discern.

### Timezones

By convention, the Z in the following string denotes that this time is UTC.

Take this string parse it to a datetime, then `localize()` it so that it has a UTC timezone

In [None]:
utc_dt_str = '2017-05-03T14:15:00Z'

import pytz

# Your code here


Using the Olson timezone `'America/Los_Angeles'`, `normalize()` this date so that it is now in Los Angeles time.

In [None]:
# Your code here


Now write a function `convert_to_la_time()` to do the above and use the `map()` function to convert all the times in the Series below to Los Angeles time.

In [None]:
datetimes = pd.Series(['2017-05-03T14:00:00Z', '2017-05-03T14:15:00Z', 
                       '2017-05-03T14:30:00Z', '2017-05-03T14:45:00Z'])

# Your code here


# Section 4 - K-nearest Neighbors Classification

In this section, we will be using a data set with lots of features about breast cancer to try and predict whether a tumour is benign or malignant.

We will be using K-nearest Neighbors to classify unseen observations into benign (0) or malignant (1).

Let's start by loading the data and taking a look at it.

In [None]:
# run this cell

breast_cancer_df = pd.read_csv('data/breast_cancer.csv')
breast_cancer_df.head()

So we have 30 feature columns that describe our tumours, including "mean radius", "mean texture", "mean perimeter"...

There is a single column (you need to scroll right in the output of the cell above) named `'malignancy'` and this tells us whether a tumour is benign (0) or malignant (1).

Let's start by splitting our data out into test data and training data.

Create a variable `X` that is a Data Frame with all the columns except `"malignancy"`.

Create a variable `y` that is the series of the column `"malignancy"`.

Then we split `X` and `y` into test and training data, give the output data the names: 
* `X_train` - This data will be the features you use to train the models.
* `X_test` - This data will be the features that act as unseen data to make predictions against.
* `y_train` - This data will be the labels for training the model.
* `y_test` - This will be the unseen labels you score your predictions against.

In [None]:
# Create variables X and y here



from sklearn.model_selection import train_test_split

# Split your data here into X_train, X_test, y_train and y_test



Note that the scale of some of our columns are over 1000 and some of our columns have a scale of less than 1. Therefore some features would skew the model disproportionately than others.

Therefore we must scale our data.

Here we will use the `MinMaxScaler()` to scale our data. For more details on this, see my [blog post](http://benalexkeen.com/feature-scaling-with-scikit-learn/) or the [official docs](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

We first fit the scaler to our training data, then we use it to scale our training data. We will later also use it to scale our test data.

In [None]:
# Run this cell
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)

# Returns a 2D array
scaled_X_train = scaler.transform(X_train)
# Convert back to a DataFrame
X_train = pd.DataFrame(scaled_X_train, columns=X_train.columns)

Now we create a K Neighbors Classifier model object that uses the 6 nearest neighbors.

Fit this model to your training data (`X_train` and `y_train`)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=6)

# Your code here


Use the scaler to transform your `X_test` data as we did for the `X_train` data above (only transform, not fit, we only ever fit on our training data).

In [None]:
# Your code here


Now use your model to predict the labels from the scaled `X_test` data and store these predictions in the variable `y_predict`

In [None]:
# Your code here


Use `accuracy_score` to compare how your prediction performed against the unseen `y_test` labels.

In [None]:
from sklearn.metrics import accuracy_score

# Your code here


How accurate was the prediction from your model?

# Section 5 - Linear Regression

In this section we wil be using some factors indicative of economical wellbeing to predict the price of a Big Mac.

We'll start be reading in the data. This data has the price of a Big Mac in US dollars, the GDP per capita in dollars, the life expectancy in years and the unemployment rate as a percentage.

In [None]:
df = pd.read_csv('data/bigmac.csv')

df.head()

Create a scatter plot of gdp per capita against the price of a big mac to see if there is any correlation

Create the variables `X` and `y`. 

`X` will be a Data Frame of the columns `'gdp_per_capita'`, `'life_expectancy'` and `'unemployment'`.  
`y` will be the `big_mac_price` Series from our Data Frame.

In [None]:
# Your code here



# random_state defined for repeatability
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

Create a linear regression model and fit it on our training data.

In [None]:
from sklearn.linear_model import LinearRegression

# Your code here


Using the `score()` method from our model and our `X_test` and `y_test` test data sets, calculate the $ R^2 $ value.

In [None]:
# Your code here

This $R^2$ value may seem small but remember a 0.5 $ R^2 $ value corresponds to a correlation of over 70%. 

Also note that there are far more factors that might affect a Big Mac's price, including obesity rates, taxation on fast food, shipping costs etc. etc.

Find the coefficients from the model's `coef_` attribute and the intercept from the model's `intercept_` attribute. 

Use these to construct an equation in the form:

$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 $

In [None]:
# Your code here


We can use the `predict()` method of our linear regression model to predict a new `y` value as follows:

`regression_model.predict([[x1, x2, x3]])`

It is a list of lists as you could provide multiple lists of `[x1, x2, x3]` to get multiple `y` predictions.

McDonald's currently has no restaurants in Macedonia. 

It has a GDP per capita of \$14,500, a life expectancy of 76.02 and an unemployment rate of 23.1%, what would we expect to pay for a Big Mac in Macedonia in \$ if it was to be released?

In [None]:
# Your code here


# Section 6 - Clustering

A telephone company has decided to erect 7 more telephone masts in Cornwall, UK.

It has obtained some GPS data from mobile of people that visited Cornwall on holiday.

Start by reading in the data for these phones

In [None]:
# Run this cell
df = pd.read_csv('data/cornwall_phones.csv')

df.head()

Now create a K means clustering model with 7 clusters

In [None]:
from sklearn.cluster import KMeans

# Your code here


The model now has 7 centroids (cluster centers), use the `cluster_centers_` method of the model to determine where these are.

Store the centroids in a variable named `centroids`

In [None]:
# Your code here


Run the cells below to create a map with your centroids plotted on. 

This may take a minute or two. 


Explore the map and see where the best positions for new masts to deal with the influx of summer tourists are.

In [None]:
# Ask your instructor for the access token
mapbox_access_token = 

In [None]:
# Run this cell

import plotly.plotly as py
from plotly.graph_objs import Data, Scattermapbox, Layout, Marker

data = Data([
    Scattermapbox(
        lat=[x[0] for x in centroids],
        lon=[x[1] for x in centroids],
        mode='markers',
        marker=Marker(
            size=14
        ),
        text=['Centroid'],
    )
])

layout = Layout(
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        accesstoken=mapbox_access_token,
        bearing=0,
        center=dict(
            lat=50.2,
            lon=-5
        ),
        pitch=0,
        zoom=7
    ),
)

fig = dict(data=data, layout=layout)
py.iplot(fig, filename='Centroid plot')