# Pandas and NumPy

## Section 1 - NumPy: Arrays and Matrices

We'll start by importing the packages we'll be using in this section

In [None]:
import numpy as np
import pandas as pd

Create two 1D numpy arrays (`np.array([])` as follows and take one away from the other, do you get the same answer as shown below:

$$
\begin{bmatrix} -1 \\ 12 \\ 14 \\ 2 \end{bmatrix} - \begin{bmatrix} 10 \\ 3 \\ 0 \\ 2\end{bmatrix} = 
\begin{bmatrix} -11 \\ 9 \\ 14 \\ 0\end{bmatrix}
$$

Now try multiplying the first array by a scalar, do you get the following results:

$$
\begin{bmatrix} -1 \\ 12 \\ 14 \\ 2 \end{bmatrix} * 4 = 
\begin{bmatrix} -4 \\ 48 \\ 56 \\ 8\end{bmatrix}
$$

Carry out an element-wise multiplication on the following matrices and check you get same answer as below:
$$
\begin{bmatrix} -1 & 2 \\ 3 & -4 \end{bmatrix} * \begin{bmatrix} 13 & 25 \\ 16 & -4 \end{bmatrix} = 
\begin{bmatrix} -13 & 50 \\ 48 & 16\end{bmatrix}
$$

Create a 3x3 matrix with numbers of your choice and carry out a dot product (`np.dot(m1, m2)`) with the matrix below:  

$\begin{bmatrix} 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 2 \end{bmatrix}$  

How does the first matrix relate to the dot product?

Now that we're comfortable with Vectors and Matrices, we'll get started with pandas

## Section 2 -  pandas: Series and Data Frames

Series will carry out vector arithmetic aligned on their indices.  

Run the cell below and see how the two Series are added together, not by the order they are in but by their indices.

In [None]:
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['c', 'd', 'b', 'a'])
print(s1)
print(s2)
print(s1 + s2)

Have a go at this yourself, create two Series:

* one with the values [5, 10, 15, 20] 
* one with values [0.1, 1, 10, 100]  

and assign the indices such that when you multiply the two Series together, the resulting Series' values are [50, 1000, 1.5, 20].

Run the cell below to assign the Series to s3 and print the result

In [None]:
s3_index = ['alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'eta', 'theta', 'iota', 'kappa']
s3 = pd.Series([1, 14, 5, 12, 9, 52, 40, 100, 15, 37], index=s3_index)
print(s3)

Using a list of index names, select the values 12, 52, 15 and 37

Using a list of booleans, select values with the labels `delta`, `epsilon`, `zeta` and `iota`

Using a comparison statement, select all the values that are less than 10

Using another comparison statement, select all values that are greater than the value of `'eta'`

Using another comparison statement, select all the negative values

Create your own Series using the notation `pd.Series(<array here>, index=<index here>)` and print your Series

Now construct the same Series using a dict

#### Manitoba Lakes Data Set
The following data describes the 9 largest lakes in Manitoba, Canada, run this cell to see the Data Frame.

The elevation is given in metres area is given in km<sup>2</sup>.

In [None]:
lakes = pd.DataFrame({
    'elevation': [217, 254, 248, 254, 253, 227, 178, 207, 217],
    'area': [24387, 5374, 4624, 2247, 1353, 1223, 1151, 755, 657]
    }, index=['Winnipeg', 'Winnipegosis', 'Manitoba', 'SouthernIndian', 'Cedar', 'Island', 'Gods', 'Cross', 'Playgreen']
)
print(lakes)

Print the elevation of Cedar Lake

Select the data from the lakes whose area is greater than 1000 but less than 2000 km<sup>2</sup>

How much bigger in km<sup>2</sup> is Lake Manitoba from the area of the Southern Indian Lake, Gods Lake and Cross Lake combined?

Select from the `lakes` Data Frame just Lake Winnipegosis, Island Lake and Playgreen Lake

It has been decided that Cross Lake is actually a pond, drop this lake from the Data Frame

New statistics show that due to erosion, the size of Lake Manitoba is now 4750 km<sup>2</sup>.  
Make this change in your Data Frame

Thanks to some significant tectonic activity, Southern Indian Lake is now at an altitude of 265 m. Make this change in your Data Frame.

#### Sugar Data Set
The following data set is the mg weight of sugar from  3 different genetically modified sugar plant, as well as an unmodified (control; con) version of the plant.

In [None]:
sugar = pd.DataFrame({
    'weight': [82, 97.8, 69.9, 58.3, 67.9, 59.3, 68.1, 70.8, 63.6, 50.7, 47.1, 48.9],
    'treatments': ['con', 'con', 'con', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
})
print(sugar)

First we want to drill in to treatment `B`, select just the the plants that were genetically modified with alteration `B`

We want to find the best performing plants, select just those plants from which more than 65 mg of sugar was extracted

Now we want to compare group `C` to the control (`con`) group, select the plants whose treatment was in the control group or genetically modified with alteration `C`

We would like to compare which of the alterations performed best, select all of the plants that are not in the control group

#### London Terminal Stations

The data set below describes 9 popular London Terminal stations. Run this cell.  

In [None]:
stations = pd.DataFrame({
    'Station': ['Waterloo', 'Victoria', 'Liverpool Street', 'London Bridge', 'Euston', 'Kings Cross', 'Charing Cross'],
    'Opened': [1848, 1862, 1874, 1836, 1837, 1852, 1864],
    'Passengers': [99148338, 81151418, 66556690,  53850938,  41677870, 33361696, 28998152],
    'Operator': ['Network Rail', 'Network Rail', 'Network Rail', 'Network Rail', 'Network Rail', 'Network Rail', 'Network Rail'],
    'Platforms': [22, 19, 18, 15, 18, 12, 6]
    }
)
stations = stations.set_index('Station')
print(stations)

Use the `describe()` method for the `stations` Data Frame to describe the data

What is the total annual passenger figure for stations in the data set that were opened before 1860?

What is the average opening year for stations with more than 6 million passengers annually?

Sort the Data Frame by the number of platforms each station has.

## Section 3 -  Vectorised Functions

#### Energy of a Photon

The Data Frame below gives the wavelengths of light in the visible spectrum

In [None]:
light = pd.DataFrame([
    {'colour': 'violet', 'wavelength': 400 * 10 ** -9},
    {'colour': 'blue', 'wavelength': 450 * 10 ** -9},
    {'colour': 'green', 'wavelength': 500 * 10 ** -9},
    {'colour': 'yellow', 'wavelength': 580 * 10 ** -9},
    {'colour': 'orange', 'wavelength': 600 * 10 ** -9},
    {'colour': 'red', 'wavelength': 650 * 10 ** -9},
])
light = light.set_index('colour')
print(light)

Assign a new column "`energy_of_photon`" to the Data Frame that is the energy of a photon as calculated by:  

$ E = \dfrac{hc}{\lambda} $  

Where:  
* E is the energy of a photon
* h is the Planck constant
* c is the speed of light
* $ \lambda $ is the wavelength

The speed of light is approximately $3 \times 10^9 m/s$ and the Plank constant is approximately $6.62 \times 10^{-34} Js$

It takes 4.18 Joules to heat 1 gram of water by 1 degree Celcius. 

How many photons of red light would be needed to heat 1 gram of water by 1 degree Celcius?

#### FBI Crime Data Set
The following cell provides FBI data on crime statistics in the USA over 25 years. Run the cell to see the DataFrame

In [None]:
crime = pd.DataFrame([
    {'Population': 262803276, 'Violent_Crimes': 1798792, 'Murders': 21606},
    {'Population': 281421906, 'Violent_Crimes': 1425486, 'Murders': 15586},
    {'Population': 296507061, 'Violent_Crimes': 1390745, 'Murders': 16740},
    {'Population': 309330219, 'Violent_Crimes': 1251248, 'Murders': 14722},
    {'Population': 321444981, 'Violent_Crimes': 1197704, 'Murders': 15696}
], columns=['Population', 'Violent_Crimes', 'Murders'], index=[1995, 2000, 2005, 2010, 2015])
print(crime)

The number of murders went up slightly from 2000 to 2015, but what happened to the murder rate (murders per 100,000 population)?

Print just the data for the years that the violent crime rate was between 400 and 500 violent crimes per 100,000 citizens

#### Employee Salary Data Set (Mock)

The following is mocked HR data for employees in a large corporation. The employees are assigned a tier based on how long they've been with the company and the level of their position.   

Run the cell below to see the Data Frame.

In [None]:
employees = pd.DataFrame([
    {"name": "Peter Butler", "salary": 87031, "title": "Financial Analyst", "tier": 3}, 
    {"name": "Amanda Wood", "salary": 80277, "title": "Senior Sales Associate", "tier": 1},
    {"name": "Stephanie Stanley", "salary": 72947, "title": "Senior Financial Analyst", "tier": 4}, 
    {"name": "Todd Rice", "salary": 64779, "title": "Web Developer I", "tier": 2}, 
    {"name": "Victor Dixon", "salary": 24377, "title": "Instructor", "tier": 3}, 
    {"name": "Charles Wood", "salary": 79613, "title": "Database Administrator I", "tier": 2}, 
    {"name": "Ryan Moreno", "salary": 95183, "title": "General Manager", "tier": 1},
    {"name": "Edward Cook", "salary": 17525, "title": "Research Assistant III", "tier": 3},
    {"name": "Amanda Stephens", "salary": 69428, "title": "General Manager", "tier": 3}, 
    {"name": "Joseph Green", "salary": 38846, "title": "Recruitment Specialist", "tier": 2},
    {"name": "Stephen Morris", "salary": 221440, "title": "Automation Specialist I", "tier": 1}
])

print(employees)

The company is doing well and has decided that each employee will get a one-off bonus based on their salary and tier as follows:
* Tier 1 employees get 10% of their salary, up to a maximum of £7500
* Tier 2 employees get 7.5% of their salary, up to a maximum of £7500
* Tier 3 employees get 5% of their salary, or £1500, whichever is greater
* Tier 4 employees get 5% of their salary, or £1000, whichever is greater

Write a function and apply the function to the Data Frame to calculate the bonuses for each employee

Who has the highest bonus? and the lowest?

#### Premier League 2015/16

The data set below is the final league table of the Premier League for 2015/16, Leicester's famous league winning season.

Run the cell below to see the league table

In [None]:
prem = pd.DataFrame([
    {'D': 9,'GA': 67, 'GF': 45, 'L': 18, 'Pld': 38, 'Team': 'AFC Bournemouth', 'W': 11}, 
    {'D': 11, 'GA': 36, 'GF': 65, 'L': 7, 'Pld': 38, 'Team': 'Arsenal', 'W': 20}, 
    {'D': 8, 'GA': 76, 'GF': 27, 'L': 27, 'Pld': 38, 'Team': 'Aston Villa', 'W': 3}, 
    {'D': 14, 'GA': 53, 'GF': 59, 'L': 12, 'Pld': 38, 'Team': 'Chelsea', 'W': 12}, 
    {'D': 9, 'GA': 51, 'GF': 39, 'L': 18, 'Pld': 38, 'Team': 'Crystal Palace', 'W': 11}, 
    {'D': 14, 'GA': 55, 'GF': 59, 'L': 13, 'Pld': 38, 'Team': 'Everton', 'W': 11}, 
    {'D': 12, 'GA': 36, 'GF': 68, 'L': 3, 'Pld': 38, 'Team': 'Leicester City', 'W': 23}, 
    {'D': 12, 'GA': 50, 'GF': 63, 'L': 10, 'Pld': 38, 'Team': 'Liverpool', 'W': 16}, 
    {'D': 9, 'GA': 41, 'GF': 71, 'L': 10, 'Pld': 38, 'Team': 'Manchester City', 'W': 19}, 
    {'D': 9, 'GA': 35, 'GF': 49, 'L': 10, 'Pld': 38, 'Team': 'Manchester United', 'W': 19}, 
    {'D': 10, 'GA': 65, 'GF': 44, 'L': 19, 'Pld': 38, 'Team': 'Newcastle United', 'W': 9}, 
    {'D': 7, 'GA': 67, 'GF': 39, 'L': 22, 'Pld': 38, 'Team': 'Norwich City', 'W': 9}, 
    {'D': 9, 'GA': 41, 'GF': 59, 'L': 11, 'Pld': 38, 'Team': 'Southampton', 'W': 18}, 
    {'D': 9, 'GA': 55, 'GF': 41, 'L': 15, 'Pld': 38, 'Team': 'Stoke City', 'W': 14}, 
    {'D': 12, 'GA': 62, 'GF': 48, 'L': 17, 'Pld': 38, 'Team': 'Sunderland', 'W': 9}, 
    {'D': 11, 'GA': 52, 'GF': 42, 'L': 15, 'Pld': 38, 'Team': 'Swansea City', 'W': 12}, 
    {'D': 13, 'GA': 35, 'GF': 69, 'L': 6, 'Pld': 38, 'Team': 'Tottenham Hotspur', 'W': 19}, 
    {'D': 9, 'GA': 50, 'GF': 40, 'L': 17, 'Pld': 38, 'Team': 'Watford', 'W': 12}, 
    {'D': 13, 'GA': 48, 'GF': 34, 'L': 15, 'Pld': 38, 'Team': 'West Bromwich Albion', 'W': 10}, 
    {'D': 14, 'GA': 51, 'GF': 65, 'L': 8, 'Pld': 38, 'Team': 'West Ham United', 'W': 16}
], columns = ['Team', 'Pld', 'W', 'D', 'L', 'GF', 'GA'])
print(prem)

Notice that the data is, however, in alphabetical order by team name.

You must order the league table by the number of points, then by the goal difference, to show how the final table looked.

The goal difference is the difference between the number of goals scored and the number of goals conceded.
The number of points is as follows:
* 3 points for a win
* 1 point for a draw
* 0 points for a loss

# Section 4 -  Cleaning, Transforming and Merging Data

#### Used Car Dealership (Mocked Data)

A used car dealership sells Toyotas and Fords.  

They keep the details of the cars and the prices these cars sold for in two separate files:
* data/car_details.csv
* data/car_prices.csv

Load the car details and car prices csvs as two separate Data Frames

Using `.head()`, what does each of these Data Frames look like?

What is the average price paid for a Ford Mustang?

What is the average price paid for a 2006 Toyota RAV4?

What is the minimum price paid for a car built in 2007? What is the make and model of this car?

A man comes in looking for a Ford Mustang, but only has £20,000 to spend.   

Looking at the historical data, what year of Ford Mustang can this man afford?

A family are looking to buy either a Ford Focus or Toyota Verso, they don't have a preference between the two, but the oldest they would want to buy is 2013. Which car would cost them more money?

What year of car has the dealership sold the most of?

*Hint: You can count values in a column using `Series.count_values()`*

#### Pokemon Dataset

There is a dataset with pokemon in it stored in `data/pokemon.csv`

Load this dataset and view the top 10 rows using the method `.head(10)`

Notice that some Pokemon have variants with the same Pokedex number (`'#'`), we will count these as duplicates, remove these duplicates, only keeping the first instance of each pokemon.

Some of the `Sp. Def` statistics are missing. For those missing `Sp. Def`, assume that the value is the same as `Sp. Atk` and fill the missing values with the values from `Sp. Atk`.

Create a new column `Total` with the sum of the columns `HP`, `Attack`, `Defense`, `Sp. Atk`, `Sp. Def` and `Speed`

Which Pokemon from Generation 1 is the most powerful in terms of total stats?

Which pokemon makes the best punching bag (HP + Defence + Sp. Def)?

#### NHS England Admission Data

There are two datasets for NHS England Admission Data
* data/ae_2014.csv
* data/ae_2013.csv

Load these two data sets into Data Frames and concatenate the Data Frames into one Data Frame

`Total Attendence > 4 hours` is the number of patients that had to wait >4 hours from arrival to admission.  

What is the average percentage of patients that had to wait >4 hours for admission?

The target for admission in less than 4 hours for the NHS is 95%.

How many weeks in 2013 and 2014, respectively, did they not meet this target?

# Section 5 - Exercises

### Vector Magnitudes

The magnitude of a vector is the square root of the sum of squares of a vector, i.e. the magnitude of the vector a below is:

$$
\bar{a} = \begin{bmatrix} x \\ y \\ z \end{bmatrix}
$$


$ |\bar{a}| = \sqrt{x^2 + y^2 + z^2} $

Add a column to the following DataFrame called `magnitude` that is the magnitude of each of the vectors a-j. 

Round this column to 2 decimal places.

In [None]:
# Run this cell first
vectors = pd.DataFrame({
    'x': [1, 3, -3, 4, 8, -7, 10, -2, -5, 9],
    'y': [-1, 13, 0, -2, 4, 6, 7, 19, 3, 12],
    'z': [3, 4, 7, 4, 2, -5, 0, 12, 8, -6]
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])

In [None]:
# Write your code here


In [None]:
# Now test your function
test_1 = 'magnitude' in vectors.columns
test_2 = vectors.loc['a', 'magnitude'] == 3.32
test_3 = vectors.loc['h', 'magnitude'] == 22.56
test_4 = vectors.loc['i', 'magnitude'] == 9.90
test_5 = vectors.loc['c', 'magnitude'] == 7.62
test_6 = vectors.loc['d', 'magnitude'] == 6.00

if all([test_1, test_2, test_3, test_4, test_5, test_6]):
    print("PASSED")
else:
    print("FAILED")

### Calculating $R^2$

The $R^2$ statistic is a common method of measuring how well a regression model performs at predicting outcomes.  

$R^2$ is the proportion of the variance in data explained by the model, given a number of independent variable inputs.

$ y $ is the observed (actual) output y

$ \hat{y} $ is our prediction of the output y

$ \bar{y} $ is the mean of the observed output y

The total sum of squares (TSS) is defined as follows:

$TSS = \Sigma^n_i (y_i - \bar{y_i})^2 $

The residual sum of squares (RSS), also known as the sum of squared error, is defined as follows:

$RSS = \Sigma^n_i (y_i - \hat{y_i})^2 $

$R^2$ is then calculated as:

$R^2 = 1 - \dfrac{RSS}{TSS} $

Calculate the $R^2$ value from the Data Frame below and store it in a variable `r_squared`

In [None]:
# Run this cell first

df = pd.DataFrame({
    'y_predicted': [220, 165, 140, 130, 160, 178, 200, 123, 130, 140, 162, 157],
    'y_observed': [195, 160, 120, 155, 161, 180, 185, 123, 128, 160, 180, 182]
})

In [None]:
# Write your code here

tss = np.sum((df['y_observed'] - np.mean(df['y_observed'])) ** 2)
rss = np.sum((df['y_observed'] - df['y_predicted']) ** 2)
r_squared = 1 - (rss / tss)

In [None]:


r_squared

In [None]:
# Now test your function

test_1 = round(r_squared, 2) == 0.54

if test_1:
    print("PASSED")
else:
    print("FAILED")

`pd.read_html()` will return a list of tables on an html page.  
The documentation for this can be found at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html

Download the table at https://simple.wikipedia.org/wiki/List_of_U.S._states   

Then use pandas to take in this data, find how many states there are.

Now sort by the date they became a state and display just the oldest 5 states.

Remember you'll have to parse the dates given as strings

How many states have a capital city starting with the letter "O"?