## <center>Scientific Programming - 7MRI0020 - 2021/2022</center>


## <center>Week 05 - Scientific Libraries - Part 02 - Exercises </center>


### <center>School of Biomedical Engineering & Imaging Sciences</center>
### <center>King's College London</center>

The purpose of this section is to practice the use of Matplotlib and Pandas.

### Exercise 2.1:
The `mgrid` object in Numpy gives up a mesh grid of values for given dimensions. We can use the values it produces to draw an image with each pixel representing the distance that pixel is from the image center. Use your Numpy skills to calculate this 2D image and draw it with Matplotlib:

In [5]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt


y,x=np.mgrid[-15:15,-10:10]

print(y[:,0])

# draw a nice picture

[-15 -14 -13 -12 -11 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2
   3   4   5   6   7   8   9  10  11  12  13  14]


### Exercise 2.2:
Let's bring back the cat:

In [None]:
im = plt.imread('chelsea.png')
plt.imshow(im)

As we know this image has red, gree, and blue channels. There aren't terribly distinct as there isn't bright colours in the image but they are different. We can convert to a proper greyscale by summing the channels after multiplying by weights, for example the weights for sRGB conversion are `0.2126, 0.7152, 0.0722`. Use these values to scale the channels to produce a greyscale version of the picture, and draw this in a four part image with the other channels intepreted as grey themselves:

### Exercise 2.3:
We haven't touched on Scipy yet but we'll use a few routines from its `ndimage` module. Scipy is a collection of many mathematical functions for interpolation, integration, optimization, linear algebra, statistics, signal processing, and some image manipulation.

Let's import `ndimage`:

In [None]:
import scipy.ndimage as ndimage

There is a function called [correlate](https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.correlate.html#scipy.ndimage.correlate) in this module for correlating a 2D image with a kernel. This allows us to define filters for images. Use this function to implement the [Sobel operator](https://en.wikipedia.org/wiki/Sobel_operator) and apply it to the cat's red channel.

In [None]:
# edge-detected cat here

## Boston House Prices Dataset

We will be using the Boston house prices dataset (1,2) for this exercise which is used for various machine learning examples. This is given to you in a `boston.csv` file which stores a table of numbers with the following columns:

 * CRIM - per capita crime rate by town
 * ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
 * INDUS - proportion of non-retail business acres per town.
 * CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
 * NOX - nitric oxides concentration (parts per 10 million)
 * RM - average number of rooms per dwelling
 * AGE - proportion of owner-occupied units built prior to 1940
 * DIS - weighted distances to five Boston employment centres
 * RAD - index of accessibility to radial highways
 * TAX - full-value property-tax rate per \$10,000
 * PTRATIO - pupil-teacher ratio by town
 * LSTAT - \% lower status of the population
 * MEDV - Median value of owner-occupied homes in $1000's
 
(1) Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

(2) Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
 
 
 We'll import this data using Pandas:

In [None]:
import pandas as pd

df=pd.read_csv('boston.csv')
df

We can get the pairwise correlation matrix automatically with this call:

In [None]:
df.corr()

### Exercise 2.4:
This is hard to digest just as a table so figure out a way with Matplotlib to render this in a more understandable form (start with `plt.imshow` maybe?). Sorting the columns in some way to group correlated values might help.

In [None]:
# nice looking correlation here.

### Exercise 2.5:
Display a histogram of the house prices with 20 bins (hint: `plt.hist`):

In [None]:
# histogram here

The RAD value has a maximum value of 24 to indicate out-of-band data.

In [None]:
plt.plot(df['RAD'])

### Exercise 2.6:
Convert the RAD column to a numpy array, replace all 24 values with 0, and replot:

In [None]:
# your code here

### Exercise 2.7:
Plot a scatter plot with x as `AGE` and y as `DIS` ([scatter documentation](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html)):

In [None]:
# your plot here using plt.scatter

### Exercise 2.8:
We would like to compare the median value of homes (MEDV) versus the percentage of the population of lower status (LSTAT). Use the [dat.plot.hexbin](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.hexbin.html?highlight=hexbin#pandas.DataFrame.plot.hexbin) method to plot a correlation between these two values in `dat`. Choose values for the arguments which effectively represents the data.

In [None]:
#your plot here using dat.plot.hexbin

### Exercise 2.9:
There might be a correlation between NOX concentrations and the amount of industry per district. Plot the two variables in some way which helps show this correlation, either with one or multiple figures

In [None]:
# figure(s) here

## Extra 2: California House Price Data Set

This is a set of house values and associated attributes derived from houses sold in California in 1990. We'll use this for a set of extra questions below.

The data columns are:

* longitude - A measure of how far west a house is; a higher value is farther west
* latitude - A measure of how far north a house is; a higher value is farther north
* housing_median_age - Median age of a house within a block; a lower number is a newer building
* total_rooms - Total number of rooms within a block
* total_bedrooms - Total number of bedrooms within a block
* population - Total number of people residing within a block
* households - Total number of households, a group of people residing within a home unit, for a block
* median_income - Median income for households within a block of houses (measured in tens of thousands of US Dollars)
* median_house_value - Median house value for households within a block (measured in US Dollars)
* ocean_proximity - Location of the house w.r.t ocean/sea


We will read the data in from the CSV file, remove the `ocean_proximity` field as it is categorical, remove the `total_bedrooms` field as it has missing entries, and those at the maximum value of $500,001 which indicates the actual value was above this threshold and not recorded.

In [None]:
df=pd.read_csv('california.csv')
del df['total_bedrooms'] # delete columns with this syntax
del df['ocean_proximity']

# df.median_house_value.eq(A) produces a boolean array with True for every entry equal to A
df=df[~df.median_house_value.eq(df.median_house_value.max())]

In [None]:
df

### Extra 2.1:
We used Matplotlib above to plot the histogram of a field, here instead use the [DataFrame.hist](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) method instead to plot a useful graph of the median house values. The default number of bins isn't going to be useful, nor the default style particularly attractive, so adjust parameters to produce an insightful plot:

### Extra 2.2:
We can view the correlation matrix with [DataFrame.corr](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html?highlight=corr#pandas.DataFrame.corr):

In [None]:
corr=df.corr()
corr

Without colour this is hard to grasp which members are correlated with what. Using the [DataFrame.style](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.style.html#pandas.DataFrame.style) attribute you can build a HTML representation of the DataFrame object. Methods are given for the [Styler](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.html) object this produces, see what you can do with them to add colour or other styling to our correlation matrix:

A map of California was included with the materials:

In [None]:
cali=plt.imread('california.png')

plt.figure(figsize=(15,15))
# the extent property represents the lat/long area the map covers
plt.imshow(cali, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)

### Extra 2.3:
Plot each house with the map using multiple Matplotlib calls and colour the markers by median house value. Be sure to choose an appropriate marker, size, and colour map to display the data effectively:

### Extra 2.4:
Reload the data from the CSV file and don't discard the `ocean_proximity` column. Plot each house on the map with a distinct colour for each category: