<a href="https://colab.research.google.com/github/faizuddin/ISB46703/blob/main/supervised_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised learning
In general we have two types of supervised learning:

1. Regression (given training data + desired labeled continuous outputs)
2. Classification (given training data + desired labeled categorical outputs)

## Supervised Learning

We will use simple two dimensional data set from [here](https://archive.org/download/ages-and-heights/AgesAndHeights.pkl). Download the dataset into this notebook using [wget](https://www.gnu.org/software/wget/) and visualise them.

The dataset is in [pickle](https://docs.python.org/3/library/pickle.html) format.

In [None]:
!wget https://archive.org/download/ages-and-heights/AgesAndHeights.pkl

In [None]:
# important data structure library!
import pandas as pd

# read pickle format
rawdataset = pd.read_pickle("AgesAndHeights.pkl")

### Visualise data using histogram
Use Pandas [``hist()``](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) to visualise approximate representation of the distribution of numerical data. The Y-axis in both the plots refers to frequency and X-axis represents *Age* and *Height* respectively.

In [None]:
rawdataset.hist()

### Data Cleaning

We are going to build model using valid dataset and clean the unaccountable data (empty/null). In the above histograms, we can see that there are a few entries that have an age less than zero which is meaningless. Hence, we need to clean those data to get better accuracy.

In [None]:
cleandataset = rawdataset[rawdataset["Age"]>0]
cleandataset.hist()

### Visualise data using scatter plot
We will represent *Age* on X-axis and *Height* on Y-axis. The points in the plot refer to data from `cleandataset`. We use Pandas  [`plot.scatter()`](https://https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.DataFrame.plot.scatter.html) to plot.

In [None]:
cleandataset.head()

In [None]:
cleandataset.shape

In [None]:
# Draw a scatter plot
ax = cleandataset.plot.scatter(x = 'Age', y = 'Height');

# Set x and y axis labels
ax.set_ylabel("Height (cm)")
ax.set_xlabel("Age (year)")

### Build the model and train it

Here we are going to use simple [linear regression](https://en.wikipedia.org/wiki/Linear_regression) to train our *Age-Height* model.

Steps:

1. Learn parameters
2. Train model using the learned (optimal) parameters
3. Evaluate prediction

### Learning paramaters
We create a function called `learnpars()` that uses the basic straight-line equation and returns `y`, in our case *Height*. If we pass the required parameters and run `learnpars()`, we will find that the height we get for the age as input does not match. Hence, we use `learnpars()` to train the model.

Straight line equation: `Y = a + bX` (a = alpha, b=beta)

In [None]:
pars = {"alpha" : 40, "beta" : 4}

def y_height(age, pars):
  alpha = pars["alpha"]
  beta = pars["beta"]
  return alpha + beta * age

y_height(5, pars)

In [None]:
def learnpars(data, pars):
  x,y = cleandataset["Age"], cleandataset["Height"]
  x_bar, y_bar = x.mean(), y.mean()
  x,y = x.to_numpy(), y.to_numpy()
  beta = sum((x-x_bar) * (y-y_bar) / sum((x-x_bar)**2))
  alpha = y_bar - beta * x_bar
  pars["alpha"] = alpha
  pars["beta"] = beta

In [None]:
# Find the correct parameters
newpars = {"alpha" : 0, "beta" : 0}
learnpars(cleandataset,newpars)

# The optimal alpha and beta parameters
newpars

### Dummy dataset
This demonstrates how parameters selection can impact the model.

In [None]:
# Create a dummy list of age
dummy_ages = list(range(19))

# Predict height using unoptimised parameters
untrained_predicted_heights = [y_height(age, pars) for age in dummy_ages]

# Create a dataframe
untrained = pd.DataFrame(list(zip(dummy_ages, untrained_predicted_heights)), columns =['Age', 'Height'])

untrained

In [None]:
import matplotlib.pyplot as plt

# Draw a scatter plot
plt.scatter(untrained["Age"], untrained["Height"], label="Untrained - dummy");
plt.scatter(cleandataset["Age"], cleandataset["Height"], label="Raw - cleaned");

# Set x and y axis labels
plt.xlabel("Age (year)")
plt.xlabel("Height (cm)")
plt.legend()

### Regression using trained parameters

We do a regression over `cleandataset` using the trained parameters (`newpars`) by calling `y_height()` and we plot the prediction results.

In [None]:
# run regression
trained_predicted_heights = [y_height(age, newpars) for age in cleandataset["Age"]]

# Create a dataframe
trained = pd.DataFrame(list(zip(cleandataset["Age"], trained_predicted_heights)), columns =['Age', 'Height'])

In [None]:
# Draw a scatter plot
plt.scatter(untrained["Age"], untrained["Height"], label="Untrained - dummy");
plt.scatter(cleandataset["Age"], cleandataset["Height"], label="Raw - cleaned");
# Regression results
plt.plot(trained["Age"], trained["Height"], label="Predictions", color="red");

# Set x and y axis labels
plt.xlabel("Age (year)")
plt.ylabel("Height (cm)")
plt.legend()

trained

### Evaluate prediction performance

We need to measure how well is our model predicting the height. Therefore we use [Root Mean Squared Error (RMSE)](https://en.wikipedia.org/wiki/Root-mean-square_deviation) to measure how far is our predicted values from the actual values, on average.

In [None]:
from sklearn.metrics import mean_squared_error 

# root mean squared error: how far predicted from the actual value, on average
rmse = mean_squared_error(cleandataset["Height"], trained["Height"], squared = False)

print("RMSE: " + str(rmse) + "cm")

### Predicting unseen data

We can test our model to predict model using a completely new, unseen data.

In [None]:
age_input = int(input("Enter an 'age' to predict height: "))
y_height(age_input, newpars)

## Multiple Linear Regression
In the previous simple linear example, we only deal with one independent variable (age) and one dependent variable (height).

In cases when we have more than one independent variables (features) and one dependent variable, we call it a multiple linear regression problem. 

To do multiple linear regression, we first convert the standard linear equation:

`y = a + Bx`

to

`y = a + B1x1 + B2x2 + ... + Bnxn`

where `a` is the intercept and `B1, B2, ... ,Bn` are the coefficients (slope) concerning independent variable `x1, x2, ..., xn`. This essentially indicates that if we increase the value of `x1` by 1 unit then `B1` says that how much value it will affect `y`, and this was similar concerning for `B2, ... ,Bn`.

The next example demonstrates this case. We will use 2016 Air Quality dataset which consists of 9538 instances. 

### AirQuality file dataset column information

0. Date the reading was recorded on
1. Time of the day the reading was recorded on
2. Concentration of CO in milligram/m^3
3. Sensor response for Tin oxide
4. Concentration of Non Metanic HydroCarbons concentration in microg/m^3
5. Concentration of Benzene in microg/m^3
6. Sensor response for titania
7. Concentration of NOx concentration in parts per billion
8. Sensor response for Tungsten Oxide (Targeting NOx)
9. Concentration of NO2 in microg/m^3
10. Sensor response for Tungsten Oxide (Targeting NO2)
11. Sensor response for Indium Oxide
12. Temperature at the time of the reading (°C)
13. Relative Humidity (%)
14. Absolute Humidity

Full description of this dataset can be found [here](https://gist.github.com/shreyasiitr/57f8fa30fa20b049359fb567cc6407d0)


In [None]:
# Download dataset

!wget "https://raw.githubusercontent.com/faizuddin/ISB46703/main/data/AirQualityUCI.csv"

In [None]:
# important data structure library!
import pandas as pd

# Read raw dataset file into Pandas dataframe
df = pd.read_csv("AirQualityUCI.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# 12 independent variables
X = df.drop(["Date", "Time", "C6H6"], axis=1) 

# dependent variable
y = df["C6H6"]

In [None]:
import matplotlib.pyplot as plt

f = plt.figure(figsize=(20,6))

plt.subplot(2,3,1)
plt.scatter(X.iloc[:,0], y)
# Set x and y axis labels
plt.xlabel("CO(GT)")
plt.ylabel("C6H6(GT)")
# plt.legend()

plt.subplot(2,3,2)
plt.scatter(X.iloc[:,1], y)
# Set x and y axis labels
plt.xlabel("PT08.S1(CO)")
plt.ylabel("C6H6(GT)")
# plt.legend()

plt.subplot(2,3,3)
plt.scatter(X.iloc[:,2], y)
# Set x and y axis labels
plt.xlabel("NMHC(GT)")
plt.ylabel("C6H6(GT)")
# plt.legend()

plt.subplot(2,3,4)
plt.scatter(X.iloc[:,3], y)
# Set x and y axis labels
plt.xlabel("C6H6(GT)")
plt.ylabel("C6H6(GT)")
# plt.legend()

plt.subplot(2,3,5)
plt.scatter(X.iloc[:,4], y)
# Set x and y axis labels
plt.xlabel("PT08.S2(NMHC)")
plt.ylabel("C6H6(GT)")
# plt.legend()

plt.subplot(2,3,6)
plt.scatter(X.iloc[:,5], y)
# Set x and y axis labels
plt.xlabel("NOx(GT)")
plt.ylabel("C6H6(GT)")
# plt.legend()



### Training and testing dataset split

In [None]:
# importing train_test_split from sklearn
from sklearn.model_selection import train_test_split

# splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1)

### Training

In [None]:
# importing module
from sklearn.linear_model import LinearRegression

# creating an object of LinearRegression class
LR = LinearRegression()

# fitting the training data
mlr = LR.fit(X_train,y_train)

### Testing

In [None]:
y_pred =  mlr.predict(X_test)

### Model evaluation

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error 

# r2 score
score = r2_score(y_test, y_pred)

# root mean squared error: how far predicted from the actual value, on average
rmse = mean_squared_error(y_test, y_pred, squared = False)

print("r2 score: ", score)
print("RMSE: ", rmse, "mg/m^3")