# Predicting Yacht Resistance with Linear Regression

## Introduction

This notebook is a simple demonstration of how to use scikit-learn to build a Linear Regression model for regression. It uses a dataset of 308 experiments and their various attributes. The goal is to predict the residuary resistance per unit weight of displacement based upon the attributes.

## The Data

The data has been taken from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) and the raw data and information can be found [here](https://archive.ics.uci.edu/ml/datasets/Yacht+Hydrodynamics). 

The columns are as follow:

1. Longitudinal position of the center of buoyancy, adimensional.
2. Prismatic coefficient, adimensional.
3. Length-displacement ratio, adimensional.
4. Beam-draught ratio, adimensional.
5. Length-beam ratio, adimensional.
6. Froude number, adimensional.
7. Residuary resistance per unit weight of displacement, adimensional. 

Where column 7 is the target variable we are looking to predict.

We import python libraries

In [18]:
import pandas as pd
import numpy as np

We read in the data we've saved, passing the column names

In [19]:
yacht = pd.read_csv("../c3o-experiments-main/kmeans.csv")

Let's check out the first few rows of data

In [20]:
yacht.head()

Unnamed: 0,instance_count,machine_type,slots,memory,data_size_MB,features,observations,k,gross_runtime
0,2,m4.2xlarge,16,64000,8000,5,150000000,3,1126
1,2,m4.2xlarge,16,64000,8000,5,150000000,3,1127
2,2,m4.2xlarge,16,64000,8000,5,150000000,3,1151
3,2,m4.2xlarge,16,64000,8000,5,150000000,3,1156
4,2,m4.2xlarge,16,64000,8000,5,150000000,3,1216


We can quickly check if we have any null values in our data

In [21]:
yacht.isnull().values.any()

False

We do! Let's use the "describe" method to find them, amongst other interesting information

In [22]:
yacht.describe()

Unnamed: 0,instance_count,slots,memory,data_size_MB,features,observations,k,gross_runtime
count,900.0,900.0,900.0,900.0,900.0,900.0,900.0,900.0
mean,7.0,55.066667,413116.666667,18408.888889,7.0,136666700.0,5.8,533.698889
std,3.417549,27.460883,209825.534684,5900.465554,3.561005,36382590.0,2.287462,791.443029
min,2.0,8.0,61000.0,8000.0,5.0,100000000.0,3.0,94.0
25%,4.0,32.0,244000.0,13200.0,5.0,100000000.0,3.0,182.0
50%,7.0,48.0,366000.0,18600.0,5.0,125000000.0,5.0,246.0
75%,10.0,80.0,610000.0,21300.0,10.0,175000000.0,7.0,434.5
max,12.0,96.0,732000.0,30500.0,15.0,200000000.0,9.0,4556.0


So... the column *presmatic_coef* has 56 missing values... we can deal with this in a few different ways. The simpliest solution is to remove them, though we lose many examples in doing so. Alternatively, we could impute the values, replacing the NaN values with an average (mean or median). For the purpose of this simple notebook, we will simply remove them.

In [23]:
yacht = yacht.dropna()

## Train & Test Data

The purpose of splitting the data is to be able to assess the quality of a predictive model when it is used on unseen data. When training, you will try to build a model that fits to the data as closely as possible, to be able to most accurately make a prediction. However, without a test set you run the risk of overfitting - the model works very well for the data it has seen but not for new data.

The split ratio is often debated and in practice you might split your data into three sets: train, validation and test. You would use the training data to understand which classifier you wish to use; the validation set to test on whilst tweaking parameters; and the test set to get an understanding of how your final model would work in practice. Furthermore, there are techniques such as K-Fold cross validation that also help to reduce bias.

For the purpose of this demonstration, we will only be randomly splitting our data into test and train, with a 80/20 split.

We import the required library from scikit-learn, [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [24]:
from sklearn.model_selection import train_test_split

We wish for all features to be used for training, therefore we are taking all columns except "class"

In [25]:
X = yacht.drop(["gross_runtime","machine_type" ], axis=1)

The column "class" is our target variable, we set y as this column

In [26]:
y = yacht["gross_runtime"]

We use the *train_test_split* function to create the appropriate train and test data for our features ("X_train" and "X_test" respectively) and target data ("Y_train" and "Y_test"). We are specifying our test data to be 20% of the total data. We are also providing a seed to be able to reproduce this split

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

We can check the number of examples we have in each of our train and test data sets using "shape"

In [28]:
X_train.shape

(720, 7)

In [29]:
X_test.shape

(180, 7)

## Standardisation

All features are numeric so we do not need to worry about converting categorical data with techniques such as one-hot encoding. However, we will demonstrate how to standardise our data. Standardisation rescales our attributes so they have a mean of 0 and standard deviation of 1. It assumes that the distribution is Gaussian (it works better if it is), alternatively normalisation can be used to rescale between the range of 0 and 1

We use scikit-learn's [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [30]:
from sklearn.preprocessing import StandardScaler

We create the scaler, leaving parameters as default

In [31]:
scaler = StandardScaler()

We fit the scaler passing the training data but also request it transforms the data and returns it to a variable named "train_scaled"

In [32]:
train_scaled = scaler.fit_transform(X_train)

We then transform our test data with the same fitted scaler

In [33]:
test_scaled = scaler.transform(X_test)

## Linear Regression

Linear regression attempts to fit a straight hyperplane to your dataset that is closest to all data points. It is most suitable when there are linear relationships between the variables in the dataset.

We are using scikit-learn's [Linear Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

In [34]:
from sklearn.linear_model import LinearRegression

We create an Linear Regression model

In [35]:
model = LinearRegression()

We train it with our scaled training data and target values

In [36]:
model.fit(train_scaled, y_train)

LinearRegression()

## Model Evaluation

We wish to understand how good our model is; there are a few different metrics we can use. We will evaluate mean squared error (MSE) and mean absolute error (MAE)

We import [scikit-learn's mean squared error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) and [sckit-learn's mean absolute error](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error)

In [37]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

We calculate the errors for our training data

In [42]:
mse = mean_squared_error(y_train, model.predict(train_scaled))
mae = mean_absolute_error(y_train, model.predict(train_scaled))

In [39]:
from math import sqrt

In [40]:
print("mse = ",mse," & mae = ",mae," & rmse = ", sqrt(mse))

mse =  387067.79286411323  & mae =  410.03726598324056  & rmse =  622.1477259173365


The easier metric to understand is the mean absolute error, this means that on average our prediction was 7.6 away from the true prediction. Mean squared error, and consequently root mean squared error (RMSE), results in predictions further and further from the true value are punished more.

We can calculate the same on the test data to understand how we the model is generalised.

In [43]:
predictions = model.predict(test_scaled)
test_mse = mean_squared_error(y_test, predictions)
test_mae = mean_absolute_error(y_test, predictions)
print(y_test, predictions)
print("mse = ",test_mse," & mae = ",test_mae," & rmse = ", sqrt(test_mse))

613     156
524     240
690     170
457     140
85     1042
       ... 
279     434
196     350
246     402
221     478
239     384
Name: gross_runtime, Length: 180, dtype: int64 [-126.15658139  513.35436477  -30.3607599   123.30085896 1080.55123964
  820.09218981 1005.56191358   14.43948408   18.81038806 1220.48273594
  -23.25459531  198.68901773 -451.0021805   -98.64275408 -143.04416535
   66.68471953  706.19566568  906.48195806  109.5707364   984.75541815
  343.65598718  762.81180511  485.84053746 -355.20635902 -180.73824473
 1205.11054637  263.89692442  381.35006656  485.84053746 -295.85495898
  810.68613658  -68.05483929  852.75111995 -317.51227963 -102.19262455
  933.33121621  109.5707364  1215.29079845  967.86783419   31.02511623
  -60.94867469  393.56479473 1217.32527454 1008.71937498  188.50876565
  640.98775899 1177.59671907  -30.3607599   815.05704056  448.14645808
  506.24820017  735.29797781 1156.78638067  735.29797781  366.97347836
 -337.91994234 -317.51227963 1193.633477

We are actually seeing better results on our test data!

## Linear Regression Parameters

More information on Generalized Linear Models can be found in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/linear_model.html)

There are a number of parameters that can be tuned that should be explored when trying to improve Linear Regression models. A common approach is to test many different parameters, building multiple models and testing their accuracy to find the best combination.

### Parameters
For Linear Regression, the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) provides parameters that can be passed by the user; changing these are likely to have an impact on the performance of the model. 

Here is high-level information on the parameters, the documentation has more details:
- fit_intercept : default True
    - whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (e.g. data is expected to be already centered).

- normalize : default False
    - This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.

- copy_X : default True
    - If True, X will be copied; else, it may be overwritten.

- n_jobs : default 1
    - The number of jobs to use for the computation. If -1 all CPUs are used. This will only provide speedup for n_targets > 1 and sufficient large problems.

### Grid Search

To search for the best hyper-parameters for your algorithm and data, grid search cross validation is commonly used. The [scikit-learn documentation](http://scikit-learn.org/stable/modules/grid_search.html) provides more thorough information on how to use this. 

#### Data Citation

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 