# Regression using scikit-learn

### What is regression?

**Regression** is when the feature to be predicted contains continuous values. Regression refers to the process of predicting a dependent variable by analyzing the relationship between other independent variables. There are several algorithms known to us that help us in excavating these relationships to better predict the value.

### scikit-learn 

In this notebook, we'll use [scikit-learn](https://scikit-learn.org/stable/) to predict values. Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities. 

To help visualize what we are doing, we'll use visualizations with *matplotlib* python library.

### Data
 
We'll continue to use the [`insurance.csv`](https://www.kaggle.com/noordeen/insurance-premium-prediction/download) file from you project assets, so if you have not already [`downloaded this file`](https://www.kaggle.com/noordeen/insurance-premium-prediction/download) to your local machine, and uploaded it to your project, do that now.

<a id="top"></a>
## Table of Contents

1. [Load libraries](#load_libraries)
3. [Load data](#load_data)
4. [Prepare data for building regression model](#prepare_data)
5. [Build and test a multiple linear regression model](#model_lrc)

### Quick set of instructions to work through the notebook

If you are new to Notebooks, here's a quick overview of how to work in this environment.

1. The notebook has 2 types of cells - markdown (text) such as this and code such as the one below. 
2. Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell at a time because we need to make code changes to some of the cells.
3. To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.
4. Work through this notebook by reading the instructions and executing code cell by cell. Some cells will require modifications before you run them. 

<a id="load_libraries"></a>
## 1. Load libraries
[Top](#top)

Install python modules
NOTE! Some pip installs require a kernel restart.
The shell command pip install is used to install Python modules. Some installs require a kernel restart to complete. To avoid confusing errors, run the following cell once and then use the Kernel menu to restart the kernel before proceeding.

In [None]:
!pip install -U scikit-learn
!pip install pandas==0.24.2
!pip install --user pandas_ml==0.6.1
!pip install matplotlib==3.1.0

In the cell below we import the generic python libraries that will be used throughout the notebook

In [None]:
import pandas as pd, numpy as np
import sys
import io

<a id="load_data"></a>
## 2. Load data
[Top](#top)

A lot of data is **structured data**, which is data that is organized and formatted so it is easily readable, for example a table with variables as columns and records as rows, or key-value pairs in a noSQL database. As long as the data is formatted consistently and has multiple records with numbers, text and dates, you can probably read the data with [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html), an open-source Python package providing high-performance data manipulation and analysis.

### 2.1 Load our data as a pandas data frame

**<font color='red'><< FOLLOW THE INSTRUCTIONS BELOW TO LOAD THE DATASET >></font>**

* Highlight the cell below by clicking it.
* Click the `10/01` "Find data" icon in the upper right of the notebook.
* Add the locally uploaded file `insurance.csv` by choosing the `Files` tab. Then choose the `insurance.csv`. Click `Insert to code` and choose `Insert Pandas DataFrame`.
* The code to bring the data into the notebook environment and create a Pandas DataFrame will be added to the cell below.
* Run the cell

In [None]:
# Place cursor below and insert the Pandas DataFrame for the Insurance Expense data


### 2.2 Update the variable for our Pandas dataframe

We'll use the Pandas naming convention df for our DataFrame. Make sure that the cell below uses the name for the dataframe used above. For the locally uploaded file it should look like df_data_1 or df_data_2 or df_data_x. 

**<font color='red'><< UPDATE THE VARIABLE ASSIGNMENT TO THE VARIABLE GENERATED ABOVE. >></font>**

In [None]:
df_pd = df_data_1

<a id="prepare_data"></a>
## 3. Prepare data for building regression model
[Top](#top)

Data preparation is a very important step in machine learning model building. This is because the model can perform well only when the data it is trained on is good and well prepared. Hence, this step consumes bulk of data scientist's time spent building models.

### 3.1 Explore data

Now let's have a look at the data that was loaded into the notebook. We will use pandas to understand the explore the dataset. Much detailed data exploration options were discussed in the previous module.

In [None]:
print("The dataset contains columns of the following data types : \n" +str(df_pd.dtypes))

Verify that there are no missing values in the columns

In [None]:
print("The dataset contains following number of records for each of the columns : \n" +str(df_pd.count()))

You can also identify if there are any missing values by running the following cell. In this case, we do not have any missing entries. 

In [None]:
df_pd.isnull().any()

### 3.2 Prepare categorical columns

During this process, we identify categorical columns in the dataset. 

In [None]:
# Defining the categorical columns 
categoricalColumns = df_pd.select_dtypes(include=[np.object]).columns

print("Categorical columns : " )
print(categoricalColumns)

Categories needed to be indexed, which means the string labels are converted to indices or numbers. These label indices are encoded using One-hot encoding to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features to use categorical features. We use the **OneHotEncoder** method from the *sklearn.preprocessing* library to implement this. 

In [None]:
from sklearn.preprocessing import OneHotEncoder

onehot_categorical =  OneHotEncoder(handle_unknown='ignore')

We use the *SimpleImputer* method as an imputation transformer for completing missing values. Note that since this dataset has no missing values it will not have an impact for this example. We however show this to explain how values can be imputed using sklearn.

In [None]:
from sklearn.impute import SimpleImputer

impute_categorical = SimpleImputer(strategy="most_frequent")

scikit-learn offers an API to sequentially apply a list of transforms and a final estimator - *sklearn.pipeline.Pipeline*. We create a pipeline and assemble the transformations we intend to apply on the categorical columns as shown below. 

In [None]:
from sklearn.pipeline import Pipeline

categorical_transformer = Pipeline(steps=[('impute',impute_categorical),('onehot',onehot_categorical)])

Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. We define a *ColumnTransformer* and pass the pipeline we create in the cell above along with the column list we defined as well. 

In [None]:
from sklearn.compose import ColumnTransformer

preprocessorForCategoricalColumns = ColumnTransformer(transformers=[('cat', categorical_transformer, categoricalColumns)],
                                            remainder="passthrough")

The transformation happens in the pipeline. We explicitly run the cell below to show what intermediate value looks like

In [None]:
df_pd_temp = preprocessorForCategoricalColumns.fit_transform(df_pd)
print("Categorical Data after transforming :")
print(df_pd_temp)

### 3.3 Prepare numerical columns

During this process, we identify numerical columns in the dataset.

In [None]:
# Defining the numerical columns 
numericalColumns = [col for col in df_pd.select_dtypes(include=[np.float,np.int]).columns if col not in ['expenses']]

print("Numerical columns : " )
print(numericalColumns)

Following cell uses *StandardScaler* method from the the *sklearn.preprocessing* API.  Standardization of numerical fields refer to the process of removing the mean and scaling to unit variance. 

In [None]:
from sklearn.preprocessing import StandardScaler

scaler_numerical = StandardScaler()

The three cells below show, assembling the steps in the pipeline and creating a column transformers. The steps are very similar to section 3.2 shown above. 

In [None]:
numerical_transformer = Pipeline(steps=[('scale',scaler_numerical)])

In [None]:
preprocessorForAllColumns = ColumnTransformer(transformers=[('cat', categorical_transformer, categoricalColumns),('num',numerical_transformer,numericalColumns)],
                                            remainder="passthrough")

In [None]:
df_pd_temp_2 = preprocessorForAllColumns.fit_transform(df_pd)
print("Data after transforming :")
print(df_pd_temp_2)

### 3.4 Prepare data frame for splitting data into train and test datasets

We first divide the dataframe into *features* - that will contain the input columns that will be used to predict the final value. 

In [None]:
features = []
features = df_pd.drop(['expenses'], axis=1)
print('value of features : ' + str(features))

We then separate the column to predicted and mark it as the *label*

In [None]:
label = pd.DataFrame(df_pd, columns = ['expenses']) 
label = df_pd['expenses']

print(" value of label : " + str(label))

Now we will use *train_test_split* to split the feature and label dataframes into random train and test subsets. Unless explicit parameters are passed, the default is to split the dataset into 75% for training and 25% for testing. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features,label , random_state=0)

print("Dimensions of datasets that will be used for training : Input features"+str(X_train.shape)+ 
      " Output label" + str(y_train.shape))
print("Dimensions of datasets that will be used for testing : Input features"+str(X_test.shape)+ 
      " Output label" + str(y_test.shape))

<a id="model_lrc"></a>
## 4. Build and test a multiple linear regression model
[Top](#top)


In a linear regression model, the variable to be predicted is dependent on only one other variable. This is calculated by using the formula that is generally used in calculating the slope of a line.

y = w0 + w1*x1

In the above equation, y refers to the target variable and x1 refers to the independent variable. w1 refers to the coeeficient that expresses the relationship between y and x1 is it also know as the slope. w0 is the constant cooefficient a.k.a the intercept. It refers to the constant offset that y will always be with respect to the independent variables.


Multiple linear regression is an extension to the simple linear regression. In this setup, the target value is dependant on more than one variable. The number of variables depends on the use case at hand. Usually a subject matter expert is involved in identifying the fields that will contribute towards better predicting the output feature.

y = w0 + w1*x1 + w2*x2 + .... + wn*xn

### 4.1 Define linear regression model 

Since multiple linear regression assumes that output depends on more than one variable, we are assuming that it depends on all the 6 features. Data is split up into training and test sets. As an experiment, you can try to remove a few features and check if the model performs any better. 

The cell below shows how to define a linear regression model using skleanr's *LinearRegression* method. 

In [None]:
from sklearn.linear_model import LinearRegression

model_name = 'Multiple Linear Regression'

mlRegressor = LinearRegression()

We then add the model defined above to the pipeline which will be executed in the next cell.

In [None]:
mlr_model = Pipeline(steps=[('preprocessorAll',preprocessorForAllColumns),('regressor', mlRegressor)])

### 4.2 Fit linear regression model

We will now fit this linear model by passing in the train data obtained from the section above. This essentially runs the sequence of steps defined in the pipeline. 

In [None]:
mlr_model.fit(X_train,y_train)

### 4.3 Predict insurance price using multiple linear regression model

Now, its time to run our prediction on the model trained above. We do this by calling the *predict* method and passing the test features dataset. The output obtained is stored in an array marked as *y_pred_mlr*. 

In [None]:
y_pred_mlr= mlr_model.predict(X_test)

### 4.4 Evaluate multiple linear regression model

The next few cells show how to quantify the quality of the predictions

Following are example attributes can be used to analyze the model performance. 

**intercept_** float or array of shape (n_targets,)
Independent term in the linear model. Set to 0.0 if fit_intercept = False.

**coef_** array of shape (n_features, ) or (n_targets, n_features)
Estimated coefficients for the linear regression problem. If multiple targets are passed during the fit (y 2D), this is a 2D array of shape (n_targets, n_features), while if only one target is passed, this is a 1D array of length n_features.

In [None]:
print(mlRegressor)
print('Intercept: \n',mlRegressor.intercept_)
print('Coefficients: \n', mlRegressor.coef_)

There are several methods that [sklearn.metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) API offers us. We will now look at the *explained_variance_score* and the *mean_squared_error* method here as an example.

In [None]:
from sklearn.metrics import explained_variance_score,mean_squared_error

def model_metrics(regressor,y_test,y_pred):
    mse = mean_squared_error(y_test,y_pred)
    print("Mean squared error: %.2f"
      % mse)
    
    e_v_s = explained_variance_score(y_test, y_pred)
    print('Explained variance score: %.2f' % e_v_s )
    return [mse, e_v_s]

For explained variance, the best possible score is 1.0, lower values are worse.

In [None]:
mlrMetrics = model_metrics(mlRegressor,y_test,y_pred_mlr)

Finally, we use *matplotlib* to visualize the actual vs predicted values of the insurance charges. 

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score 

def two_d_compare(X_test,y_test,y_pred,model_name):
    area = (12 * np.random.rand(40))**2 
    plt.subplots(ncols=2, figsize=(10,4))
    plt.suptitle('Actual vs Predicted data : ' +model_name + '. Variance score: %.2f' % r2_score(y_test, y_pred))

    plt.subplot(121)
    plt.scatter(X_test, y_test, alpha=0.8, color='#8CCB9B')
    plt.title('Actual')

    plt.subplot(122)
    plt.scatter(X_test, y_pred,alpha=0.8, color='#E5E88B')
    plt.title('Predicted')

    plt.show()
    

In [None]:
two_d_compare(X_test['bmi'],y_test,y_pred_mlr,model_name)

<p><font size=-1 color=gray>
&copy; Copyright 2019 IBM Corp. All Rights Reserved.
<p>
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing permissions and
limitations under the License.
</font></p>