-----------------------------
## Practice Hands-on case study: Linear Regression

-----------------------------

Welcome to the Hands-on case study on Linear Regression. In this case study, we aim to construct a linear model that explains the relationship a car's mileage (mpg) has with its other attributes

-----------------------------
## Dataset:
-----------------------------
There are 8 variables in the data:

- mpg: miles per gallon
- cyl: number of cylinders
- disp: engine displacement (cu. inches) or engine size
- hp: horsepower
- wt: vehicle weight (lbs.)
- acc: time taken to accelerate from O to 60 mph (sec.)
- yr: model year
- car name: car model name


- Also provided are the car labels (types)
- Missing data values are marked by series of question marks.

## Import Libraries

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from google.colab import drivedrive.mount('/content/drive', force_remount=True)
import os

SyntaxError: invalid syntax (<ipython-input-9-8b993fa5a91d>, line 7)

In [None]:

# the base Google Drive directory
root_dir = "/content/drive/My Drive/"

# choose where you want your project files to be saved
project_folder = "Colab Notebooks/My Project Folder/"

def create_and_set_working_directory(project_folder):
  # check if your project folder exists. if not, it will be created.
  if os.path.isdir(root_dir + project_folder) == False:
    os.mkdir(root_dir + project_folder)
    print(root_dir + project_folder + ' did not exist but was created.')

  # change the OS to use your project folder as the working directory
  os.chdir(root_dir + project_folder)

## Load and review data

In [3]:
data = pd.read_csv("auto_mpg.csv")
data.shape

FileNotFoundError: [Errno 2] No such file or directory: 'auto_mpg.csv'

In [None]:
data.head()

In [None]:
#dropping/ignoring car_name
data = data.drop('car name', axis=1)
# Also replacing the categorical var with actual values
data['origin'] = data['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})
data.head()

## Create Dummy Variables
Values like 'america' cannot be read into an equation. Using substitutes like 1 for america, 2 for europe and 3 for asia would end up implying that european cars fall exactly half way between american and asian cars! we dont want to impose such an baseless assumption!

So we create 3 simple true or false columns with titles equivalent to "Is this car America?", "Is this care European?" and "Is this car Asian?". These will be used as independent variables without imposing any kind of ordering between the three regions.




In [None]:
data = pd.get_dummies(data, columns=['origin'])
data.head()

## Dealing with Missing Values

In [None]:
#A quick summary of the data columns
data.describe()

In [None]:
# hp is missing cause it does not seem to be reqcognized as a numerical column!
data.dtypes

### Q.2 The method  used to check whether an entry of a column is a numerical value or is it missing?


In [None]:
 # if the string is made of digits store True else False hint: use isdigit()
hpIsDigit = pd.DataFrame(data.horsepower.str._________())


data[hpIsDigit['horsepower'] == False]   # from temp take only those rows where hp has false


In [None]:
# Missing values have a'?''
# Replace missing values with NaN
data = data.replace('?', np.nan)
data[hpIsDigit['horsepower'] == False]

There are various ways to handle missing values. Drop the rows, replace missing values with median values etc. of the 398 rows 6 have NAN in the hp column. We could drop those 6 rows - which might not be a good idea under all situations


In [None]:
#instead of dropping the rows, lets replace the missing values with median value.
data.median()

### Filling the missing values with median value

In [None]:
# replace the missing values with median value.
# Note, we do not need to specify the column names below
# every column's missing value is replaced with that column's median respectively  (axis =0 means columnwise)

medianFiller = lambda x: x.fillna(x.median())
data = data.apply(medianFiller,axis=0)



data['horsepower'] = data['horsepower'].astype('float64')  # converting the hp column from object / string type to float


## BiVariate Plots

A bivariate analysis among the different variables can be done using scatter matrix plot. Seaborn libs create a dashboard reflecting useful information about the dimensions. The result can be stored as a .png file.

In [None]:
data_attr = data.iloc[:, 0:7]
sns.pairplot(data_attr, diag_kind='kde')   # to plot density curve instead of histogram on the diag

Observation between 'mpg' and other attributes indicate the relationship is not really linear. However, the plots also indicate that linearity would still capture quite a bit of useful information/pattern. Several assumptions of classical linear regression seem to be violated, including the assumption of no Heteroscedasticity


## Split Data

In [None]:
# lets build our linear model
# independant variables
X = data.drop(columns = {'mpg','origin_europe'})
# the dependent variable
y = data['mpg']

In [None]:
# Sklearn package's model_selection have a function train_test_split() is used for data splitting into test(out of sample) and train dataset
from sklearn.model_selection import train_test_split


# Split X and y into training and test set(out of sample data) in 70:30 ratio

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

### Q.3 & 4 Create linear regression model using statsmodels OLS and interpretate coefficient

In [None]:
# import libraries for building linear regression model
#using statsmodel

from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.formula.api import ols
import statsmodels.api as sm

# let's add the intercept to data
X_train_ols = sm.add_constant(X_train)
X_test_ols=sm.add_constant(X_test)

# create the model

#remove ________ and define ols model and complete the code
model1 = ___________________

# get the model summary
#remove ________ and print summary and complete the code.
model1.__________

- Not all the variables are statistically significant to predict the outcome variable. To check which are statistically significant or have predictive power to predict the target variable, we need to check the `p-value` against all the independent variables.
- **Interpreting the Regression Results:**

1. **Adjusted. R-squared**: It reflects the fit of the model.
    - R-squared values range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.
    
2. **coeff**: It represents the change in the output Y due to a change of one unit in the variable (everything else held constant).
3. **std err**: It reflects the level of accuracy of the coefficients.
    - The lower it is, the more accurate the coefficients are.
4. **P >|t|**: It is p-value.
   
   * Pr(>|t|) : For each independent feature there is a null hypothesis and alternate hypothesis

    Ho : Independent feature is not significant
   
    Ha : Independent feature is significant
    
   * A p-value of less than 0.05 is considered to be statistically significant.

   
5. **Confidence Interval**: It represents the range in which our coefficients are likely to fall (with a likelihood of 95%).

* To be able to make statistical inferences from our model, **we will have to test the significance of the regression coefficients and linear regression assumptions.**

### Checking the performance of the model on the train and test data set

In [None]:
# RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())


# MAPE
def mape(predictions, targets):
    return np.mean(np.abs((targets - predictions)) / targets) * 100


# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test, y_train,y_test):

    # Insample Prediction
    y_pred_train = olsmodel.predict(x_train)
    y_observed_train = y_train

    # Prediction on test data
    y_pred_test = olsmodel.predict(x_test)
    y_observed_test = y_test

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                "RMSE": [
                    rmse(y_pred_train, y_observed_train),
                    rmse(y_pred_test, y_observed_test),
                ],
                "MAE": [
                    mae(y_pred_train, y_observed_train),
                    mae(y_pred_test, y_observed_test),
                ],
                "MAPE": [
                    mape(y_pred_train, y_observed_train),
                    mape(y_pred_test, y_observed_test),
                ],
            }
        )
    )


# Checking model performance
model_pref(model1, X_train_ols, X_test_ols,y_train,y_test)

**Observations:**

* RMSE, MAE, and MAPE of train and test data are not very different, indicating that the **model is not overfitting and has generalized well.**

### Question 5: Performing cross validation and comparing its average performance to OLS performance

In [None]:
# import the required function
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# build the regression model using Sklearn Linear regression
linearregression = LinearRegression()

#remove ________ and cross_val_score and complete the code
cv_Score11 = cross_val_score(____________) #cv=10 represents data is divided into 10 folds.
cv_Score12 = cross_val_score(____________________,
                             scoring = 'neg_mean_squared_error')


print("RSquared: %0.3f (+/- %0.3f)" % (cv_Score11.mean(), cv_Score11.std() * 2))
print("Mean Squared Error: %0.3f (+/- %0.3f)" % (-1*cv_Score12.mean(), cv_Score12.std() * 2))

### Get model Coefficients in a pandas dataframe with column 'Feature' having all the features and column 'Coefs' with all the corresponding Coefs. Write the regression equation.

In [None]:
coef = model1.params
coef

In [None]:
# Let us write the equation of the fit
Equation = "log (car_mileage) ="
print(Equation, end='\t')
for i in range(len(coef)):
    print('(', coef[i], ') * ', coef.index[i], '+', end = ' ')

### Building Decision Tree

In [None]:
#importing Decision tree regressor using sklearn

from sklearn.tree import DecisionTreeRegressor

In [None]:
# splitting the data in 70:30 ratio of train to test data
# separate the dependent and indepedent variable
Y1 = data['mpg']
X1 = data.drop(columns = {'mpg','origin_europe'})
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, Y1, test_size=0.30 , random_state=1)

### Question 6: Building Decision tree and Checking its performance

In [None]:
#defining the Descision tree regressor
#remove ________ and define decision tree and complete the code
dt = ________________

#Fitting Descision Tree regressor to train dataset
#remove ________ and fit decision tree and complete the code
dt.fit(______________)

Checking model perform on the train and test dataset

In [None]:
model_pref(dt, X_train1, X_test1,y_train1,y_test1)

**Observations:**

- **The model seem to overfit the data** as rmse, mae and mape value of train data is 0, but that value for test data is much higher.

In [None]:
from sklearn.tree import plot_tree

In [None]:
features = list(X1.columns)

plt.figure(figsize=(35,25))
plot_tree(dt, max_depth=4, feature_names=features,filled=True,fontsize=12,node_ids=True,class_names=True)
plt.show()

#### Let's plot the feature importance for each variable in the dataset and analyze the variables

### Checking Feature importance

In [None]:
#remove ________ and find feature importance decision tree and complete the code

importances = dt.______________

columns=X1.columns
importance_df=pd.DataFrame(importances,index=columns,columns=['Importance']).sort_values(by='Importance',ascending=False)
plt.figure(figsize=(8,4))
sns.barplot(importance_df.Importance,importance_df.index)

### Building Random Forest

In [None]:
#importing random forest regressor usinf sklearn

from sklearn.ensemble import RandomForestRegressor

#### Parameters for regression
**n_estimators**: The number of trees in the forest.

**min_samples_split**: The minimum number of samples required to split an internal node:

**max_depth**
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

**max_features{“auto”, “sqrt”, “log2”, 'None'}**: The number of features to consider when looking for the best split.

- If “auto”, then max_features=sqrt(n_features).

- If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

- If “log2”, then max_features=log2(n_features).

- If None, then max_features=n_features.

In [None]:
#defining the Random forest regressor
#remove ________ and define random forest tree and complete the code
rf=________________________

#Hyperparameters, we have randomly choosen them for now but we can tune these hyperparameters and get the best model.

#fitting the model
#remove ________ and fit random forest tree and complete the code
rf.fit(__________________)

### Q.7 Check performance of Random Forest

In [None]:
# checking model performance on test dataset
rf.score(__________________)

In [None]:
model_pref(rf, X_train1, X_test1,y_train1,y_test1)

### Question 8 & 9: Checking the feature importance of each variable in Random Forest and comparing to Decision Tree

In [None]:
#remove ________ and print feature importance Random forest and complete the code
importances = rf.________________

columns=X1.columns
importance_df=pd.DataFrame(importances,index=columns,columns=['Importance']).sort_values(by='Importance',ascending=False)
plt.figure(figsize=(8,4))
sns.barplot(importance_df.Importance,importance_df.index)

### Question 10: Comparing results of three model

In [None]:
print("Linear Regression")
model_pref(model1, X_train_ols, X_test_ols,y_train,y_test)
print("Decision tree")
model_pref(dt, X_train1, X_test1,y_train1,y_test1)
print("Random Forest")
model_pref(rf, X_train1, X_test1,y_train1,y_test1)