# Regression

<font color='steelblue'>

<font size = 5>
    <strong>Linear Regression Example</strong><br><br>
    Predict if a startup is going to be profitable<br>
</font>
    
<font size = 4>
    <br>
    
**Following examples are included in the processing:**
    
- `Load` dataset from sklearn datasets
- `Explore` Data
- `Set up` the dataframe
- `Create` training and test dataset
- `Train` a Linear Regression Model
- `Explore` trained model performance
- `Make` predictions using test dataset
- `Explore` model performance comparing actual v/s predictions
- `Use Decision Tree Regressor` to make predictions

</font>

</font>

## Import required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
#plt.style.use('seaborn-whitegrid')    # grids in the plots
import warnings
warnings.filterwarnings('ignore')

## Load the dataset from sklearn datasets

In [None]:
df = pd.read_csv('../datasets/50_Startups.csv')

In [None]:
df.head()

## Display the std deviation, mean, min, max, etc of the dataset

In [None]:
# Get the data description e.g. count, mean, standard deviation, etc.
pd.set_option('display.precision', 2)

# another way of doing transpose (or use .transpose())
df.describe().T

In [None]:
df.describe(include = 'object')

In [None]:
df['State'].value_counts()

In [None]:
states = df['State'].values

In [None]:
states

## Display the data types of features and target

In [None]:
# display the data types
df.info()

### Handle State which is categorical value (Perform One Hot Encoding)

In [None]:
df = pd.get_dummies(df, columns = ['State'], drop_first = False)
df.head()

In [None]:
# Want our target column at the end (since we have add new columns
# the end of our dataframe from the previous step
profit = df.pop('Profit')

In [None]:
type(profit)

In [None]:
# Add it as the last column in our dataframe
df['Profit'] = profit
df.head()

## Create a scatter plot R&D Spend and Profit

In [None]:
cols = df.columns
cols = cols.drop('Profit')
cols

In [None]:
plt.rc('figure', figsize=(14, 5))
fig, axs = plt.subplots(1, 3)
axs[0].scatter(df['Profit'], df[cols[0]]) 
axs[0].set_title(f'Profilt v/s {cols[0]}')
axs[1].scatter(df['Profit'], df[cols[1]])
axs[1].set_title(f'Profilt v/s {cols[1]}')
axs[2].scatter(df['Profit'], df[cols[2]])
axs[2].set_title(f'Profilt v/s {cols[2]}')

plt.show()

In [None]:
plt.rc('figure', figsize=(20, 10))
toplot = cols.drop("State_California")
toplot = toplot.drop("State_New York")
toplot = toplot.drop("State_Florida")
print(toplot)
sns.pairplot(df[toplot])
plt.show()

### Check correlation between features and price

In [None]:
# Correlation
corr = df.corr()
sns.set(font_scale=1.4)
f, ax = plt.subplots(figsize=(11,9))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap = "Blues", vmax=.9, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot = True);

## Display all features "null" count

In [None]:
# check if there are any null values in our features
df.isnull().sum()

## Define X & y

In [None]:
cols = list(df.columns)
cols

In [None]:
# feature columns
cols.remove('Profit')
cols

## Create Training and Test Data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols], df['Profit'], 
                                                    test_size = 0.20, 
                                                    random_state = 2345)

In [None]:
X_train.head(3)

In [None]:
X_train.shape, X_test.shape

In [None]:
y_train.shape, y_test.shape

## Data Leakage<br>

<font size = 3>
<font color = 'grey'>

- Must create the training and test split before preprocessing such as scaling or one hot encoding
- Want to prevent data leakage: since training happens on the training dataset, the preprocessing should also happen on training dataset, the test dataset should then be transformed with the scaling parameters learnt on training dataset [`Data Leak or Leakage`](https://en.wikipedia.org/wiki/Leakage_(machine_learning))

</font>
</font>

## Standardize the features that require scaling

In [None]:
from sklearn.preprocessing import StandardScaler
tostd = ['R&D Spend', 'Marketing Spend', 'Administration']
scaler = StandardScaler()
scalerModel = scaler.fit(X_train[tostd])
X_train[tostd] = scalerModel.transform(X_train[tostd])
X_test[tostd] = scalerModel.transform(X_test[tostd])

In [None]:
X_train.head(3)

In [None]:
X_test.head(3)

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [None]:
# normalize means the regression will apply l2-norm on the data (False is default)
# since we have already standardized the data, leave default
linReg = LinearRegression(normalize = False)

In [None]:
# perform training (need to pass np.array not dataframe)
linRegModel = linReg.fit(X_train, y_train.values)

In [None]:
# intercept on y-axis
linRegModel.intercept_

In [None]:
coeff = list(linRegModel.coef_)
coeff

In [None]:
# Sorted dataframe by coefficients
pd.set_option('display.precision', 2)
coeff_df = pd.DataFrame(coeff, cols, columns=['Coefficient'])  
sortcoeff = coeff_df.sort_values('Coefficient', ascending = False)
sortcoeff

In [None]:
print("R Squared on training data: {}".format(linRegModel.score(X_train, y_train)))

In [None]:
y_pred = linRegModel.predict(X_test)

In [None]:
# R-Squared value
linRegModel.score(X_test, y_test)

In [None]:
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head(10)

In [None]:
#test_predictions = model.predict(normed_test_data).flatten()
print(X_test.size)
plt.figure(figsize = (6,6))
a = plt.axes(aspect='equal')
plt.scatter(y_test, y_pred)
plt.xlabel('True Values [$]')
plt.ylabel('Predictions [$]')
# for the line
plt.plot([50000,200000], [50000,200000], 'r');

In [None]:
print("R Squared on predictions: {}".format(r2_score(y_test, y_pred)))

In [None]:
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared on predictions {:.2f}".format(mse))

In [None]:
print("Root Mean Squared on predictions {:.2f}".format(np.sqrt(mse)))

In [None]:
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error {mae:.2f}")

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree

In [None]:
# set the depth to tree to illustrate the plotting of the tree
# try by removing the max_depth argument and check the score
#dt = DecisionTreeRegressor(max_depth = 3)
dt = DecisionTreeRegressor()

In [None]:
# train model
dtModel = dt.fit(X_train, y_train)

In [None]:
dtModel.score(X_test, y_test)

In [None]:
# get tree depth
dtModel.get_depth()

In [None]:
dtModel.get_n_leaves()

In [None]:
fig = plt.figure(figsize = (25,20))
tree.plot_tree(dtModel, feature_names = cols, filled = True)
plt.savefig("DecisionTree.png")