# 0. Review
## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users to access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of avocado prices.

We would like to learn prices of avocado given brand, location sold, total volume, etc.

## 0.C Load Data

Now, we load our training and test set. Run the code below to load 

In [None]:
import pandas as pd

# load explanatory variables
avocado_training_set  = pd.read_csv("datasets/avocado_training_set",index_col=0)
avocado_training_set['year'] = avocado_training_set['year'].astype(str)
avocado_training_set['Month'] = avocado_training_set['Month'].astype(str)

avocado_test_set = pd.read_csv("datasets/avocado_test_set",index_col=0)
avocado_test_set['year'] = avocado_test_set['year'].astype(str)
avocado_test_set['Month'] = avocado_test_set['Month'].astype(str)

# load predictors
prices_training_set = pd.read_csv("datasets/avocado_prices_training_set",index_col=0)
prices_test_set = pd.read_csv("datasets/avocado_prices_test_set",index_col=0)

**In this section, we will be learning a linear regression model from our training data. Using the attributes of the avocado sold, this regression estimate the average price of the avocado.**

# 9. Train model: Linear Regression

2. Then, train a **machine learning model** using **labeled data**

    - "Labeled data" has been labeled with the outcome
    - "Machine learning model" learns the relationship between the attributes of the data and its outcome

Linear regression assumes that there is a *linear* relationship between the explanatory variables and the outcome. 

That our case linear regression means that price does up or down, but not both, at a constant rate if an explanatory variable increases.

Thus, linear regression assumes a linear model of price

$$\text{Price} = \beta_0 + \beta_1 \text{Total Volume} + \beta_2 \text{Year} + \beta_3 \text{Month} + \beta_4 \text{Type}.$$

In a linear regression model, we aim learn the coefficients, $\beta_0 ,\beta_1 ,\beta_2 ,\beta_3,\beta_4$ that minimizes the mean squared error between the model and true response variables (prices). 

That is,
$$\min_{\beta_0 ,\beta_1 ,\beta_2 ,\beta_3,\beta_4} \sum_{i=1}^{N}\left(y_i - \left(\beta_0 + \beta_1 \text{Total Volume}_i + \beta_2 \text{Year}_i + \beta_3 \text{Month}_i + \beta_4 \text{Type}_i\right)\right)^2$$

In learning a model, we will be to predict future prices and study how the explanatory variables affect price.
    
## 9.A Check Training Set

Let's check if we loaded the correct dataset.

In [None]:
#print head of prices_training_set
prices_training_set.head()

In [None]:
#print head of  avocado_training_set
avocado_training_set.head()

## 9.B Linear Regression: Price vs. Total Volume

### 9.B.1 Plot of Price vs. Total Volume

In [None]:
import matplotlib.pyplot as plt
import numpy as np

#scatter plot price vs. total volume
plt.figure(figsize=(15,5))
plt.scatter(avocado_training_set['Total Volume'],prices_training_set,s=5)
plt.xlabel('log(Total Volume)')
plt.ylabel('Total Price')
plt.show()

### 9.B.2 Constructing Linear Model

#### I. Initialize Linear Model Object

In [None]:
from sklearn.linear_model import LinearRegression

# initialize LinearRegression(fit_intercept=True, normalize=False) with linearmodel


#### II. Train Linear Model Object

In [None]:
# store avocado_training_set['Total Volume'] as total_volume_



In [None]:
#  convert total_volume_ to numpy array using total_volume_.values
# store as total_volume_values



In [None]:
# reshape total_volume_values with reshape(-1, 1)
# reprint reshaped array
# store as total_volume_arr



In [None]:
#now fit linearmodel with total_volume_arr, prices_training_set


#### III.  Plotting Predicted Values

In [None]:
# predict total_volume of the training set with predict method
# store as prices_predict_



In [None]:
#scatter plot price vs. total volume
plt.figure(figsize=(15,5))
plt.scatter(avocado_training_set['Total Volume'],prices_training_set,s=5)
plt.plot(total_volume_,prices_predicted_,c='r')

plt.xlabel('log(Total Volume)')
plt.ylabel('Total Price')
plt.show()

### 9.B.3 Getting attributes of the linear model

After being fit, LinearRegression object stores the coefficients as ```coef_``` and the intercept as ```intercept_```.

In [None]:
# print linearmodel.coef_

# print linearmodel.intercept_


Our model is then
$$\text{Price} = 2.5597 -0.1021 \times \text{Log(Total Volume)}$$

The ```LinearRegression``` object can also compute the $R^2$ value of the model. This is computed using the ```score``` function. The score function takes the explanatory variables and response variables as arguments.

In [None]:
# compute linearmodel.score(total_volume_arr,prices_training_set)


### 9.B.4 Training Error and Test error

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\text{RMSE}=\frac{1}{N}\sqrt{\sum_{i=1}^{N}(\text{true}_i - \text{pred}_i)^2}$$.


The RMSE is use to measure the fidelity of the model to training set and testing set.

#### I. Training Error

In [None]:
# training error
from sklearn.metrics import mean_squared_error

# compute mean_squared_error(prices_training_set, prices_predicted_) and take sqrt



#### II. Testing Error

In [None]:
# format test set error
# store  avocado_test_set['Total Volume'].values.reshape(-1, 1) as total_volume_test_


In [None]:
# get predicted prices for total_volume_test_ using  linearmodel.predict
# store as prices_predicted_test_


In [None]:
#scatter plot of price vs. total volume in the test set
plt.figure(figsize=(15,5))
plt.scatter(total_volume_test_,prices_test_set,s=5)
plt.plot(total_volume_test_,prices_predicted_test_,c='r')
plt.xlabel('log(Total Volume)')
plt.ylabel('Price')
plt.show()

In [None]:
# test error
# compute mean_squared_error(prices_test_set, prices_predicted_test_)) and take sqrt


### 9.C Constructing Linear Model with the categorical variable, ```type```.

```Scikit-Learn``` does not handle categorical variables well.


This is a pain because we have four categorical variables:
- Month
- year
- region
- type.

In this section, we are building a linear model with the variables ```Total Volume``` and ```type```.

#### 9.C.1 Hot one encoding

For ```Scikit-Learn``` to interpret categorical variables, we have encode them into binary. 

For the explanatory variable, ```type```,  we have two categories.

In [None]:
# print unique entries
# avocado_training_set['type'].unique()


Using the pandas method, ```get_dummies```, we encoded this as one column in which
- conventional = 0
- organic = 1.


We call this column, ```type_organic```.

In [None]:
# run pd.get_dummies(data=avocado_training_set['type'], drop_first=True)
# store as column 'type_organic'


In [None]:
# print head of avocado_training_set_cleaned


Repeating on the test set,

In [None]:
#  run pd.get_dummies(data=avocado_test_set['type'], drop_first=True)
# store as column 'type_organic'


In [None]:
# print head of avocado_test_set_cleaned


#### 9.C.2 Model Interpretation

If run a linear model with the ```type_organic``` and log(Total Volume) columns, the linear model becomes:

$$\text{Price} = \beta_0 + \beta_1 \times \text{Log(Total Volume)} + \beta_2 \times \text{type_organic}.$$

Let's interpret this model. 

Given that the observation is an conventional avocado, $\text{type_organic} = 0$ and the linear model becomes:

$$\text{Price} = \beta_0 + \beta_1 \times \text{Log(Total Volume)} .$$


Given that the observation is an organic avocado, $\text{type_organic} = 1$ and the linear model becomes:
$$\text{Price} = \beta_0+ \beta_2 + \beta_1 \times \text{Log(Total Volume)} .$$

#### 9.C.3 Exercise: Model of Price vs organic type and log(Total Volume)

Following the steps above, create a linear model as ```linearmodel``` with explanatory variables, ```"Total Volume","type_organic"```. 
- Get the coefficients of the model 
- Compute $R^2$ value for the training set and the test set
- Compute the training error
- Compute the test error.

In [None]:
explanatory_variables  = avocado_training_set_cleaned[['Total Volume',"type_organic"]]
test_explanatory_variables_= avocado_test_set_cleaned[['Total Volume',"type_organic"]]

from sklearn.linear_model import LinearRegression

# enter solution here



The model is then

$$\text{Price} = 1.7324 -0.04391 \times \text{Log(Total Volume)} + 0.3356 \times \text{type_organic}.$$

The code below plot the linear model of the groups of avocados on the training set.

In [None]:
predict_training_prices_ = linearmodel.predict(explanatory_variables)
predict_test_prices_ = linearmodel.predict(test_explanatory_variables_)

type_conventional = [typeo == 0 for typeo in avocado_training_set_cleaned['type_organic']]
type_organic = avocado_training_set_cleaned['type_organic'].astype(bool)

# get data for conventional avocado
training_total_volume_conventional_ = avocado_training_set_cleaned.loc[type_conventional,"Total Volume"]
training_price_conventional_ = prices_training_set[type_conventional]
predict_training_prices_conventional_ = predict_training_prices_[type_conventional]

# initialize for plot
plt.figure(figsize=(15,5))

# scatter plot of the volume for conventional avocado
plt.scatter(training_total_volume_conventional_, training_price_conventional_ ,
            s=5,alpha=0.3,color='r')
# plot of linear model for conventional avocado
plt.plot(training_total_volume_conventional_, predict_training_prices_conventional_,c='r')

# get data for organic avocado
training_total_volume_organic_ = avocado_training_set_cleaned.loc[type_organic,"Total Volume"]
training_price_organic_ = prices_training_set[type_organic]
predict_training_prices_organic_ = predict_training_prices_[type_organic]

#scatter plot of price vs. total volume in the training set for organic avocado
plt.scatter(training_total_volume_organic_, training_price_organic_,
            s=5,alpha=0.3,color='b')
# plot of linear model for organic avocado
plt.plot(training_total_volume_organic_, predict_training_prices_organic_,c='b')
plt.legend(['2.0557−0.0697×Log(Total Volume)','2.2896 - 0.0697×Log(Total Volume)',
            'conventional avocado', 'organic avocado'])
plt.xlabel('log(Total Volume)')
plt.ylabel('Price')
plt.show()

The code below plot the linear model of the groups of avocados on the test set.

In [None]:
predict_test_prices_ = linearmodel.predict(test_explanatory_variables_)

type_conventional = [typeo == 0 for typeo in avocado_test_set_cleaned['type_organic']]
type_organic = avocado_test_set_cleaned['type_organic'].astype(bool)

# get data for conventional avocado
test_total_volume_conventional_ = avocado_test_set_cleaned.loc[type_conventional,"Total Volume"]
test_price_conventional_ = prices_test_set[type_conventional]
predict_test_prices_conventional_ = predict_test_prices_[type_conventional]

# initialize for plot
plt.figure(figsize=(15,5))

# scatter plot of the volume for conventional avocado
plt.scatter(test_total_volume_conventional_, test_price_conventional_ ,
            s=5,alpha=0.3,color='r')
# plot of linear model for conventional avocado
plt.plot(test_total_volume_conventional_, predict_test_prices_conventional_,c='r')

# get data for organic avocado
test_total_volume_organic_ = avocado_test_set_cleaned.loc[type_organic,"Total Volume"]
test_price_organic_ = prices_test_set[type_organic]
predict_test_prices_organic_ = predict_test_prices_[type_organic]

#scatter plot of price vs. total volume in the test set for organic avocado
plt.scatter(test_total_volume_organic_, test_price_organic_,
            s=5,alpha=0.3,color='b')
# plot of linear model for organic avocado
plt.plot(test_total_volume_organic_, predict_test_prices_organic_,c='b')
plt.legend(['2.0557−0.0697×Log(Total Volume)','2.2896 - 0.0697×Log(Total Volume)',
            'conventional avocado', 'organic avocado'])
plt.xlabel('log(Total Volume)')
plt.ylabel('Price')
plt.show()

### 9.D Constructing Linear Model with the categorical variable, ```Year```.

In this section, we are building a linear model with the variables ```Total Volume``` and ```year```.

#### 9.C.1 Hot one encoding

For ```Scikit-Learn``` to interpret categorical varaibles again, we have encode them into binary. 

For the explanatory variable, ```year```,  we have four categories.

In [None]:
# print avocado_training_set['year'].unique()



How to we hot one-encode if we have four categories? Answer: similar to above!

We build a model with additional variables, ```year_2016```, ```year_2017``` and ```year_2018```. The model is then,

$$\text{Price} = \beta_0 + \beta_1 \times \text{Log(Total Volume)} + \beta_2\times \text{year_2016} + \beta_3\times\text{year_2017} + \beta_4\times\text{year_2018},$$
where 
  $$\text{year_2016} = \begin{cases} 0 & \text{if not 2016}\\1 & \text{if 2016}  \end{cases}$$
  $$\text{year_2017} = \begin{cases} 0 & \text{if not 2017}\\1 & \text{if 2017}  \end{cases}$$
  $$\text{year_2018} = \begin{cases} 0 & \text{if not 2018}\\1 & \text{if 2018}  \end{cases}.$$

You must be wondering. What about 2015?! 

2015 is already accounted for. This is our "base" model. 

If $\text{year}$ = 2015, then the model is 
$$\text{Price} = \beta_0 + \beta_1 \times \text{Log(Total Volume)}.$$

If $\text{year}$ = 2016, then the model is 
$$\text{Price} = \beta_0 + \beta_2 + \beta_1 \times \text{Log(Total Volume)}.$$

If $\text{year}$ = 2017, then the model is 
$$\text{Price} = \beta_0 + \beta_3 + \beta_1 \times \text{Log(Total Volume)}.$$

If $\text{year}$ = 2018, then the model is 
$$\text{Price} = \beta_0 + \beta_4 + \beta_1 \times \text{Log(Total Volume)}.$$

If we were to add a column for 2015, Scikit-learn may have issues learn a linear model from data. An additional column for 2015 can create a singular data matrix for our model.

In [None]:
# split categories using  pd.get_dummies(data=avocado_training_set['year'], drop_first=True)
# store in avocado_year_split_



# print head



In [None]:
# merge data frames pd.concat([avocado_training_set,avocado_year_split_],axis=1, sort=False)
# store in avocado_training_set_cleaned



# print head of avocado_training_set_cleaned



In [None]:
#current columns, avocado_training_set_cleaned.columns


In [None]:
# rename columns
avocado_training_set_cleaned.rename(columns={'2016':'year_2016','2017':'year_2017','2018':'year_2018'},inplace=True)

In [None]:
#new columns, avocado_training_set_cleaned.columns



Doing the same with the test data,

In [None]:
# split categories
avocado_year_split_ = pd.get_dummies(data=avocado_test_set['year'], drop_first=True)

# merge data frames
avocado_test_set_cleaned = pd.concat([avocado_test_set,avocado_year_split_],
                                        axis=1, sort=False)

# rename columns 
avocado_test_set_cleaned.rename(columns={'2016':'year_2016','2017':'year_2017','2018':'year_2018'},inplace=True)

In [None]:
avocado_test_set_cleaned.columns

#### 9.C.3 Exercise: Model of Price vs year and log(Total Volume)

Following the steps above, create a linear model as ```linearmodel``` and with explanatory variables, ```"Total Volume","year_2016","year_2017","year_2018"```. 

- Get the coefficients of the model 
- Compute $R^2$ value for the training set

Run the code below to plot the data and curves.

In [None]:
explanatory_variables  = avocado_training_set_cleaned[["Total Volume","year_2016","year_2017","year_2018"]]

# enter solution here



The model is then, 

$$\text{Price} = 2.5260 -0.1040 \times \text{Log(Total Volume)}-0.0106 \times \text{year_2016} + 0.1767\times\text{year_2017}+ 0.04261\times\text{year_2018}.$$


This code below creates a scatter plot of the training set and plot of the linear model applied to the training set.

In [None]:
# predicted variables
predict_training_prices_ = linearmodel.predict(explanatory_variables)

# create index of variables that are 2016, 2016, 2017, 2018
year_2016 = avocado_training_set_cleaned['year_2016'].astype(bool)
year_2017 = avocado_training_set_cleaned['year_2017'].astype(bool)
year_2018 = avocado_training_set_cleaned['year_2018'].astype(bool)
year_2015 = (year_2016 == False)&(year_2017 == False)&(year_2018 == False)

In [None]:
# initialize for plot
plt.figure(figsize=(15,5))
#initialize plot parameters
list_years = [year_2015,year_2016,year_2017,year_2018]
years = ["2015","2016","2017","2018"]
colors = ['r','b','g','k']
legend_name_scatter = []
legend_name_line = []
for i in range(4):
    indices = list_years[i]
    # get data
    training_total_volume_year_ = avocado_training_set_cleaned.loc[indices,"Total Volume"]
    training_price_year_ = prices_training_set[indices]
    predicted_training_prices_year = predict_training_prices_[indices]


    # scatter plot of the volume 
    plt.scatter(training_total_volume_year_, training_price_year_ , s=5,
                alpha=0.3,color=colors[i])
    # plot of linear model 
    plt.plot(training_total_volume_year_, predicted_training_prices_year,c=colors[i])
    legend_name_scatter.extend(['data for ' + years[i]])
    legend_name_line.extend(['predicted model for ' + years[i]])
legend_name_line.extend(legend_name_scatter)
plt.xlabel('log(Total Volume)')
plt.ylabel('Price')
plt.legend(legend_name_line)
plt.show()

This code below creates a scatter plot of the test set and plot of the linear model applied to the test set.

In [None]:
# predicted variables
test_explanatory_variables  = avocado_test_set_cleaned[['Total Volume',"year_2016","year_2017","year_2018"]]
predict_test_prices_ = linearmodel.predict(test_explanatory_variables)

# create index of variables that are 2016, 2016, 2017, 2018
year_2016 = avocado_test_set_cleaned['year_2016'].astype(bool)
year_2017 = avocado_test_set_cleaned['year_2017'].astype(bool)
year_2018 = avocado_test_set_cleaned['year_2018'].astype(bool)
year_2015 = (year_2016 == False)&(year_2017 == False)&(year_2018 == False)

In [None]:
# initialize for plot
plt.figure(figsize=(15,5))
#initialize plot parameters
list_years = [year_2015,year_2016,year_2017,year_2018]
years = ["2015","2016","2017","2018"]
colors = ['r','b','g','k']
legend_name_scatter = []
legend_name_line = []
for i in range(4):
    indices = list_years[i]
    # get data
    test_total_volume_year_ = avocado_test_set_cleaned.loc[indices,"Total Volume"]
    test_price_year_ = prices_test_set[indices]
    predicted_test_prices_year = predict_test_prices_[indices]


    # scatter plot of the volume 
    plt.scatter(test_total_volume_year_, test_price_year_ , s=5,
                alpha=0.3,color=colors[i])
    # plot of linear model 
    plt.plot(test_total_volume_year_, predicted_test_prices_year,c=colors[i])
    legend_name_scatter.extend(['data for ' + years[i]])
    legend_name_line.extend(['predicted model for ' + years[i]])
legend_name_line.extend(legend_name_scatter)
plt.xlabel('log(Total Volume)')
plt.ylabel('Price')
plt.legend(legend_name_line)
plt.show()