### Linear Regression Model

In [36]:
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt 
import matplotlib.mlab as mlab
import seaborn as sns
%matplotlib inline 
import sklearn as sk

### Example : Recommend the marketing plan based on advertising data to result in high Product sales.


### Predictors or Features or dependent Variable
- TV : advertising budgets(in thousands of dollars) spent on TV ads for a single product in a market.
- Radio : advertising budgets(in thousands of dollars) spent on Radio ads.
- Newspaper : advertising budgets(in thousands of dollars) spent on Newspaper ads.

### Response or Independent Variable
- Sales : Sales of a single product in a given market (in thousands units)

In [37]:
# read data from local disk
sales_data = pd.read_csv('Advertising.csv')
sales_data.head()

Unnamed: 0,TV,Radio,Newspaper,Area,Prod_type,Sales
0,230.1,37.8,69.2,suburban,Basic,38.0
1,44.5,39.3,45.1,rural,Higher,10.4
2,17.2,45.9,69.3,rural,Medium,9.3
3,151.5,41.3,58.5,suburban,Medium,18.5
4,180.8,10.8,58.4,suburban,Higher,12.9


In [38]:
## Data Cleaning
sales_data.isnull().sum()

TV           0
Radio        2
Newspaper    2
Area         0
Prod_type    0
Sales        0
dtype: int64

In [40]:
 import pandas as pd
 import numpy as np
 import random
 from fancyimpute import Knn_impute

 data = pd.read_csv("Advertising.csv")
 mat = data.iloc[:,:4].as_matrix()

 prop = int(mat.size * 0.5) #Set the % of values to be replaced
 i = [random.choice(range(mat.shape[0])) for _ in range(prop)] #Randomly choose indices of 
 j = [random.choice(range(mat.shape[1])) for _ in range(prop)] #the numpy array 

 mat[i,j] = np.NaN #replace values with NaN



 mat_filled = pd.DataFrame(KNN(3).complete(mat)) #converted the array back to df

 data_col = data.drop('species', axis = 1)
 mat_filled.columns = data_col.columns  #added column names that went missing in mat_filled

ModuleNotFoundError: No module named 'fancyimpute'

In [None]:
sales_data.fillna(groupby('Radio')['Newspaper'].transform("median"), inplace=True)

In [None]:
from fancyimpute import KNN

In [None]:
def knn_impute(target, attributes, k_neighbors, aggregation_method="mean", numeric_distance="euclidean",
               categorical_distance="jaccard", missing_neighbors_threshold = 0.5):

### Scatter plot to visulize the realtionship between differrent ads and sales.

In [None]:
fig, axs = plt.subplots(1, 3, sharey=True)
sales_data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize=(16, 8))
sales_data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
sales_data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])

In [None]:
sales_data.corr(method='pearson')

In [None]:
sales_data.shape  # 200 markets ads data and related sales

#### Simple Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
lmreg = LinearRegression()
x = sales_data[['TV']]
y = sales_data['Sales']

lmreg.fit(x, y)

print ('intercept : ', lmreg.intercept_)
print ('Reg Coeficients : ', lmreg.coef_)


#### Model :  Sales = 7.34 +  0.055 * TV

Interpreting the **intercept** ($\beta_0$):

- It is the value of $y$ when $x$=0.
- Thus, it is the estimated number of sales of Products when the TV-Advertisement = $ 0K

Interpreting the **"TV" coefficient** ($\beta_1$):

- It is the change in $y$ divided by change in $x$, or the "slope".
- Thus, a TV-Advertisement increase of 1K budget is **associated with** with the Sale increase of 55 products (.055K) .
- This is not a statement of causation.
- $\beta_1$ would be **Positive** if an increase in TV-Adv Budgets was associated with a **Increase** in Sales of Products.

In [None]:
# Scatter plot TV vs Sales
x_new = pd.DataFrame({'TV': [sales_data.TV.min(), sales_data.TV.max()]})
y_pred = lmreg.predict(x_new)

sales_data.plot(kind='scatter', x='TV', y='Sales')
plt.plot(x_new, y_pred, c='red', linewidth=2);        # Model :  Sales = 7.34 +  0.055 * TV

#### Multiple Linear Regression Model using sklearn Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
lmreg = LinearRegression()

In [None]:
x = sales_data.drop(['Sales'], axis=1)
y = sales_data['Sales']

In [None]:
x.head()

In [None]:
# categorical encoding - ordinal encoding
x.Prod_type = x.Prod_type.astype('category', categories = ['Basic','Medium','Higher' ]).cat.codes

In [None]:
# Categorical var 'Area' - one-hot-coding
x = pd.get_dummies(x, columns=['Area'] )

In [None]:
x.head()

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.3)

In [None]:
lmreg.fit(x_train, y_train) # Fit the model

In [None]:
print ('intercept : ', lmreg.intercept_)
print ('Reg Coeficients : ', lmreg.coef_)

#### Regression Model :  Sales = 7.36 + .05*TV + .10*Radio + .03*Newspaper  -  2.50*Prod_type  - 0.24*Area_Suburban + 0.24*Area_Suburban  -  0.21*Area_urban


#### Interpreting the coefficients:

- Holding all other features fixed, a 1k unit increase in **TV-Adv Budgets** is associated with a **Sales increase of .05K Products**.
- Holding all other features fixed, a 1k unit increase in **Radio-Adv Budgets** is associated with a **Sales increase of .10K Products**.
- Holding all other features fixed, ** Product Model with Medium ** is associated with a **Sales decrease of 2.50K Product** compared to the **Basic Model**.

In [None]:
y_test_pred = lmreg.predict(x_test)

In [None]:
comparision_df = pd.DataFrame()
comparision_df['Actual_Sales'] = y_test
comparision_df['Predicted_Sales'] = y_test_pred
comparision_df.head()

In [None]:
from sklearn import metrics

In [None]:
# Model Predictive performance on Train Data - RMSE uisng metrics function from sklearn
y_train_pred = lmreg.predict(x_train)
print('Train Predictive Error RMSE: ' , np.sqrt(metrics.mean_squared_error(y_train, y_train_pred)))

In [None]:
# Model Predictive performance on Test - RMSE uisng metrics function from sklearn
y_test_pred = lmreg.predict(x_test)
print('Test Predictive Error RMSE: ' , np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))

In [None]:
x_train.head()

In [None]:
# intercept       :  6.885522963179389
# Reg Coeficients :  [ 0.05081039  0.12335519  0.01805394 -2.51284757  0.14701552  0.49787306
#                      -0.64488858]

## Comparing linear regression with other models

Advantages of linear regression:

- Simple to explain
- Highly interpretable
- Model training and prediction are fast
- No tuning is required (excluding regularization)
- Features don't need scaling
- Can perform well with a small number of observations
- Well-understood

Disadvantages of linear regression:

- Presumes a linear relationship between the features and the response
- Performance is (generally) not competitive with the best supervised learning methods due to high bias
- Can't automatically learn feature interactions

### Regression Modeling - Using Statistical package

In [None]:
### Regression Modeling - Usingg Statistical package
import statsmodels.formula.api as smf

In [None]:
sales_data_train, sales_data_test = train_test_split(sales_data, test_size=0.3, random_state=1)

In [None]:
sales_data_train.head()

In [None]:
lm_multreg = smf.ols(formula='Sales ~ TV + Radio + Newspaper + Area + Prod_type  ', data=sales_data_train).fit()

#print the coefficients
lm_multreg.params

In [None]:
# model summary
print(lm_multreg.summary())

In [None]:
sales_test_pred = lm_multreg.predict(sales_data_test)

In [None]:
comparision_df = pd.DataFrame()
comparision_df['Actual_Sales'] = sales_data_test.Sales
comparision_df['Predicted_Sales'] = sales_test_pred
comparision_df.head()

In [None]:
from sklearn import metrics

In [None]:
# Model Predictive performance on Train Data - RMSE uisng metrics function from sklearn
sales_train_pred = lm_multreg.predict(sales_data_train)
print('Train Predictive Error RMSE: ' , np.sqrt(metrics.mean_squared_error(sales_data_train.Sales, sales_train_pred)))

In [None]:
# Model Predictive performance on Test Data - RMSE uisng metrics function from sklearn
print('Test Predictive Error RMSE: ' , np.sqrt(metrics.mean_squared_error(sales_data_test.Sales, sales_test_pred)))