# Working notebook 4

# **Goals:**

* Discover key attributes that drive and have a high correlation with home value.

* Use those attributes to develop a machine learning model to predict home value.

    * Carefully select features that will prevent data leakage. 


## Imports

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import wrangle as w
import model as m
import explore as e

# Acquire:

In [None]:
# acquire telco data 
df = w.get_zillow_data()

* Data acquire from Codeup Database 11/17/22

* It contained  52441 rows and 10 columns before cleaning

* Each row represents a single family household:
    * properties from 2017 with current transactions
    * located in the Californian counties of 'Los Angeles' or 'Orange'or 'Ventura'

* Each column represents a feature related to the single family residential.

###                                                        <h1><center>Data Dictionary</center></h1>     


|Feature          | Description|
| :---------------: | :---------------------------------- |
| home_value (target) | The total tax assessed value of the parcel  |
| squarefeet:  | Calculated total finished living area of the home |
| bathrooms:   |  Number of bathrooms in home including fractional bathrooms |
| bedrooms: | Number of bedrooms in home  |
| yearbuilt:  |  The Year the principal residence was built   |
| fireplace: | fireplace on property (if any = 1) |
| deck:  | deck on property (if any = 1) |
| pool:  | pool on property (if any = 1) |
| garage: | garage on property (if any = 1) |
| county: | FIPS code for californian counties: 6111 Ventura County, 6059  Orange County, 6037 Los Angeles County |
| home_age: | The age of the home in 2017   |
|optional_features: |If a home has any of the follwing: fireplace, deck, pool, garage it is noted as 1   |
|additional features: | 	Encoded and values for categorical data

# Prepare:

In [None]:
# prepare data 
df = w.zillow_prep(df)

In [None]:
# split data: train, validate and test
train, validate, test = w.split_data(df)

prepare actions:
* After the follwing steps I retained 95.9% of original data:
    * Outliers were removed
    (to better fit the definition of Single Family Property):
    
        * Beds above 6 
        * Baths above 6 
        * Home values above 1_750_000
        * Rows with both 0 beds and 0 baths 
        
    * For the following features it was assumed null values meant the structure did not exist on property:
        * fireplace (45198)
        * deck (52052)
        * pool (41345)
        * garage (34425)
            
    * The following null values were dropped:
        * home_value (1)
        * squarefeet (82)
        * yearbuilt (116)

* Encoded categorical variables
* Split data into train, validate and test 
    * Approximately: train 56%, validate 24%, test 20%
  


# Looking at the data

In [None]:
train.head(10)

# Data Summary

In [None]:
train.describe().T

# Explore:

## How do optional home features influence home value?

  * optional features refers to fireplace, garage, pool, and deck

In [None]:
# obtain lolipop plot
e.lolipop_plot(train)

#### Homes with a deck have a higher average home value than any other feature. Homes with no optional home featues have the lowest average home value.

# Does more house equal more home value?


In [None]:
# obtain bed, bath and squarfeet graph
e.home_scatterplot(train)

#### It clear that more bedrooms , more bathrooms and more square feet space drives the home value up.

 # Does county make a difference in home value?
    FIPS     6111: Ventura County    6059: Orange County    6037: Los Angeles County

In [None]:
# obtain counties and home value box_plot
e.get_boxplot_county_vs_homevalue(train)

* **It seems that different counties have a diffirent home value mean.**

**I will now conduct an anova test to test for a significant differences between the mean of the three different counties**

* The confidence interval is 95%
* Alpha is set to 0.05
* p value will be compared to alpha


$H_0$: There is  two or more counties that have the same home value mean. 

$H_a$: Mean home value of the 3 diffirent counties is not the same

In [None]:
e.anova_county_test(train)

The p-value is less than alpha. There is evidence to support that the three counties have diffirent home value mean. Based on this statistical finding I believe that county location is a driver of home value. Adding an encoded version of this feature to the model will likely increase the model's accuracy.

# Is home age a driver of home value?

In [None]:
  
# List1
Name = ['tom', 'krish', 'nick', 'juli']
  
# List2
Age = [25, 30, 26, 22]
  
# get the list of tuples from two lists.
# and merge them by using zip().
list_of_tuples = list(zip(Name, Age))
  
# Assign data to tuples.
list_of_tuples
  
  
# Converting lists of tuples into
# pandas Dataframe.
df = pd.DataFrame(list_of_tuples,
                  columns=['Name', 'Age'])
  
# Print data.
df

In [None]:
age_avghv=pd.DataFrame(columns =['age','avg_$'])
for i in range(0,200):

    age = age_avghv.append({
        'age': i,
        'avg_$':train[train.home_age == i].home_value.mean()},ignore_index=True)

    

In [None]:
sns.scatterplot(train['home_age'],train['home_value'],x_bins=50,hue='county')

In [None]:
sns.barplot(data=train, x='home_age', y='home_value', palette='PiYG')

In [None]:
train_viz = train.sample(frac=0.04, replace=True, random_state=123)


In [None]:
sns.kdeplot(y=train.home_value,hue='county',data=train);

In [None]:
sns.lineplot(train.home_value,train.home_age);

In [None]:
train.county.value_counts()

In [None]:

In [61]:
sns.set_style('whitegrid')

cyl_4 = cars[cars.cylinders==4]
cyl_8 = cars[cars.cylinders==8]

plt.figure(figsize=(8,6))

sns.kdeplot(cyl_4.horsepower, cyl_4.mpg,
            cmap="Blues", shade=True, shade_lowest=False)
sns.kdeplot(cyl_8.horsepower, cyl_8.mpg,
            cmap="Reds", shade=True, shade_lowest=False)

plt.xlabel('Horsepower', fontsize=14)
plt.ylabel('Miles per Gallon (MPG)', fontsize=14)

plt.annotate("4 Cylinders", (105, 32), color='blue', fontsize=16, fontweight='bold')
plt.annotate("8 Cylinders", (190, 18), color='red', fontsize=16, fontweight='bold');

In [None]:
sns.set_style('whitegrid')
Los_Angeles= train[train.county=='Los Angeles']
Orange = train[train.county=='Orange']
Ventura = train[train.county=='Ventura']


plt.figure(figsize=(8,6))

sns.kdeplot(Los_Angeles.home_age, Los_Angeles.home_value,
            cmap="Blues", shade=True, shade_lowest=False)
sns.kdeplot(Orange.home_age, Orange.home_value,
            cmap="Reds", shade=True, shade_lowest=False)
sns.kdeplot(Ventura.home_age, Ventura.home_value,
            cmap="Greens", shade=True, shade_lowest=False)

plt.xlabel('Home age', fontsize=14)
plt.ylabel('$ Home Value $', fontsize=14)

plt.annotate("Los_Angeles", (105, 32), color='blue', fontsize=16, fontweight='bold')
plt.annotate("Orange", (190, 18), color='red', fontsize=16, fontweight='bold');

In [None]:
p = sns.lmplot(x='home_age', y='home_value', 
               data=train_viz[train_viz.county.isin(['Los Angeles', 'Orange', 'Ventura'])], 
               hue='county', 
               order=2,
            )
plt.xlabel('home_age', fontsize=18)
plt.ylabel('home_value', fontsize=18)
sns.set_style('white')
p._legend.remove()
plt.legend(fontsize=16)
plt.xticks([])
plt.yticks([])
plt.tight_layout();

## What does the average home look like?

In [None]:
# no features #any features , garage ,fireplace, pool, deck
print(train[train.optional_features==0].home_value.median(),
train[train.fireplace==1].home_value.median(),
train[train.garage==1].home_value.median(), 
train[train.optional_features==1].home_value.median(), 
train[train.pool==1].home_value.median(),
train[train.deck==1].home_value.median())

# What does the most popular built look like?

In [None]:
import math
print (math.floor(train.bedrooms.median()),
math.floor(train.bathrooms.median()),
math.floor(train.squarefeet.median()),
math.floor(train.home_age.median()))

In [None]:
import math
print (math.floor(train.bedrooms.mode()),
math.floor(train.bathrooms.mode()),
math.floor(train.squarefeet.mode()),
math.floor(train.home_age.mode()))

In [None]:
import seaborn as sns

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def add_mean_line(data, var=None, **kws):
    
    # If no variable provided skip adding mean line
    if not var: return
    
    #Calculate mean for each group
    m = np.mean(data[var])
    
    #Get current axis
    ax = plt.gca()
    
    #add line at group mean
    ax.axvline(m, color='maroon', lw=3, ls='--')
    
    #annotate group mean
    x_pos=0.65
    if m > 5000: x_pos=0.2
    ax.text(x_pos, 0.7, f'mean={m:.0f}', 
            transform=ax.transAxes,   #transforms positions to range from (0,0) to (1,1)
            color='maroon', fontweight='bold', fontsize=12)

In [None]:
g = sns.FacetGrid(train, col='bedrooms')
g.map_dataframe(sns.scatterplot, x='squarefeet', y='home_value');

In [None]:
import seaborn as sns

In [None]:
train.bedrooms.mean()

In [None]:
def plot_variable_pair(df):
    columns = ['home_value', 'squarefeet', 'bathrooms', 'bedrooms', 'yearbuilt',
       'fireplace', 'deck', 'pool', 'garage', 'county', 'home_age',
       'optional_features', 'los_angeles_county', 'orange_county',
       'ventura_county']
    for i, col in enumerate(columns):
        sns.lmplot(data=df, x=col, y='home_value', line_kws={'color':'red'})
        plt.show()

In [None]:
train.columns

In [None]:
plot_variable_pair(train)

In [None]:
# Obtain plot for contract type vs churn
#e.get_plot_contract(train)

In [None]:
# Obtain chi-square on Contract type
#e.get_chi2_contract(train)

# Exploration Summary

* A
* B
* C

# Features that will be included in my model

* **A**  has a significant statistical relationship to 
* **B**  has a significant statistical relationship to 
* **C**  has a significant statistical relationship to 


# Features that will be not included in my model

* **D** did not ..
* **Other features** have ..

# Modeling:

## Scaling

# Prepare  data for models

In [None]:
# prepare data for modeling
X_train, y_train, X_validate, y_validate, X_test, y_test = m.model_data_prep(train, validate, test)

# Model

# OLS

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:


#scores.loc[len(scores.index)] = [key, f, RMSE_baseline, RMSE, R2, RMSE_val, R2_val, diff]

In [None]:
# set up dataframe for predictions, add actual values
train_pred = pd.DataFrame({
    'actual': train.home_value
}) 
validate_pred = pd.DataFrame({
    'actual': validate.home_value
}) 

## Baseline

In [None]:
# add a baseline model
train_pred['baseline_mean'] = train.home_value.mean()
validate_pred['baseline_mean'] = validate.home_value.mean()

train_pred['baseline_median'] = train.home_value.median()
validate_pred['baseline_median'] = validate.home_value.median()

In [None]:
train.columns

# custom

In [None]:
custom = ['squarefeet','bathrooms','bedrooms','yearbuilt','pool','orange_county','optional_features']

In [None]:

# 1. make the thing
lm = LinearRegression()
# 2. fit the thing
lm.fit(X_train[custom], y_train)
# 3. use the thing (make predictions)

train_pred['CUS_Model'] = lm.predict(X_train[custom])
validate_pred['CUS_Model'] = lm.predict(X_validate[custom])

# OLS

In [None]:

# 1. make the thing
lm = LinearRegression()
# 2. fit the thing
lm.fit(X_train, y_train)
# 3. use the thing (make predictions)

train_pred['OLS_Model'] = lm.predict(X_train)
validate_pred['OLS_Model'] = lm.predict(X_validate)

In [None]:
train_pred

In [None]:
validate_pred

## Using Kbest 7 features

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression

# parameters: f_regression stats test, give me 7 features
f_selector = SelectKBest(f_regression, k=7)

# find the top 8 X's correlated with y
f_selector.fit(X_train, y_train)

# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()

# get list of top K features. 
f_feature = X_train.iloc[:,feature_mask].columns.tolist()


In [None]:
X_train[f_feature]

In [None]:
X_train.columns.to_list()

In [None]:
# 1. make the thing
lm = LinearRegression()
# 2. fit the thing
lm.fit(X_train[f_feature], y_train)
# 3. use the thing (make predictions)

train_pred['OLS_Model_f7'] = lm.predict(X_train[f_feature])
validate_pred['OLS_Model_f7'] = lm.predict(X_validate[f_feature])

In [None]:
train_pred

In [None]:
# 1. make the thing
lm = LinearRegression()
# 2. fit the thing
lm.fit(X_train[f_feature], y_train)
# 3. use the thing (make predictions)

train_pred['OLS_Model_cus'] = lm.predict(X_train[f_feature])
validate_pred['OLS_Model_cus'] = lm.predict(X_validate[f_feature])

# Using Kbest 4 features

In [None]:
# parameters: f_regression stats test, give me 4 features
f_selector = SelectKBest(f_regression, k=4)

# find the top 8 X's correlated with y
f_selector.fit(X_train, y_train)

# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()

# get list of top K features. 
f_feature = X_train.iloc[:,feature_mask].columns.tolist()

# 1. make the thing
lm = LinearRegression()
# 2. fit the thing
lm.fit(X_train[f_feature], y_train)
# 3. use the thing (make predictions)

train_pred['OLS_Model_f4'] = lm.predict(X_train[f_feature])
validate_pred['OLS_Model_f4'] = lm.predict(X_validate[f_feature])

In [None]:
X_train[f_feature]

In [None]:
train_pred

# Using Kbest 3 features

In [None]:
# parameters: f_regression stats test, give me 3 features
f_selector = SelectKBest(f_regression, k=3)

# find the top 8 X's correlated with y
f_selector.fit(X_train, y_train)

# boolean mask of whether the column was selected or not. 
feature_mask = f_selector.get_support()

# get list of top K features. 
f_feature = X_train.iloc[:,feature_mask].columns.tolist()

# 1. make the thing
lm = LinearRegression()
# 2. fit the thing
lm.fit(X_train[f_feature], y_train)
# 3. use the thing (make predictions)

train_pred['OLS_Model_f3'] = lm.predict(X_train[f_feature])
validate_pred['OLS_Model_f3'] = lm.predict(X_validate[f_feature])

In [None]:
X_train[f_feature]

In [None]:
train_pred

# OLS_ RFE  features = 7  

In [None]:
columns = X_train.columns.to_list()

In [None]:
X_train

In [None]:
from sklearn.feature_selection import RFE
lm = LinearRegression()


# 1. Transform our X
rfe = RFE(lm, n_features_to_select=7)
rfe.fit(X_train, y_train)
print('selected top 7 features:', X_train.columns[rfe.support_])
X_train_rfe = rfe.transform(X_train)
# 2. Use the transformed x in our model
lm.fit(X_train_rfe, y_train)
#convert to DF
X_train_rfe = pd.DataFrame(X_train_rfe, columns = X_train.columns[rfe.support_], index = X_train.index)

train_pred['OLS_rfe7'] = lm.predict(X_train_rfe)

In [None]:
# 3. Make predictions


X_validate_rfe = rfe.transform(X_validate)
#Convert to df
X_validate_rfe = pd.DataFrame(X_validate_rfe, columns = X_validate.columns[rfe.support_], index = X_validate.index)

validate_pred['OLS_rfe7'] = lm.predict(X_validate_rfe)

In [None]:
train_pred

# OLS_RFE 4 features

In [None]:
lm = LinearRegression()


# 1. Transform our X
rfe = RFE(lm, n_features_to_select=4)
rfe.fit(X_train, y_train)
print('selected top 4 features:', X_train.columns[rfe.support_])
X_train_rfe = rfe.transform(X_train)
# 2. Use the transformed x in our model
lm.fit(X_train_rfe, y_train)
#convert to DF
X_train_rfe = pd.DataFrame(X_train_rfe, columns = X_train.columns[rfe.support_], index = X_train.index)

train_pred['OLS_rfe4'] = lm.predict(X_train_rfe)
X_validate_rfe = rfe.transform(X_validate)
#Convert to df
X_validate_rfe = pd.DataFrame(X_validate_rfe, columns = X_validate.columns[rfe.support_], index = X_validate.index)

validate_pred['OLS_rfe4'] = lm.predict(X_validate_rfe)

# OLS_RFE 3 features

In [None]:
lm = LinearRegression()


# 1. Transform our X
rfe = RFE(lm, n_features_to_select=3)
rfe.fit(X_train, y_train)
print('selected top 3 features:', X_train.columns[rfe.support_])
X_train_rfe = rfe.transform(X_train)
# 2. Use the transformed x in our model
lm.fit(X_train_rfe, y_train)
#convert to DF
X_train_rfe = pd.DataFrame(X_train_rfe, columns = X_train.columns[rfe.support_], index = X_train.index)

train_pred['OLS_rfe3'] = lm.predict(X_train_rfe)
X_validate_rfe = rfe.transform(X_validate)
#Convert to df
X_validate_rfe = pd.DataFrame(X_validate_rfe, columns = X_validate.columns[rfe.support_], index = X_validate.index)

validate_pred['OLS_rfe3'] = lm.predict(X_validate_rfe)

# OLS_RFE 2 features

In [None]:
lm = LinearRegression()


# 1. Transform our X
rfe = RFE(lm, n_features_to_select=2)
rfe.fit(X_train, y_train)
print('selected top 2 features:', X_train.columns[rfe.support_])
X_train_rfe = rfe.transform(X_train)
# 2. Use the transformed x in our model
lm.fit(X_train_rfe, y_train)
#convert to DF
X_train_rfe = pd.DataFrame(X_train_rfe, columns = X_train.columns[rfe.support_], index = X_train.index)

train_pred['OLS_rfe2'] = lm.predict(X_train_rfe)
X_validate_rfe = rfe.transform(X_validate)
#Convert to df
X_validate_rfe = pd.DataFrame(X_validate_rfe, columns = X_validate.columns[rfe.support_], index = X_validate.index)

validate_pred['OLS_rfe2'] = lm.predict(X_validate_rfe)

In [None]:
train_pred

# Polynomial

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Degree 2

In [None]:
# 1. Generate Polynomial Features
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=False)
poly.fit(X_train)
X_train_poly = pd.DataFrame(
    poly.transform(X_train),
    columns=poly.get_feature_names(X_train.columns),
    index=train.index,
)
X_train_poly.head()

# 2. Use the features
lm = LinearRegression()
lm.fit(X_train_poly, y_train)


train_pred['poly_d2'] = lm.predict(X_train_poly)

X_validate_poly = poly.transform(X_validate)
validate_pred['poly_d2'] = lm.predict(X_validate_poly)


# Degree 2 interactions ONLY

In [None]:
# 1. Generate Polynomial Features
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
poly.fit(X_train)
X_train_poly = pd.DataFrame(
    poly.transform(X_train),
    columns=poly.get_feature_names(X_train.columns),
    index=train.index,
)
X_train_poly.head()

# 2. Use the features
lm = LinearRegression()
lm.fit(X_train_poly, y_train)


train_pred['Ipoly_d2'] = lm.predict(X_train_poly)

X_validate_poly = poly.transform(X_validate)
validate_pred['Ipoly_d2'] = lm.predict(X_validate_poly)

In [None]:
validate_pred

# Degree 3

In [None]:
# 1. Generate Polynomial Features
poly = PolynomialFeatures(degree=3, include_bias=False, interaction_only=False)
poly.fit(X_train)
X_train_poly = pd.DataFrame(
    poly.transform(X_train),
    columns=poly.get_feature_names(X_train.columns),
    index=train.index,
)
X_train_poly.head()

# 2. Use the features
lm = LinearRegression()
lm.fit(X_train_poly, y_train)


train_pred['poly_d3'] = lm.predict(X_train_poly)

X_validate_poly = poly.transform(X_validate)
validate_pred['poly_d3'] = lm.predict(X_validate_poly)

# DEGREE 3 Interactions Only

In [None]:
# 1. Generate Polynomial Features
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
poly.fit(X_train)
X_train_poly = pd.DataFrame(
    poly.transform(X_train),
    columns=poly.get_feature_names(X_train.columns),
    index=train.index,
)
X_train_poly.head()

# 2. Use the features
lm = LinearRegression()
lm.fit(X_train_poly, y_train)


train_pred['Ipoly_d3'] = lm.predict(X_train_poly)

X_validate_poly = poly.transform(X_validate)
validate_pred['Ipoly_d3'] = lm.predict(X_validate_poly)

# Degree 4

In [None]:
# 1. Generate Polynomial Features
poly = PolynomialFeatures(degree=4, include_bias=False, interaction_only=False)
poly.fit(X_train)
X_train_poly = pd.DataFrame(
    poly.transform(X_train),
    columns=poly.get_feature_names(X_train.columns),
    index=train.index,
)
X_train_poly.head()

# 2. Use the features
lm = LinearRegression()
lm.fit(X_train_poly, y_train)


train_pred['poly_d4'] = lm.predict(X_train_poly)

X_validate_poly = poly.transform(X_validate)
validate_pred['poly_d4'] = lm.predict(X_validate_poly)

# DEGREE 4 interaction Only

In [None]:
# 1. Generate Polynomial Features
poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
poly.fit(X_train)
X_train_poly = pd.DataFrame(
    poly.transform(X_train),
    columns=poly.get_feature_names(X_train.columns),
    index=train.index,
)
X_train_poly.head()

# 2. Use the features
lm = LinearRegression()
lm.fit(X_train_poly, y_train)


train_pred['Ipoly_d4'] = lm.predict(X_train_poly)

X_validate_poly = poly.transform(X_validate)
validate_pred['Ipoly_d4'] = lm.predict(X_validate_poly)

# Evaluate Models

In [None]:
train_pred

In [None]:
def evaluate_metrics(df, col,actual):
    MSE = mean_squared_error(actual, df[col])
    SSE = MSE * len(df)
    RMSE = MSE ** .5
    ESS = ((df[col] - actual.mean())**2).sum()
    TSS = ESS + SSE
    R2 = explained_variance_score(actual, df[col])
    return MSE, SSE, RMSE,ESS, TSS,R2

In [None]:
col = train_pred.columns.to_list()

In [None]:
from sklearn.metrics import mean_squared_error,explained_variance_score
metric_df = pd.DataFrame(columns =['model','MSE','SSE','RMSE','ESS','TSS','R2'])
for i in col:
    MSE,SSE, RMSE, ESS, TSS, R2 = evaluate_metrics(train_pred, i , y_train)
    # sklearn.metrics.explained_variance_score

    metric_df= metric_df.append({
                    'model': i,
                    'MSE':MSE,
                     'SSE':SSE,
                     'RMSE':RMSE,
                     'ESS':ESS,
                     'TSS':TSS,
                     'R2':R2},ignore_index=True)
    

In [None]:
metric_df

In [None]:
metric_df[['model','RMSE','R2']]


In [None]:
metric_df[['model','RMSE','R2']].sort_values(by='R2',ascending=False)

In [None]:
metric_val['diff_RMSE']= metric_val[['RMSE']]- metric_df[['RMSE']]

In [None]:
metric_val['diff_R2']= metric_val[['R2']]- metric_df[['R2']]

In [None]:
# train - validate
1-200_000/250_000

In [None]:
metric_val['%diff_RMSE']= 1-(metric_df[['RMSE']]/(metric_val[['RMSE']]+.000000000001))

In [None]:
metric_val[['model','RMSE','R2','diff_R2','diff_RMSE','%diff_RMSE']].sort_values(by='%diff_RMSE',ascending=False)

In [None]:
col = validate_pred.columns.to_list()
metric_val = pd.DataFrame(columns =['model','MSE','SSE','RMSE','ESS','TSS','R2'])
for i in col:
    MSE,SSE, RMSE, ESS, TSS, R2 = evaluate_metrics(validate_pred, i , y_validate)
    metric_val= metric_val.append({
                    'model': i,
                    'MSE':MSE,
                     'SSE':SSE,
                     'RMSE':RMSE,
                     'ESS':ESS,
                     'TSS':TSS,
                     'R2':R2},ignore_index=True)
    


In [None]:
metric_val[['model','RMSE','R2']].sort_values(by='R2',ascending=False)

In [None]:
validate_pred

In [None]:
def baseline_mean_errors(y):
    '''
    baseline mean errors takes in acutal target and returns baseline: SSE, MSE, RMSE
    y: actual target values
    Returns:
        * SSE: baseline sum of squared error
        * MSE: baseline mean square error
        * RMSE: baseline root mean square error
    '''
    # set baseline
    baseline = np.repeat(y.mean(), len(y))
    # calculations
    MSE = mean_squared_error(y, baseline)
    SSE = MSE * len(y)
    RMSE = MSE**.5
    
    return SSE ,MSE, RMSE

In [None]:
baseline_mean_errors(y_train)

In [None]:
def better_than_baseline(y, yhat):
  
     # calculations
    MSE = mean_squared_error(y, yhat)
    SSE = MSE * len(y)
    RMSE = MSE**.5
    ESS = ((yhat - y.mean())**2).sum()
    TSS = ESS + SSE
    
    
    # set baseline
    baseline = np.repeat(y.mean(), len(y))
    # calculations
    MSE = mean_squared_error(y, baseline)
    SSE = MSE * len(y)
    RMSE = MSE**.5

    # calculate diffirences
    SSE_baseline, MSE_baseline, RMSE_baseline = baseline_mean_errors(y)
    
    if SSE < SSE_baseline:
        print('My OSL model performs better than baseline')
    else:
        print('My OSL model performs worse than baseline. :( )')

In [None]:
train_pred.columns

In [None]:
better_than_baseline(y_train, train_pred.)

In [None]:
df['yhat_baseline'] = df['y'].mean()
df['yhat'] = ols_model.predict(df[['x']])

df['residual'] = df['yhat'] - df['y']
df['residual_baseline'] = df['yhat_baseline'] - df['y']

df['residual^2'] = df.residual ** 2

df['residual_baseline^2'] = df.residual_baseline ** 2

In [None]:
def plot_residuals(y, yhat,df):
    '''
    plot_residuals takis in acutal value of target y and predicted value and returns a scatter plot of reiduals.
    y: targets acutal value
    yhat: predicted value or target
    '''
    # calculate residauals
    residuals = y - df[yhat]
    
    # create scatter plot
    plt.scatter(x=y, y=residuals)

    # create labels for axis and title
    plt.xlabel('Home Value')
    plt.ylabel('Residuals')
    plt.title('Residual vs Home Value Plot')

    # show plot
    plt.show()

In [None]:
col = train_pred.columns.to_list()
col

In [None]:
for i in col:  
    print(i)
    plot_residuals(y_train, i, train_pred)





In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def regression_errors(actual, yhat,df):

    residual = actual - df[yhat]
    
    mse = mean_squared_error(actual, df[yhat])
    sse = (residual **2).sum()
    rmse = sqrt(mse)
    tss = ((actual - df[yhat].mean()) ** 2).sum()
    ess = ((df[yhat] - actual.mean()) ** 2).sum()
# sklearn.metrics.explained_variance_score

    evs = explained_variance_score(actual, df[yhat])
    print(f""" 
  

    RMSE: {round(rmse,2)}

    R2: {evs}
    """)

In [None]:
for i in col:  
    print(i)
    regression_errors(y_validate, i, validate_pred)



* metric

In [None]:
# prep data for modeling
x_train,y_train,x_validate,y_validate, x_test, y_test = m.model_prep(train,validate,test)

**The ....** 

# Comparing Models

* All ....

# Model on Test data

In [None]:
m.get_logit_model(x_train,y_train,x_test,y_test, True)

## Modeling Summary

* A
* B

# Conclusion

## Exploration



* A
* B

## Modeling

**The final model performed....**

## Recommendations

* A
* B
* C

## Next Steps

* A
* B
* C