# Working notebook 3

# **Goals:**

* Discover key attributes that drive and have a high correlation with home value.

* Use those attributes to develop a machine learning model to predict home value.

    * Carefully select features that will prevent data leakage. 


## Imports

In [1]:
import pandas as pd
import numpy as np


import wrangle as w

# Acquire:

In [2]:
# acquire telco data 
df = w.get_zillow_data()

* Data acquire from Codeup Database 11/17/22

* It contained  52441 rows and 10 columns before cleaning

* Each row represents a single family household:
    * properties from 2017 with current transactions
    * located in the Californian counties of 'Los Angeles' or 'Orange'or 'Ventura'

* Each column represents a feature related to the single family residential.

In [None]:
df.isnull().sum()

In [None]:
52441 - 50446 

In [None]:
(1995/52441) *100

In [None]:
100 -((1995/52441) *100)

In [None]:
# a total of 1995 rows were removed as outliers still maintain 96.2% of original total data
df = w.handle_outliers(df)

In [None]:
df.isnull().sum()

In [None]:
50446 

In [None]:
# dropped properties with no bathrooms and no bedrooms 75 rows at still retained 96% of original data
df[(df.bathrooms==0) & (df.bedrooms ==0)]

In [None]:
def no_beds_and_baths(df):
    df= df[~(df.bathrooms==0) & ~(df.bedrooms ==0)]
    
    return df

In [None]:
# drop 0 beds and 0 baths
df= df[~(df.bathrooms==0) & ~(df.bedrooms ==0)]

In [None]:
df.shape

In [None]:
50326/52441

In [None]:
df.isnull().sum()

In [None]:
w.process_luxury_features(df)

In [None]:
df.isnull().sum()

In [None]:
df.shape

In [None]:
# dropp nulls  a total of 40 rows at this point we have retain 95.9% of original data
df = df.dropna()

In [None]:
df.shape

In [None]:
50326-50286

In [None]:
50286/52441

In [None]:
def process_fancy_features(df):
    
    columns = ['fireplace','deck','pool','garage']    
    for feature in columns:
        df[feature]=df[feature].replace(r"^\s*$", np.nan, regex=True)     
        # fill fancy features with 0 assumption that if it was not mark it did not exist
        df[feature] = df[feature].fillna(0)
    return df

In [None]:
def handle_outliers(df):
    """Manually handle outliers '"""
    df = df[df.bathrooms <= 6]
    
    df = df[df.bedrooms <= 6]
    
    df = df[df.home_value <= 1_750_000]
    
    return df

In [None]:
def zillow_prep(df):
    
    # remove outliers
    df = handle_outliers(df)
    
    # removed rows with 0 beds and 0 baths
    df = df[~(df.bathrooms==0) & ~(df.bedrooms ==0)]
    
    # process nulls in luxury features:
    df = process_fancy_features(df)
    
    # drop nulls
    df = df.dropna()

    return df

In [None]:
# FIPS code 6111 Ventura County, 6059  Orange County, 6037 Los Angeles County
df.county.value_counts()

In [None]:
df.isnull().sum()

In [None]:
def new_features(df):
    #Creating new column for home age using year_built, casting as float
    df['home_age'] = 2017- df['yearbuilt']
    df["home_age"] = df["home_age"].astype('float')
    
    df['optional_features'] = (df.garage==1)|(df.deck == 1)|(df.pool == 1)|(df.fireplace == 1)
    
    return df
    
    

In [None]:
def encode_features(df):
    df.fireplace = df.fireplace.replace({2:1, 3:1, 4:1, 5:1})
    df.deck= df.deck.replace({66:1})
    df.garage = df.garage.replace({2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 13:1,14:1})
    df.optional_features = df.optional_features.replace({False:0, True: 1})
    temp = pd.get_dummies(df['county'], drop_first=False)
    df = pd.concat([df, temp],axis =1)
    return df

In [None]:
df.head()

In [None]:
df =new_features(df)

In [None]:
df.head()

In [None]:
df=encode_features(df)

In [None]:
df.head(5)

# Prepare:

In [3]:
# prepare data 
df = w.zillow_prep(df)

prepare actions:
* After the follwing steps I retained 95.9% of original data:
    * Outliers were removed
    (to better fit the definition of Single Family Property):
    
        * Beds above 6 
        * Baths above 6 
        * Home values above 1_750_000
        * Rows with both 0 beds and 0 baths 
        
    * For the following features it was assumed null values meant the structure did not exist on property:
        * fireplace (45198)
        * deck (52052)
        * pool (41345)
        * garage (34425)
            
    * The following null values were dropped:
        * home_value (1)
        * squarefeet (82)
        * yearbuilt (116)

* Encoded categorical variables
* Split data into train, validate and test 
    * Approximately: train 56%, validate 24%, test 20%
    * Stratified on 'churn'


In [4]:
df 

Unnamed: 0,home_value,squarefeet,bathrooms,bedrooms,yearbuilt,fireplace,deck,pool,garage,county,home_age,optional_features,6037.0,6059.0,6111.0
0,434855.0,1570.0,2.0,3.0,1956.0,0.0,0.0,0.0,0.0,6037.0,61.0,0,1,0,0
1,218089.0,981.0,1.0,2.0,1939.0,0.0,0.0,0.0,0.0,6037.0,78.0,0,1,0,0
2,161802.0,1484.0,1.0,2.0,1913.0,0.0,0.0,0.0,0.0,6037.0,104.0,0,1,0,0
3,635000.0,3108.0,3.0,5.0,2006.0,0.0,0.0,0.0,0.0,6037.0,11.0,0,1,0,0
4,424414.0,1518.0,2.0,3.0,1947.0,0.0,0.0,0.0,0.0,6037.0,70.0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
52435,469096.0,1122.0,1.0,3.0,1950.0,0.0,0.0,1.0,0.0,6037.0,67.0,1,1,0,0
52436,368765.0,1528.0,2.0,2.0,1961.0,0.0,0.0,0.0,0.0,6037.0,56.0,0,1,0,0
52438,468871.0,1336.0,2.0,3.0,1943.0,0.0,0.0,0.0,0.0,6037.0,74.0,0,1,0,0
52439,150291.0,860.0,1.0,2.0,1934.0,0.0,0.0,0.0,0.0,6037.0,83.0,0,1,0,0


In [5]:
# split data: train, validate and test
train, validate, test = w.split_data(df)

In [7]:
train.head()

Unnamed: 0,home_value,squarefeet,bathrooms,bedrooms,yearbuilt,fireplace,deck,pool,garage,county,home_age,optional_features,6037.0,6059.0,6111.0
32899,346258.0,1026.0,1.0,2.0,1924.0,0.0,0.0,0.0,0.0,6037.0,93.0,0,1,0,0
4511,520000.0,1728.0,2.0,3.0,1987.0,0.0,0.0,0.0,0.0,6037.0,30.0,0,1,0,0
29470,217589.0,1840.0,2.0,4.0,1973.0,0.0,0.0,0.0,1.0,6059.0,44.0,0,0,1,0
15398,210507.0,2581.0,3.0,4.0,1994.0,0.0,0.0,1.0,0.0,6037.0,23.0,1,1,0,0
14156,294263.0,902.0,2.0,2.0,1950.0,0.0,0.0,0.0,0.0,6037.0,67.0,0,1,0,0


In [8]:
train.shape, validate.shape, test.shape

((28159, 15), (12069, 15), (10058, 15))

###                                                        <h1><center>Data Dictionary</center></h1>     


|Feature          | Description|
| :---------------: | :---------------------------------- |
| home_value (target) | The total tax assessed value of the parcel  |
| squarefeet:  | Calculated total finished living area of the home |
| bathrooms:   |  Number of bathrooms in home including fractional bathrooms |
| bedrooms: | Number of bedrooms in home  |
| yearbuilt:  |  The Year the principal residence was built   |
| fireplace: | fireplace on property (if any = 1) |
| deck:  | deck on property (if any = 1) |
| pool:  | pool on property (if any = 1) |
| garage: | garage on property (if any = 1) |
| county: | FIPS code for californian counties: 6111 Ventura County, 6059  Orange County, 6037 Los Angeles County |
| home_age: | The age of the home in 2017   |
|optional_features: |If a home has any of the follwing: fireplace, deck, pool, garage it is noted as 1   |
|additional Features: | 	Encoded and values for categorical data

# Looking at the data

In [9]:
train.head(10)

Unnamed: 0,home_value,squarefeet,bathrooms,bedrooms,yearbuilt,fireplace,deck,pool,garage,county,home_age,optional_features,6037.0,6059.0,6111.0
32899,346258.0,1026.0,1.0,2.0,1924.0,0.0,0.0,0.0,0.0,6037.0,93.0,0,1,0,0
4511,520000.0,1728.0,2.0,3.0,1987.0,0.0,0.0,0.0,0.0,6037.0,30.0,0,1,0,0
29470,217589.0,1840.0,2.0,4.0,1973.0,0.0,0.0,0.0,1.0,6059.0,44.0,0,0,1,0
15398,210507.0,2581.0,3.0,4.0,1994.0,0.0,0.0,1.0,0.0,6037.0,23.0,1,1,0,0
14156,294263.0,902.0,2.0,2.0,1950.0,0.0,0.0,0.0,0.0,6037.0,67.0,0,1,0,0
32788,241475.0,1719.0,2.5,3.0,1992.0,1.0,0.0,0.0,1.0,6111.0,25.0,1,0,0,1
19187,108271.0,2018.0,3.0,3.0,1960.0,0.0,0.0,1.0,0.0,6037.0,57.0,1,1,0,0
29240,243917.0,2542.0,3.0,3.0,1955.0,0.0,0.0,0.0,0.0,6037.0,62.0,0,1,0,0
24385,482506.0,1668.0,2.0,3.0,1979.0,0.0,0.0,1.0,1.0,6059.0,38.0,1,0,1,0
46165,413000.0,1351.0,2.0,3.0,1954.0,0.0,0.0,0.0,0.0,6037.0,63.0,0,1,0,0


# Data Summary

In [11]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
home_value,28159.0,435017.576015,332865.944266,3254.0,187537.5,362951.0,583900.5,1750000.0
squarefeet,28159.0,1832.225576,803.292556,300.0,1256.0,1633.0,2227.0,8251.0
bathrooms,28159.0,2.228506,0.884596,1.0,2.0,2.0,3.0,6.0
bedrooms,28159.0,3.273483,0.886781,1.0,3.0,3.0,4.0,6.0
yearbuilt,28159.0,1963.118719,22.663952,1878.0,1950.0,1960.0,1978.0,2015.0
fireplace,28159.0,0.141376,0.348415,0.0,0.0,0.0,0.0,1.0
deck,28159.0,0.006925,0.082929,0.0,0.0,0.0,0.0,1.0
pool,28159.0,0.20402,0.402991,0.0,0.0,0.0,0.0,1.0
garage,28159.0,0.344259,0.475135,0.0,0.0,0.0,1.0,1.0
county,28159.0,6049.246031,21.220257,6037.0,6037.0,6037.0,6059.0,6111.0


# Explore:

## Does contract type affect churn?

In [None]:
# Obtain plot for contract type vs churn
e.get_plot_contract(train)

* **It seems that customers with a two-year contracts churn less than customers with month-to-month contract.**

**I will now conduct a chi-square test to determine if there is an association between contract type and churn.**

* The confidence interval is 95%
* Alpha is set to 0.05 

$H_0$: There is **no** relationship between contract type and churn.

$H_a$: There is a relationship between contract type and churn.

In [None]:
# Obtain chi-square on Contract type
e.get_chi2_contract(train)

The p-value is less than alpha. **There is evidence to support that tenure has an association with churn.** I believe that tenure is a driver of churn. Adding an encoded version of this feature to the model will likely increase the mode's accuracy.

# Exploration Summary

* A
* B
* C

# Features that will be included in my model

* **A**  has a significant statistical relationship to 
* **B**  has a significant statistical relationship to 
* **C**  has a significant statistical relationship to 


# Features that will be not included in my model

* **D** did not ..
* **Other features** have ..

# Modeling:

* metric

In [None]:
# prep data for modeling
x_train,y_train,x_validate,y_validate, x_test, y_test = m.model_prep(train,validate,test)

## Model !

In [None]:
# Get Decision Tree results
m.get_tree_model(x_train,y_train,x_validate,y_validate)

**The ....** 

# Comparing Models

* All ....

# Model on Test data

In [None]:
m.get_logit_model(x_train,y_train,x_test,y_test, True)

## Modeling Summary

* A
* B

# Conclusion

## Exploration



* A
* B

## Modeling

**The final model performed....**

## Recommendations

* A
* B
* C

## Next Steps

* A
* B
* C