# Working notebook 3

# **Goals:**

* Discover key attributes that drive and have a high correlation with home value.

* Use those attributes to develop a machine learning model to predict home value.

    * Carefully select features that will prevent data leakage. 


## Imports

In [1]:
import pandas as pd
import numpy as np


import wrangle as w

# Acquire:

* Data acquire from Codeup Database 11/17/22

* It contained  52441 rows and 10 columns before cleaning

* Each row represents a single family household:
    * properties from 2017 with current transactions
    * located in the Californian counties of 'Los Angeles' or 'Orange'or 'Ventura'

* Each column represents a feature related to the single family residential.

In [6]:
df.isnull().sum()

home_value        1
squarefeet       82
bathrooms         0
bedrooms          0
yearbuilt       116
fireplace     45198
deck          52052
pool          41345
garage        34426
county            0
dtype: int64

In [2]:
# acquire telco data 
df = w.get_zillow_data()

In [3]:
52441 - 50446 

1995

In [None]:
(1995/52441) *100

In [None]:
100 -((1995/52441) *100)

In [3]:
# a total of 1995 rows were removed as outliers still maintain 96.2% of original total data
df = w.handle_outliers(df)

In [None]:
df.isnull().sum()

In [None]:
50446 

In [7]:
# dropped properties with no bathrooms and no bedrooms 75 rows at still retained 96% of original data
df[(df.bathrooms==0) & (df.bedrooms ==0)]

Unnamed: 0,home_value,squarefeet,bathrooms,bedrooms,yearbuilt,fireplace,deck,pool,garage,county
1601,830145.0,,0.0,0.0,,,,,,6059.0
2714,643406.0,,0.0,0.0,,,,,,6059.0
3137,963472.0,280.0,0.0,0.0,1953.0,,,1.0,,6037.0
3342,185161.0,1208.0,0.0,0.0,1990.0,,,,,6037.0
3423,168828.0,1378.0,0.0,0.0,,,,,,6037.0
...,...,...,...,...,...,...,...,...,...,...
50077,327761.0,,0.0,0.0,,,,,,6059.0
50262,3248800.0,,0.0,0.0,,,,,,6059.0
50637,34124.0,892.0,0.0,0.0,,,,,,6037.0
50903,499000.0,2307.0,0.0,0.0,1948.0,,,,,6037.0


In [None]:
def no_beds_and_baths(df):
    df= df[~(df.bathrooms==0) & ~(df.bedrooms ==0)]
    
    return df

In [None]:
# drop 0 beds and 0 baths
df= df[~(df.bathrooms==0) & ~(df.bedrooms ==0)]

In [None]:
df.shape

In [None]:
50326/52441

In [None]:
df.isnull().sum()

In [None]:
w.process_luxury_features(df)

In [None]:
df.isnull().sum()

In [None]:
df.shape

In [None]:
# dropp nulls  a total of 40 rows at this point we have retain 95.9% of original data
df = df.dropna()

In [None]:
df.shape

In [None]:
50326-50286

In [None]:
50286/52441

In [None]:
def process_fancy_features(df):
    
    columns = ['fireplace','deck','pool','garage']    
    for feature in columns:
        df[feature]=df[feature].replace(r"^\s*$", np.nan, regex=True)     
        # fill fancy features with 0 assumption that if it was not mark it did not exist
        df[feature] = df[feature].fillna(0)
    return df

In [None]:
def handle_outliers(df):
    """Manually handle outliers '"""
    df = df[df.bathrooms <= 6]
    
    df = df[df.bedrooms <= 6]
    
    df = df[df.home_value <= 1_750_000]
    
    return df

In [None]:
def zillow_prep(df):
    
    # remove outliers
    df = handle_outliers(df)
    
    # removed rows with 0 beds and 0 baths
    df = df[~(df.bathrooms==0) & ~(df.bedrooms ==0)]
    
    # process nulls in luxury features:
    df = process_fancy_features(df)
    
    # drop nulls
    df = df.dropna()

    return df

In [3]:
df=w.zillow_prep(df)

In [13]:
# FIPS code 6111 Ventura County, 6059  Orange County, 6037 Los Angeles County
df.county.value_counts()

6037.0    33910
6059.0    14136
6111.0     4395
Name: county, dtype: int64

In [9]:
df.isnull().sum()

home_value        1
squarefeet       82
bathrooms         0
bedrooms          0
yearbuilt       116
fireplace     45198
deck          52052
pool          41345
garage        34426
county            0
dtype: int64

In [4]:
def new_features(df):
    #Creating new column for home age using year_built, casting as float
    df['home_age'] = 2017- df['yearbuilt']
    df["home_age"] = df["home_age"].astype('float')
    
    df['optional_features'] = (df.garage==1)|(df.deck == 1)|(df.pool == 1)|(df.fireplace == 1)
    
    return df
    
    

In [5]:
def encode_features(df):
    df.fireplace = df.fireplace.replace({2:1, 3:1, 4:1, 5:1})
    df.deck= df.deck.replace({66:1})
    df.garage = df.garage.replace({2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 13:1,14:1})
    df.optional_features = df.optional_features.replace({False:0, True: 1})
    temp = pd.get_dummies(df['county'], drop_first=False)
    df = pd.concat([df, temp],axis =1)
    return df

In [6]:
df.head()

Unnamed: 0,home_value,squarefeet,bathrooms,bedrooms,yearbuilt,fireplace,deck,pool,garage,county
0,434855.0,1570.0,2.0,3.0,1956.0,0.0,0.0,0.0,0.0,6037.0
1,218089.0,981.0,1.0,2.0,1939.0,0.0,0.0,0.0,0.0,6037.0
2,161802.0,1484.0,1.0,2.0,1913.0,0.0,0.0,0.0,0.0,6037.0
3,635000.0,3108.0,3.0,5.0,2006.0,0.0,0.0,0.0,0.0,6037.0
4,424414.0,1518.0,2.0,3.0,1947.0,0.0,0.0,0.0,0.0,6037.0


In [7]:
df =new_features(df)

In [8]:
df.head()

Unnamed: 0,home_value,squarefeet,bathrooms,bedrooms,yearbuilt,fireplace,deck,pool,garage,county,home_age,optional_features
0,434855.0,1570.0,2.0,3.0,1956.0,0.0,0.0,0.0,0.0,6037.0,61.0,False
1,218089.0,981.0,1.0,2.0,1939.0,0.0,0.0,0.0,0.0,6037.0,78.0,False
2,161802.0,1484.0,1.0,2.0,1913.0,0.0,0.0,0.0,0.0,6037.0,104.0,False
3,635000.0,3108.0,3.0,5.0,2006.0,0.0,0.0,0.0,0.0,6037.0,11.0,False
4,424414.0,1518.0,2.0,3.0,1947.0,0.0,0.0,0.0,0.0,6037.0,70.0,False


In [9]:
df=encode_features(df)

In [12]:
df.head(5)

Unnamed: 0,home_value,squarefeet,bathrooms,bedrooms,yearbuilt,fireplace,deck,pool,garage,county,home_age,optional_features,6037.0,6059.0,6111.0
0,434855.0,1570.0,2.0,3.0,1956.0,0.0,0.0,0.0,0.0,6037.0,61.0,0,1,0,0
1,218089.0,981.0,1.0,2.0,1939.0,0.0,0.0,0.0,0.0,6037.0,78.0,0,1,0,0
2,161802.0,1484.0,1.0,2.0,1913.0,0.0,0.0,0.0,0.0,6037.0,104.0,0,1,0,0
3,635000.0,3108.0,3.0,5.0,2006.0,0.0,0.0,0.0,0.0,6037.0,11.0,0,1,0,0
4,424414.0,1518.0,2.0,3.0,1947.0,0.0,0.0,0.0,0.0,6037.0,70.0,0,1,0,0


# Prepare:

prepare actions:
* After the follwing steps I retained 95.9% of original data:
    * Outliers were removed
    (to better fit the definition of Single Family Property):
    
        * Beds above 6 
        * Baths above 6 
        * Home values above 1_750_000
        * Rows with both 0 beds and 0 baths 
        
    * For the following features it was assumed null values meant the structure did not exist on property:
        * fireplace (45198)
        * deck (52052)
        * pool (41345)
        * garage (34425)
            
    * The following null values were dropped:
        * home_value (1)
        * squarefeet (82)
        * yearbuilt (116)

* Encoded categorical variables
* Split data into train, validate and test 
    * Approximately: train 56%, validate 24%, test 20%
    * Stratified on 'churn'


###                                                        <h1><center>Data Dictionary</center></h1>     


|Feature          | Description|
| :---------------: | :---------------------------------- |
| home_value (target) | The total tax assessed value of the parcel  |
| squarefeet:  | Calculated total finished living area of the home |
| bathrooms:   |  Number of bathrooms in home including fractional bathrooms |
| bedrooms: | Number of bedrooms in home  |
| yearbuilt:  |  The Year the principal residence was built   |
| fireplace: | fireplace on property (if any = 1) |
| deck:  | deck on property (if any = 1) |
| pool:  | pool on property (if any = 1) |
| garage: | garage on property (if any = 1) |
| county: | FIPS code for californian counties: 6111 Ventura County, 6059  Orange County, 6037 Los Angeles County |
| home_age: | The age of the home in 2017   |
|optional_features: |If a home has any of the follwing: fireplace, deck, pool, garage it is noted as 1   |
|additional Features: | 	Encoded and values for categorical data

In [None]:
# cleaning data
df = w.prep_telco(df)

# split data: train, validate and test
train, validate, test = w.split_telco_data(df)

# Looking at the data

In [None]:
train.head(10)

# Data Summary

In [None]:
train.describe()

# Explore:

## How often does churn occur?

In [None]:
e.get_churn_mean_bar(train)

 * **It appears that about 27% of Telco customers churn.**

## Do customer who churn have higher monthly charges?

In [None]:
# Obtain boxplot displaying mean of monthly charges
e.get_monthly_charges(train)

* **The mean monthly charges of customers who churn is slightly higher than the mean monthly charges of customers who do not churn.** 

**I will now conduct a T-test to test for a significant difference between the mean of monthly charges of customers who churn and the mean oc monthly charges of customers who do not churn.**

* The confidence interval is 95%
* Alpha is set to 0.05 
* p/2 will be compared to alpha

$H_0$: Mean tenure of Telco customers who churn == mean tenure of Telco customers who do not churn.

$H_a$: Tean tenure of Telco customers who churn != mean tenure of Telco customers who do not churn.

In [None]:
# Stats T-Test result
e.get_ttest_monthly_charges(train)

The p-value/2 is less than the alpha. **There is  evidence to support that customers who churn on average pay higher monthly charges than customers who do not churn.** Based on this statistical finding I believe that monthly charges is a driver of customer churn.Adding an encoded version of this feature to the model will likely increase the model's accuracy.


## Is the mean tenure of customers who churn lower?

In [None]:
# Obtain boxplot on tenure vs churn
e.get_boxplot_tenure(train)

* **We can see that the tenure mean of customers who churn  is less than the tenure mean of customers who do not churn.**

**I will now conduct a T-test to determine if there on average customers who churn have a lower tenure than customers who do not churn.**

* The confidence interval is 95%
* Alpha is set to 0.05 
* p/2 will be compared to alpha

$H_0$: Mean tenure of Telco customers who churn >= mean tenure of Telco customers who do not churn.

$H_a$: Tean tenure of Telco customers who churn < mean tenure of Telco customers who do not churn.

In [None]:
# obtain T-test for tenure vs churn
e.get_ttest_tenure(train)

The p-value/2 is less than alpha. **Therefore we have evidence to support that customors who churn have a tenure average that is lower than customers who do not churn.** Based on this statistical finding I believe that tenure is a driver of customer churn.Adding an encoded version of this feature to the model will likely increase the model's accuracy.**

## Does having Senior Citizen status affect churn?

In [None]:
# Obtain bar graph for senior Citizen count
e.get_bar_senior(train)

* **We can see that the population count of churned senior citizens is closer to the total population of senior citizens.**

**I will now conduct a chi-square test to determine if there is an association between senior citizen status and churn.**

* The confidence interval is 95%
* Alpha is set to 0.05 

$H_0$: There is **no** relationship between a customers with senior status and churn.

$H_a$: There is a relationship between a customers senior status and churn.

In [None]:
# Obtain chi-square test
e.get_chi2_senior(train)

The p-value is less thant the alpha. **Therefore there is evidence to support that a customer senior citizen status has an association with churn.** I believe that senior citizen status is a driver of churn. Adding an encoded version of this feature to the model will likely increase the mode's accuracy. 

## Does contract type affect churn?

In [None]:
# Obtain plot for contract type vs churn
e.get_plot_contract(train)

* **It seems that customers with a two-year contracts churn less than customers with month-to-month contract.**

**I will now conduct a chi-square test to determine if there is an association between contract type and churn.**

* The confidence interval is 95%
* Alpha is set to 0.05 

$H_0$: There is **no** relationship between contract type and churn.

$H_a$: There is a relationship between contract type and churn.

In [None]:
# Obtain chi-square on Contract type
e.get_chi2_contract(train)

The p-value is less than alpha. **There is evidence to support that tenure has an association with churn.** I believe that tenure is a driver of churn. Adding an encoded version of this feature to the model will likely increase the mode's accuracy.

# Exploration Summary

* Monthly Charges is a driver of churn
* Senior Citizen status is a driver of churn
* Tenure is a driver of churn
* Contract type is a driver of churn
* Partner is a driver of churn
* Gender is not a driver of churn

# Features that will be included in my model

* **Monthly charges**  has a significant statistical relationship to churn
* **Senior Citizen**  has a significant statistical relationship to churn
* **Tenure**  has a significant statistical relationship to churn
* **Contract type** has a significan statistical relationship to churn
* **Partner**  has a significant statistical relationship to churn

# Features that will be not included in my model

* **Gender** did not have a statistical significant relationship to churn.
* **Other features** have unknow significance to churn at the moment
    * Given more time I would determined if these other features would result in any model gains

# Modeling:

* Accuracy is the metric use in the models.
    * Accuracy helps gauge the percentage of correct predictions
* Churn customers makeup 27% of the data 
* Since non-churned customers make up 73% of the data 
    * 73% will be the baseline
* I will evaluate my top model of Decision Tree, KNN, and Logistic Regression on train and validate data
* The model that performs the best will then be evaluated on test data

In [None]:
# prep data for modeling
x_train,y_train,x_validate,y_validate, x_test, y_test = m.model_prep(train,validate,test)

## Decision Tree

In [None]:
# Get Decision Tree results
m.get_tree_model(x_train,y_train,x_validate,y_validate)

**The accuracy of the Decision Tree model is above the baseline in both train and validate.** 

## KNN

In [None]:
# Get KNN model results
m.get_knn_model(x_train,y_train,x_validate,y_validate)

**The accuracy of the KNN is above the baseline in both train and validate.** 

## Logistic Regression

In [None]:
# Get Logic Regression model results
m.get_logit_model(x_train,y_train,x_validate,y_validate)

**The accuracy of the Logistic Regression model is above the baseline in both train and validate.** 

# Comparing Models

* All Models performed above the baseline in both train and validate data
* Since all models performed well I will select the Model with the least accuracy diffirence between train and validate
* I will select the Logistic Regression as the final model.

# Logistic regression on Test data

In [None]:
m.get_logit_model(x_train,y_train,x_test,y_test, True)

## Modeling Summary

* Decision Tree, KNN and Logistic Regression models all performed aboved the baseline

* Logistic Regression Model performed 7% above the baseline in terms of accuracy.

# Conclusion

## Exploration



* About 27% of Telco customers churn.
* Customers who churn tend to:
    * have a higher a monthly charge
    * lower tenure mean
* Contract type, partner status and senior status have an association with churn 
* Gender has no influence on churn 

## Modeling

**The final model performed well above the baseline by 7% in terms of accuracy.**

## Recommendations

* Have appealing incentives for customers to sign a two-year contract.
* Run a promotion to lower monthly charges for new customers.
* Give discounts to senior citizens

## Next Steps

* Explore the statistical significance of other features in regards to churn.
* Use bivariate data to explore if other factors are causing senior citizens to churn.
* Use bivariate data to explore what other services  are utilized by customers with two-year contracts.