In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gs
import seaborn as sns
from eda_module import *
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
data = pd.read_csv('data/churn_data.csv')
df = data.copy()

In [None]:
df.info()

In [None]:
df.head()

In [None]:
# make all column names lowercase
df.columns = map(str.lower, df.columns)
df.columns

In [None]:
#Print number of rows in DF
print(f"Rows: {df.shape[0]} \n")
#Print number of features in DF
print(f"Features: {df.shape[1]} \n")
#Print total number of missing values
print(f"Missing Values: {df.isnull().values.sum()} \n")
#Print number of unique values per feature location
print(f"Unique Values: {df.nunique()} \n")

# **Our Data**
We have 7043 rows (each representing a unique customer) with 21 columns:     19 features, 1 target feature (Churn)
### **Target:**
- Churn (binary)
    
### **Nominal Features:**
- Tenure
- MonthlyCharges
- TotalCharges
    
### **Categorical Features:**
- Gender
- SeniorCitizen
- Dependents
- PhoneService
- MultipleLines
- InternetService
- OnlineSecurity
- OnlineBackup
- DeviceProtection
- TechSupport
- StreamingTV
- StreamingMovies
- Contract
- PaperlessBilling
- PaymentMethod
    

# **1. General EDA**

## 1.1 Target: Churn

In [None]:
# Replace all missing string values with 0
df.totalcharges = df.totalcharges.replace(" ", 0)
# Change totalcharges type from string to float
df.totalcharges = df.totalcharges.astype('float32')
df.to_csv('data/reg_data.csv')

In [None]:
plot_target_dist(df)

# 2. Numerical Feature EDA

## 2.1 Kernel Density Estimations - Tenure, Monthly Charges, Total Charges

In [None]:
plot_kde(df, 'tenure')
plot_kde(df, 'monthlycharges')
plot_kde(df, 'totalcharges')

## 2.3 Grouping Tenure

In [None]:
df['grouped_tenure'] = df.apply(lambda df: tenure_groups(df), axis = 1)

In [None]:
tenure_group_counts(df)

In [None]:
plot_numerical_averages(df, 'monthlycharges')

In [None]:
plot_numerical_averages(df, 'totalcharges')

**Numeric Variable Conclusions:**
- Tenure: Customers with lower service tenures (newer clients) have a higher probability of churning than those with long tenures.
- Monthly Charges: Generally speaking, Likelihood of a customer churning increases as charges increase, and customers have the highest probability of churning when their monthly charges exceed 60 dollars. Customers who do not churn are most likely to have bills around 20 dollars, followed by just over 80 dollars. 
- Total Charges: Distributions mostly too general for impact of feature (Monthly is most likely more important)

# 3. Categorical Features

## 3.1 Gender

In [None]:
plot_gender_dist(df)

**Gender Conclusion:** Gender is equivalent in representation in our dataset and does not appear to be an indicator of Churn

## 3.2 Age

In [None]:
plot_age_dist(df)

**Age Conclusion:**
- Our dataset has significantly less senior citizens than non-seniors (only 16% are senior)
- *Overall*, more non-senior citizens churn than senior citizens
- A higher *proportion* of senior citizens churn than non-senior citizens

## 3.3 Partner & Dependents

In [None]:
df[df.churn == 'Yes'].partner.value_counts()

In [None]:
plot_partner_dependents(df)

**Partner/Dependent Conclusions:**
- Overall, those without partners are more likely to churn than those with partners
- Customers without dependents are more likely to churn than those with dependents

## 3.4 Phone Services & Line Quantities

In [None]:
plot_services(df)

**Phone Service Conclusions:**
- Significantly more customers with only phone service will not churn than those other customers
- People with only phone service churn ~25% of the time
- Customers with phone services only pay a higher average monthly charge
- Customers with multiple lines churn at approximately the same rate as those with a singular line
- Customers with multiple lines more frequently pay a higher monthly charge than those with singular phone lines

## 3.5 Internet Service

In [None]:
plot_internet_services(df)

## 3.6 Add-On Services

In [None]:
plot_services(df)

In [None]:
plot_service_charges(df)

**Add-On Service Conclusions:**
- Customers with TV Streaming and/or Movie streaming services churn more than all other services. 
- Churn for customers in most categories peaks around a monthly charge of $100

## 3.7 Contracts

In [None]:
plot_contracts(df)

**Contract Conclusion:**
- More than half of customers use a monthly payment option
- Significantly more customers churn on monthly plans
- The longer the plan, the lower the churn rate
- Monthly charges are generally higher the longer the contract is

## 3.8 Paperless Billing & Payment Methods

In [None]:
plot_paperless(df)

In [None]:
plot_pay_methods(df)

**Payments Conclusions:**
- Customers with non-paperless billing churn almost 15% more than paperless customers
- Paperless customers churn at similar rates as non-paperless customers when the monthly price is below 60 dollars, once above 60 more paperless customers churn than non-paperless
- Customers who pay with e-check churn more than 10% than customers with all other payment methods
- Customers who pay by credit card have consistent churn rates regardless of monthly charge, whereas customers paying by bank transfer, e-chcek, or mailed check all see an up-tick in churn once monthly charges rise above 60.
