# Business Understanding
- Why are you using machine learning rather than a simpler approach?
- What is it about the problem/data that is suitable for logistic regression? 
- Objective: 
  - Predict which customers who will "churn" (leave a business service), given the data in our training set associated with each subscriber to SyriaTel's phone plan. This way we can identify these customers before they churn, which will hopefully allow us to find ways to retain them before they leave.

# Data Understanding

| Variable | Definition | Key |
| -------- | -------- | -------- |  
| churn | Has customer ceased doing business with SyriaTel | False = has not churned, True = has churned 
| state | US State | |
| account length | Num digits: indicates account age | |
| area code | Phone number area code | |
| phone number | Phone number | |
| international plan | Customer has intl. plan | 'yes', 'no' |
| voice mail plan | Customer has voice mail plan | 'yes', 'no'|
| number vmail messages | | | |
| total day minutes
| total day calls
| total day charge
| total eve minutes
| total eve calls
| total eve charge
| total night minutes
| total night calls
| total night charge
| total intl minutes
| total intl calls
| total intl charge
| customer service calls

# Get Data and Import Libraries: 

In [17]:
# Import Required Python Libraries:
import pandas as pd
import numpy as np
import math

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.impute import MissingIndicator, SimpleImputer

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_selection import SelectFromModel

# plot_confusion_matrix is a handy visual tool, added in the latest version of scikit-learn
# if you are running an older version, comment out this line and just use confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_roc_curve

In [18]:
# Import Data:
df = pd.read_csv('./data.csv')

# Initial EDA:

In [19]:
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [20]:
df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

## Thoughts on Data (Consider the business problem when choosing features)
- Area codes (and by associate phone numbers) and State do not match (415 is not an area code in Kansas)
- "State" may be a useful geographical feature to consider, but lots of people live in states that don't match their phone #'s area code, so area code isn't a reliable indicator of location.
- There are no nulls
- Categorical Variables (besides target which is Churn)
- ## Numeric vs. Categorical:
  - Is it numeric or categorical?
    - As "Is an increase of 2 in this variable twice as much as an increase of 1?"
  - State
- These are boolean value columns - so they don't need to be one-hot-encoded, just converted from yes/no to 1/0
  - international plan
  - voice mail plan
- Ordinal values -- there are none
- To Drop:
  -   Area Code (because an increase of 1 does'nt mean twice as many)
  -   Phone number (because an increase of 1 does'nt mean twice as many)
- Calls vs. Minutes
  - The more calls doesn't necessarily mean more minutes, so we will keep calls and minutes (they are not redundant)
  
### Ideas for later:
- How could we tease apart "international plan" and "total intl calls"/minutes in order to see how likely a customer with a high number of international minutes but no international plan is likely to churn?

# EDA
- Pattern of error?

- Cannabilize the violin plot function in here for EDA:
  - 41-classification_workflow-completed.ipynb

### Choose most important features by finding ones highest correlated with target:

### See if people with no international plan and high international minutes/calls is more or less likely to churn
- Is no intl. plan correlated + high minutes correlated to churn?

# PREDICTIVE! Eyes on the Prize!
- Predictive Findings:
  - How well your model is able to predict target
  - Which features are most important to model
- Predictive Recommendations:
  - Context and situation where predictions would be useful
  - Suggest to data engineers how data can be transformed upon ingestion

# Train/Test/Split

## One-Hot Encoding

# Scaling

# Summary of Iterative Modeling Process:
1. Dummy model
2. Evaluate on appropriate classification metrics (bias/variance?)
3. Proceed to next model (provide justification!)
- Ultimate Goal is to Create a classifier that beats dummy model - doesn't have to be perfect
  - When your model isn't improving and you've tuned a couple, you can stop

# Iterative Modeling Process us Cross-Validation:
### Process below is from this lecture (also has lots on how to appropriately use Regularization): ***36-Regularization_Lecture-completed.ipynb***

So, now our modeling process has an added step: cross-validation.

1. Get Data
2. EDA
3. Cleaning
4. Feature Engineering
5. Train/Test split
6. Model training using `train` split
7. Cross Validation (Once you are happy with the model, then do step 8)
8. Model testing using `test` split

Please note, this is **NOT** a linear process.

You will repeat steps 3 through 7 many times. 

You only use the `test` split when you are satisfied of your model's performance as judged by the cross-validation.

# See this lecture for example of entire logistic regression (classification) workflow:
### **41-classification_workflow-completed.ipynb**
- Adapt the ""ModelwithCV" class" thingie - an copy it but just cite source.

# Minimal Viable Product:
- Use logistic regression, no need for any other algorithm
- Fit the transformer on the training data and use it to transform both the train and test
- Never fit anything to test
- Don't use target as a feature or a numeric target
- Don't REDUCE regularization on a model that is overfitting. INCREASE regularization if model is overfitting.
- Report the model's performance on the TEST data, not the training data.
- Tune at least 1 hyperparameter in a justifiable way without any major errors
- Focus on specific metrics that are important to business case (not just displaying `classification_report` and/or confusion matrix -- you wouldn't want to try and discuss ALL evaluation metrics, and you also wouldn't want to just display the metrics without discussion)

# Dummy Model & Initial feature selection:
1. Choosing features is indeed an iterative process. For the dummy model in logistic regression, it's simply predicting the majority class. For logistic regression, which is a type of linear model, you can use all columns to start out with or take the top 3-5 columns most correlated with the target. For subsequent iterative models, can choose different cols. If model is underfit, can increase complexity (variance) with polynomial features. If overfit, good approach is increasing regularization.

# Evaluate
- Evaluate based on TEST data
- Bias/Variance - overfitting vs. underfitting
- Recall
- Precision
- Accuracy
- MSE (better than R-squared for explaining to stakeholders)
- Confusion Matrix
- `classification_report`

## Tuning! Make justifiable changes in successive models:
- DO NOT reduce regularization on a model that is overfitting.
  - Regularization parameter (lambda) penalizes all the parameters except intercept so that model generalizes the data and won’t overfit.
- Threshold: Calculate from ROC Curve:
  - https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/
- Tune Hyperparameters and Grid Search:
  - https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/
- Manipulate target?
- Maximize validation scores
  - As a validation score, accuracy is inappropriate for imbalanced classification problems
  - Use precision, recall, F-Measure for imbalanced classification: https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/
- Cross validation: https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/
- Consider Bias/Variance tradeoff

# Conclusion:
- After refining models, provide 1-3 paragraphs discussing final model and at least 1 overall model metric
- Model Limitations Discussion:
  - Records/Instances where model performance was worse (Question: Does this mean an individual row?)
  - If used in production, what kinds of problems would pop up?
  - Connect metrics to real-world implications 
    - What should stakeholders do with this information?

This can include rows/individuals in training set that are outliers - maybe your model performs especially bad on those. Basically, the "limitations" part of your final discussion should address weaknesses in data and algorithm.

# Optional:
- Cross Validation
- Feature Engineering
- Pipelines
  - Only use pipelines later if you have time
  - Pipelines help prevent data leakage
  - They transform the test set exactly how train was transformed
  - Pipelines are the best-practice approach to data preparation that avoids leakage, but they can get complicated very quickly. We therefore do not recommend that you use pipelines in your initial modeling approach, but rather that you refactor to use pipelines if you have time.