### Is this new observation A or B (or C, D, or E) (Classification)
- [Classification](https://ds.codeup.com/classification/overview/) is a **supervised learning task**. That means we train on data w/ answers/labels
- We train with answers/labels to produce a `decision rule` we'll use to classify future data.
![image.png](attachment:image.png)
- With classification, we use labeled data to train algorithms to classify future data points.
- The training data allows us to train an algorithm to produce a decision rule
- Using a boundary between points or a distance between points, we classify new datapoints into A or B (or C or D or E)
- **Classification is a Supervised Machine Learning technique.**
    - uses labeled data from a training dataset to learn rules for making future predicitions on unseen data. 
    - **Classification is used to predict the category membership of the categorical target value or label.**
    - determine the probability of an observation to be part of a certain class or not. Therefore, you express the probability with a value between 0 and 1. **A probability close to 1 means the observation is *very likely* to be part of a group or category.**



### Vocab
- **Classifier**: An algorithm that maps the input data to a specific category.

- **Classification Model**: A series of steps that takes the patterns of input variables, generalizes those patterns, and applies them to new data in order to predict the class.

- **Feature**: A feature, aka input/independent variable, is an individual measurable property of a phenomenon being observed.

#### Types of classification
- **Binary Classification**: Classification with two possible outcomes. Uses a decision rule to predict an observation to be a member of one of only two groups: churn/not churn, pass/fail, male/female, smoker/non-smoker, healthy/sick.

- **Multiclass Classification**: Classification with more than two classes, where each sample is assigned to one and only one target label, e.g. Grade levels of students in school (1st-12th). Uses a decision rule to predict an observation to be a member of one of three or more possible groups or categories: A/B/C, hot/warm/cold, Python/Java/C++/Go/C

#### Uses for classification
- Medical Diagnosis
- Spam Detection
- Credit Approval
- Targeted Marketing

### Difference between Classification and Regression

![image.png](attachment:image.png)

Regression predicts a continuous variable while **classification predicts a *categorical* variable.**

### Common Classification Algorithms
- **Logistic Regression**
    - (`sklearn.linear_model.LogisticRegression`)
    - goal is to find the values for the coefficient that weight each input variable
    - used to predict binary outcomes
    - output is a value btwn 0 and 1 that represents the probability of one class over the other
- **Decision Tree** 
    - (`sklearn.tree.DecisionTreeClassifier`)
    - sequence of rules used to classify 2 or more classes
    - each node represents a single *input variable (x)* and a split point or class of that variable.
    - leaf nodes of the tree contain an *output variable (y)* (used to make a prediction).
    - predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at the leaf node
- **Naive Bayes** 
    - (`sklearn.naive_bayes.BernoulliNB`)
    - assumes independence btwn every pair of features
    - assumes each input variable is independent (which is often not the case)
    - comprised of 2 types of probabilities that can be calculated directly from your training data:
        - probability of each class
        - conditional probability for each class given each x value
- **K-Nearest Neighbors**
    - (`sklearn.neighbors.KNeighborsClassifier`)
    - makes predictions based on how close a *new* data point is to known data points
    - measures distances between data points
    - predictions are made for a new data point by searching through the entire training set for the K 
- **Random Forest**
    - (`sklearn.ensemble.RandomForestClassifier`)
    - similar to decision tree w/ whole bunch of trees w/ randomness
        - outcome = aggregate of all the trees (ensemble algorithm)
- **Support Vector Machine** 
    - (`sklearn.svm.SVC`)
    - technique that uses higher dimensions to best separate data points into two classes
    - **margin**: distance btwn the hyperplane and the closest data points
    - the best or optimal hyperplane that can separate the two classes is the line that has the largest margin
    - **support vectors**: points that are relevant in defining the hyperplane and in the construction of the classifier
- **Stochastic Gradient Descent** 
    - (`sklearn.linear_model.SGDClassifier`)
- **AdaBoost** 
    - (`sklearn.ensemble.AdaBoostClassifier`)
- **Bagging** 
    - (`sklearn.ensemble.BaggingClassifier`)
- **Gradient Boosting**
    - (`sklearn.ensemble.GradientBoostingClassifier`)

# Data Acquisition

### From a Database:
Create a Dataframe using a SQL query to access a database

In [None]:
# Import private info to keep it secret in public files.
from env import host, password, user

# Test query in Sequel Pro and save to a variable.
sql_query = 'write your sql query here; test it in Sequel Pro first!'

# Save connection url to a variable for use with pandas `read_sql()` function.
connection_url = f'mysql+pymysql://{user}:{password}@{host}/database_name'

# Python function to read data from database into a DataFrame.
pd.read_sql(sql_query, connection_url)

### From Files:

In [None]:
# Create Dataframe from a local csv file
#**Note**: if you are working with a folder within your directory, 
#you have to specify ‘folder_name/file_name’
df = pd.read_csv('file_path_or_folder/file_name.csv')

#or
df = pd.read_csv('file_name.csv')



# Create DataFrame from an AWS S3 file. (amazon web services)
df = pd.read_csv('https://s3.amazonaws.com/bucket_and_or_file_name.csv')




# Create DataFrame from a Google sheet using its Share url.
sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'

csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')

df = pd.read_csv(csv_export_url)

### From your Clipboard
Read copy-pasted tabular data and parse it into a Dataframe

In [None]:
# Default
df = pd.read_clipboard(sep='\\s+', **kwargs)

# Some examples of options I have.
columns = ['column_1', 'column_2', 'column_3']
df = pd.read_clipboard(sep=',', header=None, names=columns)

### From an Excel Sheet
Create a Dataframe based on the contents of an excel spreadsheet

In [None]:
#Stp 1: Download as Microsoft Excel file in Google Sheets 
#Stp 2: Move downloaded file to desired folder in order to load/read data
pd.read_excel('file_name.xlsx')

#specify sheet name
pd.read_excel('file_name.xlsx', sheet_name='sheet_name') 

#specify sheet name and columns
pd.read_excel('file_name.xlsx', sheet_name= 'sheet_name',  usecols=['col_name1', 'col_name2'])

### From modules (Pydataset or Seaborn  or Sklearn)

In [None]:
#pydataset
from pydataset import data

#show documentation
data('dataset_name', show_doc=True)

#show data
df = data('dataset_name')
df.head()

In [None]:
#seaborn
import seaborn as sns

#show data
df = sns.load_dataset('dataset_name')
df.head()

In [None]:
#Sklearn- dictionary-like object

from sklearn import datasets

iris = datasets.load_iris()

#show data
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df.head()

### Important to:
- use pandas methods and attributes to do some **initial summarization** and exploration of your data.
    - `.head()`
    - `.shape`
    - `.info()`
    - `.columns`
    - `.dtypes`
    - `.describe()`
    - `.value_counts()`
    
- create functions that acquire data from database
- save the data locally to CSV files (cache your data)
- check for CSV files upon subsequent use.
- create python module, acquire.py, that holds your functions that acquire the data you want to use and can be imported and called in other notebooks and scripts.





- **Imports**:
    - `import pandas as pd`
    - `import numpy as np`
    - `import os`

- visualize:
   - `import seaborn as sns`
   - `import matplotlib.pyplot as plt`
   - `plt.rc('figure', figsize=(11, 9))`
   - `plt.rc('font', size=13)`

- turn off pink warning boxes:
   - `import warnings`
   - `warnings.filterwarnings("ignore")`

- acquire:
   - `from env import host, user, password`
   - `from pydataset import data`

In [None]:
#transposed summary statistics for each of the numeric variables
df.describe().T

In [None]:
# getting value counts for each column
for column in df.columns:
    print(column.upper())
    print(df[column].value_counts())
    print("-------------------------------------")

In [None]:
#individual plots for individual variables
df['monthly_charges'].hist(color='gold')

plt.title('Distribution of Monthly Charges at Telco')
plt.show()

In [None]:
#always show findings

In [None]:
# Create helper function to get the necessary connection url.
def get_connection(db, user=user, host=host, password=password):
    '''
    This function uses my info from my env file to
    create a connection url to access the Codeup db.
    '''
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'

# Use the above helper function and a sql query in a single function.
def get_db_data():
    '''
    This function reads data from the Codeup db into a df.
    '''
    
    #create SQL query
    sql_query = '''
    write your sql query here:
    SELECT * FROM ___
    JOIN ___ USING(something)
    WHERE ___ AND
    test it in Sequel Pro first;
    '''
    
    #read in dataframe
    df = pd.read_sql(sql_query, get_connection('database_name'))
    
    return df

### Caching Data

- Caching or storing data you've retrieved from a database or website makes accessing it later much faster. Basically, cached data reduces load times.

- We can design our acquire functions to get our data for us faster by reading in a csv file, if one exists, and if not, acquiring our data and creating a csv file for later use.

- The `os.path.isfile()` method in Python is used to check whether a specified path is an existing file or not. It returns a boolean value.

- https://github.com/CodeupClassroom/easley-classification-exercises/blob/main/classification_acquire_lesson.ipynb

# Data Preparation
https://ds.codeup.com/classification/prep/
https://github.com/aliciag92/classification-exercises/blob/main/classification_prepare_lesson.ipynb
## What are we doing and why
**What**: Clean and tidy our data so that it is ready for exploration, analysis and modeling

**Why**: Set ourselves up for certainty!

- 1) Ensure that our observations will be sound:
    - Validity of statistical and human observations
- 2) Ensure that we will not have computational errors:
    - non numerical data cells, nulls/NaNs
- 3) Protect against overfitting:
    - Ensure that have a split data structure prior to drawing conclusions

**Input**: An aquired dataset (One Pandas Dataframe) ------> **Output**: Tidied and cleaned data split into Train,  Validate, and Test sets (Three Pandas Dataframes)


**Processes**: Summarize the data ---> Clean the data ---> Split the data

### Summarize
- head(), describe(), info(), isnull(), value_counts(), shape, ...
- plt.hist(), plt.boxplot()
- document takeaways (nulls, datatypes to change, outliers, ideas for features, etc.)
- imports:
    - `import pandas as pd`
    - `import numpy as np`
    - `import matplotlib.pyplot as plt`

    - `from sklearn.model_selection import train_test_split`
    - `from sklearn.impute import SimpleImputer`

    - `import warnings`
    - `warnings.filterwarnings("ignore")`

    - `import acquire`

### Clean
- **missing values**: drop columns with too many missing values, drop rows with too many missing values, fill with zero where it makes sense, and then make note of any columns you want to impute missing values in (you will need to do that on split data).
- **outlier**: an observation point that is distant from other observations https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
- **outliers**: ignore, drop rows, snap to a selected max/min value, create bins (cut, qcut)
- **data errors**: drop the rows/observations with the errors, correct them to what it was intended
- address **text normalization** issues...e.g. deck 'C' 'c'. (correct and standardize the text)
- **tidy data**: getting your data in the shape it needs to be for modeling and exploring. every row should be an observation and every column should be a feature/attribute/variable. You want 1 observation per row, and 1 row per observation. If you want to predict a customer churn, each row should be a customer and each customer should be on only 1 row. (address duplicates, aggregate, melt, reshape, ...)
- **creating new variables out of existing variables** (e.g. z = x - y)
- **rename columns**
- **datatypes**: need numeric data to be able to feed into model (**dummy vars, factor vars, manual encoding**)
- **scale numeric data**: so that continuous variables have the same weight, are on the same units, if algorithm will be used that will be affected by the differing weights, or if data needs to be scaled to a gaussian/normal distribution for statistical testing. (linear scalers and non-linear scalers)

### Train, Validate, Test Split
- split our data into train, validate and test sample dataframes
- Why? **overfitting**: model is not generalizable. It fits the data you've trained it on "too well". 3 points does not necessarily mean a parabola.
- **train**: *in-sample*, explore, impute mean, scale numeric data (max() - min()...), fit our ml algorithms, test our models. **Train model to make predictions**
- **validate, test**: (*out-of-sample*) represents future, unseen data. 
- **validate**: *out-of-sample*, confirm our top models have not overfit, test our top n models on unseen data. Using validate performance results, we pick the top **1** model. **Evaluate your model's performance on unseen data and ensure that it didn't learn too much from your train set causing it to 'overfit' on a particular set of data**
    - use validate to tune features/parameters in train
    - repeat the process for each model you create, and choose the best model to use w/ test dataset.
    - Be Aware: You may need to stratify your split on a particular feature, so that the proportion of a feature's values is the same in your train, validate, and test datasets. You can pass the stratify parameter as an option to your train_test_split.
- **test**: *out-of-sample*, how we expect our top model to perform in production, on unseen data in the future. **ONLY USED ON 1 MODEL**.
- You want to do all the prep that can be done on the full dataset before you split. Go through, work on DF for all you need to, then move to train when it's time. So you don't have to go back and forth, because that leads to errors and inconsistencies in data.

> Should I do this on the full dataset or on the train sample?

*this*: the action, method, function, step you are about to take on your data.

- Are you comparing, looking at the relationship or summary stats or visualizations with 2+ variables?
- Are you using an sklearn method?
- Are you moving into the explore stage of the pipeline?


If ONE or more of these is yes, then you should be doing it on your train sample. If ALL are no, then the entire dataset is fine.

>> **Be Aware: You may need to stratify your split on a particular feature, so that the proportion of a feature's values is the same in your train, validate, and test datasets. You can pass the stratify parameter as an option to your train_test_split.**

In [None]:
#generic split function

def split(df, stratify_by=None):
    """
    Crude train, validate, test split
    To stratify, send in a column name
    """
    
    if stratify_by == None:
        train, test = train_test_split(df, test_size=.3, random_state=123)
        train, validate = train_test_split(df, test_size=.3, random_state=123)
    else:
        train, test = train_test_split(df, test_size=.2, random_state=123, stratify=df[stratify_by])
        train, validate = train_test_split(df, test_size=.3, random_state=123, stratify=train[stratify_by])
    
    return train, validate, test

### Option for Missing Values: Impute
We can impute values using the mean, median, mode (most frequent), or a constant value. We will use `sklearn.imputer.SimpleImputer` to do this.

1. Create the imputer object, selecting the strategy used to impute (mean, median or mode (strategy = 'most_frequent').
2. Fit to train. This means compute the mean, median, or most_frequent (i.e. mode) for each of the columns that will be imputed. Store that value in the imputer object.
3. Transform train: fill missing values in train dataset with that value identified
4. Transform test: fill missing values with that value identified
5. Create the SimpleImputer object, which we will store in the variable imputer. In the creation of the object, we will specify the strategy to use (mean, median, most_frequent). Essentially, this is creating the instructions and assigning them to a variable we will reference.

### What are Data Imputation and Data Encoding?
**Imputation**: process of replacing missing data with substituted values. You might choose to impute the mean, median, or mode of a given column to fill any holes or cells missing data. Maybe you want to do something a little more complex like use a linear regression model to predict the missing values that you will impute. Depending on the situation, you might just decide to drop rows or columns that have more than a certain percentage of missing values. Whatever you decide to do, you first need to inspect your dataset for Null values.

**Encoding** is when you convert a string to an integer representation making a categorical value useable in a ML model.

Both imputing and encoding our data is part of **preparing** it for use in Machine Learning models.

# Examining df and cleanup df

In [None]:
#see cols in form of a list and their values
df.head().T 

In [None]:
#check total missing values by column
df.isna().sum()

In [None]:
#check nulls
df.isnull().sum()

In [None]:
#fill in missing numbers
df = df.fillna(0)

In [None]:
#set id or any column as index
df = df.set_index('customer_id')

In [None]:
#shorten col names
df = df.rename(columns={"name_of_col": "new_name", 
                       "another_name_of_col": "new_name", 
                       "one_more_col": "new_name"})

In [None]:
#change dtypes to integers
df.col = df.col.astype('int')

#change blank spaces w/ '0' and convert to float
df.col_name_needing_change = df.same_col.str.replace(' ', 0).astype(float)

In [None]:
#adjust column variables to have values to your liking 
  
df.col = df.col.replace({'some': 0, 
                         'stuff': 1})

### Creating columns

In [None]:
#using .map() function
df['new_column'] = df.column_name_needing_encoding.map({'No': 0, 'Yes': 1})

In [None]:
#create df that holds col
dummies = pd.get_dummies(df[[column_name_needing_encoding]], drop_first=True)

#add new dummy cols to original dataframe
df = pd.concat([df, dummies], axis=1)

In [None]:
#combine column values
df['new_col'] = df['col1'] + df['col2']

### Removing unnecessary columns

In [None]:
df = df.drop(columns = ['this', 'that', 
                       'and', 
                       'this', 
                       'as', 'well'])

### create prepare.py file w/ clean data (stuff done to clean the data), return the df and include split function as well

In [None]:
#start from scratch with our original dataframe 
df = acquire.get_telco_data()
df.head(1)

In [None]:
#grab cleaned data frame using the clean_telco function in prepare.py
df = prepare.clean_telco(df)
df.head()

In [None]:
#split the data using the split function in prepare.py
train, validate, test = prepare.split(df, stratify_by="churn")

In [None]:
#check split datasets
print('overall shape of dataframe:', df.shape)
print('train:', train.shape)
print('validate:', validate.shape)
print('test:', test.shape)

#work w/ training data
train.head()

# Tidy Data
### Vocab
- **Value**: every value belongs to a variable and an observation.
- **Variable**: a variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units.
- **Observation**: an observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.





- Each **variable** forms a column.
- Each **observation** forms a row.
- Each cell has a single **value**.





- Data is TABULAR, i.e. made up of rows and columns



### Reshaping Data:
- **Wide data** (alot of columns spread across) --> **Long Data** format (requires a *melt*)

- **Long data** (alot of rows spread down) --> **Wide Data** format (requires a *pivot* or *spread*)


### Tidying Messy datasets
1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables.

## `pd.melt` arguments
- `id_vars` = columns you want to **keep** (not melt)
- `var_name` = name of **new column** you **created by melting columns**
- `value_name` = column **name for resulting values**

## `pd.pivot_table` arguments
- `index` = columns you want to keep (not pivot)
- `columns` = column you want **to pivot**
- `values` = **values** we want to populate in the **new columns**
- `aggfunct` = how you want to **aggregate** the **duplicate rows**

https://github.com/aliciag92/classification-exercises

# Data Exploration

The goals of exploration are to understand the signals in the data, their strength, the features that drive the outcome, and other features to construct through questions and hypotheses, in order to walk away with modeling strategies (feature selection, algorithm selection, evaluation methods, e.g.) and actionable insight.

In general, we'll be exploring our target variable against the independent, or predictor, variables.


We may explore **individual** variables *before* splitting your data into train, validate, and test datasets, so that you can look at distributions, identify outliers, Nulls, etc. 

However, **when looking at interactions of variables, your data should first be split before you explore**. You should also split your data before you scale. 

>This is because your validate and test data should remain unseen as much as possible as it is supposed to be unknown at this stage.

>An important component of Data Science is that peers must be able to replicate what you have done to your data, especially if you are going to deploy your model to be used in the future. By splitting your data into train, validate, and test sets early on in the pipeline, you are also confirming that any processing you completed on your train set is repeatable on your validate and test datasets as well as unseen future sets.



### How to choose the right chart? Go [here](https://eazybi.com/blog/data-visualization-and-chart-types)

In [None]:
# set uniform chart and font sizes at the top of your notebook (optional)
plt.rc('figure', figsize=(num, num))

plt.rc('font', size=num)

In [None]:
#subplots
# figure 1 of 2; everything up until the next subplot is on the first plot
plt.subplot(211) # (2 rows, 1 column, plot 1)

plt.plot(x, y)

plt.show()

# figure 2 of 2
plt.subplot(212) # (2 rows, 1 column, plot 2)

plt.plot(y, x)

plt.show()

### Subplots Using Matplotlib Object-Oriented API
A figure in matplotlib is divided into two different objects:

- The Figure Class: It can contain one or more axes objects.

- The Axes Object: It represents one plot inside of a figure object.

In [None]:
# create figure and axes -> subplots(nrows, ncols)

fig, axes = plt.subplots(2,1)

# plot 1 of 2
axes[0].plot(x, y)

# set components of axes 1/plot 1
axes[0].set(title='My Title')

# plot 2 of 2
axes[1].plot(y, x)

# set components of axes 2/plot 2
axes[1].set(title='My Title')

# manipulate labels of x-ticks
axes[1].set_xticklabels()

# auto-adjust layout
fig.tight_layout()

### Explore the target (train data) and identify features related to what ever it is you stratified by answering key questions. See how features react and interact with one another.

https://seaborn.pydata.org/api.html#


In [None]:
#visualize correlation of variables with heatmap
corr = train.corr()

plt.figure(figsize=(12,12)) #set up figure


sns.heatmap(corr,
            vmin=-1, #set min value for color scale at -1
            vmax=1, #set max value for color scale at 1
            center=0, #set center the color scale at 0
            cmap='GnBu', #change default color
            linewidths=.5) #space out each square

plt.show()

In [None]:
#create subplots to visualize initial hypotheses if a feature affects rate of churn
plt.figure(figsize=(10,8))

#Are senior citizens more likely to churn than non-senior citizens?
plt.subplot(221)
sns.countplot(data=train, x='senior', hue="churn")

#Is there a difference in the rate of churn for customers who have streaming services vs customers who do not?
plt.subplot(222)
sns.countplot(data=train, x='streaming_services', hue="churn")

#Does contract type play a role in churn rate?
plt.subplot(223)
sns.countplot(data=train, x='contract_type', hue="churn")

#Do customers with add-on's churn more that customers without any add-on's
plt.subplot(224)
sns.countplot(data=train, x='add_ons', hue="churn")

plt.show()

In [None]:
#ALWAYS state findings

### Explore categorical and continuous values
[here](https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/)



### Univariate Exploration
- one variable at a time. (not necessary to split, but can still be done w/ train data).
- `plt.hist()` - Visualize the distribution of a single variable.
- `sns.countplot() 
- `sns.displot()`
- `sns.boxplot()`


Examine a distribution of a variable and can use hue argument to make it a bivariate exploration
- `sns.scatter()` 
- `sns.swarmplot()` 


### Bivariate & Multivariate Exploration
- looking at relationships between variables
- data **must be split** and use train dataset to keep your unseen data in validate and test unseen. 
- use dependent variables to explain independent variables
- **Bivariate** looks at two paired data sets, studying whether a relationship exists between them
- **Multivariate** uses 2 or more variables and analyzes which, if any, are correlated w/ a specific outcome. The goal is to determine which variables influence or cause the outcome.

Allow to see alot quickly:
- `sns.pairplot()`
- `sns.pairgrid()`

Subplots
- `.hist()`
- `sns.countplot()`



Others
- `sns.scatter()` - look for correlation btwn 2 continous variables; hue allows to bring in a categorical variable
- `sns.catplot()` - explore continuous and a categorical variable; hue allows to bring in another categorical variable and add different parameters
- `sns.lmplot()` - check for a linear fit btwn 2 continuous variables; bring in other dimensions w/ `col` or `hue`
- `sns.boxplot()`
- `sns.swarmplot()`
- `sns.violinplot()`



Use `.groupby()` to explore different aggregations of data

Run Statistical Tests and ALWAYS share findings
- T-test compares a categorical and a continuous variable
- Correlation (Pearson's Correlation Coefficient) compares two continuous variables
- Chi Square ($x^2$) compares categorical vs categorical variables

# Data Modeling

#### Steps
1. Set up X inputs and y target variables for each split
2. Evaluate on training (in-sample) dataset
3. Establish baseline accuracy to determine if having a model is better than no model
4. Fit/transform/evaluate using various classification algorithms (Decision Trees, Random Forests, KNN, and/or Logistic Regression (but many more))
5. Specify different feature selection/hyper-parameters
6. Get scores on validate to compare to the best train models’ scores
7. Use validate scores to tune features/parameters
8. Select best model and test it ONCE
9. Test final model on testing (out-of-sample) dataset
10. Summarize, interpret and document the results

In [None]:
#set up X inputs and y target variable for each split
X_train = train.drop(columns=['churn'])
y_train = train.churn

X_validate = validate.drop(columns=['churn'])
y_validate = validate.churn

X_test = test.drop(columns=['churn'])
y_test = test.churn

In [None]:
#baseline prediction: the most prevalent class in training dataset(the mode)
train.churn.value_counts()

In [None]:
#baseline model would be to predict 0 since it is most prevalant
#baseline accuracy:
baseline_accuracy = (train.churn == 0).mean()

print(f'baseline accuracy: {baseline_accuracy: .2}')

#### Then create models
[Examples](https://github.com/aliciag92/classification-project/blob/main/telco-churn-report.ipynb)

## Decision Tree
- use the training data to train the tree to find a decision boundary to use as a **decision rule** for future data.
- like playing "20 Questions" w/ your features used to predict the target. Each question is a "yes" or "no". The number of questions is the **depth** of your tree.
- Given enough depth, decision trees are **overfitting machines**

- A sequence of rules that can be used to classify 2 or more classes

- Each node represents a single input variable (x) and a split point or class of that variable

- The leaf nodes of the tree contain an output variable (y) which is used to make a prediction.

- Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.

**Pros**: Simple to understand/visualize/explain the output, Requires little data preparation and don't need to encode our *target* variable and Perform well for a broad range of problems

**Cons**: Can create complex trees that do not generalise well. Can be unstable because small variations in the data might lead to overfitting.


**Classification algorithms use training data to measure the distance between points or the distance around boundaries between points.**

By "learning" the pattern recognition around sets of labeled points, the classifier produces a **decision rule** to use to apply to classify new incoming data.

## Random Forest
- a type **Ensemble** Machine Learning algorithm called Bootstrap Aggregation or bagging.

- **Bootstrapping** is a statistical method for estimating a quantity from a data sample, e.g. mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value. 
- In **bagging**, the same approach is used for estimating entire statistical models, such as decision trees. Multiple samples of your training data are taken and models are constructed for each sample set. When you need to make a prediction for new data, each model makes a prediction and the **predictions are averaged** to give a better estimate of the true output value.

- Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness. The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value.

**Pros**: Reduction in over-fitting, More accurate than decision trees in most cases, Naturally performs feature selection

**Cons**: Slow real time prediction, Difficult to implement, Complex algorithm so difficult to explain

## K-Nearest Neighbor
- Makes predictions based on how close a new data point is to known data points.

- Considered a lazy algorithm in that it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple **majority vote** of the k nearest neighbours of each point.

- Predictions are made for a new data point by searching through the entire training set for the **K** most similar instances (the **neighbors**) and summarizing the output variable for those K instances. For regression problems, this might be the mean output variable. **For classification problems this might be the mode (or most common) class value.**

- It is important to define a metric to measure how similar data instances are. Euclidean distance can be used if attributes are all on the same scale (or you convert them to the same scale).

**Pros**: Simple to implement, Robust to noisy training data, Effective if training data is large, Performs calculations "just in time", i.e. when a prediction is needed (as opposed to ahead of time), Training instances can be updated and curated over time to keep predictions accurate.

**Cons**: Need to determine the value of K, The computation cost is high as it needs to compute the distance of each instance to all the training samples...you need to hang on to your entire training dataset. Distance can break down in very high dimensions, negatively affecting the performance. This is know as the "Curse of dimensionality". To alleviate, only use those input variables that are most relevant to predicting the output variable.


## Logistic Regression
- maps any real value into a number between 0 and 1 using the probability that an observation is in the positive class, 1.
- used for predicting discrete outcomes (binomial and multinomial)
- Overall, makes a great baseline model because of the quick and easy implementation and ease of interpretation.

**Pros**: Easy to interpret, fast to train and predict making this a great first classification model to try.

**Cons**: Not as interpretable as showing a picture like a Decision Tree Classifier, assumption that the X predictors are independent, multi-class classification gets more complicated to interpret and explain.

[interview questions about Logistic Regression](https://medium.com/analytics-vidhya/interview-questions-on-logistic-regression-1ebd1666bbbd)

# Data Evaluation

**confusion matrix**: cross-tabulation of our model's predictions against the actual values.
- **True Positive**: number of occurrences where y is true and y is predicted true.
- **True Negative**: number of occurrences where y is false and y is predicted false.
- **False Positive**: number of occurrences where y is false and y is predicted true.
- **False Negative**: number of occurrences where y is true and y is predicted false.

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
## Common Evaluation Metrics
### Accuracy: 
- tells you how often your classifier is *predicting correctly overall*. 
- the ratio of all of your correct predictions over all of your observations.
> (TP + TN ) / (TP + TN + FP + FN) --> total number correct / total number of data points

****
    
    
### Recall/Sensitivity/True Positive Rate: 
- tells you how often your classifier is catching the positive cases in your dataset. 
- describes how good the model is at predicting the *positive* class when the acutal outcome is *positive*. 
- tells you what percentage of the time your model is identifying the relevant instance in your dataset
> TP / (TP + FN) --> total number of correct positive predictions / total number of actual positive observations in dataset
- You have to decide which is more important in your situation. Is it more expensive in your specific context to miss a positive instance? If so, optimize for Recall.
- **The Higher the Recall Score == The better your Classifier is at catching the actual positive cases in your dataset.**
- **The Lower the Recall Score == The more your Classifier is making Type II Errors / Misses / False Negatives.**
    
>>As you decrease your threshold, the Recall of your model increases. This is a good idea when you want to decrease your Type II errors or False Negatives. When it's more costly to miss a positive, you might decrease your threshold a bit.

- **If your Recall score is high**, your model didn’t miss a lot of positives; it's good at catching positive observations or instances.
>For example, if churn is your positive class, and your Recall score is high, your model is good at identifying customers who are positive for churn or actually churning.

- As **your Recall score gets lower**, your model is not predicting more of the positives that are actually there; you are missing Positive observations or instances.
>For example, if churn is your positive class, and your Recall score is low, your model is not good at identifying customers who are churning. It is predicting a lot of False Negatives; you are missing the opportunity to find and woo customers who are going to churn. These are Misses, missed opportunities to identify and keep customers who are actually positive for churn. These are **Type II Errors**

**You want to optimize for recall when missed positives (False Negatives) are expensive.**
    
****    
    
### Precision/Positive Predictive Value:
- tells you how often your model was able to **predict positives correctly**
- the proportion of observations your model predicts to be positive that were *actually positive* and not false alarms or false positives.
- If your model's Precision score goes up, the cost is that your model's Recall score goes down. 
- You have to decide which is more important in your situation. Is it less costly to falsely predict that an instance or observation is the positive class (false alarm) than to miss a Positive instance? If so, optimize for precision.
> TP / (TP + FP) --> total number of correct positive predictions / total number of observations predicted as positive by model
- **The Higher the Precision Score == The better your Classifier is at predicting positives correctly in your dataset.**
- **The Lower the Precision Score == The more your Classifier predicted a lot of positives where there were none, False Positives / Type I Errors / False Alarms**
    
>>As you increase your threshold, the Precision of your model increases. This is a good idea when you want to decrease your Type I errors or False Positives. When it's more costly to falsely identify an observation as a positive case that is actually a negative case, you might want to increase your threshold a bit.    

- The **higher your precision score** is, the better your model is at **Predicting Positives Correctly**! It avoids predicting a lot of False Positives (false alarms), but it is missing more of the Actual Positives, too.
- As your **Precision score gets lower**, your model predicted a lot of False Postives (false alarms) or positives where there were none. These are **Type I Errors**.

**You want to optimize for Precision when False Positives are more expensive than False Negatives.**

****
    
### F1-Score:
- The balanced harmonic mean of Recall and Precision, giving both metrics equal weight. 
- The higher the F-Measure is, the better when you are looking to optimize for both Recall and Precision.
    
    
****    
    
### Support
- number of occurrences of each class

****


### ROC Curve (Receiver Operating Characteristic Curve): 
- summarizes the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. 
- The higher the area under the curve, the better your model is at separating or predicting positive and negative classes. 
- ROC curves are appropriate when the observations are balanced between each class in a binary classification problem.
- It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate.





# Deliver

### Generate CSV
1. Create a predictions dataframe to display the probability of churn
2. Apply appropriate column names for readability
3. Add the customer_id from the original dataframe to the new predictions dataframe
4. Convert to a csv file

### Conclusion
- key takeaways
- recommendations based on what was found

In [None]:
#get probabilities from test sample w/ specified features
probs = knn2.predict_proba(X_test[features])

#create a dataframe of the probabilities
predictions = pd.DataFrame(probs)

#add customer_id index
predictions.index = X_test.index

#rename columns
predictions.columns=["probability_no_churn", "probability_churn"]

#create a new column w/ predictions from test data
predictions["predict_churn"] = y_pred_test

#take a look at dataframe of customer_id, probability of churn, and prediction of churn
predictions.head()

In [None]:
#convert dataframe to a csv
predictions.to_csv("predictions.csv")

## Important imports 


In [2]:
import numpy as np 
import pandas as pd

# visualize
import matplotlib.pyplot as plt
import seaborn as sns 
import graphviz
from graphviz import Graph

# turn off pink warning boxes
import warnings
warnings.filterwarnings("ignore")

# .py modules to acquire and prep the data
#import acquire
#import prepare

# hypothesis tests for data exploration
from scipy.stats import chi2_contingency
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind

# train, validate, test
from sklearn.model_selection import train_test_split

# evaluating models
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support 

# creating models for classification ML:
# Decision Tree  
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz

# Random Forest
from sklearn.ensemble import RandomForestClassifier

# K-Nearest Neighbor(KNN)  
from sklearn.neighbors import KNeighborsClassifier

# Logistic Regression
from sklearn.linear_model import LogisticRegression