### Machine Learning Workflow & Intro to Data Preperation & Feature Engineering

Slides Modified from Monique.

#### Agenda

1. Machine Learning Workflow
2. Data Preperation
    - Dealing with Outliers
    - Dealing with Null Values
    - Variable Transformations
3. Feature Engineering Overview and Examples

------------
## Machine Learning Workflow

<img src='imgs/data-science-explore.png' width=800>

- Iterative process
- Non-linear process
- Lots of judgement and refining along the way
- Lots of time spent in data prep
- "Big data": a lot of time can be spent in data retrieval

Source: Practical Machine Learning with Python, Apress/Springer

---------------
**Machine Learning Workflow**

<img src='imgs/data-science-explore.png' width=600>

### Data Retrieval
- SQL, APIs, Web Scraping, csv, Excel...
- Could include combining some of the above
- Also called "Data Ingestion"

------------

**Machine Learning Workflow**

<img src='imgs/data-science-explore.png' width=600>

### Data Preparation
- **Processing and Wrangling**: You became `pandas` experts last week.
- **Feature extraction and engineering**: Will go over this today. What features (i.e., variables, `x`) do I need for my problem?
- **Feature selection**: To be covered later today.

------------

**Machine Learning Workflow**

<img src='imgs/data-science-explore.png' width=600>

### Modeling (i.e., machine learning)
- `scikit-learn` being the main basic package
- Other packages for deep learning
- Supervised vs. unsupervised learning
- "Build a model"

------------

**Machine Learning Workflow**

<img src='imgs/data-science-explore.png' width=600>

### Machine Learning Algorithm
- **"Algorithm"**: series of steps based on rules that a computer takes to calculate something
- Within supervised:
    - Regression: `y` is a continuous number (e.g., price)
    - Classification: `y` is discrete (e.g., customer retained or not)
- Examples: decision trees, linear regression, neural networks
    

------------

**Machine Learning Workflow**

<img src='imgs/data-science-explore.png' width=600>

### Model Evaluation & Tuning
- Our first model will probably not be the best model; need to pick
- **Evaluation**: Using metrics to pick the best model for the use case
- **Tuning**: Besides picking between algorithms, there are 'knobs' / settings to 'tune' a model for a specific algorithm

------------

**Machine Learning Workflow**
<img src='imgs/data-science-explore.png' width=600>

### Deployment & Monitoring
- We picked a model and it's ready for use by our users
- Be careful about concept drift
- Models sometimes need to be re-trained

-----------

## Types of Questions


| Type of question | Description | Example |
|:---|:--------------------------|:----------------|
| **Descriptive** | Summarize a characteristic of a set of data| Proportion of males, the mean number of servings of fresh fruits and vegetables per day |
| **Exploratory** | Analyze the data to see if there are patterns, trends, or relationships between variables; “hypothesis-generating” analyses|If you had a general thought that diet was linked somehow to viral illnesses, start by examining relationships between a range of dietary factors and viral illnesses|
| **Inferential** | Testing a hypothesis, statistically |Analyzing data for a subset / sample of the population and generalizing insights for the general population; Is there a higher incidence of cancer for women than for men?|
| **Predictive**  | Predicting a value, not necessarily figuring out why| Predicting cancer diagnosis from x-rays using computer vision|
| **Causal**      | Whether changing one factor will change another factor | Does changing diet lead to higher incidence of cancer?|
| **Mechanistic** | Understanding *how* one factor changes another | How does diet lead to higher incidence of cancer? |

-------------------

## Data Preperation

The main goal of this phase is to prepare the data for exploratory data analysis, inferential analysis, or predeiciton (modelling). In other words, we're making sure our data is in good shape, we have treated our missing values, dealt with weird data, cleaned it up.

Common Data Preperation Techniques:
- Outlier Detection
- Handling Null Values
- Variable Transformation


### Outlier Detection and Handling Outliers

- Data is not always right
- Could be human error, could be system error
- **Outlier**: an observation point that is distant from other observations
- Helpful to pointing us what can be wrong
- **Some errors are obvious; many require interviewing the domain experts to figure out**

Note: Before deleting outliers ask yourself if this is needed, is the outlier nessecarry. It depends on your use case, your business problem, where outliers may be important. An example is fraud detection. 

### Outlier Detection: demo
- Docs: https://scikit-learn.org/stable/datasets/index.html#boston-dataset
- [Example source](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)

In [None]:
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")




- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000’s



In [None]:
# documentation: https://scikit-learn.org/stable/datasets/toy_dataset.html#boston-dataset
boston = load_boston()
x = boston.data
y = boston.target
columns = boston.feature_names

#create the dataframe
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.head()

**Method 1: Summary of the data**

- Use your intuition
- Ask a domain expert

In [None]:
boston_df.describe()

**Method 2: Visualizing a Single Variable**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x=boston_df['DIS']);
plt.show()

**Method 3: Visualizing Multi-Variables**

In [None]:
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(boston_df['CRIM'], y)
ax.set_xlabel('Per capita crime rate by town')
ax.set_ylabel('Median value of owner-occupied homes in $1000’s')
plt.show()

**Method 4: Z-Score**

A way to detect outliers is to remove values with a z-score greater than 3. The z-score is measured in terms of standard deviations from the mean.

- Z-score of 0 indicates the value is the mean
- Z-score of 1 indicates the value is within 1 standard deviation from the mean. 
- Z-score of 2 indicates the value is within 2 standard deviations from the mean.
- Z-score of 3 indicates the value is within 3 standard deviations from the mean.
- **Z-score of above 3 indicates the value is greater than 3 standard deviations from the mean. Data Scinetist often label values with a z-score above 3 as outliers.**

In [None]:
from scipy import stats

#Finding Z Score on Column
stats.zscore(boston_df['ZN'])

#Turning Absolute
np.abs(stats.zscore(boston_df['ZN']))

#(np.abs(stats.zscore(boston_df['ZN'])) > 3) 

## Solution?

- Can drop the observation
- Can replace the outlier

-----------
## Handling Null Values

Many times we will be handed data with missing data or corrupted data. Most commonly, missing data are represented as NaNs. NaNs are blank elements in Pandas. 

- It can be a system error that causes missing values, or it wasn't captured.
- There are techinques to deal with missing data, but all of them are imperfect. 

Resource: https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/

### Null values: Demo
- Dataset: https://www.kaggle.com/uciml/pima-indians-diabetes-database/data#

In [None]:
import pandas as pd
diabetes_df = pd.read_csv('diabetes.csv')
diabetes_df

In [None]:
diabetes_df.isnull().sum()

In [None]:
#Check percentage of data missing for each feature/column
round(100*(diabetes_df.isnull().sum()/len(diabetes_df)),2)

In [None]:
diabetes_df.info()

### Null values: Summary of the data
- Sometimes null values aren't exactly NaNs
- They are encoded as -1 or 9999 etc.
- Sometimes it's 0. 
- Does 0 make sense for some of these categories??

In [None]:
diabetes_df.describe()

### Null values: Encoding true NaNs as NaNs
- Won't be used in summary calculations (e.g., average, count)
- Some columns have a lot of what we think could be missing values

In [None]:
cols_missing_vals = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI'] # cols with missing values
(diabetes_df[cols_missing_vals] == 0).sum() # count number of 0s

In [None]:
diabetes_df[cols_missing_vals] = diabetes_df[cols_missing_vals].replace(0, np.NaN) # replace 0's with NaNs
diabetes_df.isnull().sum()

### Null values: Dropping Mssing Values
- Could be a good idea if there aren't too many records removed
- Let's do this for Glucose and BMI columns

In [None]:
print("Shape before dropping NAs", diabetes_df.shape)

diabetes_df = diabetes_df.dropna(subset=['Glucose', 'BMI']) # drop rows with Glucose and BMI as NaN

print("Shape after dropping NAs for Glucose and BMI columns", diabetes_df.shape)

### Null values: using the average

In [None]:
# Fill in missing values with the average
diabetes_df['SkinThickness'] = diabetes_df['SkinThickness'].fillna(value=diabetes_df['SkinThickness'].mean())
diabetes_df.isnull().sum()

## Data Preparation: Variable Transformation

- Basic transformations (e.g., logarithmic (making it more normally distributed))
- Binning (e.g., grouping numbers into bins)
- Scaling (e.g., setting everything between 0 and 1)
- Dummy variables (e.g., turning categories into multiple columns of binary variables) - BE CAREFUL

Will learn more when we get into `scikit-learn` library and dive into unsupervised and supervised learning.

In [None]:
#Binning with q-cut
pd.qcut(diabetes_df['Age'], q = 4).value_counts()

In [None]:
#Binning with cut
pd.cut(diabetes_df['Age'], bins = 4).value_counts()

### Cool Data Analysis Tool - Pandas Profiling

Very useful! Great for exploratory data analysis.

`conda install -c conda-forge pandas-profiling`

Alternatives to Pandas Profiling: Sweetviz

Check out more here: https://towardsdatascience.com/data-frame-eda-packages-comparison-pandas-profiling-sweetviz-and-pandasgui-bbab4841943b

In [None]:
from pandas_profiling import ProfileReport
prof = ProfileReport(diabetes_df)
prof.to_file(output_file='output.html')

## Feature Engineering

- A key part to any DS Job is to figure out which parts are relevant to our desired outcome.
- The goal is to make the simplest model possible with the hihgest predictive power.
- Example: If we determine the cause of sales at a cafe is determined by two variables, price and the weather, we have a lot more predictive power and leverage than a model with thousands of variables.
- However, sometimes the a thousand variable model is needed to explain the data.

- Feature engineering is like making an argument for an essay. There is a lot of things with varying relevance that can be included, the hard part is choosing the most relevant/correct ones, synthesizing different arguments into one. 

- The best features are domain and problem specific. 
- Good features ideally:
    - Capture most important aspects of a problem
    - allow learning with a few examples
    - generalize to new scenarios. 

**Examples:**

1. Taking a date and extracting out the week number, weekday, month etc.
    - Sales are often based on seasonality. 
2. Taking freeform text (tweets) and extracting the number of words, hashtags, emojis, and counts of words etc.
    -  Text "metadata' can sometimes help with sentiment anlaysis
    
3. Take geographical coordintes and getting continent, country, urban vs. rural.
    - Housing price can depend on features extracted from geographical coordinates.
4. Predicting NBA games, we might extract the stats of the players, and coaches, and maybe look at the recent games. Home or Away games. 



**Feature Engineering vs. Feature Selection**

Through feature engineering we usually add more features to our data to make it more complex. In Feature selection, we are trying to choose thevbest features and remove features that do not add anything to our model. One common method is to remove features that have a low variance. 

## Feature Engineering Exercise (15 minutes)

You're presented with the data below.

Think of at least 5 features you might add.

**Note: For this exercise you will be creating new columns.**

In [None]:
import pandas as pd
import numpy as np

retail_df = pd.DataFrame([['Protein Bar','25-01-2021', 2.99, 1024, 1],\
              ['Oat Milk','25-01-2021', 3.99, 729, 1],\
              ['Banana','25-01-2021', 1.99, 256, 1]],\
            columns=['Item', 'Date', 'Price', 'Sales', 'Store Id'])

retail_df.loc[:,'Date'] = pd.to_datetime(retail_df['Date'])
retail_df.head()