# Coaching Session

Notes, aspects, questions covered during the coaching session.

# Business Perspective

- Frame you problem in business terms
- From a business perspective
    - What are your requirments?
    - What are the implications of a working/failing model?
    - ...
- ...

Put your work in a context.

# Project Report Template

Do not forget: There is a minimal template for the project report
    
https://github.com/caichinger/MLMNC2020/tree/master/projects#report
    
Feel free to devise you own but make sure you cover the aspects mentioned in the template.

# Notebooks

When you share a notebook with us, make sure it can be accessed by anyone with the link.

# Inference vs. Prediction

- Inference = Understand what is going on?
- Prediction = What happens if?

Albeit related, these are different questions.

# Feature Scaling

Not in general but in our context.

Feature scaling is relevant if you use a linear model and want to infer variable importance based on the coefficients.

For kNN see below.

# Non-Numerical Data & Categorical Data I

Clearly, we need to convert/transform non-numerical variables into numerical variables.
There are different approaches to do that.

Whenever a variable only takes discrete values, we say it is categorical.

Some models, such as tree-based-models, can directly handel categorical data.
Others, like linear-regression-based-models, cannot. In the latter case, we need one-hot-encoding.

For kNN see below.

---

Albeit similar, turning non-numerical data into numerical one and encoding catorical data are not the same.

# kNN

The "near" in k nearest neighbors implies that we are able to compute distance between points. How do we compute distances when we have a mix of real-valued and integer- or boolean-valued variables? That is a bit tricky. If you want to use kNN, to keep it simple, go for one-hot-encoding, feature scaling and stick to the default metric in the model.

# Variable Selection

- In a prediction (!, not inference) task it is fine to use all variables available as long the model 
  is properly devised and evaluated.
- Run your first analysis using all (reasonable) variables available.
- If you decide to drop a variable, provide an explanation.
- Assess the impact of dropping a variable by comparing model performance with and without the variable.
- There may well be reasons to prefer a smaller variable set, discuss these as you see fit.

# Model Choice

- At least 2 models, ideally 3 for comparison.
- If you want to apply more models, feel free to do that but do not use models without 
  proper hyperparameter optimization and evaluation.
- Start with a simple model to go through the entire process, then iterate.

---

We focus on a sound procedure and proper evaluation of the models from a business point of view. We do not aim for the "perfect" model.

# Non-Numerical Data & Categorical Data II

In [1]:
import pandas as pd


df = pd.DataFrame({
    'real_valued': [0.1, 0.2, 0.3, 10], 
    'string_valued': [':)', ':)', ':(', ':|'],
    'integer_valued': [-1, 0, 1, 2],
    'boolean_values': [True, False, True, False],
})
df

Unnamed: 0,real_valued,string_valued,integer_valued,boolean_values
0,0.1,:),-1,True
1,0.2,:),0,False
2,0.3,:(,1,True
3,10.0,:|,2,False


In [2]:
df.dtypes

real_valued       float64
string_valued      object
integer_valued      int64
boolean_values       bool
dtype: object

## Non-Numerical --> Numerical

If you have any non-numerical variables in your data, 
you need to perform this conversion.

In [3]:
# one way to turn non-numerical data into numerical data
# as an example, we specify all mappings at once
# for boolean_values we do not need to perform the conversion this way but it is okay
value_mappings = {
    'string_valued': {':)': 1, ':|': 0, ':(': -1}, 
    'boolean_values': {True: 1, False: 0}  
}
df_numerical = df.copy()
for column, value_mapping in value_mappings.items():
    df_numerical[column] = df_numerical[column].replace(value_mapping)
df_numerical

Unnamed: 0,real_valued,string_valued,integer_valued,boolean_values
0,0.1,1,-1,1
1,0.2,1,0,0
2,0.3,-1,1,1
3,10.0,0,2,0


In [4]:
df_numerical.dtypes  # note that all but real_valued are categorical data

real_valued       float64
string_valued       int64
integer_valued      int64
boolean_values      int64
dtype: object

## Categorical --> One-Hot-Encoded

Does your model require one-hot-encoded data?

In [5]:
from sklearn.preprocessing import OneHotEncoder

In [6]:
# DO NOT DO THIS IF YOU HAVE MIXED REAL VALUED AND CATEGORICAL DATA
enc = OneHotEncoder(drop='first')
one_hot = enc.fit_transform(df_numerical)
one_hot

<4x9 sparse matrix of type '<class 'numpy.float64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [7]:
df_one_hot = pd.DataFrame(
    one_hot.toarray(), 
    columns=enc.get_feature_names(df_numerical.columns)
)
# as you see, one-hot-encoding everything converts real_valued as well
df_one_hot

Unnamed: 0,real_valued_0.2,real_valued_0.3,real_valued_10.0,string_valued_0,string_valued_1,integer_valued_0,integer_valued_1,integer_valued_2,boolean_values_1
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0


In [8]:
# DO THIS
reals = ['real_valued']
categoricals = ['string_valued', 'integer_valued', 'boolean_values']

df_real = df_numerical[reals]
df_categorical = df_numerical[categoricals]

In [9]:
enc = OneHotEncoder(drop='first')
one_hot = enc.fit_transform(df_categorical)
df_categorical_one_hot = pd.DataFrame(
    one_hot.toarray(), 
    columns=enc.get_feature_names(df_categorical.columns)
)
# as you see, one-hot-encoding everything converts real_valued as well
df_categorical_one_hot

Unnamed: 0,string_valued_0,string_valued_1,integer_valued_0,integer_valued_1,integer_valued_2,boolean_values_1
0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.0,1.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,1.0
3,1.0,0.0,0.0,0.0,1.0,0.0


In [10]:
df_transformed = pd.concat([df_real, df_categorical_one_hot], axis=1)
df_transformed

Unnamed: 0,real_valued,string_valued_0,string_valued_1,integer_valued_0,integer_valued_1,integer_valued_2,boolean_values_1
0,0.1,0.0,1.0,0.0,0.0,0.0,1.0
1,0.2,0.0,1.0,1.0,0.0,0.0,0.0
2,0.3,0.0,0.0,0.0,1.0,0.0,1.0
3,10.0,1.0,0.0,0.0,0.0,1.0,0.0
