# Scaling values
<div style='display: flex;'>
    <div style='width: 3%;'></div>
    <div>
        <h1 style='text-align: center;'>Values before scaling</h1>
        <img style='border: 1px black solid; margin: auto;' src='https://raw.githubusercontent.com/puneettrainer/pics/main/before-scaling.png'>
    </div>
    <div style='width: 3%;'></div>
    <div>
        <h1 style='text-align: center;'>Values after scaling</h1>
        <img style='border: 1px black solid; margin: auto;' src='https://raw.githubusercontent.com/puneettrainer/pics/main/after-scaling.png'>
    </div>
    <div style='width: 3%;'></div>
</div>

Scaling is a method through which we transform numeric values in our data so that they fall in the same range and are still proportional to each other.

### Why scale values?

Scaling numeric values makes sure that features which have larger values (for example `Profit`, `Sales`) don't overshadow/take importance from features which have smaller values (for example `Discount`, `Quantity`).

This can also reduce the time it takes to train a machine learning model as all the values fall on a similar scale and it can easily form patterns by giving each feature equal importance. If the values are not scaled, the machine learning algorithm will not be precisely able to align a relation and consume time to arrive at an optimal relation.

### Common Scalling Methods

#### Z-score Normalization

Transforms the values so that they are normally distributed. 

## $value_{scaled}$ = $\frac{x - avg(x)}{std(x)}$

- since regression models are designed with the assumption of normally distributed features, `z-score` normalization can help in transforming features to be normally distributed.
- as it uses `average` and `standard deviation`, it is still impacted by outliers, but less impacted compared to `min-max` normalization. Outliers can influence the `z-score` of a value, resulting in a slightly skewed distribution, rather than a normal distribution.
- not suitable for feature(s) which is/are not normally distributed.
- requires the `scaler` to be trained as per new data.
- `z-scores` calculated for one dataset cannot be applied to another dataset.
- does not have a fixed range of values.

To use `z-score` normalization, we simply import the `StandardScaler` class from the `preprocessing` sub-module of `sklearn`. This class is used to instantiate an object of the type `StandardScaler`, which we then use to scale the values.


Creating an object of the StandardScaler class and training it using numeric fields in the training data:
```
scaler = StandardScaler().fit(training_data[numeric_fields])
```

Using the scaler to scale the data
```
training_data[numeric_fields] = scaler.transform(training_data[numeric_fields])
```

#### Normalization

Transforms the values within a fixed range, 0 - 1.

## $value_{scaled}$ = $\frac{x-min(x)}{max(x)-min(x)}$

- useful when the data does not contain outliers
- it is simpler to compute.
- since it uses `min` and `max`, this method of normalization is heavily impacted by outliers.

To use `normalization`, we simply import the `MinMaxScaler` class from the `preprocessing` sub-module of `sklearn`. This class is used to instantiate an object of the type `MinMaxScaler`, which we then use to scale the values.


Creating an object of the StandardScaler class and training it using numeric fields in the training data:
```
scaler = MinMaxScaler().fit(training_data[numeric_fields])
```

Using the scaler to scale the data
```
training_data[numeric_fields] = scaler.transform(training_data[numeric_fields])
```

#### Robust Scaling

Scales the value on the basis of median and the interquartile range.

## $value_{scaled}$ = $\frac{x-median}{IQR}$

- useful when data has outliers; since it scales the values on the basis of the `median` and the `interquartile range`, any effect outliers may have on the scaling are eliminated.
- should not be used when the feature(s) is/are expected to have outliers; in some cases, outliers may be important to consider while training the model.
- if the feature(s) is/are already normally distributed, `z-score` scaling would be more efficient.

To use `robust scaling`, we simply import the `RobustScaler` class from the `preprocessing` sub-module of `sklearn`. This class is used to instantiate an object of the type `RobustScaler`, which we then use to scale the values.


Creating an object of the StandardScaler class and training it using numeric fields in the training data:
```
scaler = RobustScaler().fit(training_data[numeric_fields])
```

Using the scaler to scale the data
```
training_data[numeric_fields] = scaler.transform(training_data[numeric_fields])
```

#### MaxAbs Scaling

Scales the feature(s) as a ratio of the max absolute value.

## $value_{scaled}$ = $\frac{x}{max(|x|)}$

- does not shift the center of the feature(s); more suited for feature(s) which have both positive and negative values - is highly spread.
- efficient to compute.
- since it uses `max` for computation, it is heavily impacted by outliers.

To use `MaxAbs scaling`, we simply import the `MaxAbsScaler` class from the `preprocessing` sub-module of `sklearn`. This class is used to instantiate an object of the type `MaxAbsScaler`, which we then use to scale the values.

Creating an object of the StandardScaler class and training it using numeric fields in the training data:
```
scaler = MaxAbsScaler().fit(training_data[numeric_fields])
```

Using the scaler to scale the data
```
training_data[numeric_fields] = scaler.transform(training_data[numeric_fields])
```

# Encoding Values

Since `logistic regression` is a linear model, it expects input features to be numeric. In order to use categorical fields, we need to convert or `encode` the categorical values.

This can be done using `encoders` available in the `preprocessing` sub-module of the `sklearn` library.

Some common encoders are:
- `One Hot Encoder`: encodes the values such that each value is represented in different columns and the value of these columns is $0$ or $1$ (`True` or `False`).

For example,

| ID | State |
| --- | --- |
| 1 | Haryana |
| 5 | Delhi |
| 6 | Kerala |

is converted to:

| ID | is_Delhi | is_Haryana | is_Kerala |
| --- | --- | --- | --- |
| 1 | 0 | 1 | 0 |
| 5 | 1 | 0 | 0 |
| 6 | 0 | 0 | 1 |

| Advantages | Disadvantages |
| --- | --- |
| retains information of the data allowing the model to identify and learn patterns at a deeper level of the data  | increases the size of the training data, making it resource intensive |
| does not add any form of `ordinality` | loses ordinal information |
| automatically handles missing values | may cause overfitting with simpler model |
|  | may not handle rare values as they may not be available in the training data (or at the time of training the model), causing the model to not understand these values |
|  | encoded columns may be correlated to each other |
|  | not suitable for large datasets |

Using `One Hot Encoder` from `sklearn.proprocessing`:
```
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder().fit(training_data[categorical_inputs])
training_data.loc[:, encoder.get_feature_names_out()] = encoder.transform(training_data[categorical_fields]).toarray()
```
#### NOTE: since `OneHotEncoder` creates a boolean PIVOT of the original columns, it outputs multiple fields. We use encoder.get_feature_names_out() to get the names of the new columns generated by the `OneHotEncoder` and add them to the training dataset.

- `Label Encoding`: encodes the values by assigning a unique integer to them.

| Advantages | Disadvantages |
| --- | --- |
| does not increase the size of the data - computationally less resource intensive | may not be suitable in some cases as the model may mistake the encoded values as continuous values |
| retains information of `ordinality` | ordinality retained may not be true to the actual ordinality |
| automatically handles missing values | may introduce multicollinearity; this may impact the relations identified |
| suitable for large datasets | can only encode one field at a time |

Using `Label Encoder` from `sklearn.proprocessing`:
```
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder().fit(training_data[categorical_inputs])
training_data.loc[:, categorical_field] = encoder.transform(training_data[categorical_field])
```

- `Mean Encoding`: replaces the original value with the mean of the target values that belong to this class.

| Advantages | Disadvantages |
| --- | --- |
| does not increase the size of the data - computationally less resource intensive | may not be suitable in some cases as the model may mistake the encoded values as continuous values |
| adds more value between the categorical field and the target field | patterns identified and learned by the model depend on the sample, so it may be biased to the sample and not useful in case of real-world application |
| automatically handles missing values | may introduce multicollinearity; this may impact the relations identified |
| suitable when there are a wide set of distinct values in the categorical field | since it uses mean, patterns identified may be exagerrated by outliers |
| easy to interpret | |

Using `Mean Encoder` from `sklearn.proprocessing`:
```
from sklearn.preprocessing import TargetEncoder

encoder = TargetEncoder().fit(training_data[categorical_fields], training_data[target_field])
training_data.loc[:, categorical_fields] = encoder.transform(training_data[categorical_fields])
```


### Creating a model to predict whether a customer will churn

In [None]:
import pandas as pd
import warnings
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

warnings.filterwarnings('ignore')

In [None]:
churn = pd.read_csv(r"https://raw.githubusercontent.com/puneettrainer/datasets/main/customer_churn.csv")

### Segregating `target` and `input` fields

In [None]:
target_field = 'churn'
fields = list(churn.columns)
fields.remove(target_field)
input_fields = fields

### Segregating `numeric` and `non-numeric` input fields

In [None]:
categorical_fields = list(churn[input_fields].select_dtypes(exclude='number').columns)
numeric_fields = list(churn[input_fields].select_dtypes(include='number').columns)

### Splitting dataset into `training` and `test` data

In [None]:
training_data, test_data = train_test_split(churn
                                           ,test_size=0.3
                                           ,random_state=4)

### Scaling numeric fields using `RobustScaler`

In [None]:
scaler = RobustScaler().fit(training_data[numeric_fields])
training_data.loc[:, numeric_fields] = scaler.transform(training_data[numeric_fields])
test_data.loc[:, numeric_fields] = scaler.transform(test_data[numeric_fields])

### Encoding categorical fields

In [None]:
encoder = OneHotEncoder().fit(training_data[categorical_fields])
training_data.loc[:, encoder.get_feature_names_out()] = encoder.transform(training_data[categorical_fields]).toarray()
test_data.loc[:,  encoder.get_feature_names_out()] = encoder.transform(test_data[categorical_fields]).toarray()

In [None]:
# modifying list of input fields after encoding
input_fields = list(encoder.get_feature_names_out()) + numeric_fields

In [None]:
model = LogisticRegression().fit(training_data[input_fields], training_data[target_field])
predictions = model.predict(test_data[input_fields])

### Evaluating the model

In [None]:
print(f'Accuracy score: {accuracy_score(test_data[target_field], predictions)}')
print(f'Precison score: {precision_score(test_data[target_field], predictions)}')
print(f'Sensitivity score: {recall_score(test_data[target_field], predictions)}')
print(f'F1 score: {f1_score(test_data[target_field], predictions)}')

As per the above evaluation metrics:
- `Accuracy score` is $82.03\% \implies$  the model makes correct predictions $82.03\%$ of the time. This means that it correctly identifies `True Positive`s and `True Negatives` $82.03\%$ of the time. Fair enough for an un-optimized model.
- `Precision score` is $60.08\% \implies$ out of the actual customers who churned, the model was correctly able to predict $60.08\%$ customers. This level of correctness is not good enough to make a final decision on whether a new customer would churn or not: this model cannot be used for identifying customers who will churn or not. If the precision score was higher, we could use this model to predict whether to accept a new customer or not.
- `Sensitivity score` is $22.94\% \implies$ out of the customers the model predicted to churn, it only predicted $22.94\%$ of them to churn. This level of correctness is also not good enough to identify customers who potentially may churn. If the sensitivity score was higher, we could use this model to identify customers who may potentially churn and try to strategize on how to prevent them from churning.
- `F1 Score` is $33.20\% \implies$ the comprehensive score of the model consider `precision` and `recall`. This metric is evaluating how likely the model would classify a customer to churn. If the `F1 score` was higher, we could use this model in our decision making process. 

### Logistic Regression Coefficients

Just like the `LinearRegression` class provides the `.coef_` attribute, we can use the `.coef_` attribute to view the weights the model has assigned to different input features and decide which input fields would be relevant to keep, and which fields need to be used as input.

In [None]:
model.coef_

In [None]:
# storing weights in a dataframe
weights = pd.DataFrame({'feature': model.feature_names_in_.reshape(-1)
                       ,'weight': model.coef_.reshape(-1)})

# sorting features by their weight
weights.sort_values(by=['weight'], ascending=False)

Based on the above weights, we can conclude that the model is assigning more importance to `age`, `balance`, `country`, `gender`, `active_member`: the remaining fields don't significantly impact the model performance. So we can remove these fields from the input features.

### Saving the model for future use

In [None]:
target_field = 'churn'
input_fields = ['age', 'balance', 'country', 'gender', 'active_member']

categorical_fields = list(churn[input_fields].select_dtypes(exclude='number').columns)
numeric_fields = list(churn[input_fields].select_dtypes(include='number').columns)

training_data, test_data = train_test_split(churn
                                           ,test_size=0.3
                                           ,random_state=4)

scaler = RobustScaler().fit(training_data[numeric_fields])
training_data.loc[:, numeric_fields] = scaler.transform(training_data[numeric_fields])
test_data.loc[:, numeric_fields] = scaler.transform(test_data[numeric_fields])

encoder = OneHotEncoder().fit(training_data[categorical_fields])
training_data.loc[:, encoder.get_feature_names_out()] = encoder.transform(training_data[categorical_fields]).toarray()
test_data.loc[:,  encoder.get_feature_names_out()] = encoder.transform(test_data[categorical_fields]).toarray()

# modifying list of input fields after encoding
model_input_fields = list(encoder.get_feature_names_out()) + numeric_fields

model = LogisticRegression().fit(training_data[model_input_fields], training_data[target_field])
predictions = model.predict(test_data[model_input_fields])

In [None]:
import joblib as jb

churn_model = {'target_field': target_field
              ,'input_fields': input_fields
              ,'categorical_fields': categorical_fields
              ,'numeric_fields': numeric_fields
              ,'model_inputs': model_input_fields
              ,'scaler': scaler
              ,'encoder': encoder
              ,'model': model}

jb.dump(churn_model, 'churn_model.joblib')

### Reusing the model

In [None]:
import joblib as jb
import pandas as pd

saved_model = jb.load('churn_model.joblib')

# creating an empty dictionary to store input values
input_values = {}

for feature in saved_model['input_fields']:
    # fetching input values from user
    value = input(f'Enter the {feature.lower()}: ')

    # inserting input values into input_values dictionary
    input_values.update({feature:value})

# converting numeric inputs to numbers
for field in numeric_fields:
    input_values[field] = float(input_values[field])

# converting data into a dataframe
input_record = pd.DataFrame(input_values, index=[0])

# scaling numeric values
input_record[numeric_fields] = saved_model['scaler'].transform(input_record[numeric_fields])

# encoding categorical values
input_record.loc[:, saved_model['encoder'].get_feature_names_out()] = saved_model['encoder'].transform(input_record[saved_model['categorical_fields']]).toarray()

# making prediction more readable
if saved_model['model'].predict(input_record[saved_model['model_inputs']]) == [1]:
    prediction = 'Churn'
else:
    prediction = "Won't Churn"

# displaying predicted value
print(f'Predicted status for provided values is: {prediction}')

### Creating a model to predict whether customer is satisfied or not

In [None]:
# importing data
airline = pd.read_csv(r"E:\data\airline_satisfaction.csv")

# saving target and input field names
target_field = 'satisfaction'

input_fields = list(airline.columns)
input_fields.remove(target_field)

# segregating categorical and numeric fields
categorical_fields = list(airline[input_fields].select_dtypes(exclude='number').columns)
numeric_fields = list(airline[input_fields].select_dtypes(include='number').columns)

# splitting dataset into training and testing data
training_data, test_data = train_test_split(airline
                                           ,test_size=0.3
                                           ,random_state=4)

# scaling numeric fields in training and test data
scaler = RobustScaler().fit(training_data[numeric_fields])
training_data.loc[:, numeric_fields] = scaler.transform(training_data[numeric_fields])
test_data.loc[:, numeric_fields] = scaler.transform(test_data[numeric_fields])

# encoding categorical fields in training and test data
encoder = OneHotEncoder().fit(training_data[categorical_fields])
training_data.loc[:, encoder.get_feature_names_out()] = encoder.transform(training_data[categorical_fields]).toarray()
test_data.loc[:,  encoder.get_feature_names_out()] = encoder.transform(test_data[categorical_fields]).toarray()

# modifying list of input fields after encoding
input_fields = list(encoder.get_feature_names_out()) + numeric_fields

# instantiating model and computing predictions
model = LogisticRegression().fit(training_data[input_fields], training_data[target_field])
predictions = model.predict(test_data[input_fields])

# displaying evaluation metrics
print(f'Accuracy score: {accuracy_score(test_data[target_field], predictions)}')
print(f'Precison score: {precision_score(test_data[target_field], predictions, pos_label='satisfied')}')
print(f'Sensitivity score: {recall_score(test_data[target_field], predictions, pos_label='satisfied')}')
print(f'F1 score: {f1_score(test_data[target_field], predictions, pos_label='satisfied')}')