# 1.0 IMPORTS

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn.datasets as datasets
import sklearn.model_selection as ms
import sklearn.tree as tree
import sklearn.metrics as metrics

from sklearn.neighbors import KNeighborsClassifier

# 2.0 CREATE SYNTHETIC DATA

In [2]:
# Define data generator parameters
n_samples = 5000
n_features = 5
n_informative = 2
n_redundant = 3
n_classes = 2
random_state = 44

# Create data
X, y = datasets.make_classification(n_samples=n_samples, n_features=n_features, n_informative=n_informative, 
                                    n_redundant=n_redundant, random_state=random_state, n_classes=n_classes)

# Showing how's the dataset looks like
df1 = pd.concat([pd.DataFrame(X), pd.DataFrame(y, columns=['response'])], axis=1)
df1.head()

We will separate data here simulating available data (which will perform train, test and validation) and production data, which will simulate data on production.

In [3]:
# Split datasets
X, X_prod, y, y_prod = ms.train_test_split(X, y, test_size=0.2)

# 3.0 FIRST SCENARIO (NO SPLITTING)

### 3.1 Prediction over training

In [4]:
# Create decision tree classifier object
model = tree.DecisionTreeClassifier(max_depth=9)

# Fit the model
model.fit(X, y)

# Make predictions
y_hat = model.predict(X)

# Check metrics
acc = metrics.accuracy_score(y, y_hat)
prec = metrics.precision_score(y, y_hat, average='weighted')
rec = metrics.recall_score(y, y_hat, average='weighted')

print('Accuracy over training: {}'.format(acc))
print('Precision over training: {}'.format(prec))
print('Recall over training: {}'.format(rec))

### 3.2 Prediction over production

In [5]:
# Make predictions
y_hat_prod = model.predict(X_prod)

# Check metrics
acc_prod = metrics.accuracy_score(y_prod, y_hat_prod)
prec_prod = metrics.precision_score(y_prod, y_hat_prod, average='weighted')
rec_prod = metrics.recall_score(y_prod, y_hat_prod, average='weighted')

print('Accuracy over training: {}'.format(acc_prod))
print('Precision over training: {}'.format(prec_prod))
print('Recall over training: {}'.format(rec_prod))

What happened here is that we used all available data only for training, and when we put it into production and ran the algorithm and captured performance on production data, we saw that performance was about 8% lower. This indicates that the algorithm is overfitted, although not so much as it has good generalization capacity, 80% is not all bad. However, the difference can negatively impact the business, causing loss of revenue or reputation.

To mitigate this issue and evaluate the real generalization capacity of the algorithm before going into production, we will call upon our friend, the trai-testing strategy.

# 4.0 SECOND SCENARIO - TRAIN/TEST

In [6]:
# Splitting data
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state=random_state)

### 4.1 Train with train data and predictions and performance with test data

In [7]:
# Create decision tree classifier object
model = tree.DecisionTreeClassifier(max_depth=15)

# Fit the model
model.fit(X_train, y_train)

# Prediction with test data
y_hat_test = model.predict(X_test)

# Check metrics
acc_test = metrics.accuracy_score(y_test, y_hat_test)
prec_test = metrics.precision_score(y_test, y_hat_test, average='weighted')
rec_test = metrics.recall_score(y_test, y_hat_test, average='weighted')

print('Accuracy over training: {}'.format(acc_test))
print('Precision over training: {}'.format(acc_test))
print('Recall over training: {}'.format(acc_test)) 

### 4.2 Find the best parameter

In [8]:
# Create a list that will serve as parameter
values = [i for i in range(1, 40)]

# Create an empty list to save the results
test_scores = []

for i in values:
    
    # Create decision tree classifier object
    model = tree.DecisionTreeClassifier(max_depth=i)

    # Fit the model
    model.fit(X_train, y_train)

    # Make predictions
    y_hat_test = model.predict(X_test)
    acc_test = metrics.accuracy_score(y_test, y_hat_test)
    
    test_scores.append(acc_test)

In [9]:
# Plot results
plt.figure(figsize=(10, 6))
plt.plot(test_scores,'-o', label='test')
plt.show()

The best parameter is **3**

### 4.3 Publish model into production with the best parameter

In [10]:
model_prod = tree.DecisionTreeClassifier(max_depth=3)

# Concatenate train and test data
X = np.concatenate((X_train, X_test))
y = np.concatenate((y_train, y_test))

# Train the model
model_prod.fit(X, y)

# Prediction over production 
y_hat_prod = model_prod.predict(X_prod)

acc_prod = metrics.accuracy_score(y_prod, y_hat_prod)
prec_prod = metrics.precision_score(y_prod, y_hat_prod, average='weighted')
rec_prod = metrics.recall_score(y_prod, y_hat_prod, average='weighted')

print('Accuracy over production {}'.format(acc_prod))
print('Precision over production: {}'.format(acc_prod))
print('Recall over production: {}'.format(acc_prod)) 

In this case, we trained the model on the training data and made predictions on the test data. We could see that the prediction results were lower than the initial results, and matched the performance of the predictions made on the production data that we saw in the first scenario.

However, we also saw the possibility of improving the model by seeking to locate the best value for the **max_depth** parameter. To do this, we iterated over a given range, training the algorithm at each iteration with the training data, making predictions, and measuring performance with test data. Each iteration generated a performance result, which we saved in a list and then plotted.

Having found the best value, we retrained using the best parameter, and when measuring the result, let's suppose we obtained a worse result on the production data. Why would this have happened?

The reason would be that the test set was used multiple times by the model, when these data, to measure performance, need to represent data never before seen by the model. However, when we iterate over them using the trained model, the data is seen by the model and the model is optimized according to the applied parameters. In other words, the algorithm's ability to memorize is increased due to the use of the test set to make multiple predictions on a model that was replicated several times. This means that the algorithm's memorization capacity is greater, but it does not perform as well as on new data.

One way to mitigate this is the **train-validation-test** strategy

## 5.0 TRAIN/VALIDATION/TEST

**Just remembering, we already have a test dataset**

In [11]:
# Data splitting
X_train, X_val, y_train, y_val = ms.train_test_split(X_train, y_train, test_size=0.2)

In [12]:
# Find the best parameter
# Create a list that will serve as parameter
values = [i for i in range(1, 40)]

# Create an empty list to save the results
test_scores = []

for i in values:
    
    # Create decision tree classifier object
    model = tree.DecisionTreeClassifier(max_depth=i)

    # Fit the model
    model.fit(X_train, y_train)

    # Make predictions
    y_hat_val = model.predict(X_val)
    acc_val = metrics.accuracy_score(y_val, y_hat_val)    
    test_scores.append(acc_val)
    
plt.plot(test_scores, '-o', label='validation')
plt.show()

The best parameter is **2**

In [13]:
# Prediction over validation
y_hat_val = model.predict(X_val)
acc_val = metrics.accuracy_score(y_val, y_hat_val)
print('Accuracy over validation {}'.format(acc_val))

In [14]:
# Model trained with the best parameter (train+val)
model_train = tree.DecisionTreeClassifier(max_depth=2)
model_train.fit(np.concatenate((X_train, X_val)), np.concatenate((y_train, y_val)))

# Performance over test
y_hat_test = model_train.predict(X_test)
acc_test = metrics.accuracy_score(y_test, y_hat_test)
print('Accuracy over test {}'.format(acc_val))

# Performance over production
y_hat_prod = model_train.predict(X_prod)
acc_prod = metrics.accuracy_score(y_prod, y_hat_prod)
print('Accuracy over production {}'.format(acc_prod))

In this scenario, we separated a portion of the training data for validation, using only these data to iterate and find the best parameter for the algorithm. Once this was done, we combined the training and validation data, forming a new dataset of training, and trained the model with the best parameter. Then, we applied the model to the test data and validated the performance on the test data.

We achieved a total of 88.2%. Then, we checked the performance on the production data and got a result of 89.9%. Therefore, we can say that the model's performance in production is very close to the performance in the test environment.

# 5.0 TRAIN/VALIDATION/TEST STRATEGY FOR KNN

In [15]:
# Load data
df = pd.read_csv('../datasets/train.csv')

# Features
features = df.select_dtypes(exclude='object').columns.to_list()
target = 'limite_adicional'

# Define datasets
X = df.loc[:, features]
y = df.loc[:, target].values.ravel()

# Split train-test
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size = 0.2, random_state=42)

# Split train-validation
X_train, X_test, y_val, y_val = ms.train_test_split(X_train, y_train, test_size = 0.05, random_state=42)

# Test the best parameter

values = [i for i in range(1, 40, 2)]   
scores = []

for i in values:
    
    model = KNeighborsClassifier(n_neighbors=i)
    
    #model.fit(X_train, y_train)
    
    #yhat_val = model.predict(X_val)
    #acc_val = metrics.accuracy_score(y_val, yhat_val)
    
    # scores.append(acc_val)

In [16]:
len(X_train)

In [17]:
X_val