# Final Activity

In this activity you will improve the results obtained by the logistic regression model for predicting whether a person has diabetes. 

As you will see shortly, there are some lines that say `"INSERT YOUR CODE HERE"`, which in many cases means that you will literary replace that sentence by just one line of code, but in other cases you will need to write more lines of code so that your implementation works properly. Do what you need to do to make things work the way they should, but in general, your implementations should be done with a few lines. 

Now that we have that out of the way, let us load the libraries you will be using.

In [None]:
import pandas                  as pd
import numpy                   as np
import matplotlib.pyplot       as plt
import seaborn                 as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

To evaluate your model you will be using the following function. Please do not modify it. Keep in mind that it receives a dataframe with two columns: the `observed` column that contains the true labels of the test dataset, and the `prediction` column that holds the predictions of the model.

In [None]:
def performance_metrics(results):
    
    positives = results[['observed', 'prediction']][results['observed'] == 1]
    negatives = results[['observed', 'prediction']][results['observed'] == 0]
    
    true_negatives = negatives[negatives['observed'] == negatives['prediction']].shape[0]
    false_positives = negatives[negatives['observed'] != negatives['prediction']].shape[0]
    true_positives = positives[positives['observed'] == positives['prediction']].shape[0]
    false_negatives = positives[positives['observed'] != positives['prediction']].shape[0]
    
    confusion_matrix = {'actual positives' : [true_positives, false_negatives], 
                        'actual negatives' : [false_positives, true_negatives]}
    
    confusion_matrix_df = pd.DataFrame.from_dict(confusion_matrix, orient='index', 
                                                 columns=['predicted positives', 'predicted negatives'])
    
    accuracy = (true_positives + true_negatives) / (true_positives + false_positives +  true_negatives + false_negatives)
    precission = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1_score = 2 * (precission * recall) / (precission + recall)
    
    metrics = {'Accuracy' : accuracy, 'Precission' : precission, 'Recall' : recall, 'F1 Score' : f1_score}
    
    metrics_df = pd.DataFrame.from_dict(metrics, orient='index', columns=['Metrics'])
    
    return confusion_matrix_df, metrics_df   

You will be working, once again, with the diabetes dataset.

In [None]:
data = pd.read_csv('diabetes-dataset.csv')
data.head()

It turns out that this dataset has a lot of repeated observations, so we will remove those extra rows using the `drop_duplicates` method.

In [None]:
data = data.drop_duplicates()
data.shape

The original dataset had 2000 observations, and now we have 744, so there was a lot of redundancy in the data.

Now we create the target variable $y$ and a dataframe $X$ in which we will store the values of all the observations of the independent variables.

In [None]:
y = data['Outcome']
X = data.drop(columns='Outcome')

For models such as logistic regression and neural networks, it is good practice to standardize features by removing the mean and scaling to unit variance. That is, each feature $x_i$ is transformed as follows

$$\hat{x}_i=\frac{x_i-\bar{x}_i}{s_i},$$

where $\bar{x}_i$ and $s_i$ are the mean and the standard deviation of the values of the variable $x_i$, respectively. This tends to improve the convergence of the model during training. The standardization of features is implemented in the `StandardScaler` module of `scikit-learn`.

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

Now we are ready to create the **training** and **testing** datasets. For this task we have the rather famous `train_test_split` method from `sklearn`.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Note that for replicating results it is convenient to set the `random_state` parameter to an arbitrary number, this guarantees that each time you run this cell you will get the same training and test sets, but the way the elements are selected is random.

At this point we have all we need to train the model. As expected, there is an implementation of logistic regression in `sklearn`. Train the model by running the following cell. 

In [None]:
logistic_model = LogisticRegression(penalty=None).fit(X_train, y_train)

Create a variable called `y_pred` and store the predictions of the model for the test set.

In [None]:
y_pred = "INSERT YOUR CODE HERE"

For evaluation purposes, we will create a dataframe that contains both the true and predicted labels of the test set.

In [None]:
labels = {'observed': y_test, 'prediction': y_pred}
labels_df = pd.DataFrame(labels)
labels_df.head()

Evaluate the model using the `performance_metrics` function.

In [None]:
confusion_matrix, metrics = "INSERT YOUR CODE HERE"
confusion_matrix

In [None]:
metrics

How did it go? Recall is probably not that good, but it can be improved. Train the model again, but include the parameter `class_weight` and set it correctly. This addition tends to improve the performance of the model when we are dealing with unbalanced datasets. Do not forget to set the `penalty` parameter to `None`.

In [None]:
logistic_model_balanced = "INSERT YOUR CODE HERE"

Save the predictions in the variable called `y_pred_balanced`.

In [None]:
y_pred_balanced = "INSERT YOUR CODE HERE"

Once again, we will create a dataframe that contains the true and predicted labels of the test set.

In [None]:
labels_balanced = {'observed': y_test, 'prediction': y_pred_balanced}
labels_balanced_df = pd.DataFrame(labels_balanced)
labels_balanced_df.head()

Evaluate the new model using the `performance_metrics` function.

In [None]:
confusion_matrix, metrics = "INSERT YOUR CODE HERE"
confusion_matrix

In [None]:
metrics

You should see an improvement in the recall metric.

## Data Imputation

Data imputation is a statistical and data preprocessing technique used to fill in missing or incomplete values in a data set. When working with real-world data, it is common to encounter missing data points, which can result from a variety of reasons, such as data collection errors, sensor failures, or survey non-response. Data imputation aims to estimate or predict these missing values, making the data set more complete and suitable for analysis or modeling.

Depending on the nature of the data and the specific problem, one or more data imputation methods are chosen. Some common imputation methods are

- **Mean, Median, or Mode Imputation:** Replacing missing values with the mean (average), median, or mode of the observed values for that variable.
- **Regression Imputation:** Using regression models to predict missing values based on other variables in the dataset.
- **K-Nearest Neighbors (K-NN) Imputation:** Estimating missing values by considering the values of the nearest neighbors in the dataset.
- **Interpolation:** Filling in missing values by estimating them based on adjacent data points in a time series or spatial dataset.
- **Random Imputation:** Replacing missing values with randomly generated values, often drawn from the same distribution as the observed data.

Data imputation is an essential step in preprocessing and preparing data for various data analysis and machine learning tasks. Imputed datasets allow you to retain more data for analysis and modeling, which can lead to better insights and more accurate predictions. However, it is important to be careful when applying data imputation and to choose the most appropriate method for the specific dataset and research problem, as improper imputation can lead to bias or inaccurate results.

### Mean Imputation

As mentioned before, for some variables, the missing values of the dataset were replaced by zeros. Let us count how many zeros we have for each feature.

In [None]:
data.isin([0]).sum()

We can see that `SkinThickess` and `Insulin` are missing 215 and 359 values, respectively. We will deal with that later. For now, you will perform mean imputation for the other variables, however, keep in mind that imputation will make sense for some of these variables, so choose wisely. 

Mean imputation will be carried out in the following way: first, compute the means for each variable, but you will have to compute two in each case, one for the group of people who do not have diabetes, and the other one for the people who do have the disease; second, you will replace a missing value with the proper mean, for instance, if the missing value belongs to a person without diabetes and that corresponds to the variable `BMI`, then you will substitute the zero with the mean `BMI` of people who do not have the disease.

Before going any further, let us create a copy of the dataset. 

In [None]:
new_data = data.copy()

Use the following cell to code your implementation of mean imputation, it should not take too many lines of code, by the way

In [None]:
"INSERT YOUR CODE HERE"

You just wrote a few lines of code, didn't you? 

### Regression Imputation

You will carry out regression imputation to replace the missing values of the features `SkinThickess` and `Insulin`. In this case, you will perform this process twice, that is, you will perform regression imputation for each of these two variables.

Let us begin with `SkinThickness`. First, you will have to create the data for training the linear regression model. In this case, the target variable is the non-zero values of `SkinThickness`. As predictors, you can use all the other variables, except for `Insulin`. Yes, I know, `Outcome` is a categorical variable and we did not talk about categorical variables in class, but in this case you do not have to do anything about this, so all good.  

Use the following cell to create the training data of the first linear regression model.

In [None]:
variables = ['SkinThickness', 'Insulin']
X_linear_1 = "INSERT YOUR CODE HERE"
y_linear_1 = "INSERT YOUR CODE HERE"

Train the first linear model in the next cell.

In [None]:
linear_model_1 = "INSERT YOUR CODE HERE"

Create a test set with the observations for which the values of `SkinThickness` are equal to zero. You will use this set to obtain the predictions that will replace the missing values.

In [None]:
X_linear_test_1 = "INSERT YOUR CODE HERE"

Create a new column called `Predictions` for the `X_linear_test_1` dataframe and store the predictions.

In [None]:
X_linear_test['Predictions'] = "INSERT YOUR CODE HERE"

Replace the missing values with your predictions in the `new_data` dataframe.

In [None]:
new_data.loc["INSERT YOUR CODE HERE"] = X_linear_test['Predictions']

Note that we do not have zeros in the `SkinThickness` variable anymore, which is good news. Now repeat the same imputation for the `Insulin` variable in the following cell. By the way, keep the command `new_data.describe()` as the last line of code of the next cell."

In [None]:
X_linear_2 = "INSERT YOUR CODE HERE"
y_linear_2 = "INSERT YOUR CODE HERE"
linear_model_2 = "INSERT YOUR CODE HERE"
X_linear_test_2 = "INSERT YOUR CODE HERE"
X_linear_test_2['Predictions'] = "INSERT YOUR CODE HERE"
new_data.loc[new_data["INSERT YOUR CODE HERE"] = X_linear_test_2['Predictions']
new_data.describe()

As you can see, there must be some observations for which the predicted value of `Insulin` was a negative number. We should not allow that. Let us inspect that.

In [None]:
new_data[new_data['Insulin'] < 0]

So it is just one observation, no big deal. We can replace the negative number with the mean of `Insulin`, but only considering the people who do not have diabetes. Do that in the next cell.

In [None]:
"INSERT YOUR CODE HERE"

Sanity check!

In [None]:
new_data.describe()

All is well. Now we are ready for some logistic regression. Repeat the training of the model with the `new_data` dataframe. Remember to set the `penalty` parameter to `None` and include the proper setting of the `class_weight` parameter.  

In [None]:
y = new_data['Outcome']
X = new_data.drop(columns='Outcome')
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
logistic_model = "INSERT YOUR CODE HERE"
y_pred_balanced = "INSERT YOUR CODE HERE"
labels_balanced = {'observed': y_test, 'prediction': y_pred_balanced}
labels_balanced_df = pd.DataFrame(labels_balanced)

And now the final results.

In [None]:
confusion_matrix, metrics = performance_metrics(labels_balanced_df)
confusion_matrix

In [None]:
metrics

Recall remained pretty much the same, but the other metrics were improved a tiny bit, so, was it worth it? If this were a Kaggle competition, absolutely!