<a href="https://colab.research.google.com/github/carighi/al_ml_workshop/blob/main/Mark_and_Remove_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Mark and Remove Missing Data**

In this tutorial, you will learn:

* How to mark invalid or corrupt values as missing in your dataset.
* How to confirm that the presence of marked missing values causes problems for learning algorithms.
* How to remove rows with missing data from your dataset and evaluate a learning algorithm on the transformed dataset.

Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).

#Diabetes Dataset
The dataset classifies patient data as
either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.

You can learn more about the dataset here:

* Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))
* Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))

The description of Diabetes Dataset can be found [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).

##Download Diabetes data files

In [None]:
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv" -O pima-indians-diabetes.csv
!wget "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names" -O pima-indians-diabetes.names
!head pima-indians-diabetes.csv

In [None]:
# load and summarize the dataset
import pandas as pd
# load the dataset
dataset = pd.read_csv('pima-indians-diabetes.csv', header=None)
dataset.columns = ['Number of times pregnant', 'Plasma glucose concentration', 'Diastolic blood pressure', 'Triceps skinfold thickness', '2-Hour serum insulin', 'Body mass index', 'Diabetes pedigree function', 'Age', 'Class variable (0 or 1)']
# summarize the dataset (Hint: .describe())
#Your code here

We can see that there are columns that have a minimum value of zero (0).
On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:
1. Plasma glucose concentration
1. Diastolic blood pressure
1. Triceps skinfold thickness
1. 2-Hour serum insulin
1. Body mass index

We can confirm this by looking at the raw data and printing out the first 20 rows of data.

In [None]:
# summarize the first 20 rows of data
# your code here

We can get a count of the number of missing values on each of these columns.

In [None]:
# Summarizing the number of missing values for each variable
# count the number of missing values for each column
num_missing = (dataset[['Plasma glucose concentration', 'Diastolic blood pressure', 'Triceps skinfold thickness', '2-Hour serum insulin', 'Body mass index']] == 0).sum()
# report the results
print(num_missing)

We can see that columns indexed 1, 2 and 5 have just a few zero values, whereas columns 3 and 4
show a lot more, nearly half of the rows. This highlights that different missing value strategies
may be needed for different columns, e.g. to ensure that there are still a sufficient number of
records left to train a predictive model.

In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, etc. We can mark values
as NaN easily with the Pandas DataFrame by using the replace() function on a subset of
the columns we are interested in. After we have marked the missing values, we can use the
isnull() function to mark all of the NaN values in the dataset as True and get a count of the
missing values for each column.

In [None]:
# Marking missing values with nan values
from numpy import nan
from pandas import read_csv
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# count the number of nan values in each column. Hint: use .isnull().sum()
# your code here

We can confirm by printing out the first 20 rows of data.

In [None]:
# Review data with missing values marked with a nan
from numpy import nan
from pandas import read_csv
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# summarize the first 20 rows of data
# your code here

#Missing Values Cause Problems
Having missing values in a dataset can cause errors with some machine learning algorithms. We are going to try classification of diabetes vs non diabetes using Linear Discrimant Analysis or LDA. LDA is a technique used for classification tasks. LDA is a classification algorithm that finds a linear combination of features that characterizes or separates two or more classes of objects or events. It assumes that the input variables have a Gaussian distribution and the same variance. This algorithm is sensitive to missing data.
When you run the next cell, you will get an error because of this.

In [None]:
# example where missing values cause errors
from numpy import nan
from pandas import read_csv
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# define the model
model = LinearDiscriminantAnalysis()
# define the model evaluation procedure using K fold cross-valiation
cv = KFold(n_splits=3, shuffle=True, random_state=1)
# evaluate the model accuracy score
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
# report the mean performance
print('Accuracy: %.3f' % result.mean())

 Here is a brief explanation of what each part the code does:

    It imports necessary libraries and functions.
    It reads the dataset from a CSV file using pandas' read_csv function.
    It replaces all 0 values in columns 1 to 5 with NaN (Not a Number), as these might represent missing data.
    It removes all rows with any missing values using the dropna function.
    It separates the dataset into input features (X) and the target variable (y).
    It creates an LDA model.
    It sets up a K-Fold cross-validation with 3 splits.
    It evaluates the model using cross-validation and calculates the mean accuracy.
    Finally, it prints the mean accuracy of the model.



#Remove Rows With Missing Values

We need to address the missing values to be able to rerun LDA.

The simplest approach for dealing with missing values is to remove entire predictor(s)
and/or sample(s) that contain missing values.

We can do this by creating a new Pandas DataFrame with the rows containing missing values
removed. Pandas provides the **dropna**() function that can be used to drop either columns or
rows with missing data. We can use **dropna**() to remove all rows with missing data,

In [None]:
# example of removing rows that contain missing values
from numpy import nan
from pandas import read_csv
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# summarize the shape of the raw data
print(dataset.shape)
# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# drop rows with missing values. Hint: use .dropna(inplace=True)
# Your code here
# summarize the shape of the data with missing rows removed
# Your code here

We now have a dataset that we could use to evaluate an algorithm sensitive to missing values
like LDA. Let's try again!

In [None]:
# evaluate model on data after rows with missing data are removed
from numpy import nan
from pandas import read_csv
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# drop rows with missing values
dataset.dropna(inplace=True)
# split dataset into inputs and outputs
values = dataset.values
# In this case, values is a 2D array. The : character means "all rows" and 0:8 means "columns from 0 to 7". So, X = values[:,0:8] is selecting all rows and the first 8 columns of the data to be assigned to X. This is typically done when you want to separate your features (input for your model) from your target variable.
X = values[:,0:8]
# What values are then assigned to y? What is y?
y = values[:,8]
# define the model
model = LinearDiscriminantAnalysis()

# define the model evaluation procedure. Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.
# The KFold divides all the samples into 'k' groups of samples, called folds. Here, it's dividing the data into 3 folds.
cv = KFold(n_splits=3, shuffle=True, random_state=1)

# evaluate the model accuracy score. It is fitting the LDA model on the training portion of the fold, making predictions on the test portion of the fold, and then calculating the accuracy of those predictions
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

# report the mean performance. This line is printing the mean accuracy of the model across all folds of the cross-validation.
# The %.3f is a placeholder for a floating-point number, with 3 digits after the decimal point.
# The % operator is then used to insert the mean accuracy (result.mean()) into this placeholder
print('Accuracy: %.3f' % result.mean())