# Coding Exercise 2: Handling Missing Data in a Dataset for Machine Learning

Instructions:

- Import the necessary Python libraries for data preprocessing, including the `SimpleImputer` class from the scikit-learn library.

- Load the dataset into a pandas DataFrame using the `read_csv` function from the pandas library. The dataset name is 'pima-indians-diabetes.csv'

- Identify missing data in your dataset. Print out the number of missing entries in each column. Analyze its potential impact on machine learning model training. This step is crucial as missing data can lead to inaccurate and misleading results.

- Implement a strategy for handling missing data, which is to replace it with the mean value, based on the nature of your dataset. Other strategies might include dropping the rows or columns with missing data, or replacing the missing data with a median or a constant value.

- Configure an instance of the `SimpleImputer` class to replace missing values with the mean value of the column.

- Apply the `fit` method of the `SimpleImputer` class on the numerical columns of your matrix of features.

- Use the `transform` method of the `SimpleImputer` class to replace missing data in the specified numerical columns.

- Update the matrix of features by assigning the result of the `transform` method to the correct columns.

- Print your updated matrix of features to verify the success of the missing data replacement.


# Data Preprocessing

## Importing the libraries

In [5]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

## Importing the dataset

In [7]:
# Dataset: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?resource=download
dataset = pd.read_csv('pima-indians-diabetes.csv')

## Identify missing data (assumes that missing data is represented as NaN)

In [9]:
missing_data = dataset.isnull().sum()
print("Number of missing entries in each column:")
print(missing_data)

Number of missing entries in each column:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


## Configure an instance of the SimpleImputer class


In [11]:
imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')

## Fit the imputer on the DataFrame


In [13]:
imputer.fit(dataset)

## Apply the transform to the DataFrame


In [15]:
dataset_imputed = imputer.transform(dataset)

## Print your updated matrix of features


In [17]:
# Convert the result back to a DataFrame
dataset_imputed = pd.DataFrame(dataset_imputed, columns=dataset.columns)


print("Updated matrix of features after imputation:")
print(dataset_imputed)


Updated matrix of features after imputation:
     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6.0    148.0           72.0           35.0      0.0  33.6   
1            1.0     85.0           66.0           29.0      0.0  26.6   
2            8.0    183.0           64.0            0.0      0.0  23.3   
3            1.0     89.0           66.0           23.0     94.0  28.1   
4            0.0    137.0           40.0           35.0    168.0  43.1   
..           ...      ...            ...            ...      ...   ...   
763         10.0    101.0           76.0           48.0    180.0  32.9   
764          2.0    122.0           70.0           27.0      0.0  36.8   
765          5.0    121.0           72.0           23.0    112.0  26.2   
766          1.0    126.0           60.0            0.0      0.0  30.1   
767          1.0     93.0           70.0           31.0      0.0  30.4   

     DiabetesPedigreeFunction   Age  Outcome  
0                  