# **Data Cleaning**

## Objectives
* Evaluate blank values and handle as required
* Explore correlation and PPS scores

## Inputs
* outputs/datasets/collection/titanic_passengers.csv

## Outputs
* Generated cleaned Train and Test sets, saved at outputs/datasets/cleaned



---

## Set up the Working Directory

In [None]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

## Load Collected Data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/titanic_passengers.csv"
df = pd.read_csv(df_raw_path)
df.head()

---

## Data Exploration

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

There are 3 variables with missing data: Age, Cabin and Embarked.

As in the previous stage, a y-data profile shows more information about the nature of data.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

## Correlation and Predictive Power Score Analysis

Correlation and PPS is used to further investigate how the target variable 'fraud' correlates with the other features.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps
%matplotlib inline
import warnings
warnings.filterwarnings("ignore", category=UserWarning)


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman",numeric_only=True)
    df_corr_pearson = df.corr(method="pearson",numeric_only=True)

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.4, PPS_Threshold =0.2,
                  figsize=(12,10), font_annot=10)

The heatmaps, above, reveal the following:
* There is a moderation monotonic relationship between Fare and the following variables - Pclass (class), SibSp (number of siblings present) and Parch (number of parents present)
* There is a moderate linear relationship between Fare and Pclass (class) and between SibSp and Parch. 
    * This means that 1st class tickets are likely to be more expensive than 2nd and 3rd class tickets.
    * The latter of these findings makes intuitive sense. 
    
There are a number of significant conclusions to be drawn from the PPS heatmap.

1. Survived has a moderate (0.52) PP score for predicting Sex
2. Fare has a very strong (0.9) PPS for predicting Pclass.
3. Fare is also a strong (0.74) predictor for Embarked.




## Data Cleaning

### Missing Data

Below is a custom function to display missing data levels in a Dataframe. It is originally from Code Institute's Walkthrough Project: Churnometer.

In [None]:
def EvaluateMissingData(df):
    missing_data_absolute = df.isnull().sum()
    missing_data_percentage = round(missing_data_absolute/len(df)*100, 2)
    df_missing_data = (pd.DataFrame(
                            data={"RowsWithMissingData": missing_data_absolute,
                                   "PercentageOfDataset": missing_data_percentage,
                                   "DataType": df.dtypes}
                                    )
                          .sort_values(by=['PercentageOfDataset'], ascending=False)
                          .query("PercentageOfDataset > 0")
                          )

    return df_missing_data

In [None]:
EvaluateMissingData(df)

### Cleaning Summary

This project will handle missing data as follows:

* Drop `['Cabin', 'PassengerId', 'Ticket', 'Name']`. Ticket and PassengerId are unique IDs and do not contribute to the dataset. Cabin is nearly unique to each passenger and has a large percent (>70%) missing.
* Mean Median Imputer `'Age'`. Mean or median (depending on distribution) will be 
* Categorical Imputer `'Embarked'`. Missing values will be imputed with the value 'Missing'.

The pipeline is created incrementally.

### Drop Variables

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine.selection import DropFeatures

pipeline = Pipeline([
    ('drop', DropFeatures(features_to_drop=['Cabin','PassengerId','Ticket','Name']))
])

pipeline

### Impute Age

Firstly, it is necessary to establish whether age is normally distributed.

Firstly, a visual inspection via a Q-Q plot and a Histogram.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

age_data = df['Age'].dropna()

plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(age_data, kde=True)
plt.title("Histogram of Age")

plt.subplot(1, 2, 2)
stats.probplot(age_data, plot=plt)
plt.title("Q-Q Plot")

plt.tight_layout()
plt.show()

These plots show that the shape looks relatively normal but with some deviation, particularly at the lower end.

The Shapiro-Wilk test can be used to test a normal distribution.

In [None]:

alpha = 0.05
stat, p = stats.shapiro(age_data)

print(f"Shapiro-Wilk test statistic: {stat:.6f}")
print(f"P-value: {p:.6f}")

Therefore, Age is not normally distributed so missing values must be imputed using the median values.

In [None]:
from feature_engine.imputation import MeanMedianImputer

pipeline = Pipeline([
    ('drop', DropFeatures(features_to_drop=['Cabin', 'PassengerId', 'Ticket','Name'])),
     ('median', MeanMedianImputer(variables=['Age'], imputation_method='median')),
])

pipeline

### Impute Embarked

In [None]:
from feature_engine.imputation import CategoricalImputer

pipeline = Pipeline([
    ('drop', DropFeatures(features_to_drop=['Cabin', 'PassengerId', 'Ticket','Name'])),
     ('median', MeanMedianImputer(variables=['Age'], imputation_method='median')),
      ('categorical_imputer', CategoricalImputer(imputation_method='missing',
                                                  fill_value='Missing',
                                                  variables=['Embarked']) )
     
])

pipeline

### Fit and Transform Dataset

In [None]:
pipeline.fit(df)

In [None]:
df = pipeline.transform(df)

## Split Train and Test Sets

In [None]:
from sklearn.model_selection import train_test_split
TrainSet, TestSet, _, __ = train_test_split(
                                        df,
                                        df['Survived'],
                                        test_size=0.2,
                                        random_state=0)

print(f"TrainSet shape: {TrainSet.shape} \nTestSet shape: {TestSet.shape}")

## Push Cleaned Data to Repo

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

In [None]:
TrainSet.to_csv("outputs/datasets/cleaned/TrainSetCleaned.csv", index=False)

In [None]:
TestSet.to_csv("outputs/datasets/cleaned/TestSetCleaned.csv", index=False)