# Data Preparation

Data preparation refers to transforming raw data into a form that is better suited to predictive modeling.

On a predictive modeling project, such as classification or regression, raw data typically cannot be used directly.

There are four main reasons why this is the case:

 - <strong>Data Types</strong>: Machine learning algorithms require data to be numbers.
 - <strong>Data Requirements</strong>: Some machine learning algorithms impose requirements on the data.
 - <strong>Data Errors</strong>: Statistical noise and errors in the data may need to be corrected.
 - <strong>Data Complexity</strong>: Complex nonlinear relationships may be teased out of the data.<br><br>
The raw data must be pre-processed prior to being used to fit and evaluate a machine learning model.

Preprocessing includes:

 - <strong>Data Cleaning</strong>: Identifying and correcting mistakes or errors in the data.
 - <strong>Feature Selection</strong>: Identifying those input variables that are most relevant to the task.
 - <strong>Data Transforms</strong>: Changing the scale or distribution of variables.
 - <strong>Feature Engineering</strong>: Deriving new variables from available data.
 - <strong>Dimensionality Reduction</strong>: Creating compact projections of the data.

## 1. Identify and deal with missing values
<ul>
<li>Identify if dataset has any missing values.</li>
    <li>Decide on whether to keep or delete that particular record/feature from the dataset.</li>
</ul>
Filling missing values with data is called <strong>data imputation</strong> and a popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic.<br>

### Types of missing data
<strong>Missing Completely At Random (MCAR)</strong>: A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. There’s no relationship between whether a data point is missing and any values in the data set, missing or observed. Missing values are randomly distributed across all observations, then we consider the data to be missing completely at random. A quick check for this is to compare two parts of data – one with missing observations and the other without missing observations. On a t-test, if we do not find any difference in means between the two samples of data, we can assume the data to be MCAR. 

<strong>Missing At Random (MAR)</strong>: The key difference between MCAR and MAR is that under MAR the data is not missing randomly across all observations, but is missing randomly only within sub-samples of data. For example, if high school GPA data is missing randomly across all schools in a district, that data will be considered MCAR. However, if data is randomly missing for students in specific schools of the district, then the data is MAR. The propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data.

<strong>Not Missing At Random (NMAR)</strong>: When the missing data has a structure to it, we cannot treat it as missing at random. In the above example, if the data was missing for all students from specific schools, then the data cannot be treated as MAR.
<pre>
Pattern  /   Data Explains Pattern

            Yes         No

Yes         MAR        NMAR

No          --         MCAR
</pre>
 if there is a pattern to a variable's missingness and the data we have cannot explain it we have NMAR, but if the data we have (i.e. other variables in our data set) can explain it we have MAR. If there is no pattern to the missingness, it's MCAR.

In [None]:
pip install xgboost

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

In [None]:
X = df.iloc[:,:27]
y = df.iloc[:,27]
display(X.shape, y.shape)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
display(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

In [None]:
rf= RandomForestClassifier()
rf.fit(X_train, y_train)

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

In [None]:
sv = SVC()
sv.fit(X_train, y_train)

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
ada = AdaBoostClassifier()
ada.fit(X_train, y_train)


In [None]:
bag = BaggingClassifier()
bag.fit(X_train, y_train)

In [None]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

We are prevented from evaluating LDA, Random Forest, Decision Tree, KNN, GuassianNB, SVC, Logistic Regression, Adaboost classifier, Bagging classifier, GradientBoost classifier on a dataset that contains missing values.

In [None]:
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
print(f"Accuracy Score of XGBoost {accuracy_score(y_test, y_pred)}")

##  Statistical imputation for missing values

Common statistics calculated include:
<ul>
    <li>The column mean value.</li>
    <li>The column median value.</li>
    <li>The column mode value.</li>
    <li>A constant value.</li>
</ul>
Mean and median are suitable for numerical variable.<br>
If the data follows normal distribution, mean and median are approximately same. If the data is skewed then median is better choice.<br><br>


In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#strategies = ['mean', 'median', 'most_frequent', 'constant']
titanic_df = pd.read_csv("titanic-train.csv", usecols=['Age', 'Fare', 'Survived', 'Embarked'])
titanic_df.head()

In [None]:
titanic_df.isna().sum()

In [None]:
titanic_df['Age'].plot(kind='kde')

## For Numerical features
### 1. statistical imputation

In [None]:
impute = SimpleImputer(strategy='median')
temp_df = pd.DataFrame(impute.fit_transform(titanic_df[['Survived', 'Age','Fare']]), columns = ['Survived', 'Age', 'Fare'])

In [None]:
titanic_df['Age'].plot(kind='kde', color='c', label='Age Before Imputation')
temp_df['Age'].plot(kind='kde', color='m', label='Age After Median Imputation')
plt.legend()
plt.show()

### 2. Arbitrary value imputation

In [None]:
impute = SimpleImputer(strategy='constant', fill_value=99)
temp_df = pd.DataFrame(impute.fit_transform(titanic_df[['Survived', 'Age', 'Fare']]), columns = ['Survived', 'Age', 'Fare'])
titanic_df['Age'].plot(kind='kde', color='c', label='Age Before Imputation')
temp_df['Age'].plot(kind='kde', color='m', label='Age After Imputation')
plt.legend()
plt.show()

### 3.  End of tail imputation



In [None]:
extreme = titanic_df.Age.mean()+3*titanic_df.Age.std()

In [None]:
def impute_nan_endtail(df, col, extreme):
    df[col+"_extreme"]= df[col].fillna(extreme)
    df[col+"_median"]= df[col].fillna(df[col].median())
#    return df

In [None]:
impute_nan_endtail(titanic_df, 'Age', extreme)

In [None]:
titanic_df.head()

In [None]:
titanic_df['Age'].plot(kind='kde', color ='g', label='Before Imputation')
titanic_df['Age_median'].plot(kind='kde', color ='r', label='Median Imputation')
titanic_df['Age_extreme'].plot(kind='kde', color ='b', label='End tail Imputation')
plt.legend()
plt.show()

## For categorical features
### 4. Most frequent Category imputation

In [None]:
impute=SimpleImputer(strategy='most_frequent')
temp_df = pd.DataFrame(impute.fit_transform(titanic_df[['Embarked']]), columns=['Embarked'])
print(temp_df['Embarked'].value_counts())
print(titanic_df['Embarked'].value_counts(dropna=False))

### 5. Missing category imputation
Treats missing values as a category.

In [None]:
impute=SimpleImputer(strategy='constant', fill_value='Missing')
temp_df = pd.DataFrame(impute.fit_transform(titanic_df[['Embarked']]), columns=['Embarked'])
print(temp_df['Embarked'].value_counts())
print(titanic_df['Embarked'].value_counts(dropna=False))

## For both numerical and categorical features
### 6. Complete Case Analysis

Only include observations with no missing data

In [None]:
temp_df = titanic_df.dropna()

In [None]:
print(temp_df.shape, titanic_df.shape,temp_df.isna().sum(), sep='\n')

### 7. Random Sample Imputation

Take random samples from observations and use the randomly selected values to fill in the missing ones.

In [None]:
random_sample = titanic_df['Age'].dropna().sample(titanic_df['Age'].isnull().sum())

In [None]:
def impute_nan_randomsample(df, col):
    random_sample = df[col].dropna().sample(df[col].isnull().sum())
    df[col+"_random"] = df[col]
    random_sample.index = df[df[col].isnull()].index
    df.loc[df[col].isnull(), col+"_random"] = random_sample

In [None]:
impute_nan_randomsample(titanic_df, 'Age')

In [None]:
titanic_df['Age'].plot(kind='kde', color ='g', label='Before Imputation')
titanic_df['Age_median'].plot(kind='kde', color ='r', label='Median Imputation')
titanic_df['Age_extreme'].plot(kind='kde', color ='b', label='End tail Imputation')
titanic_df['Age_random'].plot(kind='kde', color ='y', label='Random sample Imputation')
plt.legend()
plt.show()

### 8. Missing Indicator
A missing indicator is an additional binary variable that indicates whether the data was
missing for an observation (1) or not (0). The goal here is to capture observations
where data is missing.



In [None]:
from sklearn.impute import MissingIndicator
indicator = MissingIndicator()
temp_df = pd.DataFrame(indicator.fit_transform(titanic_df))
ind_cols = [col+"_NA_IND" for col in titanic_df.columns[indicator.features_] ]
temp_df.columns = ind_cols

In [None]:
temp_df.isna().sum()

### 9. KNN Imputer

    - Works with numerical features only

In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
knnimp = KNNImputer()
temp_df = pd.DataFrame(knnimp.fit_transform(titanic_df[['Age']]), columns=['Age'])

temp_df

In [None]:
knnimp.fit_transform(titanic_df[['Embarked']])

In [None]:
#titanic_df.Embarked =titanic_df['Embarked'].map({'S':0, 'C':1, 'Q':2})
#titanic_df['Embarked'].value_counts(dropna=False)


In [None]:
temp_df = pd.DataFrame(knnimp.fit_transform(titanic_df[['Embarked']]), columns=['Embarked'])
temp_df.Embarked.value_counts(dropna=False)

### 10 Iterative Imputer

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

itimpute = IterativeImputer()
print(titanic_df.isna().sum())
temp_df=pd.DataFrame(itimpute.fit_transform(titanic_df), columns=titanic_df.columns.tolist())
print(temp_df.isna().sum())
