# Exercise - Machine Learning Workflow - Basic

Apply the workflow on a different dataset. Understand the dataset and try to predict if a package will be delivered on time or not.

You will find the places to fill with <u>**FILL HERE**</u>  
Update the code for it to work on the new dataset.

The details of the dataset can be found here: https://www.kaggle.com/prachi13/customer-analytics

## Machine Learning Workflow Components

0. **Data Description, Tasks & Observations**
0. **Import Libararies** & other settings
1. **Data Import**
2. **Exploratory Data Analysis (EDA)**
3. **Data Preparation**
4. **Model Training**
5. **Model Evaluation**
5. **Save Model**

## Data Description, Tasks & Observations

### Data Description 
<u>**FILL HERE**</u>  
**Dataset Summary**: *Summarize the Dataset*  
**Index Column**: *Find Index Column*  
**Target Column**: *Find Target Column*  
**Features**: *Which features will you use?*  
**Data Source**: *What is the data source? Find it from the Dataset link*  
**Other Comments**: ???  

### Tasks:
1. Understand data and select columns to be used as features
1. Preparing Data for model
2. Train basic model to predict if a passenger survived
3. Evaluate model 
4. Save best performing model

### Observations:
<u>**FILL HERE**</u>  
1. *Fill any observations you find here*

## Import Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.__version__

## Data Import

### Read CSV File to Dataframe

In [None]:
#If running on Kaggle
path = "../input/customer-analytics/Train.csv"

#If Running from repo
# path = "./data/ecommerce.csv"

df = pd.read_csv(path)

### Basic Dataset Information

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

## Exploratory Data Analysis (EDA)

### Find Data Types

In [None]:
df.dtypes

### Find Null Counts

In [None]:
#Find count of nulls in each column
df.isna().sum()

In [None]:
#Find % of null values in each column
(df.isna().sum() / df.shape[0]) * 100

#### Observations
1. Age column has around 20% null values

### Target Column

In [None]:
target_column = 'Survived'

In [None]:
df[[target_column]].describe().transpose()

In [None]:
df[[target_column]].astype('object').describe().T

In [None]:
sns.countplot(data=df, x=target_column)

#### Observations
1. There are 2 categories for target column
1. The target column is numeric with 0 representing a non-survived passenger and 1 a survived passenger
1. Ratio of of non-survived to survived is around 6:4


### Numeric Columns

In [None]:
df.select_dtypes(include=np.number).columns

In [None]:
numeric_columns = ['Age', 'SibSp', 'Fare', 'Pclass']

In [None]:
df[numeric_columns].head()

In [None]:
df[numeric_columns].describe()

In [None]:
sns.displot(data=df, x='Age', hue='Survived')

In [None]:
sns.scatterplot(data=df, x='Age', y='Fare')

In [None]:
sns.jointplot(data=df, x='Age', y='Fare')

In [None]:
sns.pairplot(df[numeric_columns])

In [None]:
sns.displot(data=df, x='Fare', hue=target_column, kind='kde')

In [None]:
sns.displot(data=df, x='Age', hue=target_column, kind='kde')

#### Observations
1. SibSp seems to behave more like a categorical column than a numeric column
1. Both Age and Fare seem relavent 
1. A higher Fare generally shows a higher chance of survival
1. A lower age generally shows a higher chance of survival

### Categorical Columns

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.select_dtypes(include=np.object).columns

In [None]:
categorical_columns = ['Pclass', 'Sex', 'SibSp', 'Embarked']

In [None]:
df[categorical_columns].head()

In [None]:
df[categorical_columns].astype(np.object).describe()

In [None]:
sns.countplot(data=df, x='Pclass', hue=target_column)

In [None]:
sns.countplot(data=df, hue='Pclass', x=target_column)

In [None]:
sns.countplot(data=df, x='Sex', hue=target_column)

In [None]:
sns.countplot(data=df, hue='Sex', x=target_column)

In [None]:
sns.countplot(data=df, x='Embarked', hue=target_column)

In [None]:
sns.countplot(data=df, x='SibSp', hue=target_column)

#### Observations
1. Sex, Pclass and Embarked seem the most relavent for prediction
2. Female passenger have a higher survivor rate than male passengers
2. Passenger in Embarked C have higher survivor rate than other Embarked values
2. The lower the number for Pclass the higher the rate survivor rate

## Data Preparation

## Feature Selection

In [None]:
target_column = 'Survived'

#Manually selecting relavent features based on EDA
feature_columns = ['Age', 'Fare', 'Pclass', 'Sex', 'Embarked']

keep_columns = [target_column] + feature_columns

In [None]:
df = df[keep_columns]

In [None]:
df.head()

### Fill Null Values

In [None]:
df['Age'].median()

In [None]:
df['Age'] = df['Age'].fillna(df['Age'].median())

In [None]:
df['Embarked'].mode()

In [None]:
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode())

### One-Hot Encoding of Text Columns

In [None]:
df = pd.get_dummies(df)

In [None]:
df.head()

### Create X & y

In [None]:
df.columns

In [None]:
X = df[['Age', 'Fare', 'Pclass', 'Sex_female', 'Sex_male']]
y = df[target_column]

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
X_train

In [None]:
y_train

## Model Training & Evaluation

### Model Training

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression(solver='liblinear', random_state=0)

model.fit(X_train,y_train)

### Model Predictions on Test Data

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred

In [None]:
df_test = X_test.copy()
df_test['y_actual'] = y_test
df_test['y_pred'] = y_pred

df_test.head()

### Model Evaluation

#### Classification Report

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
eval_report = classification_report(y_test, y_pred, 
                      target_names=['0: Not-Survived', '1: Survived'],
                      output_dict=True
                     )

pd.DataFrame(eval_report).T

#### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
cf_matrix = confusion_matrix(y_test, y_pred)

In [None]:
sns.heatmap(cf_matrix, annot=True)

In [None]:
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, 
            fmt='.2%', cmap='Blues')

## Save Model

In [None]:
#TODO