# Import Python libraries
 

In [2]:
import pandas as pd
import time
import numpy as np
pd.set_option('display.max_columns', None) #Show all columns

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

**You can now import the dataset you worked on in Data Cleaning and Data visualization with pandas.** <br>

First, here is how you save a dataframe to csv from your Data cleaning/data viz notebooks.

```
# Create a dataframe with pd.Dataframe()
df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
                   'mask': ['red', 'purple'],
                   'weapon': ['sai', 'bo staff']})
                   
# Save the dataframe to a csv format                   
df.to_csv("new_dataset.csv", index=False)
```

Then, use `dataset = pd.read_csv("new_dataset.csv")` in this notebook to load the `"new_dataset.csv"` csv file.

**If you don't want to use the dataframe form data cleaning and data viz, then run this code.**

In [3]:
path=r'~/hfactory_magic_folders/course/Dataset/dataset_train.csv'

# Import the csv file
dataset = pd.read_csv(path,encoding='latin-1',sep=';')

# Clean the dataframe 
to_drop=['Customer Email','Customer Fname','Customer Lname','Customer Password','Customer Street','Order Zipcode','Product Description']
dataset=dataset.drop(to_drop,axis=1)
dataset=dataset.dropna()

# Import sklearn models, preprocessing tools, metrics
Here we've provided a list of scikit-learn models you can use. 

**Note**: <br>
You don't have to use these specific models to train your data. You can try other scikit-learn models. <br>
Part of a data scientist's job is to try out and discover new ones.

In [4]:
# Classification models
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import ExtraTreesClassifier

# Preprocessing tools
from sklearn.preprocessing import OneHotEncoder, LabelEncoder,StandardScaler
from sklearn.model_selection import train_test_split

# Performance metrics
from sklearn.metrics import confusion_matrix, roc_curve
from sklearn.metrics import precision_recall_curve, roc_curve, classification_report, explained_variance_score
from sklearn.metrics import make_scorer, mean_absolute_error, mean_absolute_percentage_error

# Improve your model
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

**Target value**: <br>
You have two options for the target variable: 
- `Late_delivery_risk` (binary classification) 
- `Delivery_Status` (multi-class classification)

**Note**: If you pick `Late_delivery_risk`, `Delivery_status` should be deleted from the dataset (and vice-versa). <br>
If you need help doing this, the next cell will do it for you.
- Keep `binary_classification=True` if you pick binary classification.
- Change it to `binary_classification=False` if you pick multi-class classification

In [5]:
# Choose your figther
binary_classification=False

if binary_classification:
    dataset=dataset.drop(columns=['Delivery Status'])
    name_label=['Late_delivery_risk']
else:
    # dataset=dataset.drop('Late_delivery_risk')
    name_label=['Delivery Status_Advance shipping', 'Delivery Status_Late delivery','Delivery Status_Shipping canceled', 'Delivery Status_Shipping on time']

## OneHotEncoding:
**Question 1:** <br>**Select categorical variables and transform them into numerical variables using OneHotEncoding (OHE).**

**Hint**: 
- Select the categorical variables by dtype (object) or by number of unique values (categorical variables have a small number of unique values). <br>
- You can go this [page](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for more info on how to use One Hot Encoding or go to `Data_Science_crash_course.ipynb` in the Pre-bootcamp folder.


In [6]:
# If you need help selecting variables to onehotencode, you can run this following code. 
df_num=dataset.select_dtypes(exclude=np.dtype('O'))
df_alphabetic=dataset.select_dtypes(include=np.dtype('O'))

nb_unique_value_max=10
to_OHE=[key for key in df_alphabetic.keys() if len(df_alphabetic[key].drop_duplicates())<nb_unique_value_max]
to_label_encoder=[key for key in df_alphabetic.keys() if key not in to_OHE]

print(to_OHE)
print(to_label_encoder)

['Type', 'Delivery Status', 'Customer Country', 'Customer Segment', 'Market', 'Order Status', 'Shipping Mode']
['Category Name', 'Customer City', 'Customer State', 'Department Name', 'Order City', 'Order Country', 'order date (DateOrders)', 'Order Region', 'Order State', 'Product Image', 'Product Name', 'shipping date (DateOrders)']


In [7]:
enc = OneHotEncoder(sparse_output=False)
enc.fit(dataset[to_label_encoder])

dataset_ohe = pd.DataFrame(enc.transform(dataset[to_label_encoder]))
#dataset_clean_ohe = pd.concat([dataset_ohe], axis=1)
dataset_ohe.head()
#dataset_clean_ohe.tail()

**Question 2**: <br>
**Concatenate the numerical variables that you've transformed with One Hot Encoding than output the dataframe.<br>**

<u> Hint </u>: You can use pandas' `pd.concat()` function to concatenate dataframes together.

You can also use other data preprocessing methods if you want to after this question (Labelencoding, Normalization, Feature selection,...).

**In this next cell, we have split the dataset into features (X_dataset) and target variable (y_dataset)**

In [6]:
X_dataset=dataset.drop(name_label,axis=1) 
y_dataset=dataset[name_label]


**Question 3**: <br>
**Now, split X and y into training and validation sets. <br>**

<u> Hint </u>: You can use scikit-learn's `train_test_split` function.

# Fit your model

Training Logistic, DecisionTree and Random forest models, using fit (you can see the differents parameters on scikit-learn website)

It's more and less every time the same code ! Let's make a pipeline :

Goal : create a function that takes an initialized untrained model as an argument and returns the score

Train at least three other models with this function

Display the ROC curve for different classifiers

Create the confusion matrix function

# Prediction

use your trained models to predict test labels (dataset in the hfactory)



# Upgrade your model !

Use Grid Search, Cross validation and/or Random Search.

Be careful, this part can be very heavy in terms of computing resources. Efficient coding! 