## Rachael DATA CHALLENGE

you are provided with a dataset containing the characteristics of different mushrooms (*mushrooms.csv*), and are tasked with discovering whether a mushroom is poisonous (class=p) or edible (class=e). You also have a dataset (*mushrooms_validation.csv*) where mushrooms are not labeled : run your algorithm on this dataset and provide a *predicted_labels.csv* file (keeping the indexes in *mushrooms_validation.csv*).

As you implement your code, answer the following questions:
1) how do you deal with missing data?
2) how would you predict the values of the class column?
3) what features are the most important in predicting the class of a mushroom?
4) what would you say is the most important metric in assessing your model performance? Accuracy, precision, recall...?
5) how would selecting only three features to use in a model impact performance? Is this acceptable?

# Configuration and Imports

---
I decided to load the csv file with pandas dataframe.


In [1]:
# Pandas and numpy for data manipulation as we use it many times with alias 'pd', 'np'
import pandas as pd
import numpy as np

# Matplotlib and seaborn visualization
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn import preprocessing
# No warnings about setting value on copy of slice
pd.options.mode.chained_assignment = None

# As we see in the class we use this line of the code in order to show the figures inline.
%matplotlib inline 

# Set default font size
plt.rcParams['font.size'] = 15

# during the hyper parameter tuning I face many messy warning which make me a little bit unhappy. because
# could not focus on the measure or report of my model selection. I had to scroll too much. By searching into
# the we I found this line of the code. 
import warnings
warnings.filterwarnings('ignore')
# - ignore is to hide the warning
# - always is to show the warning


#  1. Data Acquisition and Preprocessing
<hr/>
This is the first step we need to accomplish before going any further. The dataset will be downloaded and loaded as usual. 
Some preprocessing such as checking the missing values and duplication will be addressed.

## Load the dataset
We use Pandas dataframe to read and load the our dataset.

In [3]:
dataFrame = pd.read_csv(filepath_or_buffer="/content/mushrooms.csv", 
                        sep=',',
                        index_col= None)

In [4]:
dataFrame.shape

(7625, 23)

In [None]:
dataFrame

In [6]:
val_dataFrame = pd.read_csv(filepath_or_buffer="/content/mushrooms_validation.csv", 
                        sep=',',
                        index_col= None)

In [7]:
val_dataFrame.shape

(500, 22)

In [None]:
val_dataFrame

## Preprocessing/Cleaning

In this section we go through these processes: 
* Checking duplicate records
* How to deal with records with missing values included (imputation/drop)
* Checking unique values for each column


## Handling missing values 
We will handle missing values in two approaches: 
1. **Imputation** (completing missing values via imputation)
2. **Drop** (dropping records that are containing missing values)

### 1. **Imputation**

In imputation approach, we will try to **fill the missed values** by putting **the most common class in the column** of the missed value. 

In [9]:
# Creating a copy of data set to be used for imputation method
df_imputed = dataFrame.copy()

In [10]:
# for each column, get value counts in decreasing order and take the index (value) of most common class
df_imputed = df_imputed.apply(lambda x: x.fillna(x.value_counts().index[0]))
df_imputed

Unnamed: 0.1,Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,0,p,x,s,n,t,p,f,c,n,...,s,w,w,p,w,o,p,k,s,u
1,1,e,x,s,y,t,a,f,c,b,...,s,w,w,p,w,o,p,n,n,g
2,2,e,b,s,w,t,l,f,c,b,...,s,w,w,p,w,o,p,n,n,m
3,3,p,x,y,w,t,p,f,c,n,...,s,w,w,p,w,o,p,k,s,u
4,4,e,x,s,g,f,n,f,w,b,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7620,7620,p,k,s,n,f,y,f,c,n,...,s,w,w,p,w,o,e,w,v,l
7621,7621,p,k,y,n,f,y,f,c,n,...,s,w,p,p,w,o,e,w,v,l
7622,7622,e,b,f,g,f,n,f,w,b,...,s,w,w,p,w,t,p,w,n,g
7623,7623,e,k,s,w,f,n,f,w,b,...,s,w,w,p,w,t,p,w,s,g


Doing the same for validation set


In [11]:
# Creating a copy of data set to be used for imputation method
val_imputed = val_dataFrame.copy()

In [None]:
# for each column, get value counts in decreasing order and take the index (value) of most common class
val_imputed = val_dataFrame.apply(lambda x: x.fillna(x.value_counts().index[0]))
val_imputed

Now we will check how are the results after performing imputation on our columns.


In [13]:
df_imputed.isnull().sum()

Unnamed: 0                  0
class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

As we can see from the results above, we didn't miss any records but we filled all the missed values. 

I dropped Unnamed column for both dataframe, because it would be useless 

In [14]:
df_imputed.drop(["Unnamed: 0"], axis = 1, inplace = True)
val_imputed.drop(["Unnamed: 0"], axis = 1, inplace = True)

Changing the class column to target(beetter understanding of concept)

In [15]:
df_imputed = df_imputed.rename(columns={'class': 'target'})

Here we wrote some function to help us to find the outliers of each of the columns. 

In [16]:
# This function given a dataset and a column will return 
# the lower bound and upper bound of the values in that column

# The values that are not in this bound will be considered as outliers

#length of whiskers in a box plot:
#The whiskers indicate the largest/lowest points inside the range defined by 1st or 3rd quartile plus 1.5 times IQR. 
#The upper whisker is the largest observation that is <= 3rd quartile + 1.5 * iqr
#The lower whisker is the smallest observation that is => 1rd quartile - 1.5 * iqr
def outlier_bound(data, column):
  q1= np.percentile(data[column], 25)
  q3= np.percentile(data[column], 75)
  iqr=q3-q1
  upper_whisker = q3 + 1.5*(iqr)
  lower_whisker = q1 - 1.5*(iqr)
 
  return [round(lower_whisker, 2), round(upper_whisker, 2)] 

# This function given a dataset and a column in that dataset will return 
# the records that have outliers in that column 
def outliers_info(data, column):

  lower_whisker, upper_whisker = outlier_bound(data, column)
  print(f"Lower-bound: \'{lower_whisker}\', Upper-bound: \'{upper_whisker}\'")
  print('-'*30)
  outliers = data[(data[column]<lower_whisker) | (data[column]>upper_whisker) ] 
  print(f'Number of outliers in \'{column}\' Column is : \'{len(outliers)}\' \n\n')
  return outliers

# 3. Feature Engineering

Feature engineering refers to a process of selecting and transforming features in our dataset when creating a predictive model using machine learning.

Therefore, we have to extract the features from the __raw dataset__  that we have. Otherwise, it will be hard to gain good insights in our dataset. 
There would be some strategies to apply.

## Dropping useless features
As we **mentioned in the data exploration** section these two features **(veil-type, Unnamed: 0)**, don't participate much enough to distinquish between the two classes of our target. <br/>

Therefore, we can safely drop these two features for both dataframe

In [17]:
df_imputed.drop(["veil-type"], axis = 1, inplace = True)
df_imputed.drop(["gill-attachment"], axis = 1, inplace = True)
val_imputed.drop(["veil-type"], axis = 1, inplace = True)
val_imputed.drop(["gill-attachment"], axis = 1, inplace = True)


Check if these two columns were deleted or not. 

In [18]:
df_imputed.head()

Unnamed: 0,target,cap-shape,cap-surface,cap-color,bruises,odor,gill-spacing,gill-size,gill-color,stalk-shape,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,c,n,k,e,s,s,w,w,w,o,p,k,s,u
1,e,x,s,y,t,a,c,b,k,t,s,s,w,w,w,o,p,n,n,g
2,e,b,s,w,t,l,c,b,n,e,s,s,w,w,w,o,p,n,n,m
3,p,x,y,w,t,p,c,n,n,e,s,s,w,w,w,o,p,k,s,u
4,e,x,s,g,f,n,w,b,k,t,s,s,w,w,w,o,e,n,a,g


## Handle Outliers

In [19]:
df = df_imputed.drop(columns=['target'], axis=1)

This function given a dataset, tries to **winsorize** the outliers in the numerical columns. 

In [20]:
from sklearn.preprocessing import FunctionTransformer
from scipy.stats.mstats import winsorize

# insorizing is a technique to deal with outliers and is named after Charles Winsor. In effect, Winsorization clips outliers to given percentiles in a symmetric fashion. For instance, we can clip to the 5th and 95th percentile. SciPy has a winsorize() function, which performs this procedure.
def outlier_winsorization(df):

  for column in df.columns:
    lower, upper = outlier_bound(df, column)

    limit_lower = df[(df[column] < lower)].shape[0] / df.shape[0]
    limit_upper = df[(df[column] > upper)].shape[0] / df.shape[0]

    df[column] = winsorize(df[column], limits=[limit_lower, limit_upper])
  
  return df

# 4. Learning and Model Selection

A Machine Learning pipeline is a way to automate the workflow that takes to produce a Machine Learning model. Machine learning pipelines consist of multiple sequential steps that do everything from data extraction and preprocessing to model training and deployment.

For any Machine Learning task, testing is an important phase to realize how good our predictive model is. As we do not have any separate testing dataset we need to use some dataset entries to perform the evaluation task.



In [21]:
df_imputed.head()

Unnamed: 0,target,cap-shape,cap-surface,cap-color,bruises,odor,gill-spacing,gill-size,gill-color,stalk-shape,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,c,n,k,e,s,s,w,w,w,o,p,k,s,u
1,e,x,s,y,t,a,c,b,k,t,s,s,w,w,w,o,p,n,n,g
2,e,b,s,w,t,l,c,b,n,e,s,s,w,w,w,o,p,n,n,m
3,p,x,y,w,t,p,c,n,n,e,s,s,w,w,w,o,p,k,s,u
4,e,x,s,g,f,n,w,b,k,t,s,s,w,w,w,o,e,n,a,g


I have considered all the dataset for trainset and the provided validation one to testset.

In [22]:
X_train, y_train = df_imputed.drop(["target"],axis=1), df_imputed["target"]

I noticed that the validation set contains some values in each column which does not contain the train set.

The number of these values are high and should be handled because all the dataframe must be encoded.

To encode the data, we have several ways, but the main problem is that most of features have extra distinct value according to trainset, and it makes some error when using one hot encoder function.

On the other hand, label encoding does not work because the new values make error due to creation of new labels.(in condition of handle unknown values, we will lose huge data which may be usegul for the prediction)

Finally, I did an interesting work to merge two datasets, then after encoding the object data, I splited the validation set. By doing this, encoding phase has done on all dataset and the number of new created features based on one hot encoder have the same size.


In [23]:
# concatenating two dataframes
data = [X_train,val_imputed]
total_df = pd.concat(data,ignore_index=True)

In [24]:
# encoding the whole data by using one hot encoder (get_dummies of pandas)
import pandas as pd
import numpy as np
total_df = pd.get_dummies(data=total_df[total_df.columns], prefix=total_df.columns, prefix_sep="_")

In [25]:
total_df.shape

(8125, 109)

Now,after encoding, I divided the validation set which was encoded correctly based on the whole data

In [26]:
test = pd.DataFrame(total_df.tail(500))

In [27]:
test

Unnamed: 0,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
7625,0,0,0,1,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
7626,0,0,1,0,0,0,0,0,0,1,...,0,1,0,1,0,0,0,0,0,0
7627,0,0,0,0,0,1,0,0,1,0,...,1,0,0,0,1,0,0,0,0,0
7628,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
7629,0,0,0,1,0,0,0,0,1,0,...,0,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8120,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
8121,0,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
8122,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
8123,0,0,0,1,0,0,0,0,0,1,...,0,1,0,0,0,1,0,0,0,0


In [28]:
# reset and removing the index
test = test.reset_index()
del test['index']

In [29]:
# Assigning the encoded train set
X_train = total_df.iloc[:7625]

In [30]:
test.shape

(500, 109)

In [31]:
X_train.shape

(7625, 109)

In [32]:
X_train

Unnamed: 0,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,0,0,0,0,1,0,0,1,0,...,1,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,1,0,0,0,1,...,1,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7620,0,0,0,1,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
7621,0,0,0,1,0,0,0,0,0,1,...,0,1,0,0,0,1,0,0,0,0
7622,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
7623,0,0,0,1,0,0,0,0,1,0,...,1,0,0,0,1,0,0,0,0,0


In [33]:
test

Unnamed: 0,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,...,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0,0,0,1,0,0,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
1,0,0,1,0,0,0,0,0,0,1,...,0,1,0,1,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,1,0,...,1,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,1,0,0,0,0,1,0,...,0,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
496,0,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
497,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
498,0,0,0,1,0,0,0,0,0,1,...,0,1,0,0,0,1,0,0,0,0


In [34]:
total_df.shape

(8125, 109)

In [35]:
y_train

0       p
1       e
2       e
3       p
4       e
       ..
7620    p
7621    p
7622    e
7623    e
7624    p
Name: target, Length: 7625, dtype: object

**NOTE:** it is time to train our model, because all feature engineering has been done.(Handling the outliers will be adressed in pipeline)

In this part, we will perform two models : 

  -   [1. Logistic Regression](#LR)
  -   [2. Naive Bayes](#NB)
  -   [3. Support Vector Machine](#SVM)
  -   [4. Random Forest](#RF)


## Method I: Logistic Regression<font><a name=LR></a>

### LR: Pipleline and Prediction

In [36]:
from sklearn.preprocessing import MaxAbsScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import FunctionTransformer

pipeline = Pipeline(steps=[
                           ('outlier', FunctionTransformer(outlier_winsorization)) ,
                           ('scaler',MaxAbsScaler()), 
                           ('model', LogisticRegression())
                           ])

In [37]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('outlier',
                 FunctionTransformer(func=<function outlier_winsorization at 0x7f626971f290>)),
                ('scaler', MaxAbsScaler()), ('model', LogisticRegression())])

In [38]:
y_pred_lr = pipeline.predict(test)

In [39]:
y_pred_lr

array(['p', 'p', 'p', 'e', 'p', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'e',
       'p', 'p', 'p', 'e', 'e', 'e', 'p', 'e', 'p', 'p', 'p', 'p', 'e',
       'p', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'p', 'e',
       'p', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p',
       'p', 'p', 'e', 'p', 'p', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'e',
       'e', 'p', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'e',
       'e', 'p', 'e', 'e', 'p', 'p', 'p', 'e', 'e', 'p', 'e', 'p', 'p',
       'p', 'e', 'e', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'e', 'p', 'p',
       'p', 'e', 'p', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'p', 'e',
       'e', 'p', 'e', 'e', 'e', 'e', 'p', 'e', 'e', 'p', 'p', 'e', 'e',
       'p', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'p', 'p', 'e', 'p',
       'p', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'p', 'e', 'p', 'p', 'e',
       'p', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'e', 'e', 'e', 'e', 'p',
       'p', 'p', 'e', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'p

## Method II: Naive Bayes<font><a name=NB></a>

### NB: Pipleline and Prediction

In [40]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import FunctionTransformer

pipeline = Pipeline(steps=[
                           ('outlier', FunctionTransformer(outlier_winsorization)) ,
                           ('model', BernoulliNB())
                           ])

In [41]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('outlier',
                 FunctionTransformer(func=<function outlier_winsorization at 0x7f626971f290>)),
                ('model', BernoulliNB())])

In [42]:
y_pred_nb = pipeline.predict(test)

In [43]:
y_pred_nb

array(['p', 'p', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e',
       'p', 'p', 'p', 'e', 'p', 'e', 'p', 'e', 'p', 'p', 'p', 'p', 'p',
       'p', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'p', 'e',
       'p', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p',
       'p', 'p', 'e', 'p', 'p', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e',
       'e', 'p', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'e',
       'e', 'p', 'e', 'e', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'p', 'p',
       'p', 'e', 'e', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'e', 'p', 'p',
       'p', 'e', 'p', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'p', 'p',
       'e', 'p', 'e', 'e', 'e', 'p', 'p', 'e', 'p', 'p', 'p', 'e', 'e',
       'p', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'p', 'p', 'e', 'p',
       'p', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'p', 'e', 'p', 'p', 'e',
       'p', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'p',
       'p', 'p', 'e', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'p

## Method III: Random Forest<font><a name=RF></a>

### RF: Pipleline and Prediction

In [44]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import FunctionTransformer

pipeline = Pipeline(steps=[
                           ('outlier', FunctionTransformer(outlier_winsorization)) ,
                           ('model', RandomForestClassifier(n_estimators=100, random_state=42))
                           ])

In [45]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('outlier',
                 FunctionTransformer(func=<function outlier_winsorization at 0x7f626971f290>)),
                ('model', RandomForestClassifier(random_state=42))])

In [46]:
y_pred_rf = pipeline.predict(test)

In [47]:
y_pred_rf

array(['p', 'p', 'e', 'e', 'p', 'p', 'e', 'e', 'e', 'p', 'p', 'p', 'e',
       'p', 'p', 'p', 'e', 'e', 'e', 'e', 'e', 'p', 'p', 'p', 'p', 'e',
       'e', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'p', 'e',
       'p', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'p',
       'p', 'p', 'e', 'p', 'e', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'e',
       'e', 'p', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'e',
       'e', 'p', 'e', 'e', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'p', 'p',
       'p', 'e', 'e', 'p', 'p', 'e', 'e', 'p', 'e', 'e', 'e', 'p', 'p',
       'p', 'e', 'p', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'p', 'e',
       'e', 'p', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'p', 'p', 'e', 'e',
       'p', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'e', 'p', 'e', 'p',
       'p', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'p', 'e', 'e', 'p', 'e',
       'p', 'p', 'e', 'e', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'p',
       'p', 'p', 'e', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'p

## Method IIII: Support_Vector_Machine<font><a name=SVM></a>

### SVM: Pipleline and Prediction

In [48]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import FunctionTransformer

pipeline = Pipeline(steps=[
                           ('outlier', FunctionTransformer(outlier_winsorization)) ,
                           ('model', SVC())
                           ])

In [49]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('outlier',
                 FunctionTransformer(func=<function outlier_winsorization at 0x7f626971f290>)),
                ('model', SVC())])

In [50]:
y_pred_svm = pipeline.predict(test)

In [51]:
y_pred_svm

array(['p', 'p', 'e', 'e', 'p', 'p', 'e', 'e', 'e', 'p', 'p', 'p', 'e',
       'p', 'p', 'p', 'e', 'e', 'e', 'e', 'e', 'p', 'p', 'p', 'p', 'e',
       'e', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'p', 'e',
       'p', 'e', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'p',
       'p', 'p', 'e', 'p', 'e', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'e',
       'e', 'p', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'e',
       'e', 'p', 'e', 'e', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'p', 'p',
       'p', 'e', 'e', 'p', 'p', 'e', 'e', 'p', 'e', 'e', 'e', 'p', 'p',
       'p', 'e', 'p', 'p', 'p', 'p', 'p', 'p', 'e', 'e', 'p', 'p', 'e',
       'e', 'p', 'e', 'e', 'e', 'e', 'e', 'e', 'e', 'p', 'p', 'e', 'e',
       'p', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'e', 'p', 'e', 'p',
       'p', 'p', 'e', 'p', 'p', 'e', 'p', 'p', 'p', 'e', 'e', 'p', 'e',
       'p', 'p', 'e', 'e', 'p', 'p', 'p', 'p', 'e', 'p', 'e', 'e', 'p',
       'p', 'p', 'e', 'p', 'e', 'p', 'p', 'p', 'p', 'p', 'e', 'p

# 5. Checking results

As I have trained a model based on this dataset, I understood that the best model is Random Forest, but I assumed that if one percent my model has been overfiteed, the best model would be SVM.


### LR: Counting the Distinct Values


In [54]:
y_pred_lr = pd.DataFrame(y_pred_lr)

In [55]:
y_pred_lr.value_counts()

p    278
e    222
dtype: int64

### NB: Counting the Distinct Values


In [52]:
y_pred_nb = pd.DataFrame(y_pred_nb)

In [53]:
y_pred_nb.value_counts()

p    298
e    202
dtype: int64

### RF: Counting the Distinct Values


In [57]:
y_pred_rf = pd.DataFrame(y_pred_rf)

In [58]:
y_pred_rf.value_counts()

p    260
e    240
dtype: int64

### SVM: Counting the Distinct Values


In [59]:
y_pred_svm = pd.DataFrame(y_pred_svm,columns=["class"])

In [60]:
y_pred_svm.value_counts()

class
p        260
e        240
dtype: int64

In [64]:
# keeping the validation_set indexes
y_pred_svm.index = y_pred_svm.index + 7624

In [66]:
y_pred_svm

Unnamed: 0,class
7624,p
7625,p
7626,e
7627,e
7628,p
...,...
8119,e
8120,e
8121,e
8122,p


# Creating Predicted_labels CSV file

In [67]:
# Writing DataFrame to CSV
y_pred_svm.to_csv('/content/predicted_labels.csv', index=True)