### <span style = "color:orange">1.Introduction<span>

#### <span style="color:green">1.1. Project purpose</span>

The dataset for this competition (both train and test) was generated from a deep learning model trained on the UCI Mushroom dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

Note: Unlike many previous Tabular Playground datasets, data artifacts have not been cleaned up. There are categorical values in the dataset that are not found in the original. It is up to the competitors how to handle this.

Files
* train.csv - the training dataset; class is the binary target (either e or p)
* test.csv - the test dataset; your objective is to predict target class for each row
* sample_submission.csv - a sample submission file in the correct format

#### <span style="color:green">1.2. Data source and description</span>

In [1]:
match_type = dict({
    'class'                : ['e', 'p'],
    'cap-diameter'         : 'Numerical',
    'cap-shape'            : ['b', 'c', 'x', 'f', 'k', 's'],
    'cap-surface'          : ['f', 'g', 'y', 's'],
    'cap-color'            : ['n', 'b', 'c', 'g', 'r', 'p', 'u', 'e', 'w', 'y'],
    'does-bruise-or-bleed' : ['f', 't'],
    'gill-attachment'      : ['a', 'f', 'd', 'n'],
    'gill-spacing'         : ['c', 'w', 'd'],
    'gill-color'           : ['k', 'n', 'b', 'h', 'g', 'r', 'o', 'p', 'u', 'e', 'w', 'y'],
    'stem-height'          : 'Numerical',
    'stem-width'           : 'Numerical',
    'stem-root'            : ['b', 'c', 'u', 'e', 'z', 'r', '?'],
    'stem-surface'         : ['f', 'y', 'k', 's'],
    'stem-color'           : ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y'],
    'veil-type'            : ['p', 'u'],
    'veil-color'           : ['n', 'o', 'w', 'y'],
    'has-ring'             : ['f', 't'],
    'ring-type'            : ['c', 'l', 'e', 'n', 'f', 'p', 's', 'z'],
    'spore-print-color'    : ['k', 'n', 'b', 'h', 'r', 'o', 'u', 'w', 'y'],
    'habitat'              : ['g', 'l', 'm', 'p', 'u', 'w', 'd'],
    'season'               : ['a', 'u', 'w', 's']
})

https://www.kaggle.com/competitions/playground-series-s4e8/data
- Train_data: Dùng file “train.csv” (train/train.csv)
- Test_data: Dùng file “test.csv” (test/test.csv)

#### <span style="color:green">1.3. Goals</span>

Most accuracy as much as possible

### <span style="color:orange">2.Import Libraries</span>

#### <span style="color:green">2.1. Required Python packages</span>

In [2]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from copy import deepcopy

from sklearn.preprocessing import OneHotEncoder  # Encode feature
from sklearn.preprocessing import OrdinalEncoder # Encode feature
from sklearn.preprocessing import MinMaxScaler   # Scale feature
from sklearn.preprocessing import LabelEncoder   # Encode target
# from scipy.stats import boxcox # Normalized feature

from sklearn.feature_selection import mutual_info_classif # PCA
from sklearn.model_selection   import train_test_split
from sklearn.model_selection   import RepeatedKFold
from sklearn.model_selection   import GridSearchCV
from sklearn.model_selection   import validation_curve
from sklearn.model_selection   import learning_curve


from sklearn.naive_bayes  import MultinomialNB
from sklearn.naive_bayes  import BernoulliNB
from sklearn.naive_bayes  import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree         import DecisionTreeClassifier
from sklearn.ensemble     import RandomForestClassifier
from sklearn.neighbors    import KNeighborsClassifier
from sklearn.svm          import SVC
from xgboost              import XGBClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report

#### <span style="color:green">2.2. Configuration and display settings</span>

In [3]:
import sys
sys.path.append("../scripts")

import script

### <span style="color:orange">3.Data Loading</span>

#### <span style="color:green">3.1. Loading the dataset</span>

In [4]:
train_df = pd.read_csv("../data/train_cleaned.csv", index_col = "id")
test_df  = pd.read_csv("../data/test_cleaned.csv" , index_col = "id")

#### <span style="color:green">3.2. Displaying first few rows</span>

In [5]:
train_df

Unnamed: 0_level_0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
54,e,4.00,f,t,n,f,a,c,o,6.06,...,b,s,w,u,w,f,f,k,d,u
188,p,4.07,x,t,n,f,a,c,n,5.61,...,b,y,w,u,w,t,e,k,g,u
287,e,4.04,x,t,r,f,d,c,r,5.67,...,b,s,w,u,w,f,f,k,d,u
484,p,4.73,x,t,r,f,a,c,u,6.00,...,b,y,w,u,w,t,e,u,m,u
644,p,4.53,x,s,y,f,a,c,y,5.67,...,b,s,o,u,w,f,f,k,g,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3116054,p,4.25,x,s,y,f,a,c,y,5.77,...,b,s,o,u,w,f,f,k,g,u
3116595,p,4.47,x,t,r,f,a,c,u,5.84,...,b,y,w,u,w,t,e,u,g,u
3116738,e,4.06,x,t,n,f,d,c,o,6.09,...,b,s,o,u,w,f,f,k,d,a
3116885,e,4.12,f,t,n,f,d,c,o,6.21,...,b,s,w,u,w,f,f,k,d,a


In [6]:
test_df

Unnamed: 0_level_0,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,stem-width,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
3116945,8.64,x,t,n,t,a,c,w,11.13,17.12,b,s,w,u,w,t,f,k,d,a
3116946,6.90,x,t,n,f,a,c,y,1.27,10.75,b,s,n,u,w,f,f,k,d,a
3116947,2.00,b,g,n,f,a,c,n,6.18,3.14,b,s,n,u,w,f,f,k,d,s
3116948,3.47,x,t,n,f,a,c,n,4.98,8.51,b,s,w,u,n,t,z,k,d,u
3116949,6.17,x,t,y,f,a,c,y,6.73,13.70,b,s,y,u,y,t,f,k,d,u
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5194904,0.88,x,g,w,f,a,d,w,2.67,1.35,b,s,e,u,w,f,f,k,d,u
5194905,3.12,x,s,w,f,d,c,w,2.69,7.38,b,s,w,u,w,f,f,k,g,a
5194906,5.73,x,t,e,f,a,c,w,6.16,9.74,b,s,y,u,w,t,z,k,d,a
5194907,5.03,b,g,n,f,a,d,g,6.00,3.46,b,s,g,u,w,f,f,k,d,a


#### <span style="color:green">3.3. Data summary</span>

In [7]:
train_df.shape

(25527, 21)

In [8]:
test_df.shape

(2076378, 20)

In [9]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25527 entries, 54 to 3116888
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 25527 non-null  object 
 1   cap-diameter          25527 non-null  float64
 2   cap-shape             25527 non-null  object 
 3   cap-surface           25527 non-null  object 
 4   cap-color             25527 non-null  object 
 5   does-bruise-or-bleed  25527 non-null  object 
 6   gill-attachment       25527 non-null  object 
 7   gill-spacing          25527 non-null  object 
 8   gill-color            25527 non-null  object 
 9   stem-height           25527 non-null  float64
 10  stem-width            25527 non-null  float64
 11  stem-root             25527 non-null  object 
 12  stem-surface          25527 non-null  object 
 13  stem-color            25527 non-null  object 
 14  veil-type             25527 non-null  object 
 15  veil-color           

In [10]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2076378 entries, 3116945 to 5194908
Data columns (total 20 columns):
 #   Column                Dtype  
---  ------                -----  
 0   cap-diameter          float64
 1   cap-shape             object 
 2   cap-surface           object 
 3   cap-color             object 
 4   does-bruise-or-bleed  object 
 5   gill-attachment       object 
 6   gill-spacing          object 
 7   gill-color            object 
 8   stem-height           float64
 9   stem-width            float64
 10  stem-root             object 
 11  stem-surface          object 
 12  stem-color            object 
 13  veil-type             object 
 14  veil-color            object 
 15  has-ring              object 
 16  ring-type             object 
 17  spore-print-color     object 
 18  habitat               object 
 19  season                object 
dtypes: float64(3), object(17)
memory usage: 332.7+ MB


### <span style="color:orange">5.Exploratory Data Analysis (EDA)</span>

#### <span style="color:green">5.1. Explore Categorical Variables</span>

##### <span style="color:tomato">Summary Statistics</span>

In [11]:
categorical = train_df.select_dtypes(include='object').columns
print(f'There are {len(categorical)} categorical variables\n')
print('The categorical variables are :\n\n', categorical)

There are 18 categorical variables

The categorical variables are :

 Index(['class', 'cap-shape', 'cap-surface', 'cap-color',
       'does-bruise-or-bleed', 'gill-attachment', 'gill-spacing', 'gill-color',
       'stem-root', 'stem-surface', 'stem-color', 'veil-type', 'veil-color',
       'has-ring', 'ring-type', 'spore-print-color', 'habitat', 'season'],
      dtype='object')


Frequency counts

In [12]:
script.frequency_counts(train_df, categorical)

class
p    14570/25527
e    10957/25527
Name: count, dtype: object

cap-shape
x    16605/25527
f     3673/25527
c     2936/25527
b     1583/25527
s      730/25527
Name: count, dtype: object

cap-surface
t    22228/25527
s     2479/25527
y      806/25527
g       14/25527
Name: count, dtype: object

cap-color
n    15316/25527
y     6980/25527
r     1700/25527
w      690/25527
e      449/25527
u      220/25527
p       91/25527
g       63/25527
b       18/25527
Name: count, dtype: object

does-bruise-or-bleed
f    25272/25527
t      255/25527
Name: count, dtype: object

gill-attachment
a    16944/25527
d     8127/25527
f      456/25527
Name: count, dtype: object

gill-spacing
c    25303/25527
d      224/25527
Name: count, dtype: object

gill-color
y    7249/25527
o    6877/25527
r    3878/25527
n    3628/25527
g    1187/25527
w     987/25527
u     850/25527
k     350/25527
p     244/25527
e     216/25527
b      61/25527
Name: count, dtype: object

stem-root
b    25520/25527
r        7/2552

Frequency distributions

In [13]:
script.frequency_distributions(train_df, categorical)

class
p    57.076821%
e    42.923179%
Name: count, dtype: object

cap-shape
x    65.048772%
f    14.388686%
c    11.501547%
b     6.201277%
s     2.859717%
Name: count, dtype: object

cap-surface
t    87.076429%
s     9.711286%
y     3.157441%
g     0.054844%
Name: count, dtype: object

cap-color
n    59.999217%
y    27.343597%
r     6.659615%
w     2.703020%
e     1.758922%
u     0.861833%
p     0.356485%
g     0.246798%
b     0.070514%
Name: count, dtype: object

does-bruise-or-bleed
f    99.001058%
t     0.998942%
Name: count, dtype: object

gill-attachment
a    66.376778%
d    31.836879%
f     1.786344%
Name: count, dtype: object

gill-spacing
c    99.122498%
d     0.877502%
Name: count, dtype: object

gill-color
y    28.397383%
o    26.940103%
r    15.191758%
n    14.212403%
g     4.649978%
w     3.866494%
u     3.329808%
k     1.371097%
p     0.955851%
e     0.846163%
b     0.238963%
Name: count, dtype: object

stem-root
b    99.972578%
r     0.027422%
Name: count, dtype: object


#### <span style="color:green">5.2. Explore Numerical Variables</span>

##### <span style="color:tomato">Summary Statistics</span>

In [14]:
numerical = train_df.select_dtypes(include='number').columns
print(f'There are {len(numerical)} numerical variables')
print('The numerical variables are :\n\n', numerical)

There are 3 numerical variables
The numerical variables are :

 Index(['cap-diameter', 'stem-height', 'stem-width'], dtype='object')


Frequency counts

In [15]:
script.frequency_counts(train_df, numerical)

cap-diameter
4.04    956/25527
3.97    751/25527
3.98    716/25527
4.07    620/25527
4.06    565/25527
          ...    
4.60    134/25527
4.48    109/25527
4.72     98/25527
4.70     68/25527
4.67     68/25527
Name: count, Length: 78, dtype: object

stem-height
5.96    682/25527
6.11    650/25527
5.99    626/25527
6.12    618/25527
5.93    605/25527
          ...    
5.95    137/25527
5.56    137/25527
5.57    127/25527
5.74    119/25527
5.70    109/25527
Name: count, Length: 70, dtype: object

stem-width
7.11    462/25527
7.06    428/25527
6.93    398/25527
7.23    394/25527
7.08    381/25527
          ...    
7.61     54/25527
7.19     48/25527
7.81     42/25527
6.64     37/25527
7.64     32/25527
Name: count, Length: 130, dtype: object



Frequency distributions

In [16]:
script.frequency_distributions(train_df, numerical)

cap-diameter
4.04    3.745054%
3.97    2.941983%
3.98    2.804873%
4.07    2.428801%
4.06    2.213343%
          ...    
4.60    0.524934%
4.48    0.426999%
4.72    0.383907%
4.70    0.266385%
4.67    0.266385%
Name: count, Length: 78, dtype: object

stem-height
5.96    2.671681%
6.11    2.546324%
5.99    2.452305%
6.12    2.420966%
5.93    2.370040%
          ...    
5.95    0.536687%
5.56    0.536687%
5.57    0.497512%
5.74    0.466173%
5.70    0.426999%
Name: count, Length: 70, dtype: object

stem-width
7.11    1.809848%
7.06    1.676656%
6.93    1.559133%
7.23    1.543464%
7.08    1.492537%
          ...    
7.61    0.211541%
7.19    0.188036%
7.81    0.164532%
6.64    0.144945%
7.64    0.125357%
Name: count, Length: 130, dtype: object



### <span style="color:orange">6.Data Visualization</span>

#### <span style="color:green">6.1. Distribution Plots</span>

##### <span style="color:tomato">Histograms</span>

In [17]:
# # cols_not_included = ["class"]
# num_cols = len(numerical) + len(categorical)

# # Xác định số hàng và số cột hợp lý
# ncols = 3  # Số biểu đồ trên mỗi hàng
# nrows = int(np.ceil(num_cols / ncols))  # Tính số hàng cần thiết

# fig, axes = plt.subplots(
#     ncols   = ncols, 
#     nrows   = nrows, 
#     figsize = (5*ncols, 4*nrows)
# )
# axes = axes.flatten()  # chuyển mảng 2 chiều thành 1 chiều để dễ duyệt

# for i,column in enumerate(numerical.append(categorical)):
#     sns.histplot(
#         data = train_df, 
#         x    = column,
#         kde  = True,
#         ax   = axes[i]
#     )
#     axes[i].set_title(f'Histogram of {column}')
#     axes[i].set_xlabel(column)
#     axes[i].set_ylabel('Frequency')
#     axes[i].grid(True)

# # Ẩn các subplot dư nếu có
# for j in range(i+1, len(axes)):
#     axes[j].axis('off')

# plt.tight_layout()
# plt.show()

##### <span style="color:tomato">Box Plots</span>

In [18]:
# script.plot_Outlier(train_df, numerical.append(categorical.drop("class")), target="class")

#### <span style="color:green">6.2. Relationship Plots</span>

##### <span style="color:tomato">Scatter</span>

In [19]:
# sns.pairplot(
#     data = train_df,
#     hue  = "class"
# )

##### <span style="color:tomato">Bar Charts</span>

In [20]:
# # cols_not_included = ["class"]
# num_cols = len(categorical) + len(numerical)

# # Xác định số hàng và số cột hợp lý
# ncols = 3  # Số biểu đồ trên mỗi hàng
# nrows = int(np.ceil(num_cols / ncols))  # Tính số hàng cần thiết

# fig, axes = plt.subplots(
#     ncols   = ncols, 
#     nrows   = nrows, 
#     figsize = (5*ncols, 4*nrows)
# )
# axes = axes.flatten()  # chuyển mảng 2 chiều thành 1 chiều để dễ duyệt

# for i,column in enumerate(categorical.append(numerical)):
#     sns.barplot(
#         data = train_df, 
#         x    = column,
#         y    = "class", 
#         ax   = axes[i]
#     )
#     axes[i].set_title(f'Histogram of {column}')
#     axes[i].set_xlabel(column)
#     axes[i].set_ylabel('class')
#     axes[i].grid(True)

# # Ẩn các subplot dư nếu có
# for j in range(i+1, len(axes)):
#     axes[j].axis('off')

# plt.tight_layout()
# plt.show()

#### <span style="color:green">6.3. Annotated Visualizations</span>