# 🍄🍄 Binary Prediction of Poisonous Mushrooms 🍄🍄

![image.png](attachment:cfe04934-1c97-4ce8-8a5d-1f1215ba4473.png)

### **Data Features**

<style>
  table {
    font-size: 30px;
    text-align: left;
    width: 100%;
  }
  th {
    font-weight: bold;
    padding: 8px;
    background-color: #f2f2f2;
    font-size: 30px;
  }
  td {
    padding: 8px;
  }
</style>

<table>
  <tr>
    <th>Field</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>id</td>
    <td>Unique Identifier</td>
  </tr>
  <tr>
    <td>class</td>
    <td>Indicates if the mushroom is <strong>poisonous (p)</strong> or <strong>edible (e)</strong></td>
  </tr>
  <tr>
    <td>cap-diameter</td>
    <td>Diameter of the cap (in mm)</td>
  </tr>
  <tr>
    <td>cap-shape</td>
    <td>Shape of the cap</td>
  </tr>
  <tr>
    <td>cap-surface</td>
    <td>Surface texture of the cap</td>
  </tr>
  <tr>
    <td>cap-color</td>
    <td>Color of the cap</td>
  </tr>
  <tr>
    <td>does-bruise-or-bleed</td>
    <td>Indicates if the mushroom bruises or bleeds</td>
  </tr>
  <tr>
    <td>gill-attachment</td>
    <td>Attachment of the gills to the stem</td>
  </tr>
  <tr>
    <td>gill-spacing</td>
    <td>Spacing between the gills</td>
  </tr>
  <tr>
    <td>gill-color</td>
    <td>Color of the gills</td>
  </tr>
  <tr>
    <td>stem-height</td>
    <td>Height of the stem (in mm)</td>
  </tr>
  <tr>
    <td>stem-width</td>
    <td>Width of the stem (in mm)</td>
  </tr>
  <tr>
    <td>stem-root</td>
    <td>Structure of the stem's base</td>
  </tr>
  <tr>
    <td>stem-surface</td>
    <td>Surface texture of the stem</td>
  </tr>
  <tr>
    <td>stem-color</td>
    <td>Color of the stem</td>
  </tr>
  <tr>
    <td>veil-type</td>
    <td>Type of the veil on the cap</td>
  </tr>
  <tr>
    <td>veil-color</td>
    <td>Color of the veil</td>
  </tr>
  <tr>
    <td>has-ring</td>
    <td>Indicates if there is a ring on the stem</td>
  </tr>
  <tr>
    <td>ring-type</td>
    <td>Type of ring on the stem</td>
  </tr>
  <tr>
    <td>spore-print-color</td>
    <td>Color of the spore print left by the mushroom</td>
  </tr>
  <tr>
    <td>habitat</td>
    <td>The environment where the mushroom is found</td>
  </tr>
  <tr>
    <td>season</td>
    <td>The season in which the mushroom was observed</td>
  </tr>
</table>


In [None]:
# Step 2: Upload your Kaggle API key
from google.colab import files
files.upload()

# Step 3: Move the API key to the correct location
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

# Step 4: Download the dataset from the competition
!kaggle competitions download -c playground-series-s4e8

# Step 5: Unzip the dataset
!unzip playground-series-s4e8.zip


In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix, matthews_corrcoef
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import VotingClassifier
from lightgbm import LGBMClassifier
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
import os
import gc
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load the datasets
df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [None]:
df.head(2)

In [None]:
df.info()

# Data Cleaning

First We dont need ID coulmn so i will DROP it

In [None]:
# Drop the 'id' column from both datasets
df.drop(columns=['id'], axis=1, inplace=True)
test_df.drop(columns=['id'], axis=1, inplace=True)

In [None]:
df.duplicated().sum()

Good There is no duplicated

Lets check the Missing values

## Missing Values

In [None]:
missing_values = df.isnull().sum()
# Calculate the percentage of missing values for each column
missing_percentage = (df.isnull().sum() / len(df)) * 100

# Display the missing values count and percentage for each column
missing_info = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
print(missing_info)


**The dataset has a significant amount of missing data in several columns, which requires careful consideration**

#### 1. Assess the Severity of Missing Data

Columns like veil-type, veil-color, stem-root, spore-print-color, gill-spacing, stem-surface, and cap-surface **have a large percentage of missing values.**

These columns with more than **50% missing** data might not contribute meaningfully to your model and could be candidates for removal.

#### 2. Decide on a Strategy

Drop Columns: If a column has too many missing values (e.g., more than 50%), consider dropping it, especially if it’s unlikely to provide valuable information.

Impute Missing Values: For columns with fewer missing values, we can fill in the missing data (imputation) using different strategies:

    - For numerical columns like cap-diameter: Use the mean, median, or mode.
    - For categorical columns like cap-shape, cap-color: Use the mode (most frequent value).
    
Custom Imputation: after  insights from EDA, we can use more sophisticated imputation techniques, such as filling based on other related columns.

In [None]:
missing_values = df.isnull().mean() * 100
missing_values = missing_values[missing_values >0]
missing_values = missing_values.sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=missing_values.index, y=missing_values.values, palette='viridis')
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('Percentage of Missing Values')
plt.title('Missing Values Distribution in df_train')
plt.show()

**i Will remove this columns**

In [None]:
columns_to_drop = ['veil-type', 'veil-color', 'stem-root', 'spore-print-color',]
df = df.drop(columns=columns_to_drop)
test_df = test_df.drop(columns=columns_to_drop)

In [None]:
'''
# For categorical features
categorical_cols = ['cap-shape', 'cap-surface', 'cap-color', 'does-bruise-or-bleed',
                    'gill-attachment', 'gill-spacing', 'gill-color',
                    'stem-surface', 'stem-color',
                    'has-ring', 'ring-type', 'habitat']

for col in categorical_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)  # Replace NaN with the mode

# For numerical features
numerical_cols = ['cap-diameter']

for col in numerical_cols:
    df[col].fillna(df[col].median(), inplace=True)  # Replace NaN with the median
'''

#### **Becouse i work with XGoost model which do imputation automatic i iwll not do it with my self i tryed but when i left the missin to the XGboost he did better**

## Class Distribution ( Data Imbalance ?? )

To understand the class imbalance in our dataset, you can check the distribution of the target variable (class) to see how many mushrooms are labeled as edible (e) and how many are labeled as poisonous (p).

#### **1. Check Class Distribution**

In [None]:
# Get the counts of each class
class_counts = df['class'].value_counts()

# Calculate the percentage of each class
class_percentage = df['class'].value_counts(normalize=True) * 100

# Display the counts and percentages
class_distribution = pd.DataFrame({'Count': class_counts, 'Percentage': class_percentage})
print(class_distribution)


#### **3. Interpret the Results**

If the classes are relatively balanced (close to 50-50), you may not need to do anything special.

If the classes are imbalanced (e.g., 90-10), you'll need to consider strategies to address this, such as:
    - Resampling: Either oversample the minority class or undersample the majority class.
    - Use of Algorithms: Some algorithms like XGBoost or Random Forest handle imbalance better.
    - Class Weights: Assign higher weights to the minority class during model training.


In our Data its clear that its Have a good Distribution **around 50% for each class** so we don't need to do some this here

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the class distribution
sns.countplot(x='class', data=df)
plt.title('Class Distribution of Mushrooms (Edible vs Poisonous)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()


## OutLiers Check

 Checking for outliers is an essential step in data preprocessing, especially in numerical features, as outliers can significantly impact your model's performance.

**Visualizing Outliers Using Box Plots**

Box plots are a great way to visually identify outliers. Here’s how you can create box plots for the numerical features in your dataset:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# List of numerical columns to check for outliers
numerical_columns = ['cap-diameter', 'stem-height', 'stem-width']

# Plotting box plots for each numerical feature
plt.figure(figsize=(15, 5))
for i, column in enumerate(numerical_columns, 1):
    plt.subplot(1, len(numerical_columns), i)
    sns.boxplot(x=df[column])
    plt.title(f'Box Plot of {column}')
plt.show()


In [None]:
# Function to detect outliers using the IQR method
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]

# Apply the function to each numerical column
for column in numerical_columns:
    outliers = detect_outliers_iqr(df, column)
    print(f"Number of outliers in {column}: {outliers.shape[0]}")


Okey It looks like we have a outliers here, we cant consider all of this as outliers becouse some of them are logicaly true value

#### **How Handling Outliers**
Once you’ve identified outliers, you have several options:

**Remove Outliers**: If the outliers are likely errors or not relevant, you can remove them

**Cap Outliers**: Replace outliers with the upper or lower bound of the non-outlier data.
python

i will drop for now only the very high outlier bettwen (0.01, 0.99)

In [None]:
# Function to detect outliers using the IQR method
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.01)
    Q3 = data[column].quantile(0.99)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]
for column in numerical_columns:
    outliers = detect_outliers_iqr(df, column)
    print(f"Number of outliers in {column}: {outliers.shape[0]}")


In [None]:
# Removing outliers from each numerical column
for column in numerical_columns:
    df = df[~df.index.isin(detect_outliers_iqr(df, column).index)]


## 2. Univariate Feature Selection

# Encoding

Lets first check the value counts in each column

In [None]:
cat_cols = [col for col in df.columns if df[col].dtypes == "O"]
df[cat_cols].nunique()

#### **Okey Now there is a problem We uselly use ONE HOT ENCODING but if we used it we will have a very very big number of features which will be hard to intterapt**

#### **So we can perform rare encoding followed by either one-hot encoding or label encoding based on the number of unique classes in each categorical variable.**

**threshold=0.01: This defines the threshold for rare encoding, meaning any class that appears in less than 1% of the data will be labeled as 'Rare'.**

In [None]:
def rare_encoding(df, threshold=0.01):
    for column in df.select_dtypes(include='object').columns:
        if column in df.columns:
            freq = df[column].value_counts(normalize=True)
            rare_classes = freq.index[freq < threshold]
            df[column] = df[column].where(~df[column].isin(rare_classes), 'Rare')
    return df

# Separate the target column
target = df['class']
df_features = df.drop(columns=['class'])

# Apply rare encoding to both datasets
df_features = rare_encoding(df_features)
test_df = rare_encoding(test_df)

# Align columns
common_cols = df_features.columns.intersection(test_df.columns)
df_features = df_features[common_cols]
test_df = test_df[common_cols]

# Reattach the target column
df = pd.concat([df_features, target], axis=1)

# Verify columns
print("df columns:", df.columns)
print("test_df columns:", test_df.columns)


In [None]:
print(df.shape)
print(test_df.shape)

In [None]:
df[cat_cols].nunique()

#### **Choose Encoding Method**

After rare encoding, we can decide between one-hot encoding and label encoding based on the number of unique classes.

In [None]:
def encode_categorical(df, target_column=None, max_unique_classes=10):
    for column in df.select_dtypes(include='object').columns:
        if column != target_column:
            unique_classes = df[column].nunique()
            if unique_classes <= max_unique_classes:
                df = pd.get_dummies(df, columns=[column], drop_first=True)
            else:
                le = LabelEncoder()
                df[column] = le.fit_transform(df[column])
    return df

# Separate the target column
target = df['class']
df_features = df.drop(columns=['class'])

# Apply encoding
df_features = encode_categorical(df_features)
test_df = encode_categorical(test_df)

# Ensure both datasets have the same columns
common_cols = df_features.columns.intersection(test_df.columns)
df_features = df_features[common_cols]
test_df = test_df[common_cols]

# Reattach the target column
df = pd.concat([df_features, target], axis=1)

# Verify columns
print("df columns:", df.columns)
print("test_df columns:", test_df.columns)


In [None]:
print(df.shape)
print(test_df.shape)

In [None]:
# Map the target 'class' column to 0 and 1
df['class'] = df['class'].map({'e': 0, 'p': 1})

# Feature Selection

Feature selection involves identifying the most relevant features that contribute to the model’s predictive power.
i will try 4 ways to do that

## **Correlation Analysis**

In [None]:
plt.figure(figsize=(18, 12))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()


Okey it's clear that we can't know any thing from this heatmap so we will use the otherways

## **Univariate Feature Selection**


techniques like **SelectKBest** to select features based on **statistical** tests.

In [None]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder

# Assuming X is a pandas DataFrame and y is the target column in the DataFrame

# Separate features and target
X = df.drop('class', axis=1)
y = df['class']

# Impute missing values with the median for numerical features
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# Convert imputed data back to a DataFrame
X_imputed_df = pd.DataFrame(X_imputed, columns=X.columns)

# Perform feature selection
selector = SelectKBest(score_func=chi2, k=10)
X_new = selector.fit_transform(X_imputed_df, y)

# Get selected feature names
selected_features = X.columns[selector.get_support()]
print("Selected features:", selected_features)


In [None]:
'''Selected= ['class', 'cap-diameter', 'stem-width', 'cap-shape_b', 'gill-attachment_a',
       'gill-attachment_e', 'gill-attachment_p', 'stem-surface_g',
       'stem-color_w', 'ring-type_z', 'season_w']
testSelected= ['cap-diameter', 'stem-width', 'cap-shape_b', 'gill-attachment_a',
       'gill-attachment_e', 'gill-attachment_p', 'stem-surface_g',
       'stem-color_w', 'ring-type_z', 'season_w']
df_SELECTED = df[Selected]
df.shape
test_df_SELECTED = test_df[testSelected]'''

### **Its not logic for me he takes a classes from the same features and left the others !!**
### i tried it any way but ofcourse i got bad accuracy that without delete the features
### **So i will leave the columns without selection for now**

**i will try later :**

**Recursive Feature Elimination (RFE) :** RFE selects features by recursively considering smaller sets of features. This method is particularly useful for model selection.

 **Feature Importance from Tree-Based Models :** Tree-based models like Random Forest or XGBoost can provide feature importance directly.



# Modeling


In [None]:
# Separate features and target
X_train = df.drop('class', axis=1)
y_train = df['class']

# Test data (no target)
X_test = test_df.copy()

In [None]:
# Define DataFrames for training and test data
train_columns = set(df.columns)
test_columns = set(test_df.columns)
train_only_columns = train_columns - test_columns
test_only_columns = test_columns - train_columns
print("Columns in train_df but not in test_df:")
print(train_only_columns)
print("\nColumns in test_df but not in train_df:")
print(test_only_columns)


In [None]:
numeric_features = ['cap-diameter', 'stem-height', 'stem-width']

# Extract numeric features for training and test datasets
X_train_numeric = df[numeric_features]
X_test_numeric = test_df[numeric_features]

# Standardize the numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)

df[numeric_features] = X_train_scaled
test_df[numeric_features] = X_test_scaled

In [None]:
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

In [None]:
# mattews metrics for this competiton
def mcc_metric(y_pred, dmatrix):
    y_true = dmatrix.get_label()
    y_pred = (y_pred > 0.5).astype(int)
    mcc = matthews_corrcoef(y_true, y_pred)
    return 'mcc', mcc

## XGBOOST

In [None]:
#parameters={'n_estimators': 297, 'max_depth': 16, 'learning_rate': 0.03906159386409017, 'subsample': 0.6935900010487451, 'colsample_bytree': 0.5171160704967471, 'gamma': 0.00013710778966124443, 'lambda': 0.0017203271581656767, 'alpha': 8.501510750413265e-06, 'scale_pos_weight': 1.0017942891559255,'enable_categorical': True,'tree_method': 'hist'}


parameters={'n_estimators': 297, 'max_depth': 19, 'learning_rate': 0.028333382496137323, 'subsample': 0.9947997083813288,
            'colsample_bytree': 0.5336230391923533,
            'gamma': 0.16126940334635828,
            'early_stopping_rounds': 50 }


xgb_optuna_params = {
    'n_estimators': 10000,
    'alpha': 0.0002,
    'subsample': 0.60,
    'colsample_bytree': 0.4,
    'max_depth': 13,
    'min_child_weight': 10,
    'learning_rate': 0.002,
    'gamma': 5.6e-08,
    'early_stopping_rounds': 10,
    # 'tree_method': 'gpu_hist',
    # 'device': "cuda"
}

In [None]:
# Initialize the XGBClassifier with the specified parameters
xgb_model = XGBClassifier(**xgb_optuna_params)

In [None]:

# Train the model with the evaluation set
xgb_model.fit(
    X_train_split, y_train_split,
    eval_set=[(X_val_split, y_val_split)],
    verbose=200
)

In [None]:
# Predict on the test data
test_predictions = xgb_model.predict(X_test)

# Evaluate the model on training data using train_test_split
y_val_pred = xgb_model.predict(X_val_split)

# Evaluate the model on the validation set
accuracy = accuracy_score(y_val_split, y_val_pred)
mcc = matthews_corrcoef(y_val_split, y_val_pred)
print(f'Validation MCC: {mcc:.4f}')
print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(classification_report(y_val_split, y_val_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_val_split, y_val_pred))


In [None]:
submission = pd.read_csv("/kaggle/input/playground-series-s4e8/sample_submission.csv")
submission["class"] = test_predictions


In [None]:
# Map the target 'class' column to 0 and 1
submission['class'] = submission['class'].map({0: 'e', 1: 'p'})

In [None]:
submission.to_csv('submission.csv',index=False)

In [None]:
# Define the confusion matrix
conf_matrix = confusion_matrix(y_val_split, y_val_pred)
# Convert the confusion matrix to a DataFrame for better visualization with seaborn
conf_matrix_df = pd.DataFrame(conf_matrix,
                              index=['Actual Negative', 'Actual Positive'],
                              columns=['Predicted Negative', 'Predicted Positive'])
# Create the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_df, annot=True, fmt='d', cmap='Blues',
            linewidths=0.5, linecolor='black',
            cbar_kws={'label': 'Number of Predictions'})
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()


## LGBMClassifier

In [None]:
 ######## Still working in that #######

# Define the hyperparameters for LGBMClassifier (adjusted for LGBM)
#parameters = {'n_estimators': 1869, 'max_depth': 32, 'learning_rate': 0.010217690029650325,
#              'subsample': 0.847713364798533, 'colsample_bytree': 0.9861945128452118,
#              'min_child_weight': 3.584741970207093, 'reg_alpha': 0.5182335134716664, 'reg_lambda': 0.10566374380137711}

# Define the parameters for LGBMClassifier

lgb_params = {
    'n_estimators': 2500,
    'random_state': 42,
    'max_bin': 1024,
    'colsample_bytree': 0.6,
    'reg_lambda': 80,
    'verbosity': -1
}


In [None]:
# Initialize the LGBMClassifier with the specified parameters
lgb_model = LGBMClassifier(**lgb_params)

In [None]:
# Train the model with the evaluation set
lgb_model.fit(
    X_train_split, y_train_split,
    eval_set=[(X_val_split, y_val_split)],

)

In [None]:
# Predict on the validation data
y_val_pred = lgb_model.predict(X_val_split)

# Evaluate the model on the validation set
accuracy = accuracy_score(y_val_split, y_val_pred)
mcc = matthews_corrcoef(y_val_split, y_val_pred)
print(f'Validation MCC: {mcc:.4f}')
print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(classification_report(y_val_split, y_val_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_val_split, y_val_pred))

## Ensemble Model

In [None]:
from sklearn.ensemble import VotingClassifier
# Create a VotingClassifier ensemble
ensemble_model = VotingClassifier(
    estimators=[('xgb', xgb_model), ('lgbm', lgb_model)],
    voting='soft'  # 'soft' voting to average predicted probabilities
)

# Train the ensemble model
ensemble_model.fit(X_train_split, y_train_split)

# Predict on the validation set
y_val_pred = ensemble_model.predict(X_val_split)

# Evaluate the ensemble model
accuracy = accuracy_score(y_val_split, y_val_pred)
mcc = matthews_corrcoef(y_val_split, y_val_pred)
print(f'Validation MCC: {mcc:.4f}')
print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(classification_report(y_val_split, y_val_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_val_split, y_val_pred))


## XGBOOST Optuna hyperparameter tuning

In [None]:
'''
import optuna
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, matthews_corrcoef, classification_report, confusion_matrix
from optuna.samplers import TPESampler

# Separate features and target
X_train = df.drop('class', axis=1)
y_train = df['class']

# Split the data into training and validation sets
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

# Scale the features
scaler = StandardScaler()
X_train_split_scaled = scaler.fit_transform(X_train_split)
X_val_split_scaled = scaler.transform(X_val_split)
X_test_scaled = scaler.transform(X_test)

# Define the objective function for Optuna
def objective(trial):
    # Define the hyperparameters to tune
    param = {
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'use_label_encoder': False,
        'random_state': 42,
        'n_estimators': trial.suggest_int('n_estimators', 200, 300),
        'max_depth': trial.suggest_int('max_depth', 12, 20),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 0.3)
    }

    # Initialize and train the XGBoost model
    model = xgb.XGBClassifier(**param)
    model.fit(X_train_split_scaled, y_train_split)

    # Predict on the validation set
    y_val_pred = model.predict(X_val_split_scaled)

    # Calculate the MCC
    mcc = matthews_corrcoef(y_val_split, y_val_pred)

    return mcc

# Create a study object and optimize the objective function
study = optuna.create_study(direction='maximize', sampler=TPESampler())
study.optimize(objective, n_trials=50)

# Get the best hyperparameters
best_params = study.best_params

# Train the best model on the entire training data
best_model = xgb.XGBClassifier(**best_params, objective='binary:logistic', eval_metric='logloss', use_label_encoder=False, random_state=42)
best_model.fit(X_train_split_scaled, y_train_split)

# Predict on the validation set with the best model
y_val_pred = best_model.predict(X)
'''


#### **i got this as best parametars after tuning, i still working to enhance it**

Trial 34 finished with value:  and parameters: {'n_estimators': 297, 'max_depth': 19, 'learning_rate': 0.028333382496137323, 'subsample': 0.9947997083813288, 'colsample_bytree': 0.5336230391923533, 'gamma': 0.16126940334635828}. Best is trial 34

## LGBMClassifier Optuna hyperparameter tuning

In [None]:
'''import optuna
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, matthews_corrcoef, classification_report, confusion_matrix
from optuna.samplers import TPESampler

# Separate features and target
X_train = df.drop('class', axis=1)
y_train = df['class']

# Split the data into training and validation sets
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

# Scale the features
scaler = StandardScaler()
X_train_split_scaled = scaler.fit_transform(X_train_split)
X_val_split_scaled = scaler.transform(X_val_split)
X_test_scaled = scaler.transform(X_test)  # Scale the test data as well

# Define the objective function for Optuna
def objective(trial):
    # Define the hyperparameters to tune
    param = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'boosting_type': 'gbdt',
        'random_state': 42,
        'n_estimators': trial.suggest_int('n_estimators', 200, 2000),
        'max_depth': trial.suggest_int('max_depth', 9, 35),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2),
        'subsample': trial.suggest_float('subsample', 0.8, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.8, 1.0),
        'min_child_weight': trial.suggest_float('min_child_weight', 0.001, 10),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 1.0)
    }

    # Initialize and train the LGBM model
    model = lgb.LGBMClassifier(**param)
    model.fit(X_train_split_scaled, y_train_split)

    # Predict on the validation set
    y_val_pred = model.predict(X_val_split_scaled)

    # Calculate the MCC
    mcc = matthews_corrcoef(y_val_split, y_val_pred)

    return mcc

# Create a study object and optimize the objective function
study = optuna.create_study(direction='maximize', sampler=TPESampler())
study.optimize(objective, n_trials=50)

# Get the best hyperparameters
best_params = study.best_params

# Train the best model on the entire training data
best_model = lgb.LGBMClassifier(**best_params, objective='binary', metric='binary_logloss', boosting_type='gbdt', random_state=42)
best_model.fit(X_train_split_scaled, y_train_split)

# Predict on the validation set with the best model
y_val_pred = best_model.predict(X_val_split_scaled)

# Evaluate the best model
accuracy = accuracy_score(y_val_split, y_val_pred)
mcc = matthews_corrcoef(y_val_split, y_val_pred)

print(f'Best Hyperparameters: {best_params}')
print(f'Validation Accuracy: {accuracy:.4f}')
print(f'Validation MCC: {mcc:.4f}')
print('Classification Report:')
print(classification_report(y_val_split, y_val_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_val_split, y_val_pred))

# Predict on the test data
test_predictions = best_model.predict(X_test_scaled)'''


#### **i got this as best parametars for LGBMClassifier after tuning, i still working to enhance it**
Trial 49 finished with
and parameters: {'n_estimators': 1869, 'max_depth': 32, 'learning_rate': 0.010217690029650325, 'subsample': 0.847713364798533, 'colsample_bytree': 0.9861945128452118, 'min_child_weight': 3.584741970207093, 'reg_alpha': 0.5182335134716664, 'reg_lambda': 0.10566374380137711}. Best is trial 46

# Submition

In [None]:
submission = pd.read_csv("/kaggle/input/playground-series-s4e8/sample_submission.csv")
submission["class"] = y_val_pred
submission.to_csv('submission.csv',index=False)

#### **AS you see The accuracy is With XGBoost : 0.992**

#### **the other metrics (f1-socre, recall, precision)is okey**

#### **I am still Updating the notebook every day and trying to hyperparametar tuning and try another things to share with you**
### **So please if you found it usefull UPVOTE me **
### **Thanks**