# Setting Up Work Environment

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



In [None]:
data_train = pd.read_csv("/kaggle/input/math482-2024-2025-1-hw-02-v2/train.csv") #reading train data
data_test = pd.read_csv("/kaggle/input/math482-2024-2025-1-hw-02-v2/test.csv") #reading test data

# Overview  of the Data

**Training Data**

The training dataset contains 35,000 rows and 28 features.

**Features**:  Include abstract column names that suggest no information about the data. However, all of the columns are numerical.

**Target Variable**:  Abstract "0", "1", "2" and "3" values.

**Missing Values**:  feature_08, feature_09, feature_21 and feature_25 has missing values.

In [None]:
data_train.info()

In [None]:
# Check for missing values in the dataset
missing_values = data_train.isnull().sum()

# Display only columns with missing values
missing_columns = missing_values[missing_values > 0]

if missing_columns.empty:
    print("No missing values in the dataset.")
else:
    print("Missing values found:")
    print(missing_columns)

**Test Data**

The test dataset contains510,000 rows an271featuresns and there are no missing values. It matches the structure of the training data.

In [None]:
data_test.info()

In [None]:
# Check for missing values in the dataset
missing_values = data_test.isnull().sum()

# Display only columns with missing values
missing_columns = missing_values[missing_values > 0]

if missing_columns.empty:
    print("No missing values in the dataset.")
else:
    print("Missing values found:")
    print(missing_columns)

# Data Preprocessing

I have to handle missing values first to be able to process the data without error later. Since the missing values are so few compared to the size of the total data, I will simply remove them from the data. Because I don't want to introduce unintentional synthetic relations within the data by using imputation.

In [None]:
data_train.dropna(inplace=True)

The correlation matrix suggests there are no highly correlated features. However, the correlation matrix displays the correlation coefficient between pairs of features. If a feature is a linear combination of 2 or more features, this method can't capture that relation. So we have to use a better method to find redundant features.

In [None]:
# Correlation matrix
correlation_matrix = data_train.loc[:, data_train.columns != "id"].corr()

# Heatmap of correlations
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=False, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()


# Flatten the correlation matrix, keeping only the upper triangle (to avoid duplicates)
correlation_pairs = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

# Unstack the upper triangle into a DataFrame
correlated_features = correlation_pairs.unstack().dropna().abs().sort_values(ascending=False)

# Filter pairs with high correlation
high_correlation_pairs = correlated_features[correlated_features > 0.3]

# Display the most correlated feature pairs
print("Most Correlated Feature Pairs:")
print(high_correlation_pairs)



I will use the variance inflation factor(VIF) method to find features that are linearly dependent on other features. VIF is calculated as:
Regress the feature $X_j$ as $X_j = β_1​X_1​+β_2​X_2​+⋯+β_{j−1​}X_{j−1}​+β_{j+1}​X_{j+1}​+⋯+β_p​X_p​+ϵ$, then the VIF of $Xj\$ is:

VIF= $\frac{1}{1-R^2}$. 

Where $R^2$ is the coefficient of determination calculated as:
​
$ R^2 = SS_e/SS_t$

$SS_e$ = explained varience of $X_j$ calculated as $SS_e = \sum\limits_{i=1}^{n} {\hat{X_{j,i}}​−\bar{X_j}​)}$ ($\hat{X_{j,i}}$ is the predicted value of $X_j$ in observation i using the regression above)

$SS_t$ = total varience of $X_j$ calculated as $SS_t​ = \sum\limits_{i=1}^{n} {X_{j,i}​−\bar{X_j}​)}$ ($X_{j,i}$ is the actual value of $X_j$ in ith observation)

If $R^2$ is 1, this means the $X_j$ can be predicted completely by using other features (linearly dependent) and offers no new information. If it is 0, then $X_j$ is unique and can't be predicted by other features. Then, high VIF means the feature is likely to be linear combination of other features and redundant.


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler

X_train = data_train.drop(columns=['target','id'])

# Standardize the features so that features with large scale don't dominate the result
scaler = StandardScaler()
scaled_features = scaler.fit_transform(X_train)

# Create a DataFrame for scaled features
scaled_data = pd.DataFrame(scaled_features, columns=X_train.columns)

# Calculate VIF for each feature
def calculate_vif(dataframe):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = dataframe.columns
    vif_data["VIF"] = [variance_inflation_factor(dataframe.values, i) for i in range(dataframe.shape[1])]
    return vif_data

vif_results = calculate_vif(scaled_data)

# Filter features with VIF > 10
high_vif_features = vif_results[vif_results["VIF"] > 10]["Feature"]

# Display features with high VIF values
print("Features with VIF > 10:")
print(high_vif_features)


We have the features that have high VIF values and now we will remove them all because they bring no new information.

In [None]:
final_data = scaled_data.drop(columns = ["feature_05","feature_20","feature_08","feature_01"])

I will check for the outliers in the remaining columns using boxplot. If there are features with many outliers, I will remove them using the IQR method. I don't see any problem in removing the outliers because none of the features are highly correlated with the target. 

In [None]:
# List of features to visualize
features_to_plot = final_data.columns

# Create boxplots for each feature
plt.figure(figsize=(12, 24))
for i, feature in enumerate(features_to_plot, 1):
    plt.subplot(8, 3, i)  # Arrange plots in a grid (2 rows, 3 columns)
    sns.boxplot(data=data_train[feature])
    plt.title(f"Boxplot for {feature}")

# Adjust layout for better visualization
plt.tight_layout()
plt.show()

Most of the features have %50 of their data distributed in a fairly compacted area and they have many outliers extending in range of the data. This suggests all features have a Gaussian distribution so I take back what I said about removing them in this case. These outliers may reflect natural variation in the data. Given that I don't have any knowledge about the data, I will be keeping these outliers. If the model accuracy decreases I can experiment with removing them to improve the model.

# Building the Model

I will use RandomForestClassifier for the first model as tree-based models are a solid choice for data including outliers.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

model_1 = RandomForestClassifier(random_state=0)# Initiate the model

X_train = final_data
y_train = data_train["target"]

I will use the RandomSearchCV method to find the best parameter in the given possibilities. I don't use grid search because the dataset is quite large and trying every possibility would not be optimal. Randomized search does a similar thing with the grid search but doesn't try every combination, rather it tries random combinations.

In [None]:

# Define the parameter grid for RandomizedSearchCV
param_distributions = {
    'n_estimators': [50, 100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30, 50],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],      # Minimum samples to split an internal node
    'min_samples_leaf': [1, 2, 4],        # Minimum samples at a leaf node
    'bootstrap': [True, False]            # Whether bootstrap samples are used
}

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=model_1,
    param_distributions=param_distributions,
    n_iter=10,           # Number of random combinations to try
    cv=5,                # 5-fold cross-validation
    scoring='accuracy',  # Use accuracy as the scoring metric
    n_jobs=-1,           # Use all processors
    random_state=0,      
)

# Fit the model to find the best parameters
random_search.fit(X_train, y_train)

# Drop the 'id' column from the test dataset
data_test_ids = data_test['id']  # Store the 'id' column separately before dropping it
data_test = data_test.drop(columns=["id"])  # Drop the 'id' column first

# Standardize the features in the test dataset
scaled_features_test = scaler.transform(data_test)  # Using the same scaler fitted on training data
scaled_data_test = pd.DataFrame(scaled_features_test, columns=["feature_05","feature_20","feature_08","feature_01"])

# Drop the same columns as done for the training dataset
final_test_data = scaled_data_test.drop(columns=high_vif_values)


best_rf = random_search.best_estimator_
y_pred_1 = best_rf.predict(final_test_data)


# Create the submission DataFrame
submission = pd.DataFrame({
    'id': data_test_ids,  # Use the stored 'id' values
    'yield': y_pred_1
})

# Save the submission file
submission.to_csv('submission1.csv', index=False)

print("Submission file created: submission1.csv")

I would use the something like XGBoost classifier for the second model but I won't do that now. Because all the remaining features have a Gaussian distribution and I won't miss the opportunity to use the Gaussian Naive Bayes (GNB) classifier. GNB classifier assumes that the distribution of all features is Gaussian and all features are conditionally independent. The second assumption doesn't fit our case completely but given that I removed some of the correlated features and the features weren't that much correlated at the beginning, I think that assumption wouldn't hurt.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Initialize Gaussian Naive Bayes
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Make predictions
y_pred_2 = gnb.predict(final_test_data)


# Create the submission DataFrame
submission = pd.DataFrame({
    'id': data_test_ids,
    'target': y_pred_2
})

# Save the submission file
submission.to_csv('submission2.csv', index=False)

print("Submission file created: submission2.csv")

Since GNB classifier is a simple model, it doesn't require custom parameters as other models do. So just initializing and using it is sufficient.

# Conclusion

The accuracy score for the RandomForestClassifier is 86%, while for the Gaussian Naive Bayes (GNB), it is 66%. This likely indicates that the independence assumption of the GNB does not align well with our data. However, GNB is significantly more computationally efficient than the RandomForestClassifier. Therefore, in scenarios where computational efficiency is a priority and high accuracy is not critical, GNB might still be a feasible choice.