# **Machine Learning Project: Dataset Analysis and Modeling**
## **Importing Libraries**

This code cell imports the necessary libraries for data manipulation and model evaluation. It includes the pandas library for data handling, numpy for numerical operations, and scikit-learn for model selection and evaluation. The specific modules imported are pandas, numpy, train_test_split from sklearn.model_selection, and precision_score, recall_score, f1_score from sklearn.metrics. These libraries and modules will be used throughout the notebook for various tasks such as data processing, splitting the dataset, and evaluating model performance.

In [33]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score


## **Data Loading**

This code cell reads the training and test datasets from CSV files using the pandas library. The training dataset is loaded into the 'train_data' DataFrame, and the test dataset is loaded into the 'test_data' DataFrame. The CSV files are assumed to be named 'Train_Data.csv' and 'Test_Data.csv', respectively.

In [34]:
train_data = pd.read_csv("Train_Data.csv")
test_data = pd.read_csv("Test_Data.csv")


## **Data Preprocessing**

This code cell performs several preprocessing steps on the training and test datasets.

In [35]:
# Checking for missing values
train_data.isnull().sum()

# Handling missing values (if any)
train_data.fillna(train_data.mean(numeric_only=True), inplace=True)

# Repeating the same preprocessing steps for the test data
test_data.fillna(test_data.mean(numeric_only=True), inplace=True)


# Encoding categorical variables (if any)
train_data = pd.get_dummies(train_data)

# Scaling numeric features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
train_data[['m0', 'm1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8', 'm9', 'm10', 'm11', 'm12', 'm13', 'm14']] = scaler.fit_transform(train_data[['m0', 'm1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8', 'm9', 'm10', 'm11', 'm12', 'm13', 'm14']])

# Repeating the same preprocessing steps for the test data
test_data = test_data.dropna()
test_data = pd.get_dummies(test_data)
test_data[['m0', 'm1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8', 'm9', 'm10', 'm11', 'm12', 'm13', 'm14']] = scaler.transform(test_data[['m0', 'm1', 'm2', 'm3', 'm4', 'm5', 'm6', 'm7', 'm8', 'm9', 'm10', 'm11', 'm12', 'm13', 'm14']])




## **Train-Validation Split**

This code cell performs the train-validation split on the preprocessed data to prepare it for model training and evaluation.
The resulting X_train, X_val, y_train, and y_val datasets are now ready to be used for model training and evaluation, with X_train and y_train being the training features and target, respectively, and X_val and y_val being the validation features and target, respectively.

In [36]:
features = train_data.drop(['pred'], axis=1)  # Dropping the target column from the features
target = train_data['pred']  # Setting the target variable

X_train, X_val, y_train, y_val = train_test_split(features, target, test_size=0.2, random_state=42)


## **Model Initialization and Training**

In this code cell, three different models are initialized and trained using the training data (X_train and y_train).
After this code cell, the models (model1, model2, and model3) are trained and ready to make predictions on unseen data.

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

model1 = LogisticRegression()
model1.fit(X_train, y_train)

model2 = RandomForestClassifier()
model2.fit(X_train, y_train)

model3 = SVC()
model3.fit(X_train, y_train)


## **Prediction on Validation Set**

This code cell applies the trained models (model1, model2, and model3) to make predictions on the validation set (X_val).
These predictions can be used to evaluate the performance of the models and compare their results with the actual target values (y_val) for validation purposes.

In [38]:
pred1 = model1.predict(X_val)
pred2 = model2.predict(X_val)
pred3 = model3.predict(X_val)


## **Model Evaluation**

This code cell calculates precision, recall, and F1 score for each of the trained models (model1, model2, and model3) using the predictions made on the validation set (pred1, pred2, and pred3).

In [39]:
import warnings
from sklearn.exceptions import UndefinedMetricWarning

warnings.filterwarnings('ignore', category=UndefinedMetricWarning)

# Calculating precision, recall, and F1 score with zero_division parameter set to 1
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    precision1 = precision_score(y_val, pred1, zero_division=1)
    recall1 = recall_score(y_val, pred1, zero_division=1)
    f1_score1 = f1_score(y_val, pred1)

    precision2 = precision_score(y_val, pred2, zero_division=1)
    recall2 = recall_score(y_val, pred2, zero_division=1)
    f1_score2 = f1_score(y_val, pred2)

    precision3 = precision_score(y_val, pred3, zero_division=1)
    recall3 = recall_score(y_val, pred3, zero_division=1)
    f1_score3 = f1_score(y_val, pred3)




## **Selecting the Best Model**

This code cell compares the F1 scores of all the trained models (model1, model2, and model3) and selects the one with the highest F1 score as the best model.
This allows us to identify the model that performs the best based on the F1 score, which is a commonly used metric for evaluating the overall performance of classification models.

In [40]:
best_model = None
best_f1_score = 0.0

# Compareing the F1 scores of all the models and select the one with the highest F1 score
if f1_score1 > best_f1_score:
    best_model = model1
    best_f1_score = f1_score1

if f1_score2 > best_f1_score:
    best_model = model2
    best_f1_score = f1_score2

if f1_score3 > best_f1_score:
    best_model = model3
    best_f1_score = f1_score3


## **Test Data Preprocessing and Prediction**

This code cell preprocesses the test data to match the preprocessing steps applied to the training data and makes predictions on the test data using the best model selected.
These predictions represent the model's predictions on the test data, allowing us to assess the model's performance on unseen data and generate the desired outcomes.

In [41]:

Test_Data.fillna(0, inplace=True)
Test_Data = pd.get_dummies(Test_Data)

missing_cols = set(X_train.columns) - set(Test_Data.columns)
for col in missing_cols:
    Test_Data[col] = 0
Test_Data = Test_Data[X_train.columns]

test_predictions = best_model.predict(Test_Data)


## **Creating Submission File**

This code cell creates a submission file containing the predictions made on the test data using the best model.

In [42]:
submission = pd.DataFrame({'pred': test_predictions})
submission.to_csv("predictions.csv", index=False)


## **Printing Best Model F1 Score**

This code cell prints the F1 score of the best model selected based on the comparison of F1 scores from the previous code cells.

In [43]:
print("Best Model F1 Score:", best_f1_score)


Best Model F1 Score: 0.008307372793354102
