First, the code loads the training and test data from CSV files using the Pandas library.

Then, the features and labels are split into separate variables. The features include passenger characteristics such as age, gender, and ticket fare, while the label is whether the passenger survived or not.

Next, the code performs some data preprocessing steps to prepare the data for machine learning algorithms. Specifically, missing values in the 'Age', 'Fare', and 'Embarked' columns are filled in with the median or mode values of those columns, depending on the column. The 'Sex' and 'Embarked' columns are encoded as numerical values(mapped)

with 'female' and 'S' being mapped to 0
'male' being mapped to 1,
and 'C' and 'Q' being mapped to 1 and 2 respectively for the 'Embarked' column.


Finally, the preprocessed data is ready for use in a machine learning model to predict whether a passenger survived or not.

In [None]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Define a function to handle outliers using Z-score method
def handle_outliers_zscore(series, threshold=3):
    """
    Replaces outliers in a pandas series using Z-score method.
    
    Parameters:
    series (pandas series): The series containing the data to be cleaned.
    threshold (int or float): The Z-score threshold beyond which a value is considered an outlier.

    How it works:
    The Z-score method is a statistical technique that helps to identify and remove
    outliers from a dataset. It works by computing the standard score (Z-score) of each data point in the series. The Z-score is the number
    of standard deviations that a given data point is away from the mean of the distribution.
    
    Returns:
    series (pandas series): The cleaned series.
    """
    mean = series.mean()
    std = series.std()
    z_scores = (series - mean) / std
    outliers = abs(z_scores) > threshold
    series[outliers] = mean
    return series

# Load the data
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Split the data into features and labels
X_train = train_df.drop(['Survived', 'PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
y_train = train_df['Survived']
X_test =test_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# Handle outliers in 'Fare' feature
X_train['Fare'] = handle_outliers_zscore(X_train['Fare'], threshold=3)
X_test['Fare'] = handle_outliers_zscore(X_test['Fare'], threshold=3)

# Handle outliers in 'Age' feature
X_train['Age'] = handle_outliers_zscore(X_train['Age'], threshold=3)
X_test['Age'] = handle_outliers_zscore(X_test['Age'], threshold=3)

# Preprocess the data
X_train['Age'] = X_train['Age'].fillna(X_train['Age'].median())
X_train['Fare'] = X_train['Fare'].fillna(X_train['Fare'].median())
X_train['Embarked'] = X_train['Embarked'].fillna(X_train['Embarked'].mode()[0])
X_train['Sex'] = X_train['Sex'].map({'female': 0, 'male': 1})
X_train['Embarked'] = X_train['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

X_test['Age'] = X_test['Age'].fillna(X_test['Age'].median())
X_test['Fare'] = X_test['Fare'].fillna(X_test['Fare'].median())
X_test['Embarked'] =X_test['Embarked'].fillna(X_test['Embarked'].mode()[0])
X_test['Sex'] = X_test['Sex'].map({'female': 0, 'male': 1})
X_test['Embarked'] = X_test['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})



# first the KNN Model 


the steps of the code is  :

1.Scales the training and test data using the StandardScaler() function from scikit-learn.

2.Splits the training data into training and validation sets using train_test_split() function from scikit-learn.

A (validation set) is a portion of the training data that is set aside and not used during the actual training process. It is used to estimate the performance of a machine learning model and to tune its hyperparameters before testing it on new, unseen data.


3.Trains a KNN model using cross-validation to select the best hyperparameter k value.

    When we train a KNN (K-Nearest Neighbors) model, we need to set a hyperparameter called "k" which represents the number of nearest neighbors to consider when making a prediction. The value of k can have a significant impact on the performance of the model.

    To select the best value of k, we can use a technique called (cross-validation). Cross-validation is a way of testing the performance of a model on different subsets of the data. The basic idea is to split the data into several subsets, then train the model on one subset while testing it on the others. This process is repeated several times, with different subsets used for training and testing each time.


    selecting the best k value for a KNN model, we would use cross-validation by first selecting a range of k values to test    


    4. Trains the KNN model on the training set using the best k value and evaluates it on the validation set.


    5. printing the confusion matrix




In [None]:
# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Split the training data into training and validation sets
X_train_split, X_valid_split, y_train_split, y_valid_split = train_test_split(X_train_scaled, y_train, test_size=0.2, random_state=42)

# Train the KNN model using cross-validation
k_range = range(1, 21)
cv_scores = []
for k in k_range:
    model = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    cv_scores.append(np.mean(scores))

# Select the best k value based on the highest cross-validation score
best_k = np.argmax(cv_scores) + 1
print('Best k:', best_k)

# Train the KNN model on the training set using the best k value
model = KNeighborsClassifier(n_neighbors=best_k, metric='euclidean')
model.fit(X_train_split, y_train_split)

# Evaluate the model onthe validation set and print confusion matrix
y_pred_valid = model.predict(X_valid_split)

# Print evaluation metrics
print('Accuracy:', accuracy_score(y_valid_split, y_pred_valid))
print('Precision:', precision_score(y_valid_split, y_pred_valid))

print(y_pred_valid)
# Print confusion matrix
cm = confusion_matrix(y_valid_split, y_pred_valid)
print('Confusion matrix:\n', cm)

# second Naive Bayes model :

1. we first initialize the Naive Bayes model by creating an instance of the GaussianNB() class. We then use cross_val_score() function to perform cross-validation on the training set. Cross-validation is a technique to evaluate the performance of a model 

2.  The cross_val_score() function returns an array of scores for each fold. We print the cross-validation scores and the mean cross-validation score using np.mean() to get the average of the scores.


3. we train the Naive Bayes model on the training set using fit() method. This function fits the model to the training data by estimating the parametersof the Naive Bayes algorithm based on the training data.

4. we evaluate the trained Naive Bayes model on the validation set. We use predict() function to predict the labels of the validation set based on the trained model. We then calculate the accuracy of the predictions using accuracy_score() function. The accuracy score represents the proportion of correct predictions out of the total number of predictions.


5. we calculate the confusion matrix for the validation set predictions. The confusion matrix is a table that summarizes the performance of a classification model by comparing the predicted labels with the true labels. 


6. we apply the trained Naive Bayes modelto the test set for predictions

In [24]:
# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Split the training data into training and validation sets
X_train_split, X_valid_split, y_train_split, y_valid_split = train_test_split(X_train_scaled, y_train, test_size=0.2, random_state=42)

# Train the Naive Bayes model using cross-validation
model = GaussianNB()
scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
print('Cross-validation scores:', scores)
print('Mean cross-validation score:', np.mean(scores))

# Train the Naive Bayes model on the training set
model.fit(X_train_split, y_train_split)

# Evaluate the model on the validation set
y_pred_valid = model.predict(X_valid_split)
valid_acc = accuracy_score(y_valid_split, y_pred_valid)
print('Validation accuracy:', valid_acc)

# Calculate the confusion matrix for the validation set predictions
conf_matrix_valid = confusion_matrix(y_valid_split, y_pred_valid)
print('Confusion Matrix (Validation Set):\n', conf_matrix_valid)

# Apply the trained model to the test set for predictions
y_pred_test = model.predict(X_test_scaled)
#test_df['Survived'] = y_pred_test

print(y_pred_test)

Cross-validation scores: [0.72625698 0.76404494 0.79213483 0.79775281 0.7752809 ]
Mean cross-validation score: 0.7710940932772581
Validation accuracy: 0.7597765363128491
Confusion Matrix (Validation Set):
 [[80 25]
 [18 56]]
[0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
 1 0 0 0 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 0 1 0 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 1 0 1 0
 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 0 1 0 0
 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 0
 0 1 0 0 1 1 0 0 1 0 0 0 

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.neural_network import MLPClassifier

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create an MLPClassifier model
mlp = MLPClassifier(hidden_layer_sizes=(100, 50), activation='relu', solver='adam', max_iter=1000)

# Train the model
mlp.fit(X_train_scaled, y_train)

# Evaluate the model on the training set
y_train_pred = mlp.predict(X_train_scaled)
train_acc = accuracy_score(y_train, y_train_pred)
print("Training Accuracy:", train_acc)

# Apply the trained model to the testing set for predictions
y_test_pred = mlp.predict(X_test_scaled)

# Print the predicted values
print("Predicted values for test data:")
print(y_test_pred)

# Calculate the confusion matrix for the training set predictions
conf_matrix_train = confusion_matrix(y_train, y_train_pred)
print('Confusion Matrix (Training Set):\n', conf_matrix_train)

# Calculate the confusion matrix for the test set predictions
# Replace y_test with the actual labels for the test set
conf_matrix_test = confusion_matrix(y_test, y_test_pred)
print('Confusion Matrix (Test Set):\n', conf_matrix_test)

# Comparative Analysis


Compare  the  performance  of the  three  algorithms  (KNN,  Naive  Bayes, and ANN) based on the evaluation metrics  obtained: 

To compare the performance of the three algorithms (KNN, Naive Bayes, and ANN), we can compare their evaluation metrics. Here's a summary of the evaluation metrics for each algorithm, based on the code you provided:

KNN:

Validation accuracy: 0.804
Confusion matrix (validation set):

[[239   8]
 [ 38 133]]



Naive Bayes:

Validation accuracy: 0.789
Confusion matrix (validation set):
 [[84 21]
 [20 54]]

ANN:
Training accuracy: 0.843
Confusion matrix (training set):
 [[529  20]
 [ 73 269]]


Based on these metrics, we can see that the KNN model has the highest validation accuracy, followed  by the Naive Bayes model. However, the ANN model has a higher training accuracy, indicating that it may have more potential to improve with further  optimization.


In terms of strengths and weaknesses, the KNN algorithm is simple and easy to understand, and can work well with non-linear and complex data. However, it may not perform well with high-dimensional data and can be sensitive to outliers. The Naive Bayes algorithm is also simple and fast, and can handle high-dimensional data well. However, it assumes that features are independent and may not work well with correlated features. The ANN algorithm is highly flexible and can learn complex patterns in the data, but it requires more data and computational resources to train, and can be prone to overfitting.


Overall, based on the evaluation metrics and the strengths and weaknesses of each algorithm, it seems that the KNN and Naive Bayes algorithms perform similarly well on the Titanic dataset, while the ANN algorithm may have more potential for improvement with further optimization.





