# **CS 351 Lab 3**
## Introduction to Constraint Satisfaction Problems (CSP)


## **Student Details**

### **Name:** Muhammad Abdullah
### **Reg:** 2022323


## **Lab Task**

Scenario:
You are hired as a data scientist for a university. The university wants to predict whether
passengers survived the Titanic disaster based on various factors such as their age, gender,
ticket class, and fare paid. You will use the k-NN and Decision Tree algorithms to build
models that predict whether a passenger survived.
Part 1: Data Exploration and Preprocessing
1. Explore the Dataset:
- Load the dataset and display the first few rows.
- Visualize the distribution of key features (like `Pclass`, `Age`, `Sex`, etc.).
- Check for any missing values or outliers.
2. Data Preprocessing:
- Handle missing values by either filling them (e.g., with median) or removing records
with missing data.
- Encode categorical variables like `Sex` and `Embarked` into numerical values.
- Standardize or normalize the numerical features like `Age` and `Fare`.
Part 2: Implementing k-NN and Decision Trees
1. Model Training:
- Split the dataset into training and testing sets (70% training, 30% testing).
- Implement the k-Nearest Neighbors (k-NN) algorithm and train the model using the
training set.
- Implement a Decision Tree algorithm and train it using the same training set.
2. Model Evaluation:
- Use the test set to make predictions for both models.
- Evaluate the performance of each model using accuracy, precision, recall, and F1-score.
- Compare the results and discuss which model performed better.
Part 3: Visualization
1. Decision Boundaries:
- Create visualizations to display the decision boundaries of both models (k-NN and
Decision Tree) using two features from the dataset.
- Plot the data points along with the decision boundaries to show how each model
classifies the data.
2. Performance Visualization:
- Plot a bar chart showing the performance metrics (accuracy, precision, recall, F1-score)
of both models for easy comparison.
Dataset Source:
For this lab, you will use the publicly available Titanic dataset from Kaggle.
Download it from the following link:
https://www.kaggle.com/c/titanic/data
How to Load the Dataset in Python:
Use the following code to load the dataset:
```python
import pandas as pd
# Load the dataset
url = 'https://www.kaggle.com/c/titanic/data'
titanic_data = pd.read_csv('train.csv')
print(titanic_data.head())


## **Solution**

**Part 1:**

Data Exploration and Preprocessing
Loading and Exploring the Dataset:


In [None]:

import pandas as pd
# Load the dataset
titanic_data = pd.read_csv('train.csv')
print(titanic_data.head())
# Check for missing values and data types
print(titanic_data.info())

Visualize Key Features: You can plot distributions for features like Pclass, Age, Sex, etc.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Pclass', data=titanic_data)
plt.show()

sns.histplot(titanic_data['Age'].dropna(), bins=30)
plt.show()

sns.countplot(x='Sex', data=titanic_data)
plt.show()

Handling Missing Values: You can fill missing values in the Age column with the median and drop rows with missing values in the Embarked column.


In [None]:
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)
titanic_data.dropna(subset=['Embarked'], inplace=True)

Encoding Categorical Variables: Encode Sex and Embarked columns.

In [None]:
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data = pd.get_dummies(titanic_data, columns=['Embarked'], drop_first=True)

Feature Scaling (optional): You can standardize numerical features like Age and Fare:


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
titanic_data[['Age', 'Fare']] = scaler.fit_transform(titanic_data[['Age', 'Fare']])

**Part 2:**

Implementing k-NN and Decision Trees

Split the Data: Prepare the data for training.

In [None]:
X = titanic_data[['Pclass', 'Sex', 'Age', 'Fare']]
y = titanic_data['Survived']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Implement k-NN: Train the k-NN model and make predictions.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)


Implement Decision Tree: Train a Decision Tree classifier.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
y_pred_dtree = dtree.predict(X_test)


Evaluate Models: Evaluate both models using accuracy, precision, recall, and F1-score.

In [None]:
from sklearn.metrics import classification_report

print("k-NN Classification Report:\n", classification_report(y_test, y_pred_knn))
print("Decision Tree Classification Report:\n", classification_report(y_test, y_pred_dtree))


**Part 3:**

Visualization

Decision Boundaries: You can visualize the decision boundaries using two features (e.g., Pclass and Fare).

In [None]:
import numpy as np
from matplotlib.colors import ListedColormap

def plot_decision_boundary(X, y, model):
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA'])
    cmap_bold = ListedColormap(['#FF0000', '#00FF00'])

    h = .02  # step size in the mesh
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=cmap_light)

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
    plt.show()


Performance Metrics Visualization: Plot a bar chart comparing accuracy, precision, recall, and F1-score for both models.

In [None]:
import matplotlib.pyplot as plt
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-score']
knn_scores = [0.8, 0.75, 0.78, 0.76]  # Example scores for k-NN
dtree_scores = [0.85, 0.80, 0.83, 0.82]  # Example scores for Decision Tree

X_axis = np.arange(len(metrics))
plt.bar(X_axis - 0.2, knn_scores, 0.4, label='k-NN')
plt.bar(X_axis + 0.2, dtree_scores, 0.4, label='Decision Tree')
plt.xticks(X_axis, metrics)
plt.legend()
plt.show()


By following these steps, you can predict the survival of Titanic passengers and compare the performance of k-NN and Decision Tree classifiers.