🎯 Section 3 Goal
You’ll train a logistic regression model to predict whether a passenger survived the Titanic disaster based on features like age, fare, gender, and passenger class. You'll:
Prepare features and labels
Split data into training and test sets
Train the model
Evaluate its accuracy

✅ Cell 1: Import packages
📌 What it does:
Loads the cleaned Titanic dataset.
Displays the first 5 rows for inspection.

In [13]:
import pandas as pd

# Load dataset
df = pd.read_csv("titanic.csv")

# Preview
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


✅ Cell 2: Prepare Features (X) and Labels (y)
🧠 What this does:
Extracts relevant features (Pclass, Age, Fare, etc.)
Converts text columns (Sex, Embarked) into numerical one-hot encoded values.

In [14]:
# Define features and label
X = df[["Pclass", "Age", "Fare", "Sex", "Embarked"]]
y = df["Survived"]


🧼 Cell 2: Encode categorical variables
🧠 What it does:
Converts text data (e.g., "male", "female") into numbers so the model can process them.

In [15]:
from sklearn.preprocessing import LabelEncoder

# Encode 'Sex' and 'Embarked' using .loc to avoid SettingWithCopyWarning
le = LabelEncoder()

X.loc[:, "Sex"] = le.fit_transform(X["Sex"])
X.loc[:, "Embarked"] = le.fit_transform(X["Embarked"])

# Drop rows with NaNs from X and keep aligned y values
X_clean = X.dropna()
y_clean = y[X_clean.index]



🔎 Cell 3: Split into training and testing sets
🧠 What it does:
Splits the dataset into a training set (to train the model) and a test set (to evaluate it).

In [16]:
from sklearn.model_selection import train_test_split

# 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


🤖 Cell 4: Train and evaluate k-NN
🧠 What it does:
Loops through k=1 to 10 to find the best number of neighbors for the k-NN model and prints accuracy for each.

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X_clean, y_clean, test_size=0.2, random_state=42)

# Try different values of k
for k in range(1, 11):
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f"k={k}: Accuracy = {acc:.2f}")


k=1: Accuracy = 0.66
k=2: Accuracy = 0.67
k=3: Accuracy = 0.69
k=4: Accuracy = 0.68
k=5: Accuracy = 0.64
k=6: Accuracy = 0.68
k=7: Accuracy = 0.68
k=8: Accuracy = 0.67
k=9: Accuracy = 0.70
k=10: Accuracy = 0.66
