4 hands-on projects that help master the core skills in machine learning , including:

✅ Data cleaning & preprocessing
✅ Feature engineering
✅ Train/test split
✅ Logistic regression or random forest
✅ Model evaluation (accuracy, confusion matrix)

Project 1: Titanic Survival Prediction
🎯 Objective:
Predict whether a passenger survived the Titanic disaster using features like age, gender, and class.

🔧 Skills Practiced:
    Data cleaning & missing value handling
    Categorical encoding
    Feature selection/engineering
    Train/test split
    Random Forest / Logistic Regression
    Model evaluation (accuracy, confusion matrix)
🛠 Tools Used:
    Python
    pandas, numpy
    scikit-learn
    Jupyter Notebook
    Dataset from Kaggle/Stanford University 

In [18]:
#Import necessary libraries (pandas for data manipulation)
import pandas as pd

#Read the CSV file into a DataFrame
df = pd.read_csv("C:/Users/festu/ML05102025/myML/titanic.csv")

#Display the first 5 rows of the DataFrame in a more readable format
print(df.head().to_string())

#Display last 5 rows of the DataFrame in a more readable format
print(df.tail().to_string())

   Survived  Pclass                                                Name     Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare
0         0       3                              Mr. Owen Harris Braund    male  22.0                        1                        0   7.2500
1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cumings  female  38.0                        1                        0  71.2833
2         1       3                               Miss. Laina Heikkinen  female  26.0                        0                        0   7.9250
3         1       1         Mrs. Jacques Heath (Lily May Peel) Futrelle  female  35.0                        1                        0  53.1000
4         0       3                             Mr. William Henry Allen    male  35.0                        0                        0   8.0500
     Survived  Pclass                            Name     Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard   Fare
882  

**Step 2: Clean and Preprocessing**

***Understanding what we are dealing with***

In [19]:
#Display the shape of the DataFrame (number of rows and columns)
print(df.shape)

#Display the data types of each column in the DataFrame
print(df.dtypes)

(887, 8)
Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object


In [20]:
#We need to drop the attributes that are not useful for our analysis: name
#Drop the 'Name' column from the DataFrame 
df.drop(columns=['Name'], inplace=True) # inplace=True modifies the original DataFrame without creating a copy

#Display the first 5 rows of the modified DataFrame and convert it to a string for better readability
print(df.head().to_string())

   Survived  Pclass     Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare
0         0       3    male  22.0                        1                        0   7.2500
1         1       1  female  38.0                        1                        0  71.2833
2         1       3  female  26.0                        0                        0   7.9250
3         1       1  female  35.0                        1                        0  53.1000
4         0       3    male  35.0                        0                        0   8.0500


In [21]:
#lets check if thereare any missing values in the dataset
print(df.isnull().sum()) #O missing values in the dataset

Survived                   0
Pclass                     0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64


In [22]:
#Lets import the libraries for encording data from categorical to numerical (sex)
from sklearn.preprocessing import LabelEncoder #for encoding categorical variables need to install sklearn, pip install -U scikit-learn

le = LabelEncoder() #create an instance of the LabelEncoder class
#Encode the sex column (categorical variable) into numerical values
df["Sex"] = le.fit_transform(df['Sex']) #fit_transform() method fits the encoder and transforms the data in one step

#Display the first 5 rows of the DataFrame after encoding
print(df.head().to_string())

   Survived  Pclass  Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare
0         0       3    1  22.0                        1                        0   7.2500
1         1       1    0  38.0                        1                        0  71.2833
2         1       3    0  26.0                        0                        0   7.9250
3         1       1    0  35.0                        1                        0  53.1000
4         0       3    1  35.0                        0                        0   8.0500


In [23]:

print(df.head().to_string()) #Display the first 5 rows of the modified DataFrame

   Survived  Pclass  Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare
0         0       3    1  22.0                        1                        0   7.2500
1         1       1    0  38.0                        1                        0  71.2833
2         1       3    0  26.0                        0                        0   7.9250
3         1       1    0  35.0                        1                        0  53.1000
4         0       3    1  35.0                        0                        0   8.0500


**Step 3: Train/Test Split**

In [None]:
#import the libraries for needed for training the model
from sklearn.model_selection import train_test_split #for splitting the dataset into training and testing sets

#Drop the Survived column from the DataFrame to create the feature set (X)
X = df.drop(columns=['Survived']) #X will contain all columns except 'Survived'
#Create the target variable (y) by selecting the Survived column
y = df['Survived'] #y will contain only the 'Survived' column

#Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #random_state ensures reproducibility of the split

**Step 4: Train Model**

In [25]:
# We will use the Random Forest Classifier for our model
from sklearn.ensemble import RandomForestClassifier #for the Random Forest Classifier

#Create an instance of the RandomForestClassifier with 100 trees
model = RandomForestClassifier(n_estimators=100, random_state=42) #n_estimators specifies the number of trees in the forest
#Train the model using the training data
model.fit(X_train, y_train) #fit() method trains the model on the training data
#Make predictions on the test data
y_pred = model.predict(X_test) #predict() method generates predictions for the test data


**Step 5: Printing Predictions vs Actual Values**

In [29]:
# Predict on the test data
y_pred = model.predict(X_test) #predict() method generates predictions for the test data

#comparing the first 10 predictions with the actual values
comp_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) #create a DataFrame to compare actual and predicted values

#printing the firdt 10 rows of the comparison DataFrame
print(comp_df.head(10).to_string()) #Display the first 10 rows of the comparison DataFrame

     Actual  Predicted
296       1          0
682       0          0
535       0          0
644       1          0
623       0          0
39        1          1
529       0          0
585       0          0
723       1          1
141       1          0


**Step 6: Evaluating the Modeland checking the Accuracy**

In [31]:
#import the libraries for evaluating the model
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score #for generating classification report and confusion matrix

preds = model.predict(X_test) #make predictions on the test data
#Generate the classification report
print(classification_report(y_test, preds)) #classification_report() generates a report showing the main classification metrics

print("Accuracy:", accuracy_score(y_test, preds)) #accuracy_score() calculates the accuracy of the model

print("Confusion Matrix:\n", confusion_matrix(y_test, preds)) #confusion_matrix() generates a confusion matrix to evaluate the performance of the model

              precision    recall  f1-score   support

           0       0.81      0.82      0.81       111
           1       0.69      0.67      0.68        67

    accuracy                           0.76       178
   macro avg       0.75      0.75      0.75       178
weighted avg       0.76      0.76      0.76       178

Accuracy: 0.7640449438202247
Confusion Matrix:
 [[91 20]
 [22 45]]
