<a href="https://colab.research.google.com/github/agany52/Python-Machine-Learning-Projects/blob/main/Python_ML_Project_1_Titanic_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Use this corrected URL
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

try:
    # Load the data directly from the URL
    df = pd.read_csv(url)
    print("Dataset loaded successfully!")
    print(df.head())
except Exception as e:
    print(f"An error occurred: {e}")

Dataset loaded successfully!
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450

Just going to clean up the data a bit...

In [3]:
# Assuming you have the 'df' DataFrame loaded
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# For 'Embarked', we'll use a simple method for now
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}).fillna(0)

# Drop columns that are not useful for a basic model
df = df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

print(df.head())

   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
0         0       3    0  22.0      1      0   7.2500       0.0
1         1       1    1  38.0      1      0  71.2833       1.0
2         1       3    1  26.0      0      0   7.9250       0.0
3         1       1    1  35.0      1      0  53.1000       0.0
4         0       3    0  35.0      0      0   8.0500       0.0


Using this data, we will create a logistic regression model to predict the likelihood a person survived. We will split our given data into a training set (data the model will learn from) and a testing set (data we will use to evaluate our model's predictions).

In [4]:
from sklearn.model_selection import train_test_split

# Define our features (X) and target (y)
X = df.drop('Survived', axis=1)  # All columns except 'Survived'
y = df['Survived']               # The 'Survived' column

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (712, 7)
y_train shape: (712,)
X_test shape: (179, 7)
y_test shape: (179,)


Now that the data is properly split, we can build our logistic regression model.

In [5]:
from sklearn.linear_model import LogisticRegression

# Create an instance of the model
model = LogisticRegression(max_iter=200)

# Train the model using the training data
model.fit(X_train, y_train)

print("Model training complete.")

Model training complete.


Now that the model has been generated, let's test its mettle and see how it stacks up to the actual y_test values.

In [6]:
from sklearn.metrics import accuracy_score

# Make predictions on the test data
predictions = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, predictions)

print("Model Accuracy: {:.2f}%".format(accuracy * 100))

Model Accuracy: 79.89%


For the data we have been given, this accuracy score is pretty adequate. It shows that its predictions are more than simply a 50-50 guess. But in real life this would almost always require improvment. There are a few options we can try. Let's start with feature engineering. We will create a new feature ('FamilySize') based on the sum of two pre-existing features ('SibSp' and 'Parch').

In [7]:
# Create a new feature for FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1  # Add 1 for the passenger themselves

# Now we can drop the original SibSp and Parch columns, as they are now combined
df = df.drop(['SibSp', 'Parch'], axis=1)

print("New columns after feature engineering:")
print(df.head())

New columns after feature engineering:
   Survived  Pclass  Sex   Age     Fare  Embarked  FamilySize
0         0       3    0  22.0   7.2500       0.0           2
1         1       1    1  38.0  71.2833       1.0           2
2         1       3    1  26.0   7.9250       0.0           1
3         1       1    1  35.0  53.1000       0.0           2
4         0       3    0  35.0   8.0500       0.0           1


Next, let's add some regularization to our model by adding a regularization parameter C. In scikit-learn, a smaller C value will mean stronger regularization. Let's try a few values to test the model's performance.

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# List of C values to test
c_values = [0.01, 0.1, 1, 10, 100]

print("Testing different C values:")

# Loop through each C value and train a new model
for c in c_values:
    # Create and train the model with the current C value
    model = LogisticRegression(max_iter=200, C=c)
    model.fit(X_train, y_train)

    # Make predictions and calculate accuracy
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)

    print(f"  - C = {c:<5}: Accuracy = {accuracy * 100:.2f}%")

Testing different C values:
  - C = 0.01 : Accuracy = 73.74%
  - C = 0.1  : Accuracy = 80.45%
  - C = 1    : Accuracy = 79.89%
  - C = 10   : Accuracy = 79.89%
  - C = 100  : Accuracy = 79.89%


It looks like C=0.1 is our best bet!