[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28IX%29%20-%20Task%209%20-%20Train%20Logistic%20Model.ipynb)

This notebook provides a mini-tutorial on how to prepare the Titanic training dataset for our machine learning model.

In [5]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

Current date and time :  2025-02-25 17:23:52 

CPU times: user 133 µs, sys: 9 µs, total: 142 µs
Wall time: 146 µs


# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [6]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [7]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [8]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# of rows in training dataset: 891 



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C


# Train and Evaluate a Logistic Regression Model 
Now we are ready to train our model. That is the focus of Task 9 in the fourth coding assignment, which has the following requirements:

- Train a **logistic regression model** using the training data.  
- The predictor variables (`X`) used in this model are **Age, Female, and Fare**.  
- Use the trained model to make **predictions** on the test set.  
- Evaluate the model’s performance using the **accuracy** score.  
- *Hint*: Make sure to include the proper import statements from `sklearn`


### What is Logistic Regression?  
Logistic Regression is a commonly used machine learning algorithm for **binary classification** problems, meaning it predicts one of two possible outcomes.  

For the Titanic dataset, the model will predict:  
- **`1` (Survived)**  
- **`0` (Did not survive)**  

Since our target variable (`Survived`) is binary, logistic regression is a suitable choice.

---

## Step 1: Import Required Libraries  
We need to import:  
- **`LogisticRegression`** from `sklearn.linear_model` – to create the model.  
- **`accuracy_score` from `sklearn.metrics` – to evaluate performance.  

```python
from sklearn.linear_model import LogisticRegression  
from sklearn.metrics import accuracy_score  
```

---

### Step 2: Train the Logistic Regression Model  

We train the model using the **training dataset (`X_train`, `y_train`)**.  

```python
model = LogisticRegression()  
model.fit(X_train, y_train)  
```

### What’s Happening Here?
- `LogisticRegression()` initializes the model.  
- `.fit(X_train, y_train)` trains the model on our features (`Age`, `Female`, `Fare`) and labels (`Survived`).  
- The model **learns patterns** from the training data to predict survival.  

---

### Step 3: Make Predictions  

Now, we use the trained model to predict survival on the **validation set (`X_val`)**.  

```python
val_predictions = model.predict(X_val)  
```

- `model.predict(X_val)` generates predictions for unseen data.  
- The output is an array of `0`s and `1`s, representing survival predictions.  

---

### Step 4: Evaluate Model Performance  

We assess the model’s accuracy using `accuracy_score`:  

```python
print("Accuracy:", accuracy_score(y_val, val_predictions), '\n')  
```

- **Accuracy** = (Correct Predictions / Total Predictions)  
- A higher accuracy means the model correctly predicts survival more often.  

For a more detailed analysis, we can also print a **classification report**:  

```python
print(classification_report(y_val, val_predictions))  
```

- This report provides:
  - **Precision** (How many predicted survivors actually survived?)  
  - **Recall** (How many actual survivors were correctly identified?)  
  - **F1-score** (Balance of precision and recall).  

---

### Interpreting the Results  

#### Good Model Performance If:
- Accuracy is **above 70%** (varies depending on features used).  
- Precision and recall scores are **balanced** across both classes.  

### Potential Issues If:
- The model predicts **only 0s or 1s** → It may not be learning properly.  
- Accuracy is **too low (<60%)** → Consider adding more features or tuning the model.  

---

### Conclusion  
Logistic regression is a **simple but effective** model for predicting survival. Now that we’ve trained and evaluated our first model, we can refine it in the next steps to improve performance!

---

Below I will walk you though the code. First, we need to fix the missing values in `Age` and create `Female`, create `X` and `y`, and generate the test-train split. 

In [15]:
train['Age'] = train['Age'].fillna(train["Age"].median())
train["Female"] = train["Sex"].map({"female": 1, "male": 0})

In [16]:
X = train[['Age', 'Female', 'Fare']]
y = train['Survived'] 

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

First, we read in the relevant packages

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Next, we *initialize* the model (we are calling it simply `model`) and *train* the model on our three features (`Age`, `Female`, `Fare`) and our outcome variable (`Survived`).  

In [19]:
# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

After the model is run, we enerate predictions on validation set – the 20% of 'holdover' data

In [21]:
val_predictions = model.predict(X_val)  # Previously y_pred

Lastly, we evaluate model performance using the `accuracy_score`

In [22]:
print("Accuracy:", accuracy_score(y_val, val_predictions), '\n')

Accuracy: 0.776536312849162 



<br>We can also print the model *coefficients* to understand the influence of each variable.  

In [40]:
print("Model coefficients (Based on Age, Female, and Fare):\n")
print('\tIntercept:', model.intercept_[0])
print('\tAge coefficient:', model.coef_[0][0])
print('\tFemale coefficient:', model.coef_[0][1])
print('\tFare coefficient:', model.coef_[0][2], '\n')

Model coefficients (Based on Age, Female, and Fare):

	Intercept: -1.5639624479395506
	Age coefficient: -0.005799829744918314
	Female coefficient: 2.335279581705797
	Fare coefficient: 0.01016203209179087 



### **Task 10: Make Predictions on `test.csv` and Generate Submission File**  
- Load the **Kaggle test dataset** (`test.csv`) from the provided GitHub URL.  
- Apply the **same transformations** used on the training data:  
  - Convert `Sex` into a numeric column (`Female` = 1, `Male` = 0).  
  - Fill in missing values on `Age`
- **Hint:** Before making predictions, **ensure there are no missing values** in the variables used for training (`Age`, `Female`, and `Fare`). Double-check all variables and apply any necessary transformations before proceeding.  
- Select the same predictor variables (`Age`, `Female`, `Fare`) used in training.  
- Use the trained model to make **predictions on the test set**.  
- **Important:** The submission file **must match Kaggle’s format exactly**—every `PassengerId` must have a prediction, and no values can be missing.  

In [41]:
test_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/test.csv'
test = pd.read_csv(test_url)
print(len(test))
test.head()

418


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [43]:
#Look for missing values
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [45]:
# Apply the same transformations as in training
test['Female'] = test['Sex'].map({'female': 1, 'male': 0}).fillna(0)  # Convert Sex to numeric
test['Age'] = test['Age'].fillna(train['Age'].median())  # Fill missing Age values with median

In [46]:
#Fill in the 1 missing value on Fare
test['Fare'] = test['Fare'].fillna(train['Fare'].median())  # Fill missing Fare values with median

In [47]:
# Verify no missing values before making predictions
print("\nMissing values in test dataset:")
print(test[['Age', 'Female', 'Fare']].isnull().sum())  # Check for missing values in model variables


Missing values in test dataset:
Age       0
Female    0
Fare      0
dtype: int64


In [48]:
# Select the same predictor variables as in training
X_test = test[['Age', 'Female', 'Fare']]

In [50]:
# Generate predictions for Kaggle test set
test_predictions = model.predict(X_test)
print('# of predictions:', len(test_predictions))
test_predictions[:5]

# of predictions: 418


array([0, 1, 0, 0, 1])

In [41]:
# Create submission file
submission_df = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": test_predictions})
submission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB


In [42]:
#If you want to see the predicted frequencies
submission_df['Survived'].value_counts()

Survived
0    260
1    158
Name: count, dtype: int64

In [38]:
#Save file
submission.to_csv("submission.csv", index=False)
print("Submission file 'submission.csv' created successfully!")

Submission file 'submission.csv' created successfully!


---

## **Deliverables**
1. Submit the link to you Google Colab notebook in the assignment area in Canvas.
2. Include comments in your code to explain each step.

---

## Bonus Points

If you beat my Kaggle score you can earn an additional 10% (1 point out of 10). 

To claim the bonus, upload a screenshot showing your submission with your *Public Score*. Below is a screenshot of my submission with a score of 0.28117.

![](https://github.com/gdsaxton/GDAN5400/blob/main/Titanic/Titanic%20submission.png?raw=true)

##### Here is code you can use to upload you screenshot

In [None]:
from google.colab import files
from IPython.display import display, Image

# Upload the screenshot
uploaded = files.upload()  # Prompts file upload dialog

# Display the uploaded image (assuming it's the first uploaded file)
for filename in uploaded.keys():
    display(Image(filename))
    break  # Display only the first uploaded image