[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28X%29%20-%20Task%2010%20-%20Predict%20and%20Submit.ipynb)

This notebook provides a mini-tutorial on how to prepare the Titanic training dataset for our machine learning model.

In [25]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

Current date and time :  2025-02-25 22:10:41 

CPU times: user 962 µs, sys: 1.27 ms, total: 2.23 ms
Wall time: 2.88 ms


# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [26]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [27]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [28]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# of rows in training dataset: 891 



Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C


# Make Predictions on `test.csv` and Generate Submission File
We have *trained* our model and validated its `accuracy`. Now we will `deploy` the data. There are a variety of terms you might here for this, but for the Titanic dataset, `inference`, `prediction`, and `scoring` are the most relevant terms since we are applying the logistic regression model to a new dataset (`test.csv`) to predict survival outcomes. The final task of coding assignment #4 is to deploy the model on `test.csv`. Specifically, the task involves the following steps:

- Load the **Kaggle test dataset** (`test.csv`) from the provided GitHub URL.  
- Apply the **same transformations** used on the training data:  
  - Convert `Sex` into a numeric column (`Female` = 1, `Male` = 0).  
  - Fill in missing values on `Age`
- **Hint:** Before making predictions, **ensure there are no missing values** in the variables used for training (`Age`, `Female`, and `Fare`). Double-check all variables and apply any necessary transformations before proceeding.  
- Select the same predictor variables (`Age`, `Female`, `Fare`) used in training.  
- Use the trained model to make **predictions on the test set**.  
- *Important:* The submission file *must match Kaggle’s format exactly*—every `PassengerId` must have a prediction, and no values can be missing.  


---

### Why Make Predictions on the Test Set?  
Now that we have trained our logistic regression model, we use it to predict survival outcomes for new passengers in *Kaggle’s test dataset* (`test.csv`).  

Since Kaggle does *not* provide survival labels for this dataset, our goal is to generate predictions and submit them in the correct format.  

---

### Step 1: Load the Kaggle Test Dataset  
We first load the test dataset from the provided GitHub URL:  

```python
test_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/test.csv'  
test = pd.read_csv(test_url)  

# Confirm that the dataset loaded correctly
print(len(test))  
test.head()  
```

---

### Step 2: Check for Missing Values  
Before applying transformations, we need to check for missing values in the dataset:  

```python
test.info()  
```

This helps us determine whether we need to *fill in missing values* before making predictions.  

---

### Step 3: Apply the Same Transformations as Training  
To ensure consistency, we must apply *the same preprocessing steps* as we did for `train.csv`:  

#### Convert `Sex` to Numeric (`Female`)  
```python
test['Female'] = test['Sex'].map({'female': 1, 'male': 0}).fillna(0)  
```

#### Fill Missing `Age` Values with the Median  
```python
test['Age'] = test['Age'].fillna(train['Age'].median())  
```

#### Fill the Missing `Fare` Value  
This variable was not missing any data in the *training* dataset, but it is missing for one observation in the *testing* dataset.

```python
test['Fare'] = test['Fare'].fillna(train['Fare'].median())  
```

---

### Step 4: Verify No Missing Values  
Before making predictions, we **double-check** that all features (`Age`, `Female`, `Fare`) contain no missing values:  

```python
print("\nMissing values in test dataset:")  
print(test[['Age', 'Female', 'Fare']].isnull().sum())  
```

If the output shows `0` for all variables, we are ready to proceed.  

---

### Step 5: Select Predictor Variables  
We extract the same features used in training as save it in a new dataframe called `X_test`:  

```python
X_test = test[['Age', 'Female', 'Fare']]  
```

Note that we do *not* need to save a `y` because there is no `Survived` column in the test dataset – instead, we are going to *predict* y using our machine learning model!

---

### Step 6: Generate Predictions  
Now, we use our trained logistic regression model to predict survival outcomes for the test set:  

```python
test_predictions = model.predict(X_test)  
print('# of predictions:', len(test_predictions))  
test_predictions[:5]  # Display first five predictions  
```

The output will be an array of `0`s and `1`s, representing survival predictions.  

---

### Step 7: Create the Submission File  
Kaggle requires a submission file in the following format:  
- **`PassengerId`** – The passenger’s ID from the test set.  
- **`Survived`** – The model’s prediction (`0 = Did Not Survive`, `1 = Survived`).  

We create the submission file using the `PassengerId` column from our test dataset and `Survived` from the predictions we have just made and saved in `test_predictions`:  

```python
submission_df = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": test_predictions})  
submission_df.info()  
```

To check the predicted survival distribution:  

```python
submission_df['Survived'].value_counts()  
```

---

### Step 8: Save the Submission File  
Finally, we save the predictions in **CSV format** so they can be submitted to Kaggle:  

```python
submission_df.to_csv("submission.csv", index=False)  
print("Submission file 'submission.csv' created successfully!")  
```

---

### Final Checklist Before Submission  
- Does the submission file contain exactly the same number of rows as `test.csv`?
- Does every `PassengerId` have a corresponding prediction (`0` or `1`)?
- Are all transformations applied consistently (no missing values in `Age`, `Female`, `Fare`)?  
- Is the file named `submission.csv`?

Once confirmed, *upload the file to Kaggle* and check your model’s performance!

---

Below I will walk you though the code. First, we need to fix the missing values in `Age` and create `Female`, create `X` and `y`, generate the test-train split, and train and validate our model.

In [29]:
train['Age'] = train['Age'].fillna(train["Age"].median())
train["Female"] = train["Sex"].map({"female": 1, "male": 0})

In [30]:
X = train[['Age', 'Female', 'Fare']]
y = train['Survived'] 

In [31]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(X_train, y_train)

val_predictions = model.predict(X_val)

print("Accuracy:", accuracy_score(y_val, val_predictions), '\n')

Accuracy: 0.776536312849162 



In [33]:
#Read in test dataset
test_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/test.csv'
test = pd.read_csv(test_url)
print(len(test))
test.head()

418


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [36]:
#Look for missing values
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [37]:
# Apply the same transformations as in training
test['Female'] = test['Sex'].map({'female': 1, 'male': 0}).fillna(0)  # Convert Sex to numeric
test['Age'] = test['Age'].fillna(train['Age'].median())  # Fill missing Age values with median

In [38]:
#Fill in the 1 missing value on Fare
test['Fare'] = test['Fare'].fillna(train['Fare'].median())  # Fill missing Fare values with median

In [39]:
# Verify no missing values before making predictions
print("\nMissing values in test dataset:")
print(test[['Age', 'Female', 'Fare']].isnull().sum())  # Check for missing values in model variables


Missing values in test dataset:
Age       0
Female    0
Fare      0
dtype: int64


In [40]:
# Select the same predictor variables as in training
X_test = test[['Age', 'Female', 'Fare']]

In [41]:
# Generate predictions for Kaggle test set
test_predictions = model.predict(X_test)
print('# of predictions:', len(test_predictions))
test_predictions[:5]

# of predictions: 418


array([0, 1, 0, 0, 1])

In [42]:
# Create submission file
submission_df = pd.DataFrame({"PassengerId": test["PassengerId"], "Survived": test_predictions})
submission_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB


In [43]:
#If you want to see the predicted frequencies
submission_df['Survived'].value_counts()

Survived
0    260
1    158
Name: count, dtype: int64

In [38]:
#Save file
submission_df.to_csv("submission.csv", index=False)
print("Submission file 'submission.csv' created successfully!")

Submission file 'submission.csv' created successfully!


---

## **Deliverables**
1. Submit the link to you Google Colab notebook in the assignment area in Canvas.
2. Include comments in your code to explain each step.

---

## Bonus Points

If you beat my Kaggle score you can earn an additional 10% (1 point out of 10). 

To claim the bonus, upload a screenshot showing your submission with your *Public Score*. Below is a screenshot of my submission with a score of 0.76555.

![](https://github.com/gdsaxton/GDAN5400/blob/main/Titanic/Titanic%20submission.png?raw=true)

##### Here is code you can use to upload you screenshot

In [None]:
from google.colab import files
from IPython.display import display, Image

# Upload the screenshot
uploaded = files.upload()  # Prompts file upload dialog

# Display the uploaded image (assuming it's the first uploaded file)
for filename in uploaded.keys():
    display(Image(filename))
    break  # Display only the first uploaded image