[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28VIII%29%20-%20Task%208%20-%20Modeling%20Preparation%20Steps.ipynb)

This notebook provides a mini-tutorial on how to prepare the Titanic training dataset for our machine learning model.

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [None]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# Preparing the Data for Modeling  
Before `training` our machine learning model, we must prepare the training data for modeling in specific ways. To practice preparing such datasets, task 8 in the fourth coding assignment has the following requirements:

- Select the **predictor variables (`X`)** and the **target variable (`y`)**.  
- Use `Age`, `Female`, and `Fare` as the features for prediction.  
- Split the data into **training (`X_train, y_train`)** and **testing (`X_test, y_test`)** sets using a standard 80/20 split.  
- Set `random_state=42` to ensure reproducibility.  


## Why Do We Prepare the Data?  
Before training our machine learning model, we must:  
1. **Select relevant features (`X`)** that will help predict survival.  
2. **Define the target variable (`y`)**, which is what we want to predict.  
3. **Split the data into training and validation sets** to evaluate the model's performance.  

This ensures our model generalizes well to new, unseen data.

---

### Step 1: Select Predictor Variables (`X`) and Target Variable (`y`)  

We use three predictor variables:  
- **`Age`** – Passenger’s age.  
- **`Female`** – Whether the passenger is female (`1 = Female, 0 = Male`).  
- **`Fare`** – The fare paid for the ticket.  

The target variable (`y`) is **`Survived`**, which indicates whether a passenger survived (`1`) or not (`0`).  

### Define `X` and `y`:  
```python
X = train[['Age', 'Female', 'Fare']]  
y = train['Survived']  
```


--- 


### Step 2: Split the Data into Training and Validation Sets  

To build a reliable model, we divide the data into two sets:  
- **Training Set (`X_train, y_train`)** – Used to train the model.  
- **Validation Set (`X_val, y_val`)** – Used to test the model’s performance on unseen data.  

We use an **80/20 split**:  
- **80%** of the data goes to training.  
- **20%** of the data goes to validation.  

To ensure the split is reproducible, we set `random_state=42`.  

### Perform the Split:  
```python
from sklearn.model_selection import train_test_split  

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)  
```

---

## Step 3: Verify the Split  

We can check the number of samples in each set to confirm the 80/20 ratio:  

```python
print(f"Training set size: {X_train.shape[0]} rows")  
print(f"Validation set size: {X_val.shape[0]} rows")  
```

If the total dataset has **891 passengers**, we expect:  
- **Training set:** ~712 rows  
- **Validation set:** ~179 rows  

---

## Why Do We Split the Data?  
- **Avoids overfitting** – The model is tested on unseen data to check how well it generalizes.  
- **Ensures fair evaluation** – If we trained and tested on the same data, the model would appear perfect but fail on new data.  
- **Enables reproducibility** – Setting `random_state=42` ensures we get the same split every time.  

Once the data is prepared, we are ready to train a logistic regression model!

---

Below I will walk you through the code. First, we need to fix the missing values in `Age` and create `Female`:

In [None]:
train['Age'] = train['Age'].fillna(train["Age"].median())
train["Female"] = train["Sex"].map({"female": 1, "male": 0})

Now define `X` and `y`

In [None]:
X = train[['Age', 'Female', 'Fare']]
y = train['Survived'] 

Next, split the data into 'training' and 'validation' sets after doing the necessary package import from `sklearn`.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

We can use the `shape` command to output the *dimensions* (number of rows and columns) of the training and validation datasets after performing the `train_test_split`.

####v Breaking It Down:
Each `.shape` attribute returns a **tuple** in the format `(number_of_rows, number_of_columns)`, where:
- `X_train.shape` → Shape of the training feature set (`X_train`), which contains the predictor variables (`Age`, `Female`, `Fare`).
- `X_val.shape` → Shape of the validation feature set (`X_val`), which contains the same predictor variables but for validation.
- `y_train.shape` → Shape of the training target set (`y_train`), which contains the `Survived` labels for training.
- `y_val.shape` → Shape of the validation target set (`y_val`), which contains the `Survived` labels for validation.

#### Our output below (for 891 total passengers with an 80/20 split):
```
(712, 3) (179, 3) (712,) (179,)
```
This means:
- `X_train` has **712 rows and 3 columns** (predictor variables).
- `X_val` has **179 rows and 3 columns**.
- `y_train` has **712 rows** (one per passenger, only the target variable).
- `y_val` has **179 rows**.

#### Why is this Useful? 
- **Confirms the 80/20 split** worked correctly.
- **Ensures features (`X_train`) and target labels (`y_train`) have matching row counts**.
- **Verifies that all predictor variables are included** (in this case, 3: `Age`, `Female`, `Fare`).

This check is helpful to prevent shape mismatches when training the model.

In [None]:
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)