[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28VII%29%20-%20Task%207%20-%20Create%20Female%20Variable.ipynb)

This notebook provides a mini-tutorial on creating a binary variable in the Titanic training dataset.

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

### Read in the Titanic Training Data

In [None]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

# Convert `Sex` to a Numeric Variable called `Female`

Machine learning models require numerical inputs, meaning we cannot directly use categorical variables like `'male'` and `'female'`. Instead, we convert them into *binary variables* (0 or 1) so they can be used in calculations. To help give you practice with this process, task 7 in the fourth coding assignment has the following requirements:
- Convert the `Sex` column into a **numeric format** for modeling.  
- Call the new variable `Female` and map `female` to `1` and `male` to `0`.  
- Fill any missing values with `0` (though there should be none in this dataset). 

### Why Convert `Sex` to a Numeric Variable?  


We will:  
1. **Explore the `Sex` variable** to confirm its values.  
2. **Create a new numeric column (`Female`)**, mapping:  
   - `"female"` → `1`  
   - `"male"` → `0`  
3. **Check for missing values** and fill them with `0` if necessary (though none should exist).  
4. **Verify the conversion** to ensure it was applied correctly.

---

### Step 1: Explore the `Sex` Variable  

Before modifying the data, let’s check what values exist in the `Sex` column.  

```python
train["Sex"].value_counts()
```

Expected output:  
```
male      577  
female    314  
Name: Sex, dtype: int64  
```
- We confirm that there are **only two categories**: `"male"` and `"female"`.  
- We now convert this column into a binary variable.  

---

### Step 2: Convert `Sex` into `Female`  

To ensure clarity in our dataset, we create a new column called **`Female`** instead of replacing `Sex`.  
- The name **`Female`** is used instead of `Sex` or `Gender` because:  
  - Binary variables should be **named after the category assigned `1`** (in this case, `"female"`).  
  - This makes interpretation easier (e.g., `1 = Female` instead of `1 = Sex`).  

Use the following code to map values:  

```python
train["Female"] = train["Sex"].map({"female": 1, "male": 0})
```

---

### Step 3: Handle Missing Values (if any)  

Even though there should be **no missing values**, it’s good practice to check:  

```python
train["Female"].isnull().sum()
```

If the output is `0`, that means no missing values exist. If there were any, we would fill them using:  

```python
train["Female"] = train["Female"].fillna(0)
```

---

### Step 4: Verify the Conversion  

Let’s confirm that our transformation worked correctly:  

```python
train[["Sex", "Female"]].head(5)
```

Expected output:  

| Sex    | Female |  
|--------|--------|  
| male   | 0      |  
| female | 1      |  
| female | 1      |  
| male   | 0      |  
| male   | 0      |  

Additionally, check the counts to ensure we preserved the original distribution:  

```python
train["Female"].value_counts()
```

Expected output:  
```
0    577  
1    314  
Name: Female, dtype: int64  
```
This confirms that:  
- All `"female"` values were converted to `1`.  
- All `"male"` values were converted to `0`.  
- The total number of passengers remains the same.  

---

## Conclusion  
Converting categorical variables to numeric values is an essential step in machine learning. The `Female` variable allows our model to **understand gender differences in survival probability**, which we will explore in later tasks.  

Once we have transformed `Sex` into a numeric format, our dataset is one step closer to being model-ready!

Below I implement the code steps mentioned above:

---

Show frequencies

In [None]:
train['Sex'].value_counts()

<br>Create new variable

In [None]:
train["Female"] = train["Sex"].map({"female": 1, "male": 0})

<br>
Check for missing values. If there were any, we could run this code:

```python
train["Female"] = train["Female"].fillna(0)
```

In [None]:
train["Female"].isnull().sum()

<br>Confirm that our transformation worked correctly on a sample of 10 rows (note we used the `sample()` function here). We should see all instances of 'male' in `Sex` having a value of '0' in `Female`, and all values of 'female' in `Sex` should have a value of '1' in `Female`.

In [None]:
train[["Sex", "Female"]].sample(10)

<br>Lastly, check the counts to ensure we preserved the original distribution:

In [None]:
train['Female'].value_counts()