[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%208%20Notebooks/GDAN%205400%20-%20Week%208%20Notebooks%20%28II%29%20-%20Task%202%20-%20Load%20Training%20Data.ipynb)

This notebook provides a mini-tutorial on loading the training data from the GDAN 5400 GitHub repository.

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Loading the Training Data from GitHub

Most of our work will be using the `training` data. Task 2 in the fourth coding assignment has the following requirements:

- Import the necessary libraries (**NumPy** and **PANDAS**).  
- Load the Titanic training dataset directly from a **GitHub URL** into a Pandas DataFrame. Name the dataset `train`.
- Print the **number of rows** in the dataset.  
- Display the **first two rows** to verify the data has loaded correctly.  

### What is a Training Dataset?  

When building a machine learning model, we need **data** to help the model learn patterns and make predictions. The **training dataset** is the portion of the data that includes both the input features (e.g., passenger details like age, ticket class, and gender) and the correct output (whether the passenger survived).  

The goal of training a model is to find relationships between these features and the survival outcome so that we can use the model to make predictions on new, unseen data.  

In the Titanic dataset:  
- Each **row** represents a passenger.  
- Each **column** provides information about that passenger, such as their age, class, fare paid, and whether they survived (1 = survived, 0 = did not survive).  
- The **Survived** column is our **target variable**, meaning it’s what we want our model to predict.  

The Titanic dataset is split into two parts:  
- **Training Data (`train.csv`)** – This contains both passenger details and their survival outcome. We use this data to train the model.  
- **Test Data (`test.csv`)** – This contains only passenger details (no survival outcome). After training our model, we use this dataset to make predictions and submit them to Kaggle.  

### Details on `train.csv`
**train.csv** contains the details of a subset of the passengers on board (891 passengers, to be exact -- where each passenger gets a different row in the table). Importantly, this dataset reveals whether they survived or not, also known as the **"ground truth"**.


To investigate this data, go to the Kaggle competition page and click on the name of the file on the left of the screen.  Once you've done this, you can view all of the data in the window.  

![](https://i.imgur.com/cYsdt0n.png)

The values in the second column (**"Survived"**) can be used to determine whether each passenger survived or not:
- if it's a "1", the passenger survived.
- if it's a "0", the passenger died.

For instance, the first passenger listed in **train.csv** is Mr. Owen Harris Braund.  He was 22 years old when he died on the Titanic.

**train.csv** will contain the details of a subset of the passengers on board (**891 to be exact**) and, importantly, will reveal whether they survived or not, also known as the **"ground truth"**.
As a reference, here is a data dictionary describing the variables you will find in the dataset:


| **Data Dictionary** |   |   |
|---------------------|---|---|
| **Variable**       | **Definition**                         | **Key**                                   |
| Survived          | Survival                              | 0 = No, 1 = Yes                          |
| Pclass            | Ticket class                          | 1 = 1st, 2 = 2nd, 3 = 3rd                |
| Sex               | Sex                                   |                                           |
| Age               | Age in years                          |                                           |
| SibSp             | # of siblings/spouses aboard Titanic |                                           |
| Parch             | # of parents/children aboard Titanic |                                           |
| Ticket            | Ticket number                         |                                           |
| Fare              | Passenger fare                        |                                           |
| Cabin             | Cabin number                          |                                           |
| Embarked          | Port of Embarkation                   | C = Cherbourg, Q = Queenstown, S = Southampton |

---

### Reading in the Data from GitHub
To load the training data into a PANDAS DataFrame, we will import two Python libraries:
- **Pandas** – For working with tabular data.  
- **NumPy** – For numerical operations (we import it as a best practice).  

Next, we will read in the data. I have downloaded the data and put them into the `GDAN 5400` folder on GitHub in order to facilitate easy access. 
How I got the URL for the file:
- Locate the CSV file in the GitHub repository.
- Click on the file to view its contents.
- Click the "Raw" button, typically found in the upper right corner of the file view. This will display the raw CSV data.
- Copy the URL of the raw CSV file.
- In your Google Colab notebook, use the pandas library to read the CSV file directly from the URL.

In [None]:
import numpy as np
import pandas as pd

train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Titanic/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]