<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_03_EDA_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis on the Titanic
This chapter will delve deeper into the exciting world of data science by focusing on the crucial step of exploratory data analysis (EDA). EDA is a fundamental part of any data science project, allowing us to understand the data we're working with and make informed decisions about how to proceed with our analysis.

Specifically, we'll be learning how to handle common challenges that arise when dealing with real-world data, such as missing data, outliers, and inconsistent data entries. Understanding and addressing these issues is vital for ensuring the reliability and validity of our analysis. We'll learn about various techniques for identifying and handling these issues, and apply them to a real dataset.

Speaking of datasets, the one we'll be using throughout this notebook is the Titanic dataset, a classic dataset in the data science world. This dataset contains passenger information from the infamous Titanic ship that sank in 1912 after hitting an iceberg. The data includes various details about each passenger, such as their age, sex, passenger class, fare, and most notably, whether or not they survived the sinking.

In addition to these practical skills, we'll also delve into data science's philosophical side. Specifically, we'll discuss the "problem of induction," a philosophical issue that deals with our ability to make generalizations or predictions based on specific observations. This issue is particularly relevant to data science, where we often need to make general predictions based on specific datasets.

By the end of this notebook, you will better understand the steps involved in preparing a dataset for analysis, and you will be more aware of the philosophical considerations underpinning our work as data scientists. So, let's dive in and get started!

## Loading the Data
The dataset we are going to use is based on the infamous Titanic ship, specifically, data about its passengers. This dataset is a classic in the data science world, often used for training and learning purposes. It contains information about the passengers who were onboard the Titanic, including details such as their age, sex, class, and importantly, whether they survived or not.

Let's begin by loading our data.

### Importing Necessary Libraries
Before we load our data, we must import the necessary libraries. Libraries are collections of functions and methods that allow us to perform many actions without writing a lot of code.

For this task, we are going to use the Pandas library, one of the most powerful and commonly used tools for data manipulation and analysis in Python.

In [17]:
# Importing the pandas library
import pandas as pd

### Loading the Data
With the necessary library imported, we can now load our dataset. Our data is stored in a **CSV (Comma Separated Values)** file, a type of file that stores tabular data. It is a simple and popular format among data scientists because of its easy readability and wide support.

Pandas provides a function, `read_csv()`, which reads a CSV file and converts it into a DataFrame. A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or SQL table, or a dictionary of Series objects.

Let's load our data now.

In [18]:
# Load the data
url = "https://github.com/brendanpshea/data-science/raw/main/data/titanic_train.csv"
titanic_df = pd.read_csv(url)

### Inspecting the Data
Great! We've now loaded our data. But how do we know if it was loaded correctly? And what does our data look like?

Pandas provides a method, `head()`, that allows us to inspect the first few rows of our DataFrame. By default, `head()` displays the first five rows.

In [19]:
# Display the first few rows of the DataFrame
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The DataFrame you see is the first five rows of our Titanic dataset. Each row represents a passenger onboard the Titanic, and the columns contain various pieces of information about each passenger.

Here is a brief overview of what each column represents:

-   `PassengerId:` This is a unique number assigned to each passenger. It's simply a way to number each row in the dataset, and it doesn't have any real-world meaning about the passenger themselves.

-   `Survived:` This is our target variable, which we would like to predict if we were building a machine learning model. It indicates whether the passenger survived (1) or did not survive (0).

-   `Pclass:` This refers to the passenger's ticket class, a proxy for socio-economic status. It's an ordinal integer feature where 1 = 1st class (Upper), 2 = 2nd class (Middle), and 3 = 3rd class (Lower).

-   `Name:` The name of the passenger.

-   `Sex:` The gender of the passenger. It's a categorical feature with two values, male or female.

-   `Age:` The age of the passenger. There may be many missing values in this column, as indicated by the NaN (Not a Number) in the first row for the Cabin column.

-   `SibSp:` This is the number of siblings or spouses the passenger had aboard the Titanic.

-   `Parch:` This is the number of parents or children the passenger had aboard the Titanic.

-   `Ticket:` This is the ticket number of the passenger.

-   `Fare:` This is how much the passenger paid for their ticket.

-   `Cabin:` This is the cabin number of the passenger. Like the Age column, this column may also have many missing values.

-   `Embarked:` This is the port at which the passenger boarded the Titanic. It's a categorical feature with three possible values: C = Cherbourg, Q = Queenstown, S = Southampton.

In the rest of this chapter, we will delve into this dataset, clean it, and explore it

## Introduction to Exploratory Data Analysis (EDA)

As budding data scientists, you've just started a journey that will often lead you through a jungle of data. The path you forge to understand this wilderness is known as Exploratory Data Analysis (EDA).

EDA is a critical early step in the data science process, allowing you to dive into the heart of your dataset and emerge with valuable insights. It's a chance to roll up your sleeves and get hands-on with your data. So, let's understand a bit more about what EDA is and why it's important.

EDA is the practice of using visual and quantitative methods to understand and summarize a dataset without making any assumptions about its contents. It's all about exploring and it's the first step in your data analysis process. In EDA, we try to understand the patterns and relationships in our data, often by using visual methods.

The importance of EDA cannot be overstated. It enables us to:

-   Identify patterns and relationships in the data, which could lead to hypotheses for later testing.

-   Detect anomalies and outliers that might distort our later analysis.

-   Check the assumptions related to our chosen data analysis methods.

-   Select appropriate statistical tools and techniques for analysis.

-   Create a foundational understanding of the data, making it easier to communicate your results and findings to others.

To explore our data, we'll be asking a series of questions and seeking their answers by using a variety of statistical and visualization techniques. However, remember that EDA is not a rigid process. It's more of a creative and iterative process, allowing you to dig deeper as you uncover more about your data.

Also, while EDA helps us to understand the data's underlying structure and extract valuable insights, it's important to note that EDA doesn't directly involve making predictions or testing hypotheses. It simply helps us to comprehend the data better and guides us in building suitable predictive models or conducting statistical analysis.

In the next section, we'll dive right into EDA, using our Titanic dataset to guide us. We'll be addressing some key questions to help you understand basic concepts and techniques of data cleaning and EDA. By the end, you'll have a solid grasp of this critical phase of the data science process, and be well-equipped to tackle your own data exploration in the future. Let's get started!

### Exploring the Shape of the Dataset
The **shape** of the dataset refers to the number of rows (instances) and columns (features) it contains. In a DataFrame, the .shape attribute returns a tuple representing the dimensionality of the DataFrame. The first element of the tuple is the number of rows and the second element is the number of columns.

Knowing the shape of your dataset can provide insight into the volume of data you have, which is an essential factor in determining your data analysis approach.

Here's how to find the shape of our DataFrame:


In [20]:
# Get the shape of the DataFrame
titanic_df.shape

(891, 12)

This statement will output the number of rows (891) and columns (21) in our DataFrame. This is a vital first step in understanding the structure and size of our dataset. It gives us an idea of how much data we're working with, which will inform our choices as we move through the data analysis process.

Remember, big datasets aren't necessarily better, and small datasets aren't necessarily worse. But the size of your dataset will impact what you can do with it, so it's good to know this right at the start of your EDA journey.

###  Understanding Data Types and Checking for Missing Values
As we continue our exploratory journey, the next key aspect of our dataset to understand is the types of data we're working with and where we might have missing information. For this purpose, the `.info()` method provided by pandas is extremely handy.

The `.info()` method offers a concise summary of our DataFrame. It provides essential information about the data types of our columns, the number of non-null entries (i.e., entries that are not missing), and memory usage. Understanding the data types is critical because certain operations and visualizations are only applicable to certain types of data.

Let's invoke this method on our titanic_df DataFrame:

In [21]:
# Print information about the DataFrame
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The `.info()` method gives us a summary of our DataFrame. Let's discuss the output and understand the key information it provides:

The output begins with confirming that `titanic_df` is a DataFrame object with 891 entries, ranging from index 0 to 890. We can see that the DataFrame contains 12 columns in total.

The data types of the columns are broken down into three categories: `int64`, `float64`, and `object`.

-   `int64`: This data type represents integer values. In our DataFrame, the columns 'PassengerId', 'Survived', 'Pclass', 'SibSp', and 'Parch' are of this type.

-   `float64`: This data type is used for floating-point numbers (numbers that have decimal points). The 'Age' and 'Fare' columns in our DataFrame are represented as `float64`.

-   `object`: This data type typically represents strings, but it can also be used to store different types of data. In our DataFrame, 'Name', 'Sex', 'Ticket', 'Cabin', and 'Embarked' columns are of this type.

Now, let's move to the Non-Null Count. This tells us the number of entries in each column that are not missing (non-null).

-   Most of our columns like 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', and 'Fare' have 891 non-null entries, meaning they don't have any missing values.

-   However, 'Age', 'Cabin', and 'Embarked' columns have less than 891 non-null entries. This means these columns have missing values. Specifically, 'Age' has 177 missing values, 'Cabin' has 687 missing values, and 'Embarked' has 2 missing values.

The `memory usage: 83.7+ KB` line tells us the amount of memory used by our DataFrame.

Understanding the data types and the location of missing values in our dataset will guide us in our next steps of data cleaning and exploratory data analysis.

### Handling Missing Values


The task of handling missing values is an essential step in the data cleaning process. Data can be missing for various reasons: maybe an individual chose not to share their age, a technical glitch didn't record the cabin number, or perhaps the port of embarkation data was lost. Whatever the reason, our task is to make a strategic choice about how to handle these gaps in our dataset.

There are three fundamental strategies for handling missing data:

1.  **Imputation:** This strategy involves filling missing data with some value. The choice of value can significantly affect our analysis, and it usually depends on the nature of our data. Common choices are:

    -   **Mean:** The mean or average is the sum of all values divided by the number of values. It works well when the data is normally distributed, but it can be misleading if there are outliers in the data.

    -   **Median:** The median is the middle value in a dataset. It separates the data into two halves and is less affected by outliers and skewed data.

    -   **Mode:** The mode is the most frequently occurring value in a dataset. It is often used for categorical data.

2.  **Deletion:** In this strategy, we remove the rows or columns with missing values. This is usually done when the number of missing values is relatively small, and removing them would not cause significant loss of information.

3.  **Prediction:** Advanced methods involve predicting missing values, using machine learning algorithms or other techniques. However, for our beginner's guide, we will stick to imputation and deletion.

Let's address the missing values in our data:

-   Age: The 'Age' column is numerical data. It may contain outliers (for instance, a few very old passengers). Thus, using the median might be a good choice here as it is less sensitive to extreme values compared to the mean. We can use the Pandas method `fillna()` to fill the missing values. Here `inplace=True` indicates that changes are to be made directly to our DataFrame.

In [22]:
titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)

- Cabin: The 'Cabin' column has a large number of missing values. Rather than filling in this extensive missing data, we could drop the column entirely using the `drop()` method. The argument `axis=1` indicates we want to drop a column (not a row), and `inplace=True` applies the change to our DataFrame.

In [23]:
titanic_df.drop('Cabin', axis=1, inplace=True)

- Embarked: For the 'Embarked' column, only two values are missing. As it is a small fraction of the dataset, we might decide to drop these rows. We use the dropna() method here, specifying the 'Embarked' column.

In [24]:
titanic_df.dropna(subset=['Embarked'], inplace=True)

Remember, these are choices that we made for this specific analysis, given the nature of our data and our objective. We could have made different decisions, such as filling missing ages with the mean age, or replacing missing embarkation points with the mode. The essential point is to understand your data and make informed decisions about handling missing values.

To conclude this section, let's see what our data set looks like now:

In [25]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


## Feature Selection: Choosing the Right Columns

When we work with a dataset, especially one with many columns like the Titanic dataset, we often don't need to use every single column for our analysis. Feature selection, or choosing which features (columns) to include in our analysis, is a critical step in the data cleaning and preparation process.

The features that we include will depend on what sort of question(s) we are interested in answering. For example, suppose we are interested in the research question, **"How did age, gender, and passenger class affect survival on the Titanic?".**  This questions is interesting for several reasoning, including:

1.  The sinking of the Titanic is a significant event in history, and understanding the factors that influenced survival can shed light on societal norms of that era, like "women and children first."

2. The factors influencing survival in a disaster like the sinking of the Titanic could offer insights applicable to emergency planning and response today.

3. This question also provides a fantastic opportunity to learn and practice data cleaning, EDA, and data analysis techniques.

To answer this question, the 'Age', 'Sex', 'Pclass', and 'Survived' columns are particularly relevant:

- **Age:** Age could have been a factor in survival – perhaps younger or older individuals were less likely to survive.

- **Sex:** It's often said that in maritime disasters, the protocol was "women and children first" for lifeboats. The 'Sex' column can help us investigate whether this was indeed the case on the Titanic.

- **Pclass:** The 'Pclass' (passenger class) column can indicate socioeconomic status. First-class passengers had cabins closer to the deck and might have had better access to lifeboats, possibly influencing survival chances.

- **Survived:** Obviously, to understand what factors influenced survival, we need to know who survived and who didn't.



To keep only the columns we're interested in, we can use the following syntax:

In [26]:
titanic_df = titanic_df[['Survived', 'Pclass', 'Sex', 'Age']]
titanic_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age
0,0,3,male,22.0
1,1,1,female,38.0
2,1,3,female,26.0
3,1,1,female,35.0
4,0,3,male,35.0


This line is redefining `titanic_df`. The double square brackets ([[ ]]) are used for selecting a subset of the DataFrame. Inside the brackets, we've listed the column names we're interested in, each name enclosed in single quotes and separated by commas. Running this line of code simplifies our DataFrame to include only our selected columns.

### Why Not Keep Everything?
But you may ask, "Why not just keep all the columns? What's the harm?" Here's why feature selection is important:

-   **Simplicity:** Reducing the number of features makes your dataset easier to understand. It also simplifies any models or visualizations you create, making your results more interpretable.

-   **Efficiency:** Fewer features means less data, which can speed up computations and analysis.

-  **Quality:** Some features may not be useful or may even introduce noise or bias to your analysis. By focusing on the most relevant features, you can improve the quality of your analysis.

The choice of features to include in your analysis should be guided by your understanding of the data and the specific research question you are trying to answer. If a feature does not contribute to that question, it may be a candidate for removal.

## Activity: Research Questions and Choice of Factors

Suppose that, instead of examining the relationship between gender, age, class and survival, we were interested in a *different* question. For each of the following research questions, identify which features (columns) might be relevant.

1. Question: How did the fare price affect the survival rate on the Titanic?

2. Question: Did people with families onboard have a higher survival rate?

3. Question: Were passengers from certain embarkation ports more likely to survive?

4. Question: Did the cabin location (as determined by cabin number) influence survival rate?

5. Question: Did the title of passengers (which can be extracted from their names) influence survival chances?


## My Answers
For each research question above, identify the factors/columns that might be relevant:

1.

2.

3.

4.

5.