# Exploratory Data Analysis with the Titanic Dataset

This dataset is the training dataset from Kaggle's ["Titanic - Machine Learning from Disaster"](https://www.kaggle.com/c/titanic)

## Import module

In [None]:
# import the module to use pandas and give it the alias "pd"

import pandas as pd

## Import the data

In [None]:
# The dataset is contained in a CSV file, "data/titanic.csv".
# Use pandas read_csv function to import the data into a dataframe.

df = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLAX-24Fall-EDA/main/Data/titanic.csv')

## Look at the data

* look at snapshots of the dataframe
  * `df`, `df.head()`, `df.tail()`, `df.sample()`
* look at the sizes
  * `df.shape`: look at the size of the data
* look at column names
  * `df.columns`: look at column names
* look at summary information
  * `df.describe()`: statistical summary info
  * `df.info()`: data types, sizes, column labels, null values

In [None]:
# Put the dataframe variable by itself on a line and execute the cell
# You should see an abbreviated output of the dataframe contents

df

In [None]:
# What happens when you print the dataframe with the print function?

print(df)

In [None]:
# Look at the first 5 rows

df.head()

In [None]:
# Look at the last 5 rows

df.tail(5)

In [None]:
# Look at 5 sample rows

df.sample(5)

In [None]:
# Look at the number of rows and columns

df.shape

Look at the description and details of the training data on the data page:
https://www.kaggle.com/competitions/titanic/data?select=train.csv

Do your number of rows and columns match with the description/details?

-> There are 891 passengers, 12 columns of features

In [None]:
# Look at the column names
# Do these match your expectations based on the documentation? (included below)

df.columns

-> These are the column names

Let's consult information from the Kaggle site to get more information.

| Variable | Definition | Key| 
| :-- | :-- | :-- |
| survival | Survival | 0 = No, 1 = Yes| 
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd| 
| sex | Sex | | 
| Age | Age in years | | 
| sibsp | # of siblings / spouses aboard the Titanic | | 
| parch | # of parents / children aboard the Titanic | | 
| ticket | Ticket number | | 
| fare | Passenger fare | | 
| cabin | Cabin number | | 
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton| 

**Variable Notes**

pclass: A proxy for socio-economic status (SES)
* 1st = Upper
* 2nd = Middle
* 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson
* Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
# Use the describe method to get summary statistical information about the quantitative data

df.describe()

* What information does this show?
  * What is the average survival rate?
  * What is the age range?
  * What is the mean age?
  * How many have siblings or spounses?
  * How does the standard deviaton of the fare compare with its mean value?

Are the answers to the above reasonable?

## Aside: Checking categorical variables

In [None]:
# You can use the describe method to also get summary information about the categorical data

df.describe(include='all')

In [None]:
df['Survived'].nunique()

In [None]:
df['Pclass'].nunique()

In [None]:
df['Embarked'].nunique()

In [None]:
df['Survived'].unique()

In [None]:
df['Pclass'].unique()

In [None]:
df['Embarked'].unique()

Note that NaN's will show up as values.

`nunique` and `unique` also works for numerical columns, though has less use than the summaries produced by `describe`.

In [None]:
df['Age'].unique()

In [None]:
df['Age'].sort_values().unique()

## Back to the original EDA notebook

In [None]:
# Use the "info" method to get a summary description of the dataframe's contents.

df.info()

-> Which columns have null values?  And what is the percentage of nulls for those that do?

-> Do the data types make sense? (The below table describes data types for reference)

<table class="table table-striped">
  <thead>
    <tr>
      <th>Pandas Type</th>
      <th>Native Python Type</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>object</td>
      <td>string</td>
      <td>The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).</td>
    </tr>
    <tr>
      <td>int64</td>
      <td>int</td>
      <td>Numeric characters. 64 refers to the memory allocated to hold this character.</td>
    </tr>
    <tr>
      <td>float64</td>
      <td>float</td>
      <td>Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal.</td>
    </tr>
    <tr>
      <td>datetime64, timedelta[ns]</td>
      <td>N/A (but see the <a href="http://doc.python.org/2/library/datetime.html">datetime</a> module in Python’s standard library)</td>
      <td>Values meant to hold time data. Look into these for time series experiments.</td>
    </tr>
  </tbody>
</table>

Let's change a column's datatype from int to string (which becomes an object to pandas):

In [None]:
df['Survived'].astype(str)

In [None]:
# Use "info" again to see whether that changed anything

df.info()

Whoops!  The astype function returned a view, but it didn't change the underlying dataframe.  To do that, we need to explicitly assign the returned dataframe column back into the `df['Survived']` column.

In [None]:
df['Survived'] = df['Survived'].astype(str)

In [None]:
# Let's look again
df.info()

In [None]:
# We'll change two other columns too
df['PassengerId'] = df['PassengerId'].astype(str)
df['Pclass'] = df['Pclass'].astype(str)

## Visualization

Now for some fun stuff.  Let's try to make some simple plots to see what observations we can make.

In [None]:
# List the values of the "Fare" column
# It's ok if the output is abbreviated

df['Fare']

In [None]:
# Use the "plot" method to generate the default plot of "Fare" values

df['Fare'].plot()

This shows Index vs Fare, i.e., what the value of every Fare was.  We can get a sense of what all the fares were from this, but really we probably want to see a distribution of values.

In [None]:
# Use the "plot" method again, but now set the "kind" input parameter of plot to be equal to "hist"
# This should generate a histogram of Fare values.

df['Fare'].plot(kind='hist')

It looks like there are a bunch of low cost tickets, or maybe just a few very *very* expensive tickets.

**Our first look at potentially suspicious values:**  Are there any 0 values?

In [None]:
# Use "loc" and a boolean conditon to output those rows that have a 0 value for Fare

df.loc[df['Fare']==0]

A brief search of some names will show that Mr Lionel Leonard, William Cahoone Johnson Jr., Alfred Johnson, and William Henry Tornquist were American Line employees.  It may make sense that they would have traveled on complementary fare.

## Aside: drop some unwanted rows

### Method 1: simply make a copy of what you want to use

In [None]:
df_nonzerofare = df.loc[df['Fare']!=0]

Double-checking that 13 rows are absent.

In [None]:
df_nonzerofare.shape

In [None]:
df.shape

With this method, you may want to reset the index, because it will keep the same indices.  Looking at the above, we can see that our previous row 822 had a zero-fare passenger.

In [None]:
df_nonzerofare.loc[820:825]

To reset the index:

In [None]:
# drop = True means to discard the existing index
# otherwise a new column of the previous indices will be added to the dataframe
# inplace = True means to change the underlying dataframe
# otherwise reset_index will just return a view of the dataframe

df_nonzerofare.reset_index(drop=True,inplace=True)

In [None]:
df_nonzerofare.loc[820:825]

In [None]:
df_nonzerofare.head(2)

In [None]:
df_nonzerofare.tail(2)

### Method 2: drop the rows from your dataframe

Simply assign the filtered result back into df!

Do remember that this is destructive.  If you want to analyze the zero-fare rows later, you'll need to make the original dataframe again.

In [None]:
df = df.loc[df['Fare']!=0]

Checking shape:

In [None]:
df.shape

Checking index:

In [None]:
df.loc[820:825]

Reset index and showing example without the drop=True

In [None]:
df.reset_index(inplace=True)

In [None]:
df.loc[820:825]

### Another way to drop the rows

We don't have the original df anymore, so reinitialize it:

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLAX-24Fall-EDA/main/Data/titanic.csv')

Get indices of rows that have zero fare:

In [None]:
df.loc[df['Fare'] == 0].index

In [None]:
# Get indices of rows that have zero fare
indexNames = df.loc[df['Fare'] == 0].index

# Delete these row indices from df
df.drop(indexNames, inplace=True)

In [None]:
df.loc[820:825]

We don't have the original df anymore, so reinitialize it:

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLAX-24Fall-EDA/main/Data/titanic.csv')

## Back to the original EDA notebook

... more investigation may be warranted...  But let's looks at the columns that have 'NaN'.

In [None]:
# if you use the "isna" method of the dataframe, what does that output?

df.isna()

In [None]:
# you can get the count of null values for any column by taking the sum of the True/False values of isna
# That is, look at the output of the following:

df.isna().sum()

In [None]:
# Use the shape attribute and a list index to get the number of rows

df.shape[0]

In [None]:
# Divide df.isna().sum() by the number of rows to find the percent null values for all columns

df.isna().sum() / df.shape[0]

* What percentage of age data is missing?
* What percentage of cabin data is missing?
* What percentage of embarked data is missing?

If we want to use those data columns, we would potentially stop here and try to figure out how we need/want to deal with the values that are missing.  For example, we could:
* drop the column completely
* drop the rows with NaNs
* fill the NaNs with other values (a useful value like mean or median, the previous or next row's value, a constant, or the result of an operation)

## Aside: Cleaning NaNs

Options:
* Drop records that have missing values
  * `pd.DataFrame.dropna()`
  * default is to drop rows.  This can be explicitly specified with `pd.DataFrame.dropna(axis=0)`
* Drop an entire feature that has lots of missing values
  * `pd.DataFrame.drop(<feature_name>, axis=1)`
  * <feature_name> is the name of the column to drop
* Fill in missing values with something else
  * Example: Impute the mean/median (if quantitative) or most common class (if categorical) for all missing values.
    * `pd.DataFrame.fillna(value=x.mean())`

To demonstrate taking care of NaNs, let's create a copy of the first 5 rows of the dataframe that have zero fares -- this will make it easier to see exactly what each option does.

We don't have the original df anymore, so reinitialize it:

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLAX-24Fall-EDA/main/Data/titanic.csv')

In [None]:
df_sample = df.loc[df['Fare']==0][0:5]

In [None]:
df_sample

In [None]:
df_sample.dropna()

In [None]:
df_sample

In [None]:
df_sample.drop('Cabin',axis=1)

In [None]:
df_sample

Here we explicitly calculate the median and change the underlying dataframe with inplace=True

In [None]:
median = df_sample['Age'].median()

In [None]:
df_sample.fillna({'Age': median})

In [None]:
df_sample

In [None]:
df_sample.fillna({'Age': median}, inplace=True)

In [None]:
df_sample

Or we could do the calculation and pass it into the "value" parameter of fillna.

In [None]:
# reinitialize
df_sample = df.loc[df['Fare']==0][0:5]

In [None]:
df_sample

In [None]:
df_sample['Age'].fillna(value=df_sample['Age'].median())

In [None]:
df_sample['Age'] = df_sample['Age'].fillna(value=df_sample['Age'].median())

In [None]:
df_sample

...or... we could just assign any specific value we choose:

In [None]:
# reinitialize one more time
df_sample = df.loc[df['Fare']==0][0:5]

In [None]:
df_sample

In [None]:
df_sample['Age'] = df_sample['Age'].fillna(value=8008)

In [None]:
df_sample

## Back to the original EDA notebook

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLAX-24Fall-EDA/main/Data/titanic.csv')

Further analysis: let's see how Age is related to Survived.

Here are the variables we might like to look at:
* `df.loc[df['Survived'] == '0', 'Age']`: the Age values of those who did not survive
* `df.loc[df['Survived'] == '1', 'Age']`: the Age values of those who did survive

Let's use matplotlib to do a histogram of these.

In [None]:
# Use the "hist()" method of dataframes to make a histogram plot of Age values for those who survived

df.loc[df['Survived'] == '1', 'Age'].hist()

## Another aside about re-initializing df

* Importing the data again into df was not sufficient to re-initializing everything
  * We didn't change the datatype of the Survived column like before!!
* If you have definite sequences of operations that you _definitely_ always want to perform when importing, encapsulate them in a function or into some saved workflow steps


In [None]:
def importtitanic():
    df_titanic = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLAX-24Fall-EDA/main/Data/titanic.csv')
    df_titanic['Survived'] = df_titanic['Survived'].astype(str)
    df_titanic['PassengerId'] = df_titanic['PassengerId'].astype(str)
    df_titanic['Pclass'] = df_titanic['Pclass'].astype(str)
    return df_titanic

## Let's get back into the EDA notebook with our new functionality

In [None]:
df = importtitanic()

In [None]:
df.info()

All looks good?

How do we know? -> double-check that datatypes are as we want them.

In [None]:
# Make another histogram plot of Age values for those who did not survive

df.loc[df['Survived'] == '0', 'Age'].hist()

In [None]:
# What happens if you put the commands to make both histograms here and execute the cell?

df.loc[df['Survived'] == '0', 'Age'].hist()
df.loc[df['Survived'] == '1', 'Age'].hist()

It would be nice to plot the bars next to each other too to directly compare them.

We can tie in a little bit of another Python plotting package, Matplotlib, to help.

In [None]:
# Execute this cell
import matplotlib.pyplot as plt

In [None]:
# Execute this cell
a = df.loc[df['Survived'] == '0', 'Age']
b = df.loc[df['Survived'] == '1', 'Age']
plt.hist([a,b]);

In [None]:
# Copy the above commands here
# And insert another condition so that you plot data only for Age values > 18
# You'll need to use the "&" symbol to combine two conditions with "and"

a = df.loc[(df['Survived'] == '0') & (df['Age'] > 18), 'Age']
b = df.loc[(df['Survived'] == '1') & (df['Age'] > 18), 'Age']
plt.hist([a,b]);

In [None]:
# Now try again for Age < 18

a = df.loc[(df['Survived'] == '0') & (df['Age'] < 18), 'Age']
b = df.loc[(df['Survived'] == '1') & (df['Age'] < 18), 'Age']
plt.hist([a,b]);

## Adding in the groupby functionality

Let's say we want to look at Embarked now too... this is a categorical variable.

In [None]:
df.groupby(['Embarked'])['Survived'].count()

In [None]:
df.groupby(['Embarked'])['Survived'].count().plot.bar()

It's actually just returning the total row count, not the count of Survived and Not Survived.

To get the grouping by Survived too, we need to include that in the groupby

In [None]:
df.groupby(['Embarked','Survived'])['Survived'].count()

In [None]:
df.groupby(['Embarked','Survived'])['Survived'].count().plot.bar()

The side-by-side bar plots are easier to make if we first filter on the 0/1 values of Survived, followed then by the grouping.

In [None]:
a = df.loc[df['Survived'] == '0'].groupby(['Embarked'])['Survived'].count()
b = df.loc[df['Survived'] == '1'].groupby(['Embarked'])['Survived'].count()

In [None]:
a

In [None]:
b

To generalize this to arbitrary numbers of categories, and make the side-by-side plots, we'll use `numpy` and `matplotlib`

In [None]:
import numpy as np

In [None]:
len(a)

In [None]:
np.arange(len(a))

In [None]:
X_axis = np.arange(len(a))

In [None]:
plt.bar(X_axis - 0.2, a, width=0.3, label = 'Did Not Survive')
plt.bar(X_axis + 0.2, b, width=0.3, label = 'Survived')
plt.xticks(X_axis, a.index)
plt.legend()
plt.show()

## Exercise

Try to make some plots on your own now.

1. Make a boxplot of Fare
  * Remember that the `kind` parameter can make it easy to make a variety of elementary plots with, for example, `kind='box'`
  * After doing this for all Fare values, make another boxplot that only plots Fares < 100
1. Make a histogram of Fare
  * Do this again for all Fares as well as for only Fares < 100
1. Make a bar chart for the Parch variable that shows the count of each unique Parch value
1. If you get to this step, try making plots of a couple other variables like Pclass, etc.