## Titanic Data Exploration ##

***

Over the past several weeks, you've learned the code about how to explore and manipulate a dataset. Now it's time to practice what you've learned on a real-world dataset. 

***

### Titanic Dataset

The titanic dataset holds information about the passengers on the titanic. This includes passenger name, characteristics, and if they survived the accident. The dataset has the following columns:

    * pclass = passenger class; 1 = first class, 2 = second class, 3 = third class
    * survived = passenger survival; 1 = survived, 0 = did not survive
    * name = passenger name
    * sex = sex of passenger
    * age = age of passenger
    * sibsp = # of siblings / spouses aboard the Titanic
    * parch = # of parents / children aboard the Titanic
    * ticket = ticket number
    * fare = fare paid by passenger
    * cabin = passenger cabin
    * embarked = port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton
    * boat = lifeboat assignment 
    * body = recovered body number
    * home dest = anticipated home destination 
    
If you need some additional motivation before starting, please visit: https://www.youtube.com/watch?v=3gK_2XdjOdY

### How to work through the dataset:

Follow the prompts below to explore, manipulate, and visualize aspects of the dataset. Working with data takes time, so take your time as you start with a messy dataset and turn it into something that shows meaningful visualizations. 

***


### Import Libraries and Dataset

* Review the entire notebook to determine what you will be expected to do - then, import the necessary libraries
* Import the titanic.xlsx dataset

In [8]:
import pandas as pd
import numpy as np
import scipy.stats as stats 
import statsmodels.formula.api as sm 

In [9]:
df = pd.read_excel("titanic.xlsx")

In [10]:
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### Determine the Characteristics of the Dataset

   * How many columns are in this dataset?
   * How many rows are in this dataset?
   * What types of data are in each column? Does this make sense with that you know about that column?
   * Which variables are numeric? Which variables are categorical? What other variables are left outside of these two groups?
   * Which variable could be considered a 'dependent' variable?

In [12]:
df.info()
df.shape
# Dependent variable: Survived
#Numerical variables are: pclass, survived, age, sibsp, parch, ticket, fare,boat and body
#categorical variables are: name, sex, embarked,cabin and home.dest

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   int64   
 1   survived   1309 non-null   int64   
 2   name       1309 non-null   object  
 3   sex        1309 non-null   object  
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   int64   
 6   parch      1309 non-null   int64   
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   object  
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
 14  age group  1046 non-null   category
dtypes: category(1), float64(3), int64(4), object(7)
memory usage: 144.8+ KB


(1309, 15)

### Identify the Missing Data in the Dataset

   * Is there any missing data?
   * Which columns have any missing data?
   * Which column has the most missing information? Which column has the least?

In [13]:
df.isnull().sum()
# the columns age, cabin, embarked, boat, body, fare, and home.dest
#the column body has the maximum (1188) missing values and fare has the minimum (1) missing values.

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
age group     263
dtype: int64

### Handling the Missing Data in the Dataset

   * Remove the columns with excessive missing data (any column missing greater than 500 rows)
   * When there is very little missing data, we can make replacements. Replace the missing data for the "embarked" column with the most common embarkation point. 
   * Replace the missing data in "fare" with the average fare of the entire sample. 
   * Remove the rows in the dataset that has missing "age" data. 
   * Recheck is there is any data missing in the dataset. 

In [14]:
df.drop(columns = ["home.dest", "cabin", "boat", "body"], inplace = True)


df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,S,adult
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,S,infant
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,S,child
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,S,adult
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,S,adult


In [15]:
df["embarked"].fillna("S", inplace = True)

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,S,adult
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,S,infant
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,S,child
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,S,adult
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,S,adult
...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,C,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult


In [16]:
print(df["fare"].mean())

df["fare"].fillna(df["fare"].mean(), inplace = True)

df

33.29547928134572


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,S,adult
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,S,infant
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,S,child
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,S,adult
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,S,adult
...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,C,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult


In [17]:
df.dropna(inplace = True)

df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,S,adult
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,S,infant
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,S,child
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,S,adult
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,S,adult


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1046 entries, 0 to 1308
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1046 non-null   int64   
 1   survived   1046 non-null   int64   
 2   name       1046 non-null   object  
 3   sex        1046 non-null   object  
 4   age        1046 non-null   float64 
 5   sibsp      1046 non-null   int64   
 6   parch      1046 non-null   int64   
 7   ticket     1046 non-null   object  
 8   fare       1046 non-null   float64 
 9   embarked   1046 non-null   object  
 10  age group  1046 non-null   category
dtypes: category(1), float64(2), int64(4), object(4)
memory usage: 91.1+ KB


### Creating Columns and Replacing Labels

   * Create descriptive labels for the categorical columns: pclass, survived, and embarked. Instead of the coding that shows in the dataset, create labels to describe what each category represents (i.e. in the embarked column S = Southhampton)
   * Create a new column called "Titanic Passenger" and make all values 1
   * Create a new column called "Family Size" - this column should equal the total number of family members each passenger was traveling with.
   * Create a column called "Travel Alone" - this column should be 1 if the passenger was traveling alone, and 0 if the passenger was traveling with family. 
   * Create a column called "Has Caregiver" - this column should have a value of 1 if a passenger is less than 13-years old AND the passenger is traveling with at least one family member, otherwise the value should be 0. 
   * Create a column called "Crew" - this column should be 1 if the passenger paid 0 dollars for their ticket, and 0 otherwise. 
   * Create a column called "Age Group" to group passengers by their age (create five categories: infant, child, teen, adult, senior). You can use bins to complete this (or any other method you like). You define the cutoff points for each group you create. 
   
After create new columns, replace the basic coding "0/1" with meaningful labels. 

In [19]:
df["pclass"].replace([1, 2, 3], ["First Class", "Second Class", "Third Class"], inplace = True)

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group
0,First Class,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,S,adult
1,First Class,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,S,infant
2,First Class,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,S,child
3,First Class,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,S,adult
4,First Class,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,S,adult
...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,0,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult
1304,Third Class,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen
1306,Third Class,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult
1307,Third Class,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult


In [20]:
df["survived"].replace([1, 0], ["survived", "did not survived"], inplace = True)

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,S,adult
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,S,infant
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,S,child
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,S,adult
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,S,adult
...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult


In [21]:
df["embarked"].replace(["c", "Q", "S"], ["Cherbourg", "Queenstown", "Southampton"], inplace = True)

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult
...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult


In [22]:
df["Titanic Passenger"] = 1

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult,1
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant,1
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child,1
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult,1
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult,1
...,...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult,1
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen,1
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult,1
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult,1


In [23]:
df["Family Size"] = df["sibsp"] + df["parch"] + 1

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger,Family Size
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult,1,1
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant,1,4
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child,1,4
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult,1,4
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult,1,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult,1,1
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen,1,2
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult,1,1
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult,1,1


In [24]:
df["Travel Alone"] = np.where((df["Family Size"] == 1), 1, 0)

df


Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger,Family Size,Travel Alone
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult,1,1,1
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant,1,4,0
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child,1,4,0
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult,1,4,0
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult,1,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult,1,1,1
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen,1,2,0
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult,1,1,1
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult,1,1,1


In [25]:
df["Has caregiver"] = np.where((df["age"] < 13) & (df["Family Size"] >= 2), 1, 0)

df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger,Family Size,Travel Alone,Has caregiver
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,Southampton,adult,1,1,1,0
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,Southampton,infant,1,4,0,1
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,Southampton,child,1,4,0,1
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,Southampton,adult,1,4,0,0
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,Southampton,adult,1,4,0,0


In [26]:
df["crew"] = np.where((df["fare"]== 0), 1, 0)
df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger,Family Size,Travel Alone,Has caregiver,crew
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult,1,1,1,0,0
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant,1,4,0,1,0
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child,1,4,0,1,0
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult,1,4,0,0,0
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult,1,4,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult,1,1,1,0,0
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen,1,2,0,0,0
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult,1,1,1,0,0
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult,1,1,1,0,0


In [27]:
### convert the age into different groups

#0 - 1  = infant
#1.1 - 12 = child
#12.1 - 19 = teen
#19.1 - 55 = adult
#55.1 - 120 = seniors
#### STEP 1: create the bin limits ####

bins = [0, 1, 12, 19, 55, 120]

## the bin limits are the cutoff points for the values
## each number shown is the cutoff for a specific group (0-60, 60.1 - 70, 70.1 - 80...)

bin_labels = ["infant", "child", "teen", "adult", "seniors"]

## the bin labels are the group names that will be created
## there should always be one less group than bins

#### STEP 2: apply your bins to a specific column (or create new column) in dataset  ####
## new column = pd.cut(column to apply to, bin cutoff list, labels = list of bin labels)
# pd.cut function segments and organizes values into the appropriate bin

df["age group"] = pd.cut(df["age"], bins, labels = bin_labels)


#### STEP 3: check changes  ####

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger,Family Size,Travel Alone,Has caregiver,crew
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult,1,1,1,0,0
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant,1,4,0,1,0
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child,1,4,0,1,0
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult,1,4,0,0,0
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult,1,4,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult,1,1,1,0,0
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen,1,2,0,0,0
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult,1,1,1,0,0
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult,1,1,1,0,0


In [28]:
df["Travel Alone"].replace([1, 0], ["Travelling alone", "Travelling with family"], inplace = True)

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger,Family Size,Travel Alone,Has caregiver,crew
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult,1,1,Travelling alone,0,0
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant,1,4,Travelling with family,1,0
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child,1,4,Travelling with family,1,0
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult,1,4,Travelling with family,0,0
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult,1,4,Travelling with family,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult,1,1,Travelling alone,0,0
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen,1,2,Travelling with family,0,0
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult,1,1,Travelling alone,0,0
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult,1,1,Travelling alone,0,0


In [30]:
df["Has caregiver"].replace([1, 0], ["Passenger < 13 years and travel with family", "Passenger >= 13 and travel alone"], 
                                     inplace = True)

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger,Family Size,Travel Alone,Has caregiver,crew
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult,1,1,Travelling alone,Passenger >= 13 and travel alone,0
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant,1,4,Travelling with family,Passenger < 13 years and travel with family,0
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child,1,4,Travelling with family,Passenger < 13 years and travel with family,0
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult,1,4,Travelling with family,Passenger >= 13 and travel alone,0
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult,1,4,Travelling with family,Passenger >= 13 and travel alone,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult,1,1,Travelling alone,Passenger >= 13 and travel alone,0
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen,1,2,Travelling with family,Passenger >= 13 and travel alone,0
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult,1,1,Travelling alone,Passenger >= 13 and travel alone,0
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult,1,1,Travelling alone,Passenger >= 13 and travel alone,0


In [33]:
df["crew"].replace([1, 0], ["Paid 0 dollar for ticket", "Paid for ticket"], inplace = True)

df

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,age group,Titanic Passenger,Family Size,Travel Alone,Has caregiver,crew
0,First Class,survived,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,Southampton,adult,1,1,Travelling alone,Passenger >= 13 and travel alone,Paid for ticket
1,First Class,survived,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,Southampton,infant,1,4,Travelling with family,Passenger < 13 years and travel with family,Paid for ticket
2,First Class,did not survived,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,Southampton,child,1,4,Travelling with family,Passenger < 13 years and travel with family,Paid for ticket
3,First Class,did not survived,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,Southampton,adult,1,4,Travelling with family,Passenger >= 13 and travel alone,Paid for ticket
4,First Class,did not survived,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,Southampton,adult,1,4,Travelling with family,Passenger >= 13 and travel alone,Paid for ticket
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,"Youseff, Mr. Gerious",male,45.5000,0,0,2628,7.2250,C,adult,1,1,Travelling alone,Passenger >= 13 and travel alone,Paid for ticket
1304,Third Class,did not survived,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,C,teen,1,2,Travelling with family,Passenger >= 13 and travel alone,Paid for ticket
1306,Third Class,did not survived,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,C,adult,1,1,Travelling alone,Passenger >= 13 and travel alone,Paid for ticket
1307,Third Class,did not survived,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,C,adult,1,1,Travelling alone,Passenger >= 13 and travel alone,Paid for ticket


### Determine Frequencies of Groups

* How many passengers fall into each category? Determine how many passengers fall into each group for <b>each</b> categorical vairable (including the ones you just created). 

In [34]:
df["pclass"].value_counts()

Third Class     501
First Class     284
Second Class    261
Name: pclass, dtype: int64

In [35]:
df["survived"].value_counts()

did not survived    619
survived            427
Name: survived, dtype: int64

In [38]:
df["sex"].value_counts()

male      658
female    388
Name: sex, dtype: int64

In [39]:
df["sibsp"].value_counts()

0    685
1    280
2     36
4     22
3     16
5      6
8      1
Name: sibsp, dtype: int64

In [40]:
df["parch"].value_counts()

0    768
1    160
2     97
3      8
5      6
4      5
6      2
Name: parch, dtype: int64

In [41]:
df["embarked"].value_counts()

Southampton    784
C              212
Queenstown      50
Name: embarked, dtype: int64

In [42]:
df["age group"].value_counts()

adult      762
teen       131
child       72
seniors     59
infant      22
Name: age group, dtype: int64

In [45]:
df["Travel Alone"].value_counts()

Travelling alone          590
Travelling with family    456
Name: Travel Alone, dtype: int64

In [44]:
df["Has caregiver"].value_counts()

Passenger >= 13 and travel alone               955
Passenger < 13 years and travel with family     91
Name: Has caregiver, dtype: int64

In [43]:
df["crew"].value_counts()

Paid for ticket             1038
Paid 0 dollar for ticket       8
Name: crew, dtype: int64

### Determine the Distribution of Numeric Data

* What are the summary statistics for <b>each</b> numeric variable in the dataset? Summary statistics include:
    * Mean
    * Median
    * Mode
    * Standard Deviation
    * Range

In [49]:
df["age"].mean()

29.8811345124283

In [57]:
df["age"].median()

28.0

In [58]:
df["age"].mode()

0    24.0
dtype: float64

In [60]:
df["age"].std()

14.413499699923594

In [62]:
age_range = df['age'].max() - df['age'].min()

print(age_range)

79.8333


In [51]:
df["fare"].mean()

36.68283879472418

In [63]:
df["fare"].median()

15.8

In [64]:
df["fare"].mode()

0    13.0
dtype: float64

In [65]:
df["fare"].std()

55.70595916349577

In [66]:
fare_range = df['fare'].max() - df['fare'].min()

print(fare_range)

512.3292


In [52]:
df["Family Size"].mean()

1.9235181644359465

In [67]:
df["Family Size"].median()

1.0

In [68]:
df["Family Size"].mode()

0    1
dtype: int64

In [69]:
df["Family Size"].std()

1.4528906850592564

In [54]:
df["parch"].mean()

0.42065009560229444

In [78]:
df["parch"].median()

0.0

In [79]:
df["parch"].mode()

0    0
dtype: int64

In [80]:
df["parch"].std()

0.8397504516166859

In [81]:
parch_range = df['parch'].max() - df['parch'].min()

print(parch_range)

6


In [56]:
df["sibsp"].mean()

0.502868068833652

In [82]:
df["sibsp"].median()

0.0

In [83]:
df["sibsp"].mode()

0    0
dtype: int64

In [84]:
df["sibsp"].std()

0.912167299664662

### Relationships between Variables

* Determine the relationship between each variable and the variable "survived". This is our primary variable of interest -- did this passenger survive the accident? Did the characteristics of the passenger have any relationship with their survival?
    * <b>pclass</b>: how many survivors are in each passenger class? does a pattern emerge? which class has the most survivors? which has the least?
    * <b>sex</b>: how many survivors are in each variable group? does a pattern emerge? which group has the most survivors? which has the least?
    * <b>age</b>: how does the average age of the passenger differ based on survival group? 
    * <b>age group</b>: how many survivors are in each variable group? does a pattern emerge? which group has the most survivors? which has the least?
    * <b>family size</b>: how many survivors are in each variable group? does a pattern emerge? which group has the most survivors? which has the least?
    * <b>travel alone</b>: how many survivors are in each variable group? does a pattern emerge? which group has the most survivors? which has the least?
    * <b>crew</b>: how many survivors are in each variable group? does a pattern emerge? which group has the most survivors? which has the least?
    * <b>has caregiver</b>: how many survivors are in each variable group? does a pattern emerge? which group has the most survivors? which has the least?
    * <b>fare</b>: how does the average fare the passenger paid differ based on survival group? 
    * <b>embarked</b>: how many survivors are in each variable group? does a pattern emerge? which group has the most survivors? which has the least?
    
Based on what you learn working through this section, make (2) statements about what characteristics of passenger most influenced their survival.

In [94]:
df.drop(["Titanic Passenger", "name", "sibsp", "parch", "ticket"], axis = 1, inplace = True)

In [96]:
df

Unnamed: 0,pclass,survived,sex,age,fare,embarked,age group,Family Size,Travel Alone,Has caregiver,crew
0,First Class,survived,female,29.0000,211.3375,Southampton,adult,1,Travelling alone,Passenger >= 13 and travel alone,Paid for ticket
1,First Class,survived,male,0.9167,151.5500,Southampton,infant,4,Travelling with family,Passenger < 13 years and travel with family,Paid for ticket
2,First Class,did not survived,female,2.0000,151.5500,Southampton,child,4,Travelling with family,Passenger < 13 years and travel with family,Paid for ticket
3,First Class,did not survived,male,30.0000,151.5500,Southampton,adult,4,Travelling with family,Passenger >= 13 and travel alone,Paid for ticket
4,First Class,did not survived,female,25.0000,151.5500,Southampton,adult,4,Travelling with family,Passenger >= 13 and travel alone,Paid for ticket
...,...,...,...,...,...,...,...,...,...,...,...
1301,Third Class,did not survived,male,45.5000,7.2250,C,adult,1,Travelling alone,Passenger >= 13 and travel alone,Paid for ticket
1304,Third Class,did not survived,female,14.5000,14.4542,C,teen,2,Travelling with family,Passenger >= 13 and travel alone,Paid for ticket
1306,Third Class,did not survived,male,26.5000,7.2250,C,adult,1,Travelling alone,Passenger >= 13 and travel alone,Paid for ticket
1307,Third Class,did not survived,male,27.0000,7.2250,C,adult,1,Travelling alone,Passenger >= 13 and travel alone,Paid for ticket


In [98]:
## new library alert! ##

## Import the StatsModels library for our regression analyses

import statsmodels.formula.api as sm

In [106]:
## create the regression model
result = sm.ols('survived ~ pclass + sex + age + agegroup + fare + embarked + Hascaregiver + crew', data = df).fit()

## print the regression model summary
result.summary()

PatsyError: Error evaluating factor: NameError: name 'agegroup' is not defined
    survived ~ pclass + sex + age + agegroup + fare + embarked + Hascaregiver + crew
                                    ^^^^^^^^

In [None]:
## create the regression model
result = sm.ols('grade ~ hours + age + exercise + C(gender)', data = df).fit()

## print the regression model summary
result.summary()

### Visualize your Results

* Using the most interesting (from your POV) results from the above section, create (3) visualizations to illustrate the results. 
* Create a barplot to show the variation in average age across passenger class. On average, which passenger class has the oldest passengers?
* Create a violin plot to show the distribution of age across passenger class. 