# BUSINESS ISSUE UNDERSTANDING

#1. What is the most important question you want to answer with this analysis? 
    "What effect did class have on Titanic survival rates?"
    
#2. Who is the audience for this analysis?
    Jr. High students learning about the Titanic for the first time
    
#3. What is the timeline for this project?
    One week
    
#4. What metrics and categories are critical for this analysis?
    List of people on the Titanic, each person's class, and each person's survival status
    
#5. How do I connect to the data?
    I found the dataset on Kaggle and downloaded it as a locally-stored CSV
    
#6. Is there a data dictionary?
    Yes, on Kaggle

In [1]:
#import libraries

import pandas as pd

In [2]:
#load data

df = pd.read_csv("titanic/tit_train.csv")

df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# DATA UNDERSTANDING

#7. source = https://www.kaggle.com/datasets/dbdmobile/tita111?select=tit_train.csv
    
#8. Why was the data collected?
    For machine learning practice

#9. Who authored or collected the dataset?
    A user on Kaggle took an older dataset and split it into training and testing data
    
#10. What is the context of the dataset? Is there supplemental documentation?
    It was put together for machine learning practice; some of the data will be missing from each sheet
    
#11. What does the data represent?
    The data represents individual passengers aboard the Titanic and some information about each of their journeys
    
#12. What is the granularity?
    Each record is granular to 11 data points each

#13. What does each row represent?
    One Titanic passenger
    
#14. What does each value mean?
    **PassengerId** = unique identifier for each passenger
    **Survived** = whether the passenger survived (1) or not (0)
    **Pclass** = passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
    **Name** = name of the passenger
    **Sex** = gender of the passenger
    **Age** = age of the passenger (in years)
    **SibSp** = number of siblings or spouses aboard the Titanic
    **Parch** = number of parents or children aboard the Titanic
    **Ticket** = ticket number
    **Fare** = passenger fare
    **Cabin** = cabin number
    **Embarked** = port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
    
# DATA PREPARATION

#15. How do I validate the data?
    Research information such as the number of passengers aboard the Titanic, number of survivors, etc. and compare to the data
    Research tells me that ~1300 passengers boarded and ~500 survived.

In [3]:
print("===")
print("INFO")
print(df.info())
print("===")
print("SHAPE")
print(df.shape)
print("===")
print("SURVIVAL VALUES")
df["Survived"].value_counts()

===
INFO
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
===
SHAPE
(891, 12)
===
SURVIVAL VALUES


0    549
1    342
Name: Survived, dtype: int64

In [4]:
#The previous cell tells me I'm missing some data. I have ~400 fewer rows than I expect and ~160 fewer survivors. 
#I have to decide if this is good enough for my analysis.

print(f'The Titanic is believed to have had a {round(500/1300,2)*100}% survival rate.')
print(f'The provided dataset has a {round(342/891,2)*100}% survival rate.')

The Titanic is believed to have had a 38.0% survival rate.
The provided dataset has a 38.0% survival rate.


#16. Are there any known issues?
    Yes, some data is missing. However, initial analysis shows promise that the data may be representative enough to be useful.

#17. If a project requires multiple data sources, what is the best way to combine them?
    This project does not require additional data sources. For this project, if I needed additional data, I would likely
    use CSVs stored locally and join them in Python.

#18. How frequently does the data update?
    This dataset never updates.

In [6]:
#Drop irrelevant columns

df = df.drop(columns = ["Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"])
df.head(20)

Unnamed: 0,PassengerId,Survived,Pclass
0,1,0,3
1,2,1,1
2,3,1,3
3,4,1,1
4,5,0,3
5,6,0,3
6,7,0,1
7,8,0,3
8,9,1,3
9,10,1,2


In [None]:
df.tail(20)

In [7]:
#Drop duplicate rows
df.drop_duplicates()
df

Unnamed: 0,PassengerId,Survived,Pclass
0,1,0,3
1,2,1,1
2,3,1,3
3,4,1,1
4,5,0,3
...,...,...,...
886,887,0,2
887,888,1,1
888,889,0,3
889,890,1,1


In [8]:
#Drop missing/null values
print(df['PassengerId'].isna().sum())
print(df['Survived'].isna().sum())
print(df['Pclass'].isna().sum())
df.dropna()
df

0
0
0


Unnamed: 0,PassengerId,Survived,Pclass
0,1,0,3
1,2,1,1
2,3,1,3
3,4,1,1
4,5,0,3
...,...,...,...
886,887,0,2
887,888,1,1
888,889,0,3
889,890,1,1


# DATA ANALYSIS

In [9]:
#create function to return value extrapolated out to full Titanic passenger population

def extrapolate(val):
    return int(round((val/68)*100,0))

#get number of passengers for each class extrapolated to full population
passengers_1 = extrapolate(len(df[df["Pclass"]==1]))
passengers_2 = extrapolate(len(df[df["Pclass"]==2]))
passengers_3 = extrapolate(len(df[df["Pclass"]==3]))

#get total number of passengers extrapolated to full population
total_passengers = extrapolate(len(df))

#get number of survivors for each class extrapolated to full population
survived_1 = extrapolate(len(df[(df["Survived"]==1) & (df["Pclass"]==1)]))
survived_2 = extrapolate(len(df[(df["Survived"]==1) & (df["Pclass"]==2)]))
survived_3 = extrapolate(len(df[(df["Survived"]==1) & (df["Pclass"]==3)]))

#get total number of survivors extrapolated to full population
total_survived = survived_1+survived_2+survived_3

#get the percentage of survivors for each class
survived_1_rate = round(survived_1/passengers_1,2)*100
survived_2_rate = round(survived_2/passengers_2,2)*100
survived_3_rate = round(survived_3/passengers_3,2)*100

#get the average survival rate for the classes
avg_survival = (survived_1_rate+survived_2_rate+survived_3_rate)/3

#get the median survival rate for the classes
survivors = [survived_1_rate,survived_2_rate,survived_3_rate]
med_survival = (min(survivors)+max(survivors))/2

#create a function to calculate whether a class's survival rate was greater than or less than average.
def survival_math(srv):
    if srv > avg_survival:
        return "greater than the average survival rate."
    else:
        return "less than the average survival rate."
    

print(f"""This dataset looks at the survivor rate of {total_passengers} passengers aboard the Titanic. According to the data: \n
    * Approximately {total_survived} passengers survived the wreck, a survival rate of {round(total_survived/total_passengers,2)*100}%.\n
    * The average survival rate of the three classes was {round(avg_survival,2)}%\n
    * The median survival rate of the three classes was {round(med_survival,2)}% \n
    * {survived_1} of {passengers_1}, or {survived_1_rate}% of first-class passengers survived, {survival_math(survived_1_rate)}\n
    * {survived_2} of {passengers_2}, or {survived_2_rate}% of second-class passengers survived, {survival_math(survived_2_rate)}\n
    * {survived_3} of {passengers_3}, or {survived_3_rate}% of third-class passengers survived, {survival_math(survived_3_rate)}""")

This dataset looks at the survivor rate of 1310 passengers aboard the Titanic. According to the data: 

    * Approximately 503 passengers survived the wreck, a survival rate of 38.0%.

    * The average survival rate of the three classes was 44.67%

    * The median survival rate of the three classes was 43.5% 

    * 200 of 318, or 63.0% of first-class passengers survived, greater than the average survival rate.

    * 128 of 271, or 47.0% of second-class passengers survived, greater than the average survival rate.

    * 175 of 722, or 24.0% of third-class passengers survived, less than the average survival rate.


#19. Does the analysis make sense?
    Yes, the math all works out.
    
#20. Is the analysis answering the business question?
    Yes, the question was whether a passenger's class affected their liklihood of surviving the wreck
    
#21. Does the data prep or project need to be revisited?
    Maybe. We may need more complete data
    
#22. Is there meaninful insight?
    Yes, we can see that passenger's survival rates correlated with their classes
    
#23. What is the next question someone looking at this analyisis would ask?
    What other factors contributed to a person's liklihood of survival?
    Was the relationship between survival and class causational?
    
#24. Is this a one-time request?
    Yes, this is for a lesson plan, and it uses data that will never be updated. There is no reason to maintain the analysis.
    
# DATA PRODUCTION

