# Titanic Dataset


The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS
Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough
lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers
and crew. While there was some element of luck involved in surviving, it seems some
groups of people were more likely to survive than others.


Feature Details:
    
1. Variable Definition
2. survival - Survival (0 = No, 1 = Yes)
3. pclass - Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
4. sex - Sex (0 = Female, 1 = Male)
5. Age - Age in years
6. sibsp - # of siblings / spouses aboard the Titanic
7. parch - # of parents / children aboard the Titanic
8. ticket - Ticket number
9. fare - Passenger fare
10. cabin - Cabin number
11. embarked - Port of Embarkation (C=Cherbourg, Q=Queenstown, S=Southampton)


1. Mean Age of Survivors : 

Given File 'titanic.csv'
Problem Statement:
From the dataset, find the mean age of the people who did not survive.
In the dataset,
For Survived = 0 means the passenger has not survived.

For Survived = 1 means the passenger has survived.


Output Format:
Print the mean rounded off to two decimal places.

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('titanic.csv')

# Filter data where 'Survived' is equal to 0 (not survived)
not_survived_data = data[data['Survived'] == 0]

# Calculate the mean age of people who did not survive
mean_age_not_survived = not_survived_data['Age'].mean()

# Print the mean age rounded off to two decimal places
print(round(mean_age_not_survived, 2))


2. Percentage of Survivors :

Given File 'titanic.csv'
Problem Statement:
From the dataset, find the following:


1.Percentage of Passengers Survived.
2.Of the total survived passengers what percentage are the male passengers.
3.Of the total survived passengers what percentage are the female passengers.


In the dataset,
For Survived = 0 means the passenger has not survived.

For Survived = 1 means the passenger has survived.

For Sex = 1 means the passenger is male.

For Sex = 0 means the passenger is Female.


Output Format:
Print the percentages rounded off to two decimal places in separate lines.

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('titanic.csv')

# Total number of passengers
total_passengers = len(data)

# Number of passengers who survived
survived_passengers = len(data[data['Survived'] == 1])

# Number of male passengers who survived
male_survived_passengers = len(data[(data['Survived'] == 1) & (data['Sex'] == 1)])

# Number of female passengers who survived
female_survived_passengers = len(data[(data['Survived'] == 1) & (data['Sex'] == 0)])

# Calculate percentages
percentage_survived = (survived_passengers / total_passengers) * 100
percentage_male_survived = (male_survived_passengers / survived_passengers) * 100
percentage_female_survived = (female_survived_passengers / survived_passengers) * 100

# Print the percentages rounded off to two decimal places
print(round(percentage_survived, 2))
print(round(percentage_male_survived, 2))
print(round(percentage_female_survived, 2))

3. Highest Correlation : 

Given File 'titanic.csv'


Problem Statement:
From the dataset, find the variable having the highest correlation with the Survival rate.


Note: Survival rate is denoted by the "Survived" column.


Output Format:
Print the column name having the highest correlation with the Survived column.
Print the absolute value of the correlation of this column rounded off to two decimal places.
These values should be separated by a new line.


In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('titanic.csv')

# Calculate correlation with the 'Survived' column for all other columns
correlations = data.corr()['Survived'].abs().sort_values(ascending=False)

# Get the column name with the highest correlation (except 'Survived' column itself)
highest_correlation_column = correlations.index[1]

# Get the absolute value of the correlation rounded off to two decimal places
correlation_value = round(correlations.iloc[1], 2)

# Print the results separated by a new line
print(highest_correlation_column)
print(correlation_value)


4. Calculate IQR :

Given File 'titanic.csv'


Problem Statement:
From the dataset, Calculate the IQR for the Age column.
In the dataset, there are some null values present for the Age column. First, remove those and then calculate the IQR.


Output Format:
Print the IQR of Age, rounded off to two decimal places.

In [None]:
import pandas as pd


df = pd.read_csv('titanic.csv')

df = df.dropna(subset=['Age'])
age_iqr = df['Age'].quantile(0.75) - df['Age'].quantile(0.25)
print(round(age_iqr,2))


5. Analyze the Hypothesis : 

Given File 'titanic.csv'


Problem Statement:
From the dataset, find out if there is a significant difference in the mean sex between the passenger who survived and the passenger who did not survive?.


In the dataset,
For Survived = 0 means the passenger has not survived.

For Survived = 1 means the passenger has survived.


Output Format:
Print "Yes" if there is a significant difference otherwise print "No".
Here for significant difference, the mean should vary by 2 points at least.

In [None]:
import pandas as pd
from scipy.stats import ttest_ind

# Load the dataset
data = pd.read_csv('titanic.csv')

# Extract ages of passengers who survived and did not survive
age_survived = data[data['Survived'] == 1]['Age'].dropna()
age_not_survived = data[data['Survived'] == 0]['Age'].dropna()

# Perform t-test to compare means
t_stat, p_value = ttest_ind(age_survived, age_not_survived)

# Check if the absolute difference between means is at least 2 points
if abs(age_survived.mean() - age_not_survived.mean()) >= 2 and p_value < 0.05:
    print("No")
else:
    print("Yes")
