#TITANIC DATASET:

The Titanic dataset contains detailed information about the passengers of the RMS Titanic, which tragically sank on its maiden voyage after colliding with an iceberg on April 15, 1912. This dataset is often employed in classification tasks, aiming to predict passenger survival based on various attributes.

### Key Features of the Titanic Dataset:

1. **PassengerId**: Unique identifier for each passenger.
2. **Survived**: Survival status (0 = No, 1 = Yes) indicating whether the passenger survived the disaster.
3. **Pclass**: Passenger's class (1 = 1st, 2 = 2nd, 3 = 3rd), which serves as a proxy for socio-economic status.
4. **Name**: Full name of the passenger, which can be used to extract titles and infer family relations.
5. **Sex**: Gender of the passenger (male or female).
6. **Age**: Age of the passenger in years. Some values may be missing and require imputation.
7. **SibSp**: Number of siblings and spouses the passenger had aboard the Titanic.
8. **Parch**: Number of parents and children the passenger had aboard the Titanic.
9. **Ticket**: Ticket number, which can provide insights into the socio-economic status and travel groupings.
10. **Fare**: Amount of money the passenger paid for the ticket, which also indicates socio-economic status.
11. **Cabin**: Cabin number, which can be used to deduce the passenger's location on the ship. This field has many missing values.
12. **Embarked**: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton), indicating the port where the passenger boarded the ship.



### Problem Statement

**Objective:**
Conduct a comprehensive Exploratory Data Analysis (EDA) of the Titanic dataset to uncover patterns, relationships, and insights that help in understanding the factors influencing passenger survival.

**Goals:**
1. **Data Cleaning and Preparation:**
   - Identify and handle missing values in the dataset.
   - Convert categorical features to appropriate numerical formats if necessary.
   - Create new features (feature engineering) that may provide additional insights.

2. **Descriptive Statistics and Visualization:**
   - Summarize the dataset using descriptive statistics to understand the central tendencies and dispersion of features.
   - Visualize the distribution of key features such as Age, Fare, and Pclass.
   - Visualize the relationship between different features and survival rate using plots such as bar charts, histograms, and box plots.

3. **Univariate Analysis:**
   - Analyze individual features to understand their distribution and characteristics.
   - Identify any outliers and decide on appropriate treatments.

4. **Bivariate and Multivariate Analysis:**
   - Examine the relationships between pairs of features and their combined influence on survival.
   - Use heatmaps, pair plots, and correlation matrices to visualize these relationships.

5. **Segmentation and Patterns:**
   - Segment the data based on features such as Pclass, Sex, Age, and Embarked to uncover patterns in survival rates.
   - Identify any significant differences in survival rates across different passenger groups.

6. **Insights and Conclusions:**
   - Summarize the key findings from the analysis.
   - Draw conclusions about the factors that most strongly influenced survival on the Titanic.
   - Discuss any limitations of the dataset and the analysis.

### Deliverables:
- A detailed report or Jupyter notebook containing:
  - Steps of data cleaning and preparation.
  - Descriptive statistics and visualizations.
  - Univariate, bivariate, and multivariate analyses.
  - Key insights and conclusions drawn from the analysis.

### Example Questions to Explore:
- How does the survival rate vary with passenger class (Pclass)?
- What is the age distribution of the passengers, and how does age relate to survival?
- Are there differences in survival rates between males and females?
- How does the number of siblings/spouses (SibSp) or parents/children (Parch) aboard affect survival?
- Does the port of embarkation (Embarked) influence the chances of survival?
- How does the fare paid by passengers correlate with their survival?

### Learning Outcomes:
- Gain proficiency in handling and cleaning real-world data.
- Develop skills in using descriptive statistics and visualizations to understand data.
- Learn to identify and analyze relationships between features.
- Practice drawing meaningful insights and conclusions from data analysis.

This problem statement provides a clear framework for conducting EDA on the Titanic dataset, ensuring that students can systematically explore and analyze the data to derive meaningful insights.

In [None]:
#importing pandas library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [1]:
from google.colab import drive
drive.mount('/content/drive')

ValueError: mount failed

###Read the Data

In [None]:
titanic =pd.read_csv("/content/drive/MyDrive/kapil/Statistics/Descriptive/EDA/data/titanic.csv")

In [None]:
titanic.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [None]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
titanic.dtypes

Unnamed: 0,0
PassengerId,int64
Survived,int64
Pclass,int64
Name,object
Sex,object
Age,float64
SibSp,int64
Parch,int64
Ticket,object
Fare,float64


In [None]:
titanic.shape

(891, 12)

In [None]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [None]:
#Check for NULL values
titanic.isnull().sum()


Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


The columns having null values are: Age, Cabin, Embarked. They need to be filled up with appropriate values later on.

Features: The titanic dataset has roughly the following types of features:

Categorical/Nominal: Variables that can be divided into multiple categories but having no order or priority.
Eg. Embarked (C = Cherbourg; Q = Queenstown; S = Southampton)
Binary: A subtype of categorical features, where the variable has only two categories.
Eg: Sex (Male/Female)
Ordinal: They are similar to categorical features but they have an order(i.e can be sorted).
Eg. Pclass (1, 2, 3)
Continuous: They can take up any value between the minimum and maximum values in a column.
Eg. Age, Fare
Count: They represent the count of a variable.
Eg. SibSp, Parch
Useless: They don’t contribute to the final outcome of an ML model. Here, PassengerId, Name, Cabin and Ticket might fall into this category.