# **Introduction**
# The Titanic dataset is popular for data analysis and machine learning. 
# It contains information about the passengers onboard the Titanic, including features like age, gender, fare, cabin, and survival status. 
# We will perform exploratory data analysis (EDA) on the Titanic dataset using Python in this project.

# ***Studying the Dataset***
# The first step in any data analysis project is to look at the data. 
# We need to see how many observations/rows, how many features/columns are contained, what these columns mean, and so on. 
# This will help us warm up and get familiar with the dataset, and might even help us to evaluate which features are important and which aren't. 
# To do this, we import the relevant Python libraries and read in the train.csv CSV file as a data frame.

In [78]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [79]:
sns.set_style('dark')

In [80]:
df = pd.read_csv('train.csv')

**Exploring the Data:**

In [81]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [82]:
'''
We can count 12 features describing each person on board the Titanic. 
Our target feature (also known as the independent variable) is the Survived column, 
which is 1 if the person survived and 0 if not. In this case, 
it is easy to infer that the Survived column only contains numerical 1's and 0's. 
For other features (or dependent variables) however, it may not be possible to deduce the type of data contained at first glance. 
For instance, the Ticket column contains a mixture of letters and numbers for some rows but other rows contain only numbers. 
In order to quickly generate a table of data types contained within each column, we use the .info() method.
'''

"\nWe can count 12 features describing each person on board the Titanic. \nOur target feature (also known as the independent variable) is the Survived column, \nwhich is 1 if the person survived and 0 if not. In this case, \nit is easy to infer that the Survived column only contains numerical 1's and 0's. \nFor other features (or dependent variables) however, it may not be possible to deduce the type of data contained at first glance. \nFor instance, the Ticket column contains a mixture of letters and numbers for some rows but other rows contain only numbers. \nIn order to quickly generate a table of data types contained within each column, we use the .info() method.\n"

In [83]:
# Info about data frame dimensions, column types, and file size
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [84]:
# Check the dimensions of the dataset
print(df.shape)
print('____________________________________________________________________________________')
# Get summary statistics of numerical variables
print(df.describe())
print('____________________________________________________________________________________')
# Check the data types of variables
print(df.dtypes)
print('____________________________________________________________________________________')
# Check for missing values
print(df.isnull().sum())

(891, 12)
____________________________________________________________________________________
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000 

In [85]:
'''Now we know there are 7 numerical columns (floats and integers), 5 non-numerical columns, and their names as well. The output also tells us there are 891 rows/entries and 12 columns. For numerical columns such as Age, Fare, etc., we would like to find out their mean, maximum and minimum values to see if the data is reasonably distributed or if there are any anomalies or mistakes, such as percentages above 100%. To draw up a column of statistics for each column, we use the .describe() method.'''

'Now we know there are 7 numerical columns (floats and integers), 5 non-numerical columns, and their names as well. The output also tells us there are 891 rows/entries and 12 columns. For numerical columns such as Age, Fare, etc., we would like to find out their mean, maximum and minimum values to see if the data is reasonably distributed or if there are any anomalies or mistakes, such as percentages above 100%. To draw up a column of statistics for each column, we use the .describe() method.'

In [86]:
# Get summary statistics of numerical variables
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [87]:
df.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [88]:
'''For the variables with less than 10 levels, we can list what these levels are using the method .unique()'''

'For the variables with less than 10 levels, we can list what these levels are using the method .unique()'

In [89]:
lessthanten = []
for col in df.columns:
    lessthanten.append(df[col].nunique() < 10)

for col in df[df.columns[lessthanten]]:
    print(col, df[col].unique())

Survived [0 1]
Pclass [3 1 2]
Sex ['male' 'female']
SibSp [1 0 3 4 2 5 8]
Parch [0 1 2 5 3 4 6]
Embarked ['S' 'C' 'Q' nan]


In [90]:
'''Another useful method for exploration is the .value_counts() method, which counts the number of occurrences for each unique category within a column.'''

'Another useful method for exploration is the .value_counts() method, which counts the number of occurrences for each unique category within a column.'

In [91]:
df['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [92]:
pd.DataFrame({'Integer': ['Survived','Pclass','SibSp, Parch','-'], 
              'Float': ['-','-','-','Age, Fare'], 
              'Object': ['Sex, Name, Ticket, Cabin, Embarked','-','-','-']}, 
              index = ['Nominal','Ordinal','Discrete','Continuous'])

Unnamed: 0,Integer,Float,Object
Nominal,Survived,-,"Sex, Name, Ticket, Cabin, Embarked"
Ordinal,Pclass,-,-
Discrete,"SibSp, Parch",-,-
Continuous,-,"Age, Fare",-


**Data Cleaning:**
Data cleaning is an essential step in EDA. We must handle missing values, outliers, and inconsistencies in the dataset. Some common data-cleaning tasks include:

In [103]:
df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64

**From above it is observed that there are 177 vissing values in the age column. this can be handel by imputing the mean age**

In [97]:
df['Age'].fillna(df['Age'].mean(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(),inplace=True)


**for the column cabin out of 891 observation 687 values are missing so, we can drop the column**

In [102]:
df.drop(columns=['Cabin'], inplace=True)

**since Embarked columns have 2 missing values so, we can replace this with most frequest occuring value**

In [107]:
df['Embarked'].fillna(df['Embarked'].mod()[0],inplace=True)

TypeError: Series.mod() missing 1 required positional argument: 'other'

In [106]:
print(df.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       2
dtype: int64


In [121]:
charlist=[]
for char in range(len(df['Embarked'])):
    charlist.append(df['Embarked'][char])
print(charlist)    

['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'Q', 'S', 'S', 'C', 'S', 'S', 'Q', 'S', 'S', 'S', 'C', 'S', 'Q', 'S', 'C', 'C', 'Q', 'S', 'C', 'S', 'C', 'S', 'S', 'C', 'S', 'S', 'C', 'C', 'Q', 'S', 'Q', 'Q', 'C', 'S', 'S', 'S', 'C', 'S', 'C', 'S', 'S', 'C', 'S', 'S', 'C', nan, 'S', 'S', 'C', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'Q', 'S', 'C', 'S', 'S', 'C', 'S', 'Q', 'S', 'C', 'S', 'S', 'S', 'C', 'S', 'S', 'C', 'Q', 'S', 'C', 'S', 'C', 'S', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'C', 'C', 'S', 'S', 'Q', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'Q', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'Q', 'S', 'S', 'C', 'S', 'S', 'C', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'Q', 'S', 'Q', 'S', 'S', 'S', 'S', 'S', 'C', 'C', 'Q', 'S', 'Q', 'S',

In [113]:
s = pd.Series([10, 20, 30, 40, 50])
result = s.mod(3)
print(result)

0    1
1    2
2    0
3    1
4    2
dtype: int64
