<a href="https://colab.research.google.com/github/alishbas11/data_winner/blob/main/clean_titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Dataset Cleaning

This project focuses on cleaning and preparing the Titanic dataset for analysis. The dataset contains information about passengers on the ill-fated Titanic voyage, including their demographics, travel details, and survival status.

## Dataset Features

The dataset includes the following features:

- `Sr`: Serial number of the passenger.
- `Passenger_class(1-3)`: Passenger class (1st, 2nd, or 3rd).
- `Survived(0,1)`: Survival status (0 = Did not survive, 1 = Survived).
- `Name`: Passenger's name.
- `Gender`: Passenger's gender.
- `Age`: Passenger's age.
- `Family`: Number of siblings/spouses and parents/children aboard.
- `Fare`: The fare paid for the ticket.
- `Embarked`: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
- `Date`: Date of embarkation.

## Cleaning Techniques Used

The following data cleaning techniques were applied:

- **Handling Missing Values:** Missing values in the `Age`, `Fare`, `Family`, and `Embarked` columns were addressed by filling them with the mean, mode, or a specific value based on the column type and distribution.
- **Converting Data Types:** Columns like `Sr`, `Survived(0,1)`, `Age`, `Family`, `Fare`, and `Passenger_class(1-3)` were converted to appropriate numeric data types for analysis.
- **Replacing Inconsistent Values:** Inconsistent values in the `Embarked` and `Gender` columns were replaced with standardized terms (e.g., replacing abbreviations with full port names, standardizing 'male' and 'female').
- **Formatting Data:** The `Date` column was converted to datetime objects and then formatted consistently.
- **Handling Duplicates:** Duplicates were identified and removed from the dataset.
- **Removing Unnecessary Data:** The initial row containing header information was removed.
- **Resetting Index:** The index was reset and a new 'Sr' column was created to ensure a clean serial number after data manipulation.

This cleaned dataset is now ready for further exploration and analysis to uncover insights into factors that influenced survival on the Titanic.

In [3]:
import pandas as pd
import numpy as np


import warnings
warnings.filterwarnings("ignore")

In [42]:
df= pd.read_csv("/content/downloadsdata_titanic.csv")
df.head(120)

Unnamed: 0,Sr,Passenger_class(1-3),"Survived(0,1)",Name,Gender,Age,Family,Fare,Embarked,Date
0,1,3,0,Mr. Anthony,Male,42.00000,0.0,7.550000,Southampton,1990-01-01
1,2,3,0,Mr. Anthony,Male,42.00000,0.0,7.550000,Southampton,1990-01-01
2,3,3,0,Master. Eugene Joseph,Male,29.97883,2.0,20.250000,Southampton,1990-01-02
3,4,2,0,"Abbott, Mr. Rossmore Edward",Male,29.97883,2.0,33.462182,Southampton,1990-01-03
4,5,3,1,"Abbott, Mr. Rossmore Edward",Female,35.00000,2.0,20.250000,Southampton,1990-01-04
...,...,...,...,...,...,...,...,...,...,...
115,116,3,0,"Bengtsson, Mr. John Viktor",Male,26.00000,0.0,7.775000,Southampton,1990-04-25
116,117,2,1,"Bentham, Miss. Lilian W",Female,19.00000,0.0,13.000000,Southampton,1990-04-26
117,118,3,0,"Berglund, Mr. Karl Ivar Sven",Male,22.00000,0.0,9.350000,Southampton,1990-04-27
118,119,2,0,"Berriman, Mr. William John",Male,23.00000,0.0,13.000000,Southampton,1990-04-28


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1301 entries, 0 to 1300
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Sr                    1301 non-null   int64  
 1   Passenger_class(1-3)  1301 non-null   int64  
 2   Survived(0,1)         1301 non-null   int64  
 3   Name                  1301 non-null   object 
 4   Gender                1301 non-null   object 
 5   Age                   1301 non-null   float64
 6   Family                1301 non-null   float64
 7   Fare                  1301 non-null   float64
 8   Embarked              1301 non-null   object 
 9   Date                  1301 non-null   object 
dtypes: float64(3), int64(3), object(4)
memory usage: 101.8+ KB


In [5]:
df.shape

(1302, 10)

In [6]:
df.columns = ['Sr','passenger_class','survived','name','gender', 'age','family', 'fare', 'embarked','date']
df.head(22)

Unnamed: 0,Sr,passenger_class,survived,name,gender,age,family,fare,embarked,date
0,sn,pclass,survived,,gender,age,family,fare,embarked,date
1,1,3,0,Mr. Anthony,male,42,0,7.55,,01-Jan-90
2,1,3,0,Mr. Anthony,male,42,0,7.55,,01-Jan-90
3,2,3,0,Master. Eugene Joseph,male,?,2,20.25,S,02-Jan-90
4,3,2,0,"Abbott, Mr. Rossmore Edward",,,2,**,S,03-Jan-90
5,4,3,1,"Abbott, Mr. Rossmore Edward",female,35,2,20.25,S,04-Jan-90
6,5,3,1,"Abelseth, Miss. Karen Marie",female,16,0,7.65,S,05-Jan-90
7,6,3,1,"Abelseth, Mr. Olaus Jorgensen",male,25,0,7.65,S,06-Jan-90
8,7,2,0,"Abelson, Mr. Samuel",male,30,1,24,C,07-Jan-90
9,8,2,1,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28,1,24,C,08-Jan-90


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1302 entries, 0 to 1301
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Sr               1302 non-null   object
 1   passenger_class  1302 non-null   object
 2   survived         1302 non-null   object
 3   name             1301 non-null   object
 4   gender           1301 non-null   object
 5   age              1045 non-null   object
 6   family           1300 non-null   object
 7   fare             1300 non-null   object
 8   embarked         1296 non-null   object
 9   date             1302 non-null   object
dtypes: object(10)
memory usage: 101.8+ KB


In [8]:
df.head()

Unnamed: 0,Sr,passenger_class,survived,name,gender,age,family,fare,embarked,date
0,sn,pclass,survived,,gender,age,family,fare,embarked,date
1,1,3,0,Mr. Anthony,male,42,0,7.55,,01-Jan-90
2,1,3,0,Mr. Anthony,male,42,0,7.55,,01-Jan-90
3,2,3,0,Master. Eugene Joseph,male,?,2,20.25,S,02-Jan-90
4,3,2,0,"Abbott, Mr. Rossmore Edward",,,2,**,S,03-Jan-90


In [9]:
df.drop(index=0, inplace=True)

In [10]:
df.columns

Index(['Sr', 'passenger_class', 'survived', 'name', 'gender', 'age', 'family',
       'fare', 'embarked', 'date'],
      dtype='object')

In [11]:
print(df.isnull().sum())


Sr                   0
passenger_class      0
survived             0
name                 0
gender               1
age                257
family               2
fare                 2
embarked             6
date                 0
dtype: int64


In [12]:
df.info( )

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1301 entries, 1 to 1301
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Sr               1301 non-null   object
 1   passenger_class  1301 non-null   object
 2   survived         1301 non-null   object
 3   name             1301 non-null   object
 4   gender           1300 non-null   object
 5   age              1044 non-null   object
 6   family           1299 non-null   object
 7   fare             1299 non-null   object
 8   embarked         1295 non-null   object
 9   date             1301 non-null   object
dtypes: object(10)
memory usage: 101.8+ KB


In [13]:
df["embarked"].head(4)

Unnamed: 0,embarked
1,
2,
3,S
4,S


In [14]:
df['age'] = df['age'].replace('?',np.nan)

In [15]:
df['Sr'] = pd.to_numeric(df['Sr'])
df['survived'] = pd.to_numeric(df['survived'])
df['family'] = pd.to_numeric(df['family'])

In [16]:
df["age"] = pd.to_numeric(df['age'])

In [17]:
df['fare']=df['fare'].str.strip('**')

In [18]:
import numpy as np

In [19]:
df['fare']= df['fare'].replace('',np.nan)

In [20]:
df['fare'] = pd.to_numeric(df['fare'])

In [21]:
df['passenger_class'] = pd.to_numeric(df['passenger_class'])

In [22]:
df.duplicated()

Unnamed: 0,0
1,False
2,True
3,False
4,False
5,False
...,...
1297,False
1298,False
1299,False
1300,False


In [23]:
df['embarked']= df['embarked'].replace('S','Southampton')
df['embarked']= df['embarked'].replace('C','Cherbourg')
df['embarked']= df['embarked'].replace('Q','Queenstown')

In [24]:
df['gender'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 1301 entries, 1 to 1301
Series name: gender
Non-Null Count  Dtype 
--------------  ----- 
1300 non-null   object
dtypes: object(1)
memory usage: 10.3+ KB


In [25]:
df.head()


Unnamed: 0,Sr,passenger_class,survived,name,gender,age,family,fare,embarked,date
1,1,3,0,Mr. Anthony,male,42.0,0.0,7.55,,01-Jan-90
2,1,3,0,Mr. Anthony,male,42.0,0.0,7.55,,01-Jan-90
3,2,3,0,Master. Eugene Joseph,male,,2.0,20.25,Southampton,02-Jan-90
4,3,2,0,"Abbott, Mr. Rossmore Edward",,,2.0,,Southampton,03-Jan-90
5,4,3,1,"Abbott, Mr. Rossmore Edward",female,35.0,2.0,20.25,Southampton,04-Jan-90


In [26]:

df['date'] = pd.to_datetime(df['date'])
df['date'] = df['date'].dt.strftime('%d-%m-%Y')

In [27]:
df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1301 entries, 1 to 1301
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Sr               1301 non-null   int64         
 1   passenger_class  1301 non-null   int64         
 2   survived         1301 non-null   int64         
 3   name             1301 non-null   object        
 4   gender           1300 non-null   object        
 5   age              1043 non-null   float64       
 6   family           1299 non-null   float64       
 7   fare             1298 non-null   float64       
 8   embarked         1295 non-null   object        
 9   date             1301 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(3), object(3)
memory usage: 101.8+ KB


In [28]:
df['gender']= df['gender'].replace('male','Male')
df['gender']= df['gender'].replace('female','Female')
df['gender']= df['gender'].replace(np.nan,'Male')

In [29]:
avg_age = df['age'].mean()
avg_age
df['age']= df['age'].fillna(avg_age)

In [30]:
df['embarked']= df['embarked'].fillna(df['embarked'].mode()[0])

In [31]:
df.head()


Unnamed: 0,Sr,passenger_class,survived,name,gender,age,family,fare,embarked,date
1,1,3,0,Mr. Anthony,Male,42.0,0.0,7.55,Southampton,1990-01-01
2,1,3,0,Mr. Anthony,Male,42.0,0.0,7.55,Southampton,1990-01-01
3,2,3,0,Master. Eugene Joseph,Male,29.97883,2.0,20.25,Southampton,1990-01-02
4,3,2,0,"Abbott, Mr. Rossmore Edward",Male,29.97883,2.0,,Southampton,1990-01-03
5,4,3,1,"Abbott, Mr. Rossmore Edward",Female,35.0,2.0,20.25,Southampton,1990-01-04


In [32]:
df['family'] =df['family'].fillna(df['family'].mode()[0])

In [33]:
df['family'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 1301 entries, 1 to 1301
Series name: family
Non-Null Count  Dtype  
--------------  -----  
1301 non-null   float64
dtypes: float64(1)
memory usage: 10.3 KB


In [34]:
df['fare'] =df['fare'].fillna(df['fare'].mean())

In [35]:
df.head(5)

Unnamed: 0,Sr,passenger_class,survived,name,gender,age,family,fare,embarked,date
1,1,3,0,Mr. Anthony,Male,42.0,0.0,7.55,Southampton,1990-01-01
2,1,3,0,Mr. Anthony,Male,42.0,0.0,7.55,Southampton,1990-01-01
3,2,3,0,Master. Eugene Joseph,Male,29.97883,2.0,20.25,Southampton,1990-01-02
4,3,2,0,"Abbott, Mr. Rossmore Edward",Male,29.97883,2.0,33.462182,Southampton,1990-01-03
5,4,3,1,"Abbott, Mr. Rossmore Edward",Female,35.0,2.0,20.25,Southampton,1990-01-04


In [36]:
df = df.reset_index(drop=True)
df['Sr'] = df.index + 1
display(df.head())

Unnamed: 0,Sr,passenger_class,survived,name,gender,age,family,fare,embarked,date
0,1,3,0,Mr. Anthony,Male,42.0,0.0,7.55,Southampton,1990-01-01
1,2,3,0,Mr. Anthony,Male,42.0,0.0,7.55,Southampton,1990-01-01
2,3,3,0,Master. Eugene Joseph,Male,29.97883,2.0,20.25,Southampton,1990-01-02
3,4,2,0,"Abbott, Mr. Rossmore Edward",Male,29.97883,2.0,33.462182,Southampton,1990-01-03
4,5,3,1,"Abbott, Mr. Rossmore Edward",Female,35.0,2.0,20.25,Southampton,1990-01-04


In [37]:
df.columns = ['Sr','Passenger_class(1-3)','Survived(0,1)','Name','Gender', 'Age','Family', 'Fare', 'Embarked','Date']
df.head()

Unnamed: 0,Sr,Passenger_class(1-3),"Survived(0,1)",Name,Gender,Age,Family,Fare,Embarked,Date
0,1,3,0,Mr. Anthony,Male,42.0,0.0,7.55,Southampton,1990-01-01
1,2,3,0,Mr. Anthony,Male,42.0,0.0,7.55,Southampton,1990-01-01
2,3,3,0,Master. Eugene Joseph,Male,29.97883,2.0,20.25,Southampton,1990-01-02
3,4,2,0,"Abbott, Mr. Rossmore Edward",Male,29.97883,2.0,33.462182,Southampton,1990-01-03
4,5,3,1,"Abbott, Mr. Rossmore Edward",Female,35.0,2.0,20.25,Southampton,1990-01-04


In [38]:
file_name = "data_titanic.csv"
# You can specify an absolute path, e.g., "C:/Users/YourUser/Documents/my_dataset.csv" (Windows)
# or "/home/youruser/documents/my_dataset.csv" (Linux/macOS)
# Or a relative path, which will save it in the current working directory
file_path = "downloads" + file_name # Example: saving in a subfolder named 'data_folder'

# Save the DataFrame to a CSV file
df.to_csv(file_path, index=False)


In [39]:
print(file_path)

downloadsdata_titanic.csv
