Data Cleansing with the Titanic Dataset in Google Colab
Inspired by the image, this lab guides you step-by-step through practical data cleaning operations using the Titanic dataset. It’s designed for hands-on use in Google Colab.

Objectives

Load and inspect the Titanic dataset.

Identify and handle missing or problematic data.

Apply data cleaning techniques using pandas.

In [1]:


# Step 1: Set Up the Environment
# Install pandas if not available (usually pre-installed in Colab)
!pip install pandas





In [2]:


# Step 2: Load the Titanic Dataset
import pandas as pd

url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)
print(df.head())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [3]:


# Step 3: Preliminary Data Inspection
# General info and statistics
print(df.info())
print(df.describe())

# Check for missing values
print(df.isnull().sum())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.48659

In [4]:


# Step 4: Data Cleansing Tasks
# 4.1: Handling Missing Values

# Drop columns with too many missing values or not useful
df = df.drop(columns=['Cabin'])

# Fill missing 'Age' values with the median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing 'Embarked' values with the mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


In [5]:


# 4.2: Remove Duplicates (if any)
df = df.drop_duplicates()


In [None]:


# 4.3: Convert Categorical Variables to Numeric (optional)
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)


In [None]:


# Step 5: Verify the Clean Data
print(df.isnull().sum())
print(df.head())
print(df.info())



Step 6: Reflection Questions
In your Colab notebook, answer:

What were the main issues you found during the cleaning process?

Why is it important to handle missing values appropriately?

What are potential risks if data cleansing is skipped before modeling/analysis?

Step 6: Reflection Questions

**What were the main issues you found during the cleaning process?**

The main issues found during the cleaning process were:
1. **Missing values:** The 'Age', 'Cabin', and 'Embarked' columns had missing values. 'Cabin' had a significant number of missing values.
2. **Data types:** The 'Sex' and 'Embarked' columns were categorical and needed to be converted to numerical formats for potential modeling.
3. **Potential duplicates:** Although none were found in this specific run after dropping 'Cabin', it's a common issue to check for.


**Why is it important to handle missing values appropriately?**


1. **Impact on analysis and modeling:** Many analytical and machine learning techniques cannot handle missing values, or they will produce biased or incorrect results.
2. **Data integrity:** Missing data can distort the distribution of variables and lead to inaccurate conclusions about the dataset.
3. **Model performance:** Models trained on data with unhandled missing values can have poor performance and generalization capabilities.


**What are potential risks if data cleansing is skipped before modeling/analysis?**

Potential risks if data cleansing is skipped before modeling/analysis include:
1. **Inaccurate results:** Analysis or models built on dirty data will likely produce flawed or misleading results.
2. **Model errors:** Many algorithms will fail to run or produce errors when encountering dirty data (e.g., missing values, incorrect data types).
3. **Biased models:** Systematic issues in the data can lead to models that are biased and do not accurately reflect the underlying patterns.
4. **Misinterpretation of findings:** Conclusions drawn from analysis on unclean data can be incorrect, leading to poor decision-making.
