In [3]:
import pandas as pd
import seaborn as sns
import numpy as np

titanic = sns.load_dataset("titanic")
iris = sns.load_dataset("iris")

In [4]:
# Look at basic structure
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [5]:
# Look at basic structure
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [6]:
# Count missing values
print("Missing in Titanic:")
print(titanic.isnull().sum())

print("\nMissing in Iris:")
print(iris.isnull().sum())

# Check for duplicate rows
print("Titanic duplicates:", titanic.duplicated().sum())
print("Iris duplicates:", iris.duplicated().sum())

Missing in Titanic:
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Missing in Iris:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
Titanic duplicates: 107
Iris duplicates: 1


# Initial Data Quality Overview

A quick scan for missing and duplicated entries reveals the following insights:

## Titanic Dataset

| Column        | Missing Values |
|---------------|----------------|
| age           | 177            |
| deck          | 688            |
| embarked      | 2              |
| embark_town   | 2              |
| All others    | 0              |

- **Age** is missing for ~20% of passengers. This is a significant feature for survival analysis and will need thoughtful imputation (e.g., median age by class).
- **Deck** is missing in the majority of records, making it largely unreliable for modeling unless used as a categorical indicator (e.g., "known vs unknown").
- **Embarked** and **embark_town** each have 2 missing values. These can be safely filled using the mode (most common value).
- **107 duplicate rows** exist and will be removed to avoid skewing results.

---

## Iris Dataset

- No missing values in any feature.
- **1 duplicate row** detected — minimal, but it will be dropped for cleanliness.

## Action Plan

| Task                                | Action                        |
|-------------------------------------|-------------------------------|
| Titanic age missing                 | Fill by median per `pclass`  |
| Titanic deck missing                | Consider dropping or binning |
| Titanic embarked / embark_town      | Fill with most common value  |
| Titanic duplicates                  | Drop duplicates              |
| Iris duplicates                     | Drop duplicate row           |

Cleaning these issues ensures the data is structured, consistent, and ready for trustworthy analysis.

In [9]:
titanic["age"] = titanic.groupby("pclass")["age"].transform(
    lambda x: x.fillna(x.median())
)

titanic["embarked"] = titanic["embarked"].fillna(titanic["embarked"].mode()[0])

titanic.drop_duplicates(inplace=True)

titanic["embarked"] = titanic["embarked"].str.upper()

In [12]:
iris.iloc[3, 1] = np.nan

iris.dropna(inplace=True)

iris.drop_duplicates(inplace=True)

In [13]:
titanic.to_csv("cleaned_titanic.csv", index=False)
iris.to_csv("iris_clean.csv", index=False)

## Titanic Insight

Filling missing age using median by passenger class preserves structure while keeping age distributions class-specific. It avoids skewing with global averages and keeps rows intact for modeling.

## Iris Insight

After simulating a missing value and dropping it, the dataset remains nearly identical but is now ready for clean training with no NaNs or duplicates.