# ai03bTasks
# Machine Learning: Decision Trees
## Data Cleaning

**Instructions:**
- Complete each task below by running the code cells
- Fill in the blanks and answer questions in markdown cells
- Save your work when finished
- Push this file to your GitHub "Machine Learning" Repo under the appropriate folder.

---
## Setup: Import Libraries and Load Data

Run this cell first to set up your environment.

In [1]:
import pandas as pd

# Load the Titanic dataset if it's not already
df = pd.read_csv("Titanic Dataset.csv")

print("✓ Data loaded successfully!")
print(f"Original shape: {df.shape}")
print(f"\nOriginal columns: {df.columns.tolist()}")

✓ Data loaded successfully!
Original shape: (1309, 14)

Original columns: ['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest']


---
## Task 1: Understand the Original Data

Before cleaning, let's see what we're working with.

### 1a. How many rows and columns are in the original dataset?

In [3]:
# TODO: Print the shape of the DataFrame
print(f"Rows: {df.shape[0]}")
print(f"Columns: {df.shape[1]}")

Rows: 1309
Columns: 14


### 1b. Display the first few rows

In [4]:
# TODO: Use .head() to display the first 5 rows
print("First 5 rows of the dataset:")
print(df.head())


First 5 rows of the dataset:
   pclass  survived                                             name     sex  \
0       1         1                    Allen, Miss. Elisabeth Walton  female   
1       1         1                   Allison, Master. Hudson Trevor    male   
2       1         0                     Allison, Miss. Helen Loraine  female   
3       1         0             Allison, Mr. Hudson Joshua Creighton    male   
4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   

     age  sibsp  parch  ticket      fare    cabin embarked boat   body  \
0  29.00      0      0   24160  211.3375       B5        S    2    NaN   
1   0.92      1      2  113781  151.5500  C22 C26        S   11    NaN   
2   2.00      1      2  113781  151.5500  C22 C26        S  NaN    NaN   
3  30.00      1      2  113781  151.5500  C22 C26        S  NaN  135.0   
4  25.00      1      2  113781  151.5500  C22 C26        S  NaN    NaN   

                         home.dest  
0       

### 1c. Check for missing values in the original data

In [5]:
# TODO: Use .isnull().sum() to count missing values per column
print("Missing values in original data:")
df.isnull().sum()

Missing values in original data:


pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

**Q: Which columns have the most missing values?**

A: Body

---
## Task 2: Select Useful Features

We'll keep only the columns that help predict survival.

### 2a. Keep only these 8 columns: pclass, survived, sex, age, sibsp, parch, fare, embarked

In [7]:
# TODO: Select only the useful columns
# Hint: df = df[['column1', 'column2', ...]]
df = df[['pclass', 'survived', 'sex', 'age', 
         'sibsp', 'parch', 'fare', 'embarked']]

print("✓ Columns selected!")
print(f"New shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

✓ Columns selected!
New shape: (1309, 8)

Columns: ['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']


### 2b. Explain why we dropped certain columns

**Q: Why did we drop the 'name' column?**.  (Refer to the lesson if you don't recall - or ask a neighbor!)

A: It contains too many unique values. Because almost every name is unique, the model cannot find a pattern (like "People named John survive more often"). In Machine Learning, we generally drop identifiers that are unique to a single row unless we extract features from them first (like extracting "Mr." or "Ms." from the name).

**Q: Why did we drop the 'cabin' column?**

A: It has too many missing values. In the Titanic dataset, the cabin column is empty (NaN) for the vast majority of passengers (often >75%). 

**Q: Why did we drop the 'boat' column?**

A: It causes "Data Leakage" (it gives away the answer). The boat column lists the lifeboat identifier. If a passenger has a boat number, it implies they survived. If we include this, the model isn't predicting survival based on passenger characteristics; it's just checking if they were assigned a boat. 

---
## Task 3: Check for Missing Values

Now let's see which of our selected columns have missing values.

In [9]:
# TODO: Check for missing values in the cleaned dataset
print("Missing values after feature selection:")
print(df.isnull().sum())


Missing values after feature selection:
pclass        0
survived      0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64


**Q: How many missing values are in the 'age' column?**

A: 263

**Q: How many missing values are in the 'fare' column?**

A: 1

**Q: How many missing values are in the 'embarked' column?**

A: 2

---
## Task 4: Handle Missing Age Values

Age has many missing values. We'll fill them with the median age.

### 4a. Calculate the median age

In [25]:
# TODO: Calculate the median age
# TODO: Write code to find the total number of rows

# Used here Ai because I had a df erorr i dont know why because previous 1,2,3 tasks df worked


# Load the file using the exact name you found
df = pd.read_csv('Titanic Dataset.csv')
median_age = df['age'].median()
print(f"Median age: {median_age}")

Median age: 28.0


### 4b. Fill missing ages with the median

In [26]:
# TODO: Fill missing ages with median_age

# Load the file using the exact name you found
df = pd.read_csv('Titanic Dataset.csv')
# Hint: Use .fillna(value, inplace=True)
# df['age'].fillna(median_age, inplace=True) doesnt work
df['age'] = df['age'].fillna(median_age)

print("✓ Missing ages filled with median!")
print(f"Missing ages now: {df['age'].isnull().sum()}")

✓ Missing ages filled with median!
Missing ages now: 0


### 4c. Compare median vs mean for age

In [7]:
# Calculate both median and mean
print(f"Median age: {df['age'].median():.2f}")
print(f"Mean age: {df['age'].mean():.2f}")

Median age: 28.00
Mean age: 29.50


**Q: Which is larger, the median or mean? Why might this be?**

A: Mean larger Median age: 28.00 Mean age: 29.50 because the age distribution is right-skewed.

---
## Task 5: Handle Missing Fare Values

Fare has only 1 missing value. We'll fill it with the median fare.

In [20]:
# TODO: Calculate median fare
median_fare = df['fare'].median()
print(f"Median fare: ${median_fare:.2f}")

# TODO: Fill missing fare with median
df['fare'].fillna(median_fare, inplace=True)

print("✓ Missing fare filled!")
print(f"Missing fares now: {df['fare'].isnull().sum()}")

Median fare: $14.45
✓ Missing fare filled!
Missing fares now: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['fare'].fillna(median_fare, inplace=True)


---
## Task 6: Handle Missing Embarked Values

Embarked has only 2 missing values. Since this is so few, we'll drop those rows.

### 6a. How many rows before dropping?

In [8]:
rows_before = len(df)
print(f"Rows before dropping: {rows_before}")

Rows before dropping: 1309


### 6b. Drop rows with missing embarked values

In [16]:
# Drop rows where embarked is missing
df.dropna(subset=['embarked'], inplace=True)

# Verify
print(f"Missing embarked: {df['embarked'].isnull().sum()}")
print(f"Rows remaining: {len(df)}")

Missing embarked: 0
Rows remaining: 1307


**Q: What percentage of data did we lose by dropping these rows?**

A: 0.15%

---
## Task 7: Verify All Missing Values Are Gone

Let's do a final check to make sure our data is completely clean.

In [19]:
# TODO: Check for any remaining missing values
print("Final missing value check:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

Final missing value check:
pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        0
boat          823
body         1186
home.dest     563
dtype: int64

Total missing values: 3850


**Q: Are there any missing values remaining? (Should be 0!)**

A: 

---
## Task 8: Summary Statistics

Now that our data is clean, let's look at summary statistics.

In [None]:
# Display summary statistics
print("Summary statistics after cleaning:")
df.describe()

**Q: What is the average age after filling missing values?**

A: 

**Q: What is the average fare?**

A: 

---
## Task 9: Save the Cleaned Data
Save your cleaned data to a new CSV file so you can use it in the next lesson.

In [None]:
# Save cleaned data
# Writes the DataFrame to a new CSV file without adding extra row numbers (index)
df.to_csv("Titanic_Cleaned.csv", index=False)     

# Prints a confirmation message so the user knows the save was successful
print("✓ Cleaned data saved to 'Titanic_Cleaned.csv'")   

---
## Reflection Questions

Answer these questions based on your work:

**1. Why is it important to check for missing values before building a model?**

Answer: 

**2. When should you fill missing values vs. drop rows?**

Answer: 

**3. Why did we use median instead of mean to fill missing ages?**

Answer: 

**4. What could happen if we trained a model on data with missing values?**

Answer: 

**5. Name one real-world scenario where missing data might occur.**

Answer: 

---
## Lesson Complete!

You've successfully cleaned the Titanic dataset!

**Summary of what you did:**
- Selected 8 useful features from 15 columns
- Filled missing ages with median
- Filled missing fares with median
- Dropped 2 rows with missing embarked values
- Verified all missing values are gone

Save this notebook and push to GitHub.

**Next lesson**: Convert categorical data to numbers