# 🛠️ Mini-Project: Real-world Data Processing & Visualization

### Overview:
- We will work with a real-world CSV dataset.
- Steps: **Load → Clean → Group → Plot**
- This is a common workflow in data analysis.

### Why this is important?
- Real data is messy and needs cleaning.
- Grouping helps summarize data.
- Visualization makes data insights clear.

---

## Step 1: Load Data (CSV) 📥

We load data from a CSV file into a pandas DataFrame.
We use the famous 'Titanic' dataset here for demonstration.

- Columns include: PassengerId, Survived, Pclass, Name, Sex, Age, Fare, etc.

In [1]:
# Import pandas
import pandas as pd

# Load the Titanic dataset from URL (or local CSV if you have)
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
df = pd.read_csv(url)

# Show first 5 rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Notes:
- `pd.read_csv()` loads CSV data.
- `df.head()` displays the first 5 rows.
- Real datasets usually have many columns and rows.

## Step 2: Data Cleaning 🧹

- Real data can have missing or incorrect values.
- Let's check for missing values and handle them.

- We will fill missing 'Age' values with the average age.
- Drop irrelevant columns for this project (like 'Name', 'Ticket').

In [None]:
# Check missing values
df.isnull().sum()

### Handle missing values in 'Age'

In [None]:
# Fill missing Age values with mean age
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)

# Drop columns that won't be used
df_clean = df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked'])

# Check again for missing values
df_clean.isnull().sum()

### Notes:
- `.fillna()` fills missing values.
- `.drop()` removes unwanted columns.
- Always check for missing data before analysis.

## Step 3: Grouping and Aggregation 🔍

- Let's analyze survival rate by gender.
- We'll group by 'Sex' and calculate average survival.

- We can also group by 'Pclass' (passenger class) to see survival trends.

In [None]:
# Group by Sex and calculate mean survival rate
grouped_sex = df_clean.groupby('Sex')['Survived'].mean()
print('Survival Rate by Gender:')
print(grouped_sex)

# Group by Passenger Class and calculate mean survival rate
grouped_class = df_clean.groupby('Pclass')['Survived'].mean()
print('\nSurvival Rate by Passenger Class:')
print(grouped_class)

### Notes:
- `.groupby()` groups data by column.
- `.mean()` calculates average of numeric column.
- This gives us summarized insights.

## Step 4: Visualization 📈

We will use **matplotlib** and **seaborn** to visualize the grouped data.

In [None]:
# Import visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Plot survival rate by gender
grouped_sex.plot(kind='bar', color=['skyblue', 'lightpink'])
plt.title('Survival Rate by Gender')
plt.ylabel('Survival Rate')
plt.ylim(0,1)
plt.show()

In [None]:
# Convert grouped_class to DataFrame for seaborn
grouped_class_df = grouped_class.reset_index()
grouped_class_df.columns = ['Pclass', 'Survival Rate']

# Seaborn barplot for survival by passenger class
sns.barplot(data=grouped_class_df, x='Pclass', y='Survival Rate', palette='viridis')
plt.title('Survival Rate by Passenger Class')
plt.ylim(0,1)
plt.show()

### Notes:
- `.plot(kind='bar')` creates bar plots in pandas/matplotlib.
- Seaborn uses DataFrames and offers better aesthetics by default.
- Adjusting y-axis limits improves clarity.
- Titles and labels make plots understandable.

---
## 🎯 Tasks for Students
1. Load a CSV file of your choice and check for missing data.
2. Fill missing values appropriately (mean, median or mode).
3. Group data by one or two categorical columns and calculate mean or count.
4. Plot the grouped data using matplotlib or seaborn.
5. Add meaningful titles and labels to your plots.

## ✅ MCQs

<span style='color:green;font-weight:bold;'>Q1:</span> Which pandas method helps to fill missing values?
- a) `.dropna()` ❌
- b) `.fillna()` ✅ ✔️
- c) `.groupby()` ❌
- d) `.plot()` ❌

<span style='color:green;font-weight:bold;'>Q2:</span> To calculate average survival rate grouped by gender, we use?
- a) `df.groupby('Sex')['Survived'].mean()` ✅ ✔️
- b) `df.mean()` ❌
- c) `df.fillna()` ❌
- d) `df.plot()` ❌

<span style='color:green;font-weight:bold;'>Q3:</span> Which library provides better default aesthetics for plots?
- a) Matplotlib ❌
- b) Seaborn ✅ ✔️
- c) Pandas ❌
- d) NumPy ❌

<span style='color:green;font-weight:bold;'>Q4:</span> What does `df.drop(columns=[...])` do?
- a) Deletes rows ❌
- b) Removes columns ✅ ✔️
- c) Fills missing data ❌
- d) Groups data ❌

---