<a href="https://colab.research.google.com/github/chebil/stat/blob/main/part1/ch01_assignment.ipynb" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1 Assignment: Exploring Global Life Expectancy Data

## Overview

In this assignment, you will apply the concepts learned in Chapter 1 to analyze a real-world dataset. You will:

1. **Load and explore** a public dataset
2. **Clean the data** by handling missing values and data types
3. **Create visualizations** (bar charts, histograms)
4. **Calculate descriptive statistics** (mean, median, standard deviation)
5. **Draw conclusions** based on your analysis

## Dataset: Gapminder Life Expectancy Data

We will use the **Gapminder** dataset, which contains information about countries including:
- Life expectancy
- GDP per capita
- Population
- Continent

This dataset is publicly available and widely used for teaching data analysis.

**Source**: https://www.gapminder.org/data/

---

## Instructions

- Complete all the tasks marked with **TODO**
- Write your code in the provided cells
- Answer the questions in markdown cells
- Make sure your visualizations have proper labels and titles

---

## Part 1: Loading and Exploring the Data (15 points)

First, let's import the necessary libraries and load the dataset.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [None]:
# Load the Gapminder dataset from a public URL
url = "https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv"
df = pd.read_csv(url)

# Display the first 10 rows
print("First 10 rows of the dataset:")
df.head(10)

### Task 1.1: Explore the Dataset Structure (5 points)

**TODO**: Use appropriate pandas methods to answer the following questions:
1. How many rows and columns does the dataset have?
2. What are the data types of each column?
3. Are there any missing values?

In [None]:
# TODO: Find the shape of the dataset (rows, columns)
# Hint: Use df.shape


In [None]:
# TODO: Display data types and info about the dataset
# Hint: Use df.info() or df.dtypes


In [None]:
# TODO: Check for missing values in each column
# Hint: Use df.isnull().sum()


### Task 1.2: Understand the Variables (5 points)

**TODO**: For each categorical column, find the unique values.

In [None]:
# TODO: Find unique continents in the dataset
# Hint: Use df['continent'].unique()


In [None]:
# TODO: Find the range of years covered in the dataset
# Hint: Use df['year'].min() and df['year'].max()


In [None]:
# TODO: How many unique countries are in the dataset?
# Hint: Use df['country'].nunique()


### Task 1.3: Filter Data for Analysis (5 points)

For the rest of this assignment, we will focus on the most recent year in the dataset.

**TODO**: Create a new DataFrame containing only the data from year 2007.

In [None]:
# TODO: Filter the dataset to include only year 2007
# Hint: df_2007 = df[df['year'] == 2007]

df_2007 = None  # Replace with your code

# Display the shape of the filtered dataset
print(f"Number of countries in 2007: {len(df_2007) if df_2007 is not None else 'Complete the TODO'}")

---

## Part 2: Data Cleaning (15 points)

Real-world data often contains issues that need to be addressed before analysis. In this section, you will practice data cleaning techniques.

### Task 2.1: Introduce and Handle Missing Values (10 points)

Let's simulate a real-world scenario by introducing some missing values, then handle them appropriately.

In [None]:
# Create a copy of the 2007 data for cleaning practice
df_clean = df_2007.copy()

# Introduce some missing values (simulating real-world data issues)
np.random.seed(42)  # For reproducibility
missing_indices = np.random.choice(df_clean.index, size=10, replace=False)
df_clean.loc[missing_indices[:5], 'lifeExp'] = np.nan
df_clean.loc[missing_indices[5:], 'gdpPercap'] = np.nan

print("Missing values introduced:")
print(df_clean.isnull().sum())

In [None]:
# TODO: Identify which countries have missing life expectancy values
# Hint: df_clean[df_clean['lifeExp'].isnull()]['country']


In [None]:
# TODO: Fill missing 'lifeExp' values with the median life expectancy of their continent
# Hint: Use groupby and transform with a lambda function
# Example: df_clean['lifeExp'] = df_clean.groupby('continent')['lifeExp'].transform(
#              lambda x: x.fillna(x.median()))


In [None]:
# TODO: Fill missing 'gdpPercap' values with the median GDP of their continent


In [None]:
# TODO: Verify that there are no more missing values
# Hint: Use df_clean.isnull().sum()


### Task 2.2: Data Type Validation (5 points)

**TODO**: Create a new column called `pop_millions` that contains the population in millions (divide population by 1,000,000).

In [None]:
# TODO: Create a new column 'pop_millions' = population / 1,000,000
# Round to 2 decimal places using .round(2)


# Display sample of the result
# df_clean[['country', 'pop', 'pop_millions']].head(10)

---

## Part 3: Visualization (35 points)

Now let's create visualizations to understand our data better.

### Task 3.1: Bar Chart - Countries per Continent (10 points)

**TODO**: Create a bar chart showing the number of countries in each continent.

Requirements:
- Add a title: "Number of Countries per Continent (2007)"
- Label the x-axis: "Continent"
- Label the y-axis: "Number of Countries"
- Add value labels on top of each bar

In [None]:
# TODO: Create a bar chart showing countries per continent
# Step 1: Count countries per continent using value_counts()
# Step 2: Create the bar chart using plt.bar()
# Step 3: Add labels and title
# Step 4: Add value labels on bars

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

### Task 3.2: Histogram - Life Expectancy Distribution (10 points)

**TODO**: Create a histogram showing the distribution of life expectancy across all countries in 2007.

Requirements:
- Use 10 bins
- Add a title: "Distribution of Life Expectancy (2007)"
- Label the x-axis: "Life Expectancy (years)"
- Label the y-axis: "Number of Countries"
- Add a vertical line showing the mean life expectancy

In [None]:
# TODO: Create a histogram of life expectancy
# Step 1: Create histogram using plt.hist()
# Step 2: Calculate mean life expectancy
# Step 3: Add vertical line at the mean using plt.axvline()
# Step 4: Add labels, title, and legend

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

### Task 3.3: Conditional Histogram - Life Expectancy by Continent (15 points)

**TODO**: Create separate histograms of life expectancy for each continent to compare distributions.

Requirements:
- Create a figure with 5 subplots (one per continent)
- Use the same x-axis range for all (40 to 85 years)
- Add appropriate titles and labels

In [None]:
# TODO: Create conditional histograms by continent
# Hint: Use plt.subplots() to create multiple plots
# Loop through continents and create a histogram for each

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()  # Flatten to make indexing easier

continents = df_clean['continent'].unique()

# Your code here - loop through continents and create histograms


# Hide the 6th subplot (we only have 5 continents)
axes[5].set_visible(False)

plt.suptitle('Life Expectancy Distribution by Continent (2007)', fontsize=14)
plt.tight_layout()
plt.show()

---

## Part 4: Descriptive Statistics (20 points)

Calculate and interpret key statistics for the data.

### Task 4.1: Summary Statistics (10 points)

**TODO**: Calculate the following statistics for life expectancy in 2007:
- Mean
- Median
- Standard deviation
- Minimum and Maximum
- Range (Max - Min)

In [None]:
# TODO: Calculate descriptive statistics for life expectancy
# Use numpy or pandas methods: np.mean(), np.median(), np.std(), etc.

life_exp = df_clean['lifeExp']

# Calculate statistics
mean_life = None      # TODO: Calculate mean
median_life = None    # TODO: Calculate median
std_life = None       # TODO: Calculate standard deviation
min_life = None       # TODO: Calculate minimum
max_life = None       # TODO: Calculate maximum
range_life = None     # TODO: Calculate range

print("Life Expectancy Statistics (2007):")
print("="*40)
print(f"Mean:               {mean_life:.2f} years" if mean_life else "TODO")
print(f"Median:             {median_life:.2f} years" if median_life else "TODO")
print(f"Standard Deviation: {std_life:.2f} years" if std_life else "TODO")
print(f"Minimum:            {min_life:.2f} years" if min_life else "TODO")
print(f"Maximum:            {max_life:.2f} years" if max_life else "TODO")
print(f"Range:              {range_life:.2f} years" if range_life else "TODO")

### Task 4.2: Statistics by Continent (10 points)

**TODO**: Calculate the mean and standard deviation of life expectancy for each continent.

In [None]:
# TODO: Calculate mean and std of life expectancy by continent
# Hint: Use df_clean.groupby('continent')['lifeExp'].agg(['mean', 'std'])

# continent_stats = ...

# Display the results sorted by mean life expectancy
# continent_stats.sort_values('mean', ascending=False)

In [None]:
# TODO: Create a bar chart comparing mean life expectancy across continents
# Include error bars showing the standard deviation
# Hint: Use plt.bar() with yerr parameter

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

---

## Part 5: Analysis and Conclusions (15 points)

Based on your analysis, answer the following questions.

### Question 5.1: Distribution Shape (5 points)

**TODO**: Look at your histogram from Task 3.2. Describe the shape of the life expectancy distribution.

Consider:
- Is it symmetric or skewed?
- Is it unimodal (one peak) or bimodal (two peaks)?
- Are there any outliers?

**Your Answer:**

*Write your answer here (double-click to edit)*



### Question 5.2: Mean vs Median (5 points)

**TODO**: Compare the mean and median life expectancy you calculated.

- Which one is larger?
- What does this tell you about the distribution?
- Which measure would you use to describe the "typical" life expectancy and why?

**Your Answer:**

*Write your answer here*



### Question 5.3: Continental Differences (5 points)

**TODO**: Based on your analysis of life expectancy by continent:

1. Which continent has the highest average life expectancy?
2. Which continent has the most variability (highest standard deviation)?
3. What factors might explain these differences?

**Your Answer:**

*Write your answer here*



---

## Bonus Challenge (10 extra points)

For extra credit, complete the following challenge.

### Bonus: Temporal Analysis

**TODO**: Create a line plot showing how the average life expectancy has changed over time for each continent.

Requirements:
- Calculate mean life expectancy by year and continent
- Create a line plot with one line per continent
- Use different colors for each continent
- Add a legend
- Add appropriate title and labels

In [None]:
# BONUS: Create a line plot of life expectancy over time by continent
# Hint: Use the original df (not df_2007)
# Group by year and continent, then plot

plt.figure(figsize=(12, 6))

# Your code here


plt.tight_layout()
plt.show()

### Bonus Question:

What trends do you observe in the line plot? Has the gap between continents increased or decreased over time?

**Your Answer:**

*Write your answer here*



---

## Submission Checklist

Before submitting, make sure you have:

- [ ] Completed all TODO tasks
- [ ] Run all cells from top to bottom without errors
- [ ] Added titles and labels to all visualizations
- [ ] Written answers to all analysis questions
- [ ] Saved your notebook

**Total Points: 100 (+ 10 bonus)**

| Section | Points |
|---------|--------|
| Part 1: Loading and Exploring | 15 |
| Part 2: Data Cleaning | 15 |
| Part 3: Visualization | 35 |
| Part 4: Descriptive Statistics | 20 |
| Part 5: Analysis and Conclusions | 15 |
| Bonus | 10 |

---

**Good luck!** ðŸŽ‰