<a href="https://colab.research.google.com/github/chebil/stat/blob/main/part1/ch01_assignment_solution.ipynb" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1 Assignment: Exploring Global Life Expectancy Data

## **SOLUTION KEY** ðŸ”‘

---

## Overview

In this assignment, you will apply the concepts learned in Chapter 1 to analyze a real-world dataset. You will:

1. **Load and explore** a public dataset
2. **Clean the data** by handling missing values and data types
3. **Create visualizations** (bar charts, histograms)
4. **Calculate descriptive statistics** (mean, median, standard deviation)
5. **Draw conclusions** based on your analysis

## Dataset: Gapminder Life Expectancy Data

We will use the **Gapminder** dataset, which contains information about countries including:
- Life expectancy
- GDP per capita
- Population
- Continent

This dataset is publicly available and widely used for teaching data analysis.

**Source**: https://www.gapminder.org/data/

---

## Part 1: Loading and Exploring the Data (15 points)

First, let's import the necessary libraries and load the dataset.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [None]:
# Load the Gapminder dataset from a public URL
url = "https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv"
df = pd.read_csv(url)

# Display the first 10 rows
print("First 10 rows of the dataset:")
df.head(10)

### Task 1.1: Explore the Dataset Structure (5 points)

**TODO**: Use appropriate pandas methods to answer the following questions:
1. How many rows and columns does the dataset have?
2. What are the data types of each column?
3. Are there any missing values?

In [None]:
# SOLUTION: Find the shape of the dataset (rows, columns)
print(f"Dataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

In [None]:
# SOLUTION: Display data types and info about the dataset
print("Data types:")
print(df.dtypes)
print("\n" + "="*50 + "\n")
print("Dataset info:")
df.info()

In [None]:
# SOLUTION: Check for missing values in each column
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

### Task 1.2: Understand the Variables (5 points)

**TODO**: For each categorical column, find the unique values.

In [None]:
# SOLUTION: Find unique continents in the dataset
print("Unique continents:")
print(df['continent'].unique())
print(f"\nNumber of continents: {df['continent'].nunique()}")

In [None]:
# SOLUTION: Find the range of years covered in the dataset
print(f"Minimum year: {df['year'].min()}")
print(f"Maximum year: {df['year'].max()}")
print(f"\nAll years in dataset: {sorted(df['year'].unique())}")

In [None]:
# SOLUTION: How many unique countries are in the dataset?
print(f"Number of unique countries: {df['country'].nunique()}")

### Task 1.3: Filter Data for Analysis (5 points)

For the rest of this assignment, we will focus on the most recent year in the dataset.

**TODO**: Create a new DataFrame containing only the data from year 2007.

In [None]:
# SOLUTION: Filter the dataset to include only year 2007
df_2007 = df[df['year'] == 2007].copy()

# Display the shape of the filtered dataset
print(f"Number of countries in 2007: {len(df_2007)}")
print(f"\nFirst 5 rows of 2007 data:")
df_2007.head()

---

## Part 2: Data Cleaning (15 points)

Real-world data often contains issues that need to be addressed before analysis. In this section, you will practice data cleaning techniques.

### Task 2.1: Introduce and Handle Missing Values (10 points)

Let's simulate a real-world scenario by introducing some missing values, then handle them appropriately.

In [None]:
# Create a copy of the 2007 data for cleaning practice
df_clean = df_2007.copy()

# Introduce some missing values (simulating real-world data issues)
np.random.seed(42)  # For reproducibility
missing_indices = np.random.choice(df_clean.index, size=10, replace=False)
df_clean.loc[missing_indices[:5], 'lifeExp'] = np.nan
df_clean.loc[missing_indices[5:], 'gdpPercap'] = np.nan

print("Missing values introduced:")
print(df_clean.isnull().sum())

In [None]:
# SOLUTION: Identify which countries have missing life expectancy values
print("Countries with missing life expectancy:")
print(df_clean[df_clean['lifeExp'].isnull()]['country'].tolist())

In [None]:
# SOLUTION: Fill missing 'lifeExp' values with the median life expectancy of their continent
df_clean['lifeExp'] = df_clean.groupby('continent')['lifeExp'].transform(
    lambda x: x.fillna(x.median())
)

print("Missing lifeExp values after filling:")
print(df_clean['lifeExp'].isnull().sum())

In [None]:
# SOLUTION: Fill missing 'gdpPercap' values with the median GDP of their continent
df_clean['gdpPercap'] = df_clean.groupby('continent')['gdpPercap'].transform(
    lambda x: x.fillna(x.median())
)

print("Missing gdpPercap values after filling:")
print(df_clean['gdpPercap'].isnull().sum())

In [None]:
# SOLUTION: Verify that there are no more missing values
print("Remaining missing values:")
print(df_clean.isnull().sum())
print(f"\nTotal missing values: {df_clean.isnull().sum().sum()}")

### Task 2.2: Data Type Validation (5 points)

**TODO**: Create a new column called `pop_millions` that contains the population in millions (divide population by 1,000,000).

In [None]:
# SOLUTION: Create a new column 'pop_millions' = population / 1,000,000
df_clean['pop_millions'] = (df_clean['pop'] / 1_000_000).round(2)

# Display sample of the result
print("Population in millions:")
df_clean[['country', 'pop', 'pop_millions']].head(10)

---

## Part 3: Visualization (35 points)

Now let's create visualizations to understand our data better.

### Task 3.1: Bar Chart - Countries per Continent (10 points)

**TODO**: Create a bar chart showing the number of countries in each continent.

Requirements:
- Add a title: "Number of Countries per Continent (2007)"
- Label the x-axis: "Continent"
- Label the y-axis: "Number of Countries"
- Add value labels on top of each bar

In [None]:
# SOLUTION: Create a bar chart showing countries per continent
plt.figure(figsize=(10, 6))

# Count countries per continent
continent_counts = df_clean['continent'].value_counts()

# Create bar chart
bars = plt.bar(continent_counts.index, continent_counts.values, color='steelblue', edgecolor='black')

# Add labels and title
plt.xlabel('Continent', fontsize=12)
plt.ylabel('Number of Countries', fontsize=12)
plt.title('Number of Countries per Continent (2007)', fontsize=14)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.5,
             f'{int(height)}',
             ha='center', va='bottom', fontsize=11)

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

### Task 3.2: Histogram - Life Expectancy Distribution (10 points)

**TODO**: Create a histogram showing the distribution of life expectancy across all countries in 2007.

Requirements:
- Use 10 bins
- Add a title: "Distribution of Life Expectancy (2007)"
- Label the x-axis: "Life Expectancy (years)"
- Label the y-axis: "Number of Countries"
- Add a vertical line showing the mean life expectancy

In [None]:
# SOLUTION: Create a histogram of life expectancy
plt.figure(figsize=(10, 6))

# Create histogram
plt.hist(df_clean['lifeExp'], bins=10, edgecolor='black', alpha=0.7, color='steelblue')

# Calculate mean life expectancy
mean_life_exp = df_clean['lifeExp'].mean()

# Add vertical line at the mean
plt.axvline(x=mean_life_exp, color='red', linestyle='--', linewidth=2, 
            label=f'Mean: {mean_life_exp:.1f} years')

# Add labels, title, and legend
plt.xlabel('Life Expectancy (years)', fontsize=12)
plt.ylabel('Number of Countries', fontsize=12)
plt.title('Distribution of Life Expectancy (2007)', fontsize=14)
plt.legend(fontsize=11)
plt.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### Task 3.3: Conditional Histogram - Life Expectancy by Continent (15 points)

**TODO**: Create separate histograms of life expectancy for each continent to compare distributions.

Requirements:
- Create a figure with 5 subplots (one per continent)
- Use the same x-axis range for all (40 to 85 years)
- Add appropriate titles and labels

In [None]:
# SOLUTION: Create conditional histograms by continent
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()  # Flatten to make indexing easier

continents = df_clean['continent'].unique()
colors = ['steelblue', 'orange', 'green', 'red', 'purple']

# Loop through continents and create histograms
for i, continent in enumerate(continents):
    continent_data = df_clean[df_clean['continent'] == continent]['lifeExp']
    
    axes[i].hist(continent_data, bins=8, edgecolor='black', alpha=0.7, color=colors[i])
    axes[i].set_xlim(40, 85)
    axes[i].set_xlabel('Life Expectancy (years)')
    axes[i].set_ylabel('Number of Countries')
    axes[i].set_title(f'{continent} (n={len(continent_data)})')
    axes[i].grid(axis='y', alpha=0.3)
    
    # Add mean line
    mean_val = continent_data.mean()
    axes[i].axvline(x=mean_val, color='black', linestyle='--', linewidth=1.5,
                    label=f'Mean: {mean_val:.1f}')
    axes[i].legend(fontsize=9)

# Hide the 6th subplot (we only have 5 continents)
axes[5].set_visible(False)

plt.suptitle('Life Expectancy Distribution by Continent (2007)', fontsize=14)
plt.tight_layout()
plt.show()

---

## Part 4: Descriptive Statistics (20 points)

Calculate and interpret key statistics for the data.

### Task 4.1: Summary Statistics (10 points)

**TODO**: Calculate the following statistics for life expectancy in 2007:
- Mean
- Median
- Standard deviation
- Minimum and Maximum
- Range (Max - Min)

In [None]:
# SOLUTION: Calculate descriptive statistics for life expectancy
life_exp = df_clean['lifeExp']

# Calculate statistics
mean_life = np.mean(life_exp)
median_life = np.median(life_exp)
std_life = np.std(life_exp)
min_life = np.min(life_exp)
max_life = np.max(life_exp)
range_life = max_life - min_life

print("Life Expectancy Statistics (2007):")
print("="*40)
print(f"Mean:               {mean_life:.2f} years")
print(f"Median:             {median_life:.2f} years")
print(f"Standard Deviation: {std_life:.2f} years")
print(f"Minimum:            {min_life:.2f} years")
print(f"Maximum:            {max_life:.2f} years")
print(f"Range:              {range_life:.2f} years")

### Task 4.2: Statistics by Continent (10 points)

**TODO**: Calculate the mean and standard deviation of life expectancy for each continent.

In [None]:
# SOLUTION: Calculate mean and std of life expectancy by continent
continent_stats = df_clean.groupby('continent')['lifeExp'].agg(['mean', 'std', 'min', 'max', 'count'])
continent_stats.columns = ['Mean', 'Std Dev', 'Min', 'Max', 'Count']

# Display the results sorted by mean life expectancy
print("Life Expectancy by Continent (2007):")
print("="*60)
continent_stats.sort_values('Mean', ascending=False).round(2)

In [None]:
# SOLUTION: Create a bar chart comparing mean life expectancy across continents
plt.figure(figsize=(10, 6))

# Sort by mean for better visualization
stats_sorted = continent_stats.sort_values('Mean', ascending=True)

# Create bar chart with error bars
bars = plt.barh(stats_sorted.index, stats_sorted['Mean'], 
                xerr=stats_sorted['Std Dev'],
                color='steelblue', edgecolor='black', alpha=0.7,
                capsize=5)

# Add labels and title
plt.xlabel('Life Expectancy (years)', fontsize=12)
plt.ylabel('Continent', fontsize=12)
plt.title('Mean Life Expectancy by Continent (2007)\nError bars show Standard Deviation', fontsize=14)

# Add value labels on bars
for i, (mean, std) in enumerate(zip(stats_sorted['Mean'], stats_sorted['Std Dev'])):
    plt.text(mean + std + 1, i, f'{mean:.1f}', va='center', fontsize=10)

plt.grid(axis='x', alpha=0.3)
plt.xlim(0, 90)

plt.tight_layout()
plt.show()

---

## Part 5: Analysis and Conclusions (15 points)

Based on your analysis, answer the following questions.

### Question 5.1: Distribution Shape (5 points)

**TODO**: Look at your histogram from Task 3.2. Describe the shape of the life expectancy distribution.

Consider:
- Is it symmetric or skewed?
- Is it unimodal (one peak) or bimodal (two peaks)?
- Are there any outliers?

**SOLUTION:**

The life expectancy distribution in 2007 shows the following characteristics:

1. **Skewness**: The distribution is **left-skewed (negatively skewed)**. There is a longer tail on the left side, with most countries having higher life expectancies.

2. **Modality**: The distribution appears to be **bimodal** (two peaks). There is one peak around 70-80 years (representing developed countries, primarily in Europe, Americas, and Oceania) and another smaller peak around 50-60 years (representing many African countries).

3. **Outliers**: There are some countries with notably low life expectancies (below 45 years), which could be considered outliers. These are primarily countries affected by HIV/AIDS epidemic, civil conflicts, or extreme poverty.

### Question 5.2: Mean vs Median (5 points)

**TODO**: Compare the mean and median life expectancy you calculated.

- Which one is larger?
- What does this tell you about the distribution?
- Which measure would you use to describe the "typical" life expectancy and why?

**SOLUTION:**

1. **Comparison**: The median (71.94 years) is slightly **larger** than the mean (66.91 years).

2. **What this tells us**: When the median is greater than the mean, it indicates a **left-skewed distribution**. This means there are some countries with very low life expectancies that pull the mean downward, while more than half of the countries have life expectancies above the mean.

3. **Which measure to use**: For this data, the **median** would be a better measure of "typical" life expectancy because:
   - The distribution is skewed, making the mean sensitive to extreme values
   - The median is more robust to outliers (countries with very low life expectancy)
   - The median better represents what a "typical" country experiences
   
   However, if we want to account for the full range of human experience including the suffering in low-life-expectancy countries, the mean provides a more complete picture.

### Question 5.3: Continental Differences (5 points)

**TODO**: Based on your analysis of life expectancy by continent:

1. Which continent has the highest average life expectancy?
2. Which continent has the most variability (highest standard deviation)?
3. What factors might explain these differences?

**SOLUTION:**

1. **Highest average life expectancy**: **Oceania** has the highest average life expectancy (~80.7 years), followed closely by Europe (~77.6 years). However, Oceania only includes 2 countries (Australia and New Zealand), so Europe might be more representative.

2. **Most variability**: **Asia** has the highest standard deviation (~7.9 years), followed by Africa (~9.6 years). This means life expectancy varies widely within these continents.

3. **Factors explaining differences**:
   - **Healthcare access**: Developed continents (Europe, Oceania) have better healthcare systems
   - **Economic development**: Higher GDP per capita correlates with better nutrition, sanitation, and medical care
   - **Disease burden**: Africa faces high HIV/AIDS prevalence and tropical diseases
   - **Political stability**: Civil conflicts and wars reduce life expectancy
   - **Education**: Higher education levels lead to better health decisions
   - **Infrastructure**: Access to clean water, sanitation, and electricity
   - **Within-continent diversity**: Asia's high variability reflects the mix of wealthy nations (Japan, Singapore) and developing countries (Afghanistan, Cambodia)

---

## Bonus Challenge (10 extra points)

For extra credit, complete the following challenge.

### Bonus: Temporal Analysis

**TODO**: Create a line plot showing how the average life expectancy has changed over time for each continent.

Requirements:
- Calculate mean life expectancy by year and continent
- Create a line plot with one line per continent
- Use different colors for each continent
- Add a legend
- Add appropriate title and labels

In [None]:
# SOLUTION: Create a line plot of life expectancy over time by continent
plt.figure(figsize=(12, 6))

# Calculate mean life expectancy by year and continent
temporal_stats = df.groupby(['year', 'continent'])['lifeExp'].mean().unstack()

# Define colors for each continent
colors = {'Africa': 'brown', 'Americas': 'green', 'Asia': 'red', 
          'Europe': 'blue', 'Oceania': 'purple'}

# Plot each continent
for continent in temporal_stats.columns:
    plt.plot(temporal_stats.index, temporal_stats[continent], 
             marker='o', linewidth=2, markersize=6,
             color=colors[continent], label=continent)

# Add labels and title
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Life Expectancy (years)', fontsize=12)
plt.title('Average Life Expectancy Over Time by Continent (1952-2007)', fontsize=14)
plt.legend(title='Continent', fontsize=10)
plt.grid(True, alpha=0.3)

# Set x-axis ticks to show all years
plt.xticks(temporal_stats.index, rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Additional analysis: Calculate the change in life expectancy from 1952 to 2007
print("Change in Life Expectancy by Continent (1952 â†’ 2007):")
print("="*55)

for continent in temporal_stats.columns:
    start = temporal_stats.loc[1952, continent]
    end = temporal_stats.loc[2007, continent]
    change = end - start
    print(f"{continent:12}: {start:.1f} â†’ {end:.1f} years (Change: +{change:.1f} years)")

### Bonus Question:

What trends do you observe in the line plot? Has the gap between continents increased or decreased over time?

**SOLUTION:**

**Observed Trends:**

1. **All continents improved**: Every continent shows a clear upward trend in life expectancy from 1952 to 2007, demonstrating global improvements in health and living conditions.

2. **Fastest improvement**: Asia showed the most dramatic improvement, going from about 46 years in 1952 to about 70 years in 2007 (~24 year increase). This reflects rapid development in countries like China, South Korea, and India.

3. **Gap analysis**:
   - In 1952, the gap between the highest (Oceania ~69 years) and lowest (Africa ~39 years) was about **30 years**
   - In 2007, the gap between highest (Oceania ~81 years) and lowest (Africa ~55 years) is about **26 years**
   - The gap has **slightly decreased**, suggesting some convergence

4. **Africa's trajectory**: Africa shows slower progress, especially visible in the 1990s-2000s period where growth stagnated or reversed in some years. This is largely due to the HIV/AIDS epidemic.

5. **Convergence in developed regions**: Europe, Americas, and Oceania have converged toward similar high life expectancies (75-81 years) by 2007.

6. **Asia catching up**: Asia has narrowed the gap with Europe and Americas significantly, especially after 1970.

---

## Grading Rubric

**Total Points: 100 (+ 10 bonus)**

| Section | Points | Key Criteria |
|---------|--------|-------------|
| Part 1: Loading and Exploring | 15 | Correct use of shape, info, dtypes, unique, nunique |
| Part 2: Data Cleaning | 15 | Proper handling of missing values with groupby/transform |
| Part 3: Visualization | 35 | Correct charts with titles, labels, and value annotations |
| Part 4: Descriptive Statistics | 20 | Accurate calculations and meaningful groupby analysis |
| Part 5: Analysis and Conclusions | 15 | Thoughtful interpretation of results |
| Bonus | 10 | Complete temporal analysis with proper visualization |

---

**End of Solution Key** ðŸ”‘