![](images/logo.png){style="width: 1000px; align: center;"}

# **Data Acquisition and Statistical Data Visualization with Seaborn** {style="text-align: center;"}

**Presenter:** Olusola Timothy Ogundepo  
**Institution:** African Institute of Mathematical Sciences, Rwanda  
**Date:** October 6, 2025

> End-to-end walkthrough: (1) ethically acquiring data (incl. a mini web scraping demo) → (2) quick exploratory profiling → (3) building interpretable statistical graphics with Seaborn.

---

## Table of Contents
1. Motivation & Goals
2. Ways to Obtain Data
3. Web Scraping Overview
4. Why Web Scraping Matters (Ethics & Use Cases)
5. Python Libraries for Scraping
6. Mini Scraping Demo (Books to Scrape)
7. Loading Local Datasets (Tips, Penguins, COVID-19)
8. Quick Automated EDA (Pandas Profiling)
9. Seaborn Fundamentals (Styles & Figure vs Axes API)
10. Univariate Distributions
11. Bivariate Numeric Relationships
12. Categorical vs Numeric Visualizations
13. Multivariate & Matrix Views (Pairplot, Heatmap)
14. Plot Customization & Annotation
15. Saving Figures
16. Practice Activities
17. Further Reading & Design Checklist

## 1. Motivation & Goals

Effective data visualization is essential for uncovering insights, communicating uncertainty and patterns transparently, and supporting evidence-based decision making. Thoughtful visualizations clarify complex data, reveal trends, and highlight areas for further investigation.

Seaborn, built on top of Matplotlib, specializes in visualizing statistical relationships. In this session, we will develop a layered approach—beginning with simple plots, then progressively adding semantic elements such as color (hue), size, and style, before exploring multivariate patterns.

> Guiding Principle: Each plot should answer a specific question with clarity and precision.


## 2. Ways to Obtain Data

There are several reliable methods for acquiring data, each suited to different needs and contexts:

1. **Published Open Data:** Downloadable files such as CSV or Parquet from official portals, government sites, or research repositories.  
    *Examples:* [World Bank Data](https://data.worldbank.org/), [Kaggle Datasets](https://www.kaggle.com/datasets), [UN Data Portal](https://data.un.org/)

2. **Public or Authenticated APIs:** Programmatic access to data via endpoints that return structured formats like JSON, often requiring registration or authentication.  
    *Examples:* [Twitter API](https://developer.twitter.com/en/docs), [OpenWeatherMap API](https://openweathermap.org/api), [COVID-19 API](https://covid19api.com/)

3. **Internal Databases and Data Warehouses:** Secure, organization-specific sources accessed through SQL queries or business intelligence tools.  
    *Examples:* Company sales databases, hospital patient record systems, university research data warehouses

4. **Bulk Data Dumps:** Large datasets shared on platforms like Kaggle or GitHub, typically for research or community use.  
    *Examples:* [ImageNet](https://www.image-net.org/), [Common Crawl](https://commoncrawl.org/), [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)

5. **Web Scraping:** Automated extraction from websites when no structured or official access is available; use only when other options are exhausted.  
    *Examples:* Scraping book prices from [Books to Scrape](https://books.toscrape.com/), extracting job listings from [Indeed](https://www.indeed.com/), collecting product reviews from e-commerce sites

**Key considerations before acquiring data:**
- Is there an official API? Use it whenever possible for reliability and structure.
- Are licensing terms and conditions compatible with your intended use? Always review them carefully.
- Is reproducibility important? Automate the acquisition process with scripts.
- Will scraping impose minimal load on the source? If so, implement polite delays and respect site policies.

Next, we’ll outline the scope of web scraping before demonstrating a practical example.

## 3. Web Scraping Overview

Web scraping refers to the automated collection of information from web pages, typically in situations where no official API or structured data endpoint is available.

**Key Python tools:**
- `requests`: Retrieves raw HTML content from websites.
- `BeautifulSoup`: Parses and navigates the HTML structure to extract relevant data.
- `pandas`: Organizes extracted data into clean, tabular formats for analysis.

**When web scraping is inappropriate:**
- If the data is protected by paywalls or requires login and you lack explicit permission.
- When scraping would violate website rate limits or terms of service.
- In cases where legal or ethical guidelines prohibit automated data extraction.

**Responsible scraping practices:**
- Always set a clear and honest User-Agent string to identify your script.
- Introduce delays between requests to avoid overwhelming servers.
- Cache downloaded pages locally to minimize repeated requests.
- Limit scraping to only the data necessary for your analysis; avoid large-scale or indiscriminate crawling.

## 4. Python Libraries for Scraping
Install (if needed):
```bash
pip install requests beautifulsoup4
```
Optional extras:
- lxml (faster parser)
- selenium (dynamic JS pages)
- scrapy (large-scale frameworks)

We'll use a small demo site (Books to Scrape) ideal for practice.

In [1]:
import requests
from bs4 import BeautifulSoup

## 5. Mini Scraping Demo: Books to Scrape

**Objective:** Demonstrate ethical, small-scale web scraping by collecting a sample of book data—including title, price, rating, and stock status from the "Books to Scrape" website. This exercise is for illustration only, not for large-scale data extraction.

**Workflow:**
1. Retrieve the HTML content of the target web page.
2. Identify and parse individual product blocks within the page.
3. Extract relevant attributes for each book: title, price, rating, and stock availability.
4. Convert rating descriptions (e.g., "Three") into numeric values for analysis.
5. Optionally, save the collected data to a CSV file for further exploration.

Next, we’ll implement helper functions to automate these steps and ensure the process is clear and reproducible.

In [2]:
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

In [3]:
BASE_URL = "https://books.toscrape.com/"
HEADERS = {"User-Agent": "SeabornDataVizLecture/1.0 (contact: example@example.com)"}
STAR_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

In [4]:
def fetch(url: str):
    """
        Fetch the content of a URL with error handling and timeout.
    """    
    resp = requests.get(url, headers=HEADERS, timeout=10)
    resp.raise_for_status()
    return resp.text

In [5]:

def parse_book(article):
    """
        Parse a book article element to extract title, price, rating, and stock status.
    """
    title = article.h3.a['title'].strip()
    price = re.sub(r"[^0-9,.]", "", article.select_one('.price_color').get_text(strip=True))
    rating_class = article.select_one('p.star-rating')['class']
    rating_word = next(c for c in rating_class if c != 'star-rating')
    stars = STAR_MAP.get(rating_word)
    stock_text = article.select_one('.availability').get_text(strip=True)
    in_stock = 'In stock' in stock_text
    return {"title": title, "price_gbp": price, "stars": stars, "in_stock": in_stock}

In [6]:
def scrape_pages(n_pages=2):
    """
        Scrape multiple pages of the book catalog and return a DataFrame of book records.
    """
    records = []
    for p in range(1, n_pages+1):
        url = BASE_URL + (f"catalogue/page-{p}.html" if p>1 else "")
        html = fetch(url)
        soup = BeautifulSoup(html, 'html.parser')
        for art in soup.select('article.product_pod'):
            records.append(parse_book(art))
        time.sleep(0.6)  # polite delay
    return pd.DataFrame(records)

In [7]:
books_df = scrape_pages(n_pages=3)

In [8]:
books_df.head()

Unnamed: 0,title,price_gbp,stars,in_stock
0,A Light in the Attic,51.77,3,True
1,Tipping the Velvet,53.74,1,True
2,Soumission,50.1,1,True
3,Sharp Objects,47.82,4,True
4,Sapiens: A Brief History of Humankind,54.23,5,True


## 6. Scraped Sample: Quick Glimpse

In [9]:
books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      60 non-null     object
 1   price_gbp  60 non-null     object
 2   stars      60 non-null     int64 
 3   in_stock   60 non-null     bool  
dtypes: bool(1), int64(1), object(2)
memory usage: 1.6+ KB


In [10]:
books_df["price_gbp"] = pd.to_numeric(books_df["price_gbp"], errors='coerce')
#### OR 
# books_df["price_gbp"] = books_df["price_gbp"].astype(float)

books_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      60 non-null     object 
 1   price_gbp  60 non-null     float64
 2   stars      60 non-null     int64  
 3   in_stock   60 non-null     bool   
dtypes: bool(1), float64(1), int64(1), object(1)
memory usage: 1.6+ KB


## 7. Load Local Teaching Datasets
We'll now pivot to local CSVs for Seaborn demonstrations:
- tips.csv (restaurant tipping behavior)
- penguins.csv (species morphology)
- Africa COVID-19 summary

Load and inspect baseline structure.

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Using the code above, we have imported both Seaborn and Pandas. We assign both of these aliases to make calling their methods easier. numpy is assigned the alias np, pandas is assigned the alias pd, pyplot is assigned the alias plt, and seaborn is assigned the alias sns.

### Tips Dataset: Variable Glossary

The tips dataset originates from a restaurant in Nairobi, Kenya, where patrons pay for their meals and leave gratuities based on the total bill. Each row in the dataset represents a dining party and includes the following variables:

- **total_bill**: The total cost of the meal, recorded in Kenyan Shillings (KES).
- **tip**: The amount of gratuity left by the payer, also in Kenyan Shillings.
- **gender**: The gender of the individual who paid for the meal.
- **smoker**: Indicates whether anyone in the party was a smoker.
- **day**: The day of the week on which the meal was served.
- **time**: Specifies whether the meal occurred during lunch or dinner.
- **size**: The number of people in the dining party.

This dataset enables analysis of tipping behavior in relation to meal cost, party characteristics, and temporal factors.


### Loading the Tips Dataset into a Pandas DataFrame

To begin our analysis, we will load the restaurant tipping dataset into a Pandas DataFrame. This step provides a structured format for efficient data exploration and visualization.

In [12]:
tips_data = pd.read_csv("datasets/tips.csv")

In [13]:
tips_data.head()

Unnamed: 0,total_bill,tip,gender,smoker,day,time,size
0,2125.5,360.79,Male,No,Thur,Lunch,1
1,2727.18,259.42,Female,No,Sun,Dinner,5
2,1066.02,274.68,Female,Yes,Thur,Dinner,4
3,3493.45,337.9,Female,No,Sun,Dinner,1
4,3470.56,567.89,Male,Yes,Sun,Lunch,6


In [14]:
tips_data.tail()

Unnamed: 0,total_bill,tip,gender,smoker,day,time,size
739,3164.27,645.28,Male,No,Sat,Dinner,3
740,2962.62,218.0,Female,Yes,Sat,Dinner,2
741,2471.03,218.0,Male,Yes,Sat,Dinner,2
742,1942.38,190.75,Male,No,Sat,Dinner,2
743,2047.02,327.0,Female,No,Thur,Dinner,2


In [15]:
tips_data.columns

Index(['total_bill', 'tip', 'gender', 'smoker', 'day', 'time', 'size'], dtype='object')

In [16]:
tips_data.shape

(744, 7)

## 8. Automated Exploratory Profile (Optional)

Automated data profiling provides a rapid overview of your dataset’s structure, distributions, and potential issues. Use profiling tools to generate initial hypotheses, identify missing values, spot outliers, and detect data types. However, treat these reports as a starting point for deeper analysis—not as definitive conclusions. Profiling is most effective for guiding further exploration and highlighting areas that require closer attention.

### Installation:
```bash
pip install ydata-profiling[notebook]
```

You will need to restart your kernel after installation for the package to be recognized in your Jupyter notebook environment.

In [17]:
from ydata_profiling import ProfileReport

In [18]:
ProfileReport(tips_data, title="Tips Dataset Profiling Report", explorative=True)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 7/7 [00:00<00:00, 162.89it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## 9. Seaborn Fundamentals: Styles & Themes

Seaborn offers several built-in visual styles—`darkgrid`, `whitegrid`, `dark`, `white`, and `ticks`—that control background, gridlines, and overall plot appearance. Setting a style at the beginning of your workflow ensures visual consistency across all figures. For this session, we’ll use the `whitegrid` style, which provides a clean background with subtle gridlines to aid comparison without distraction.

For presentations, consider adjusting the plot context with `sns.set_context('talk')` to scale fonts and elements for better readability. We’ll set the style now and explore palette customization in later sections.


In [19]:
sns.set_style("whitegrid")

## 10. Univariate Distributions

Begin by examining each variable independently to understand its basic characteristics before exploring relationships between variables.

**Key visualization options:**
- **Histogram:** Displays frequency counts across bins; sensitive to bin width and placement.
- **Kernel Density Estimate (KDE):** Shows a smoothed approximation of the distribution; may obscure details in small samples.
- **Empirical Cumulative Distribution Function (ECDF):** Illustrates cumulative proportions without binning, providing an unbiased view of distribution.
- **Box, Violin, Strip, or Swarm Plots:** Reveal distribution shape, central tendency, and outliers; each offers a different perspective on variability and data structure.

We’ll start by visualizing the `total_bill` variable using several of these approaches to highlight its distribution and key features.

In [20]:
sns.relplot(x='total_bill', y='tip', data=tips_data)

<seaborn.axisgrid.FacetGrid at 0x13d9cc2fa10>

Add a categorical semantic (hue) to reveal subgroup structure (gender).

In [21]:
sns.relplot(x= 'total_bill', y='tip', hue='gender', data= tips_data)

<seaborn.axisgrid.FacetGrid at 0x13d9d54a350>

**Interpretation note:**  
When data points overlap or cluster densely in a plot, important patterns may be obscured. To address this, consider using transparency (adjusting point alpha), jitter, or alternative visual encodings to reveal underlying distributions and ensure that dense regions remain interpretable.

Axis labelling & titling: Provide units, clarify transformations (log, %), and keep titles question-oriented.

Examples below relabel axes & add an interpretive title.

In [22]:
sns.relplot(x= 'total_bill', y='tip', hue='gender', data= tips_data)

plt.xlabel("Total bill for the meal")

plt.ylabel("Tips given in Kenya Shilling")

plt.title("tips vs total bill")

Text(0.5, 1.0, 'tips vs total bill')

**Important note**:
    
We always use semi colon (;) at the last line of code to stop the texts written before the visual is showing. For example,

In [23]:
sns.relplot(x= 'total_bill', y='tip', hue='gender', data= tips_data)

plt.xlabel("Total bill for the meal")

plt.ylabel("Tips given (in KES)")

plt.title("tips vs total bill");

Regression line helps assess linear trend and variance structure; hide CI when focusing on central tendency only.

In [24]:
sns.lmplot(x= 'total_bill', y='tip', data= tips_data);

You can remove the confidence interval in the regression line by setting ci = None in sns.lmplot() function

In [25]:
sns.lmplot(x= 'total_bill', y='tip', data= tips_data, ci = None);

Pair plot reveals pairwise relationships + diagonals (univariate distributions). Use `corner=True` or subset numeric columns to reduce clutter.

In [26]:
sns.pairplot(tips_data);

Joint plot combines a bivariate plot (center) with marginal univariate plots. Useful for quickly spotting non-linear structure or heteroskedasticity.

In [27]:
sns.jointplot(x = 'total_bill', y = 'tip', data = tips_data)

<seaborn.axisgrid.JointGrid at 0x13d9cc2f8c0>

## 11. Correlation & Matrix Views
Correlation caveats:
- Only linear association
- Sensitive to outliers
- Categorical encodings must be numeric to appear

We'll compute Pearson correlation among numeric columns and visualize with a heatmap + formatted annotations.

In [28]:
tips_data.corr(numeric_only=True)

Unnamed: 0,total_bill,tip,size
total_bill,1.0,0.214756,0.096942
tip,0.214756,1.0,0.090766
size,0.096942,0.090766,1.0


Before plotting, inspect numeric-only subset; limited columns may suggest engineered features (e.g., percent tip) for richer correlation structure.

In [29]:
sns.heatmap(tips_data.corr(numeric_only=True), annot=True, cmap='coolwarm');

### Histograms & Density Plots

A histogram displays the frequency of observations within specified intervals (bins), allowing you to see how values are distributed across the range of a variable. Adjusting the `bins=` parameter can reveal finer or broader patterns in the data. Kernel Density Estimate (KDE) plots provide a smoothed curve that approximates the underlying distribution, but may be misleading when sample sizes are small or data are highly skewed.

For clear insights, combine histograms and KDE plots thoughtfully—avoid layering too many visual elements, which can obscure key patterns and make interpretation difficult.

In [30]:
# For histogram only use:

sns.histplot(x = tips_data['total_bill'], bins=30, kde= False);

In [31]:
# For both histogram and density plot use:

sns.histplot(x = tips_data['total_bill'], kde = True);

### Box Plot: Summarizing Distributions and Detecting Outliers

A box plot provides a concise visual summary of a variable’s distribution, highlighting the median, interquartile range (IQR), and potential outliers. It is especially useful for comparing central tendency and spread across multiple groups. However, box plots may obscure multimodal patterns or subtle distributional features—consider pairing with violin or swarm plots when it’s important to reveal the full shape of the data.

In [32]:
sns.boxplot(y='total_bill', data=tips_data);

The distribution of the total bill appears symmetric, suggesting it is approximately normally distributed rather than skewed. Most values cluster around the center, with the majority falling within a typical range. The point above the upper whisker represents an outlier—an unusually high total bill that deviates significantly from the rest of the data.

In [33]:
sns.boxplot(x = 'tip', data = tips_data);

## 12. Categorical vs Numeric

**Objective:**  
Effectively compare the distribution or central tendency of a numeric variable across different categorical groups.

**Chart Selection Guide:**
- **Countplot:** Visualizes the frequency of each category, helping you understand group sizes.
- **Barplot:** Displays the mean (or median, using `estimator=`) of a numeric variable for each category, often with confidence intervals to show uncertainty.
- **Box, Violin, or Swarm Plot:** Reveals the full distribution, including spread, central tendency, and outliers, for each group.
- **Pointplot:** Highlights trends or changes in a numeric variable across ordered categories, useful for time series or ordinal data.

**Best Practices:**  
Use the `hue` parameter thoughtfully to introduce a secondary categorical distinction, but avoid excessive encodings—too many colors or styles can make plots confusing and reduce interpretability.

In [34]:
sns.countplot(x = "day", data = tips_data);

As you can see it plots the number of bars as there are categories in the variable. We can use `.value_counts()` to confirm that.

In [35]:
tips_data["day"].value_counts()

day
Sat     165
Sun     140
Thur    134
Fri      79
Tues     78
Mon      75
Wed      73
Name: count, dtype: int64

Ordering categorical variables thoughtfully enhances the clarity and impact of your visualizations. Use a logical sequence—such as chronological order for days of the week or ranking by frequency or magnitude—rather than default alphabetical sorting. This approach ensures that patterns and comparisons are immediately meaningful to your audience, making your charts easier to interpret and more informative.

In [36]:
sns.countplot(x = "day", order = ["Mon","Tues", "Wed", "Thur", "Fri", "Sat", "Sun"], data = tips_data)

plt.xlabel("Day of the week")

plt.ylabel("Number of visitors");

As you can see, a lot of visitors came to the restuarant on Saturday when compare to other days of the week

We can also compare day of the week and gender using the `hue` parameter as follows:

In [37]:
sns.countplot(x = "day", hue="gender", data = tips_data)

plt.xlabel("Day of the week")

plt.ylabel("Number of visitors");

In [38]:
sns.countplot(x='gender', data=tips_data, hue='size');

For both male and female, the most common size of the party in the restaurant is 2.

### Mean Comparison vs Distribution

Bar plots display group averages but conceal variability within each group. To provide a clearer picture, pair bar plots with box or violin plots, or include error bars (confidence intervals, standard deviation, or standard error). For small samples, consider showing individual data points to highlight the underlying distribution.

In [39]:
sns.barplot(x='smoker' , y='total_bill', data=tips_data);

Those that smoke pay a slightly higher bill than those that did not smoke during their stay in the restaurant.

Use `ci=None` to remove confidence intervals or specify `estimator=np.median` to visualize medians instead of means.

In [40]:
sns.barplot(x='smoker' , y='tip', ci = None, data=tips_data);


The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  sns.barplot(x='smoker' , y='tip', ci = None, data=tips_data);


**Interpretation:**  
The observation that non-smokers tend to leave higher tips may be influenced by differences in group size or the total bill amount. To ensure a fair comparison between smokers and non-smokers, consider normalizing tips by calculating the tip as a percentage of the total bill.

Introducing the `hue` parameter adds a second categorical variable to the visualization, allowing for deeper subgroup analysis. However, avoid adding additional dimensions such as size or style unless they directly address a specific analytical question, as too many encodings can make the plot difficult to interpret.

In [41]:
sns.barplot(x = "smoker" , y= "tip", hue = "time",  data=tips_data);

It seems that people give more tips at the dinner than at the lunch time irrespective of whether they smoke or not at the restuarant.

### Pie Charts: Use Sparingly and Interpret with Caution

Pie charts can illustrate the proportion of each category within a dataset, but they often make it challenging to compare segment sizes accurately—especially when differences are subtle or categories are numerous. For most analytical purposes, bar charts are preferable, as they present category frequencies or proportions in a way that is easier to interpret and compare. Reserve pie charts for situations where you have a small number of distinct categories and the differences between them are visually pronounced. Always consider your audience and the clarity of your message when choosing this type of visualization.

In [42]:
tips_data["gender"].value_counts().plot(kind = "pie");

To highlight a particular value in the plot, use explode parameter

In [43]:
# Explode 1st slice

tips_data["gender"].value_counts().plot(kind = "pie", explode = (0.05, 0));

In [44]:
tips_data["size"].value_counts().plot(kind = "pie", explode = (0.05, 0, 0, 0.1 ,0 ,0))

plt.title("Size of the party")

plt.ylabel("");

### Grouped Box Plot: Comparing Distributions Across Categories

Grouped box plots provide a clear visual comparison of how the central tendency and variability of a numeric variable shift across different categories. By introducing the `hue` parameter, you can highlight interactions between two categorical variables, revealing nuanced patterns and subgroup differences. For optimal readability and accessibility, always select a colorblind-friendly palette and ensure that group labels are clearly annotated. This approach makes it easy to detect changes, outliers, and distributional differences between groups, supporting more informed and transparent analysis.

In [45]:
sns.boxplot(x='smoker', y='total_bill', data=tips_data);

  ax.set_xlim(-.5, n - .5, auto=None)


### Violin Plot

A violin plot is a powerful visualization that merges the summary features of a box plot with a mirrored kernel density estimate, offering a comprehensive view of the distribution’s shape for each category. This dual representation makes it especially effective for detecting multimodal patterns, assessing differences in spread or skewness, and revealing subtle distributional features that box plots alone may miss. When your data contains extreme outliers, you can use the cut= parameter to trim the tails, allowing you to focus on the main body of the distribution and improve interpretability. Violin plots are ideal for comparing distributions across groups, providing both statistical summaries and detailed insights into the underlying data structure.

In [46]:
sns.violinplot(x='gender', y='total_bill', data= tips_data);

  ax.set_xlim(-.5, n - .5, auto=None)


## 13. Saving Figures

To export your visualizations for reports or publications, use `plt.savefig("filename.png", dpi=300, bbox_inches='tight')`. This ensures high-resolution output with neatly cropped edges.

For graphics containing lines or text, vector formats such as SVG or PDF are ideal, as they scale without loss of quality. Use PNG for raster images or when sharing online.

Below is an example of saving a bar plot as an image file.

In [47]:
sns.countplot(x = "size", data = tips_data)

plt.xlabel("Size of the party")

plt.ylabel("Number of attendees")

plt.savefig("size.png")

  ax.set_xlim(-.5, n - .5, auto=None)


Check your current working directory for `size.png` image

(If image not visible inline, ensure path exists or refresh Jupyter render.)

## 14. Practice Activity 1: Africa COVID-19 Dataset
Objective: Reinforce bar plot interpretation & ordering.
Tasks:
1. Import libraries
2. Load dataset (handle encoding if needed)
3. Bar plot: Region vs Total Cases (horizontal for readability)
4. Remove error bars (`ci=None`)
5. Identify highest case region

Extension: Normalize per capita if population available.

### Solution (Activity 1)

In [48]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [49]:
covid = pd.read_csv("activity_datasets/Africa COVID-19 Dec 6, 2020.csv", encoding = "ISO-8859-1")

FileNotFoundError: [Errno 2] No such file or directory: 'activity_datasets/Africa COVID-19 Dec 6, 2020.csv'

In [None]:
sns.barplot(y = "Region", x = "Total Cases", data = covid, ci = None);

Interpretation: Region with highest total cases may reflect population density, testing rates, or outbreak timing—avoid causal claims without further context.

## 15. Practice Activity 2: Penguins (Basics)
Objective: Explore summary stats & simple univariate / boxplot visuals.
Tasks:
1. Load penguins
2. Summarize numeric stats
3. Boxplot for bill_length_mm
4. Histogram for body_mass_g

Extension: Color by species to reveal morphological differences.

### Solutions (Activity 2)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Solution 2

In [None]:
penguins = pd.read_csv("activity_datasets/penguins.csv")

## Solution 3

In [None]:
penguins.describe()

## Solution 4

In [None]:
sns.boxplot(x = "bill_length_mm", data = penguins);

or

In [None]:
sns.boxplot(y = "bill_length_mm", data = penguins);

## Solution 5

In [None]:
sns.histplot(penguins["body_mass_g"], kde = True);

## 16. Practice Activity 3: Penguins (Intermediate)
Objective: Multi-variable comparisons & ordering.
Tasks:
1. Load dataset
2. Boxplot: species vs flipper_length_mm
3. Reorder species logically (Adelie, Chinstrap, Gentoo)
4. Add hue=sex
5. Scatter: bill_length_mm vs flipper_length_mm (trend)
6. Pairplot for quick multivariate scan

Extension: Compute derived feature (bill_ratio = bill_length_mm / bill_depth_mm) and visualize.

### Solutions (Activity 3)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Solution 2

In [None]:
penguins = pd.read_csv("activity_datasets/penguins.csv")

## Solution 3

In [None]:
sns.boxplot(x = "species", y = "bill_length_mm", data = penguins)

plt.ylabel("flipper length (mm)");

Reordering reveals monotonic trends more clearly and reduces cognitive load.

In [None]:
sns.boxplot(x = "species", y = "bill_length_mm", data = penguins, order = ["Adelie", "Chinstrap", "Gentoo"]);

Adding hue introduces an interaction view; ensure color palette is distinguishable (e.g., `palette='Set2'`).

In [None]:
sns.boxplot(x = "species", y = "bill_length_mm", data = penguins, order = ["Adelie", "Chinstrap", "Gentoo"], hue = "sex");

Regression line can surface linear separability or species clusters; consider facetting by species for clarity.

In [None]:
sns.lmplot(x = "bill_length_mm", y = "flipper_length_mm", data = penguins, ci = None);

## 17. Further Reading & Design Checklist
Further Reading:
- Seaborn docs: https://seaborn.pydata.org/
- Data Viz Principles: "Fundamentals of Data Visualization" (Claus Wilke)
- Color: https://colorbrewer2.org
- Statistical Graphics: Tukey, Cleveland

Design Checklist:

- [x] What question does this plot answer?
- [x] Is the chosen mark (point, bar, area) appropriate?
- [x] Are scales clear (units, transformations)?
- [x] Are colors accessible (colorblind-safe)?
- [x] Is ordering meaningful, not arbitrary?
- [x] Is uncertainty represented (CI / distribution) if relevant?
- [x] Is annotation needed for key takeaway?
- [x] Can any ink be removed without losing information?

> Good visualization = maximum insight, minimum noise.