Okay, here are comprehensive notes for **Module 2: Core Visualization Techniques with Matplotlib & Pandas**, based on the provided transcript. This is designed for easy copy-pasting into Google Docs.

***

**A Note on Data:** The code examples in this module often assume you have a Pandas DataFrame named `df_canada` loaded and preprocessed as detailed in Module 1 (e.g., with countries as index, year columns from 1980-2013 as integers or strings, and a 'Total' immigration column per country, and a 'Continent' column). You will also need a `years_for_plotting` list, typically defined as `years_for_plotting = [y for y in range(1980, 2014)]` if year columns are integers, or `[str(y) for y in range(1980, 2014)]` if they are strings.

# Module 2: Core Visualization Techniques with Matplotlib & Pandas

This module dives into creating essential plot types using Matplotlib and Pandas' convenient plotting functions. [Source: 1] These visualizations are fundamental in data analysis, each designed to reveal different aspects of your dataset. [Source: 2]

## 2.1. Area Plots: Visualizing Magnitude and Proportion Over Time

**Area plots** are graphical displays of quantitative data, particularly useful for showing how one or more quantities change over a continuous axis, usually time. [Source: 3] They are like line plots, but the space between the axis and the line is filled with color. This fill helps to emphasize the volume or cumulative magnitude of the data over the progression. [Source: 4]

[*Image: Example of a simple area plot showing website traffic over a month, with the area under the line filled.*]

**Definition and Use Cases:** [Source: 5]
Area plots shine when you want to:
* Compare the change in two or more quantities over time. [Source: 5]
* Depict cumulative data, like tracking stock market performance, changes in population demographics, or resource distribution over time. [Source: 6]
* Visually represent magnitude, making it easy to see changes in volume. [Source: 7]

**Creation with Pandas:**
Pandas DataFrames make creating area plots simple with the `.plot(kind='area')` method. [Source: 8]

* **Data Structure:** Often, your data needs to be structured correctly. For example, if you want to plot trends over years (with years on the x-axis), and your DataFrame has years as columns and categories (like countries) as rows, you'll likely need to **transpose** it. This makes years the index and categories the columns. [Source: 9, 10]
* **Transparency (`alpha`):** The `alpha` parameter sets the transparency of the filled areas (a value between 0 for fully transparent and 1 for fully opaque). This is handy when plotting multiple series that might overlap, especially if they are not stacked. [Source: 11]
* **Stacked Area Plots:** By default, when you plot multiple columns, Pandas creates a **stacked area plot**. [Source: 12] In these plots, the values for each category are stacked on top of one another. [Source: 13] This is excellent for showing both the total magnitude and the proportional contribution of each category to that total over time. [Source: 14]
    [*Diagram: Illustration of a stacked area plot with 3 categories. Category A is at the bottom, Category B is stacked on A, and Category C is stacked on B. The top line represents the total.*]
* **Unstacked Area Plots:** You can create unstacked plots, but be careful with overlapping areas, as they can hide data from series underneath. [Source: 15]

**Code Example (Top 5 Countries Immigration Trend - Stacked Area Plot):**
This example shows immigration trends for the top 5 countries to Canada. [Source: 16]

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np # For dummy data generation

# --- Assume df_canada is loaded and preprocessed from Module 1 ---
# For this example, let's create a dummy df_canada
countries = ['India', 'China', 'UK', 'Philippines', 'Pakistan', 'USA', 'Iran', 'Sri Lanka', 'South Korea', 'Poland']
years_for_plotting = [y for y in range(1980, 2014)] # Assuming integer year columns
data = {}
for year in years_for_plotting:
    data[year] = np.random.randint(1000, 30000, size=len(countries))
df_canada = pd.DataFrame(data, index=countries)
df_canada['Total'] = df_canada[years_for_plotting].sum(axis=1)
df_canada['Continent'] = ['Asia', 'Asia', 'Europe', 'Asia', 'Asia', 'North America', 'Asia', 'Asia', 'Asia', 'Europe']
# --- End dummy df_canada setup ---

# Get top 5 countries by 'Total' immigration
df_top5 = df_canada.sort_values(by='Total', ascending=False).head(5)

# Select only the year columns for these top 5 countries
df_top5_years = df_top5[years_for_plotting]

# Transpose the DataFrame: years become the index, countries become columns [Source: 17, 18]
df_top5_T = df_top5_years.transpose()

# Ensure the index (years) is of integer type (it should be if years_for_plotting are ints)
df_top5_T.index = df_top5_T.index.map(int) # [Source: 18]

# Create the area plot using Pandas
ax = df_top5_T.plot(kind='area',
                    alpha=0.75,  # Adjust transparency [Source: 11, 18]
                    stacked=True,  # Default, but explicit here [Source: 12, 18]
                    figsize=(12, 7)) # Set the figure size [Source: 19]

# Customize the plot using Matplotlib methods on the Axes object
ax.set_title('Immigration Trend of Top 5 Countries to Canada (1980-2013)') # [Source: 19]
ax.set_xlabel('Year') # [Source: 19]
ax.set_ylabel('Number of Immigrants') # [Source: 19]
ax.legend(title='Countries') # [Source: 19]
plt.show()

*Explanation:* Data preparation, like transposing, is often needed to match the desired plot orientation. [Source: 19, 20] Stacked area plots clearly show how part-to-whole relationships change. [Source: 21] However, while the total and the bottom-most series are easy to read against their stable baselines, accurately judging the magnitude or trend of upper series can be tricky because their baselines shift. [Source: 22] For comparing exact magnitudes of individual, non-cumulative series, other charts or unstacked area plots (with careful transparency) might be better. [Source: 23]

## 2.2. Histograms: Understanding Data Distributions

**Histograms** are visual representations of the frequency distribution of a single continuous numerical dataset. [Source: 24] They offer valuable insights into the underlying structure and shape of your data. [Source: 25]

[*Image: Example of a histogram showing the distribution of exam scores, with bins on the x-axis and frequency on the y-axis.*]

**Definition and Use Cases:** [Source: 26]
A histogram works by:
1.  Dividing the range of a numeric variable into a series of intervals called **"bins"**. [Source: 26]
2.  Counting how many data points fall into each bin. [Source: 27]
3.  Drawing a bar for each bin, where the height of the bar represents the frequency (or count) of data points in that interval. [Source: 28]

Histograms are used to:
* Visualize the **shape** of the data distribution (e.g., symmetric/bell-shaped, skewed left or right, bimodal, uniform). [Source: 29]
* Identify the **central tendency** (e.g., mean, median) and **spread** (variability) of the data. [Source: 29]
* Detect **outliers**, gaps in the data, or unusual concentrations/clusters. [Source: 30]

**The Importance of Binning:** [Source: 31]
The choice of the number of bins (or bin width) is critical:
* **Too few bins:** Can oversimplify the distribution, hiding important features. [Source: 32]
    [*Diagram: A histogram with too few bins, making a potentially complex distribution look like a simple block.*]
* **Too many bins:** Can create a "noisy" plot with many small fluctuations, making it hard to see the overall shape. [Source: 33]
    [*Diagram: A histogram with too many bins, showing excessive detail and obscuring the overall trend.*]
Choosing the right number of bins often involves experimentation or using statistical rules (like Sturges' formula, Scott's rule, or Freedman-Diaconis rule). [Source: 34]

**Creation with Pandas:**
Generate histograms from a Pandas Series or DataFrame column using `.plot(kind='hist')`. [Source: 35]
* `bins` parameter: Controls the number of bins. [Source: 36]
* `xticks` parameter: Can be used with bin edges (e.g., from `numpy.histogram()`) for precise tick alignment. [Source: 36]

**Code Example (Distribution of 2013 Immigration Data):** [Source: 37]

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np # For dummy data and np.histogram

# --- Assume df_canada is loaded (using dummy from above) ---
# For this example, let's ensure the 2013 column exists and is numeric
if 2013 not in df_canada.columns: # If year columns are integers
    df_canada[2013] = np.random.randint(100, 5000, size=len(df_canada))
# --- End dummy data adjustment ---

year_to_analyze = 2013 # Assuming integer column name [Source: 37]

if year_to_analyze in df_canada.columns:
    # Ensure data is numeric and handle potential NaNs
    immigration_data_2013 = pd.to_numeric(df_canada[year_to_analyze], errors='coerce').dropna() #[Source: 37]

    # Plotting the histogram using Pandas
    ax = immigration_data_2013.plot(kind='hist',
                                    figsize=(10, 6),
                                    bins=15,  # Specify number of bins [Source: 38, 39]
                                    color='skyblue',
                                    edgecolor='black') # For better bin separation [Source: 39]
    ax.set_title(f'Distribution of Immigration to Canada from Various Countries in {year_to_analyze}') #[Source: 39]
    ax.set_xlabel('Number of Immigrants') # [Source: 40]
    ax.set_ylabel('Number of Countries (Frequency)') # [Source: 40]
    ax.grid(axis='y', alpha=0.75) # [Source: 40]
    plt.show()

    # Example using NumPy to define bin edges for more control [Source: 40]
    count, bin_edges = np.histogram(immigration_data_2013, bins=10) # [Source: 40]

    ax_np = immigration_data_2013.plot(kind='hist',
                                       figsize=(10, 6),
                                       bins=bin_edges,    # Use pre-calculated bin edges [Source: 41]
                                       xticks=bin_edges,  # Set x-axis ticks to bin edges [Source: 42]
                                       color='lightcoral',
                                       edgecolor='black') # [Source: 43]
    ax_np.set_title(f'Distribution of Immigration to Canada in {year_to_analyze} (NumPy Bins)') # [Source: 43]
    ax_np.set_xlabel('Number of Immigrants') # [Source: 43]
    ax_np.set_ylabel('Number of Countries (Frequency)') # [Source: 43]
    ax_np.grid(axis='y', alpha=0.75) # [Source: 43]
    plt.xticks(rotation=45) # Rotate x-axis labels if they overlap [Source: 43]
    plt.show()
else:
    print(f"Column for year {year_to_analyze} not found in df_canada.") # [Source: 43]

**Histograms vs. Bar Charts:** [Source: 43]
It's crucial to distinguish them:
* **Histograms:** Show the distribution of *continuous numerical data*. The width of each bar (bin) represents an interval and is meaningful. Bars typically have no gaps. [Source: 44]
* **Bar Charts:** Compare *discrete categories*. The width of bars is generally arbitrary, and there are usually distinct gaps between them. [Source: 45]

The "optimal" number of bins isn't fixed; it depends on dataset size and distribution characteristics. [Source: 46] While visual experimentation is common, statistical rules can offer a data-driven starting point for bin selection to avoid misinterpretations. [Source: 47]

## 2.3. Bar Charts: Comparing Categorical Data

**Bar charts** are widely used for comparing the values of a variable across different discrete categories or groups. [Source: 48] The length or height of each bar is proportional to the value it represents. [Source: 49]

[*Image: Example of a vertical bar chart comparing sales figures for 4 different products.*]

**Definition and Use Cases:** [Source: 50]
Bar charts are ideal for:
* Visualizing data that can be easily categorized and ranked. [Source: 50]
* Showing how different categories contribute to a measure or how they compare. [Source: 50]
* Common examples:
    * Comparing sales across products. [Source: 51]
    * Showing vote counts for candidates. [Source: 51]
    * Visualizing survey responses by demographic groups. [Source: 52]
    * Tracking a metric (like number of immigrants) for an entity over discrete time periods (e.g., years, treated as categories). [Source: 52]
* Bars can be **vertical** (`kind='bar'`) or **horizontal** (`kind='barh'`). [Source: 53]

**Creation with Pandas:**
Use `.plot(kind='bar')` for vertical bars or `.plot(kind='barh')` for horizontal bars. [Source: 53]
* Customizations: `color` (single or list), `edgecolor`. [Source: 54]

**Code Example (Iceland Immigration & Top 5 Countries Horizontal Bar):** [Source: 55]
1.  Plots Iceland's annual immigration to Canada (vertical bar chart). [Source: 55]
2.  Shows total immigration from the top 5 countries (horizontal bar chart). [Source: 56]

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np # For dummy data

# --- Assume df_canada and years_for_plotting are defined (using dummy from above) ---
# Ensure Iceland exists for the example
if 'Iceland' not in df_canada.index:
    iceland_data = {year: np.random.randint(10, 500) for year in years_for_plotting}
    iceland_df = pd.DataFrame(iceland_data, index=['Iceland'])
    iceland_df['Total'] = iceland_df[years_for_plotting].sum(axis=1)
    iceland_df['Continent'] = 'Europe'
    df_canada = pd.concat([df_canada, iceland_df])
# Ensure years_for_plotting contains strings if that's how they are indexed for loc
# years_for_plotting_str = [str(y) for y in years_for_plotting]
# If df_canada columns are integers, use:
years_data_columns = years_for_plotting
# --- End dummy data adjustment ---

if 'Iceland' in df_canada.index:
    # --- Vertical Bar Chart for Iceland's yearly immigration ---
    df_iceland_yearly = df_canada.loc['Iceland', years_data_columns]

    # Convert index (years) to string for categorical display on x-axis if they are numbers
    df_iceland_yearly.index = df_iceland_yearly.index.map(str) # [Source: 57]

    ax_iceland = df_iceland_yearly.plot(kind='bar',
                                        figsize=(12, 7),
                                        color='skyblue',
                                        edgecolor='black') # [Source: 57, 58]
    ax_iceland.set_title('Annual Immigration from Iceland to Canada (1980-2013)') # [Source: 59]
    ax_iceland.set_xlabel('Year') # [Source: 59]
    ax_iceland.set_ylabel('Number of Immigrants') # [Source: 59]
    ax_iceland.tick_params(axis='x', rotation=45) # Rotate x-axis labels [Source: 59]
    ax_iceland.grid(axis='y', linestyle='--', alpha=0.7) # [Source: 59]
    plt.tight_layout() # Adjust layout
    plt.show()

    # --- Horizontal Bar Chart for Total Immigration from Top 5 Countries ---
    # Assuming 'Total' column exists
    df_top5_total = df_canada.sort_values(by='Total', ascending=False).head(5)

    ax_top5 = df_top5_total['Total'].plot(kind='barh', # Plot only the 'Total' series [Source: 59]
                                          figsize=(10, 7),
                                          color='lightcoral',
                                          edgecolor='black') # [Source: 60, 61]
    ax_top5.set_title('Total Immigration from Top 5 Countries to Canada (1980-2013)') # [Source: 61]
    ax_top5.set_xlabel('Total Number of Immigrants') # [Source: 61]
    ax_top5.set_ylabel('Country') # [Source: 61]
    ax_top5.invert_yaxis() # Show highest value at the top for barh [Source: 61]
    ax_top5.grid(axis='x', linestyle='--', alpha=0.7) # [Source: 61]
    plt.tight_layout() # Adjust layout
    plt.show()
else:
    print("Country 'Iceland' not found in df_canada index for the first example.") # [Source: 61]

**Horizontal Bar Charts (`barh`):** [Source: 61]
* Often better than vertical when category labels are long (e.g., full country names). [Source: 62]
* Provide more space for text, improving readability. [Source: 63]
    [*Image: Example of a horizontal bar chart with long category labels that are easily readable.*]

**Ordering Bars:** [Source: 63]
* Sorting bars by value (ascending or descending) makes comparisons easier and helps quickly identify min/max and rankings. [Source: 64]
* Generally preferred over arbitrary or alphabetical order unless categories have a natural sequence. [Source: 65]

## 2.4. Pie Charts: Illustrating Proportions

**Pie charts** are circular graphics divided into segments ("slices") to show numerical proportions. [Source: 66] The size (angle or area) of each slice is proportional to the quantity it represents relative to the whole. [Source: 67]

[*Image: Example of a pie chart showing market share of 3 companies, with percentages labeled on slices.*]

**Definition and Use Cases:** [Source: 68]
* Primarily for showing the percentage breakdown of categories that make up a whole (e.g., party-wise election seats, market share). [Source: 68]
* **Most effective for a small number of categories (typically 2-3)** where differences are distinct. [Source: 69]

**Creation with Pandas:**
Generate from a Pandas Series (often created by `groupby().sum()` or `groupby().count()`) using `.plot(kind='pie')`. [Source: 69]
Common parameters:
* `autopct='%1.1f%%'`: Formats and displays percentages on slices. [Source: 70] (`%1.1f%%` means float with 1 decimal, followed by '%')
* `explode`: List/tuple of offset values to "explode" slices outwards for emphasis. [Source: 71]
* `colors`: List of colors for slices. [Source: 72]
* `startangle`: Rotates the starting point of the first slice (e.g., `startangle=90` starts the first slice at the top). [Source: 72]
* `shadow`: Adds a shadow. [Source: 73]
* `labels`: Can provide slice labels, but a legend is often cleaner if labels clutter. [Source: 73]

**Criticisms and Alternatives:** [Source: 74]
Pie charts are heavily criticized: [Source: 74]
* Often misused and can make it hard to accurately compare segment sizes, especially with many slices or similar proportions. [Source: 75]
* The human eye is better at judging lengths (bar charts) than angles or areas. [Source: 75]
* **3D pie charts are especially bad** as perspective distorts proportions further. [Source: 76]
    [*Diagram: A 3D pie chart vs. a 2D pie chart of the same data, highlighting how 3D distorts perception.*]
* For many pie chart scenarios, a **bar chart is clearer and more accurate**. [Source: 76]

**Code Example (Proportion of Immigration by Continent):** [Source: 77]

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np # For dummy data

# --- Assume df_canada is loaded (using dummy from Area Plots section) ---
# Ensure 'Continent' and 'Total' columns exist
if 'Continent' not in df_canada.columns:
    df_canada['Continent'] = np.random.choice(['Asia', 'Europe', 'Africa', 'Oceania', 'Americas'], size=len(df_canada))
if 'Total' not in df_canada.columns:
    df_canada['Total'] = np.random.randint(10000, 1000000, size=len(df_canada))
# --- End dummy data setup ---

# Group data by 'Continent' and sum the 'Total' immigration [Source: 78]
df_continents_total = df_canada.groupby('Continent')['Total'].sum()

explode_list = [0.0] * len(df_continents_total) # No explosion by default [Source: 78]

if not df_continents_total.empty:
    # Example: Explode the slice with the minimum total (if you want to emphasize the smallest)
    # This is just one way to choose; often you might explode a specific category.
    min_continent_val = df_continents_total.min()
    min_continent_idx_pos = df_continents_total[df_continents_total == min_continent_val].index
    if not min_continent_idx_pos.empty:
        # Get the integer position of this index
        idx_loc = df_continents_total.index.get_loc(min_continent_idx_pos[0])
        explode_list[idx_loc] = 0.1  # Set a small explosion for that slice [Source: 79]


ax_pie = df_continents_total.plot(kind='pie',
                                  figsize=(12, 8), # Made figure slightly wider for legend
                                  autopct='%1.1f%%', # Display percentages [Source: 79, 80]
                                  startangle=90,     # Start first slice at the top [Source: 80]
                                  shadow=True,       # [Source: 80]
                                  labels=None,       # Remove default labels, use legend [Source: 81]
                                  pctdistance=0.8,   # Distance of % text from center [Source: 81]
                                  explode=explode_list) # [Source: 82]

ax_pie.set_title('Proportion of Immigration to Canada by Continent (1980-2013)', y=1.05, fontsize=14) # [Source: 82]
ax_pie.set_ylabel('') # Remove default ylabel (often the Series name) [Source: 82]
plt.axis('equal')    # Ensures pie is circular [Source: 82]

# Add a legend [Source: 82]
plt.legend(labels=df_continents_total.index, loc='upper left', bbox_to_anchor=(-0.15, 1))
plt.tight_layout() # Adjust layout to prevent overlap
plt.show()

*Note on `explode_list` logic in the example: The original source tries to find `idxmin()` on the entire DataFrame `df_continents_total` which after `groupby().sum()` is a Series. The corrected logic above gets the minimum value, then its index, then the position of that index.*

**Caution with Pie Charts:** [Source: 82]
Use pie charts sparingly. [Source: 83] They are best for simple part-to-whole relationships with very few categories (ideally ≤ 3-4). [Source: 83] Bar charts are almost always better for comparing categories due to our brain's accuracy with lengths. [Source: 84] Customizations like `explode` or `autopct` try to mitigate issues but can add clutter ("chartjunk") instead of fixing fundamental perception problems. [Source: 85, 86] If a chart needs many helpers, a simpler one is likely better. [Source: 87]

## 2.5. Box Plots: Summarizing Data Distributions Statistically

**Box plots** (or box-and-whisker plots) offer a standardized way to visually represent the distribution of numerical data through its key summary statistics. [Source: 88]

[*Diagram: A clear, labeled diagram of a box plot.
    - Label the Minimum (bottom whisker end)
    - Label Q1 (bottom of the box)
    - Label Median (Q2, line inside the box)
    - Label Q3 (top of the box)
    - Label Maximum (top whisker end)
    - Indicate the Interquartile Range (IQR) as the height of the box.
    - Show potential outliers as individual points beyond the whiskers.*]

**Definition and Components:** [Source: 89]
A box plot summarizes data using five main dimensions:
* **Minimum:** Smallest data point, typically excluding outliers (often Q1 - 1.5 * IQR). [Source: 89, 90]
* **First Quartile (Q1):** 25th percentile; 25% of data falls below this. [Source: 91]
* **Median (Q2):** Middle value (50th percentile); divides data in half. [Source: 92]
* **Third Quartile (Q3):** 75th percentile; 75% of data falls below this. [Source: 93]
* **Maximum:** Largest data point, typically excluding outliers (often Q3 + 1.5 * IQR). [Source: 94]
* **Interquartile Range (IQR):** Range between Q1 and Q3 (IQR = Q3 - Q1). Represents the spread of the middle 50% of data. The "box" spans this. [Source: 95]
* **Whiskers:** Lines extending from the box showing data range. Commonly extend to 1.5 * IQR beyond Q1/Q3. [Source: 96, 97]
* **Outliers:** Data points beyond whiskers are often plotted as individual dots, considered potential outliers. [Source: 97, 98]

**Use Cases:** [Source: 98]
Box plots are great for:
* Comparing distributions of a continuous variable across categories (e.g., exam scores for different classes). [Source: 99]
* Examining data spread (dispersion) and skewness. [Source: 99]
* Visually identifying quartiles and the median. [Source: 100]
* Detecting potential outliers. [Source: 100]
* Comparing multiple distributions side-by-side compactly. [Source: 101]

**Creation with Pandas:**
Use `.plot(kind='box')` or `.boxplot()` on a DataFrame or Series. [Source: 101] Data might need restructuring (e.g., transposing) to compare distributions across groups. [Source: 102]

**Code Example (Japanese Immigration Distribution & Comparison):** [Source: 103]
1.  Single box plot for Japan's annual immigration. [Source: 104]
2.  Compares immigration distributions for the top 3 countries. [Source: 104]

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np # For dummy data

# --- Assume df_canada and years_for_plotting are defined (using dummy from Area Plots section) ---
# Ensure Japan exists for the example
if 'Japan' not in df_canada.index:
    japan_data = {year: np.random.randint(500, 2000) for year in years_for_plotting}
    japan_df = pd.DataFrame(japan_data, index=['Japan'])
    japan_df['Total'] = japan_df[years_for_plotting].sum(axis=1)
    japan_df['Continent'] = 'Asia'
    df_canada = pd.concat([df_canada, japan_df])
# Ensure years_for_plotting contains the correct column names (integers or strings)
years_data_columns = years_for_plotting # if columns are integers
# years_data_columns = [str(y) for y in years_for_plotting] # if columns are strings
# --- End dummy data adjustment ---


if 'Japan' in df_canada.index:
    # --- Single Box Plot for Japan's Yearly Immigration ---
    df_japan_yearly = df_canada.loc['Japan', years_data_columns]
    df_japan_yearly_numeric = pd.to_numeric(df_japan_yearly, errors='coerce').dropna() # [Source: 105]

    ax_japan_box = df_japan_yearly_numeric.plot(kind='box',
                                                figsize=(8, 6),
                                                vert=False, # Horizontal box plot [Source: 106]
                                                patch_artist=True, # Fills box with color [Source: 107]
                                                medianprops={'color':'black'}) # Customize median line [Source: 107]
    ax_japan_box.set_title('Distribution of Annual Japanese Immigration to Canada (1980-2013)') # [Source: 107]
    ax_japan_box.set_xlabel('Number of Immigrants per Year') # [Source: 107]
    ax_japan_box.grid(axis='x', linestyle='--', alpha=0.7) # [Source: 107, 108]
    plt.tight_layout()
    plt.show()

    # --- Comparing Box Plots for Top 3 Countries ---
    df_top3_countries = df_canada.sort_values(by='Total', ascending=False).head(3)
    # Select year columns and transpose so countries are columns, years are index
    df_top3_yearly_data = df_top3_countries[years_data_columns].transpose() # [Source: 108]

    # Ensure year index is numeric (if it became string due to mixed types earlier)
    df_top3_yearly_data.index = df_top3_yearly_data.index.map(int) # [Source: 108]
    # Ensure all data in these columns is numeric
    df_top3_yearly_data = df_top3_yearly_data.apply(pd.to_numeric, errors='coerce') # [Source: 108]

    ax_compare_box = df_top3_yearly_data.plot(kind='box',
                                              figsize=(10, 7),
                                              patch_artist=True) # Allows filling boxes [Source: 109]
    ax_compare_box.set_title('Annual Immigration Distribution for Top 3 Countries (1980-2013)') #[Source: 109, 110]
    ax_compare_box.set_xlabel('Country') # Pandas uses column names (countries here) for x-labels [Source: 110]
    ax_compare_box.set_ylabel('Number of Immigrants per Year') # [Source: 110]
    ax_compare_box.grid(axis='y', linestyle='--', alpha=0.7) # [Source: 110]
    plt.tight_layout()
    plt.show()
else:
    print("Country 'Japan' not found in df_canada index for the first example.") # [Source: 110]

Box plots concisely summarize key distribution characteristics (median, spread/IQR, skewness, potential outliers) without showing the full shape like a histogram. [Source: 111] This compactness is great for comparing many distributions side-by-side. [Source: 111, 112] Remember, the common $1.5 \times IQR$ rule for whiskers and outliers identifies *potential* outliers. [Source: 112, 113] Interpret them with domain knowledge, not just visually. [Source: 115]

## 2.6. Scatter Plots: Exploring Relationships Between Variables

**Scatter plots** display values for two numerical variables, with each data point shown as a dot (or other marker) at its Cartesian coordinates (one variable on x-axis, one on y-axis). [Source: 115]

[*Image: Example of a scatter plot showing height vs. weight for a group of people, with a general upward trend visible.*]

**Definition and Use Cases:** [Source: 116]
Scatter plots primarily examine the relationship or **correlation** between two continuous variables. [Source: 116] They help:
* Identify patterns or trends (e.g., positive, negative, linear, non-linear). [Source: 117]
* Detect outliers or unusual observations deviating from the pattern. [Source: 117]
* Visualize clusters of data points. [Source: 118]
* Sense the strength and direction of a correlation. [Source: 118]
* Example: Plotting advertising spend vs. sales to see if higher spending correlates with higher sales. [Source: 119]

**Creation:**
* **Pandas:** `.plot(kind='scatter', x='x_col', y='y_col')` on a DataFrame. [Source: 119]
* **Matplotlib:** `ax.scatter(x_data, y_data)`. [Source: 120]

**Customization:**
* `s`: Marker size. Can be a single value or an array/Series (for bubble plots). [Source: 120, 121]
* `c`: Marker color. Single color or array/Series mapped to a colormap. [Source: 122]
* `marker`: Marker shape (e.g., 'o' circle, 's' square, '^' triangle). [Source: 123]
* `alpha`: Transparency, useful for overlapping points. [Source: 124]

**Regression Line Concept:** [Source: 125]
Often, a **regression line** is overlaid on a scatter plot to visually summarize the general trend (typically linear) between the two variables. [Source: 125] This line represents the "best fit" to the data according to a chosen regression model. [Source: 126]
[*Diagram: A scatter plot with data points and a regression line drawn through them, showing the general trend.*]

**Code Example (Total Immigration to Canada Over Years):** [Source: 127]
Plots total number of immigrants to Canada per year from 1980-2013.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np # For dummy data
import seaborn as sns # For regression plot [Source: 130]

# --- Assume df_canada and years_for_plotting are defined (using dummy from Area Plots) ---
# Ensure years_for_plotting matches column names (integers or strings)
years_data_columns = years_for_plotting # Adjust if columns are strings
# ---

# Calculate total immigration across all countries for each year
# .sum(axis=0) sums each column (year) vertically [Source: 127]
total_immigration_per_year = df_canada[years_data_columns].sum(axis=0)

# Ensure the index (years) is numeric for plotting
total_immigration_per_year.index = total_immigration_per_year.index.map(int) # [Source: 127]

# Create a DataFrame for plotting [Source: 127]
df_total_trend = pd.DataFrame({
    'Year': total_immigration_per_year.index,
    'Total_Immigrants': total_immigration_per_year.values
})

# --- Scatter plot using Pandas ---
ax_scatter_pd = df_total_trend.plot(kind='scatter',
                                    x='Year',
                                    y='Total_Immigrants',
                                    figsize=(10, 6),
                                    color='darkcyan',
                                    s=50) # Marker size [Source: 128, 129, 130]
ax_scatter_pd.set_title('Total Immigration to Canada (1980-2013) - Scatter Plot') # [Source: 130]
ax_scatter_pd.set_xlabel('Year') # [Source: 130]
ax_scatter_pd.set_ylabel('Total Number of Immigrants') # [Source: 130]
ax_scatter_pd.grid(True) # [Source: 130]
plt.show()

# --- Scatter plot with regression line using Seaborn ---
# Seaborn (covered later) is convenient for regression lines [Source: 130]
plt.figure(figsize=(10, 6))
ax_seaborn_reg = sns.regplot(x='Year',
                             y='Total_Immigrants',
                             data=df_total_trend, # [Source: 131]
                             color='darkcyan',
                             scatter_kws={'s': 50, 'alpha': 0.7}, # Customize scatter points [Source: 131]
                             line_kws={'color': 'red', 'linewidth': 2}) # Customize regression line [Source: 132]
ax_seaborn_reg.set_title('Total Immigration to Canada with Regression Line (1980-2013)') # [Source: 132]
ax_seaborn_reg.set_xlabel('Year') # [Source: 132]
ax_seaborn_reg.set_ylabel('Total Number of Immigrants') # [Source: 132]
ax_seaborn_reg.grid(True) # [Source: 132]
plt.show()

Scatter plots focus on the *relationship* and potential *correlation* between two numerical variables. The collective pattern is key. [Source: 132, 133] Outliers can disproportionately influence perceived relationships and calculated correlations/regression lines, so visual inspection is a crucial first step. [Source: 134, 135]

## 2.7. Bubble Plots: Adding a Third Dimension to Scatter Plots

**Bubble plots** extend scatter plots by allowing visualization of a third numerical variable through varying marker (bubble) size. [Source: 136]

[*Image: Example of a bubble plot. X-axis: GDP per capita, Y-axis: Life Expectancy, Bubble Size: Population of country.*]

**Definition and Use Cases:** [Source: 137]
* X and Y coordinates represent two numerical variables (like scatter plots). [Source: 137]
* The **area** of the bubble is proportional to a third numerical variable. [Source: 138]
* Useful for visualizing relationships among three variables simultaneously on a 2D plane. [Source: 139]
* Example: Countries plotted with GDP/capita (x), life expectancy (y), and population (bubble size). [Source: 140]

**Creation:** [Source: 141]
Use Matplotlib's `ax.scatter()` or Pandas' `.plot(kind='scatter')` with the `s` parameter. [Source: 141]
* `s` parameter: Takes an array or Pandas Series whose values determine relative bubble sizes. [Source: 142]
    * **Important:** `s` should ideally map to bubble *area*, not radius, for accurate visual proportions (most libraries handle this). [Source: 143, 157]
* `alpha`: Useful for transparency if bubbles overlap. [Source: 144]

**Code Example (Conceptual - Extending Total Immigration Scatter Plot):** [Source: 145]
Builds on the previous total immigration scatter plot, adding a hypothetical third variable (e.g., "Economic Impact Score") for bubble size.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# --- Assuming df_total_trend DataFrame from scatter plot example is available ---
# (If not, recreate it based on the previous example)
# Recreating for self-containment:
years_for_plotting = [y for y in range(1980, 2014)]
# Dummy total_immigration_per_year if not carried over
total_immigration_per_year_values = np.random.randint(150000, 350000, size=len(years_for_plotting))
total_immigration_per_year = pd.Series(total_immigration_per_year_values, index=years_for_plotting)

df_total_trend = pd.DataFrame({
    'Year': total_immigration_per_year.index,
    'Total_Immigrants': total_immigration_per_year.values
})
# --- End recreation ---

# Add a dummy third variable for bubble size [Source: 146]
np.random.seed(42) # for reproducibility
# Scale this variable to produce reasonably sized bubbles
# This scaling (e.g., * 500 + 100) is arbitrary and needs adjustment based on data values
df_total_trend['Economic_Impact_Score'] = (np.random.rand(len(df_total_trend)) * 20 + 5) * 50 # Adjusted scaling [Source: 146]


ax_bubble = df_total_trend.plot(kind='scatter',
                                x='Year',
                                y='Total_Immigrants',
                                s=df_total_trend['Economic_Impact_Score'], # Bubble size [Source: 147]
                                alpha=0.6,  # Transparency [Source: 147, 148]
                                figsize=(12, 7),
                                c='Total_Immigrants', # Optional: color by another variable [Source: 149]
                                cmap='viridis',     # Colormap if 'c' is continuous [Source: 149]
                                edgecolors='black', # Add edgecolors [Source: 149]
                                linewidth=0.5) # [Source: 150]
ax_bubble.set_title('Total Immigration to Canada (Bubble Size: Conceptual Economic Impact)') # [Source: 150]
ax_bubble.set_xlabel('Year') # [Source: 150]
ax_bubble.set_ylabel('Total Number of Immigrants') # [Source: 150]
ax_bubble.grid(True) # [Source: 150]

# Creating a custom legend for bubble sizes can be complex. [Source: 151]
# For simplicity here, we'll add a textual note or a few example points for legend. [Source: 152, 153, 154]
# Example of adding a few representative bubbles for legend (manual approach):
sizes_for_legend = [df_total_trend['Economic_Impact_Score'].min(),
                    df_total_trend['Economic_Impact_Score'].median(),
                    df_total_trend['Economic_Impact_Score'].max()]
labels_for_legend = [f"{s:.0f} Impact Score" for s in sizes_for_legend]

# Plot dummy points off-screen for legend handles
legend_handles = []
for size, label in zip(sizes_for_legend, labels_for_legend):
    legend_handles.append(plt.scatter([], [], s=size, label=label,
                                      c='grey', alpha=0.6, edgecolors='black', linewidth=0.5))

plt.legend(handles=legend_handles, title='Economic Impact Score (Size)',
           scatterpoints=1, frameon=False, labelspacing=1.5, loc='lower right')
plt.tight_layout()
plt.show()

**Challenges with Bubble Plots:**
* **Occlusion:** Larger bubbles can hide smaller ones, especially in dense plots or with wide size variations. [Source: 154, 155] Transparency (`alpha`) helps but isn't always a perfect fix. [Source: 155]
* **Scaling Bubble Sizes (`s`):** Critical to scale `s` to bubble *area* for accurate visual proportions. Users might need to preprocess the sizing variable. [Source: 156, 157, 158]

## Table: Summary of Basic Plot Types and Use Cases

[Source: 160]

| Plot Type     | Primary Use Case                                           | Key Characteristics                                               | Pandas `kind`     |
| :------------ | :--------------------------------------------------------- | :---------------------------------------------------------------- | :---------------- |
| Area Plot     | Show cumulative magnitude/proportion over a continuous axis | Filled area below line(s), good for trends & volume changes       | `area`            |
| Histogram     | Show frequency distribution of one numerical variable      | Bars represent frequency in bins (intervals), shows data shape    | `hist`            |
| Bar Chart     | Compare values across discrete categories                  | Bars represent magnitude, can be vertical or horizontal           | `bar`, `barh`     |
| Pie Chart     | Show proportion of categories making up a whole (use cautiously) | Circular, slices represent percentages (few categories ideal)   | `pie`             |
| Box Plot      | Summarize/compare distributions (median, quartiles, outliers) | Box shows IQR, whiskers show range, dots for potential outliers   | `box` or `.boxplot()`|
| Scatter Plot  | Examine relationship/correlation between 2 numerical vars  | Points show individual observations, reveals patterns/clusters      | `scatter`         |
| Bubble Plot   | Examine relationship between 3 numerical vars simultaneously | Scatter plot where marker size represents the 3rd variable        | `scatter` (with `s` arg) |

This table is a quick reference for the core purpose of each plot type. [Source: 161] Choosing the right chart is fundamental to effective data visualization. [Source: 162]

***

## 2.8. Module 2 Practice Questions

Here are practice questions covering Module 2 topics.

**Section A: Multiple Choice Questions (MCQ)**

1.  Area plots are most similar to which other plot type, with the addition of a filled region?
    a)  Bar charts
    b)  Line plots
    c)  Scatter plots
    d)  Histograms

2.  By default, when plotting multiple columns with `df.plot(kind='area')`, Pandas creates what kind of area plot? [Source: 12]
    a)  Unstacked area plot
    b)  Overlapping area plot
    c)  Stacked area plot
    d)  Percentage area plot

3.  What is the primary purpose of a histogram? [Source: 24, 25]
    a)  To compare values between discrete categories.
    b)  To show the relationship between two continuous variables.
    c)  To display the frequency distribution of a single continuous numerical dataset.
    d)  To show proportions of a whole.

4.  In a histogram, what does the term "bins" refer to? [Source: 26]
    a)  The height of the bars.
    b)  The labels on the x-axis.
    c)  The intervals into which the range of a numeric variable is divided.
    d)  The total count of data points.

5.  Which scenario is most appropriate for using a horizontal bar chart (`kind='barh'`) over a vertical bar chart? [Source: 62, 63]
    a)  When comparing a very small number of categories.
    b)  When the category labels are long and numerous.
    c)  When you want to show a trend over time.
    d)  When the data values are all percentages.

6.  Pie charts are generally criticized for being ineffective when: [Source: 75]
    a)  There are only two categories to display.
    b)  Displaying exact numerical values.
    c)  There are many slices or when proportions are very similar.
    d)  The data represents a time series.

7.  Which part of a box plot represents the Interquartile Range (IQR)? [Source: 95]
    a)  The length of the whiskers.
    b)  The line indicating the median.
    c)  The height (or width, if horizontal) of the box itself.
    d)  The individual points plotted beyond the whiskers.

8.  What do the "whiskers" in a standard box plot typically extend to? [Source: 96, 97]
    a)  The absolute minimum and maximum values in the dataset.
    b)  One standard deviation from the mean.
    c)  $1.5 \times IQR$ beyond Q1 and Q3.
    d)  The 5th and 95th percentiles.

9.  Scatter plots are primarily used to: [Source: 116]
    a)  Show the distribution of a single variable.
    b)  Compare proportions of categories.
    c)  Examine the relationship or correlation between two numerical variables.
    d)  Display cumulative data over time.

10. In a bubble plot, what does the size of the bubble typically represent? [Source: 138]
    a)  The category of the data point.
    b)  The x-axis value.
    c)  A third numerical variable.
    d)  The density of points in that region.

11. The `alpha` parameter in plotting functions is used to control: [Source: 11, 124, 144]
    a)  The size of markers.
    b)  The color of the plot elements.
    c)  The transparency of plot elements.
    d)  The angle of rotation for labels.

12. If you want to emphasize a particular slice in a pie chart, which parameter would you use? [Source: 71]
    a)  `startangle`
    b)  `autopct`
    c)  `shadow`
    d)  `explode`

13. Which of these is a key difference between histograms and bar charts? [Source: 44, 45]
    a)  Histograms can be horizontal, but bar charts cannot.
    b)  Bar charts have gaps between bars by default; histogram bins typically do not.
    c)  Histograms are for categorical data; bar charts are for continuous data.
    d)  Bar charts always show percentages; histograms show counts.

14. A regression line on a scatter plot is used to: [Source: 125, 126]
    a)  Connect all the data points in sequence.
    b)  Visually summarize the general trend between the two variables.
    c)  Divide the data into quartiles.
    d)  Highlight outliers.

15. What is a common issue with bubble plots, especially with many data points or large variations in bubble size? [Source: 154, 155]
    a)  Inability to show negative values.
    b)  Difficulty in labeling axes.
    c)  Occlusion (larger bubbles obscuring smaller ones).
    d)  They can only represent two variables.

16. To generate a histogram from a Pandas Series `s` with 20 bins, you would use: [Source: 35, 36]
    a)  `s.plot(kind='histogram', bins=20)`
    b)  `s.plot(kind='hist', bins=20)`
    c)  `s.plot(kind='bar', bins=20)`
    d)  `s.plot(kind='distribution', num_bins=20)`

17. In a box plot, the line inside the box represents the: [Source: 92]
    a)  Mean
    b)  Mode
    c)  Median (Q2)
    d)  First Quartile (Q1)

18. For an area plot showing the contribution of different product sales to total sales over time, which type is most suitable? [Source: 14]
    a)  Unstacked area plot with high transparency.
    b)  Stacked area plot.
    c)  A series of pie charts, one for each time period.
    d)  A grouped bar chart.

19. When creating a scatter plot with Pandas `df.plot(kind='scatter', x='col_A', y='col_B', s=df['col_C'])`, what does `s=df['col_C']` achieve? [Source: 121, 142]
    a)  Sets the color of markers based on 'col_C'.
    b)  Sets the shape of markers based on 'col_C'.
    c)  Sets the size of markers based on 'col_C', creating a bubble plot.
    d)  Sorts the data by 'col_C' before plotting.

20. The `transpose()` method on a Pandas DataFrame is often used before creating area plots when: [Source: 10]
    a)  The DataFrame has too many rows.
    b)  The desired x-axis (e.g., time) is currently in columns, and categories are in rows.
    c)  You want to normalize the data.
    d)  You need to calculate cumulative sums.

**Section B: True/False Questions**

1.  Area plots are unsuitable for showing cumulative data over time. (T/F) [Source: 4, 6]
2.  Choosing too few bins in a histogram can hide important features of the data distribution. (T/F) [Source: 32]
3.  Vertical bar charts are always more readable than horizontal bar charts, regardless of label length. (T/F) [Source: 62, 63]
4.  Pie charts are highly recommended by visualization experts for comparing more than 5 categories. (T/F) [Source: 69, 75, 83]
5.  The "box" in a box plot visually represents the range between the minimum and maximum values of the dataset. (T/F) [Source: 95]
6.  Outliers in a box plot are always errors in the data and should be removed. (T/F) [Source: 98, 115] (Outliers are *potential* and require investigation)
7.  A scatter plot can help visualize the correlation between two categorical variables. (T/F) [Source: 116] (Scatter plots are for numerical variables)
8.  In a bubble plot, the `s` parameter should ideally map to the radius of the bubble for accurate visual proportion. (T/F) [Source: 143, 157] (Should map to area)
9.  A stacked area plot makes it easy to read the exact magnitude of an upper series against a stable baseline. (T/F) [Source: 22]
10. The `autopct` parameter in Pandas pie plots is used to automatically select colors for the slices. (T/F) [Source: 70] (It's for displaying percentage values)
11. Histograms typically have gaps between bars, while bar charts do not. (T/F) [Source: 44, 45] (Opposite is true)
12. Seaborn's `regplot()` function can be used to easily add a regression line to a scatter plot. (T/F) [Source: 130, 131]
13. Sorting bars in a bar chart by value generally improves its interpretability. (T/F) [Source: 64]
14. The median is always located exactly in the middle of the box in a box plot. (T/F) (Only if the distribution of the middle 50% is symmetric)
15. Occlusion is not a concern when creating bubble plots. (T/F) [Source: 154, 155]

**Section C: Short Answer / Explanation Questions**

1.  Explain what a stacked area plot is and in what scenario it is particularly useful. What is one challenge in interpreting stacked area plots? [Source: 12-14, 21, 22]
2.  Why is the choice of the number of bins crucial when creating a histogram? Describe the consequences of too few or too many bins. [Source: 31-33]
3.  When would you choose a horizontal bar chart over a vertical bar chart? [Source: 62, 63]
4.  List two major criticisms of pie charts. What is a generally recommended alternative for comparing proportions across categories? [Source: 74-76, 84]
5.  Describe all the components of a standard box plot (Median, Q1, Q3, IQR, Whiskers, Potential Outliers). [Source: 89-98]
6.  How can a scatter plot help in understanding the relationship between two numerical variables? What is a regression line in this context? [Source: 116-118, 125, 126]
7.  What is a bubble plot, and how does it extend the capability of a standard scatter plot? What is one key consideration when setting bubble sizes? [Source: 136-139, 143, 157]
8.  Explain the difference between a histogram and a bar chart, focusing on the type of data they represent and the meaning of their bars. [Source: 44, 45]
9.  If you have a DataFrame where rows are countries, columns are years (1980-2013) with immigration figures, and you want to create an area plot showing the trend of immigration for several countries over the years (years on x-axis), what data preprocessing step is likely necessary? Why? [Source: 10, 17]
10. What does the `explode` parameter do in a pie chart, and why might you use it? [Source: 71]
11. How do box plots help in identifying potential outliers? What does the $1.5 \times IQR$ rule refer to? [Source: 97, 98, 113]
12. Describe a scenario where an unstacked area plot might be preferred over a stacked one, and what visual parameter would be important to adjust. [Source: 11, 15, 23]
13. Why is sorting bars in a bar chart often a good practice? When might you choose not to sort them by value? [Source: 64, 65]
14. What information about a dataset's distribution can you glean from a box plot (e.g., symmetry, skewness)?
15. Explain the concept of "occlusion" in bubble plots and one way to mitigate it. [Source: 154, 155]

**Section D: Code Interpretation / "What's the Output?" / "Identify the Error"**

1.  **Code Snippet (Pandas):**

In [None]:
import pandas as pd
    import matplotlib.pyplot as plt
    s = pd.Series([1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 6])
    s.plot(kind='hist', bins=3, edgecolor='black')
    plt.show()

Roughly describe what the resulting histogram might look like. How many bars will there be?

2.  **Identify the Plot Type:** You are given a plot showing the average rainfall for 12 different months. Each month has a bar, and the height of the bar represents the rainfall amount. What type of plot is this? [Source: 48-50]

3.  **Code Snippet (Pandas):**

In [None]:
# df_sales has columns 'Product_Category' and 'Revenue'
    # We want to see total revenue per product category
    df_grouped = df_sales.groupby('Product_Category')['Revenue'].sum()
    df_grouped.plot(kind='pie', autopct='%1.0f%%')
    plt.title('Revenue by Category')
    plt.ylabel('') # Why is this line often added for pie charts? [Source: 82]
    plt.show()

Explain the purpose of `plt.ylabel('')`.

4.  **Conceptual Question:** If you have a dataset of student test scores for three different schools and want to compare the distribution of scores (median, spread, outliers) for each school side-by-side, which plot type from this module would be most effective? [Source: 99, 101]

5.  **Code Snippet (Matplotlib/Pandas):**

In [None]:
# df_stocks has columns 'Date', 'StockA_Price', 'StockB_Price'
    # The 'Date' column is the index.
    # We want to plot the trend of StockA_Price and StockB_Price over time on the same plot.
    # Which of these is most likely to produce a stacked area plot of the two stocks?
    # a) df_stocks[['StockA_Price', 'StockB_Price']].plot(kind='line')
    # b) df_stocks[['StockA_Price', 'StockB_Price']].plot(kind='area', stacked=True)
    # c) df_stocks[['StockA_Price', 'StockB_Price']].plot(kind='hist', alpha=0.5)
    # d) df_stocks.plot(kind='scatter', x='StockA_Price', y='StockB_Price')

Choose the correct option. [Source: 8, 12]

**Section E: "Choose the Right Plot" Scenarios**

For each scenario, choose the most appropriate plot type from Module 2 (Area Plot, Histogram, Bar Chart, Pie Chart, Box Plot, Scatter Plot, Bubble Plot) and briefly justify your choice.

1.  You want to visualize the distribution of ages of customers visiting your store to understand different age groups.
2.  You need to show the proportion of a company's budget allocated to five different departments (Marketing, R&D, Sales, HR, Operations). The differences in allocation are quite distinct.
3.  You want to compare the annual sales figures of your company's top 10 products for the last year. The product names are quite long.
4.  You are analyzing a dataset of houses and want to see if there's a relationship between the square footage of a house and its selling price.
5.  You want to track the cumulative number of users signing up for your new app month over month for the past year.
6.  You have collected data on car models, including their horsepower (HP), fuel efficiency (MPG), and engine size (cubic inches). You want to visualize the relationship between HP and MPG, while also representing engine size.
7.  You are comparing the performance of five different email marketing campaigns by looking at their click-through rates. You want to see the median click-through rate, the spread of rates, and any unusually high or low performing campaigns (outliers) for each.

***

This set of notes and questions should give you a thorough understanding of the core visualization techniques covered in Module 2. Remember to practice generating these plots with your own datasets!