---

**Module 4: Data Grouping and Aggregation**

Grouping and aggregation are fundamental operations for summarizing data and extracting insights. They allow you to segment your data based on certain criteria and then calculate summary statistics for each segment. Pandas provides powerful and flexible tools for these tasks.


**4.1 GroupBy Operations (`.groupby()`)**

The `groupby()` operation is a cornerstone of data analysis in Pandas. It follows a "split-apply-combine" paradigm:

1.  **Splitting:** The data is split into groups based on some criteria (e.g., values in one or more columns).
2.  **Applying:** A function is applied independently to each group. This function can be an aggregation (like `sum()`, `mean()`, `count()`), a transformation (like standardizing data within a group), or a filtration.
3.  **Combining:** The results of the function applications are combined into a new data structure (usually a `Series` or `DataFrame`).

* **Concept:** `groupby()` allows you to group rows of a DataFrame that share the same values in one or more specified columns. You can then perform calculations on these groups.
* **Grouping by a Single Column:**

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Creating a sample DataFrame for group operations
data_group_raw = {
    'Team': ['A', 'B', 'A', 'B', 'A', 'C', 'C', 'A', 'B'],
    'Player': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9'],
    'Points': [10, 12, 8, 15, 12, 9, 11, 10, 14],
    'Rebounds': [5, 7, 4, 8, 6, 5, 7, 4, 9]
}

In [3]:
df_group = pd.DataFrame(data_group_raw)
print("Original DataFrame for grouping:\n", df_group)

Original DataFrame for grouping:
   Team Player  Points  Rebounds
0    A     P1      10         5
1    B     P2      12         7
2    A     P3       8         4
3    B     P4      15         8
4    A     P5      12         6
5    C     P6       9         5
6    C     P7      11         7
7    A     P8      10         4
8    B     P9      14         9


In [4]:
# Grouping by the 'Team' column
grouped_by_team = df_group.groupby('Team')

In [5]:
# Checking the type of grouped object
print("\nType of grouped_by_team object:", type(grouped_by_team))


Type of grouped_by_team object: <class 'pandas.core.groupby.generic.DataFrameGroupBy'>


In [10]:
# Displaying first entry of each group (for illustration purposes)
print("\nGroups formed (first entry of each group):")
for team_name, group_df in grouped_by_team:
    print(f"\nTeam: {team_name}")
    print(group_df.iloc[0])  # Show first row for the team


Groups formed (first entry of each group):

Team: A
Team         A
Player      P1
Points      10
Rebounds     5
Name: 0, dtype: object

Team: B
Team         B
Player      P2
Points      12
Rebounds     7
Name: 1, dtype: object

Team: C
Team         C
Player      P6
Points       9
Rebounds     5
Name: 5, dtype: object


[link text](https://)* `[Diagram: Visual representation of the split phase: Original DataFrame on left, arrows pointing to three smaller DataFrames on right, one for Team A, one for Team B, one for Team C.]`    

In [None]:
# Add a 'Position' column for multi-column grouping
    df_group['Position'] = ['Fwd', 'Fwd', 'Ctr', 'Grd', 'Fwd', 'Ctr', 'Grd', 'Ctr', 'Grd']
    print("\nDataFrame with 'Position' column added:\n", df_group)

    # Group by 'Team' and then by 'Position'
    grouped_by_team_position = df_group.groupby(['Team', 'Position'])

    print("\nGroups formed by Team and Position (first entry of each group):")
    for name_tuple, group_df in grouped_by_team_position:
        print(f"\nGroup (Team, Position): {name_tuple}")
        print(group_df.head(1))

* **Applying Aggregation Functions:**
    * Once data is grouped, you can apply various aggregation functions to summarize each group.
    * Common built-in functions: `sum()`, `mean()`, `median()`, `count()` (non-NaN values), `size()` (includes NaN, returns group size), `min()`, `max()`, `std()` (standard deviation), `var()` (variance), `first()`, `last()`, `nunique()` (number of unique values).
    * *Reference: [51]*

In [None]:
# Mean points per team
    mean_points_team = grouped_by_team['Points'].mean() # Select 'Points' column from grouped object, then aggregate
    print("\nMean points per team:\n", mean_points_team)
    # Output:
    # Team
    # A    10.0
    # B    13.0
    # C    10.0
    # Name: Points, dtype: float64

    # Total points and average rebounds per team
    agg_stats_team = grouped_by_team.agg(
        Total_Points=('Points', 'sum'),
        Avg_Rebounds=('Rebounds', 'mean'),
        Num_Players=('Player', 'count') # or 'size'
    )
    print("\nAggregated sum of points and mean rebounds per team:\n", agg_stats_team)
    # Output:
    #       Total_Points  Avg_Rebounds  Num_Players
    # Team
    # A              40           4.75            4
    # B              41           8.00            3
    # C              20           6.00            2

    # Apply aggregation to the multi-level group (Team and Position)
    # Sum of points and mean rebounds per team AND position using the .agg() method
    agg_team_pos = grouped_by_team_position.agg({
        'Points': 'sum',      # Sum of Points for each (Team, Position) group
        'Rebounds': ['mean', 'std'] # Mean and Std Dev of Rebounds for each group
    })
    print("\nAggregated stats by team and position (sum Points, mean/std Rebounds):\n", agg_team_pos)
    # Output will have hierarchical column index for Rebounds (mean, std)

    # Applying multiple aggregation functions to a single column
    points_stats_team = grouped_by_team['Points'].agg(['sum', 'mean', 'count', 'min', 'max'])
    print("\nPoints (sum, mean, count, min, max) per team:\n", points_stats_team)

* **The `.agg()` Method:** Highly flexible.
        * Can take a single function name (e.g., `'sum'`).
        * Can take a list of function names (e.g., `['sum', 'mean']`) to apply multiple functions to selected columns.
        * Can take a dictionary to apply specific functions to specific columns (e.g., `{'Points': 'sum', 'Rebounds': 'mean'}`).
        * Can use named aggregations for clearer output column names as shown with `agg_stats_team`.
    * **`.reset_index()`**: After aggregation, the grouping columns often become the index of the resulting DataFrame. `reset_index()` can convert these index levels back into regular columns, which is often useful for further processing or plotting.

In [None]:
mean_points_team_df = mean_points_team.reset_index()
        print("\nMean points per team (DataFrame with Team as column):\n", mean_points_team_df)

**4.2 Pivot Tables (`.pivot_table()`)**

Pivot tables are a powerful tool for summarizing and reshaping data in a spreadsheet-like format. They allow you to aggregate data and view it from different perspectives by specifying which columns become rows, columns, and values in the resulting table.

* **`pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, margins_name='All')`**: *[53]*
    * `data`: The DataFrame to use.
    * `values`: Column(s) whose data will populate the cells of the pivot table (the values to be aggregated).
    * `index`: Column(s) whose unique values will become the rows of the pivot table.
    * `columns`: Column(s) whose unique values will become the columns of thepivot table.
    * `aggfunc`: The aggregation function(s) to apply to the `values`. Default is `'mean'`. Can be a single function, a list of functions, or a dictionary mapping columns to functions. *[53]*
    * `fill_value`: Value to replace missing values (`NaN`) in the resulting pivot table (after aggregation). *[53]*
    * `margins=True/False`: If `True`, adds row and column subtotals and a grand total (applies `aggfunc` across all). *[53]*
    * `margins_name='All'` (default): Name of the row/column that will contain the totals when `margins=True`.

In [None]:
# Using df_group from the previous section
    print("\nOriginal DataFrame for pivot table:\n", df_group)

    # Pivot table: Mean points for each Team (index) and Position (columns)
    pivot_points_mean = pd.pivot_table(df_group,
                                       values='Points',    # Values to aggregate
                                       index='Team',       # Rows
                                       columns='Position', # Columns
                                       aggfunc='mean')     # Aggregation function
    print("\nPivot table (mean points by Team and Position):\n", pivot_points_mean)
    # Output:
    # Position        Ctr  Fwd  Grd
    # Team
    # A        9.000000   11   NaN  (NaN because Team A has no Grd with Points)
    # B             NaN   12  14.5
    # C       10.000000  NaN   NaN

    # Pivot table with sum of Points and mean of Rebounds, indexed by Team, for each Position
    pivot_multi_agg_complex = pd.pivot_table(df_group,
                                             index='Team',
                                             columns='Position',
                                             values=['Points', 'Rebounds'], # Multiple value columns
                                             aggfunc={'Points': np.sum, 'Rebounds': [np.mean, np.min]}) # Different agg for different values
    print("\nPivot table (sum Points, mean/min Rebounds by Team and Position):\n", pivot_multi_agg_complex)
    # Output will have hierarchical columns: (Points, sum), (Rebounds, mean), (Rebounds, amin)

    # Pivot table with fill_value for NaNs and margins for totals
    pivot_with_fill_margins = pd.pivot_table(df_group,
                                             values='Points',
                                             index='Team',
                                             columns='Position',
                                             aggfunc='sum',       # Sum of points
                                             fill_value=0,      # Replace NaN with 0
                                             margins=True,        # Add row/column totals
                                             margins_name='TotalPoints') # Name for totals
    print("\nPivot table with fill_value=0, margins=True, and custom margins_name:\n", pivot_with_fill_margins)
    # Output:
    # Position     Ctr  Fwd  Grd  TotalPoints
    # Team
    # A             18   22    0           40
    # B              0   12   29           41
    # C              9    0   11           20
    # TotalPoints   27   34   40          101

* `[Image: A visual representation of a pivot table, showing 'Team' as rows, 'Position' as columns, and aggregated 'Points' (e.g., sum) in the cells. Include a 'Total' row and 'Total' column if margins=True.]`
    * Pivot tables are extremely useful for creating summary reports and exploring multi-dimensional relationships.

**4.3 Cross-Tabulation (`.crosstab()`)**

Cross-tabulation (or contingency table) is used to compute a frequency table of two or more categorical factors. It helps in understanding the relationship and distribution between these factors.

* **`pd.crosstab(index, columns, values=None, aggfunc=None, normalize=False, margins=False, margins_name='All')`**:
    * `index`: Values to group by in the rows (can be a Series, array, or list of arrays/Series).
    * `columns`: Values to group by in the columns (can be a Series, array, or list of arrays/Series).
    * `values` (optional): Array of values to aggregate according to the factors. If not specified, computes a frequency table of the factors.
    * `aggfunc` (optional): If `values` is specified, this function is used for aggregation.
    * `normalize`: If `True` or set to `'index'`, `'columns'`, or `'all'`, normalizes the table by dividing all values by their respective sums (row sums, column sums, or grand total sum), giving proportions instead of counts.
        * `normalize=True` or `normalize='all'`: Normalize over all values.
        * `normalize='index'`: Normalize over each row.
        * `normalize='columns'`: Normalize over each column.
    * `margins=True/False`: If `True`, adds row/column totals (frequencies).
    * `margins_name='All'` (default): Name for the margin row/column.

In [None]:
# Example data for crosstab
    data_crosstab_raw = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
                         'Product': ['A', 'B', 'A', 'A', 'C', 'B', 'B', 'A', 'C'],
                         'Region': ['North', 'South', 'East', 'North', 'West', 'South', 'East', 'West', 'North'],
                         'Quantity': [10, 15, 12, 8, 20, 10, 18, 14, 22]}
    df_crosstab_data = pd.DataFrame(data_crosstab_raw)
    print("\nOriginal DataFrame for crosstab:\n", df_crosstab_data)

    # Crosstab: Frequency of Product purchased by Gender
    xtab_prod_gender_freq = pd.crosstab(index=df_crosstab_data['Gender'],
                                        columns=df_crosstab_data['Product'])
    print("\nCrosstab (Product by Gender - Frequencies):\n", xtab_prod_gender_freq)
    # Output:
    # Product  A  B  C
    # Gender
    # Female   2  2  0
    # Male     2  1  2

    # Crosstab: Frequency of Product by Gender, normalized by 'all' (proportions of grand total)
    xtab_prod_gender_prop_all = pd.crosstab(df_crosstab_data['Gender'],
                                            df_crosstab_data['Product'],
                                            normalize='all')
    print("\nCrosstab (Product by Gender - Proportions of total):\n", xtab_prod_gender_prop_all)
    # Output: Values will be fractions (e.g., Female & A = 2/9 = 0.222...)

    # Crosstab: Frequency of Product by Gender, normalized by 'index' (proportions within each Gender)
    xtab_prod_gender_prop_index = pd.crosstab(df_crosstab_data['Gender'],
                                              df_crosstab_data['Product'],
                                              normalize='index')
    print("\nCrosstab (Product by Gender - Proportions within each Gender):\n", xtab_prod_gender_prop_index)
    # Output: For Female row, sum of proportions will be 1.0. For Male row, sum will be 1.0.

    # Crosstab with a third variable (Region) in columns and margins for totals
    xtab_gender_product_region_freq = pd.crosstab(index=df_crosstab_data['Gender'],
                                                  columns=[df_crosstab_data['Product'], df_crosstab_data['Region']],
                                                  margins=True,
                                                  margins_name='TotalCount')
    print("\nCrosstab (Gender by Product and Region - Frequencies with margins):\n", xtab_gender_product_region_freq)

    # Crosstab with values and aggfunc: Average Quantity by Gender and Product
    xtab_avg_quantity = pd.crosstab(index=df_crosstab_data['Gender'],
                                    columns=df_crosstab_data['Product'],
                                    values=df_crosstab_data['Quantity'],
                                    aggfunc='mean') # Calculate mean quantity
    print("\nCrosstab (Average Quantity by Gender and Product):\n", xtab_avg_quantity)

* Cross-tabulation is particularly useful for examining the relationship between two or more categorical variables and is often a precursor to statistical tests like the Chi-Square test of independence.

---

**Module 4: Practice Questions**

61. **Concept:** What are the three main steps in a "split-apply-combine" strategy used by `groupby()`?
62. **Coding:** Given `df_group` (from module notes), write code to calculate the total 'Points' and total 'Rebounds' for each 'Team'.
63. **Coding:** Using `df_group`, group the data by 'Team' and then find the player with the maximum 'Points' within each team. (Hint: `idxmax()` might be useful after grouping, or sort and take `first()`).
64. **MCQ:** When using `groupby()`, what type of object is initially returned before an aggregation function is applied?
    * A) DataFrame
    * B) Series
    * C) DataFrameGroupBy
    * D) NumPy array
65. **`.agg()` Method:** How can you apply multiple aggregation functions (e.g., `sum` and `mean`) to the 'Points' column after grouping `df_group` by 'Team'?
66. **Function:** What is the primary purpose of `pd.pivot_table()`?
67. **`pivot_table` Parameters:** In `pd.pivot_table(data, values='A', index='B', columns='C', aggfunc='sum')`, which column's unique values will form the rows of the resulting table?
68. **Coding:** Using `df_group`, create a pivot table that shows the *maximum* 'Rebounds' for each 'Team' (as index) against each 'Position' (as columns). Fill any missing combinations with 0.
69. **`pivot_table` Parameter:** What does `margins=True` do in `pd.pivot_table()`?
70. **Function:** What is `pd.crosstab()` primarily used for?
71. **`crosstab` Parameter:** In `pd.crosstab()`, what does the `normalize` parameter achieve? List one specific value it can take (other than `True` or `False`).
72. **Coding:** Using `df_crosstab_data` (from module notes), create a cross-tabulation showing the frequency count of 'Gender' against 'Region'. Include row and column totals.
73. **Coding:** Using `df_crosstab_data`, create a cross-tabulation that shows the *sum* of 'Quantity' for each combination of 'Gender' (rows) and 'Product' (columns).
74. **Difference:** Briefly explain the difference between `groupby().size()` and `groupby().count()`.
75. **`reset_index()`:** After performing a `groupby()` and aggregation, the grouping keys often form the index. How can you convert these index levels back into regular columns?
76. **Critical Thinking:** You have sales data with columns 'Category', 'SubCategory', and 'SalesAmount'. You want to see the total 'SalesAmount' for each 'Category' and also for each 'SubCategory' within those 'Categories'. Would `groupby()` or `pivot_table()` be more direct for creating a nicely formatted summary table for this? Explain your choice.

---

**Module 5: Data Visualization with Matplotlib and Seaborn**

Data visualization is crucial for understanding data distributions, identifying patterns and relationships, outliers, and for effectively communicating findings to others. Matplotlib provides the foundational plotting capabilities, while Seaborn offers a higher-level interface for creating more statistically sophisticated and aesthetically pleasing plots with less code.

**5.1 Introduction to Matplotlib**

Matplotlib is a comprehensive and highly versatile plotting library. The `matplotlib.pyplot` module is a collection of functions that make Matplotlib work like MATLAB, providing a convenient state-based interface to create plots and visualizations. *[6]*

* **Core Concepts:**
    * **Figure:** The top-level container for all plot elements (titles, axes, legends, etc.). A single `Figure` can contain multiple `Axes` (subplots). *[6]*
    * **Axes:** An `Axes` object represents an individual plot or chart within a `Figure`. It's the region where data is actually plotted. An `Axes` has its own coordinate system and contains `Axis` objects (for x-axis, y-axis, and potentially z-axis) which handle the data limits, ticks, and tick labels. Each `Axes` can have a title, an x-label, and a y-label. *[6]*
    * **Artist:** Virtually everything you see on a Matplotlib figure is an `Artist` (e.g., `Text` objects for labels, `Line2D` objects for lines, `Patch` objects for shapes). This provides a high degree of control. *[6]*
* **Basic Plotting Workflow (using `pyplot`):**
    1.  **Import:** `import matplotlib.pyplot as plt`
    2.  **Prepare Data:** Have your data ready (e.g., in NumPy arrays or Pandas Series/DataFrames).
    3.  **(Optional but Recommended) Create Figure and Axes:**
        * `fig, ax = plt.subplots()` (for a single plot)
        * `fig, axes = plt.subplots(nrows=..., ncols=...)` (for multiple subplots)
    4.  **Plot Data:** Use `Axes` methods to plot data (e.g., `ax.plot()`, `ax.scatter()`, `ax.bar()`, `ax.hist()`). Or, use `pyplot` functions directly (e.g., `plt.plot()`), which act on the "current" axes.
    5.  **Customize:** Add titles, labels, legends, change colors, styles, etc. (e.g., `ax.set_title()`, `ax.set_xlabel()`, `ax.legend()`, `plt.title()`).
    6.  **Show Plot:** `plt.show()` (displays the plot). In Jupyter Notebooks with `%matplotlib inline`, `plt.show()` is often not strictly necessary for the plot to appear, but it's good practice, especially in scripts.

**5.2 Introduction to Seaborn**

Seaborn is built on top of Matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics. It integrates well with Pandas DataFrames and often simplifies the creation of complex visualizations that would require more code in Matplotlib. *[9]*

* **Key Advantages over Matplotlib alone:**
    * **Simplified Syntax:** Easier syntax for common statistical plots (e.g., creating a regression plot with confidence intervals is one line).
    * **Better Default Aesthetics:** Comes with more visually appealing default styles, themes, and color palettes. *[9]*
    * **Pandas Integration:** Works seamlessly with Pandas DataFrames, allowing you to easily map DataFrame columns to plot aesthetics like x, y, hue, size, style. *[9]*
    * **Built-in Statistical Estimation:** Many Seaborn functions can automatically perform statistical estimation (like means, medians, confidence intervals, kernel density estimates) and display them on the plot.

* **Basic Seaborn Workflow:**
    1.  **Import:** `import seaborn as sns` (and usually `import matplotlib.pyplot as plt` as well for customization or showing plots).
    2.  **Prepare Data:** Typically a Pandas DataFrame.
    3.  **(Optional) Set Theme/Style:** `sns.set_theme(style="whitegrid")` or `sns.set_style("darkgrid")`.
    4.  **Plot Data:** Use Seaborn's high-level functions (e.g., `sns.histplot()`, `sns.boxplot()`, `sns.scatterplot()`, `sns.lineplot()`, `sns.heatmap()`), often passing the DataFrame to the `data` parameter and column names as strings to `x`, `y`, `hue`, etc.
    5.  **Customize (often using Matplotlib functions):** Add titles, labels, adjust limits using `plt` or `Axes` object methods.
    6.  **Show Plot:** `plt.show()`.

**5.3 Common Plot Types**

Let's prepare some sample data for plotting:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# Sample data for plotting demonstrations
np.random.seed(42) # for reproducibility
plot_data_numeric = np.random.randn(200) # More data points for better visuals
plot_data_numeric_positive = np.abs(np.random.randn(200) * 50 + 10) # Positive skewed data
plot_df = pd.DataFrame({
    'value_norm': plot_data_numeric,
    'value_pos': plot_data_numeric_positive,
    'category': np.random.choice(['Alpha', 'Bravo', 'Charlie', 'Delta'], 200),
    'value2_corr': plot_data_numeric * 0.7 + np.random.randn(200) * 0.5, # Correlated with value_norm
    'group': np.random.choice(['G1', 'G2'], 200)
})
print("Sample DataFrame for plotting (first 5 rows):\n", plot_df.head())

* **Histograms:** Visualize the distribution of a single numerical variable. They divide the data range into bins and show the frequency (count) of observations falling into each bin.
    * **Matplotlib:** `ax.hist(data, bins=..., ...)` or `plt.hist(...)`
    * **Seaborn:** `sns.histplot(data=df, x='column_name', bins=..., kde=True, hue='category_column')`
        * `kde=True` overlays a Kernel Density Estimate, a smooth curve representing the estimated probability density function. *[55]*
        * **Choosing Bins:** The number of bins (or `binwidth`) significantly impacts the histogram's appearance and interpretation. Too few bins can hide important features; too many can make the plot noisy and over-emphasize minor variations. Seaborn's `histplot` often chooses a reasonable default, but you can customize with `bins` (number of bins) or `binwidth`. *[56]*

In [None]:
plt.figure(figsize=(10, 6))
    sns.histplot(data=plot_df, x='value_norm', bins=20, kde=True, color='skyblue', hue='group')
    plt.title('Histogram of Normally Distributed Values (Seaborn)')
    plt.xlabel('Value (Normalized)')
    plt.ylabel('Frequency')
    plt.legend(title='Group')
    plt.show()
    # [Image: sns_histogram_value_norm.png - A Seaborn histogram of 'value_norm' with a KDE curve. Colors for different groups if 'hue' is used. Clear title and labels.]

* **Box Plots (Box-and-Whisker Plots):** Display the distribution of numerical data, highlighting its quartiles, median, and potential outliers. Excellent for comparing distributions across different categories.
    * **Seaborn:** `sns.boxplot(data=df, x='categorical_column', y='numerical_column', hue='another_category')` or `sns.boxplot(data=df, y='numerical_column')` for a single distribution. *[57]*
    * **Interpretation:** *[59]*
        * **Box:** Represents the Interquartile Range (IQR), stretching from the first quartile (Q1, 25th percentile) to the third quartile (Q3, 75th percentile). The length of the box is IQR = Q3 - Q1.
        * **Median (Q2):** The line inside the box, representing the 50th percentile.
        * **Whiskers:** Typically extend to 1.5 * IQR beyond Q1 (lower whisker) and Q3 (upper whisker). They capture the bulk of the data.
        * **Outliers:** Individual points plotted beyond the whiskers, indicating observations that are unusually far from the central part of the distribution.

In [None]:
plt.figure(figsize=(10, 6))
    sns.boxplot(data=plot_df, x='category', y='value_pos', hue='group', palette='pastel')
    plt.title('Box Plot of Positive Values by Category and Group (Seaborn)')
    plt.xlabel('Category')
    plt.ylabel('Value (Positive)')
    plt.xticks(rotation=15)
    plt.legend(title='Group')
    plt.tight_layout()
    plt.show()
    # [Image: sns_boxplot_value_by_category_group.png - Seaborn box plots for 'value_pos' for each 'category', further split by 'group' using different colors. Clear title, labels, and legend.]

* `[Diagram: A detailed breakdown of a single box plot, labeling Q1, Median (Q2), Q3, IQR, Whiskers, and Outliers.]`

* **Scatter Plots:** Show the relationship (or lack thereof) between two numerical variables. Each point on the plot represents an observation (a row from your DataFrame).
    * **Matplotlib:** `ax.scatter(x_values, y_values, ...)` or `plt.scatter(...)`
    * **Seaborn:** `sns.scatterplot(data=df, x='column1_num', y='column2_num', hue='categorical_column', size='numerical_column_for_size', style='categorical_column_for_style')`. *[61]*
        * `hue`, `size`, and `style` parameters can add more dimensions to the plot by varying color, point size, and point marker style based on other columns.

In [None]:
plt.figure(figsize=(10, 6))
    sns.scatterplot(data=plot_df, x='value_norm', y='value2_corr', hue='category', size='value_pos', style='group',
                    palette='viridis', sizes=(20, 200), alpha=0.7)
    plt.title('Scatter Plot of Correlated Values, Colored by Category, Sized by Positive Value, Styled by Group (Seaborn)')
    plt.xlabel('Value (Normalized)')
    plt.ylabel('Value 2 (Correlated)')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') # Place legend outside plot
    plt.tight_layout(rect=[0, 0, 0.85, 1]) # Adjust layout to make space for legend
    plt.grid(True)
    plt.show()
    # [Image: sns_scatterplot_value_vs_value2.png - Seaborn scatter plot showing points colored by 'category', sized by 'value_pos', and styled by 'group'. Clear title, labels, and legend.]

* **Line Plots:** Typically used to visualize trends over time or the relationship between ordered variables. Points are connected by lines.
    * **Matplotlib:** `ax.plot(x_values, y_values, ...)` or `plt.plot(...)`
    * **Seaborn:** `sns.lineplot(data=df, x='ordered_column_or_time', y='value_column', hue='category_column', style='another_category', markers=True, dashes=False)`
        * Seaborn's `lineplot` can automatically aggregate multiple y-values for the same x-value (e.g., plot the mean) and show a confidence interval around the line.

In [None]:
# Sample time series data for line plot
    num_days = 30
    time_data = pd.DataFrame({
        'date': pd.to_datetime(['2025-01-01'] * num_days * 2) + pd.to_timedelta(np.tile(np.arange(num_days), 2), unit='D'),
        'metric_value': np.concatenate([np.random.rand(num_days).cumsum() + 10,
                                        np.random.rand(num_days).cumsum() + 15]),
        'sensor_type': ['SensorX'] * num_days + ['SensorY'] * num_days
    })

    plt.figure(figsize=(12, 6))
    sns.lineplot(data=time_data, x='date', y='metric_value', hue='sensor_type', style='sensor_type', markers=True, dashes=False)
    plt.title('Line Plot of Metric Value Over Time by Sensor Type (Seaborn)')
    plt.xlabel('Date')
    plt.ylabel('Metric Value')
    plt.xticks(rotation=45)
    plt.legend(title='Sensor Type')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.tight_layout()
    plt.show()
    # [Image: sns_lineplot_time_value.png - Seaborn line plot showing 'metric_value' over 'date', with different lines/colors for 'sensor_type'. Markers on points. Clear title, labels, rotated x-ticks.]

* **Bar Plots:** Compare numerical values across different discrete categories. The height (or length) of the bar represents the magnitude of the value.
    * **Matplotlib:** `ax.bar(categories, values, ...)` (vertical) or `ax.barh(...)` (horizontal) or `plt.bar(...)`
    * **Seaborn:**
        * `sns.barplot(data=df, x='categorical_column', y='numerical_column', hue='another_category', estimator=np.mean, errorbar=('ci', 95))`: Displays an estimate of central tendency (default is mean) for a numerical variable for each category. It can also show error bars (e.g., 95% confidence interval `('ci', 95)` or standard deviation `errorbar='sd'`).
        * `sns.countplot(data=df, x='categorical_column', hue='another_category')`: Specifically for showing the counts of observations in each category (like a histogram for categorical data).

In [None]:
plt.figure(figsize=(10, 6))
    sns.barplot(data=plot_df, x='category', y='value_pos', hue='group',
                estimator=np.median, errorbar='sd', palette='muted', capsize=0.1)
    plt.title('Bar Plot of Median Positive Value by Category and Group (Seaborn with StdDev Error Bars)')
    plt.xlabel('Category')
    plt.ylabel('Median Positive Value')
    plt.legend(title='Group')
    plt.xticks(rotation=15)
    plt.tight_layout()
    plt.show()
    # [Image: sns_barplot_mean_value_by_category.png - Seaborn bar plot showing median 'value_pos' for each 'category', split by 'group'. Error bars indicating std dev. Clear title, labels.]

    plt.figure(figsize=(8, 5))
    sns.countplot(data=plot_df, x='category', hue='group', palette='Set2')
    plt.title('Count Plot of Categories by Group (Seaborn)')
    plt.xlabel('Category')
    plt.ylabel('Count')
    plt.legend(title='Group')
    plt.show()
    # [Image: sns_countplot_category_group.png - Seaborn count plot showing frequencies of each 'category', with stacked or side-by-side bars for 'group'.]

* **Heatmaps:** Visualize matrix-like data where values are represented by colors. Excellent for displaying correlation matrices, confusion matrices, or pivot tables.
    * **Seaborn:** `sns.heatmap(matrix_data, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5, cbar=True)`
        * `annot=True`: Displays the data values on the cells of the heatmap.
        * `cmap`: Colormap to use (e.g., `'coolwarm'`, `'viridis'`, `'YlGnBu'`).
        * `fmt=".2f"`: String formatting code to use when `annot=True` (e.g., `".2f"` for two decimal places).
        * `linewidths`: Width of lines that will divide each cell.
        * `cbar=True` (default): Whether to draw a colorbar.
    * *Reference for cmap and other features: [63], [108]*

In [None]:
# Create a sample correlation matrix from plot_df numerical columns
    correlation_matrix_sample = plot_df[['value_norm', 'value_pos', 'value2_corr']].corr()
    print("\nSample Correlation Matrix:\n", correlation_matrix_sample)

    plt.figure(figsize=(8, 6))
    sns.heatmap(correlation_matrix_sample, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5, vmin=-1, vmax=1)
    plt.title('Heatmap of Correlation Matrix (Seaborn)')
    plt.show()
    # [Image: sns_heatmap_correlation.png - Seaborn heatmap of the correlation_matrix_sample. Values annotated in cells. A colormap like 'coolwarm' used. Title included.]

**5.4 Customizing Plots**

Both Matplotlib and Seaborn offer extensive customization options to make your plots clear, informative, and visually appealing. Many Seaborn plots return Matplotlib `Axes` objects, so you can use Matplotlib methods for fine-tuning.

* **Titles and Labels:**
    * `plt.title('My Plot Title', fontsize=16)` or `ax.set_title('My Plot Title', fontsize=16)`
    * `plt.xlabel('X-axis Label', fontsize=12)` or `ax.set_xlabel('X-axis Label', fontsize=12)`
    * `plt.ylabel('Y-axis Label', fontsize=12)` or `ax.set_ylabel('Y-axis Label', fontsize=12)`
* **Legends:**
    * `plt.legend(title='Legend Title', loc='upper right')` or `ax.legend(...)`
    * `loc` parameter controls position (e.g., `'best'`, `'upper left'`, `'center right'`).
    * For plots with `hue`, Seaborn usually adds a legend automatically.
* **Colors and Styles:**
    * **Matplotlib:** `color='red'`, `linestyle='--'` (dashed), `marker='o'` (circle markers).
    * **Seaborn:** `color='red'` for a single color. `palette='viridis'` or other Seaborn/Matplotlib colormap names when using `hue` to get a range of colors.
* **Figure Size:** `plt.figure(figsize=(width_inches, height_inches))` (call this *before* creating the plot).
* **Grid Lines:** `plt.grid(True, linestyle=':', alpha=0.7)` or `ax.grid(True, ...)`
* **Axis Limits & Ticks:**
    * `plt.xlim(min_val, max_val)` or `ax.set_xlim(...)`
    * `plt.ylim(min_val, max_val)` or `ax.set_ylim(...)`
    * `plt.xticks(ticks_list, labels_list, rotation=45)` or `ax.set_xticks(...)`, `ax.set_xticklabels(...)`
* **Seaborn Global Styles/Themes:**
    * `sns.set_theme(style='whitegrid', palette='pastel')`: Sets a global theme for all subsequent Seaborn plots. Styles include `'darkgrid'`, `'whitegrid'`, `'dark'`, `'white'`, `'ticks'`. Palettes control default color schemes.
    * `sns.set_style('ticks')`: Another way to set just the style.
    * `sns.despine()`: Removes the top and right spines from a plot, which can be aesthetically pleasing.

In [None]:
# Example of a customized plot using Seaborn and Matplotlib
    sns.set_theme(style="whitegrid", palette="pastel") # Set a global theme for this example

    plt.figure(figsize=(12, 7)) # Set figure size

    # Create the plot using Seaborn
    scatter_ax = sns.scatterplot(data=plot_df,
                                 x='value_norm',
                                 y='value2_corr',
                                 hue='category',
                                 size='value_pos',
                                 style='group',
                                 sizes=(30, 250), # Control range of point sizes
                                 alpha=0.8,       # Point transparency
                                 palette='Set1')  # Use a different color palette

    # Customize using Matplotlib methods on the returned Axes object (or plt)
    scatter_ax.set_title('Highly Customized Scatter Plot', fontsize=20, fontweight='bold')
    scatter_ax.set_xlabel('Normalized Feature X (units)', fontsize=14)
    scatter_ax.set_ylabel('Correlated Feature Y (units)', fontsize=14)

    scatter_ax.legend(title='Categories & Groups', loc='best', fontsize=10, title_fontsize=12, frameon=True, shadow=True)
    scatter_ax.grid(True, linestyle='--', linewidth=0.5, color='gray') # Customize grid

    # Add annotations (example)
    # scatter_ax.text(x=0, y=0, s="Origin Point", fontsize=12, color='blue')

    sns.despine(left=True, bottom=True) # Remove left and bottom spines for a different look

    plt.tight_layout() # Adjust plot to ensure everything fits
    plt.show()
    # [Image: sns_customized_scatter.png - A scatter plot with numerous customizations: specific theme, palette, figure size, title font, label fonts, legend styling, custom grid, despine effect.]

Effective data visualization is an art and a science. It requires choosing the right plot type for your data and message, and then customizing it thoughtfully to ensure it is clear, accurate, and engaging.

---
**Module 5: Practice Questions**

77. **Core Concepts:** What is the difference between a Matplotlib `Figure` and an `Axes`?
78. **Import:** What is the conventional way to import the `matplotlib.pyplot` module?
79. **Seaborn Advantage:** List two advantages of using Seaborn over Matplotlib for statistical plotting.
80. **Plot Type:** Which type of plot would be most suitable for visualizing the distribution of a single numerical variable, showing frequencies within specified intervals?
81. **`histplot` Parameter:** In Seaborn's `sns.histplot()`, what does the `kde=True` parameter do?
82. **Box Plot Interpretation:** What does the line inside the box of a box plot represent?
83. **Plot Type:** If you want to compare the distribution of a numerical variable across several categories, which plot type is often a good choice (e.g., 'Salary' across 'Department')?
84. **Scatter Plot Use:** When is a scatter plot typically used? What can the `hue` parameter add to a scatter plot?
85. **Line Plot Use:** For what type of data are line plots most commonly used?
86. **Bar Plot vs. Count Plot:** What is the main difference between `sns.barplot()` and `sns.countplot()` in Seaborn?
87. **Heatmap Application:** Name two common applications where a heatmap would be a useful visualization.
88. **Customization:** How can you set the title for a plot created using `matplotlib.pyplot`? Write the function call.
89. **Customization:** How do you change the figure size for a Matplotlib/Seaborn plot?
90. **Seaborn Theme:** How can you set a global theme (e.g., "darkgrid") for all your Seaborn plots?
91. **Coding:** Using `plot_df` (from module notes), create a scatter plot of 'value_norm' (x-axis) vs 'value2_corr' (y-axis). Color the points based on the 'category' column. Add a title "Value Comparison" and label the axes appropriately.
92. **Coding:** Using `plot_df`, create a set of box plots showing the distribution of 'value_pos' for each 'category'.
93. **Coding:** Using `plot_df`, create a histogram for the 'value_pos' column with 25 bins and a KDE overlay.
94. **Critical Thinking:** You have a DataFrame showing monthly sales for three different product lines over the past five years. What type of plot would be most effective to visualize and compare the sales trends of these product lines over time? Explain your choice.
95. **`annot` Parameter:** What is the purpose of the `annot=True` parameter in `sns.heatmap()`?

---
*(Continued in next response due to length limitations)*

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://github.com/fypbjchina/b_PythonDSHB_CN">https://github.com/fypbjchina/b_PythonDSHB_CN</a></li>
  <li><a href="https://brainly.com/question/49394551">https://brainly.com/question/49394551</a></li>
  </ol>
</div>