

# Chapter 3: Introduction to Data Visualization with Seaborn – Categorical Plots

In this chapter, we focus on **visualizations involving categorical variables**. The primary types of categorical plots we will cover are **count plots** and **bar plots**. 

---

## 1. Count Plots and Bar Plots

- **Count Plots** display the number of observations in each category.
- **Bar Plots** display the mean value of a quantitative variable for each category.
- Both are referred to as **categorical plots** in Seaborn.
- Categorical variables have a fixed, usually small number of possible values (categories).
- Useful for **comparing groups**.

> Example from masculinity survey: Most men report feeling "somewhat" or "very" masculine.

---

## 2. `catplot()` for Categorical Plots

- `catplot()` is Seaborn's flexible function for **creating various categorical plots**.
- Works similarly to `relplot()`.
- Supports **subplots** using `col` and `row` parameters.

---

## 3. `countplot()` vs. `catplot()`

### Using `countplot()`
```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x="how_masculine", data=masculinity_data)
plt.show()
````
![image.png](attachment:image.png)


**Output:**

```
Bar plot showing counts of each category in "how_masculine" column
```

**Explanation:**

* `import matplotlib.pyplot as plt` → Imports Matplotlib for plotting.
* `import seaborn as sns` → Imports Seaborn for statistical visualizations.
* `sns.countplot(...)` → Creates a count plot of the categorical variable `how_masculine`.
* `plt.show()` → Displays the plot.
* **Significance:** Shows how many observations fall into each category.

---

### Using `catplot()` as a Count Plot

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.catplot(x="how_masculine", data=masculinity_data, kind="count")
plt.show()
```
![image-2.png](attachment:image-2.png)
 
 
**Output:**

```
Same bar plot as countplot, showing counts per category
```

**Explanation:**

* `kind="count"` specifies the type of categorical plot.
* Functionally equivalent to `countplot()`, but allows **additional flexibility** like subplots.
* **Significance:** Provides a unified interface for different categorical plot types.

---

## 4. Changing the Order of Categories

```python
import matplotlib.pyplot as plt
import seaborn as sns

category_order = ["No answer", "Not at all", "Not very", "Somewhat", "Very"]

sns.catplot(
    x="how_masculine",
    data=masculinity_data,
    kind="count",
    order=category_order
)
plt.show()
```
![image-3.png](attachment:image-3.png)

**Output:**

```
Bar plot with categories displayed in specified logical order
```

**Explanation:**

* `category_order` → Defines the desired order of categories.
* `order=category_order` → Applies this order to the plot.
* Ensures **logical or meaningful display** of categorical data.
* **Significance:** Improves readability and interpretation.

---

## 5. Bar Plots

* Display **mean of a quantitative variable per category**.
* Example: Using the `tips` dataset to show **average total bill per day**.

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.catplot(
    x="day",
    y="total_bill",
    data=tips,
    kind="bar"
)
plt.show()
```
![image-4.png](attachment:image-4.png)

**Output:**

```
Bar plot showing average total_bill for each day
Bars represent mean values; error bars show 95% confidence intervals
```

**Explanation:**

* `x="day"` → Categorical variable on x-axis.
* `y="total_bill"` → Quantitative variable on y-axis.
* `kind="bar"` → Creates a bar plot.
* **Significance:** Visualizes the mean values and allows comparison across categories.

---

## 6. Confidence Intervals

* Seaborn **automatically shows 95% confidence intervals** on bar plots.
* Indicates **uncertainty about the mean** estimate.
* Assumes data is a random sample.
* Example: Average bill per day with error bars representing CI.

---

### Turning Off Confidence Intervals

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.catplot(
    x="day",
    y="total_bill",
    data=tips,
    kind="bar",
    ci=None
)
plt.show()
```
![image-6.png](attachment:image-6.png)

**Output:**

```
Bar plot showing mean total_bill per day without confidence intervals
```

**Explanation:**

* `ci=None` → Removes confidence interval bars.
* **Significance:** Useful when CI is not needed for visualization clarity.

---

### Changing Orientation of Bars

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.catplot(
    x="total_bill",
    y="day",
    data=tips,
    kind="bar"
)
plt.show()
```
![image-7.png](attachment:image-7.png)

**Output:**

```
Horizontal bar plot showing average total_bill per day
```

**Explanation:**

* Swaps x and y for **horizontal orientation**.
* Common practice: keep categorical variable on x-axis.
* **Significance:** Orientation can improve readability depending on category labels.

---

## 7. Practice

* Apply `catplot()` to different categorical plots.
* Experiment with `count` and `bar` kinds.
* Adjust **order, orientation, and confidence intervals** to see effects.

```
```


### Exercise
Count plots
In this exercise, we'll return to exploring our dataset that contains the responses to a survey sent out to young people. We might suspect that young people spend a lot of time on the internet, but how much do they report using the internet each day? Let's use a count plot to break down the number of survey responses in each category and then explore whether it changes based on age.

As a reminder, to create a count plot, we'll use the catplot() function and specify the name of the categorical variable to count (x=____), the pandas DataFrame to use (data=____), and the type of plot (kind="count").

Seaborn has been imported as sns and matplotlib.pyplot has been imported as plt.

Instructions 1/3
35 XP
1.Use sns.catplot() to create a count plot using the survey_data DataFrame with "Internet usage" on the x-axis.

```python
# Create count plot of internet usage
sns.catplot( x='Internet usage', kind='count'
            , data = survey_data)


# Show plot
plt.show()
```
![image.png](attachment:image.png)

Instructions 2/3
Make the bars horizontal instead of vertical.
```python
# Change the orientation of the plot
sns.catplot(y="Internet usage", data=survey_data,
            kind="count")

# Show plot
plt.show()
```
![image-2.png](attachment:image-2.png)


Instructions 3/3
Separate this plot into two side-by-side column subplots based on "Age Category", which separates respondents into those that are younger than 21 vs. 21 and older.

```python
# Separate into column subplots based on age category


sns.catplot(y="Internet usage", data=survey_data,
            kind="count",
            col='Age Category')

# Show plot
plt.show()


```
![image-4.png](attachment:image-4.png)

Exercise
Bar plots with percentages
Let's continue exploring the responses to a survey sent out to young people. The variable "Interested in Math" is True if the person reported being interested or very interested in mathematics, and False otherwise. What percentage of young people report being interested in math, and does this vary based on gender? Let's use a bar plot to find out.

As a reminder, we'll create a bar plot using the catplot() function, providing the name of categorical variable to put on the x-axis (x=____), the name of the quantitative variable to summarize on the y-axis (y=____), the pandas DataFrame to use (data=____), and the type of categorical plot (kind="bar").

Seaborn has been imported as sns and matplotlib.pyplot has been imported as plt.

Instructions
Use the survey_data DataFrame and sns.catplot() to create a bar plot with "Gender" on the x-axis and "Interested in Math" on the y-axis.
```python
# Create a bar plot of interest in math, separated by gender
sns.catplot(x='Gender', y='Interested in Math', data=survey_data, kind='bar' )


# Show plot
plt.show()
```
![image.png](attachment:image.png)


Exercise
Customizing bar plots
In this exercise, we'll explore data from students in secondary school. The "study_time" variable records each student's reported weekly study time as one of the following categories: "<2 hours", "2 to 5 hours", "5 to 10 hours", or ">10 hours". Do students who report higher amounts of studying tend to get better final grades? Let's compare the average final grade among students in each category using a bar plot.

Seaborn has been imported as sns and matplotlib.pyplot has been imported as plt.

Instructions 1/3
Use sns.catplot() to create a bar plot with "study_time" on the x-axis and final grade ("G3") on the y-axis, using the student_data DataFrame.

```python
# Create bar plot of average final grade in each study category
sns.catplot(x='study_time', y='G3', data=student_data, kind='bar')

# Show plot
plt.show()
```
![image.png](attachment:image.png)

Instructions 2/3

Using the order parameter and the category_order list that is provided, rearrange the bars so that they are in order from lowest study time to highest.

```python
# List of categories from lowest to highest
category_order = ["<2 hours", 
                  "2 to 5 hours", 
                  "5 to 10 hours", 
                  ">10 hours"]

# Rearrange the categories
sns.catplot(x="study_time", y="G3",
            data=student_data,
            kind="bar", order=category_order)

# Show plot
plt.show()
```
![image-2.png](attachment:image-2.png)

Instructions 3/3
Update the plot so that it no longer displays confidence intervals.

```python
# List of categories from lowest to highest
category_order = ["<2 hours", 
                  "2 to 5 hours", 
                  "5 to 10 hours", 
                  ">10 hours"]

# Turn off the confidence intervals
sns.catplot(x="study_time", y="G3",
            data=student_data,
            kind="bar",
            order=category_order,
            ci=None)

# Show plot
plt.show()
```
![image-3.png](attachment:image-3.png)

# Introduction to Data Visualization with Seaborn: Creating a Box Plot

Source: Waskom, M. L. (2021). seaborn: statistical data visualization. https://seaborn.pydata.org/

---

## 1) Creating a box plot

- Goal: Learn how to create a categorical box plot in Seaborn.
- Context: Box plots are a type of categorical plot showing the distribution of a quantitative variable across categories.

---

## 2) What is a box plot?

- Purpose:
  - Shows the distribution of quantitative data.
  - Highlights median, spread, skewness, and outliers.
  - Facilitates comparison between groups (categories).

- Key elements:
  - Box: Interquartile range (IQR), from the 25th percentile (Q1) to the 75th percentile (Q3).
  - Median: Line inside the box (50th percentile).
  - Whiskers: Extend beyond the box to indicate spread (by default up to 1.5 × IQR; configurable).
  - Outliers: Points beyond the whiskers.

- Example interpretation (tips dataset; total_bill across days):
  - Median total bill is higher on Saturday and Sunday.
  - Spread is larger on weekend days.
  - Box plots make such comparisons fast and clear.

---

## 3) How to create a box plot

Below we use seaborn.catplot with kind="box" to create a box plot of total_bill by time (Lunch vs Dinner) from the tips dataset.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Load example dataset
tips = sns.load_dataset("tips")

# Create a box plot using catplot
g = sns.catplot(
    x="time",          # categorical variable on x-axis
    y="total_bill",    # quantitative variable on y-axis
    data=tips,         # dataset
    kind="box"         # specify box plot
)

plt.show()
```
![image.png](attachment:image.png)

### Expected Output (visual description)

- A figure with two box plots side-by-side:
  - x-axis categories: Lunch, Dinner.
  - y-axis: total_bill (continuous).
  - Each box shows:
    - Median line.
    - IQR box.
    - Whiskers at default 1.5 × IQR from Q1/Q3 (capped at data range).
    - Outlier points beyond whiskers.

### Line-by-line explanation

- `import matplotlib.pyplot as plt`
  - What: Imports Matplotlib’s plotting module.
  - Why: Needed to control figure display via plt.show() and other Matplotlib functions.
  - Result: plt namespace available.

- `import seaborn as sns`
  - What: Imports Seaborn.
  - Why: Provides high-level statistical plotting functions like catplot.
  - Result: sns namespace available.

- `tips = sns.load_dataset("tips")`
  - What: Loads Seaborn’s built-in “tips” dataset into a pandas DataFrame.
  - Why: Provides example data with variables like total_bill, time, day, etc.
  - Result: tips contains the dataset used for plotting.

- `g = sns.catplot(x="time", y="total_bill", data=tips, kind="box")`
  - What: Creates a categorical plot with box plots of total_bill for each time category.
  - Why: catplot is a figure-level function that can easily add facets (row/col) if needed; kind="box" selects a box plot.
  - Result: g is a FacetGrid/figure object; plot rendered (but not yet shown).

- `plt.show()`
  - What: Displays the plot.
  - Why: Ensures the figure is rendered in environments where auto-display is not enabled.
  - Result: The box plot appears.

### Significance of the output

- You get a clear comparison of the distribution of total bills between Lunch and Dinner:
  - Median differences suggest central tendency differences between groups.
  - IQR and whiskers reveal variability and potential skew.
  - Outliers show unusually high or low bills.

---

## 4) Change the order of categories

- You can control the order of categorical levels on the axis for better storytelling or consistency.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")

g = sns.catplot(
    x="time",
    y="total_bill",
    data=tips,
    kind="box",
    order=["Dinner", "Lunch"]  # force Dinner to appear before Lunch
)

plt.show()
```
![image-2.png](attachment:image-2.png)

### Expected Output (visual description)

- Same box plots as before but the x-axis order is now: Dinner, then Lunch.

### Line-by-line explanation

- `order=["Dinner", "Lunch"]`
  - What: Explicit category ordering for the x-axis.
  - Why: The default order may be alphabetical or data-driven; this allows narrative or conventional ordering (e.g., evening before midday).
  - Result: Axis categories appear in the specified order.

### Significance

- Ordering affects readability and interpretability (e.g., chronological order, magnitude order), which can reduce cognitive load for the audience.

---

## 5) Omitting the outliers using sym

- You may want to hide outlier markers to focus on central distribution or avoid clutter.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")

g = sns.catplot(
    x="time",
    y="total_bill",
    data=tips,
    kind="box",
    sym=""  # empty string hides outlier markers (Use fliersize=0 instead of sym="".)
)

plt.show()
```
![image-3.png](attachment:image-3.png)

### Expected Output (visual description)

- Box plots for Lunch and Dinner without any individual outlier points drawn.
- The box and whiskers remain; points beyond whiskers are not shown.

### Line-by-line explanation

- `sym=""`
  - What: Sets the marker style for outliers.
  - Why: In Seaborn’s box plot (Matplotlib under the hood), an empty string tells it not to draw the outlier markers.
  - Result: Outlier points are omitted, making the plot cleaner.

### Significance

- Useful when outliers distract from the main story or when a dense plot would be cluttered with many points.
- Note: Outliers still influence whisker endpoints if within the whisker definition; their markers are just hidden.

---

## 6) Changing the whiskers using whis

- Default whiskers: extend to 1.5 × IQR beyond Q1 and Q3 (bounded by min/max data).
- You can redefine whiskers:
  - Scalar: e.g., whis=2.0 → whiskers at 2.0 × IQR.
  - Percentiles: e.g., whis=[5, 95] → whiskers at 5th and 95th percentiles.
  - Min/Max: whis=[0, 100] → whiskers at min and max values.

### Code example A: 2.0 × IQR

```python
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")

g = sns.catplot(
    x="time",
    y="total_bill",
    data=tips,
    kind="box",
    whis=2.0  # extend whiskers to 2.0 × IQR
)

plt.show()
```
![image-4.png](attachment:image-4.png)

### Expected Output (visual description)

- Similar box plots, but whiskers reach further compared to 1.5 × IQR, reducing the number of points flagged as outliers.

### Line-by-line explanation

- `whis=2.0`
  - What: Sets whisker reach to 2.0 × IQR.
  - Why: Looser definition of outliers; fewer points beyond whiskers get marked as outliers.
  - Result: Whiskers extend farther; fewer outlier points.

---

### Code example B: Percentile-based whiskers [5, 95]

```python
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")

g = sns.catplot(
    x="time",
    y="total_bill",
    data=tips,
    kind="box",
    whis=[5, 95]  # whiskers at 5th and 95th percentiles
)

plt.show()
```
![image-5.png](attachment:image-5.png)

### Expected Output (visual description)

- Whiskers drawn exactly at the 5th and 95th percentiles.
- Points below 5th or above 95th percentiles appear as outliers.

### Line-by-line explanation

- `whis=[5, 95]`
  - What: Specifies lower and upper percentiles for whiskers.
  - Why: Percentile whiskers provide a fixed coverage (90% of data inside whiskers).
  - Result: Consistent coverage of the central 90% of data.

---

### Code example C: Min/Max whiskers [0, 100]

```python
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")

g = sns.catplot(
    x="time",
    y="total_bill",
    data=tips,
    kind="box",
    whis=[0, 100]  # whiskers at min and max values
)

plt.show()
```
![image-6.png](attachment:image-6.png)

### Expected Output (visual description)

- Whiskers extend to the minimum and maximum observed total_bill within each category.
- There are no outlier points, since whiskers cover the entire range.

### Line-by-line explanation

- `whis=[0, 100]`
  - What: Sets whiskers to the 0th and 100th percentiles (min and max).
  - Why: To show the full range explicitly and suppress outlier markers.
  - Result: No outlier markers; full range is enclosed by whiskers.

### Significance

- Changing whis alters what is considered an “outlier” and how much of the tails are captured by whiskers.
- Choose based on analysis goals:
  - Robust to extremes (1.5 × IQR).
  - Controlled coverage ([5,95]).
  - Full range ([0,100]).

---

## 7) Example recap: whiskers at min and max (no outliers)

- When `whis=[0, 100]`, whiskers span the entire data range; thus, outlier markers do not appear.
- This can be useful when you want a compact summary without emphasizing extreme points as outliers.

---

## 8) Let’s practice!

- Try:
  - Faceting with col or row in catplot to compare distributions across multiple categorical dimensions.
  - Combining order with whis to standardize presentation across figures.
  - Toggling sym to see how highlighting/hiding outliers affects interpretation.

---

## Appendix: Notes on Seaborn and catplot

- Seaborn is a Python data visualization library based on Matplotlib providing a high-level interface for informative statistical graphics.
- catplot is a figure-level function that supports multiple plot kinds (box, violin, swarm, etc.) and easy faceting (row/col).
- For more, see Seaborn’s tutorials and API reference at https://seaborn.pydata.org/.

### Exercise
Create and interpret a box plot
Let's continue using the student_data dataset. In an earlier exercise, we explored the relationship between studying and final grade by using a bar plot to compare the average final grade ("G3") among students in different categories of "study_time".

In this exercise, we'll try using a box plot look at this relationship instead. As a reminder, to create a box plot you'll need to use the catplot() function and specify the name of the categorical variable to put on the x-axis (x=____), the name of the quantitative variable to summarize on the y-axis (y=____), the pandas DataFrame to use (data=____), and the type of plot (kind="box").

We have already imported matplotlib.pyplot as plt and seaborn as sns.

Instructions 1/2

Use sns.catplot() and the student_data DataFrame to create a box plot with "study_time" on the x-axis and "G3" on the y-axis. Set the ordering of the categories to study_time_order.

```python
# Specify the category ordering
study_time_order = ["<2 hours", "2 to 5 hours", 
                    "5 to 10 hours", ">10 hours"]

# Create a box plot and set the order of the categories
sns.catplot(x='study_time', y='G3', data=student_data, order = study_time_order, kind='box')




# Show plot
plt.show()

```
![image.png](attachment:image.png)


Question
Which of the following is a correct interpretation of this box plot?

Possible answers


The 75th percentile of grades is highest among students who study more than 10 hours a week.

There are no outliers plotted for these box plots.

The 5th percentile of grades among students studying less than 2 hours is 5.0.

The median grade among students studying less than 2 hours is 10.0.



Answer = Last one


### Exercise
Omitting outliers
Now let's use the student_data dataset to compare the distribution of final grades ("G3") between students who have internet access at home and those who don't. To do this, we'll use the "internet" variable, which is a binary (yes/no) indicator of whether the student has internet access at home.

Since internet may be less accessible in rural areas, we'll add subgroups based on where the student lives. For this, we can use the "location" variable, which is an indicator of whether a student lives in an urban ("Urban") or rural ("Rural") location.

Seaborn has already been imported as sns and matplotlib.pyplot has been imported as plt. As a reminder, you can omit outliers in box plots by setting the sym parameter equal to an empty string ("").

Instructions

Use sns.catplot() to create a box plot with the student_data DataFrame, putting "internet" on the x-axis and "G3" on the y-axis.
Add subgroups so each box plot is colored based on "location".
Do not display the outliers.



```python
# Create a box plot with subgroups and omit the outliers

sns.catplot(x='internet', y='G3', data=student_data,
            sym='', hue='location', kind='box' )


# Show plot
plt.show()

```
![image-2.png](attachment:image-2.png)

### Exercise
Adjusting the whiskers
In the lesson we saw that there are multiple ways to define the whiskers in a box plot. In this set of exercises, we'll continue to use the student_data dataset to compare the distribution of final grades ("G3") between students who are in a romantic relationship and those that are not. We'll use the "romantic" variable, which is a yes/no indicator of whether the student is in a romantic relationship.

Let's create a box plot to look at this relationship and try different ways to define the whiskers.

We've already imported Seaborn as sns and matplotlib.pyplot as plt.

Instructions 1/3

Adjust the code to make the box plot whiskers to extend to 0.5 * IQR. Recall: the IQR is the interquartile range.

```python
# Set the whiskers to 0.5 * IQR
sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box",
            whis= 0.5)

# Show plot
plt.show()

```
![image.png](attachment:image.png)


2. Change the code to set the whiskers to extend to the 5th and 95th percentiles.



```python
# Extend the whiskers to the 5th and 95th percentile
sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box",
            whis=[5, 95]
)

# Show plot
plt.show()
```
![image-2.png](attachment:image-2.png)

3. Change the code to set the whiskers to extend to the min and max values.


```python
# Set the whiskers at the min and max values
sns.catplot(x="romantic", y="G3",
            data=student_data,
            kind="box",
            whis=[0, 100])

# Show plot
plt.show()

```
![image-3.png](attachment:image-3.png)

# Introduction to Data Visualization with Seaborn: Point Plots

Source: seaborn: statistical data visualization — seaborn 0.13.2 documentation (https://seaborn.pydata.org/). Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. See the tutorials and API reference on the site for more.

---

## 1) Point plots

- We’ve seen several categorical plots (count, bar, box). This lesson introduces point plots, another categorical plot type in Seaborn.

---

## 2) What are point plots?

- What they show:
  - A single point per category representing the mean of a quantitative variable.
  - Vertical error bars representing 95% confidence intervals (CIs) for that mean.
- Interpretation:
  - CIs communicate uncertainty around the mean estimate.
  - Under random sampling, we can be 95% confident the population mean lies within the CI for each group.
- Example context:
  - Tips dataset: average total bill among smokers vs. non-smokers, with 95% CIs.

---

## 3) Point plots vs. line plots

- Similarities:
  - Both display means and 95% CIs.
- Key difference:
  - Line plots are relational: both x and y axes are quantitative (often time on x).
  - Point plots are categorical: one axis (usually x) is categorical, so it’s a categorical plot.

---

## 4) Point plots vs. bar plots

- Similarities:
  - Both show the mean and 95% CIs per category (and subcategory via hue).
- When point plots can be preferable:
  - Easier to compare subgroup points stacked vertically.
  - Easier to assess differences in “slope” between categories when points are connected (or to compare within-category levels when disconnected).

---
![image.png](attachment:image.png)


## 5) Creating a point plot

We’ll create a point plot using Seaborn’s figure-level function catplot with kind="point". Example uses a “masculinity” survey dataset (masculinity_data) with age groups on x, a response variable on y, and subgroups by a hue variable.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Example dataset: replace with your DataFrame 'masculinity_data'
# Expected columns:
# - "age": categorical age group (e.g., '18-29', '30-44', ...)
# - "masculinity_important": numeric response (e.g., percent or score)
# - "feel_masculine": categorical subgroup (e.g., 'Yes'/'No')

# Create the point plot
sns.catplot(
    x="age",
    y="masculinity_important",
    data=masculinity_data,
    hue="feel_masculine",
    kind="point"
)

plt.show()
```
![image-2.png](attachment:image-2.png)

### Expected Output (visual description)

- A figure with:
  - x-axis: age groups (categorical).
  - y-axis: average of masculinity_important.
  - Points for each hue subgroup per age.
  - Lines connecting points across age (by hue) and vertical error bars (≈95% CIs).

### Line-by-line explanation

- `import matplotlib.pyplot as plt`
  - What: Imports Matplotlib’s plotting interface.
  - Why: Needed for rendering and controlling display with plt.show().
  - Result: plt is available to show the figure.

- `import seaborn as sns`
  - What: Imports Seaborn.
  - Why: Provides high-level statistical plotting functions (catplot, point plot).
  - Result: sns is available.

- `sns.catplot(..., kind="point")`
  - What: Creates a categorical plot showing point estimates per category.
  - Why: catplot is a figure-level function; kind="point" selects a point plot and enables easy faceting if needed later.
  - Parameters:
    - `x="age"`: categorical axis.
    - `y="masculinity_important"`: quantitative variable to summarize (mean).
    - `data=masculinity_data`: DataFrame source.
    - `hue="feel_masculine"`: subgroup split within each age category.
  - Result: Generates a figure with points, connecting lines, and 95% CIs.

- `plt.show()`
  - What: Displays the plot.
  - Why: Ensures rendering in environments that don’t auto-display.
  - Result: Figure appears.

### Significance of the output

- Summarizes group means with uncertainty, enabling:
  - Within-age comparisons of subgroups (hue).
  - Across-age trends for each subgroup (via connecting lines).

---

## 6) Disconnecting the points

You can remove the connecting lines when you want to emphasize within-category comparisons without implying continuity across categories.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.catplot(
    x="age",
    y="masculinity_important",
    data=masculinity_data,
    hue="feel_masculine",
    kind="point",
    join=False  # do not draw lines between category points
)

plt.show()
```
![image-3.png](attachment:image-3.png)

### Expected Output (visual description)

- Same points and CIs as before, but no lines connecting points across age groups.

### Line-by-line explanation

- `join=False`
  - What: Disables line segments connecting points within each hue level.
  - Why: Avoids suggesting a continuous relationship across categorical x levels; focuses on per-category comparisons.
  - Result: Cleaner, category-focused visualization.

### Significance

- Reduces visual implication of trends across nominal categories.
- Useful when x-axis is unordered or when slopes could be misleading.

---

## 7) Displaying the mean (default) on tips: smokers vs. non-smokers

By default, point plots show the mean and its 95% CI.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Load example dataset
tips = sns.load_dataset("tips")

sns.catplot(
    x="smoker",
    y="total_bill",
    data=tips,
    kind="point"
)

plt.show()
```
![image-4.png](attachment:image-4.png)

### Expected Output (visual description)

- Two points (No, Yes) for smoker status on x.
- y-axis: mean total_bill.
- Vertical 95% CI bars; a line connecting the two points.

### Line-by-line explanation

- `tips = sns.load_dataset("tips")`
  - What: Loads Seaborn’s built-in tips dataset into a pandas DataFrame.
  - Why: Provides example data to demonstrate the plot.
  - Result: tips DataFrame with columns including smoker and total_bill.

- `sns.catplot(x="smoker", y="total_bill", data=tips, kind="point")`
  - What: Plots mean total_bill for smokers vs. non-smokers with 95% CIs.
  - Why: Illustrates default behavior of point plot (mean + CI).
  - Result: Figure depicting group means and uncertainty.

- `plt.show()`
  - What/Why/Result: Displays the figure.

### Significance

- Quickly compares average spending between smokers and non-smokers, with uncertainty intervals to avoid overinterpretation of small differences.

---

## 8) Displaying the median instead of the mean

You can change the summary statistic using the estimator parameter. Median is more robust to outliers.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import median

tips = sns.load_dataset("tips")

sns.catplot(
    x="smoker",
    y="total_bill",
    data=tips,
    kind="point",
    estimator=median  # use median instead of mean
)

plt.show()
```
![image-5.png](attachment:image-5.png)

### Expected Output (visual description)

- Similar layout, but points represent the median total_bill per group.
- Error bars represent an uncertainty interval around the median (Seaborn computes intervals via bootstrapping by default).

### Line-by-line explanation

- `from numpy import median`
  - What: Imports the median function.
  - Why: To pass as the estimator to compute per-group summaries.
  - Result: median callable available.

- `estimator=median`
  - What: Replaces default mean with median as the point estimate.
  - Why: Median reduces sensitivity to extreme values and skew.
  - Result: Points reflect central tendency robust to outliers.

### Significance

- Choose median when distributions are skewed or contain outliers; it may better represent a “typical” value than the mean.

---

## 9) Customizing the confidence intervals: capsize

Add “caps” at the ends of CI bars to improve readability.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")

sns.catplot(
    x="smoker",
    y="total_bill",
    data=tips,
    kind="point",
    capsize=0.2  # width of the end caps on CI bars
)

plt.show()
```
![image-6.png](attachment:image-6.png)

### Expected Output (visual description)

- Same mean points and CI bars as default, but with small horizontal caps at the ends of the error bars.

### Line-by-line explanation

- `capsize=0.2`
  - What: Sets the length of the caps relative to the categorical spacing.
  - Why: Enhances visual clarity of CI endpoints.
  - Result: Error bars have small horizontal ticks at their ends.

### Significance

- Caps can make CI endpoints easier to see, especially in dense or small figures.

---

## 10) Turning off confidence intervals

Disable CI display to declutter or when uncertainty is handled elsewhere.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

tips = sns.load_dataset("tips")

sns.catplot(
    x="smoker",
    y="total_bill",
    data=tips,
    kind="point",
    ci=None  # turn off confidence intervals
)

plt.show()
```
![image-7.png](attachment:image-7.png)

### Expected Output (visual description)

- Points (and connecting line) without vertical error bars.

### Line-by-line explanation

- `ci=None`
  - What: Disables computation and drawing of confidence intervals.
  - Why: Reduce visual clutter, speed up plotting, or when uncertainty is addressed differently.
  - Result: Only point estimates are shown.

### Significance

- Useful in dashboards or small multiples where many overlays would overwhelm the viewer, or where CIs are unnecessary.

---

## 11) Let’s practice!

Try the following to reinforce learning:
- Use hue and join=False to explore subgroup differences without implying trends.
- Switch estimator between mean and median to see robustness effects.
- Adjust capsize and ci to tailor uncertainty presentation.
- Explore more in Seaborn’s tutorials and API reference (site navigation: Installing, Gallery, Tutorial, API, etc.) to deepen understanding.

### Exercise
Customizing point plots
Let's continue to look at data from students in secondary school, this time using a point plot to answer the question: does the quality of the student's family relationship influence the number of absences the student has in school? Here, we'll use the "famrel" variable, which describes the quality of a student's family relationship from 1 (very bad) to 5 (very good).

As a reminder, to create a point plot, use the catplot() function and specify the name of the categorical variable to put on the x-axis (x=____), the name of the quantitative variable to summarize on the y-axis (y=____), the pandas DataFrame to use (data=____), and the type of categorical plot (kind="point").

We've already imported Seaborn as sns and matplotlib.pyplot as plt.

Instructions 1/3

Use sns.catplot() and the student_data DataFrame to create a point plot with "famrel" on the x-axis and number of absences ("absences") on the y-axis.

```python
# Create a point plot of family relationship vs. absences
sns.catplot(x='famrel', y='absences', data=student_data,
            kind='point')


            
# Show plot
plt.show()
```
![image.png](attachment:image.png)

2. Add "caps" to the end of the confidence intervals with size 0.2.



```python
# Add caps to the confidence interval
sns.catplot(x="famrel", y="absences",
			data=student_data,
            kind="point", capsize=0.2)
        
# Show plot
plt.show()

```
![image-2.png](attachment:image-2.png)

3. Remove the lines joining the points in each category.

```python
# Remove the lines joining the points
sns.catplot(x="famrel", y="absences",
			data=student_data,
            kind="point",
            capsize=0.2, join=False)
            
# Show plot
plt.show()

```
![image-3.png](attachment:image-3.png)

### Exercise
Point plots with subgroups
Let's continue exploring the dataset of students in secondary school. This time, we'll ask the question: is being in a romantic relationship associated with higher or lower school attendance? And does this association differ by which school the students attend? Let's find out using a point plot.

We've already imported Seaborn as sns and matplotlib.pyplot as plt.

Instructions 1/3

Use sns.catplot() and the student_data DataFrame to create a point plot with relationship status ("romantic") on the x-axis and number of absences ("absences") on the y-axis. Color the points based on the school that they attend ("school").

```python
# Create a point plot that uses color to create subgroups
sns.catplot(x='romantic', y='absences', data=student_data,
            kind='point', hue='school')



# Show plot
plt.show()

```
![image.png](attachment:image.png)

2. Turn off the confidence intervals for the plot.
 
 
```python
# Turn off the confidence intervals for this plot
sns.catplot(x="romantic", y="absences",
			data=student_data,
            kind="point",
            hue="school",
            ci=None)

# Show plot
plt.show()

```
![image-2.png](attachment:image-2.png)

3. Since there may be outliers of students with many absences, use the median function that we've imported from numpy to display the median number of absences instead of the average.

```python

# Import median function from numpy
from numpy import median

# Plot the median number of absences instead of the mean
sns.catplot(x="romantic", y="absences",
			data=student_data,
            kind="point",
            hue="school",
            ci=None,
            estimator=median)

# Show plot
plt.show()

```
![image-3.png](attachment:image-3.png)



End of Chapter