
# Introduction to Data Visualization with Seaborn


---

## 1. What is Seaborn?
Seaborn is a powerful **Python data visualization library** designed to make it easy to create the most common types of plots with minimal code.

- **Purpose**: Simplifies statistical and categorical data visualization.
- **Reference**: Waskom, M. L. (2021). *seaborn: statistical data visualization*. [https://seaborn.pydata.org](https://seaborn.pydata.org)
- **Example**: The plots in this course can be created with just a few lines of Seaborn code.

---

## 2. Why is Seaborn useful?
Data visualization is a key part of:
- **Data exploration**: Understanding data trends and patterns.
- **Communication of results**: Presenting insights effectively.

---

## 3. Advantages of Seaborn
- **Easy to use**: Handles much of the complexity behind the scenes.
- **Integrates with pandas**: Works directly with DataFrames and Series.
- **Built on Matplotlib**: Retains flexibility while simplifying usage.

---

## 4. Getting Started

```python
import seaborn as sns
import matplotlib.pyplot as plt
````

**Explanation:**

* `import seaborn as sns`: Imports Seaborn using the standard alias `sns` (named after "Samuel Norman Seaborn" from *The West Wing* TV show).
* `import matplotlib.pyplot as plt`: Imports Matplotlib's plotting functions with alias `plt`.

---

## 5. Example 1: Scatter Plot

### Code:

```python
import seaborn as sns
import matplotlib.pyplot as plt

height = [62, 64, 69, 75, 66, 68, 65, 71, 76, 73]
weight = [120, 136, 148, 175, 137, 165, 154, 172, 200, 187]

sns.scatterplot(x=height, y=weight)
plt.show()

```
![image-2.png](attachment:image-2.png)


### Output:

*(This plot will appear in Jupyter as a scatter plot.)*

**Graph description**:

* **X-axis**: Height (in inches)
* **Y-axis**: Weight (in pounds)
* Shows that taller people tend to weigh more.

**Line-by-line explanation**:

1. `height = [...]` and `weight = [...]`: Lists containing heights and weights of 10 people.
2. `sns.scatterplot(x=height, y=weight)`: Creates a scatter plot mapping heights to x-axis and weights to y-axis.
3. `plt.show()`: Displays the plot.

**Significance**:

* Visual correlation between height and weight.
* Useful for identifying trends and relationships in numeric data.

---

## 6. Example 2: Count Plot

### Code:

```python
import seaborn as sns
import matplotlib.pyplot as plt

gender = ["Female", "Female", "Female", "Female", "Male", "Male", "Male", "Male", "Male", "Male"]

sns.countplot(x=gender)
plt.show()
```
![image.png](attachment:image.png)

### Output:

*(This plot will appear in Jupyter as a bar chart.)*

**Graph description**:

* **X-axis**: Gender categories (Male, Female)
* **Y-axis**: Count of occurrences
* Bars show: Male = 6, Female = 4.

**Line-by-line explanation**:

1. `gender = [...]`: List containing gender labels for each person.
2. `sns.countplot(x=gender)`: Creates a bar chart showing frequency of each gender.
3. `plt.show()`: Displays the plot.

**Significance**:

* Quickly summarizes distribution of categorical variables.
* Helps in identifying imbalances or patterns in category counts.

---

## 7. Summary

In this lesson:

* Learned what Seaborn is and why it’s useful.
* Installed and imported Seaborn and Matplotlib.
* Created two visualizations:

  1. **Scatter plot**: Showed correlation between height and weight.
  2. **Count plot**: Showed distribution of gender in dataset.
* Seaborn’s strength lies in simplicity, pandas integration, and Matplotlib foundation.

```




### Exercise
Making a scatter plot with lists
In this exercise, we'll use a dataset that contains information about 227 countries. This dataset has lots of interesting information on each country, such as the country's birth rates, death rates, and its gross domestic product (GDP). GDP is the value of all the goods and services produced in a year, expressed as dollars per person.

We've created three lists of data from this dataset to get you started. gdp is a list that contains the value of GDP per country, expressed as dollars per person. phones is a list of the number of mobile phones per 1,000 people in that country. Finally, percent_literate is a list that contains the percent of each country's population that can read and write.

1. Import Matplotlib and Seaborn using the standard naming convention.
```python
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

2. Create a scatter plot of GDP (gdp) vs. number of phones per 1000 people (phones).

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot with GDP on the x-axis and number of phones on the y-axis
sns.scatterplot(x=gdp, y=phones)

3. Display the plot.

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot with GDP on the x-axis and number of phones on the y-axis
sns.scatterplot(x=gdp, y=phones)

# Show plot
plt.show()
```
![image.png](attachment:image.png)

```python

4. Change the scatter plot so it displays the percent of the population that can read and write (percent_literate) on the y-axis.

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Change this scatter plot to have percent literate on the y-axis
sns.scatterplot(x=gdp, y=percent_literate)

# Show plot
plt.show()


```
![image-2.png](attachment:image-2.png)

### Exercise
Making a count plot with a list
In the last exercise, we explored a dataset that contains information about 227 countries. Let's do more exploration of this data - specifically, how many countries are in each region of the world?

To do this, we'll need to use a count plot. Count plots take in a categorical list and return bars that represent the number of list entries per category. You can create one here using a list of regions for each country, which is a variable named region.

Import Matplotlib and Seaborn using the standard naming conventions.
Use Seaborn to create a count plot with region on the y-axis.
Display the plot.


```python
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create count plot with region on the y-axis
sns.countplot(y=region)

# Show plot
plt.show()
```

![image.png](attachment:image.png)




# Using pandas with Seaborn

## 1. Introduction
Data scientists commonly use **pandas** to perform data analysis, and it's a big advantage that **Seaborn** works seamlessly with pandas data structures.

We will learn:
- What pandas is
- How to work with DataFrames
- How to use Seaborn’s `countplot()` with a DataFrame
- The difference between **tidy** and **untidy** data

---

## 2. What is pandas?

**pandas**:
- A Python library for data analysis.
- Can read datasets from various file types (CSV, TXT, Excel, etc.).
- Supports several data structures; the most common is the **DataFrame**.
- A **DataFrame** is a 2D labeled data structure with rows and columns.

When you read a dataset into pandas, it becomes a DataFrame.

---

## 3. Working with DataFrames

### Example: Reading a CSV file into a DataFrame

```python
import pandas as pd

# Read CSV file into a DataFrame
df = pd.read_csv("masculinity.csv")

# Display the first 5 rows
df.head()
````

**Expected Output:**

| participant\_id | age     | how\_masculine | how\_important |
| --------------- | ------- | -------------- | -------------- |
| 0               | 18 - 34 | Somewhat       | Somewhat       |
| 1               | 18 - 34 | Somewhat       | Somewhat       |
| 2               | 18 - 34 | Very           | Not very       |
| 3               | 18 - 34 | Very           | Not very       |
| 4               | 18 - 34 | Very           | Very           |

**Line-by-line explanation:**

1. `import pandas as pd`

   * Imports the pandas library and aliases it as `pd` for convenience.
2. `pd.read_csv("masculinity.csv")`

   * Reads the file `masculinity.csv` into a DataFrame.
   * The CSV must be in the current working directory or you must specify the full path.
3. `df.head()`

   * Displays the first 5 rows of the DataFrame for inspection.

**Significance of Output:**

* Each row is one participant’s response.
* Columns:

  * `participant_id`: Unique identifier for each participant.
  * `age`: Age category.
  * `how_masculine`: Response to “How masculine or ‘manly’ do you feel?”
  * `how_important`: Response to “How important is it to you that others see you as masculine?”

---

## 4. Using DataFrames with `countplot()`

### Example: Creating a count plot with Seaborn

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
df = pd.read_csv("masculinity.csv")

# Create a count plot
sns.countplot(x="how_masculine", data=df)

# Display the plot
plt.show()
```
![image.png](attachment:image.png)

**Expected Output:**

* A bar chart showing counts of each category in `how_masculine`:

  * "Somewhat" → tallest bar (most frequent)
  * "Very" → second tallest
  * Others (if present) → smaller bars

**Line-by-line explanation:**

1. `import pandas as pd` – Imports pandas for data handling.
2. `import matplotlib.pyplot as plt` – Imports Matplotlib for plotting controls.
3. `import seaborn as sns` – Imports Seaborn for statistical data visualization.
4. `pd.read_csv("masculinity.csv")` – Reads the survey data into a DataFrame.
5. `sns.countplot(x="how_masculine", data=df)`

   * Creates a bar plot showing the frequency of each category in the `how_masculine` column.
   * `x="how_masculine"` → The column to visualize.
   * `data=df` → The DataFrame where the column exists.
6. `plt.show()` – Renders the plot.

**Significance of Output:**

* Visualizes the distribution of answers to the “how masculine” question.
* Seaborn automatically uses the column name as the x-axis label.
* Helps quickly identify the most common responses.

---

## 5. "Tidy" Data

**Definition:**

* Each observation has its own row.
* Each variable has its own column.

**In our example:**

* Each row = one participant’s survey response.
* Columns:

  * Age
  * How masculine
  * How important others see them as masculine
* Works perfectly with Seaborn because each column’s values correspond to a single variable.

---

## 6. "Untidy" Data

**Example characteristics:**

* Rows contain mixed or inconsistent data types.
* Different rows store different types of information.
* Example:

  * Row 0 contains age categories.
  * Row 1 contains question text.
  * Other rows contain summary data.

**Why it’s a problem for Seaborn:**

* Seaborn expects each column to contain one type of variable.
* Untidy data can’t be directly plotted without transformation.

**Note:** Transforming untidy → tidy is possible but outside this course’s scope.

---

## 7. Practice

Now that you know:

* How pandas reads data into DataFrames.
* How Seaborn can work directly with DataFrame columns.
* The importance of tidy data for visualization.

You’re ready to practice making your own plots using pandas with Seaborn.

---




### Exercise
"Tidy" vs. "untidy" data
Here, we have a sample dataset from a survey of children about their favorite animals. But can we use this dataset as-is with Seaborn? Let's use pandas to import the csv file with the data collected from the survey and determine whether it is tidy, which is essential to having it work well with Seaborn.

To get you started, the filepath to the csv file has been assigned to the variable csv_filepath.

Note that because csv_filepath is a Python variable, you will not need to put quotation marks around it when you read the csv.

1. Read the csv file located at csv_filepath into a DataFrame named df.
Print the head of df to show the first five rows.

```python

# Import pandas
import pandas as pd

# Create a DataFrame from csv file
df = pd.read_csv(csv_filepath)

# Print the head of df
print(df.head())

<script.py> output:
      Unnamed: 0               How old are you?
    0     Marion                             12
    1      Elroy                             16
    2        NaN  What is your favorite animal?
    3     Marion                            dog
    4      Elroy                            cat
```
    

### Exercise
Making a count plot with a DataFrame
In this exercise, we'll look at the responses to a survey sent out to young people. Our primary question here is: how many young people surveyed report being scared of spiders? Survey participants were asked to agree or disagree with the statement "I am afraid of spiders". Responses vary from 1 to 5, where 1 is "Strongly disagree" and 5 is "Strongly agree".

To get you started, the filepath to the csv file with the survey data has been assigned to the variable csv_filepath.

Note that because csv_filepath is a Python variable, you will not need to put quotation marks around it when you read the csv.

Instructions
Import Matplotlib, pandas, and Seaborn using the standard names.
Create a DataFrame named df from the csv file located at csv_filepath.
Use the countplot() function with the x= and data= arguments to create a count plot with the "Spiders" column values on the x-axis.
Display the plot.

```python
# Import Matplotlib, pandas, and Seaborn
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Create a DataFrame from csv file
df = pd.read_csv(csv_filepath)

# Create a count plot with "Spiders" on the x-axis
sns.countplot(x='Spiders', data=df)

# Display the plot
plt.show()
```
![image.png](attachment:image.png)

# Adding a Third Variable with `hue` (Seaborn) — Jupyter Markdown Notes


---

## 1) Adding a third variable with `hue`

**Idea:** Seaborn works seamlessly with pandas DataFrames and lets you add a third variable to a plot by mapping it to color via the `hue` parameter. This helps you compare groups within the same chart.

**Why it matters:** Color lets you distinguish subgroups (e.g., smokers vs. non-smokers) without creating separate plots.

---

## 2) Tips dataset

We’ll use Seaborn’s built-in `tips` dataset.

### Code

```python
import pandas as pd
import seaborn as sns

tips = sns.load_dataset("tips")
tips.head()
```

### Expected Output (first 5 rows)

```
   total_bill   tip     sex smoker   day    time  size
0       16.99  1.01  Female     No   Sun  Dinner     2
1       10.34  1.66    Male     No   Sun  Dinner     3
2       21.01  3.50    Male     No   Sun  Dinner     3
3       23.68  3.31    Male     No   Sun  Dinner     2
4       24.59  3.61  Female     No   Sun  Dinner     4
```

### Line-by-line explanation

* `import pandas as pd`

  * **What:** Imports pandas and aliases it as `pd`.
  * **Why:** DataFrames are the primary data structure for tabular data.
  * **Result:** You can call pandas functions via `pd`.

* `import seaborn as sns`

  * **What:** Imports Seaborn and aliases it as `sns`.
  * **Why:** We’ll use Seaborn’s plotting functions and datasets.
  * **Result:** `sns` is the handle for Seaborn.

* `tips = sns.load_dataset("tips")`

  * **What:** Loads a built-in dataset named `"tips"`.
  * **Why:** Provides a ready dataset with restaurant bills, tips, party size, etc.
  * **Result:** `tips` is a pandas DataFrame with the data.

* `tips.head()`

  * **What:** Shows the first five rows.
  * **Why:** Quick sanity check to understand columns and data types.
  * **Result:** The table above, confirming columns like `total_bill`, `tip`, `sex`, `smoker`, `day`, `time`, `size`.

**Significance:** We confirm the dataset’s structure before plotting (what variables exist and how they’re stored).

---

## 3) A basic scatter plot

We’ll visualize the relationship between `total_bill` (x) and `tip` (y).

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.show()
```
![image.png](attachment:image.png)

### Expected Output

* A scatter plot with:

  * **x-axis:** `total_bill` (dollars)
  * **y-axis:** `tip` (dollars)
  * Points showing a positive trend: larger bills tend to have larger tips.

### Line-by-line explanation

* `import matplotlib.pyplot as plt`

  * **What:** Imports Matplotlib’s plotting interface.
  * **Why:** Needed to render/show figures (`plt.show()`).
  * **Result:** `plt` controls figure display.

* `import seaborn as sns`

  * **What/Why/Result:** Same as before; ensures `sns` is available here too.

* `sns.scatterplot(x="total_bill", y="tip", data=tips)`

  * **What:** Creates a scatter plot using DataFrame columns.
  * **Why:** To explore the relationship between bill size and tip amount.
  * **Result:** A figure with points; expected positive association.

* `plt.show()`

  * **What:** Renders the plot to the output.
  * **Why:** Necessary in scripts/notebooks to display figures.
  * **Result:** The scatter plot appears.

**Significance:** Confirms a general pattern—higher bills → higher tips—before adding subgroup information.

---

## 4) A scatter plot with `hue`

Add a third variable by mapping `smoker` to color.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(
    x="total_bill",
    y="tip",
    data=tips,
    hue="smoker"   # color points by smoker status
)
plt.show()
```
![image-2.png](attachment:image-2.png)


### Expected Output

* Same scatter, but points are **colored by smoker status** (`Yes` vs `No`).
* An **automatic legend** labeling the color mapping.

### Line-by-line explanation

* `sns.scatterplot(..., hue="smoker")`

  * **What:** Instructs Seaborn to color points by the `smoker` column.
  * **Why:** To compare distributions/patterns of smokers vs. non-smokers in the same plot.
  * **Result:** Two distinct colors with a legend; you can visually compare groups.

* `plt.show()`

  * **As above:** Renders the colored scatter plot.

**Significance:** You can immediately see if one group tips differently or has different bill ranges.

> **Note:** If you weren’t using a DataFrame, you could pass a list/array to `hue` directly, but using column names is more convenient with pandas.

---

## 5) Setting `hue_order`

Control the order that categories appear (affects legend and sometimes plotting order).

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(
    x="total_bill",
    y="tip",
    data=tips,
    hue="smoker",
    hue_order=["Yes", "No"]  # force “Yes” to appear before “No”
)
plt.show()
```
![image-3.png](attachment:image-3.png)


### Expected Output

* Scatter colored by smoker status.
* **Legend order:** `Yes` appears before `No`.

### Line-by-line explanation

* `hue_order=["Yes", "No"]`

  * **What:** Specifies explicit ordering of the categorical levels.
  * **Why:** For consistent presentation (e.g., putting “Yes” first to match narrative or slides).
  * **Result:** Legend shows `Yes` first; plotting layers follow that order.

**Significance:** Consistent ordering improves readability and comparability across plots/reports.

---

## 6) Specifying `hue` colors with `palette` (named colors)

Choose specific colors for categories via a dictionary.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

hue_colors = {"Yes": "black", "No": "red"}

sns.scatterplot(
    x="total_bill",
    y="tip",
    data=tips,
    hue="smoker",
    palette=hue_colors  # map each level to a chosen color
)
plt.show()
```

![image-4.png](attachment:image-4.png)

### Expected Output

* Smokers (`Yes`) plotted in **black**.
* Non-smokers (`No`) plotted in **red**.
* Legend reflects these colors.

### Line-by-line explanation

* `hue_colors = {"Yes": "black", "No": "red"}`

  * **What:** A Python dict mapping category → color name.
  * **Why:** Full control over which color represents which group.
  * **Result:** Later passed to `palette` to enforce colors.

* `palette=hue_colors`

  * **What:** Tells Seaborn to use your mapping.
  * **Why:** Ensures consistency across plots/documents (e.g., brand colors or accessibility palettes).
  * **Result:** Plot uses black for `Yes`, red for `No`.

* Other lines behave as in previous examples.

**Significance:** Custom colors support consistency and better visual contrast for your audience.

> **Color options:** Matplotlib supports a set of **named colors** and **one-letter abbreviations** (e.g., `"k"` for black, `"r"` for red). When you need precise/brand colors, use **hex codes** (next section).

---

## 7) Using HTML hex color codes with `hue`

Hex codes allow any color (e.g., gray and lime).

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

hue_colors = {"Yes": "#808080", "No": "#00FF00"}  # gray and lime

sns.scatterplot(
    x="total_bill",
    y="tip",
    data=tips,
    hue="smoker",
    palette=hue_colors
)
plt.show()
```
![image-5.png](attachment:image-5.png)


### Expected Output

* Smokers (`Yes`) in **#808080** (gray).
* Non-smokers (`No`) in **#00FF00** (lime green).
* Legend matches these hex colors.

### Line-by-line explanation

* `hue_colors = {"Yes": "#808080", "No": "#00FF00"}`

  * **What:** Dict mapping each category to a hex color code (strings with `#`).
  * **Why:** Precise color control beyond named colors.
  * **Result:** Clean, reproducible, exact color mapping.

* The rest is identical to the previous example, but using hex codes.

**Significance:** Hex codes are the standard for precise color selection (branding, accessibility, style guides).

---

## 8) Using `hue` with count plots

`hue` works across most Seaborn plot types, not just scatter plots.

### Code

```python
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(
    x="smoker",
    data=tips,
    hue="sex"    # subdivide bars by sex within each smoker category
)
plt.show()
```
![image-6.png](attachment:image-6.png)


### Expected Output

* A **count plot** with two main bars on the x-axis: `No` and `Yes` (non-smokers vs. smokers).
* Each bar is **stacked/grouped** by `sex` (male/female), with a legend.
* You should see **males outnumber females** in both smoker and non-smoker groups for this dataset.

### Line-by-line explanation

* `sns.countplot(x="smoker", data=tips, hue="sex")`

  * **What:** Plots counts of rows per `smoker` value, subdivided by `sex`.
  * **Why:** To compare subgroup sizes within each category; helpful for understanding sample composition.
  * **Result:** Multi-colored bars indicating counts for male vs. female within `No` and `Yes`.

* `plt.show()`

  * **What/Why/Result:** Renders the chart.

**Significance:** Understanding subgroup counts prevents misinterpretation of patterns seen in other plots (e.g., if one subgroup dominates, trends could reflect sample size rather than behavior).

---

## 9) Quick Reference & Takeaways

* **`hue` adds a third variable via color** to most Seaborn plots (scatter, count, bar, line, etc.).
* **`hue_order`** ensures categories appear in a specific order (legend & rendering).
* **`palette`** accepts:

  * A dict mapping category → color (named colors, one-letter abbreviations, or hex codes).
  * Built-in palette names (e.g., `"Set2"`, `"deep"`, etc.) if you’re not mapping manually.
* **Hex codes** (`"#RRGGBB"`) provide exact color control; always wrap in quotes.

---

## 10) Practice Prompt

* Recreate each plot above on your machine.
* Try different `hue_order` arrangements and custom `palette` mappings.
* Swap `hue` variables (e.g., `sex`, `day`, or `time`) to see how patterns change.


### Exercise
Hue and scatter plots
In the prior video, we learned how hue allows us to easily make subgroups within Seaborn plots. Let's try it out by exploring data from students in secondary school. We have a lot of information about each student like their age, where they live, their study habits and their extracurricular activities.

For now, we'll look at the relationship between the number of absences they have in school and their final grade in the course, segmented by where the student lives (rural vs. urban area).

1. Create a scatter plot with "absences" on the x-axis and final grade ("G3") on the y-axis using the DataFrame student_data. Color the plot points based on "location" (urban vs. rural).

```python
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create a scatter plot of absences vs. final grade
sns.scatterplot(x='absences', y='G3', data= student_data, hue='location')



# Show plot
plt.show()
```
![image-2.png](attachment:image-2.png)

2. Make "Rural" appear before "Urban" in the plot legend.

```python

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Change the legend order in the scatter plot
sns.scatterplot(x="absences", y="G3", 
                data=student_data, 
                hue="location"
                ,hue_order=['Rural', 'Urban'])
# Show plot
plt.show()

```

![image-3.png](attachment:image-3.png)

### Exercise
Hue and count plots
Let's continue exploring our dataset from students in secondary school by looking at a new variable. The "school" column indicates the initials of which school the student attended - either "GP" or "MS".

In the last exercise, we created a scatter plot where the plot points were colored based on whether the student lived in an urban or rural area. How many students live in urban vs. rural areas, and does this vary based on what school the student attends? Let's make a count plot with subgroups to find out.

Fill in the palette_colors dictionary to map the "Rural" location value to the color "green" and the "Urban" location value to the color "blue".
Create a count plot with "school" on the x-axis using the student_data DataFrame.
Add subgroups to the plot using "location" variable and use the palette_colors dictionary to make the location subgroups green and blue.


```python

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create a dictionary mapping subgroup values to colors
palette_colors = {'Rural': "green", 'Urban': "blue"}

# Create a count plot of school with location subgroups
sns.countplot(x='school', hue='location', palette=palette_colors, data=student_data)



# Display plot
plt.show()

```
![image.png](attachment:image.png)