# (PART) DATA VISUALIZATION {-}


# What are common data types in Python and R?

## Explanation

Before you clean, visualize, or model data, it's important to understand what types of values you're working with — numeric, text, logical, or otherwise.

These **data types** affect how values are stored, displayed, and processed in both Python and R — and they play a major role in how functions behave.

---

## Common Data Types in Python and R

| Concept           | Python (`pandas` / base) | R (`base`)             | Notes |
|:------------------|:--------------------------|:------------------------|:-------|
| Integer           | `int`                    | `integer`              | Use `astype(int)` or `as.integer()` |
| Decimal Number    | `float`                  | `numeric`, `double`    | `numeric` is typically `double` in R |
| Text / String     | `str`, `object` (pandas) | `character`            | Use `astype(str)` or `as.character()` |
| Logical / Boolean | `bool`                   | `logical`              | `True/False` in Python, `TRUE/FALSE` in R |
| Date / Time       | `datetime64[ns]`         | `Date`, `POSIXct`      | Use `pd.to_datetime()` or `as.Date()` |
| Category          | `category`               | `factor`               | Ideal for grouping and modeling |
| Missing Values    | `NaN`                    | `NA`                   | Use `pd.isna()` or `is.na()` |
| Complex Numbers   | `complex`                | `complex`              | Rare in typical data work |
| List              | `list`                   | `list`                 | Flexible containers |
| Dictionary        | `dict`                   | `named list`, `list()` | R lists can mimic dictionaries |
| Tuple             | `tuple`                  | `c()`, `list()`        | No exact match — use vectors or lists |

> ✅ Knowing the common data types — and how to interpret them — lays the foundation for all future data work.

# How do you inspect variable types in a dataset?

## Explanation

Once you’ve loaded your dataset, the next step is to **inspect the structure** and confirm the variable types. This helps you:

- Understand what you're working with  
- Catch mismatches (e.g., numbers stored as strings)  
- Decide whether conversions are needed

---

## Python Code



In [None]:
import seaborn as sns
import pandas as pd

# Load and sample the diamonds dataset
df_full = sns.load_dataset("diamonds")
df = df_full.sample(n=500, random_state=42)

# Inspect the shape
print("📐 Dataset Shape:", df.shape)

# Preview the first few rows
print("\n🔍 Dataset Preview:")
print(df.head())

# Check data types
print("\n🔠 Variable Types:")
print(df.dtypes)




## R Code
```{r}
library(ggplot2)
library(dplyr)

# Load and sample the diamonds dataset
set.seed(42)
df <- ggplot2::diamonds %>% sample_n(500)

# Check dimensions
cat("📐 Dataset Dimensions:", dim(df)[1], "rows x", dim(df)[2], "columns\n\n")

# Preview the dataset
cat("🔍 Dataset Preview:\n")
print(head(df))

# Inspect variable types
cat("\n🔠 Variable Types:\n")
str(df)
```

---

> ✅ Always check the variable types before analysis — it helps prevent errors, ensures correct plotting and modeling behavior, and guides you in converting variables where needed (e.g., from text to category or numeric).

# How do you convert variable types in a dataset?

## Explanation

Before visualizing or modeling your data, it’s important to ensure that each variable has the correct type. For example:

- Categorical variables (like `cut`, `color`, `clarity`) should be treated as factors or categories  
- Numerical variables accidentally stored as strings should be converted to numeric types

In this example, we’ll use a **sample of 500 diamonds** to demonstrate how to inspect and convert variable types where needed — a crucial step for grouped plots and modeling accuracy.

## Python Code



In [None]:
import seaborn as sns
import pandas as pd

# Load and sample the diamonds dataset
df_full = sns.load_dataset("diamonds")
df = df_full.sample(n=500, random_state=42)

# Convert selected columns to categorical
categorical_cols = ["cut", "color", "clarity"]
for col in categorical_cols:
    df[col] = df[col].astype("category")

# Confirm data types
print("🔠 Updated Variable Types:\n")
print(df.dtypes)

## R Code

```{r}
library(ggplot2)
library(dplyr)

# Load and sample the diamonds dataset
set.seed(42)
df <- ggplot2::diamonds %>%
  sample_n(500)

# Convert selected columns to factor
df <- df %>%
  mutate(
    cut = as.factor(cut),
    color = as.factor(color),
    clarity = as.factor(clarity)
  )

# Confirm structure
cat("🔠 Updated Variable Types:\n")
str(df)
```

> ✅ Ensuring correct variable types improves how your data is visualized, summarized, and modeled — especially when working with grouped plots or categorical aesthetics.

# What is the difference between categorical and numerical variables?

## Explanation

In data analysis, variables are typically classified into two major types:

### 🔷 Categorical Variables  
These represent groups, labels, or categories. They describe **qualities**, not quantities.

- **Nominal**: Categories with no natural order  
  _Example_: `"red"`, `"blue"`, `"green"`
  
- **Ordinal**: Categories with a meaningful order  
  _Example_: `"low"`, `"medium"`, `"high"`

### 🔶 Numerical Variables  
These represent measurable quantities and describe **how much** or **how many**.

- **Discrete**: Countable, whole-number values  
  _Example_: Number of children, cars, books
  
- **Continuous**: Measurable on a scale; can take any value within a range  
  _Example_: Height, weight, temperature

---

Correctly identifying variable types is critical. It informs the choice of:
- Statistical methods (e.g., t-tests vs chi-squared tests)
- Visualizations (e.g., histograms for continuous vs bar plots for categorical)
- Feature encoding in machine learning (e.g., one-hot encoding for nominal variables)


# How do you summarize numerical and categorical variables?

## Explanation

Summarizing variables helps you quickly understand data distribution, central tendency, and variation — essential before any visualization.

- **Numerical variables**: We summarize using measures like mean, median, standard deviation, min, max, and percentiles.  
- **Categorical variables**: We summarize by counting the frequency of each category.

Here we use a **sample of 500 diamonds** for fast, clear summaries.

## Python Code



In [None]:
import seaborn as sns
import pandas as pd

# Load and sample the diamonds dataset
df_full = sns.load_dataset("diamonds")
df = df_full.sample(n=500, random_state=42)

# Summary of numerical variables
print("📊 Summary of Numerical Variables:\n")
print(df.describe())

# Frequency count of categorical variables
print("\n🔠 Frequency of Categorical Variables:\n")
for col in ["cut", "color", "clarity"]:
    print(f"\n{col}:\n", df[col].value_counts())

## R Code

```{r}
library(ggplot2)
library(dplyr)

# Load and sample the diamonds dataset
set.seed(42)
df <- ggplot2::diamonds %>%
  sample_n(500)

# Summary of numerical variables
cat("📊 Summary of Numerical Variables:\n")
summary(select(df, where(is.numeric)))

# Frequency count of categorical variables
cat("\n🔠 Frequency of Categorical Variables:\n")
df %>%
  select(cut, color, clarity) %>%
  summarise(across(everything(), ~ list(table(.))))
```

> ✅ Summarizing your variables helps reveal patterns, detect outliers, and identify potential problems — all before you create your first plot.

# What are the most effective plots for comparing values across categories?

## Explanation

Before diving into specific charts, it’s helpful to understand the **landscape of visualization tools** available when comparing a **numeric variable** across different **groups** (i.e., categorical levels).

Depending on what insight you're after — distribution shape, summary statistics, or raw values — the best choice will vary.

### 🔹 Common Visualization Types {-}

| Plot Type           | Shows Distribution | Best For                                 |
|---------------------|--------------------|-------------------------------------------|
| **Bar Plot**        | ❌ No              | Comparing total counts or group means     |
| **Histogram**       | ✅ Yes (1 group)   | Viewing frequency distribution            |
| **Density Plot**    | ✅ Yes             | Smooth shape of distribution              |
| **Box Plot**        | ✅ Yes             | Spread, center, and outliers              |
| **Violin Plot**     | ✅ Yes             | Shape + quartiles                         |
| **Ridge Plot**      | ✅ Yes             | Comparing shapes across many groups       |
| **Strip Plot**      | ✅ Yes             | Raw points (ideal for small datasets)     |
| **Swarm Plot**      | ✅ Yes             | Raw data with spacing                     |
| **Dot Plot**        | ❌ Summary only    | Central tendency with optional error bars |
| **Bar + Error Bars**| ❌ Summary only    | Means with uncertainty (CI/SE bars)       |

---

### 📌 Choosing the Right Plot {-}

Ask yourself:

- How many **groups** are you comparing?  
- Is the dataset **small or large**?  
- Do you want to show **individual points** or a **summary**?  
- Is **distribution shape** important?

Start with simple visuals like bar plots and box plots. Use richer plots (like violin or ridge) when you want to uncover deeper patterns.

---

> ✅ **Takeaway:** Choosing the right visual makes patterns across categories easier to see — and ensures your insights are both accurate and easy to communicate.

# How do you visualize category counts using a bar plot?

## Explanation

A **count plot** (also known as a categorical bar plot) shows how many observations fall into each category. It is ideal for quickly assessing the distribution of categorical variables such as species, cut, or class.

You can enhance the plot by:

- Adding color (`hue`) to show subgroup breakdown  
- Applying professional palettes like `Set2`, `viridis`, or `rainbow`  
- Ordering bars by count or category level  
- Adding labels to each bar for clarity  

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = pd.read_csv("data/iris.csv")

# Set aesthetic style
sns.set(style="whitegrid")

# Base count plot with beautiful palette
plt.figure(figsize=(8, 6))
sns.countplot(data=iris, x="species", palette="Set2")
plt.title("Count Plot of Iris Species", fontsize=14)
plt.xlabel("Species")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

# Count plot with hue: bin sepal_length
iris["sepal_length_bin"] = pd.cut(iris["sepal_length"], bins=3, labels=["Short", "Medium", "Long"])

plt.figure(figsize=(8, 6))
sns.countplot(data=iris, x="species", hue="sepal_length_bin", palette="viridis")
plt.title("Count Plot with Hue (Sepal Length Bin)", fontsize=14)
plt.xlabel("Species")
plt.ylabel("Count")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)
library(dplyr)
library(viridis)

# Load dataset
iris <- read_csv("data/iris.csv")

# Base count plot
ggplot(iris, aes(x = species)) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Count Plot of Iris Species", x = "Species", y = "Count")

# Count plot with hue: bin sepal_length
iris <- iris %>%
  mutate(sepal_length_bin = cut(sepal_length, breaks = 3, labels = c("Short", "Medium", "Long")))

ggplot(iris, aes(x = species, fill = sepal_length_bin)) +
  geom_bar(position = "dodge") +
  scale_fill_viridis_d(option = "C") +
  theme_minimal() +
  labs(title = "Count Plot with Hue (Sepal Length Bin)",
       x = "Species", y = "Count", fill = "Sepal Length Bin")
```

---

> ✅ **Count plots** provide a clear and colorful summary of category sizes. Using hue and palettes enhances clarity, making group comparisons more informative and visually appealing.

# How do you compare group distributions using a boxplot?

## Explanation

A **boxplot** is a standard way to visualize the distribution of a numerical variable across categories. It summarizes key statistics:

- **Median** (central line)
- **Interquartile range (IQR)** (box edges)
- **Minimum and maximum** (whiskers)
- **Outliers** (points outside whiskers)

Boxplots are ideal for detecting:

- Differences in **central tendency**
- Variation in **spread**
- Presence of **outliers**
- Asymmetry or skewness in the distribution

Adding color and overlaying raw data (e.g., strip plots) improves interpretability.

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = pd.read_csv("data/iris.csv")

# Set style
sns.set(style="whitegrid")

# Basic boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(data=iris, x="species", y="sepal_length", palette="viridis")
plt.title("Boxplot: Sepal Length by Species", fontsize=14)
plt.xlabel("Species")
plt.ylabel("Sepal Length")
plt.tight_layout()
plt.show()

# Boxplot with overlaid strip plot (raw points)
plt.figure(figsize=(8, 6))
sns.boxplot(data=iris, x="species", y="sepal_length", palette="viridis", width=0.6)
sns.stripplot(data=iris, x="species", y="sepal_length", color="black", alpha=0.5, jitter=True)
plt.title("Boxplot with Raw Points: Sepal Length by Species", fontsize=14)
plt.tight_layout()
plt.show()




## R Code

```{r}
library(readr)
library(ggplot2)

# Load dataset
iris <- read_csv("data/iris.csv")

# Basic boxplot
ggplot(iris, aes(x = species, y = sepal_length)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  theme_minimal() +
  labs(title = "Boxplot: Sepal Length by Species", x = "Species", y = "Sepal Length")

# Boxplot with overlaid jittered points
ggplot(iris, aes(x = species, y = sepal_length)) +
  geom_boxplot(fill = "lightgreen", outlier.shape = NA) +
  geom_jitter(color = "black", width = 0.2, alpha = 0.5) +
  theme_minimal() +
  labs(title = "Boxplot with Raw Points: Sepal Length by Species", x = "Species", y = "Sepal Length")
```

---

> ✅ **Boxplots** offer a compact summary of distribution and spread for each category. When enhanced with color and raw points, they reveal both statistical structure and individual variation clearly.

# How do you compare distribution shape and summary stats using a violin plot?

## Explanation

A **violin plot** combines the benefits of a **boxplot** and a **density plot**. It shows:

- The **kernel density estimate** of the data distribution (mirrored on both sides)
- **Median** and **IQR** through an embedded boxplot
- The **width** of the violin reflects the frequency of values

This makes violin plots ideal when you want to explore both:
- **Shape and modality** of the distribution  
- **Statistical summaries** like median and quartiles

Using color palettes and overlaying a boxplot improves clarity and visual appeal.

---

## Python Code



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = pd.read_csv("data/iris.csv")

# Set theme
sns.set(style="whitegrid")

# Violin plot with boxplot in the center
plt.figure(figsize=(8, 6))
sns.violinplot(data=iris, x="species", y="sepal_length", inner="box", palette="Set2")
plt.title("Violin Plot with Boxplot: Sepal Length by Species", fontsize=14)
plt.xlabel("Species")
plt.ylabel("Sepal Length")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)

# Load dataset
iris <- read_csv("data/iris.csv")

# Violin plot with embedded boxplot
ggplot(iris, aes(x = species, y = sepal_length, fill = species)) +
  geom_violin(trim = FALSE, color = "gray40") +
  geom_boxplot(width = 0.1, color = "black", outlier.shape = NA) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Violin Plot with Boxplot: Sepal Length by Species",
       x = "Species", y = "Sepal Length")
```

---

> ✅ **Violin plots** are powerful for visualizing both distribution shape and group-level statistics. The embedded boxplot helps interpret quartiles, while the violin shape reveals modality and spread.

# How do you visualize overlapping group distributions using a ridge plot?

## Explanation

A **ridge plot** (also called a **joyplot**) displays **smoothed density curves** for a numerical variable across different groups. The curves are stacked and partially overlapping, making it easy to:

- Compare the **shape** of distributions
- Detect **skewness**, **modality**, and **spread**
- Handle **many groups** in a compact space

These plots are especially useful in **Exploratory Data Analysis (EDA)** when you want to:

- Compare distributions across levels of a categorical variable
- Reveal subtle differences in group behavior
- Highlight the overall distribution pattern clearly

---

## Python Code


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = pd.read_csv("data/iris.csv")

# Set theme
sns.set(style="white")

# Create ridge-style KDE plot manually
plt.figure(figsize=(8, 6))
species_list = iris["species"].unique()

for i, species in enumerate(species_list):
    subset = iris[iris["species"] == species]
    sns.kdeplot(
        subset["sepal_length"],
        fill=True,
        label=species,
        linewidth=1.5,
        alpha=0.7,
        clip=(4, 8),
    )

plt.title("Ridge-style KDE Plot: Sepal Length by Species", fontsize=14)
plt.xlabel("Sepal Length")
plt.ylabel("Density")
plt.legend(title="Species")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)
library(ggridges)
library(viridis)

# Load dataset
iris <- read_csv("data/iris.csv")

# Ridge plot using ggridges
ggplot(iris, aes(x = sepal_length, y = species, fill = species)) +
  geom_density_ridges(scale = 1.2, alpha = 0.7, color = "white") +
  scale_fill_viridis_d(option = "D") +
  theme_minimal() +
  labs(title = "Ridge Plot: Sepal Length by Species",
       x = "Sepal Length", y = "Species")
```

---

> ✅ **Ridge plots** provide a smooth, elegant comparison of multiple distributions. They're especially useful when working with several categories and aiming to uncover differences in shape or spread.

# How do you display individual data points by category using a swarm plot?

## Explanation

A **swarm plot** displays individual data points while intelligently spacing them to avoid overlap. Unlike strip plots (which may stack points randomly), swarm plots use a **repulsion algorithm** to spread points for better visibility.

They are especially helpful when:

- The dataset is **small to medium-sized**
- You want to show **raw observations**
- Identifying **clusters**, **gaps**, or **outliers** is important

Combining swarm plots with color (`hue`) and category grouping enhances clarity and storytelling.

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = pd.read_csv("data/iris.csv")

# Set style
sns.set(style="whitegrid")

# Swarm plot
plt.figure(figsize=(8, 6))
sns.swarmplot(data=iris, x="species", y="sepal_length", hue="species", palette="Set2", dodge=False, size=6)
plt.title("Swarm Plot: Sepal Length by Species", fontsize=14)
plt.xlabel("Species")
plt.ylabel("Sepal Length")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)
library(ggbeeswarm)

# Load dataset
iris <- read_csv("data/iris.csv")

# Swarm plot using ggbeeswarm::geom_quasirandom
ggplot(iris, aes(x = species, y = sepal_length, color = species)) +
  geom_quasirandom(size = 2.5, width = 0.25) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Swarm Plot: Sepal Length by Species",
       x = "Species", y = "Sepal Length")
```

---

> ✅ **Swarm plots** reveal individual data points without overlap, making them ideal for exploring real observations, spotting outliers, and understanding group patterns in moderate-sized datasets.

# How do you show raw observations by group using a strip plot?

## Explanation

A **strip plot** is a simple yet powerful way to show every individual data point for a numerical variable grouped by a categorical variable. Unlike boxplots or violin plots, which summarize data, strip plots highlight **raw measurements**.

They are best used when:

- You want **complete visibility** of individual observations  
- Your dataset is **small or moderate** in size  
- You want to explore **variation and outliers** without summary overlays

Adding **jitter** (slight random displacement) and using vibrant **palettes** makes the visualization more readable and visually engaging.

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = pd.read_csv("data/iris.csv")

# Set style
sns.set(style="whitegrid")

# Warning-free strip plot with hue and palette
plt.figure(figsize=(8, 6))
sns.stripplot(
    data=iris,
    x="species",
    y="sepal_length",
    hue="species",
    jitter=True,
    palette="Set2",
    dodge=False,
    size=6,
    alpha=0.8
)
plt.title("Strip Plot: Sepal Length by Species", fontsize=14)
plt.xlabel("Species")
plt.ylabel("Sepal Length")
plt.legend([],[], frameon=False)  # Hides duplicate legend
plt.tight_layout()
plt.show()

## R Code

```{r}
library(readr)
library(ggplot2)

# Load dataset
iris <- read_csv("data/iris.csv")

# Strip plot with jitter and color
ggplot(iris, aes(x = species, y = sepal_length, color = species)) +
  geom_jitter(width = 0.2, size = 2.5, alpha = 0.8) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Strip Plot: Sepal Length by Species",
       x = "Species", y = "Sepal Length")
```

---

> ✅ **Strip plots** offer a direct view of all data points in each category. They are perfect for spotting data spread, clusters, or outliers—especially when combined with color and jitter for clarity.

# How do you show group means and variability using a bar plot?

## Explanation

A **bar plot with error bars** summarizes numerical data by showing:

- **Mean** value for each category as a bar  
- **Error bars** representing variability (e.g., standard deviation or standard error)

This is ideal for comparing central tendencies between groups, especially when:

- You’ve already summarized your data  
- You want to **highlight differences in means**  
- Distribution details (e.g., skewness, modality) are **less important**  

Adding colors using `hue` and modern palettes enhances readability and prevents warnings.

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
iris = pd.read_csv("data/iris.csv")

# Set style
sns.set(style="whitegrid")

# Bar plot with error bars (standard deviation), hue to avoid warnings
plt.figure(figsize=(8, 6))
sns.barplot(
    data=iris,
    x="species",
    y="sepal_length",
    hue="species",
    palette="Set2",
    ci="sd",
    errorbar="sd",
    capsize=0.1
)
plt.title("Bar Plot with Error Bars: Sepal Length by Species", fontsize=14)
plt.xlabel("Species")
plt.ylabel("Mean Sepal Length ± SD")
plt.legend([],[], frameon=False)  # Hides duplicate legend
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)
library(dplyr)

# Load dataset
iris <- read_csv("data/iris.csv")

# Summarize mean and standard deviation
summary_df <- iris %>%
  group_by(species) %>%
  summarise(
    mean_val = mean(sepal_length),
    sd_val = sd(sepal_length),
    .groups = "drop"
  )

# Bar plot with error bars
ggplot(summary_df, aes(x = species, y = mean_val, fill = species)) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_errorbar(aes(ymin = mean_val - sd_val, ymax = mean_val + sd_val), width = 0.2) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Bar Plot with Error Bars: Sepal Length by Species",
       x = "Species", y = "Mean Sepal Length ± SD")
```

---

> ✅ **Bar plots** with error bars are great for summarizing group-level differences in means. Using `hue` or `fill` ensures correct color mapping and avoids warnings in modern plotting libraries.

# How do you show group summaries using a dot plot?

## Explanation

A **dot plot** is a simple yet effective way to compare **group-level summary statistics**, such as the **mean price** of diamonds per quality grade. It’s particularly helpful when:

- You want to emphasize **central values** without clutter  
- The number of groups is **moderate**  
- You want a clean alternative to a bar chart  

Dot plots are enhanced with color, size, and error bars for visual clarity. They're perfect for **summary comparisons** like mean or median ± standard deviation.

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Compute group summary
summary_df = diamonds.groupby("cut", as_index=False).agg(
    mean_price=("price", "mean"),
    sd_price=("price", "std")
)

# Create dot plot
sns.set(style="whitegrid")
plt.figure(figsize=(8, 6))
sns.pointplot(
    data=summary_df,
    x="cut",
    y="mean_price",
    # palette="Set2",
    errorbar=None,
    join=False,
    markers="o"
    
)

# Add error bars manually
plt.errorbar(
    x=range(len(summary_df)),
    y=summary_df["mean_price"],
    yerr=summary_df["sd_price"],
    fmt='none',
    capsize=5,
    color='black'
)

plt.title("Dot Plot with Error Bars: Diamond Price by Cut", fontsize=14)
plt.xlabel("Cut")
plt.ylabel("Mean Price ± SD")
plt.tight_layout()
plt.show()



## R Code

```{r}
library(readr)
library(ggplot2)
library(dplyr)

# Load dataset
diamonds <- read_csv("data/diamonds_sample.csv")

# Compute mean and SD
summary_df <- diamonds %>%
  group_by(cut) %>%
  summarise(
    mean_price = mean(price),
    sd_price = sd(price),
    .groups = "drop"
  )

# Dot plot with error bars
ggplot(summary_df, aes(x = cut, y = mean_price, color = cut)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = mean_price - sd_price, ymax = mean_price + sd_price), width = 0.2) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Dot Plot with Error Bars: Diamond Price by Cut",
       x = "Cut", y = "Mean Price ± SD")
```

---

> ✅ **Dot plots** are a compact, precise way to compare summary statistics across groups. When used with color and error bars, they highlight differences in group means with clarity and elegance.

# How do you show frequency patterns using a histogram?

## Explanation

A **histogram** is used to show the **frequency distribution** of a numerical variable by grouping values into bins. It helps you:

- Understand the **range and shape** of a distribution  
- Detect **skewness** or **multi-modality**  
- Compare **group-level differences** using color or faceting

For grouped comparisons (e.g., price by cut), it's common to:

- Use **transparent fills** (alpha blending)  
- Use **facets** to separate overlapping plots  
- Choose appropriate **bin width** and palettes

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Set style
sns.set(style="whitegrid")

# Histogram with hue
plt.figure(figsize=(10, 6))
sns.histplot(data=diamonds, x="price", hue="cut", element="step", stat="density", common_norm=False,
             palette="Set2", bins=50, alpha=0.6)
plt.title("Histogram of Diamond Price by Cut", fontsize=14)
plt.xlabel("Price")
plt.ylabel("Density")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)

# Load dataset
diamonds <- read_csv("data/diamonds_sample.csv")

# Histogram with color fill and transparency
ggplot(diamonds, aes(x = price, fill = cut)) +
  geom_histogram(position = "identity", bins = 50, alpha = 0.6, color = "black") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Histogram of Diamond Price by Cut",
       x = "Price", y = "Count")
```

---

> ✅ **Histograms** are ideal for visualizing frequency and shape. By using color or faceting, you can explore how distributions vary across groups like diamond cut.

# How do you visualize a smoothed distribution with a density plot?

## Explanation

A **density plot** shows the **probability density** of a continuous variable using a smoothed curve. Compared to histograms, it offers:

- A more **refined view** of distribution shape  
- Insight into **skewness**, **peaks**, and **spread**  
- Easy group comparison using **hue** or **facets**

For grouped densities (e.g., price by cut), you can:

- Use **fill and hue** for comparison  
- Overlay multiple groups for contrast  
- Normalize to show **relative densities**

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Set style
sns.set(style="white")

# Density plot with hue
plt.figure(figsize=(10, 6))
sns.kdeplot(data=diamonds, x="price", hue="cut", fill=True, alpha=0.6, palette="Set2", common_norm=False)
plt.title("Density Plot of Diamond Price by Cut", fontsize=14)
plt.xlabel("Price")
plt.ylabel("Density")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)

# Load dataset
diamonds <- read_csv("data/diamonds_sample.csv")

# Density plot
ggplot(diamonds, aes(x = price, fill = cut)) +
  geom_density(alpha = 0.6) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Density Plot of Diamond Price by Cut",
       x = "Price", y = "Density")
```

---

> ✅ **Density plots** provide a smooth view of distributions. They are excellent for comparing shape, spread, and modality across groups—especially when overlaid with vibrant palettes.

# How do you visualize two categorical variables with a grouped bar plot?

## Explanation

A **grouped bar plot** allows you to compare two categorical variables simultaneously by showing side-by-side bars within each group.

This is ideal when:

- You want to analyze **proportions or counts** across two categorical dimensions  
- You need a clean comparison without stacking  
- Each group has a **manageable number of levels**

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Set style
sns.set(style="whitegrid")

# Grouped bar plot: count of cut across clarity levels
plt.figure(figsize=(10, 6))
sns.countplot(data=diamonds, x="clarity", hue="cut", palette="Set2")
plt.title("Grouped Bar Plot: Diamond Cut by Clarity", fontsize=14)
plt.xlabel("Clarity")
plt.ylabel("Count")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)

# Load dataset
diamonds <- read_csv("data/diamonds_sample.csv")

# Grouped bar plot
ggplot(diamonds, aes(x = clarity, fill = cut)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Grouped Bar Plot: Diamond Cut by Clarity",
       x = "Clarity", y = "Count")
```

---

> ✅ **Grouped bar plots** allow clean comparison across two categorical dimensions. They’re especially useful for understanding distribution patterns in grouped data.

# How do you visualize trends across ordered groups using a line plot?

## Explanation

A **line plot** is typically used for time series, but you can also use it to show changes over any **ordered numeric** variable. In this case, we’ll group diamonds by **carat bins** and compute the **mean price**.

This allows us to simulate a trend and observe how **price changes with carat**.

This type of plot is useful for:

- Showing **trends** or **gradual change** across bins  
- Comparing **multiple features** over a common x-axis  
- Visualizing **aggregated patterns** from large datasets  

---

## Python Code




In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("data/diamonds_sample.csv")

# Bin carat into equal-width intervals
df["carat_bin"] = pd.cut(df["carat"], bins=10)

# Compute mean price per carat bin
mean_price = df.groupby("carat_bin")["price"].mean().reset_index()

# Convert bin labels to midpoints for plotting
mean_price["carat_mid"] = mean_price["carat_bin"].apply(lambda x: x.mid)

# Plot
plt.plot(mean_price["carat_mid"], mean_price["price"], marker="o")
plt.xlabel("Carat (binned)")
plt.ylabel("Mean Price")
plt.title("Trend of Price by Carat (Binned)")
plt.grid(True)
plt.tight_layout()
plt.show()


## R Code
```{r}
library(readr)
library(dplyr)
library(ggplot2)

# Load data
df <- read_csv("data/diamonds_sample.csv")

# Bin carat into equal-width intervals and compute mean price
df_summary <- df %>%
  mutate(carat_bin = cut(carat, breaks = 10)) %>%
  group_by(carat_bin) %>%
  summarise(mean_price = mean(price), .groups = "drop")

# Convert factor levels to midpoints for plotting
df_summary <- df_summary %>%
  mutate(carat_mid = as.numeric(sub("\\((.+),.+\\]", "\\1", carat_bin)) +
                     as.numeric(sub(".+,(.+)\\]", "\\1", carat_bin)) / 2)

# Plot
ggplot(df_summary, aes(x = carat_mid, y = mean_price)) +
  geom_line() +
  geom_point() +
  labs(
    x = "Carat (binned)",
    y = "Mean Price",
    title = "Trend of Price by Carat (Binned)"
  ) +
  theme_minimal()
```

> ✅ **Line plots** are great for visualizing aggregated trends over an ordered numeric variable — not just time. Binning continuous values helps reveal smooth relationships when raw scatterplots are noisy.

# How do you visualize trends for multiple groups using a line plot?

## Explanation

A **line plot** is ideal for visualizing trends over an ordered variable. In this example, we compute the **average sepal length** for each **species** across **sepal width bins**.

This helps reveal **group-specific patterns** — for example, whether one species has consistently longer sepals as sepal width increases.

---

## Python Code



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv("data/iris.csv")

# Bin sepal_width to simulate order
df["width_bin"] = pd.cut(df["sepal_width"], bins=5)

# Group by bin and species, compute mean sepal length
grouped = df.groupby(["width_bin", "species"])["sepal_length"].mean().reset_index()

# Convert bin to string for plotting
grouped["width_bin"] = grouped["width_bin"].astype(str)

# Line plot
plt.figure(figsize=(8, 5))
sns.lineplot(data=grouped, x="width_bin", y="sepal_length", hue="species", marker="o")
plt.title("Average Sepal Length by Sepal Width Bin and Species")
plt.xlabel("Sepal Width Bin")
plt.ylabel("Mean Sepal Length")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## R Code
```{r}
library(dplyr)
library(ggplot2)
library(readr)

# Load dataset
df <- read_csv("data/iris.csv")

# Bin sepal width into 5 intervals
df <- df %>%
  mutate(width_bin = cut(sepal_width, breaks = 5))

# Compute mean sepal length by bin and species
grouped <- df %>%
  group_by(width_bin, species) %>%
  summarise(mean_length = mean(sepal_length), .groups = "drop")

# Line plot
ggplot(grouped, aes(x = width_bin, y = mean_length, group = species, color = species)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average Sepal Length by Sepal Width Bin and Species",
    x = "Sepal Width Bin",
    y = "Mean Sepal Length"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

> ✅ **Takeaway:**  
> Line plots with **grouped trends** help you compare patterns side by side — especially when studying how one variable behaves across subgroups. Use color or faceting to highlight these comparisons.

# How do you show overall trend patterns using a smoothed line?

## Explanation

A **smoothed trend line** is used to show the underlying relationship between two continuous variables. It's useful when data is noisy and you want to see:

- General **direction of change**  
- Nonlinear **patterns**  
- Local **averages** (using LOESS or regression smoothing)

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Smoothed trend line
plt.figure(figsize=(10, 6))
sns.regplot(data=diamonds, x="carat", y="price", lowess=True, scatter_kws={"alpha": 0.4}, line_kws={"color": "#D55E00"})
plt.title("Smoothed Trend: Carat vs Price", fontsize=14)
plt.xlabel("Carat")
plt.ylabel("Price")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)

# Load dataset
diamonds <- read_csv("data/diamonds_sample.csv")

# LOESS smoothed line
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "loess", se = TRUE, color = "#D55E00") +
  theme_minimal() +
  labs(title = "Smoothed Trend: Carat vs Price",
       x = "Carat", y = "Price")
```

---

> ✅ **Smoothed trend lines** reveal underlying patterns in noisy data. Use them to identify nonlinear growth or saturation points that raw scatter plots may hide.

# (PART) PATTERN RECOGNITION AND RELATIONSHIPS {-}

# How do you visualize patterns and relationships in multivariate data?

## Explanation

Once you've explored **individual variables** and **group-based comparisons**, the next step is to examine how variables relate to one another across the entire dataset. This enables you to uncover:

- **Patterns** in how multiple features interact  
- **Clustering** or separation between groups (e.g., diamond cuts)  
- **Correlations** that indicate redundancy or strong associations  

Understanding these relationships is essential for:

- **Feature selection** — identifying which variables offer unique insight  
- **Model design** — anticipating relationships a model might capture  
- **Data structure** — assessing whether groups are well-separated or overlapping  

### Key tools for visualizing relationships

| Method                                      | Purpose                                                    |
|---------------------------------------------|-------------------------------------------------------------|
| **Pair plots**                              | Visualize all-vs-all numeric relationships                  |
| **Facet plots (e.g., histograms, KDEs)**    | Compare distributions side by side across group levels      |
| **Scatter plots with trend lines**          | Show numeric relationships with group coloring and smoothing|
| **Heatmaps**                                | Quantify strength of correlation between features           |
| **Parallel coordinates**                    | View high-dimensional feature profiles per case             |
| **Dimensionality reduction (PCA, UMAP, t-SNE)** | Project complex data into 2D to visualize structure       |

---

### 👇 Core Questions Explored in This Section

- **How do you uncover relationships between multiple variables using a pair plot?**  
- **How do you compare distributions across groups using facet plots?**  
- **How do you enhance scatter plots by adding group color and trend lines?**  
- **How do you quantify linear relationships between numerical variables using a correlation heatmap?**  
- **How do you visualize patterns across multiple numeric features using a parallel coordinates plot?**  
- **How do you uncover structure in high-dimensional data using a PCA plot?**  
- **How do you visualize clustering patterns in high-dimensional data using a t-SNE plot?**  
- **How do you explore complex patterns in high-dimensional data using a UMAP plot?**

> Each method helps reveal a different aspect of your dataset's internal structure. Proceed through the Q&A to explore them interactively.

# How do you uncover relationships between multiple variables using a pair plot?

## Explanation

A **pair plot** (or scatterplot matrix) allows you to explore relationships between several numeric variables at once. It shows:

- **Scatter plots** for every variable pair  
- **Histograms** or **density plots** on the diagonal  
- Optional color (`hue`) to separate groups

It’s useful for spotting **correlations**, **clusters**, and **outliers** in multivariate data.

---

## Python Code



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Select relevant numeric variables + categorical hue
subset = diamonds[["carat", "depth", "table", "price", "cut"]]

# Pair plot
sns.set(style="ticks")
sns.pairplot(subset, hue="cut", palette="Set2", corner=True)
plt.suptitle("Pair Plot of Diamond Attributes by Cut", y=1.02)
plt.show()



## R Code

```{r}
library(readr)
library(GGally)
library(dplyr)

# Load dataset and sample for speed
diamonds <- read_csv("data/diamonds_sample.csv")

# Pair plot
ggpairs(diamonds, aes(color = cut), columns = 1:4,
        upper = list(continuous = wrap("points", alpha = 0.5)),
        diag = list(continuous = wrap("densityDiag")),
        lower = list(continuous = wrap("smooth", alpha = 0.3))) +
  theme_minimal()
```

---

> ✅ **Pair plots** are ideal for detecting multivariate patterns. Using color (`hue`) reveals how groups differ in structure and correlation.


# How do you compare distributions across groups using facet plots?

## Explanation

**Facet plots** allow you to split your data into **multiple panels** based on a categorical variable, making it easier to compare group-specific distributions or relationships without overlap.

Unlike standard plots that layer everything into one axis, facet plots:
- Create **one plot per group**, arranged side by side or in a grid  
- Highlight **differences in shape, spread, or skew** between categories  
- Work well with **histograms**, **KDE plots**, **scatter plots**, and more

They’re useful for:
- Comparing **distributions** (e.g., KDE or histograms across species or cut)
- Analyzing **trends** across subgroups
- Preventing overplotting in **dense datasets**

---

## Python Code


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Faceted histogram by 'cut'
sns.displot(data=diamonds, x="price", col="cut", bins=30, color="steelblue", aspect=0.8)
plt.tight_layout()
plt.show()

# Faceted KDE plot with fill
sns.displot(data=diamonds, x="price", col="cut", kind="kde", fill=True, height=3, aspect=1, color="orchid")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(ggplot2)
library(readr)

# Load data
diamonds <- read_csv("data/diamonds_sample.csv")

# Histogram faceted by cut
ggplot(diamonds, aes(x = price)) +
  geom_histogram(fill = "steelblue", bins = 30, color = "white") +
  facet_wrap(~cut, scales = "free_y") +
  theme_minimal() +
  labs(title = "Price Distribution by Cut")

# KDE faceted by cut
ggplot(diamonds, aes(x = price)) +
  geom_density(fill = "orchid", alpha = 0.6) +
  facet_wrap(~cut, scales = "free_y") +
  theme_minimal() +
  labs(title = "Smoothed Price Distribution by Cut")
```

---

> ✅ **Facet plots** are ideal for comparing group-specific patterns across a categorical variable. They prevent clutter and make distribution differences easier to detect than overlapping in a single plot.

# How do you enhance scatter plots by adding group color and trend lines?

## Explanation

Scatter plots are a go-to tool for visualizing the relationship between two numerical variables. But they become far more insightful when enhanced with:

- **Group-based coloring** (e.g., by species or cut)
- **Trend lines** to show linear or nonlinear patterns
- **Smoothers** (like LOESS or regression fits)
- **Transparency** to handle overplotting in dense data

These enhancements help:
- Detect **direction and strength** of relationships
- Compare **group-level trends** side by side
- Spot **outliers** or overlapping clusters

---

## Python Code




In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load iris data
iris = pd.read_csv("data/iris.csv")

# Scatter with group color and regression lines
sns.lmplot(data=iris, x="sepal_length", y="petal_length", hue="species", 
           palette="Set2", height=5, aspect=1.2, markers=["o", "s", "D"])
plt.title("Relationship Between Sepal Length and Petal Length by Species")
plt.tight_layout()
plt.show()

## R Code

```{r}
library(ggplot2)
library(readr)

# Load iris data
iris <- read_csv("data/iris.csv")

# Scatter with group color and regression lines
ggplot(iris, aes(x = sepal_length, y = petal_length, color = species)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  theme_minimal() +
  labs(title = "Relationship Between Sepal Length and Petal Length by Species")
```

> ✅ **Enhancing scatter plots** with color and trend lines reveals both overall relationships and how those relationships vary across groups — a key part of visual EDA.

# How do you quantify linear relationships between numerical variables using a correlation heatmap?

## Explanation

A **correlation heatmap** visually represents the strength and direction of linear relationships between numeric variables using **Pearson’s correlation coefficient** (r):

- Values range from **-1** (perfect negative) to **+1** (perfect positive)
- Darker or more saturated colors indicate stronger correlations
- Symmetric across the diagonal (correlation with self = 1)

It's a compact way to assess **multicollinearity**, **feature redundancy**, or **predictive potential**.

---

## Python Code



In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Select numerical columns only
num_df = diamonds[["carat", "depth", "table", "price", "x", "y", "z"]]

# Compute correlation matrix
corr = num_df.corr(numeric_only=True)

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Heatmap of Diamond Variables", fontsize=14)
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)
library(corrplot)

# Load dataset
diamonds <- read_csv("data/diamonds_sample.csv")

# Compute correlation matrix
num_vars <- diamonds %>% select(carat, depth, table, price, x, y, z)
corr_matrix <- cor(num_vars, use = "complete.obs")

# Plot correlation heatmap
corrplot(corr_matrix, method = "color", type = "upper", addCoef.col = "black",
         tl.cex = 0.8, number.cex = 0.7, col = colorRampPalette(c("blue", "white", "red"))(200))
```

---

> ✅ **Correlation heatmaps** are a fast and effective way to explore relationships between numerical variables and detect potential feature interactions.

# How do you visualize patterns across multiple numeric features using a parallel coordinates plot?

## Explanation

A **parallel coordinates plot** lets you explore high-dimensional patterns by plotting each feature on a vertical axis. Each observation is a line crossing all axes.

It helps you:

- Spot **group patterns** across all variables  
- Detect **outliers** and **overlaps**  
- Understand **feature trends** in classification problems

Coloring by category (e.g., `cut`) helps distinguish clusters or groups.

---

## Python Code




In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates

# Load dataset
diamonds = pd.read_csv("data/diamonds_sample.csv")

# Sample for performance
subset = diamonds[["carat", "depth", "table", "price", "x", "y", "z", "cut"]].sample(300, random_state=42)

# Normalize numeric features
normalized = subset.copy()
for col in ["carat", "depth", "table", "price", "x", "y", "z"]:
    normalized[col] = (subset[col] - subset[col].min()) / (subset[col].max() - subset[col].min())

# Parallel coordinates plot
plt.figure(figsize=(12, 6))
parallel_coordinates(normalized, class_column="cut", color=["#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3", "#a6d854"])
plt.title("Parallel Coordinates Plot: Diamond Features by Cut")
plt.ylabel("Normalized Value")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(GGally)
library(dplyr)

# Load and sample dataset
diamonds <- read_csv("data/diamonds_sample.csv")
subset <- diamonds %>% select(carat, depth, table, price, x, y, z, cut) %>% sample_n(500)

# Normalize numeric columns
subset_norm <- subset %>%
  mutate(across(c(carat, depth, table, price, x, y, z), ~ (. - min(.)) / (max(.) - min(.))))

# Parallel coordinates plot
ggparcoord(data = subset_norm,
           columns = 1:7,
           groupColumn = 8,
           scale = "uniminmax",
           alphaLines = 0.5) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "Parallel Coordinates Plot: Diamond Features by Cut",
       x = "Features", y = "Normalized Value")
```

---

> ✅ **Parallel coordinates plots** help you detect patterns across multiple features simultaneously, especially when colored by category.

# How do you uncover structure in high-dimensional data using a PCA plot?

## Explanation

**Principal Component Analysis (PCA)** reduces high-dimensional data into 2 or 3 principal axes (components) that preserve the most variance. It helps:

- Reveal **clusters or overlaps** in feature space  
- Understand **group separation**  
- Prepare for **clustering or modeling**

It’s most useful for numeric data and can be colored by group (e.g., `cut`).

---

## Python Code




In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt

# Load and prepare data
diamonds = pd.read_csv("data/diamonds_sample.csv")
subset = diamonds[["carat", "depth", "table", "price", "x", "y", "z", "cut"]].sample(500, random_state=1)

# Standardize features
X = StandardScaler().fit_transform(subset.drop("cut", axis=1))

# PCA transformation
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X)

# Combine with labels
pca_df = pd.DataFrame(pca_result, columns=["PC1", "PC2"])
pca_df["cut"] = subset["cut"].values

# Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=pca_df, x="PC1", y="PC2", hue="cut", palette="Set2", alpha=0.7)
plt.title("PCA Plot: Diamond Features Colored by Cut")
plt.tight_layout()
plt.show()


## R Code

```{r}
library(readr)
library(ggplot2)
library(dplyr)

# Load and sample
diamonds <- read_csv("data/diamonds_sample.csv")
subset <- diamonds %>% select(carat, depth, table, price, x, y, z, cut) %>% sample_n(500)

# PCA
features <- subset %>% select(-cut)
features_scaled <- scale(features)
pca_result <- prcomp(features_scaled)

# Combine for plotting
pca_df <- data.frame(pca_result$x[,1:2], cut = subset$cut)

# Plot
ggplot(pca_df, aes(x = PC1, y = PC2, color = cut)) +
  geom_point(alpha = 0.7) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "PCA Plot: Diamond Features by Cut")
```

---

> ✅ **PCA—Principal Component Analysis** reduces complexity while preserving patterns. When plotted in 2D, it can reveal clustering, separation, or overlap between groups.

# How do you visualize clustering patterns in high-dimensional data using a t-SNE plot?

## Explanation

**t-SNE (t-distributed Stochastic Neighbor Embedding)** is a nonlinear technique that transforms high-dimensional data into **2D or 3D** for visualization.

It excels at:
- Revealing **local clusters** and **grouping structures**
- Displaying **complex, non-linear relationships**
- Visualizing **high-dimensional feature space**

t-SNE works best on pre-scaled data and is often used after initial filtering or sampling due to its high computational cost.

---

## Python Code


In [None]:
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt

# Load and sample data
diamonds = pd.read_csv("data/diamonds_sample.csv")
subset = diamonds[["carat", "depth", "table", "price", "x", "y", "z", "cut"]].sample(500, random_state=1)

# Standardize numeric features
X = StandardScaler().fit_transform(subset.drop("cut", axis=1))

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42)
embedding = tsne.fit_transform(X)

# Prepare dataframe for plotting
tsne_df = pd.DataFrame(embedding, columns=["TSNE1", "TSNE2"])
tsne_df["cut"] = subset["cut"].values

# Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=tsne_df, x="TSNE1", y="TSNE2", hue="cut", palette="Set2", alpha=0.7)
plt.title("t-SNE Projection of Diamond Features by Cut")
plt.tight_layout()
plt.show()



## R Code

```{r}
library(readr)
library(dplyr)
library(ggplot2)
library(Rtsne)

# Load and sample
diamonds <- read_csv("data/diamonds_sample.csv")
subset <- diamonds %>% select(carat, depth, table, price, x, y, z, cut) %>% sample_n(500)

# Standardize numeric features
X <- scale(subset %>% select(-cut))

# Apply t-SNE
set.seed(42)
tsne_result <- Rtsne(X, dims = 2, perplexity = 30)

# Combine with labels
tsne_df <- data.frame(tsne_result$Y)
tsne_df$cut <- subset$cut
colnames(tsne_df) <- c("TSNE1", "TSNE2", "cut")

# Plot
ggplot(tsne_df, aes(x = TSNE1, y = TSNE2, color = cut)) +
  geom_point(alpha = 0.7) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "t-SNE Projection of Diamond Features by Cut")
```

---

> ✅ **t-SNE** is powerful for uncovering group-level clusters in high-dimensional data. While slower than PCA or UMAP, it's excellent for detailed structure exploration in smaller samples.

# How do you explore complex patterns in high-dimensional data using a UMAP plot?

## Explanation

**UMAP (Uniform Manifold Approximation and Projection)** is a nonlinear technique that preserves both **local** and **global structure** better than PCA. It's excellent for:

- Revealing **clusters**, **manifolds**, or **nonlinear groupings**  
- Visualizing **high-dimensional feature behavior** in 2D  
- Exploring potential for **classification** or **clustering**

---

## Python Code




In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import umap
import seaborn as sns
import matplotlib.pyplot as plt

# Load and sample data
diamonds = pd.read_csv("data/diamonds_sample.csv")
subset = diamonds[["carat", "depth", "table", "price", "x", "y", "z", "cut"]].sample(500, random_state=1)

# Normalize numeric features
X = StandardScaler().fit_transform(subset.drop("cut", axis=1))

# Run UMAP
reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(X)

# Plot
umap_df = pd.DataFrame(embedding, columns=["UMAP1", "UMAP2"])
umap_df["cut"] = subset["cut"].values

plt.figure(figsize=(8, 6))
sns.scatterplot(data=umap_df, x="UMAP1", y="UMAP2", hue="cut", palette="Set2", alpha=0.7)
plt.title("UMAP Projection of Diamond Features by Cut")
plt.tight_layout()
plt.show()


## R Code (UMAP via `uwot`)

```{r}
library(readr)
library(dplyr)
library(ggplot2)
library(uwot)

# Load and sample
diamonds <- read_csv("data/diamonds_sample.csv")
subset <- diamonds %>% select(carat, depth, table, price, x, y, z, cut) %>% sample_n(500)

# Standardize features
X <- scale(subset %>% select(-cut))

# Apply UMAP
set.seed(42)
embedding <- umap(X, n_neighbors = 15, min_dist = 0.1)

# Combine with labels
umap_df <- data.frame(embedding, cut = subset$cut)
colnames(umap_df)[1:2] <- c("UMAP1", "UMAP2")

# Plot
ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = cut)) +
  geom_point(alpha = 0.7) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal() +
  labs(title = "UMAP Projection of Diamond Features by Cut")
```

---

> ✅ **UMAP** captures nonlinear patterns in complex datasets, helping you visualize hidden structure, group separations, and feature interactions that PCA may miss.

# How do you visualize simple proportions using a pie chart?

## Explanation

A **pie chart** represents parts of a whole as slices of a circle. Each slice's size is proportional to its value, making it easy to visualize **category proportions** at a glance.

- Best used when comparing a **small number of categories** (≤5)
- Labels or percentages should be clearly shown
- Not ideal for precise comparisons — bar charts are usually better

Use pie charts in:
- Survey responses (e.g., favorite colors, device usage)
- Market share or budget composition
- Simple storytelling visuals

## Python Code


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

data = pd.Series([40, 30, 20, 10], index=["A", "B", "C", "D"])
plt.figure(figsize=(5, 5))
data.plot.pie(autopct='%1.1f%%', startangle=90)
plt.title("Category Proportions")
plt.ylabel("")
plt.tight_layout()
plt.show()

## R Code

```{r fig.width=7, fig.height=7}
data <- c(A = 40, B = 30, C = 20, D = 10)
pie(data, main = "Category Proportions", col = rainbow(length(data)))
```

> ✅ Pie charts are best used when showing a few categories and emphasizing part-to-whole relationships in a visually simple way.

# How do you create a donut chart to show part-to-whole proportions?

## Explanation

A **donut chart** is a variation of the pie chart with a central hole. It helps communicate part-to-whole relationships in a slightly more readable way than standard pie charts.

- Ideal for categorical variables with a few levels
- Central space can be used for annotations or percentages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("data/iris.csv")
species_counts = df["species"].value_counts()
colors = plt.cm.Set2.colors

plt.pie(species_counts, labels=species_counts.index, colors=colors,
        autopct='%1.1f%%', startangle=90, wedgeprops={'width': 0.4})
plt.title("Iris Species Distribution")
plt.axis("equal")
plt.show()

## R Code

```{r}
library(ggplot2)
library(dplyr)

df <- readr::read_csv("data/iris.csv")
df_counts <- df %>%
  count(species) %>%
  mutate(prop = n / sum(n), ypos = cumsum(prop) - 0.5 * prop)

ggplot(df_counts, aes(x = 2, y = prop, fill = species)) +
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar(theta = "y") +
  xlim(0.5, 2.5) +
  theme_void() +
  geom_text(aes(y = ypos, label = scales::percent(prop)), color = "white") +
  ggtitle("Iris Species Distribution (Donut Chart)")
```

> ✅ Donut charts are more stylish than pie charts, but they carry the same limitations—use them only for small, clear part-to-whole comparisons.

# How do you visualize hierarchical part-to-whole relationships using a treemap?

## Explanation

A **treemap** displays hierarchical data as nested rectangles, where:

- **Size** represents a value (e.g., frequency or count)
- **Color** can encode an additional group
- **Nested rectangles** reflect categories and subcategories


## Python Code — Interactive

In [None]:
import pandas as pd
import plotly.express as px

data = pd.DataFrame({
    "group": ["Setosa", "Setosa", "Versicolor", "Versicolor", "Virginica", "Virginica"],
    "subgroup": ["Short", "Long", "Short", "Long", "Short", "Long"],
    "value": [20, 30, 25, 25, 15, 35]
})

fig = px.treemap(data, path=["group", "subgroup"], values="value", color="group",
                 title="Interactive Treemap of Iris Subgroups by Plotly")
fig.show()

> ⚠️ Interactive treemaps using plotly.express don’t appear in PDFs. Use static alternatives for printable outputs.

## Python Code — Static

For use in **PDF or non-browser reports**, static treemaps built with `squarify` are ideal.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import squarify

data = pd.DataFrame({
    "label": ["Setosa Short", "Setosa Long", "Versicolor Short", "Versicolor Long", "Virginica Short", "Virginica Long"],
    "value": [20, 30, 25, 25, 15, 35]
})

colors = plt.cm.viridis_r([i / float(len(data)) for i in range(len(data))])
plt.figure(figsize=(10, 6))
squarify.plot(sizes=data["value"], label=data["label"], color=colors, alpha=0.8)
plt.axis("off")
plt.title("Static Treemap of Iris Subgroups")
plt.tight_layout()
plt.show()

## R Code

```{r}
library(treemap)

data <- data.frame(
  group = c("Setosa", "Setosa", "Versicolor", "Versicolor", "Virginica", "Virginica"),
  subgroup = c("Short", "Long", "Short", "Long", "Short", "Long"),
  value = c(20, 30, 25, 25, 15, 35)
)

treemap(data, index = c("group", "subgroup"), vSize = "value",
        type = "index", title = "Treemap of Iris Subgroups")
```

> ✅ Static treemaps are ideal for printed or PDF outputs and help visualize complex category structures compactly.

# How do you visualize overlaps using a Venn diagram?

## Explanation

A **Venn diagram** shows the overlap and differences between sets. It's useful for comparing shared vs unique elements among groups.

- Works best for **2 or 3 sets**
- Clearly shows commonality and uniqueness
- Use for membership, features, item inclusion

## Python Code



In [None]:
from matplotlib_venn import venn2
import matplotlib.pyplot as plt

set1 = set(["A", "B", "C", "D"])
set2 = set(["C", "D", "E", "F"])

plt.figure(figsize=(6, 4))
venn2([set1, set2], set_labels=("Group 1", "Group 2"))
plt.title("Venn Diagram of Two Sets")
plt.show()

## R Code

```{r}
library(VennDiagram)

venn.plot <- draw.pairwise.venn(
  area1 = 4, area2 = 4, cross.area = 2,
  category = c("Group 1", "Group 2"),
  fill = c("lightblue", "pink"),
  ind = FALSE
)

grid.draw(venn.plot)
```

> ✅ Use Venn diagrams for small set-based comparisons when the goal is to highlight overlap, not quantity.

# VIZ Summary {-}

You’ve successfully completed the **Data Visualization (VIZ)** layer of the CDI Learning System — working hands-on in both **Python** and **R** to explore a wide range of visual techniques that transform raw data into meaningful insight.

This layer focused on building your **data storytelling** skills — helping you present information clearly, detect patterns, and support analysis through compelling visuals.

---

### 🎨 What You’ve Accomplished {-}

- ✅ Created basic plots: histograms, bar charts, boxplots, and scatter plots  
- ✅ Enhanced your plots with color, grouping, faceting, and trend lines  
- ✅ Visualized multivariate relationships: pair plots, heatmaps, and parallel coordinates  
- ✅ Explored dimensionality reduction techniques (PCA, t-SNE, UMAP)  
- ✅ Used part-to-whole and structural plots like pie charts, donut charts, treemaps, and Venn diagrams  
- ✅ Practiced on both small (iris) and large (diamonds) datasets  
- ✅ Built fluency across `matplotlib`, `seaborn`, `ggplot2`, `GGally`, `plotly`, and more

---

## 📐 What Comes After Visualization? {-}

Now that you’ve developed your **visual intuition**, the next step is to **quantify relationships** — using statistics to draw valid conclusions from your data.

In the next stages of your journey, you’ll dive into:

- 📐 **Statistical Analysis (STATS)** — measure, test, and explain key patterns  
- 🤖 **Machine Learning (ML)** — learn from data and make predictions with real-world models

Each layer builds on what you’ve already learned — using the same datasets and dual-language structure to deepen your understanding.

---

## 🚀 Continue Learning with CDI {-}

Ready to take your next step?

📚 **[Explore All CDI Products →](https://complexdatainsights.com/explore-products/)**

> ✅ With your visualization skills in place, you're now prepared to move from **insightful graphics** to **statistical reasoning** and **predictive modeling** — with confidence in both Python and R.