---
title: Multivariable Data  
subtitle: "IN2039: Data Visualization for Decision Making"
author: 
  - name: Alan R. Vazquez
    affiliations:
      - name: Department of Industrial Engineering
format: 
  revealjs:
    chalkboard: true
    multiplex: true
    footer: "Tecnologico de Monterrey"
    logo: IN2039_logo.png
    css: style.css
    slide-number: True
execute:
  echo: true  
editor: visual
---


## Agenda

1.  Data with more than one variable

2.  Two numerical variables

3.  One numerical and one categorical variable

4.  Two categorical variables

5.  Three or more variables

## Load Libraries

Let's import the `pandas`, `matplotlib`, and `seaborn` in Google Colab and R before starting.


In [None]:
#| echo: true
#| output: false

# Don't forget to install the 'ggformula' library in Google Colab.
# install.packages("ggformula")

import pandas as pd      
import matplotlib.pyplot as plt  
import seaborn as sns    

## Multivariate data

</br>

Multivariate data consists of datasets that contain observations of two or more variables.

::: incremental
-   Variables can be numerical or categorical.

-   Variables may or may not depend on each other.
:::

. . .

In fact, the **goal** is to determine whether there is a relationship between the variables and the type of relationship.

## Example 1

Consider data from 392 cars, including miles per gallon, number of cylinders, horsepower, weight, acceleration, year, origin, among other variables.

The data is stored in the file "auto_dataset.xlsx".


In [None]:
#| echo: true

auto_data = pd.read_excel("auto_dataset.xlsx")
auto_data.head()

## Principle 1: Formulate the question

In the context of multiple-variable data, typical questions to study include:

::: incremental
-   ¿How are variable $X$ and variable $Y$ related?

-   Is the distribution of variable $X$ the same across all subgroups defined by variable $Z$?

-   Are there any unusual observations in the combination of values for variables $X$ and $Y$?

-   Are there any unusual observations in $X$ for a subgroup of variable $Z$?
:::

## Principle 2: Turn data into information

There are various types of graphs that help us explore relationships between two or more variables.

::: center
| Type        | Graph Type                                |
|:------------|:------------------------------------------|
| Numerical   | Scatter plot, line graph                  |
| Categorical | Side-by-side bar chart, stacked bar chart |
| Mixed       | GSide-by-side box plot, bubble chart      |
:::

::: notes
For two features, the combination of types (both quantitative, both qualitative, or a mix) matters.
:::

# Two Numerical Variables

## Independent and dependent variables

When investigating the relationship between two variables (numerical or categorical), we use specific terminology.

. . .

One variable is called the *dependent* or *response variable*, denoted by the letter $Y$.

. . .

The other variable is called the *independent* or *predictor variable*, denoted by the letter $X$.

. . .

> Our goal is to determine whether changes in variable $X$ are associated with changes in variable $Y$, and the nature of this association.

## Scatter Plot

</br>

The most common graph for examining the relationship between two numerical variables is the [***scatter plot***]{style="color:#174062;"}.

. . .

Variables $X$ and $Y$ are placed on the horizontal and vertical axes, respectively. Each point on the graph represents a pair of $X$ and $Y$ values.

. . .

> The goal is to explore linear or non-linear relationships between variables.

## Scatterplot in Python

To create scatter plots in **seaborn**, we use the function `scatterplot()`.

. . .

For example, let's create a plot to explore the relationship between a car's weight (`weight`) and its fuel efficiency in miles per gallon (`mpg`)


In [None]:
#| fig-align: center
#| echo: true
#| output: false

plt.figure(figsize=(6, 6))
sns.scatterplot(data = auto_data, x = "weight", y = "mpg")
plt.show()

## 


In [None]:
#| fig-align: center
#| echo: true
#| output: true

plt.figure(figsize=(6, 6))
sns.scatterplot(data=auto_data, x="weight", y="mpg", color="blue", s=50)
plt.show()

## Principle 3: Apply gaphic design principles

Following Principle 3, we can modify the default function values to define different colors or shapes for the points in the graph.

Specifically, you can change the color, shape, and size of points using the arguments `color`, `shape` and `size`, respectively.

</br>

`sns.scatterplot(data=data_set, x=X, y=Y, color, marker, s)`

## 


In [None]:
#| fig-pos: center
#| echo: true

plt.figure(figsize=(6, 6))
sns.scatterplot(data=auto_data, x="weight", y="mpg", color="blue", 
                marker="x", s=100)
plt.show()

## Possible Point Shapes

To change the symbols used for points in a scatter plot, set the `shape` parameter to a number or character from the chart below.

![](images/FIG-SCATTER-SHAPES-CHART-1.png){fig-align="center" width="530" height="344"}

## 

Continuing with Principle 3, you can use previously seen functions to further improve the chart's appearance.


In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: true


# Create the scatter plot with custom color and size
sns.scatterplot(data=auto_data, x="weight", y="mpg", color="darkblue", s=50)

# Customize the plot
plt.title("Relación de peso y millas por galón en autos", fontsize=25)
plt.xlabel("Peso (lb)", fontsize=20)
plt.ylabel("Millas por galón", fontsize=20)
plt.tick_params(axis='both', labelsize=20)

# Show the plot
plt.show() 

## Include Zero

In the previous chart, the minimum vertical axis value is around 10. To adjust the minimum value to 0, we use the additional command below.


In [None]:
#| fig-pos: center
#| echo: true

# Create the scatter plot with custom color and size
sns.scatterplot(data=auto_data, x="weight", y="mpg", color="darkblue", s=50)

# Set y-axis limits
plt.ylim(0, 50)

# Customize the plot
plt.title("Relación de peso y millas por galón en autos", fontsize=25)
plt.xlabel("Peso (lb)", fontsize=20)
plt.ylabel("Millas por galón", fontsize=20)
plt.tick_params(axis='both', labelsize=20)

# Show the plot
plt.show()

## 

If necessary, we can also adjust the horizontal axis to show 0 as well.


In [None]:
#| fig-pos: center
#| echo: true

# Create the scatter plot with custom color and size
sns.scatterplot(data=auto_data, x="weight", y="mpg", color="darkblue", s=50)

# Set x-axis limits
plt.xlim(0, 5500)

# Customize the plot
plt.title("Relación de peso y millas por galón en autos", fontsize=25)
plt.xlabel("Peso (lb)", fontsize=20)
plt.ylabel("Millas por galón", fontsize=20)
plt.tick_params(axis='both', labelsize=20)

# Show the plot
plt.show()

## Individual Graphs

Individual variable graphs (such as histograms) do not allow us to study the relationship between two variables. They only provide information on the *distribution* of each variable.

::::: columns
::: {.column width="50%"}

In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: true

# Create the histogram
sns.histplot(data=auto_data, x="mpg", color="darkblue", kde=False, edgecolor="black")

# Customize the plot
plt.title("Distribución de milas por galón", fontsize=25)
plt.xlabel("Millas por galón", fontsize=20)
plt.ylabel("Frecuencia", fontsize=20)
plt.tick_params(axis='both', labelsize=20)

# Show the plot
plt.show()

:::

::: {.column width="50%"}

In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: true

# Create the histogram
sns.histplot(data=auto_data, x="weight", color="darkblue", kde=False, edgecolor="black")

# Customize the plot
plt.title("Distribución de peso", fontsize=25)
plt.xlabel("Peso (lb)", fontsize=20)
plt.ylabel("Frecuencia", fontsize=20)
plt.tick_params(axis='both', labelsize=20)

# Show the plot
plt.show()

:::
:::::

## GLine Graph

A line graph is a visual representation of data where data points are connected by a line. Axes:

-   $X$ (horizontal): Represents time or the independent variable.
-   $Y$ (vertical): Represents the dependent variable.

Each point represents a value at a given moment.

. . .

> The objective is to explore trends over time or the evolution of a continuous variable.

## Example 2

Consider the data in the file "spotify.xlsx". This dataset contains the global daily streams of the top five most popular songs on the music streaming service Spotify in 2017.


In [None]:
#| echo: true

# Read the Excel file
spotify_data = pd.read_excel("spotify.xlsx")

# View the first 3 rows
spotify_data.head(3)

## 

We will focus on the song *Despacito* by Luis Fonsi. To construct scatter plots, we use the function`gf_line(Y ~ X, data = data_set)`.


In [None]:
#| echo: true

# Create the line plot
sns.lineplot(data=spotify_data, x="Date", y="Despacito")

# Show the plot
plt.show()

## Applying Principle 3

We can change various aspects of the graph using additional arguments `linetype`, `size`, and `color`.

</br>

`gf_line(Y ~ X, linetype, size, color, data = data_set)`.

## 


In [None]:
#| echo: true
#| code-fold: true


# Create the line plot
plt.figure(figsize=(10, 6))
sns.lineplot(data=spotify_data, x="Date", y="Despacito", color="darkblue", linewidth=1.3)

# Customize the plot
plt.title("Popularidad de la canción Despacito de Luis Fonsi", fontsize=25)
plt.xlabel("Fecha", fontsize=18)
plt.ylabel("Número de reproducciones en Spotify", fontsize=18)
plt.tick_params(axis='both', labelsize=20)

# Show the plot
plt.show()

## Line Types

To change the line type, set the `linetype` parameter to a number or a word shown below.

![](images/clipboard-2759814636.png){fig-align="center"}

# A Categorical and a Numerical Variable

## !Divide the Data into Groups!

To examine the relationship between a numerical and a categorical variable, we use the categorical variable to divide the data into groups. This way, we **compare the distribution** of the numerical variable among these groups.

. . .

In this context:

-   $X$ is the categorical variable.
-   $Y$ is the numerical variable.

. . .

The [side-by-side boxplot]{style="color:#D70040;"} is the most effective way to study the relationship between a categorical and a numerical variable.

## Boxplot by Groups

The side-by-side boxplot compares the distribution of a variable across different groups.

</br>

The plot is obtained using the function:

`gf_boxplot(Y ~ X, data = dataset)`.

## 

For example, if we want to compare the distributions of miles per gallon of cars built in America, Europe, or Japan, we use the following command:


In [None]:
#| fig-pos: center
#| echo: true

# Create the boxplot
sns.boxplot(data=auto_data, x="origin", y="mpg")

# Show the plot
plt.show() 

## Applying Principle 3


In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: true


# Create the boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(data=auto_data, x="origin", y="mpg", color="lightblue", fliersize=5, linewidth=1.5, boxprops=dict(color="black"))

# Customize the plot
plt.title("Distribución de Millas por Galón por Origen", fontsize=25)
plt.xlabel("Origen", fontsize=20)
plt.ylabel("Millas por galón", fontsize=20)
plt.tick_params(axis='both', labelsize=20)

# Show the plot
plt.show()

## 

We can also change the format of outlier points using the arguments `outlier.color`, `outlier.shape`, and `outlier.size`.


In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: true

# Create the boxplot
plt.figure(figsize=(8, 6))
sns.boxplot(data=auto_data, x="origin", y="mpg", color="lightblue", 
            linewidth=1.5, fliersize=8, boxprops=dict(color="black"), 
            flierprops=dict(marker="o", markerfacecolor="red", markersize=8, markeredgewidth=2))

# Customize the plot
plt.title("Distribución de Millas por Galón por Origen", fontsize=25)
plt.xlabel("Origen", fontsize=20)
plt.ylabel("Millas por galón", fontsize=20)
plt.tick_params(axis='both', labelsize=20)

# Show the plot
plt.show()

## Plotting Statistical Summaries by Groups

Alternatively, we can summarize the values of the numerical variable $Y$ for each category of the variable $X$ using the median or the mean.

For example, let's plot the average miles per gallon of cars produced in America, Europe, and Japan. First, we calculate the average for each category using `group_by()` and `summarise()`.


In [None]:
#| echo: true
#| output: true

# Example data
data = [10, 20, 30, 40, 50]

# Calculate some summary statistics
mean_value = np.mean(data)
median_value = np.median(data)
std_dev = np.std(data)

print("Mean:", mean_value)
print("Median:", median_value)
print("Standard Deviation:", std_dev)

## 

The data to be plotted are:


In [None]:
#| echo: true
#| output: true

# Example dataset (replace with actual 'auto_data' dataframe)
auto_data = pd.DataFrame({
    'mpg': [21, 22, 23, 24, 25],
    'weight': [3000, 3200, 3400, 3600, 3800],
    'origin': ['USA', 'Europe', 'USA', 'Japan', 'Europe']
})

# Summary statistics for the dataset
resumen_autos = auto_data.describe()

Two common visualization types for plotting a numerical and a discrete variable when there is only one value per category are:

-   Cleveland dot plot
-   Bar chart

## Cleveland Dot Plot

The Cleveland dot plot encodes quantitative data across different categories. It is an alternative to a bar chart. It is obtained using the function `gf_point()`.


In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: false

# Example summary data (replace with actual 'resumen_autos' dataframe)
resumen_autos = pd.DataFrame({
    'origin': ['USA', 'Europe', 'Japan'],
    'Promedio.mpg': [22.5, 25.0, 27.5]
})

# Create the scatter plot
sns.scatterplot(data=resumen_autos, x='Promedio.mpg', y='origin')

# Customize the plot
plt.title("Promedio de Millas por Galón por Origen", fontsize=20)
plt.xlabel("Promedio de Millas por Galón", fontsize=15)
plt.ylabel("Origen", fontsize=15)

# Show the plot
plt.show() 

## Improving the Plot

We apply Principle 3 to improve the plot.


In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: true

# Example summary data (replace with actual 'resumen_autos' dataframe)
resumen_autos = pd.DataFrame({
    'origin': ['USA', 'Europe', 'Japan'],
    'Promedio.mpg': [22.5, 25.0, 27.5]
})

# Create the scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=resumen_autos, x='Promedio.mpg', y='origin', s=100, color='pink')

# Customize the plot
plt.title("Comparación de autos de diferentes regiones", fontsize=20)
plt.xlabel("Promedio de Millas por Galón", fontsize=20)
plt.ylabel("Origen", fontsize=20)

# Set x-axis limits
plt.xlim(0, 35)

# Apply theme (similar to theme_bw in R)
sns.set_style("whitegrid")

# Show the plot
plt.show()

## Bar Chart

To create a bar chart where the bar length equals a specific value, we use the function `gf_col()` dfrom the **ggformula** library.


In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: false

# Example summary data (replace with actual 'resumen_autos' dataframe)
resumen_autos = pd.DataFrame({
    'origin': ['USA', 'Europe', 'Japan'],
    'Promedio.mpg': [22.5, 25.0, 27.5]
})

# Create the column plot (bar plot)
plt.figure(figsize=(8, 6))
sns.barplot(data=resumen_autos, x='origin', y='Promedio.mpg', color='lightblue')

# Customize the plot
plt.title("Promedio de Millas por Galón por Origen", fontsize=20)
plt.xlabel("Origen", fontsize=15)
plt.ylabel("Promedio de Millas por Galón", fontsize=15)

# Show the plot
plt.show() 

## 

We can use similar commands as the Cleveland dot plot to improve the bar chart.


In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: true

# Example summary data (replace with actual 'resumen_autos' dataframe)
resumen_autos = pd.DataFrame({
    'origin': ['USA', 'Europe', 'Japan'],
    'Promedio.mpg': [22.5, 25.0, 27.5]
})

# Create the bar plot (column plot)
plt.figure(figsize=(8, 6))
sns.barplot(data=resumen_autos, x='origin', y='Promedio.mpg', color='pink')

# Customize the plot
plt.title("Comparación de autos de diferentes regiones", fontsize=20)
plt.xlabel("Promedio de Millas por Galón", fontsize=20)
plt.ylabel("Origen", fontsize=20)

# Apply grid style similar to theme_bw() in R
sns.set_style("whitegrid")

# Show the plot
plt.show()

# Two Categorical Variables

## !Divide the Data into Groups!

With two categorical variables, we compare the distribution of one variable across subgroups defined by the other variable.

In fact, we keep one variable constant and plot the distribution of the other.

. . .

To do this, the most popular charts are extensions of bar graphs:

-   Stacked bar charts
-   Side-by-side bar charts

## Example 3

As an example, let's consider the data in the file "penguins.xlsx".


In [None]:
#| echo: true

# Read the Excel file
penguins_data = pd.read_excel("penguins.xlsx")

# Display the first few rows of the dataset
print(penguins_data.head())

## 

The data has two categorical variables:

-   The species of penguins (`species`).
-   The island they come from (`island`).

Make sure they are specified as `factor` in R!


In [None]:
#| echo: true

# Convert specified columns to categorical (factor in R)
penguins_data[['species', 'island', 'sex']] = penguins_data[['species', 'island', 'sex']].astype('category')

## Stacked Bar Chart

The side-by-side bar chart is generated using the `gf_bar()`, function, assigning the $X$ variable to the `fill`argument.

The variable name must be preceded by a tilde `~X`.

</br>

. . .

For example, to study the distribution of penguin species across the three different islands, we use the following:


In [None]:
#| fig-pos: center
#| echo: true
#| output: false

# Example dataset (replace with actual 'penguins_data' dataframe)
penguins_data = pd.DataFrame({
    'species': ['Adelie', 'Chinstrap', 'Gentoo', 'Adelie', 'Chinstrap'],
    'island': ['Torgersen', 'Dream', 'Biscoe', 'Torgersen', 'Dream']
})

# Create the bar plot with 'species' on the x-axis and 'island' for color fill
plt.figure(figsize=(8, 6))
sns.countplot(data=penguins_data, x='species', hue='island', palette='Set1')

# Customize the plot
plt.title("Distribution of Penguin Species by Island", fontsize=20)
plt.xlabel("Species", fontsize=15)
plt.ylabel("Count", fontsize=15)

# Show the plot
plt.show()

## 

The chart shows the frequency of each species, separated by island name.


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'penguins_data' dataframe)
penguins_data = pd.DataFrame({
    'species': ['Adelie', 'Chinstrap', 'Gentoo', 'Adelie', 'Chinstrap'],
    'island': ['Torgersen', 'Dream', 'Biscoe', 'Torgersen', 'Dream']
})

# Create the bar plot with 'species' on the x-axis and 'island' for color fill
plt.figure(figsize=(8, 6))
sns.countplot(data=penguins_data, x='species', hue='island', palette='Set1')

# Customize the plot
plt.title("Distribution of Penguin Species by Island", fontsize=20)
plt.xlabel("Species", fontsize=15)
plt.ylabel("Count", fontsize=15)

# Show the plot
plt.show()

## Side-by-Side Bar Chart

An alternative to the previous chart is to place the bars side by side for the categories of the $X$ variable.

In this case, we use the same commands with an extra argument: `position = position_dodge()`.


In [None]:
#| fig-pos: center
#| echo: true
#| output: false

# Example dataset (replace with actual 'penguins_data' dataframe)
penguins_data = pd.DataFrame({
    'species': ['Adelie', 'Chinstrap', 'Gentoo', 'Adelie', 'Chinstrap'],
    'island': ['Torgersen', 'Dream', 'Biscoe', 'Torgersen', 'Dream']
})

# Create the bar plot with 'species' on the x-axis and 'island' for color fill, with dodging
plt.figure(figsize=(8, 6))
sns.countplot(data=penguins_data, x='species', hue='island', dodge=True, palette='Set1')

# Customize the plot
plt.title("Distribution of Penguin Species by Island", fontsize=20)
plt.xlabel("Species", fontsize=15)
plt.ylabel("Count", fontsize=15)

# Show the plot
plt.show()

## 


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'penguins_data' dataframe)
penguins_data = pd.DataFrame({
    'species': ['Adelie', 'Chinstrap', 'Gentoo', 'Adelie', 'Chinstrap'],
    'island': ['Torgersen', 'Dream', 'Biscoe', 'Torgersen', 'Dream']
})

# Create the bar plot with 'species' on the x-axis and 'island' for color fill, with dodging
plt.figure(figsize=(8, 6))
sns.countplot(data=penguins_data, x='species', hue='island', dodge=True, palette='Set1')

# Customize the plot
plt.title("Distribution of Penguin Species by Island", fontsize=20)
plt.xlabel("Species", fontsize=15)
plt.ylabel("Count", fontsize=15)

# Show the plot
plt.show()

## Stacked or Side-by-Side?

The main difference between stacked and side-by-side bar charts is that the side-by-side chart shows values in separate bars within a category.

Advantages of **stacked** bars:

-   Easier to understand what proportions of a whole are divided among segments.

-   Visually adds up each proportion.

## 

Advantages of **side-by-side** bars:

-   Easier to compare the heights of each individual entity.

-   Better for comparing between groups.

## Statistical Summaries

For categorical variables, the most common statistical summaries are frequency and relative frequency.

With **dplyr**, we calculate frequency using the `count()`, function, which counts the unique values of one or more variables.


In [None]:
#| echo: true
#| output: true

# Example dataset (replace with actual 'penguins_data' dataframe)
penguins_data = pd.DataFrame({
    'species': ['Adelie', 'Chinstrap', 'Gentoo', 'Adelie', 'Chinstrap'],
    'island': ['Torgersen', 'Dream', 'Biscoe', 'Torgersen', 'Dream']
})

# Count occurrences of species per island and group by island
count_data = penguins_data.groupby(['island', 'species']).size().reset_index(name='count')

# Show the result
print(count_data)

## 

To calculate relative frequency, we use `mutate()` along with the `prop.table()`, function, which calculates the proportions of a column.


In [None]:
#| fig-pos: center
#| echo: true
#| output: true

# Example dataset (replace with actual 'penguins_data' dataframe)
penguins_data = pd.DataFrame({
    'species': ['Adelie', 'Chinstrap', 'Gentoo', 'Adelie', 'Chinstrap'],
    'island': ['Torgersen', 'Dream', 'Biscoe', 'Torgersen', 'Dream']
})

# Count occurrences of species per island and group by island
count_data = penguins_data.groupby(['island', 'species']).size().reset_index(name='count')

# Group by 'island' and calculate the proportion within each group
count_data['Proporción'] = count_data.groupby('island')['count'].transform(lambda x: x / x.sum())

# Show the result
print(count_data)

# More than One Variable

## Charts for Three Variables

-   When examining a distribution or relationship, we often want to compare it across data subgroups.

-   This process of conditioning on additional variables leads to visualizations involving three or more variables.

-   Here we explain how to create charts to visualize multiple variables.

## Scatter Plot by Color

For two numerical variables and one categorical variable.


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'auto_data' dataframe)
auto_data = pd.DataFrame({
    'mpg': [21, 22, 23, 24, 25],
    'weight': [2000, 2500, 3000, 3500, 4000],
    'origin': ['USA', 'Europe', 'USA', 'Europe', 'Asia']
})

# Create the scatter plot with 'mpg' on the y-axis, 'weight' on the x-axis, and color based on 'origin'
plt.figure(figsize=(8, 6))
sns.scatterplot(data=auto_data, x='weight', y='mpg', hue='origin', palette='Set1')

# Customize the plot
plt.title("Scatter plot of MPG vs. Weight by Origin", fontsize=20)
plt.xlabel("Weight (lb)", fontsize=15)
plt.ylabel("Miles per Gallon (MPG)", fontsize=15)

# Show the plot
plt.show() 

## Faceted or Lattice Plot

A faceted plot visualizes the relationship or distribution of one or two variables for each subgroup defined by a third variable $Z$.

. . .

**Idea:** Create a chart for each subgroup of $Z$.

. . .

To create the plot, use the `gf_facet_grid` function with the following syntax:


In [None]:
#| fig-pos: center
#| echo: true
#| code-fold: false
#| output: false

# Example dataset (replace with actual 'auto_data' dataframe)
auto_data = pd.DataFrame({
    'mpg': [21, 22, 23, 24, 25],
    'weight': [2000, 2500, 3000, 3500, 4000],
    'origin': ['USA', 'Europe', 'USA', 'Europe', 'Asia']
})

# Create the scatter plot and facet by 'origin'
g = sns.FacetGrid(auto_data, col='origin', height=5, aspect=1)
g.map(sns.scatterplot, 'weight', 'mpg')

# Customize the plot
g.set_axis_labels("Weight (lb)", "Miles per Gallon (MPG)")
g.set_titles("{col_name} Origin")

# Show the plot
plt.show()

## 

The function produces a grid with 1 column and 3 rows of charts. Each row accommodates one category of `origin`.


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'auto_data' dataframe)
auto_data = pd.DataFrame({
    'mpg': [21, 22, 23, 24, 25],
    'weight': [2000, 2500, 3000, 3500, 4000],
    'origin': ['USA', 'Europe', 'USA', 'Europe', 'Asia']
})

# Create the scatter plot and facet by 'origin'
g = sns.FacetGrid(auto_data, col='origin', height=5, aspect=1)
g.map(sns.scatterplot, 'weight', 'mpg')

# Customize the plot
g.set_axis_labels("Weight (lb)", "Miles per Gallon (MPG)")
g.set_titles("{col_name} Origin")

# Show the plot
plt.show()

## 

If we change the order of the `origin` variable in the `gf_facet_grid`, function, we get a grid with three columns and one row of charts.


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'auto_data' dataframe)
auto_data = pd.DataFrame({
    'mpg': [21, 22, 23, 24, 25],
    'weight': [2000, 2500, 3000, 3500, 4000],
    'origin': ['USA', 'Europe', 'USA', 'Europe', 'Asia']
})

# Create the scatter plot and facet by 'origin'
g = sns.FacetGrid(auto_data, row='origin', height=5, aspect=1)
g.map(sns.scatterplot, 'weight', 'mpg')

# Customize the plot
g.set_axis_labels("Weight (lb)", "Miles per Gallon (MPG)")
g.set_titles("{row_name} Origin")

# Show the plot
plt.show()

## Multiple Line Charts

We can use the functions `gf_line()` and `gf_facet_grid()` to visualize the evolution of play counts for the 5 songs in the file "spotify.xlsx" over time.

</br>

However, we need to manipulate the data to obtain the format required by these functions.

## The Required Format

For a multiple line chart, we need to merge the columns `Shape of You`, `Despacito`, `Something Just Like This`, `HUMBLE` and `Unforgettable` into two columns.

One column will contain the number of plays, and the other will contain the song title.

Both columns will be ordered by the variable `Date`.

## A New Library: tidyr

::::: columns
::: {.column width="50%"}
![](images/tidyverse.jpeg){fig-align="center" width="491" height="374"}
:::

::: {.column width="50%"}
-   **tidyr** allows reshaping and regrouping a dataset.

-   It is part of a collection of data science packages called *tidyverse*.

-   <https://tidyr.tidyverse.org/>
:::
:::::

Load it in Google Colab with the following code:


```{pyhton}
#| echo: true
#| output: false

import pandas as pd
```


## 

To format the data, we use the `pivot_longer()` function from the **tidyr** library.


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'spotify_data' dataframe)
spotify_data = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03'],
    'Shape of You': [100, 150, 120],
    'Despacito': [200, 180, 190],
    'Something Just Like This': [120, 130, 140],
    'HUMBLE.': [300, 280, 290],
    'Unforgettable': [250, 260, 240]
})

# Pivot longer: melt the dataframe
data_lines = spotify_data.melt(id_vars=["Date"], 
                                value_vars=["Shape of You", "Despacito", "Something Just Like This", "HUMBLE.", "Unforgettable"], 
                                var_name="Cancion", 
                                value_name="Reproducciones")

# Show the first few rows
print(data_lines.head())

## 

Now, we apply similar functions to the`data_lines`object.


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'data_lines' dataframe)
import pandas as pd
data_lines = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'Cancion': ['Shape of You', 'Shape of You', 'Shape of You', 'Despacito', 'Despacito', 'Despacito'],
    'Reproducciones': [100, 150, 120, 200, 180, 190]
})

# Create the line plot and facet by 'Cancion'
g = sns.FacetGrid(data_lines, col='Cancion', height=5, aspect=1)
g.map(sns.lineplot, 'Date', 'Reproducciones')

# Customize the plot
g.set_axis_labels("Date", "Reproductions")
g.set_titles("{col_name} Song")

# Rotate the x-axis labels for better readability
for ax in g.axes.flat:
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

# Show the plot
plt.show()

## 

Or we can plot all lines on a single chart.


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'data_lines' dataframe)
import pandas as pd
data_lines = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'Cancion': ['Shape of You', 'Shape of You', 'Shape of You', 'Despacito', 'Despacito', 'Despacito'],
    'Reproducciones': [100, 150, 120, 200, 180, 190]
})

# Create the line plot with color mapped to 'Cancion'
sns.lineplot(x='Date', y='Reproducciones', hue='Cancion', data=data_lines)

# Customize the plot
plt.title("Reproductions by Song Over Time")
plt.xlabel("Date")
plt.ylabel("Reproductions")
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

# Show the plot
plt.show()

## Area Chart

An area chart is a specialized form of a line chart, where points are connected with a continuous line, and the region beneath the line is filled with a solid color. It is generated using the `gf_area()`function from **ggformula**.


In [None]:
#| fig-pos: center
#| echo: true

# Example dataset (replace with actual 'data_lines' dataframe)
data_lines = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-01', '2023-01-02', '2023-01-03'],
    'Cancion': ['Shape of You', 'Shape of You', 'Shape of You', 'Despacito', 'Despacito', 'Despacito'],
    'Reproducciones': [100, 150, 120, 200, 180, 190]
})

# Set the Date column to datetime format
data_lines['Date'] = pd.to_datetime(data_lines['Date'])

# Create the area plot using seaborn
plt.figure(figsize=(10, 6))
sns.lineplot(x='Date', y='Reproducciones', hue='Cancion', data=data_lines, linewidth=2)

# Fill the area under the line
for song in data_lines['Cancion'].unique():
    song_data = data_lines[data_lines['Cancion'] == song]
    plt.fill_between(song_data['Date'], song_data['Reproducciones'], alpha=0.3, label=song)

# Customize the plot
plt.title("Reproductions by Song Over Time")
plt.xlabel("Date")
plt.ylabel("Reproductions")
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.legend(title="Song")

# Show the plot
plt.show()

## Applying Principle 3

In addition to the previously seen functions, we can make the areas transparent using the `alpha`parameter.


In [None]:
#| fig-pos: center
#| code-fold: true
#| echo: true

# Example dataset (replace with your actual 'data_lines' DataFrame)
data_lines = pd.DataFrame({
    'Date': ['2023-01-01', '2023-01-02', '2023-01-03',
             '2023-01-01', '2023-01-02', '2023-01-03'],
    'Cancion': ['Shape of You', 'Shape of You', 'Shape of You',
                'Despacito', 'Despacito', 'Despacito'],
    'Reproducciones': [100, 150, 120, 200, 180, 190]
})

# Ensure that Date is a datetime type
data_lines['Date'] = pd.to_datetime(data_lines['Date'])

# Use a style similar to theme_bw (a clean white background with gridlines)
plt.style.use('seaborn-whitegrid')

# Create the figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Choose a color palette for the different songs
unique_songs = data_lines['Cancion'].unique()
palette = sns.color_palette("husl", len(unique_songs))
color_map = dict(zip(unique_songs, palette))

# Plot the area for each song with a fill (alpha=0.5)
for song in unique_songs:
    # Get data for the song and sort by Date to ensure proper plotting
    song_data = data_lines[data_lines['Cancion'] == song].sort_values('Date')
    
    # Plot the line for the song
    ax.plot(song_data['Date'], song_data['Reproducciones'],
            color=color_map[song], label=song)
    
    # Fill the area under the line with the same color and specified transparency
    ax.fill_between(song_data['Date'], song_data['Reproducciones'],
                    color=color_map[song], alpha=0.5)

# Customize the axes labels and title
ax.set_xlabel("Date", fontsize=14)
ax.set_ylabel("Reproducciones", fontsize=14)
ax.set_title("Reproductions Over Time by Song", fontsize=16)
plt.xticks(rotation=45)

# Add a legend
ax.legend(title="Cancion")

# Show the plot
plt.tight_layout()
plt.show()

## Charts for Four Variables

A common chart for four variables is the scatter plot, where the color and size of the symbols depend on two [**categorical**]{style="color:#8B8000;"} variables.

</br>


In [None]:
#| fig-pos: center
#| echo: true
#| output: false

# Assuming penguins_data is already loaded as a DataFrame with columns:
# 'bill_length_mm', 'bill_depth_mm', 'island', and 'species'

# Convert the 'island' categorical variable to numeric codes for size mapping.
# (Alternatively, you could define a custom mapping if desired.)
penguins_data['island_code'] = penguins_data['island'].astype('category').cat.codes

# Create the scatter plot.
# x-axis: bill_depth_mm, y-axis: bill_length_mm,
# Color (hue) mapped to 'species' and size mapped to the numeric codes from 'island'.
sns.scatterplot(
    data=penguins_data,
    x='bill_depth_mm',
    y='bill_length_mm',
    hue='species',
    size='island_code',
    sizes=(50, 200)  # Adjust the minimum and maximum point sizes as needed
)

# Customize the plot labels and title.
plt.xlabel("Bill Depth (mm)")
plt.ylabel("Bill Length (mm)")
plt.title("Penguin Bill Dimensions by Species and Island")

# Display the plot.
plt.show()

## 


In [None]:
#| fig-pos: center
#| echo: true

# Assume penguins_data is already loaded as a DataFrame with the required columns.
# For example:
# penguins_data = pd.read_csv("penguins.csv")

# Convert the 'island' column to categorical codes for size mapping.
penguins_data['island_code'] = penguins_data['island'].astype('category').cat.codes

# Create the scatter plot:
# - x-axis: bill_depth_mm
# - y-axis: bill_length_mm
# - Color (hue) mapped to species
# - Size mapped to the numeric codes for island
sns.scatterplot(
    data=penguins_data,
    x='bill_depth_mm',
    y='bill_length_mm',
    hue='species',
    size='island_code',
    sizes=(50, 200)  # Adjust the size range as needed
)

# Set labels and title (customize as needed)
plt.xlabel("Bill Depth (mm)")
plt.ylabel("Bill Length (mm)")
plt.title("Penguin Bill Dimensions by Species and Island")

# Show the plot
plt.show()

# More Charts

<https://www.mosaic-web.org/ggformula/articles/pkgdown/ggformula-long.html>

# [Return to Main Page](https://alanrvazquez.github.io/TEC-IN2039-Website/TEC-IN2039-Website.html)