# Tutorial 6. EDA II

In this notebook we are picking up from where we left off creating visualizations to aid in exploratory data analysis. You have seen variations of the histogram and density plots. In this tutorial we focus on box plots, violin plots, 2D histograms and scatter plots. 
As you work through this module, we recommend that you open the [Altair Data Transformations documentation](https://altair-viz.github.io/user_guide/transform/index.html) in another tab. It will be a useful resource if at any point you'd like more details or want to see what other transformations are available.

## Learning Goals
Those who actively work through this notebook will be able to:
- Create box plots and explain the various aspects of it
- Create violin plots
- Use `transform_regression` to create trend lines to supporting our understanding of correlations
- Extend histograms and scatter plots to support EDA tasks. 



## Dataset and Environment Setup

### Movies Dataset

We will be working with a table of data about motion pictures, taken from the [vega-datasets](https://vega.github.io/vega-datasets/) collection. The data includes variables such as the film name, director, genre, release date, ratings, and gross revenues. However, _be careful when working with this data_: the films are from unevenly sampled years, using data combined from multiple sources. If you dig in you will find issues with missing values and even some subtle errors! Nevertheless, the data should prove interesting to explore.

Let's retrieve the URL for the JSON data file from the vega_datasets package, and then read the data into a Pandas data frame so that we can inspect its contents.


| Column Name | Data Type | Description |
|-------------|-----------|-------------|
| Title  | Text | Movie title |
| US_Gross | Quantitative | USA box office revenue in USD|
| Worldwide_Gross | Quantitative | Global box office revenue in USD|
| US_DVD_Sales | Quantitative | DVD/Physical sales in USD |
| Production_Budget | Quantitative | Movie production costs in USD |
| Release_Date | Date | Theatrical release date |
| MPAA_Rating | Ordinal | Movie rating (G, PG, PG-13, R, etc.) |
| Major_Genre | Nominal | Primary movie genre |
| Running_Time_min | Quantitative | Movie length (minutes) |
| Distributor | Nominal | Distribution company |
| IMDb_Rating | Quantitative | IMDb user ratings (1-100 scale) |
|Rotten_Tomatoes_Rating | Quantitative | Rotten Tomatoes user ratings (1-10 scale) |
| Director | Nominal | Movie director name |

In [None]:
import pandas as pd
import altair as alt
import numpy as np


In [None]:
movies_url = 'https://cdn.jsdelivr.net/npm/vega-datasets@1/data/movies.json'
movies = pd.read_json(movies_url)
movies = movies.query("MPAA_Rating in ['G', 'PG', 'PG-13', 'R']")


## Box Plot
After all the visualizations that you have been exposed to thus far, the box plot may seem simple, but it packs a punch. The boxplot mark in Altair is very different from other marks, as it is a composite visualization (i.e., it is a combination of marks, it is NOT an atomic mark) that summarizes multiple data points
rather than representing individual data items. The box plot encodes the mean, median, and interquartile ranges. Less detailed than the histogram, the box plot groups data into bins. The spacing between the different aspects depics the spread and skewness of the data. Outliers are also encoded. The image to the right sourced from the [DataVizCatalogue](https://datavizcatalogue.com/methods/box_plot.html) shows the structure of a box plot.

<img width="200px" img-height="300px" src="https://datavizcatalogue.com/methods/images/anatomy/box_plot.png" />



Let's start by using our dataset, the `boxplot` mark and encoding `IMDB_Rating` on the `y` channel and see what happens.

In [None]:
alt.Chart(movies).mark_boxplot().encode(
    y='IMDB_Rating:Q'
)

Note that you can also use the `x` channel to encode the quantitative data.
Let's customize the box plot by:
- changing the color of the median to red so it is more noticeable
- adding whiskers for the max and min,
- changing the axis name,
- and removing the scale restriction so it is no more defaulted to zero.

<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Boxplot of IMDB Ratings</b></h3>

<ul>
<li>Use the <code>boxplot</code> mark to show the distribution of IMDB ratings.</li>
<li>Mark options:
<ul>
<li><code>median={"color": "red"}</code> ‚Äî highlights the median in red.</li>
<li><code>extent=1.5</code> ‚Äî sets the whisker length to 1.5</li>
<li><code>ticks=True</code> ‚Äî shows tick marks for outliers.</li>
</ul>
</li>
<li>Encode:
<ul>
<li><code>IMDB_Rating</code> on the <b>x channel</b> as quantitative.</li>
<li>Use <code>scale=alt.Scale(zero=False, padding=5)</code> to adjust the axis for better visualization.</li>
</ul>
</li>
</ul>
</div>


In [None]:
alt.Chart(movies).mark_boxplot(
    median={"color": "red"},
    extent=1.5,
    ticks= True,
).encode(
    alt.X('IMDB_Rating', scale=alt.Scale(zero=False, padding=5), title= "IMBD Rating" ),
)

In Altair (and Vega-Lite), the **`extent`** parameter controls how far the **whiskers** of a boxplot extend beyond the box:

* The box represents the **interquartile range (IQR)** ‚Äî the range between the 25th percentile (Q1) and 75th percentile (Q3).
* The whiskers extend to the smallest and largest values within:

$$
\text{Lower whisker} = Q1 - (\text{extent} \times IQR)
$$

$$
\text{Upper whisker} = Q3 + (\text{extent} \times IQR)
$$

* Any points **outside this range** are considered **outliers** and are shown separately as ticks or dots.

**Example:**

* `extent=1.5` (default) ‚Üí whiskers extend 1.5 √ó IQR from the quartiles.
* `extent=0` ‚Üí whiskers end exactly at Q1 and Q3 (all points outside the box are outliers).
* Increasing `extent` ‚Üí longer whiskers, fewer points marked as outliers.

---

So in your chart:

```python
alt.Chart(movies).mark_boxplot(
    median={"color": "red"},
    extent=1.5,
    ticks=True,
)
```

* The whiskers extend **1.5 √ó IQR** from the box, and points outside this range are marked as ticks (outliers).
* This is the standard setting for identifying statistical outliers.



Take a few minutes to change the extent to make sure you understand how it works. 

<br>
Currently our boxplot is showing the distribution for one quantitative attribute. <br>
We can extend it to show a nominal attribute on the `y` channel. 
<br>
Let's encode the genre on the `y` channel. 


In [None]:
alt.Chart(movies).mark_boxplot(
    median={"color": "red"},
    extent=1.5,
    ticks= True,
).encode(
    alt.Y('Major_Genre'),
    alt.X('IMDB_Rating', scale=alt.Scale(zero=False, padding=5), title= "IMBD Rating" ),
).properties(height = 350)

**Analysis Questions:**
1. Which genre has the highest median rating?
2. Which genre shows the most variation in ratings?
3. Are there genres with particularly notable outliers?


<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Boxplot of IMDB Ratings by MPAA Rating and Genre</b></h3>

<ul>
<li>Use the <code>boxplot</code> mark to show the distribution of IMDB ratings across MPAA ratings and genres.</li>
<li>Mark options:
<ul>
<li><code>median={"color": "red"}</code> ‚Äî highlights the median rating in red.</li>
<li><code>extent=3</code> ‚Äî extends the whiskers to 3 √ó IQR from the quartiles, capturing more values before labeling outliers.</li>
<li><code>ticks=True</code> ‚Äî shows individual outlier points as ticks.</li>
</ul>
</li>
<li>Encode:
<ul>
<li><code>IMDB_Rating</code> on the <b>x channel</b> as quantitative, with <code>scale=alt.Scale(zero=False, padding=5)</code>.</li>
<li><code>MPAA_Rating</code> on the <b>y channel</b> as nominal, to create separate rows for each rating category.</li>
<li><code>Major_Genre</code> on the <b>color channel</b> as nominal, to differentiate genres visually.</li>
</ul>
</li>
<li>Data transformation:
<ul>
<li>Use <code>transform_filter()</code> with <code>alt.FieldOneOfPredicate</code> to include only selected genres: <code>['Action', 'Comedy', 'Documentary', 'Horror']</code>.</li>
</ul>
</li>
<li>Set chart <b>width=500</b> and <b>height=120</b> for readability.</li>
<li>Add a descriptive title: <i>‚ÄúBoxplot of IMDB Ratings by MPAA Rating and Genre‚Äù</i>.</li>
</ul>
</div>


In [None]:
alt.Chart(movies).transform_filter(
    alt.FieldOneOfPredicate(field='Major_Genre', oneOf=['Action', 'Comedy', 'Documentary', 'Horror'])
).mark_boxplot(
    median={"color": "red"},
    extent=3,
    ticks= True,
).encode(
    alt.X('IMDB_Rating', scale=alt.Scale(zero=False, padding=5), title= "IMDB Rating" ),
    alt.Y('MPAA_Rating:N').title('MPAA Rating'),
    alt.Color('Major_Genre:N').title('Genre'), 
).properties(
    width=500,
    height=120,
    title = "Boxplot of IMDB Ratings by MPAA Rating and Genre"
)


    Okay at this is it is a mess, we have box plots on top of each other and it is hard to tell for each rating where the box plot ends for the given genres. 
    
    So why don't we use our y channel differently. We have the ability to facet the chart, so that we have multiple x or multiple y axes. So let's use that superpower. 
    
    Taking the chart we had in the previous step, let's encode the Major_Genre which is very messy and encode it on the Row channel. 
    
    What do you notice? 
    What is helpful?  
    What is counter-productive?

In [None]:
alt.Chart(movies).transform_filter(
    alt.FieldOneOfPredicate(field='Major_Genre', oneOf=['Action', 'Comedy', 'Documentary', 'Horror'])
).mark_boxplot(
    median={"color": "red"},
    extent=3,
    ticks= True,
).encode(
    alt.X('IMDB_Rating', scale=alt.Scale(zero=False, padding=5), title= "IMDB_Rating " ),
    alt.Y('MPAA_Rating:N').title('MPAA Rating'),
    alt.Color('Major_Genre:N').title('Genre'), 
    alt.Row('Major_Genre:N')
).properties(
    width=500,
    height=120,
    title = "Boxplot of IMDB Ratings by MPAA Rating and Genre"
)

So we have the `x` channel, we have the `y` channel. And now we have the `row` and `column` channel. 
We have enhanced our ability to use the position channels. 
Clap for us. 

## Violin Plot

The violin plot is an extension of the box plot that combines the statistical summary of a box plot with the shape information of a density plot. While a box plot encodes key statistical measures (median, quartiles, outliers, and interquartile ranges), the violin plot adds the probability density distribution, showing the actual shape of the data distribution.

The "violin" shape is created by mirroring a kernel density estimation on both sides of the traditional box plot elements. This allows viewers to see not just where the majority of data points lie, but also whether the distribution is unimodal or multimodal, skewed, or has multiple peaks.

Violin plots are particularly useful when:
- You want to compare distributions across multiple categories
- The shape of the distribution matters for your analysis
- You have enough data points to make density estimation meaningful
- You need to identify whether data is normally distributed or has unusual patterns

The width of the violin at any point represents the density of data at that value - wider sections indicate more data points at that level, while narrower sections show fewer observations.



<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Horizontal Density of IMDB Ratings by MPAA Rating</b></h3>

<ul>
<li>Use the <code>area</code> mark with <code>orient='horizontal'</code> to create horizontal density plots.</li>
<li>Encode:
<ul>
<li><code>IMDB_Rating</code> on the <b>y channel</b> as quantitative.</li>
<li><code>density</code> on the <b>x channel</b> as quantitative, with <code>stack='center'</code> to center the ridgeline.</li>
<li><code>MPAA_Rating</code> on the <b>color channel</b> as nominal to differentiate ratings visually.</li>
<li>Facet by <code>MPAA_Rating</code> using <code>column</code>, with labels positioned at the bottom.</li>
</ul>
</li>
<li>Data transformation:
<ul>
<li>Use <code>transform_density()</code> to compute density:
<ul>
<li><code>field='IMDB_Rating'</code> ‚Äî the numeric variable to calculate density for.</li>
<li><code>groupby=['MPAA_Rating']</code> ‚Äî compute separate density curves for each MPAA rating.</li>
<li><code>as_=['IMDB_Rating', 'density']</code> ‚Äî store the x-axis (density) and y-axis (IMDB_Rating) values.</li>
<li><code>extent=[1, 10]</code> ‚Äî limit IMDB ratings considered in the density calculation.</li>
</ul>
</li>
</ul>
</li>
<li>Chart properties:
<ul>
<li><b>width=100</b> for each facet</li>
<li>No facet spacing (<code>configure_facet(spacing=0)</code>)</li>
<li>No stroke around the view (<code>configure_view(stroke=None)</code>)</li>
<li>X-axis hides labels except for baseline at 0</li>
</ul>
</li>
</ul>
</div>


In [None]:
alt.Chart(movies).transform_density(
    'IMDB_Rating',
    as_=['IMDB_Rating', 'density'],
    extent=[1, 10],
    groupby=['MPAA_Rating']
).mark_area(orient='horizontal').encode(
    y=alt.Y('IMDB_Rating:Q', title="IMDB Rating"),
    color='MPAA_Rating:N',
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True),
    ),
    column=alt.Column(
        'MPAA_Rating:N',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=0,
        ),
    )
).properties(
    width=100
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

### **1. `mark_area` for Violin Plots**

* A **violin plot** is essentially a **smoothed density plot** mirrored along an axis.
* In Altair, we use **`mark_area()`** to create the filled shape of the density.
* The **height/width** of the area represents the **density of values** at that point.

---

### **2. `orient='horizontal'`**

* By default, the area extends **vertically**.
* Setting **`orient='horizontal'`** flips the plot:

  * The **y-channel** shows the variable of interest (e.g., `IMDB_Rating`).
  * The **x-channel** represents the density.
* This makes the violin ‚Äúlie on its side,‚Äù which is useful when faceting by categories along the horizontal axis.

---

### **3. `x` Channel**

* When `orient='horizontal'`, the **x-channel encodes the density** of the variable.
* Often, we set `stack='center'` to make the density symmetrical around the baseline, forming the characteristic violin shape.

---

### **4. `column` Channel**

* The **`column`** channel is used to **facet the chart into multiple small multiples**‚Äîone for each category.
* For example, `column='MPAA_Rating'` creates a separate horizontal violin for each rating category.
* This allows students to **compare distributions across groups** easily.

---

**What would the chart look like if we didn't have the `column` channel encode any attribute?**

We are encoding Movie Rating on both the x and the color channel.
This is redundant, so let's use the `color` channel to represent the `genre`

In [None]:
alt.Chart(movies).transform_density(
    'IMDB_Rating',
    as_=['IMDB_Rating', 'density'],
    extent=[1, 10],
    groupby=['MPAA_Rating', 'Major_Genre']
).mark_area(orient='horizontal', opacity=0.6).encode(
    y=alt.Y('IMDB_Rating:Q', title="IMDB Rating"),
    color='Major_Genre:N',
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True),
    ),
    column=alt.Column(
        'MPAA_Rating:N',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=0,
        ),
    ),
).properties(
    width=100
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

**Once again, just because you can, doesn't mean you should.**

## Correlation Detection Through Visualization 

### Beyond Basic Scatter Plots: Detecting Relationships

You've seen scatter plots before, but now we'll focus specifically on **using them to detect and interpret correlations.** This is a crucial EDA skill‚Äîunderstanding when two variables are related and how strongly.


<div class="alert alert-info" style="color:black">

**VISUALIZATION TASK:** Start with a basic correlation detection example between IMDB ratings and Rotten Tomatoes ratings.

**Chart Specifications:**
- **Mark:** `mark_circle()` with opacity=0.5 and size=30
- **X channel:** `IMDB_Rating` (quantitative), title "IMDB Rating"
- **Y channel:** `Rotten_Tomatoes_Rating` (quantitative), title "Rotten Tomatoes Rating"
- **Chart Properties:**
  - Width: 350px, Height: 250px  
  - Title: "Movie Ratings Correlation"

</div>


In [None]:
# Start with a basic correlation detection example
movie_correlation = alt.Chart(movies).mark_circle(
    opacity=0.5,
    size=30
).encode(
    x=alt.X('IMDB_Rating:Q', title='IMDB Rating'),
    y=alt.Y('Rotten_Tomatoes_Rating:Q', title='Rotten Tomatoes Rating'),
).properties(
    width=350,
    height=250,
    title="Movie Ratings Correlation"
)

movie_correlation

**üîç Visual Correlation Assessment:**
- **Strong positive correlation**: Points form a clear upward line
- **Weak correlation**: Points are scattered with no clear pattern
- **Non-linear correlation**: Clear pattern that isn't a straight line
- **No correlation**: Random scatter with no discernible pattern

**ü§î Analysis Question:** Looking at the movie ratings plot above, how would you describe the correlation? Is it what you expected?

### Understanding Correlation Strength Visually

Let's enhance our correlation detection with additional visual elements:

<div class="alert alert-info" style="color:black">

**VISUALIZATION TASK:** Add a regression trend line to highlight the correlation between IMDB ratings and Rotten Tomatoes ratings.

**Chart Specifications:**
- **Mark:** `mark_line()` with color = red and size = 3
- **Transform:** `transform_regression('IMDB_Rating', 'Rotten_Tomatoes_Rating')`
- **X channel:** `IMDB_Rating` (quantitative),
- **Y channel:** `Rotten_Tomatoes_Rating` (quantitative)
- **Chart Properties:**
  - Overlays on top of scatter plot
  - Red line provides clear visual cue of correlation

</div>


In [None]:
# Add trend line to make correlation more visible
trend_line = alt.Chart(movies).mark_line(
    color='red',
    size=3
).transform_regression(
    'IMDB_Rating', 'Rotten_Tomatoes_Rating'
).encode(
    x='IMDB_Rating:Q',
    y='Rotten_Tomatoes_Rating:Q'
)

trend_line


Notice how the trend line extends below 0, is it possible to have a negative rating on Rotten Tomatoes or on IMDB. 
I don't think so. The way `transform_regression` works is that it fits the line to the full x-domain of the data and extrapolates (i.e., past 0). To address this we would have to do some clipping and use some masks. We won't do that now, but it is possible. 

### Combine the charts
Combine the scatter plot and trend line. 
Overlay the trend line chart on top of the scatter plot. 

In [None]:

# Combine scatter plot and trend line
correlation_with_trend = (
    movie_correlation + trend_line
)

correlation_with_trend

Let's calculate and display correlation coefficient


In [None]:
correlation_coef = movies['IMDB_Rating'].corr(movies['Rotten_Tomatoes_Rating'])
print(f"Correlation coefficient: {correlation_coef:.3f}")


**üìä Interpreting Correlation Strength:**
- **r > 0.7**: Strong positive correlation
- **0.3 < r < 0.7**: Moderate positive correlation  
- **-0.3 < r < 0.3**: Weak or no correlation
- **-0.7 < r < -0.3**: Moderate negative correlation
- **r < -0.7**: Strong negative correlation

**üí° Visual vs. Statistical:** The trend line helps you see the relationship direction and strength, while the correlation coefficient gives you a precise measure.



## 2D Histogram

We have looked at how we can explore the distributions of one quantitative attribute across multiple nominal attributes. What if we have 2 quantative attributes, how do we explore the relation, we typically start with a scatter plot, but let's go a step further. 

Let's start by returning to the unaggregated data to get a sense of the relationship between ratings from Rotten Tomatoes and from IMDB users.


<div style="border-left: 5px solid #007BFF; padding: 1em; background-color: #F0F8FF;">

<h3><b>Viz Task: Scatterplot of Rotten Tomatoes vs IMDB Ratings</b></h3>

<ul>
<li>Use the <code>circle</code> mark to create a scatterplot showing the relationship between Rotten Tomatoes and IMDB ratings.</li>
<li>Encode:
<ul>
<li><code>Rotten_Tomatoes_Rating</code> on the <b>x channel</b> as quantitative.</li>
<li><code>IMDB_Rating</code> on the <b>y channel</b> as quantitative.</li>
</ul>
</li>

</ul>
</div>


In [None]:
alt.Chart(movies).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q'),
    alt.Y('IMDB_Rating:Q')
)

To summarize this data, we can *bin* a data field to group numeric values into discrete groups. Here we bin along the x-axis by adding `bin=True` to the `x` encoding channel. The result is a set of ten bins of equal step size, each corresponding to a span of ten ratings points.

In [None]:
alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q')
)

Here's what happens if we bin *both* axes of our original plot.

In [None]:
alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
)

Detail is lost due to *overplotting*, with many points drawn directly on top of each other.

To form a two-dimensional histogram we can add a `count` aggregate as before. As both the `x` and `y` encoding channels are already taken, we must use a different encoding channel to convey the counts. Here is the result of using circular area by adding a *size* encoding channel.

In [None]:
size_hist = alt.Chart(movies_url).mark_circle().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Size('count()')
)

size_hist

Alternatively, we can encode counts using the `color` channel and change the mark type to `bar`. The result is a two-dimensional histogram in the form of a [*heatmap*](https://en.wikipedia.org/wiki/Heat_map).

In [None]:
color_hist = alt.Chart(movies_url).mark_bar().encode(
    alt.X('Rotten_Tomatoes_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Y('IMDB_Rating:Q', bin=alt.BinParams(maxbins=20)),
    alt.Color('count()')
)
color_hist

In [None]:
size_hist | color_hist

Compare the *size* and *color*-based 2D histograms above. <br>Which encoding do you think should be preferred? <br>Why? In which plot can you more precisely compare the magnitude of individual values? <br> In which plot can you more accurately see the overall density of ratings?

## Scatter Matrix
A scatter or scatterplot matrix is a table or grid of scatter plots. Each plot in the matrix depicts the bivariate relationships. In Altair, a scatter matrix is created by using the RepeatChart.
Note that the main difference between the matrix below and the plot from the preceding section is the use of alt.repeat for the X and Y channels.

In [None]:
alt.Chart(movies).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
).properties(
    width=120,
    height=120
).repeat(
    row=['Running_Time_min', 'Worldwide_Gross', 'Production_Budget', 'IMDB_Rating'],
    column=['Running_Time_min', 'Worldwide_Gross', 'Production_Budget', 'IMDB_Rating']
)

Let's talk about the new bits of code we see above

```python
encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
)
```

* Uses **repeated encodings** to dynamically assign columns and rows for the scatterplots.
* `alt.repeat("column")` ‚Üí the variable used on the x-axis will come from the **current column** in the `repeat` specification.
* `alt.repeat("row")` ‚Üí the variable used on the y-axis will come from the **current row** in the `repeat` specification.
* `type='quantitative'` ensures Altair treats all these variables as numeric values.

---

```python
.repeat(
    row=['Running_Time_min', 'Worldwide_Gross', 'Production_Budget', 'IMDB_Rating'],
    column=['Running_Time_min', 'Worldwide_Gross', 'Production_Budget', 'IMDB_Rating']
)
```

* Creates a **matrix of scatterplots** comparing each variable with every other variable.
* **Rows** ‚Üí y-axis variables.
* **Columns** ‚Üí x-axis variables.
* For example:

  * Top-left plot: `Running_Time_min` vs `Running_Time_min`
  * Top-right plot: `Worldwide_Gross` vs `Running_Time_min`
  * Bottom-left plot: `Running_Time_min` vs `IMDB_Rating`
  * ‚Ä¶and so on, producing a **4√ó4 grid** of scatterplots.


While there are 16 scatter plots, only 6 of them present unique information, the scatter plots on the diagonal with a high positive correlation do not provide us with any new information because both the x and y channels encode the same attribute. In addition, observe that the 6 scatter plots above the diagonal and the 6 below correspond to each other, the main difference is that the y and x channels are switched. The scatter plot is beneficial when we want to understand the bivariate relationships. An alternative representation that presents a high level overview of just the correlation between the attributes is the correlation matrix.

## Correlation Matrix

The correlation and scatterplot matrixs are both useful when we want to visualize associations for two or more quantitative attributes. Imagine a situation in which we had 10 quantitative attributes, while creating the scatterplot matrix would not take longer than it did when we had 4 quantitative attributes. It would however, result is a huge amount of data being represented which contributes to cogntive overload. The correlation matrix or correlogram (as it is described by some) is a visualization technique that present the magnitude and direction for each bivariate relationship. It shows the strength of the relationship and allows us to quickly assess which relationship are worth exploring further.

With Pandas we can calculate the correlation between two attributes.


<div style="background-color: #fff3cd; border-left:6px solid #ffecb5; padding:15px; margin:15px 0;">

<h2>Data Task </h2>
<h3>Step 1: Compute the correlation matrix</h3>
<p>
We only want to look at numeric columns (since correlation only makes sense for numeric data). 
First, select the numeric columns and then calculate their correlations.  
</p>
<p><strong>Why?</strong>  
This gives us a square table showing how strongly each pair of numeric variables is related. 
But right now it‚Äôs in a wide, grid format that isn‚Äôt easy to analyze or plot.
</p>

<h3>Step 2: Reshape the correlation matrix into long format</h3>
<p>
Next, we ‚Äúunstack‚Äù the correlation matrix into a tall, tidy table, and reset the index so the variable names become proper columns.  
</p>
<p><strong>Why?</strong>  
This turns the wide square matrix into a three-column table: one column for the first variable, one for the second variable, and one for their correlation value. 
That‚Äôs a more useful structure for analysis and visualization.
</p>

<h3>Step 3: Clean up column names and add readable labels</h3>
<p>
Finally, rename the columns to meaningful names and add a column with the correlation rounded to two decimals for easier reading.  
</p>
<p><strong>Why?</strong>  
Clearer column names make the table easier to understand, and the formatted labels are especially useful when you want to annotate correlations on a plot later.
</p>

</div>


In [None]:
# Step 1: Select only numeric columns and compute correlation matrix
corr_matrix = movies.select_dtypes(include='number').corr()

# Step 2: Reshape correlation matrix into long format
cor_data = corr_matrix.stack().reset_index()

# Step 3: Rename columns and add formatted correlation labels
cor_data = cor_data.rename(columns={0: 'correlation',
                                    'level_0': 'variable1',
                                    'level_1': 'variable2'})
cor_data['correlation_label'] = cor_data['correlation'].map('{:.2f}'.format)

cor_data.sample(10)



To visualize the correlations, we will use the `X` and `Y` channels to encode the names of the quantitative variables.
We will then use color to encode the correlation. Because there are no negative correlations in the dataset, we will use a sequential [single hue scheme](https://vega.github.io/vega/docs/schemes/) to represent the continous variable of `correlation`. 

In [None]:
alt.Chart(cor_data).mark_rect().encode(
    alt.X('variable1:O'),
    alt.Y('variable2:O'),
    alt.Color('correlation').scale(scheme='purples'),
    alt.Tooltip(['correlation_label']),
).interactive().properties(width=200, height=200)

### Think Deeply
What quantitative attributes have the strongest correlation?
Does it make sense?
Which ones have the weakest correlation?
Play around with the color scheme and investigate which one is most helpful to support making sense of the correlations. 


## Summary
In this notebook, you have been exposed to various visualization techniques used in exploratory data analysis.
While this tutorial is far from exhaustive, it does provide you with common visualizations that can help you make sense of the data before preceeding to more complex or targeted data analysis tasks.

Interested in learning more about this topic?
- Skim through Chapters 7 -9 and 16 in Claus O. Wilke's book [Fundamentals of Data Visualization](https://clauswilke.com/dataviz/)
- Explore visualizations by Function in the [Data Visualization Catalogue](https://datavizcatalogue.com/)

### Beyond the Scope. 
Here is the solution that clips the regression line. 

In [None]:


# Scatter plot
scatter = alt.Chart(movies).mark_point().encode(
    x='IMDB_Rating:Q',
    y='Rotten_Tomatoes_Rating:Q'
)

# Compute regression manually
mask = movies['IMDB_Rating'].notna() & movies['Rotten_Tomatoes_Rating'].notna()
x = movies.loc[mask, 'IMDB_Rating']
y = movies.loc[mask, 'Rotten_Tomatoes_Rating']
slope, intercept = np.polyfit(x, y, 1)

# Calculate y-values at min and max x
y_at_min = slope * x.min() + intercept
y_at_max = slope * x.max() + intercept

# Clip the line to not go below 0
x_start = x.min()
y_start = y_at_min

# If y would be below 0 at x.min(), find where y = 0
if y_start < 0:
    x_start = -intercept / slope  # Solve for x when y = 0
    y_start = 0

# Create a DataFrame for the line
regression_df = pd.DataFrame({
    'IMDB_Rating': [x_start, x.max()],
    'Rotten_Tomatoes_Rating': [y_start, y_at_max]
})

# Regression line chart
regression_line = alt.Chart(regression_df).mark_line(color='red', size=3).encode(
    x='IMDB_Rating:Q',
    y=alt.Y('Rotten_Tomatoes_Rating:Q', scale=alt.Scale(domain=[0, y.max()]))
)

# Combine
chart = scatter + regression_line
chart