 # Lecture 10: 1D and Data Mapping



 In this notebook, we will explore how to visualize data in a **1D** context and how to map data to basic visual elements. We will start by showing some older plotting methods using **matplotlib** and **pandas**, just for context, and then we will move on to **Altair**.



 The focus here is on taking a **gradual approach** to Altair. We'll begin with very simple examples, then we will progressively introduce more advanced features like `alt.X()` and `alt.Y()`.



 **Dataset**: We'll use the `heart_data.csv` dataset containing medical examination information. Let’s load it and inspect the first few rows.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import altair as alt

# If you're running in a notebook environment that limits rows for interactive charts:
alt.data_transformers.disable_max_rows()

# Read the dataset
df = pd.read_csv("../../Datasets/heart_data.csv")
df.head()


 The dataset columns are:



 - **id**: Patient identifier.

 - **age**: Age in days (we might convert it to years if needed).

 - **gender**: 1 (women), 2 (men).

 - **height**: Height in centimeters.

 - **weight**: Weight in kilograms.

 - **ap_hi**: Systolic blood pressure.

 - **ap_lo**: Diastolic blood pressure.

 - **cholesterol**: 1 (normal), 2 (above normal), 3 (well above normal).

 - **gluc**: Glucose level (1, 2, 3 with similar meaning as cholesterol).

 - **smoke**: Binary (0 if not smoking, 1 if smoking).

 - **alco**: Binary (0 if not an alcoholic, 1 if alcoholic).

 - **active**: Binary (0 if not physically active, 1 if physically active).

 - **cardio**: Binary (0 if no cardiovascular disease, 1 if it is present).

 ---

 ## 1. A Quick Look at Matplotlib and Pandas



 Although our main focus is Altair, let's briefly look at some older methods to understand the evolution of plotting libraries in Python.

In [None]:
# Simple histogram of 'weight' using matplotlib
plt.figure(figsize=(8, 4))
plt.hist(df['weight'], bins=20, color='lightblue', edgecolor='black')
plt.title('Weight Distribution (Matplotlib)')
plt.xlabel('Weight (kg)')
plt.ylabel('Frequency')
plt.show()


 Pandas can also do quick visualizations since it integrates with matplotlib under the hood.

In [None]:
df['height'].plot(kind='box', title='Height Box Plot (Pandas)', ylabel='Height (cm)')
plt.show()


 As you can see, matplotlib and pandas are powerful but can sometimes be verbose or less straightforward for interactive or layered visualizations.



 ---

 ## 2. Introduction to Altair



 **Altair** is a declarative statistical visualization library. Here are some key ideas before we start:



 - **Data**: The data source (typically a pandas DataFrame).

 - **Mark**: The basic graphical shape (like `mark_bar()`, `mark_circle()`, etc.).

 - **Encoding**: A mapping between your data columns and visual properties (e.g., color, size, and position).

 - **Data Types**: Altair uses short type codes when encoding data:

   - **Q**: Quantitative (numeric values, e.g., `height`, `weight`).

   - **T**: Temporal (time or date).

   - **O**: Ordinal (data with an order, e.g., a numeric range grouped in bins).

   - **N**: Nominal (categorical data with no intrinsic order, e.g., `gender`).



 Let’s start **very simply**. We'll make a bar chart counting how many patients fall under each `gender`.

In [None]:
# A minimal Altair chart:

basic_1dpoints = alt.Chart(df).mark_point().encode(
    # We can specify just the field names if we don't need special config yet.
    x='height',
)
basic_1dpoints


 Basic, right? We just plotted the `height` column as a 1D scatter plot by mapping it to the x-axis.

 Now, let's create a bar chart counting the number of patients for each `gender`.

In [None]:
basic_bar = alt.Chart(df).mark_bar().encode(
    # For a bar chart, if we only mention one dimension, Altair automatically does a 'count()' on the y-axis.
    x='gender',
    y='count()'
)

basic_bar


 **How does this work?**

 - `alt.Chart(df)`: We create a new Chart object using our DataFrame.

 - `.mark_bar()`: We want bar marks.

 - `.encode(x='gender', y='count()')`: We encode the `gender` column on the x-axis.

   By using `count()`, the bar’s height represents the number of rows for each `gender`.



 Note that we didn’t specify any data types (`N`, `Q`, etc.) explicitly. In many cases, Altair can infer them. However, let's be explicit to demonstrate best practices.

In [None]:
# Let's do the same chart but specifying types:
basic_bar_explicit = alt.Chart(df).mark_bar().encode(
    x='gender:N',      # Treat gender as Nominal (categorical)
    y='count()'        # A built-in aggregation for counting rows
)

basic_bar_explicit


 ## 3. Understanding Data Types in Altair



 Altair recognizes several types of data and uses these type codes within the encoding channels:



 - **Quantitative (Q)**: Numeric values, such as `height`, `weight`, or any continuous measurement.

 - **Temporal (T)**: Time or date values, often used for time-series data.

 - **Ordinal (O)**: Data with a logical order or ranking (e.g., small < medium < large).

 - **Nominal (N)**: Categorical data with no intrinsic order (e.g., `gender`, `cholesterol` levels if treated as categories).



 By default, Altair often *infers* these data types, but being explicit can help avoid confusion—especially when customizing how data is displayed.



 Let’s explore a histogram of the `weight` column using Altair. Because `weight` is quantitative (Q), we can `bin` the values to see their distribution.

In [None]:
hist_weight = alt.Chart(df).mark_bar().encode(
    alt.X('weight:Q', bin=True, title='Weight (kg)'),  # Bin the weight values
    alt.Y('count()', title='Count of Patients')
).properties(
    width=400,
    height=200,
    title='Histogram of Weight'
)

hist_weight


## 4. Composing plots
Altair allows us to combine multiple plots into a single visualization. We can use the `|` operator to place them side by side or the `&` operator to stack them vertically.

Create a boxplot of height for each gender and combine


In [None]:

box_height_gender1 = alt.Chart(df.query("gender == 1")).mark_boxplot().encode(
    y='height:Q'
).properties(
    width=300,
    title='Distribution of Height for Gender 1'
)

box_height_gender2 = alt.Chart(df.query("gender == 2")).mark_boxplot().encode(
    y='height:Q'
).properties(
    width=300,
    title='Distribution of Height for Gender 2'
)

                               

box_height_gender1|box_height_gender2



# ## 5. Customizing Encodings
#
# Altair allows us to map columns to various visual channels such as:
#
# - `alt.X()` and `alt.Y()` for positions on horizontal/vertical axes.
# - `alt.Color()` for color.
# - `alt.Size()` for size.
# - `alt.Opacity()`, `alt.Shape()`, etc.
#
# We can also customize scales, legends, and axis labels.
#
# Let’s try something a bit more interesting: a **box plot** of `height` by `gender`. Since gender is nominal, we want to ensure we specify that, and let Altair do the rest.
#
# ### 5.1 Box Plot of Height Grouped by Gender


In [None]:
box_height_gender = alt.Chart(df).mark_boxplot().encode(
    x='gender:N',       # Nominal
    y='height:Q',        # Quantitative
    # color="gender:N",
).properties(
    width=300,
    title='Distribution of Height by Gender'
)

box_height_gender


 This gives us a quick statistical summary of the `height` distribution for each gender.



 ---

 ## 6. Plotting Multiple Variables: Scatter Plots



 One of the most common ways to look for relationships between two quantitative variables is through a **scatter plot**. In our dataset, `weight` and `height` are both good examples of continuous (Q) variables.



 ### 6.1 Basic Scatter Plot of Weight vs. Height

In [None]:
scatter_wh = alt.Chart(df).mark_point().encode(
    x='height:Q',
    y='weight:Q'
).properties(
    width=400,
    height=300,
    title='Scatter Plot: Weight vs. Height'
)

scatter_wh


 This basic scatter plot shows how height (x-axis) relates to weight (y-axis), but it's just a simple cloud of points.



 ### 6.2 Adding Color to Represent a Third Variable



 Often, we want to see how a third variable might influence the relationship between two numeric variables. For example, let's color our points by `gender`.



 - `gender` is categorical, so we use `:N` or `:O`.

 - We can directly specify `alt.Color('gender:N')`.

In [None]:
scatter_wh_color = alt.Chart(df).mark_point(
    filled=True,  # Filled circles
    size=50       # Larger size
).encode(
    x='height:Q',
    y='weight:Q',
    color='gender:N'  # Color points by gender
).properties(
    width=400,
    height=300,
    title='Weight vs. Height Colored by Gender'
)

scatter_wh_color


 Now we can see how male and female patients (in this dataset, labeled 2 and 1, respectively) may occupy different regions in the height-weight space.



 **Tip**: If you want to make the legend or color scale more descriptive, you can replace the numeric codes (1 and 2) with actual labels. One way is to create a mapping in your DataFrame before plotting:



 ```python

 gender_map = {1: 'Female', 2: 'Male'}

 df['gender_label'] = df['gender'].map(gender_map)

 ```



 Then encode:



 ```python

 color='gender_label:N'

 ```



 ---

 ## 7. Putting It All Together



 Below is a final example combining some of these concepts:



 - **Histogram** to see the distribution of `height`.

 - **Scatter plot** of `height` vs. `weight` with color by `gender`.

 - We also add some descriptive properties like titles and axis labels.

In [None]:
# Histogram of height (binned)
hist_height = alt.Chart(df).mark_bar().encode(
    alt.X('height:Q', bin=True, title='Height (cm)'),
    alt.Y('count()', title='Count')
).properties(
    width=300,
    height=200,
    title='Histogram of Height'
)

# Scatter plot of height vs. weight colored by gender
scatter_hw_gender = alt.Chart(df).mark_point().encode(
    alt.X('height:Q', title='Height (cm)'),
    alt.Y('weight:Q', title='Weight (kg)'),
    alt.Color('gender:N', title='Gender')
).properties(
    width=300,
    height=200,
    title='Scatter: Height vs Weight'
)

# Combine the two charts side by side
combined_charts = hist_height | scatter_hw_gender
combined_charts



 ## 8. Cholesterol levels vs. Cardiovascular Disease

 High cholesterol levels have long been associated with an increased risk of heart disease. In this chart, we compare the proportion of cardiovascular disease (`cardio`) across three categories of cholesterol: **Normal**, **Above Normal**, and **Well Above Normal**.

 Create descriptive labels for cholesterol categories

 Stacked bar chart showing the proportion of patients with/without CVD in each cholesterol category

In [None]:
cholesterol_cvd_chart = (
    alt.Chart(df)
    .mark_bar()
    .encode(
        x=alt.X('cholesterol:O', title='Cholesterol Level'),
        # Use stack='normalize' on the y-axis to see proportions instead of raw counts
        y=alt.Y('count()', stack='normalize', title='Proportion of Patients'),
        color=alt.Color('cardio:N', title='Has CVD?'),
        tooltip=['count()']  # Optional: see raw counts on hover
    )
    .properties(title='Proportion of Cardiovascular Disease by Cholesterol Level')
)

cholesterol_cvd_chart