In [1]:
# Initialize Otter
import otter
grader = otter.Notebook("pcp5.ipynb")

# Programming Checkpoint 5 - EDA I

## Visualizing distributions of birds

In this quiz we are going to work with a dataset of hawks that we saw in class.  
You can view a short description of some of the data frame columns in this table:

| Column       | Description                                                                                      |
|--------------|--------------------------------------------------------------------------------------------------|
| month | Month of capture
| year | Year of capture
| species      | CH=Cooper's or SS=Sharp-Shinned                                                       |
| age          | A=Adult or I=Imature                                                                             |
| sex          | F=Female or M=Male                                                                               |
| wing         | Length (in mm) of primary wing feather from tip to wrist it attaches to                          |
| weight       | Body weight (in gm)                                                                              |
| culmen       | Length (in mm) of the upper bill from the tip to where it bumps into the fleshy part of the bird |
| hallux       | Length (in mm) of the killing talon                                                              |
| tail         | Measurement (in mm) related to the length of the tail                                            |

<div class="alert alert-success" style="color: black; padding: 15px; border-radius: 8px; background-color: #d4edda;">
  <h3>Autograding System</h3>
  <p>This notebook uses <code>otter-grader</code> for immediate feedback. Run tests like:</p>
  <pre>grader.check("Task_A")</pre>
  <p>You will not receive feedback during the session, hidden tests will run after submission on PrairieLearn.</p>
</div>

<div class="alert alert-info" style="color: black; padding: 15px; border-radius: 8px; background-color: #eaf4ff;">
  <h3>Instructions</h3>
  <ul>
    <li>Complete all tasks within the time limit</li>
    <li>Test your code as you go to ensure it runs without errors</li>
    <li>Focus on working solutions rather than perfect optimization</li>
    <li>Use pandas documentation if needed, but work efficiently</li>
  </ul>

  <h4>Submission Requirements:</h4>
  <ul>
    <li>Answer all questions and save your work</li>
    <li><strong>Before submitting:</strong> restart the kernel and rerun all cells (click the ▶▶ button)</li>
    <li>Click on the question title in the teal bar at the top to return to PrairieLearn, then click "Save and Grade"</li>
    <li>Don't change given variable names, move cells around, or include package installation code</li>
    <li>Submission may take 1-2 minutes to process</li>
  </ul>
</div>

## Dataset and Environment Setup

Let's import the libraries and load our hawks dataset.

In [2]:
import pandas as pd
import altair as alt

alt.renderers.enable("default")

RendererRegistry.enable('default')

In [3]:
filepath = 'data/hawks.csv'
hawks = pd.read_csv(filepath)

hawks.shape

(908, 20)

<div style="background-color: #e8edf1ff; border-left:6px solid #d2d8deff; padding:15px; margin:15px 0;">

<h2>Data Preparation </h2>
<h4>Do not change anything in the cell below, it is us preparing the `hawks` dataset </h4>

We have done some pre-processing for you that is a bit different from the one we did in class, so please make sure you read through the cell below before you start answering questions. 
    
</div>

In [4]:

cols_to_drop = [
    'Unnamed: 0', 'ReleaseTime', 'StandardTail', 'Tarsus',
    'KeelFat', 'Crop', 'BandNumber', 'CaptureTime', 'WingPitFat', 'Day'
]

hawks = (
    hawks
    .drop(columns=cols_to_drop, errors='ignore')
    .rename(columns=lambda x: x.strip().lower())
    .dropna(subset=['wing', 'sex'])
)


print(hawks.shape)



(332, 10)


## Statistical Data Exploration
Explore the `hawks` dataset to understand the completeness of the data and the distribution of wing lengths. You will examine missing values and calculate summary statistics for wing lengths across different categorical variables: age, sex, and species.


<div style="background-color:#fff3cd; border-left:6px solid #ffecb5; padding:15px; margin:15px 0;">
  
<h2>DATA TASK: Hawk Wing Length Analysis</h2>

<strong>Missing data landscape:</strong>
<ol>
  <li>Calculate the number of missing values per column in the <code>hawks</code> dataset and extract only the columns that have missing values.</li>
  <li>Compute the total number of missing values in the dataset.</li>
  <li>Calculate the percentage of complete rows (rows without any missing values).</li>
</ol>

<strong>Group-wise statistical analysis:</strong>
<ol>
  <li>Perform group-wise statistical analysis for <code>wing</code> by <code>age</code>.</li>
  <li>Perform group-wise statistical analysis for <code>wing</code> by <code>sex</code>.</li>
  <li>Perform group-wise statistical analysis for <code>wing</code> by <code>species</code>.</li>
</ol>
Hint: use <code>describe()</code> for the statistical analysis

</div>


_Points:_ 18

In [5]:
# 1. Missing data landscape
missing_summary = 

missing_values_column = 
print("Missing values per column:")
print(missing_values_column)

total_missing_values = 
print('Total missing values is:', total_missing_values)

percentage_complete = 
print('Percentage complete is:', percentage_complete)


# 2. Group-wise statistical analysis
# Hypothesis: Does age explain the distribution?
age_stats = 
print("Wing length statistics by age:")
print(age_stats)

# Hypothesis: Does sex explain the distribution?
sex_stats = 
print("Wing length statistics by sex:")
print(sex_stats)

# Hypothesis: Does species explain the distribution?
specie_stats = 
print("Wing length statistics by species:")
print(specie_stats)

Missing values per column:
weight    5
culmen    4
hallux    3
dtype: int64
Total missing values is: 12
Percentage complete is: 97.89156626506023
Wing length statistics by age:
     count        mean        std    min    25%    50%    75%    max
age                                                                 
A    101.0  205.495050  41.852748  145.0  172.0  198.0  230.0  425.0
I    231.0  197.359307  40.859312  143.0  167.5  194.0  205.0  410.0
Wing length statistics by sex:
     count        mean        std    min    25%    50%     75%    max
sex                                                                  
F    174.0  213.643678  40.262140  143.0  194.0  200.0  209.00  425.0
M    158.0  184.626582  36.888105  155.0  162.0  169.0  193.75  381.0
Wing length statistics by species:
         count        mean        std    min     25%    50%     75%    max
species                                                                   
CH        68.0  244.308824  32.342504  145.0  226.5

---

<div style="background-color: #d2e6f8ff; border-left:6px solid #9dcffaff; padding:15px; margin:15px 0;">

<h2>VIZ TASK: Hawk Wing Length Distribution</h2>

<h4>Chart Specifications:</h4>
<ul>
  <li><b>Mark:</b> <code>bar</code></li>
  <li><b>X channel:</b> <code>wing:Q</code>, maximum bins should be 30, title = "Wing (mm)".</li>
  <li><b>Y channel:</b> <code>count()</code>, title = "Number of Hawks".</li>
</ul>

<h4>Styling Specifications:</h4>
<ul>
  <li><b>Chart Properties:</b> Width = 400px, Height = 200px.</li>
  <li><b>Mark Styling:</b> binSpacing = 0</li>
  <li><b>Title:</b> <i>"Histogram: Hawk Wing Length Distribution"</i>.</li>
</ul>

</div>


_Points:_ 16

In [6]:
hawk_histogram = 

# show the chart
hawk_histogram

In [7]:
grader.check("Task_VIZ_B")

---
<div style="background-color: #d2e6f8ff; border-left:6px solid #9dcffaff; padding:15px; margin:15px 0;">

<h2>VIZ TASK: Density Plot - Wing Distribution</h2>

<h4>Chart Specifications:</h4>
<ul>
  <li><b>Transform on </b>  <code>wing</code>.</li>
  <li><b>Mark:</b> <code>area</code>.</li>
  <li><b>X channel:</b> <code>wing:Q</code>, title = "Wing (mm)".</li>
  <li><b>Y channel:</b> <code>density:Q</code>, title = "Density".</li>
 </ul>

<h4>Styling Specifications:</h4>
<ul>
  <li><b>Chart Properties:</b> Width = 400px, Height = 250px.</li>
  <li><b>Title:</b> <i>"Density Plot: Distribution of Hawk Wing Length"</i>.</li>
  <li><b>Mark Styling:</b> color=<code>'black'</code>.</li>
</ul>

</div>


_Points:_ 15

In [8]:
hawk_density = 
# Show the plot
hawk_density 

In [9]:
grader.check("Task_VIZ_C")

---

<div style="background-color: #d2e6f8ff; border-left:6px solid #9dcffaff; padding:15px; margin:15px 0;">

<h2>VIZ TASK: Density Plots by Categorical Attributes (Age, Sex, Species) </h2>

    So this question is going to be a bit different. You will first create a base chart that includes the mark and the encodings as detailed below. 
    
<h4>Chart Specifications:</h4>
<ul>
  <li><b>Base chart:</b> Create a density chart of `wing` with <code>mark_area()</code> and <code>opacity=0.7</code></li>
  <li><b>X channel:</b> `wing` with title "Wing (mm)"</li>
  <li><b>Y channel:</b> `density` with title "Density"</li>
  <li><b>Base Chart Properties:</b> Width = 400px, Height = 200px</li>
</ul>

    
<h4>Individual Charts</h4>
    Now we will create 3 charts, but we will use the base chart as the starting point. What you need to do is for each chart, include the density transform, title, and encode the color channel. 
    So the structure of each chart should be as follows
    <pre><code>
    X_chart = density_base.transform_density(
        '...',
        as_=['...', '...'],
        groupby= ['...']
    ).encode(
        color=...
    ).properties(
        title="..."
    )
    </code></pre>
<ul>The styling for each of the 3 charts is as described below
<li><b>sex</b> - Title:"Density of Hawk Wing Lengths by Sex", color range is <code>['dodgerblue', 'gold']</code> </li>
<li><b>age</b> - Title:"Density of Hawk Wing Lengths by Age", color range is <code>['cyan', 'brown']</code> </li>
<li><b>species</b> - Title:"Density of Hawk Wing Lengths by Species", color range is <code>['salmon', 'violet', 'steelblue'] </code> </li>
</ul>
    
    
<b>Also make sure that the Color legend title is appropriately styled for each chart </b>
  
</div>



_Points:_ 25

In [10]:
# Base density chart
density_base = alt.Chart(hawks).mark_area(
    opacity = 0.7
).encode(
    alt.X("wing:Q", title = "Wing (mm)"),
    alt.Y("density:Q", title = "Density")
).properties(
    width = 400,
    height = 200)

sex_chart = density_base

age_chart = density_base
species_chart = density_base

# Show all three charts
(sex_chart & age_chart & species_chart).resolve_scale(
   color='independent'
)

In [11]:
grader.check("Task_VIZ_D")

<div style="background-color: #d2e6f8ff; border-left:6px solid #9dcffaff; padding:15px; margin:15px 0;">

<h2>VIZ TASK: Faceted Density Plot - Wing distribution of hawks</h2>

<h3> You will first create the <code>density_base</code> chart as specified below and then you will create <code>density_facets_color</code> chart that builds on the base chart </h3>

<h4>DENSITY_BASE CHART Chart Specifications:</h4>
<ul>
  <li>Density transformation on <code>wing</code> and have it grouped by the <code>species</code> and <code>sex</code> attributes</li>
  <li><b>Mark:</b> <code>mark_area()</code> with opacity of 0.6</li>
  <li><b>X channel:</b> <code>wing</code> with title "Wing (mm)"</li>
  <li><b>Y channel:</b> <code>density</code> with title "Density"</li>
  <li><b>Color:</b> <code>sex</code> with title 'Sex' use the <code>set2</code> color scheme.</li>
  <li><b>Chart Properties:</b> Width = 300px, Height = 150px (per facet)</li>
</ul>

<h4>DENSITY_FACETS_COLOR Chart Specifications: </h4>

<ul>
  <li><b>Facet:</b> <code>species</code> in rows (or 1 column)
  <li><b>Title:</b> 'Density of Hawk Wing Lengths by Species and Sex' (one title for entire view)</li>
  <li><b>View configuration:</b> remove the stroke from the faceted charts</li>
</ul>

 The structure of density_facets_color chart should be as follows
    <pre><code>
    density_facets_color = density_base.facet(...).configure_view(...).properties(...)
    </code></pre>

</div>


_Points:_ 26

In [12]:
# Base density chart
density_base = alt.Chart(hawks).transform_density(
    "wing",
    as_ = ["wing", "density"],
    groupby = ["species", "sex"]
).mark_area(
    opacity = 0.6
).encode(
    alt.X("wing:Q", title = "Wing (mm)"),
    alt.Y("density:Q", title = "Density"),
    alt.Color("sex:N", title = "Sex", scale = alt.Scale(scheme = "set2"))
).properties(
    width = 300,
    height = 150)

# Facet by species
density_facets_color = density_base.facet(
    "species:N",
    columns = 1
).configure_view(
    stroke=None
).properties(
    title = 'Density of Hawk Wing Lengths by Species and Sex'
)

# Show chart
density_facets_color

In [13]:
grader.check("Task_VIZ_E")

<div class="alert alert-success" style="color: black; padding: 15px; border-radius: 8px; background-color: #d4edda;">
<h3>Programming Pulse Check Complete!</h3>
<h4>Final Submission Steps:</h4>
<ul>
<li><strong>Restart and run all cells:</strong> Click the ▶▶ button or go to <code>Kernel → Restart Kernel and Run All Cells...</code> in the menu to ensure there are no errors</li>
<li><strong>Save your file:</strong> Make sure your work is saved</li>
<li><strong>Submit your assessment:</strong> Return to the main PL assessment page for the Quiz and submit your entire assessment</li>
</ul>
</div>