# Week 1 Exploratory Data Analysis

By the end of this worksheet, you will be able to:

1.	Highlight the importance and objectives of exploratory data analysis (EDA). 
2.	Evaluate which statistical measure is most appropriate in a given scenario, including measures of central tendency and variability. 
3.	Identify the most appropriate visualization to investigate a given variable (or a set of variables) in a data set and to communicate an idea. 
4.	Investigate the relationship between two or more variables through correlation. 
5.	Review the usage and execution of certain types of visualizations.
6.	Use computer code to effectively visualize data, including understanding how different layers can be utilized to improve a visualization.
7.	Critically analyse a visualization and highlight potential sources of improvement. 
8.	Recognize how the choice of visualization can result in potential misinterpretations or biased representation of the data. 
9.	Discuss how decisions taken during EDA can affect the subsequent data analysis pipeline.
10.	Recognize the role of sampling splitting in addressing challenges in data analyses. 

## Getting Started

Before beginning the worksheet, let us load the necessary packages we will be using through out the worksheet

In [None]:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(corrr))
suppressPackageStartupMessages(library(themis))
suppressPackageStartupMessages(library(kknn))

In this worksheet we will working with the [Spotify songs dataset from Kaggle](https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs/data). This dataset contains 32,833 songs with 23 features including but not limited to track name, artist, track popularity, track length  and relase date. this dataset also contains several musical characteristsics including measures like danceability, energy, loudness, speechiness, acousticness, tempo and so on... You can visit the above link to read more about each of the variables and how they were measured. For simplicity, we have randomly sampled  3000 songs from this dataset for this worksheet.

<div style="text-align: center;">
  <img 
    src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Spotify_logo_with_text.svg/1118px-Spotify_logo_with_text.svg.png" 
    alt="Spotify Logo" 
    style="width: 400px; max-width: 100%;"
  />
</div>

In [None]:
#Reading the data
spotify_data <- read_csv("data/spotify_songs.csv")

In [None]:
head(spotify_data)

The goal of this worksheet is to perform some exploratory data analysis (EDA) on this dataset. This includes visualizing the distributions of different variables, calculating the central tendancy measures, covariablity between variables and understanding how decisions we make during EDA can influence any downstream inference and modelling. Before we dive into the EDA, we need to do some basic data wrangling. For simplicity, we will extract only the year from the release date as well as the columns that we are interested in. 

In [None]:
# Extract year
spotify_data <- spotify_data |> 
 mutate(year = lubridate::year(as.Date(track_album_release_date, format = "%Y-%m-%d")))

spotify_data |>
    select(year) |>
    table()

In [None]:
#Selecting only the columns we need
spotify_data <- spotify_data |> 
  select(track_name, track_artist, year, 
         track_popularity, duration_ms, 
         playlist_genre,
         danceability, energy, key, 
         loudness, mode, speechiness, 
         acousticness, instrumentalness, 
         liveness, valence, tempo) |> 
  rename(name = track_name, artists = track_artist,
         genre = playlist_genre, popularity = track_popularity)

#Checking the structure of the data
glimpse(spotify_data) 

Alright, looks like we are ready to dive into some EDA! This is also a good opportunity to refresh some of the basics of using `ggplot2` to visulaize data.

---

### Warm-up

**Question 1**

For each of the following questions, assign the correct option to the given variable in quotes. For example, if the correct answer is option B, enter `answer1.1 <- "B"` in the answer cell.


1.1. Which of the following creates a histogram of the `popularity` variable? (1 point)

    A. `ggplot(spotify_data, aes(x = popularity)) + geom_bar(stat = "identity")`   
    B. `ggplot(spotify_data, aes(x = popularity)) + geom_histogram()`  
    C. `ggplot(spotify_data, aes(x = popularity)) + geom_point()`  
    D. `ggplot(spotify_data, aes(y = popularity)) + geom_line()`  


In [None]:
#assign the answer to answer1.1
# answer1.1 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer1.1 is not character"= setequal(digest(paste(toString(class(answer1.1)), "d3ced")), "a279991fd55d4a12d2d38c7c9a04aae9"))
stopifnot("length of answer1.1 is not correct"= setequal(digest(paste(toString(length(answer1.1)), "d3ced")), "4bcb0f1404e642bf70c1e5f62538b416"))
stopifnot("value of answer1.1 is not correct"= setequal(digest(paste(toString(tolower(answer1.1)), "d3ced")), "47fd44c3b5b7306df9af5a04fa9a4bd1"))
stopifnot("letters in string value of answer1.1 are correct but case is not correct"= setequal(digest(paste(toString(answer1.1), "d3ced")), "01c49b9545ecb02882243b426d8caddb"))

print('Success!')

1.2. What is the default statistical transformation applied by `geom_bar()`? (1 point)

    A. `stat = "identity"`  
    B. `stat = "density"`  
    C. `stat = "bin"`  
    D. `stat = "count"`  

In [None]:
#assign the answer to answer1.2
# answer1.2 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer1.2 is not character"= setequal(digest(paste(toString(class(answer1.2)), "cf85e")), "f85c8da4e00f67eefffcbe167deef57e"))
stopifnot("length of answer1.2 is not correct"= setequal(digest(paste(toString(length(answer1.2)), "cf85e")), "ebb2db096f8873ea13911456609631b5"))
stopifnot("value of answer1.2 is not correct"= setequal(digest(paste(toString(tolower(answer1.2)), "cf85e")), "e5dfe49f7eed04500f124a76c787a11e"))
stopifnot("letters in string value of answer1.2 are correct but case is not correct"= setequal(digest(paste(toString(answer1.2), "cf85e")), "f30fcee93b231a16f7032bb4c4d970a2"))

print('Success!')

1.3. Which of the following `geom_*()` functions creates a scatterplot between two numerical variables? (1 point)

    A. `geom_dots()`  
    B. `geom_point()`  
    C. `geom_dotplot()`  
    D. `geom_scatter()`  

In [None]:
#assign the answer to answer1.3
# answer1.3 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()



In [None]:
library(digest)
stopifnot("type of answer1.3 is not character"= setequal(digest(paste(toString(class(answer1.3)), "61ae8")), "31f3f5b23db1f0c426ac8542d0a4fa0d"))
stopifnot("length of answer1.3 is not correct"= setequal(digest(paste(toString(length(answer1.3)), "61ae8")), "057356ffb5427cdc022170f5d8c30e77"))
stopifnot("value of answer1.3 is not correct"= setequal(digest(paste(toString(tolower(answer1.3)), "61ae8")), "6bd4e0b13cd1c7eb14e4ff1bbb68f5c4"))
stopifnot("letters in string value of answer1.3 are correct but case is not correct"= setequal(digest(paste(toString(answer1.3), "61ae8")), "bf9a6dbe58ba7b0aefa085ccf3b0997b"))

print('Success!')

1.4. In a scatterplot comparing two variables, which aesthetic is *typically* mapped to identify groups based on a categorical variable? (1 point)

    A. `x`  
    B. `y`  
    C. `color`  
    D. `size`  

In [None]:
#assign the answer to answer1.4
# answer1.4 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer1.4 is not character"= setequal(digest(paste(toString(class(answer1.4)), "8735a")), "41400864452d76545d71940b54e1f10b"))
stopifnot("length of answer1.4 is not correct"= setequal(digest(paste(toString(length(answer1.4)), "8735a")), "629e27c833ebf3e9254706062c516e72"))
stopifnot("value of answer1.4 is not correct"= setequal(digest(paste(toString(tolower(answer1.4)), "8735a")), "8c805f5ca4fd61e7ec33aee6f8893799"))
stopifnot("letters in string value of answer1.4 are correct but case is not correct"= setequal(digest(paste(toString(answer1.4), "8735a")), "012a29c08edc23dd5f690837e88ff4ad"))

print('Success!')

1.5. What does `facet_wrap(~genre)` do in a ggplot? (1 point)

    A. Applies a color gradient to genres  
    B. Adds multiple plots based on the genre variable  
    C. Combines multiple aesthetic mappings into one plot  
    D. Filters the dataset to only include the genre variable  

In [None]:
#assign the answer to answer1.5
# answer1.5 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer1.5 is not character"= setequal(digest(paste(toString(class(answer1.5)), "36ce7")), "3d111022d033c9552e3c26738cfc11b6"))
stopifnot("length of answer1.5 is not correct"= setequal(digest(paste(toString(length(answer1.5)), "36ce7")), "b96716821859bea08d1dfdc21b853cdf"))
stopifnot("value of answer1.5 is not correct"= setequal(digest(paste(toString(tolower(answer1.5)), "36ce7")), "d34cf363414528afd95ecb912f929386"))
stopifnot("letters in string value of answer1.5 are correct but case is not correct"= setequal(digest(paste(toString(answer1.5), "36ce7")), "197aa8f6ea7cca78cde934c7a928a5a0"))

print('Success!')

---
**Question 2**

Let's begin to visualize some of the variables in the dataset. Plot the distribution of the `popularity` of the tracks as a histogram. Replace `...` with appropirate codes and assign the plot to the varaible `answer2`. (1 point)


In [None]:
 

# answer2 <- ggplot(spotify_data, aes(x = ...)) +
#     ...(binwidth = ..., fill = "steelblue", color = "white") +
#     labs(title = "Distribution of Track Popularity",
#              x = "Popularity",
#              y = "Count") +
#     theme_minimal()

 # YOUR CODE HERE
 fail()

print(answer2)

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer2$layers)), function(i) {c(class(answer2$layers[[i]]$geom))[1]})), "15899")), "7f10d9d743578cdf6e863bd3da666021"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer2$layers)), function(i) {rlang::get_expr(c(answer2$layers[[i]]$mapping, answer2$mapping)$x)}), as.character))), "15899")), "626a8cd5a3562c42c49fc664e9440faf"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer2$layers)), function(i) {rlang::get_expr(c(answer2$layers[[i]]$mapping, answer2$mapping)$y)}), as.character))), "15899")), "2e630cb70f33ce54dead062165e87dce"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer2$layers[[1]]$mapping, answer2$mapping)$x)!= answer2$labels$x), "15899")), "36f34933d232d6b738bc2fbe043d0fb6"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer2$layers[[1]]$mapping, answer2$mapping)$y)!= answer2$labels$y), "15899")), "2e630cb70f33ce54dead062165e87dce"))
stopifnot("incorrect colour variable in answer2, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer2$layers[[1]]$mapping, answer2$mapping)$colour)), "15899")), "2e630cb70f33ce54dead062165e87dce"))
stopifnot("incorrect shape variable in answer2, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer2$layers[[1]]$mapping, answer2$mapping)$shape)), "15899")), "2e630cb70f33ce54dead062165e87dce"))
stopifnot("the colour label in answer2 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer2$layers[[1]]$mapping, answer2$mapping)$colour) != answer2$labels$colour), "15899")), "2e630cb70f33ce54dead062165e87dce"))
stopifnot("the shape label in answer2 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer2$layers[[1]]$mapping, answer2$mapping)$colour) != answer2$labels$shape), "15899")), "2e630cb70f33ce54dead062165e87dce"))
stopifnot("fill variable in answer2 is not correct"= setequal(digest(paste(toString(quo_name(answer2$mapping$fill)), "15899")), "c1d71728344ba4b147cf06273a8c2f35"))
stopifnot("fill label in answer2 is not informative"= setequal(digest(paste(toString((quo_name(answer2$mapping$fill) != answer2$labels$fill)), "15899")), "2e630cb70f33ce54dead062165e87dce"))
stopifnot("position argument in answer2 is not correct"= setequal(digest(paste(toString(class(answer2$layers[[1]]$position)[1]), "15899")), "869d318e3c7f38e683ccc2fe92876f19"))

print('Success!')

Now lets see how this changes with different genres...

**Question 3**

Choose an appropriate method to plot the distribution of `popularity` accross different genres. Assign the plot to the varaible `answer3`. (1 point)

In [None]:
# YOUR CODE HERE
fail()

print(answer3)

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer3$layers)), function(i) {c(class(answer3$layers[[i]]$geom))[1]})), "de7d2")), "3e77ad90c56f16a2bf3e31205c81a16b"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer3$layers)), function(i) {rlang::get_expr(c(answer3$layers[[i]]$mapping, answer3$mapping)$x)}), as.character))), "de7d2")), "164245a1a1880351bbf8a354251034c5"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer3$layers)), function(i) {rlang::get_expr(c(answer3$layers[[i]]$mapping, answer3$mapping)$y)}), as.character))), "de7d2")), "8f437213ef9f3a45210e33e8bbd96a83"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer3$layers[[1]]$mapping, answer3$mapping)$x)!= answer3$labels$x), "de7d2")), "96b321a610040ff4c12d10ca1e149310"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer3$layers[[1]]$mapping, answer3$mapping)$y)!= answer3$labels$y), "de7d2")), "96b321a610040ff4c12d10ca1e149310"))
stopifnot("incorrect colour variable in answer3, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer3$layers[[1]]$mapping, answer3$mapping)$colour)), "de7d2")), "6bcbfd076378a4c77d15f28b5bfbc2ce"))
stopifnot("incorrect shape variable in answer3, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer3$layers[[1]]$mapping, answer3$mapping)$shape)), "de7d2")), "6bcbfd076378a4c77d15f28b5bfbc2ce"))
stopifnot("the colour label in answer3 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer3$layers[[1]]$mapping, answer3$mapping)$colour) != answer3$labels$colour), "de7d2")), "6bcbfd076378a4c77d15f28b5bfbc2ce"))
stopifnot("the shape label in answer3 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer3$layers[[1]]$mapping, answer3$mapping)$colour) != answer3$labels$shape), "de7d2")), "6bcbfd076378a4c77d15f28b5bfbc2ce"))
stopifnot("fill variable in answer3 is not correct"= setequal(digest(paste(toString(quo_name(answer3$mapping$fill)), "de7d2")), "164245a1a1880351bbf8a354251034c5"))
stopifnot("fill label in answer3 is not informative"= setequal(digest(paste(toString((quo_name(answer3$mapping$fill) != answer3$labels$fill)), "de7d2")), "6bcbfd076378a4c77d15f28b5bfbc2ce"))
stopifnot("position argument in answer3 is not correct"= setequal(digest(paste(toString(class(answer3$layers[[1]]$position)[1]), "de7d2")), "8bec279e70b85aa0991b443f633cad29"))

print('Success!')

Do you see any trend between the genres and track popularities? Describe what you are seeing.

*** YOUR ANSWER HERE ***

In [None]:
mean_d <- spotify_data |> 
    pull(popularity) |> 
    mean(na.rm = TRUE)

median_d <- spotify_data |> 
    pull(popularity) |> 
    median(na.rm = TRUE)


#Since there is no mode function in base R, we can create a custom function to calculate the mode
mode <- function(data, col) {
    data |>
        count({{col}}) |>
        arrange(desc(n)) |>
        slice(1) |>
        pull({{col}})
}

mode_d <- mode(spotify_data, popularity)

ggplot(spotify_data, aes(x = popularity)) +
  geom_histogram(binwidth = 5, fill = "lightgray", color = "black") +
  geom_vline(aes(xintercept = mean_d), color = "blue", linetype = "dashed", linewidth = 1.2) +
  geom_vline(aes(xintercept = median_d), color = "darkgreen", linetype = "dashed", linewidth = 1.2) +
  geom_vline(aes(xintercept = mode_d), color = "red", linetype = "dashed", linewidth = 1.2) +
  labs(title = "Distribution of Track Popularity",
       subtitle = paste("Mean =", round(mean_d, 1), 
                        "| Median =", round(median_d, 1), 
                        "| Mode =", mode_d),
       x = "Popularity", y = "Count")

The plot above shows the distribution of track popularity scores as a histogram, with three vertical dashed lines indicating the mean (blue), median (dark green), and mode (red) of the popularity values. The histogram reveals how popularity is spread across tracks, with most tracks clustering around a central range. The mean and median are close to each other, suggesting a relatively symmetric distribution, while the mode highlights the most frequently occurring popularity score which is 0 indincating majority of the tracks in the dataset are not popular. This visualization helps to quickly compare the central tendency measures and understand the overall pattern of track popularity in the dataset.

Now let us repeat the above for the `valence` variable.

In [None]:
mean_val <- spotify_data |> 
    pull(valence) |> 
    mean(na.rm = TRUE)

median_val <- spotify_data |> 
    pull(valence) |> 
    median(na.rm = TRUE)

mode_val <- mode(spotify_data, valence)

ggplot(spotify_data, aes(x = valence)) +
  geom_histogram(binwidth = 0.05, fill = "lightgray", color = "black") +
  geom_vline(aes(xintercept = mean_val), color = "blue", linetype = "dashed", linewidth = 1.2) +
  geom_vline(aes(xintercept = median_val), color = "darkgreen", linetype = "dashed", linewidth = 1.2) +
  geom_vline(aes(xintercept = mode_val), color = "red", linetype = "dashed", linewidth = 1.2) +
  labs(title = "Distribution of Track Valence",
       subtitle = paste("Mean =", round(mean_val, 2), 
                        "| Median =", round(median_val, 2), 
                        "| Mode =", round(mode_val, 2)),
       x = "Valence", y = "Count")

The plot above displays the distribution of the `valence` variable, which measures the musical positiveness of a track. Unlike the `popularity` distribution, the valence distribution appears **more symmetric** and **less skewed**, with the mean and median and mode being equal, suggesting a more symmetric distribution. Understanding and choosing the right CTM is a key step in exploratory data analysis and helps ensure meaningful summaries and comparisons. 

Alongside these, measures like **standard deviation (SD)** and **interquartile range (IQR)** help describe the spread of the data around these central points. Lets plot the distribution of `tempo` as a boxplot.

In [None]:
# Boxplot for tempo
ggplot(spotify_data, aes(y = tempo)) +
    geom_boxplot(fill = "orange", alpha = 0.7) +
    labs(title = "Boxplot of Tempo",
             y = "Tempo") +
    theme_minimal()

The boxplot above visualizes the distribution of the `tempo` variable for the tracks in the dataset. The central line within the box represents the **median** tempo, while the edges of the box show the **interquartile range (IQR)**, indicating where the middle 50% (25%, 50% which is the media and 75%) of tempo values lie. The "whiskers" extend to show the range of most of the data (outside the 25 and 75 percent quartiles), and any points outside this range are considered outliers. From the plot, we can observe the typical tempo for most tracks, the spread of tempo values, and whether there are any unusually fast or slow tracks (outliers). This helps us quickly assess the variability, spread and central tendency of tempo in the dataset.

Now plot the distribution of this variable as a density plot...

In [None]:
# Density plot for tempo
ggplot(spotify_data, aes(x = tempo)) +
    geom_density(fill = "orange", alpha = 0.7) +
    labs(title = "Desnity distribution of Tempo",
             y = "Tempo") +
    theme_minimal()

The density plot for tempo provides a smooth estimate of the distribution, highlighting where most tempo values are concentrated and revealing the overall shape (e.g., unimodal, skewed, or multimodal). Compared to the boxplot, which summarizes the distribution using quartiles and highlights outliers, the density plot gives more detail about the distribution's modality and spread.

**Question 4**

Violin plots offer a more detailed view of the data distribution than a boxplot, while still retaining summary information. Replace the `...` in the code below with the appropriate geom function to create such a plot. Assign the result to `answer4`. Hint: Explore plots that show distribution shape. (1 point)

In [None]:
# answer4 <- ggplot(spotify_data, aes(x = "", y = tempo)) +
#     ...(fill = "orange", alpha = 0.7) +
#     geom_boxplot(width = 0.1, fill = "white") +
#     labs(y = "Tempo") +
#     theme_minimal()

# YOUR CODE HERE
fail()

print(answer4)

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer4$layers)), function(i) {c(class(answer4$layers[[i]]$geom))[1]})), "a6c06")), "c536be52c25b78dfa26a3dc61989e4a7"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer4$layers)), function(i) {rlang::get_expr(c(answer4$layers[[i]]$mapping, answer4$mapping)$x)}), as.character))), "a6c06")), "b5711a7b8439d23690179e636179ca9d"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer4$layers)), function(i) {rlang::get_expr(c(answer4$layers[[i]]$mapping, answer4$mapping)$y)}), as.character))), "a6c06")), "a58a365e05cc31dc4deb88f07612a706"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer4$layers[[1]]$mapping, answer4$mapping)$x)!= answer4$labels$x), "a6c06")), "810da8b33b6468267e1bc12d814f2b6a"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer4$layers[[1]]$mapping, answer4$mapping)$y)!= answer4$labels$y), "a6c06")), "c300d159cb55401414b3667e9293bcf5"))
stopifnot("incorrect colour variable in answer4, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer4$layers[[1]]$mapping, answer4$mapping)$colour)), "a6c06")), "810da8b33b6468267e1bc12d814f2b6a"))
stopifnot("incorrect shape variable in answer4, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer4$layers[[1]]$mapping, answer4$mapping)$shape)), "a6c06")), "810da8b33b6468267e1bc12d814f2b6a"))
stopifnot("the colour label in answer4 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer4$layers[[1]]$mapping, answer4$mapping)$colour) != answer4$labels$colour), "a6c06")), "810da8b33b6468267e1bc12d814f2b6a"))
stopifnot("the shape label in answer4 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer4$layers[[1]]$mapping, answer4$mapping)$colour) != answer4$labels$shape), "a6c06")), "810da8b33b6468267e1bc12d814f2b6a"))
stopifnot("fill variable in answer4 is not correct"= setequal(digest(paste(toString(quo_name(answer4$mapping$fill)), "a6c06")), "2e746ae740db813e95dac978fd11b796"))
stopifnot("fill label in answer4 is not informative"= setequal(digest(paste(toString((quo_name(answer4$mapping$fill) != answer4$labels$fill)), "a6c06")), "810da8b33b6468267e1bc12d814f2b6a"))
stopifnot("position argument in answer4 is not correct"= setequal(digest(paste(toString(class(answer4$layers[[1]]$position)[1]), "a6c06")), "3f141ea66b7eec8e1241313448caf6a3"))

print('Success!')

**Question 5**

For each of the following questions, assign the correct option to the given variable in quotes. For example, if the correct answer is option B, enter `answer5.1 <- "B"` in the answer cell.

5.1. Which of the following measures is most affected by extreme values (outliers)? (1 point)

    A. Standard deviation  
    B. Mode  
    C. Median  
    D. Interquartile Range  

In [None]:
#assign the answer to answer5.1
# answer5.1 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer5.1 is not character"= setequal(digest(paste(toString(class(answer5.1)), "33671")), "e282a638575e1ba6f69f952dc99d5bdd"))
stopifnot("length of answer5.1 is not correct"= setequal(digest(paste(toString(length(answer5.1)), "33671")), "8c6be8e6e74a8d28b44bd9a706e04eb3"))
stopifnot("value of answer5.1 is not correct"= setequal(digest(paste(toString(tolower(answer5.1)), "33671")), "04979692f337a261fba8d14cf757b4da"))
stopifnot("letters in string value of answer5.1 are correct but case is not correct"= setequal(digest(paste(toString(answer5.1), "33671")), "d58b058c84056f5827f78e9f68040b53"))

print('Success!')

5.2. A dataset of song lengths has a mean of 3.78 minutes and a standard deviation of 1.02 minutes. What does the SD tell us? (1 point)

    A. Most songs are exactly 3.78 minutes  
    B. Half the songs are shorter than 3.78 minutes  
    C. Songs vary by about 1.02 minutes from the mean  
    D. There are no songs longer than 4.8 minutes  

In [None]:
#assign the answer to answer5.2
# answer5.2 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer5.2 is not character"= setequal(digest(paste(toString(class(answer5.2)), "c1551")), "6aff9e7f5ef0b9ea1c733333303b5cbc"))
stopifnot("length of answer5.2 is not correct"= setequal(digest(paste(toString(length(answer5.2)), "c1551")), "b690e8a6a00a3c6c93b077b5e5a748d0"))
stopifnot("value of answer5.2 is not correct"= setequal(digest(paste(toString(tolower(answer5.2)), "c1551")), "d6ad651c197d366f3417a853330c737a"))
stopifnot("letters in string value of answer5.2 are correct but case is not correct"= setequal(digest(paste(toString(answer5.2), "c1551")), "7204639bc673b7aa645cd11375b2d7e1"))

print('Success!')

5.3. The IQR represents the range between which two percentiles? (1 point)

    A. 10th and 90th  
    B. 0th and 100th  
    C. 25th and 75th  
    D. 5th and 95th  

In [None]:
#assign the answer to answer5.3
# answer5.3 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer5.3 is not character"= setequal(digest(paste(toString(class(answer5.3)), "e6b38")), "5d1ee98293927ab4c5fa453ecbd4db51"))
stopifnot("length of answer5.3 is not correct"= setequal(digest(paste(toString(length(answer5.3)), "e6b38")), "ccc4451ba6cf289d6daf4a06968da2fc"))
stopifnot("value of answer5.3 is not correct"= setequal(digest(paste(toString(tolower(answer5.3)), "e6b38")), "c2ff55c75d5f81a026ede1b515267f18"))
stopifnot("letters in string value of answer5.3 are correct but case is not correct"= setequal(digest(paste(toString(answer5.3), "e6b38")), "38c0a00b91d759c866ec9fd2df3fd0a9"))

print('Success!')

5.4. What would be an appropriate central tendency measure for a highly skewed distribution? (1 point)

    A. Mean  
    B. Median  
    C. Mode  
    D. Standard Deviation  

In [None]:
#assign the answer to answer5.4
# answer5.4 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer5.4 is not character"= setequal(digest(paste(toString(class(answer5.4)), "d3436")), "4ce49c29529d937569571db88eb1c26f"))
stopifnot("length of answer5.4 is not correct"= setequal(digest(paste(toString(length(answer5.4)), "d3436")), "9cc127e18635a9604fb9f49a661d1e86"))
stopifnot("value of answer5.4 is not correct"= setequal(digest(paste(toString(tolower(answer5.4)), "d3436")), "3bfaffbb2dab0efa4c3e5bf5fab0af71"))
stopifnot("letters in string value of answer5.4 are correct but case is not correct"= setequal(digest(paste(toString(answer5.4), "d3436")), "a5dd7659d2dceb9faa44dd91e8dba2ce"))

print('Success!')

5.5. In a right-skewed distribution (e.g., Speechiness), which of the following typically holds? (1 point)

    A. Mean < Median   
    B. Mean > Median  
    C. Mean = Median  
    D. Not enough information

In [None]:
#assign the answer to answer5.5
# answer5.5 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of answer5.5 is not character"= setequal(digest(paste(toString(class(answer5.5)), "14618")), "9c038a0c631f583d007b80ba255baf0f"))
stopifnot("length of answer5.5 is not correct"= setequal(digest(paste(toString(length(answer5.5)), "14618")), "379df039244e74ab292445e89862f2ac"))
stopifnot("value of answer5.5 is not correct"= setequal(digest(paste(toString(tolower(answer5.5)), "14618")), "8473fcc081f8d77f04b41526e8f98653"))
stopifnot("letters in string value of answer5.5 are correct but case is not correct"= setequal(digest(paste(toString(answer5.5), "14618")), "bce330ab56c265e61872e126536de3eb"))

print('Success!')

Now lets delve into comparing the relationship between different variables within this dataset using correlation analysis. Correlation analysis shows how two numeric variables are related. A positive correlation indicates that as one variable increases, so does the other; a negative correlation means one increases while the other decreases. The **Pearson correlation coefficient** (ranging from -1 to 1) is commonly used to quantify linear relationships.
*   1: perfect positive linear relationship
*   -1: perfect negative linear relationship
*   0: no linear relationship

Examples include:
1. Numbers of hours studies and final exam score (**likely positive**)
2. Annual income of a household and number of cars owned (**likely positive**)
3. Average screen time per day and sleep hours per night (**likely negative**)

**Question 6**

Plot a scatterplot to compare `loudness` and `energy` variables. HINT: To visualize the trend, use `geom_smooth(methods = "lm")` which adds a linear regression line to the plot. Assign the plot to variable `answer6`. (1 point)

In [None]:
# answer6 <- "..."

# YOUR CODE HERE
fail()


In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer6$layers)), function(i) {c(class(answer6$layers[[i]]$geom))[1]})), "be1ff")), "788d3720e75d5dcdad4a48d3f5b5acee"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer6$layers)), function(i) {rlang::get_expr(c(answer6$layers[[i]]$mapping, answer6$mapping)$x)}), as.character))), "be1ff")), "632735ac36305c6ef143c9de22679bf7"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer6$layers)), function(i) {rlang::get_expr(c(answer6$layers[[i]]$mapping, answer6$mapping)$y)}), as.character))), "be1ff")), "cc9486e1a5751aa443d47f5da86acfb5"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer6$layers[[1]]$mapping, answer6$mapping)$x)!= answer6$labels$x), "be1ff")), "9707dad30b1a04ce9960b5d75af42dad"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer6$layers[[1]]$mapping, answer6$mapping)$y)!= answer6$labels$y), "be1ff")), "9707dad30b1a04ce9960b5d75af42dad"))
stopifnot("incorrect colour variable in answer6, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer6$layers[[1]]$mapping, answer6$mapping)$colour)), "be1ff")), "7e82ff55876945318456463472d86a1c"))
stopifnot("incorrect shape variable in answer6, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer6$layers[[1]]$mapping, answer6$mapping)$shape)), "be1ff")), "7e82ff55876945318456463472d86a1c"))
stopifnot("the colour label in answer6 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer6$layers[[1]]$mapping, answer6$mapping)$colour) != answer6$labels$colour), "be1ff")), "7e82ff55876945318456463472d86a1c"))
stopifnot("the shape label in answer6 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer6$layers[[1]]$mapping, answer6$mapping)$colour) != answer6$labels$shape), "be1ff")), "7e82ff55876945318456463472d86a1c"))
stopifnot("fill variable in answer6 is not correct"= setequal(digest(paste(toString(quo_name(answer6$mapping$fill)), "be1ff")), "8c64c05ee936987c15d49c4b899c2dcf"))
stopifnot("fill label in answer6 is not informative"= setequal(digest(paste(toString((quo_name(answer6$mapping$fill) != answer6$labels$fill)), "be1ff")), "7e82ff55876945318456463472d86a1c"))
stopifnot("position argument in answer6 is not correct"= setequal(digest(paste(toString(class(answer6$layers[[1]]$position)[1]), "be1ff")), "c2d0ec367abb45655f062356621fa410"))

print('Success!')

Do you see any trend between the loudness and energy of the tracks? Describe what you are seeing.

*** YOUR ANSWER HERE ***

Now let us plot the relationship between energy and accousticness.

In [None]:
ggplot(spotify_data, aes(x = energy, y = acousticness)) +
    geom_point(alpha = 0.5, color = "steelblue") +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Scatterplot of Energy vs Acousticness",
         x = "Energy",
         y = "Acousticness") +
    theme_minimal()

Describe what you see in the above plot in the following cell

*** YOUR ANSWER HERE ***

**Question 7**


Now let's calculate the Pearson corrlation coeffienct for these relationships. Use `correlate()` from the `corrr` package to compute the correlation coefficient between `loudness`, `energy`. Replace the `...` with appropriate code and assign the value to `answer7`. (1 point)

In [None]:
# answer7 <- spotify_data |> 
#     select(..., ...) |> 
#     ... |> 
#     filter(term == "loudness") |> 
#     pull(...)

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer7 is not numeric"= setequal(digest(paste(toString(class(answer7)), "31d87")), "71db11afa5dd2afeb888a8f437eb8387"))
stopifnot("value of answer7 is not correct (rounded to 2 decimal places)"= setequal(digest(paste(toString(round(answer7, 2)), "31d87")), "2d27e2cc52b6ec41f0f56cfdcbc7554e"))
stopifnot("length of answer7 is not correct"= setequal(digest(paste(toString(length(answer7)), "31d87")), "6692c11d9956db7117bdd27aae4d9d9c"))
stopifnot("values of answer7 are not correct"= setequal(digest(paste(toString(sort(round(answer7, 2))), "31d87")), "2d27e2cc52b6ec41f0f56cfdcbc7554e"))

print('Success!')

We can run the correlation analysis between all the numerical variables within the datatset using `correlate()` without selecting for any particular variables.

In [None]:
spotify_data |> 
    correlate() |> 
    fashion()

**Question 8**

Use the `correlate()` function within the `corrr` package to visualize this correlation table. Replace the `...` in the following with that function to visualize the table above. Assign the code varaible `answer8`. Note that this function automatically removes non-numeric variables. (1 point) 

In [None]:
options(warn = -1)
# answer8 <- spotify_data |> 
#     correlate() |> 
#     ...(colors = c("darkred", "red", "white",  "blue", "darkblue")) +
#     theme(axis.text.x = element_text(angle = 90))

# YOUR CODE HERE
fail()

print(answer8)

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer8$layers)), function(i) {c(class(answer8$layers[[i]]$geom))[1]})), "2070e")), "7c8484e09b6e7cf2d1bd7f09e5ced51f"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer8$layers)), function(i) {rlang::get_expr(c(answer8$layers[[i]]$mapping, answer8$mapping)$x)}), as.character))), "2070e")), "8637ccc210161a955c437678472743f2"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer8$layers)), function(i) {rlang::get_expr(c(answer8$layers[[i]]$mapping, answer8$mapping)$y)}), as.character))), "2070e")), "3a197d7879f6df10f5f5f83a475975c5"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer8$layers[[1]]$mapping, answer8$mapping)$x)!= answer8$labels$x), "2070e")), "d3230e289a59c5d67e40d1e9b9be1be8"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer8$layers[[1]]$mapping, answer8$mapping)$y)!= answer8$labels$y), "2070e")), "d3230e289a59c5d67e40d1e9b9be1be8"))
stopifnot("incorrect colour variable in answer8, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer8$layers[[1]]$mapping, answer8$mapping)$colour)), "2070e")), "5055b3e0c8008db586cd7c6efafa6380"))
stopifnot("incorrect shape variable in answer8, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer8$layers[[1]]$mapping, answer8$mapping)$shape)), "2070e")), "c1453c4c2a89f140d0e4131fb352d324"))
stopifnot("the colour label in answer8 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer8$layers[[1]]$mapping, answer8$mapping)$colour) != answer8$labels$colour), "2070e")), "c1453c4c2a89f140d0e4131fb352d324"))
stopifnot("the shape label in answer8 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer8$layers[[1]]$mapping, answer8$mapping)$colour) != answer8$labels$shape), "2070e")), "c1453c4c2a89f140d0e4131fb352d324"))
stopifnot("fill variable in answer8 is not correct"= setequal(digest(paste(toString(quo_name(answer8$mapping$fill)), "2070e")), "bd2e1c244c161c822cf8817880464e8a"))
stopifnot("fill label in answer8 is not informative"= setequal(digest(paste(toString((quo_name(answer8$mapping$fill) != answer8$labels$fill)), "2070e")), "c1453c4c2a89f140d0e4131fb352d324"))
stopifnot("position argument in answer8 is not correct"= setequal(digest(paste(toString(class(answer8$layers[[1]]$position)[1]), "2070e")), "dc38c50e3efc757511e7fc34b3cfef0b"))

print('Success!')

The heatmap reveals several key trends:  
- There is a strong positive correlation between `energy` and `loudness`, indicating that louder tracks tend to be more energetic.
- `Acousticness` shows a strong negative correlation with both `energy` and `loudness`, suggesting that more acoustic tracks are generally less loud and less energetic.
- `Danceability` and `valence` are lightly positively correlated, implying that more danceable tracks also tend to be more positive or cheerful.
- Most other correlations are weak, indicating little linear relationship between those variable pairs.

It's important to distinguish between **correlation** and **causation** when analyzing relationships in this dataset. Correlation simply means that two variables move together (e.g., energy and loudness are positively correlated), but it does not imply that one causes the other. In the Spotify dataset, just because tracks with higher energy tend to be louder, it doesn't mean increasing loudness will cause a track to be more energetic. Other factors may be involved and it is possible that both may be influenced by a third variable (like production style, genre, tempo, etc). 

**Question 9**

Now lets see how music has evolved over the years. to begin with, let's plot the distribution of genre of music over the years. Plot the count of genres for every year in the dataset. To make the visualization clear, plot the proportion of each genre over the years. Assign the code to variable `answer9`. (1 point)

In [None]:
# answer9 <- ggplot(spotify_data, aes(x = ..., fill = ...)) +
#     ...(position = "fill") +
#     labs(title = "Proportion of Genres by Year") +
#     theme_minimal() +
#     theme(axis.text.x = element_text(angle = 90, hjust = 1))

# YOUR CODE HERE
fail()

print(answer9)

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer9$layers)), function(i) {c(class(answer9$layers[[i]]$geom))[1]})), "a6b44")), "a4db0bb1adf8a3c4beb6200010742250"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer9$layers)), function(i) {rlang::get_expr(c(answer9$layers[[i]]$mapping, answer9$mapping)$x)}), as.character))), "a6b44")), "1c5e32bbc60baa5653aea44573e266c1"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer9$layers)), function(i) {rlang::get_expr(c(answer9$layers[[i]]$mapping, answer9$mapping)$y)}), as.character))), "a6b44")), "317b25d45010c4ef703897df63423dcf"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer9$layers[[1]]$mapping, answer9$mapping)$x)!= answer9$labels$x), "a6b44")), "317b25d45010c4ef703897df63423dcf"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer9$layers[[1]]$mapping, answer9$mapping)$y)!= answer9$labels$y), "a6b44")), "317b25d45010c4ef703897df63423dcf"))
stopifnot("incorrect colour variable in answer9, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer9$layers[[1]]$mapping, answer9$mapping)$colour)), "a6b44")), "317b25d45010c4ef703897df63423dcf"))
stopifnot("incorrect shape variable in answer9, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer9$layers[[1]]$mapping, answer9$mapping)$shape)), "a6b44")), "317b25d45010c4ef703897df63423dcf"))
stopifnot("the colour label in answer9 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer9$layers[[1]]$mapping, answer9$mapping)$colour) != answer9$labels$colour), "a6b44")), "317b25d45010c4ef703897df63423dcf"))
stopifnot("the shape label in answer9 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer9$layers[[1]]$mapping, answer9$mapping)$colour) != answer9$labels$shape), "a6b44")), "317b25d45010c4ef703897df63423dcf"))
stopifnot("fill variable in answer9 is not correct"= setequal(digest(paste(toString(quo_name(answer9$mapping$fill)), "a6b44")), "5656b6b9ff8ae66a1c62541bd3979514"))
stopifnot("fill label in answer9 is not informative"= setequal(digest(paste(toString((quo_name(answer9$mapping$fill) != answer9$labels$fill)), "a6b44")), "317b25d45010c4ef703897df63423dcf"))
stopifnot("position argument in answer9 is not correct"= setequal(digest(paste(toString(class(answer9$layers[[1]]$position)[1]), "a6b44")), "9d52c112554bed918d6e28a68e19527c"))

print('Success!')

The plot above shows how the proportion of different music genres has changed over the years in the dataset. Some genres increase in prevalence while others decrease or remain relatively stable. For example, genres like pop and edm have become more prevelant in recent years while rock and r&b appear to decrease.

**Question 10**

Visualize how the energy of the tracks has evolved over the years for each genre. To plot this caluclate the mean of track energy across each year for each genre. Use `group_by()` and `summarise()`. Be sure to separate each genre into its own subplot. 

> Hint: Use `scales = "free_x"` within `facet_wrap()` to account for the difference in years for each genre and make sure to `group_by()` both `year` and `genre` before calculating mean energy all the tracks per year (for each genre). Assign the answer to variable `answer10`. (1 point)

In [None]:
# answer10 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()


print(answer10)

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer10$layers)), function(i) {c(class(answer10$layers[[i]]$geom))[1]})), "102ef")), "ecea05090560329f3d481cdc13aa296f"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer10$layers)), function(i) {rlang::get_expr(c(answer10$layers[[i]]$mapping, answer10$mapping)$x)}), as.character))), "102ef")), "59adad3851f4421dabb960b96ae83c94"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer10$layers)), function(i) {rlang::get_expr(c(answer10$layers[[i]]$mapping, answer10$mapping)$y)}), as.character))), "102ef")), "947b3c9520591d3af37be9c233bbc296"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer10$layers[[1]]$mapping, answer10$mapping)$x)!= answer10$labels$x), "102ef")), "b028817bc8e6a4a4aef00a23708fc112"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer10$layers[[1]]$mapping, answer10$mapping)$y)!= answer10$labels$y), "102ef")), "2fa699dc98530ba254005206e34bb30a"))
stopifnot("incorrect colour variable in answer10, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer10$layers[[1]]$mapping, answer10$mapping)$colour)), "102ef")), "a93fec8fc297e6f17ad2be0460573abe"))
stopifnot("incorrect shape variable in answer10, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer10$layers[[1]]$mapping, answer10$mapping)$shape)), "102ef")), "b028817bc8e6a4a4aef00a23708fc112"))
stopifnot("the colour label in answer10 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer10$layers[[1]]$mapping, answer10$mapping)$colour) != answer10$labels$colour), "102ef")), "2fa699dc98530ba254005206e34bb30a"))
stopifnot("the shape label in answer10 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer10$layers[[1]]$mapping, answer10$mapping)$colour) != answer10$labels$shape), "102ef")), "b028817bc8e6a4a4aef00a23708fc112"))
stopifnot("fill variable in answer10 is not correct"= setequal(digest(paste(toString(quo_name(answer10$mapping$fill)), "102ef")), "c262b58fff85757c0d6fcc02cdadc460"))
stopifnot("fill label in answer10 is not informative"= setequal(digest(paste(toString((quo_name(answer10$mapping$fill) != answer10$labels$fill)), "102ef")), "b028817bc8e6a4a4aef00a23708fc112"))
stopifnot("position argument in answer10 is not correct"= setequal(digest(paste(toString(class(answer10$layers[[1]]$position)[1]), "102ef")), "ce2ac850a143a9d2a95a9f5ef86fbd19"))

print('Success!')

Describe what the trends you see in the above plot. 

*** YOUR ANSWER HERE ***

---
Now let us focus on why EDA is important and how the downstream analysis and data modelling can be affected by the decisions made during EDA.


Your final goal with this spotify dataset is to build a classifier to predict whether a song belongs to the EDM (Electronic Dance Music) genre or not. Let us prepare the dataset to help achieve this goal.

In [None]:
classification_data <- spotify_data |>
    select(genre, popularity, danceability, energy, loudness, acousticness, instrumentalness, liveness, valence) |> 
    mutate(genre = case_when(
        genre == "edm" ~ "EDM",
        TRUE ~ "Non-EDM"
    )) |>
    mutate(genre = as.factor(genre))

head(classification_data)

**Question 11**

Examine the distribution of EDM vs. non-EDM tracks in the dataset. Create a plot that shows how many tracks fall into each category, and store the plot in the variable `answer11`. (1 point)

In [None]:
#answer11 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

#answer11

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(answer11$layers)), function(i) {c(class(answer11$layers[[i]]$geom))[1]})), "1e95b")), "0736467f4d134decef45f1450d40edd9"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer11$layers)), function(i) {rlang::get_expr(c(answer11$layers[[i]]$mapping, answer11$mapping)$x)}), as.character))), "1e95b")), "2d09de8dbfd8f56e5a07a73328cd7ef6"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(answer11$layers)), function(i) {rlang::get_expr(c(answer11$layers[[i]]$mapping, answer11$mapping)$y)}), as.character))), "1e95b")), "477c896096bff73cc0592b38c3ea9e18"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer11$layers[[1]]$mapping, answer11$mapping)$x)!= answer11$labels$x), "1e95b")), "befb6e57583b264ccc1f3de17b4c20ea"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer11$layers[[1]]$mapping, answer11$mapping)$y)!= answer11$labels$y), "1e95b")), "477c896096bff73cc0592b38c3ea9e18"))
stopifnot("incorrect colour variable in answer11, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer11$layers[[1]]$mapping, answer11$mapping)$colour)), "1e95b")), "477c896096bff73cc0592b38c3ea9e18"))
stopifnot("incorrect shape variable in answer11, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(answer11$layers[[1]]$mapping, answer11$mapping)$shape)), "1e95b")), "477c896096bff73cc0592b38c3ea9e18"))
stopifnot("the colour label in answer11 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer11$layers[[1]]$mapping, answer11$mapping)$colour) != answer11$labels$colour), "1e95b")), "477c896096bff73cc0592b38c3ea9e18"))
stopifnot("the shape label in answer11 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(answer11$layers[[1]]$mapping, answer11$mapping)$colour) != answer11$labels$shape), "1e95b")), "477c896096bff73cc0592b38c3ea9e18"))
stopifnot("fill variable in answer11 is not correct"= setequal(digest(paste(toString(quo_name(answer11$mapping$fill)), "1e95b")), "68923e7c2bb712e84dfb86c8fa891929"))
stopifnot("fill label in answer11 is not informative"= setequal(digest(paste(toString((quo_name(answer11$mapping$fill) != answer11$labels$fill)), "1e95b")), "477c896096bff73cc0592b38c3ea9e18"))
stopifnot("position argument in answer11 is not correct"= setequal(digest(paste(toString(class(answer11$layers[[1]]$position)[1]), "1e95b")), "beafdc309691d0f62079a49f4b793ae4"))

print('Success!')

Notice that the EDM tracks only make up the fraction of the dataset creating a class imbalance in the target variable. 

To investigate how this imbalance might impact classification, you train a k-Nearest Neighbors (KNN) model using two different approaches:

1. Original dataset (no upsampling)  
   You perform a train/test split on the original dataset without making any changes to the class distribution.

2. Balanced dataset (with upsampling)  
   Before training the model, you apply random upsampling to the minority class (EDM) in the training data so that both classes are represented equally.

Your task is to evaluate the performance of both models based on accuracy and recall, and to consider how class imbalance and choices made during EDA affect your results. Code for the model without upsampling is provided below. 

In [None]:
# Method 1: KNN classification without upsampling
# Step 1: Train-test split
set.seed(123)
spotify_split <- initial_split(classification_data, strata = genre)
spotify_train <- training(spotify_split)
spotify_test <- testing(spotify_split)

# Step 2: Recipe without upsampling
knn_recipe <- recipe(genre ~ ., data = spotify_train) |> 
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors()) 

# Step 3: Specify KNN model
knn_model <- nearest_neighbor(
  neighbors = 5,          # You can tune this
  weight_func = "rectangular"
) |>
  set_engine("kknn") |> 
  set_mode("classification")

# Step 4: Workflow
knn_workflow <- workflow() |> 
  add_recipe(knn_recipe) |> 
  add_model(knn_model)

# Step 5: Fit the model
knn_fit <- fit(knn_workflow, data = spotify_train)

# Step 6: Predict and evaluate
knn_preds <- predict(knn_fit, spotify_test, type = "prob") |> 
  bind_cols(predict(knn_fit, spotify_test), spotify_test)

# Step 7: Metrics + Confusion Matrix
knn_preds |> 
  metrics(truth = genre, estimate = .pred_class) |> 
  filter(.metric == "accuracy")

knn_preds |> 
  conf_mat(truth = genre, estimate = .pred_class)

knn_preds |>
  recall(truth = genre, estimate = .pred_class, event_level = "first")

**Question 12**

Now repeat the same KNN classification with upsampling of the minority class. Use `step_upsample()` function in the recipe to achieve this with `over_ratio=1`. After model fitting, evaluate the model using the same metrics as above. (1 point)

In [None]:
set.seed(123) ### DO NOT CHANGE

# YOUR CODE HERE
fail()

After upsampling the EDM class, your model's test accuracy decreased, but recall increased. What does this suggest about model trained on the imbalanced data?  

    A. The model is performing worse overall  
    B. The model is overfitting the majority class  
    C. The model is better at identifying EDM songs  
    D. The test set must be imbalanced

In [None]:
#assign the answer to answer12
# answer12 <- "FILL_THIS_IN"

# YOUR CODE HERE
fail()

In [None]:
library(digest)
stopifnot("type of answer12 is not character"= setequal(digest(paste(toString(class(answer12)), "3ddfc")), "ce2c6ae59a0c4cb2f7d87a059a27dbcf"))
stopifnot("length of answer12 is not correct"= setequal(digest(paste(toString(length(answer12)), "3ddfc")), "c9e464074acc1b2ae3aab57d5c13115b"))
stopifnot("value of answer12 is not correct"= setequal(digest(paste(toString(tolower(answer12)), "3ddfc")), "50c8930fe5211d8102e257696fd1ee82"))
stopifnot("letters in string value of answer12 are correct but case is not correct"= setequal(digest(paste(toString(answer12), "3ddfc")), "2a263d3bfda12b52c22ed8560d90df88"))

print('Success!')

The key takeaway from comparing the two models is that class imbalance can significantly affect model performance. The model trained on the imbalanced data (without upsampling) tends to favor the majority class (Non-EDM), resulting in higher overall accuracy but poor recall for the minority class (EDM). After upsampling, the model becomes better at identifying EDM tracks (higher recall), but overall accuracy may decrease. This highlights the importance of addressing class imbalance when building classifiers, especially when the minority class is of particular interest. This also shows how using accuracy as the only metric can be decieving in terms of building a "better" model especially when there are class imbalances in the dataset.