# MIDTERM EXAM: Comprehensive R Data Wrangling Assessment

**Student Name:** Vivianna Cowan

**Student ID:** kxb432

**Date:** 10/19/2025

**Time Limit:** No Time Limit

---

## Exam Overview

This comprehensive midterm exam assesses your mastery of ALL R data wrangling skills covered in Lessons 1-8:

- **Lesson 1:** R Basics and Data Import
- **Lesson 2:** Data Cleaning (Missing Values & Outliers)
- **Lesson 3:** Data Transformation Part 1 (select, filter, arrange)
- **Lesson 4:** Data Transformation Part 2 (mutate, summarize, group_by)
- **Lesson 5:** Data Reshaping (pivot_longer, pivot_wider)
- **Lesson 6:** Combining Datasets (joins)
- **Lesson 7:** String Manipulation & Date/Time
- **Lesson 8:** Advanced Wrangling & Best Practices

## Business Scenario

You are a data analyst for a retail company. The executive team needs a comprehensive analysis of:
- Sales performance across products and regions
- Customer behavior and segmentation
- Data quality issues and recommendations
- Strategic insights for business growth

## Instructions

1. **Set your working directory** to where your data files are located
2. Complete ALL tasks in order
3. Write code in the TODO sections
4. Use the pipe operator (%>%) to chain operations
5. Add comments explaining your logic
6. Run all cells to verify your code works
7. Answer all reflection questions

## Grading

- **Code Correctness (40%)**: All tasks completed correctly
- **Code Quality (20%)**: Clean, well-commented code
- **Business Understanding (20%)**: Demonstrates understanding of context
- **Analysis & Insights (15%)**: Meaningful insights and recommendations
- **Reflection Questions (5%)**: Thoughtful answers

## Academic Integrity

This is an individual exam. You may use:
- Course notes and lesson materials
- R documentation and help files
- Your previous homework assignments

You may NOT:
- Collaborate with other students
- Use AI assistants or online forums
- Share code or solutions

---

**Good luck! 🎓**

## Part 1: R Basics and Data Import (Lesson 1)

**Skills Assessed:** Variables, data types, data import, working directory

**Your Tasks:**
1. Set working directory
2. Load required packages
3. Import multiple datasets
4. Examine data structures

In [2]:
# Task 1.1: Set Working Directory
# TODO: Set your working directory to where your data files are located
# IMPORTANT: Students must set their own path!
# Example: setwd("/Users/yourname/GitHub/ai-homework-grader-clean/data")

# Your code here:
getwd()
setwd("/workspaces/assignment-1-version-3-vivicowan/data")
    getwd()

# Verify working directory
cat("Current working directory:", getwd(), "\n")

Current working directory: /workspaces/assignment-1-version-3-vivicowan/data 


In [3]:
# Task 1.2: Load Required Packages
# TODO: Load tidyverse (includes dplyr, tidyr, stringr, ggplot2)
library(tidyverse)

# TODO: Load lubridate for date operations
library(lubridate)

cat("✅ Packages loaded successfully!\n")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.5.2
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


✅ Packages loaded successfully!


In [6]:
# Task 1.3: Import Datasets
# TODO: Import the following CSV files using read_csv():
#   - company_sales_data.csv -> sales_data
#   - customers.csv -> customers
#   - products.csv -> products
#   - orders.csv -> orders
#   - order_items.csv -> order_items

# Your code here:
sales_data <- read_csv("company_sales_data.csv")

customers <- read_csv("customers.csv")

products <- read_csv("products.csv")

orders <- read_csv("orders.csv")

order_items <- read_csv("order_items.csv")


# Display import summary
cat("✅ Data imported successfully!\n")
cat("Sales data:", nrow(sales_data), "rows\n")
cat("Customers:", nrow(customers), "rows\n")
cat("Products:", nrow(products), "rows\n")
cat("Orders:", nrow(orders), "rows\n")
cat("Order items:", nrow(order_items), "rows\n")

[1mRows: [22m[34m300[39m [1mColumns: [22m[34m8[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): Sales_Rep_Name, Region, Product_Category
[32mdbl[39m  (4): TransactionID, Revenue, Cost, Units_Sold
[34mdate[39m (1): Sale_Date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m100[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): Name, Email, City
[32mdbl[39m  (1): CustomerID
[34mdate[39m (1): Registration_Date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[1mRows: [22m[34m50[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): Product_Name, Category
[32mdbl[39m (2): ProductID, Supplier_ID

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m250[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m  (3): OrderID, CustomerID, Total_Amount
[34mdate[39m (1): Order_Date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m400[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification

✅ Data imported successfully!
Sales data: 300 rows
Customers: 100 rows
Products: 50 rows
Orders: 250 rows
Order items: 400 rows


## Part 2: Data Cleaning - Missing Values & Outliers (Lesson 2)

**Skills Assessed:** Identifying NAs, handling missing data, detecting outliers

**Your Tasks:**
1. Check for missing values in sales_data
2. Handle missing values appropriately
3. Identify outliers in Revenue column
4. Create a cleaned dataset

In [9]:
# Task 2.1: Check for Missing Values
# TODO: Create 'missing_summary' that shows count of NAs in each column of sales_data

is.na(sales_data)
missing_summary <- sum(is.na(sales_data))


cat("========== MISSING VALUES SUMMARY ==========\n")
print(missing_summary)
cat("\nTotal missing values:", sum(missing_summary), "\n")

TransactionID,Sales_Rep_Name,Region,Product_Category,Revenue,Cost,Units_Sold,Sale_Date
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE


[1] 0

Total missing values: 0 


In [10]:
# Task 2.2: Handle Missing Values
# TODO: Create 'sales_clean' by removing rows with ANY missing values


sales_clean <- na.omit(sales_data)


cat("========== DATA CLEANING RESULTS ==========\n")
cat("Original rows:", nrow(sales_data), "\n")
cat("Cleaned rows:", nrow(sales_clean), "\n")
cat("Rows removed:", nrow(sales_data) - nrow(sales_clean), "\n")

Original rows: 300 
Cleaned rows: 300 
Rows removed: 0 


In [21]:
# Task 2.3: Detect Outliers in Revenue
# TODO: Calculate outlier thresholds using IQR method
#   - Calculate Q1 (25th percentile) and Q3 (75th percentile) of Revenue
#   - Calculate IQR = Q3 - Q1
#   - Lower bound = Q1 - 1.5 * IQR
#   - Upper bound = Q3 + 1.5 * IQR
# TODO: Create 'outlier_analysis' dataframe with these values

Q1 <- quantile(sales_clean$Revenue, 0.25)
Q3 <- quantile(sales_clean$Revenue, 0.75)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

outlier_analysis <- data.frame(
  Metric = c("Q1", "Q3", "IQR", "Lower Bound", "Upper Bound"),
  Value = c(Q1, Q3, IQR_value, lower_bound, upper_bound)
)

cat("========== OUTLIER ANALYSIS ==========\n")
print(outlier_analysis)

# Count outliers
outlier_count <- sum(sales_clean$Revenue < lower_bound | sales_clean$Revenue > upper_bound)
cat("\nNumber of outliers detected:", outlier_count, "\n")

       Metric     Value
1          Q1  15034.29
2          Q3  37707.71
3         IQR  22673.42
4 Lower Bound -18975.84
5 Upper Bound  71717.84

Number of outliers detected: 0 


## Part 3: Data Transformation Part 1 (Lesson 3)

**Skills Assessed:** select(), filter(), arrange(), pipe operator

**Your Tasks:**
1. Select specific columns
2. Filter data by conditions
3. Sort data
4. Chain operations with pipe

In [22]:
# Task 3.1: Select Specific Columns
# TODO: Create 'sales_summary' with only these columns from sales_clean:
#   Region, Product_Category, Revenue, Units_Sold, Sale_Date


sales_summary <- sales_clean %>%
  # Your code here:
  select(Region, Product_Category, Revenue, Units_Sold, Sale_Date)

cat("========== SELECTED COLUMNS ==========\n")
cat("Columns:", names(sales_summary), "\n")
cat("Rows:", nrow(sales_summary), "\n")
head(sales_summary, 5)

Columns: Region Product_Category Revenue Units_Sold Sale_Date 
Rows: 300 


Region,Product_Category,Revenue,Units_Sold,Sale_Date
<chr>,<chr>,<dbl>,<dbl>,<date>
Latin America,Services,20750.92,78,2023-04-24
Europe,Hardware,32359.98,13,2023-06-09
Europe,Services,39268.4,34,2023-03-25
Europe,Hardware,28865.09,90,2023-04-11
Latin America,Software,3932.36,63,2023-08-26


In [25]:
# Task 3.2: Filter High Revenue Sales
# TODO: Create 'high_revenue_sales' by filtering sales_clean for Revenue > 20000


high_revenue_sales <- sales_clean %>%
  # Your code here:
  filter(Revenue > 20000)

cat("========== HIGH REVENUE SALES ==========\n")
cat("Total high revenue transactions:", nrow(high_revenue_sales), "\n")
cat("Total revenue from these sales: $", sum(high_revenue_sales$Revenue), "\n")

Total high revenue transactions: 194 
Total revenue from these sales: $ 6671906 


In [24]:
# Task 3.3: Sort by Revenue
# TODO: Create 'top_sales' by arranging sales_clean by Revenue in descending order
#       and keeping only the top 10 rows


top_sales <- sales_clean %>%
  # Your code here:
  arrange(desc(Revenue))

cat("========== TOP 10 SALES ==========\n")
print(top_sales %>% select(Region, Product_Category, Revenue, Units_Sold))

[90m# A tibble: 300 × 4[39m
   Region        Product_Category Revenue Units_Sold
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting        [4m4[24m[4m9[24m956.         88
[90m 2[39m Europe        Software          [4m4[24m[4m9[24m867.         96
[90m 3[39m Europe        Consulting        [4m4[24m[4m9[24m857.          1
[90m 4[39m Asia Pacific  Consulting        [4m4[24m[4m9[24m239.         72
[90m 5[39m Asia Pacific  Hardware          [4m4[24m[4m8[24m997.         92
[90m 6[39m North America Services          [4m4[24m[4m8[24m884.         62
[90m 7[39m Europe        Software          [4m4[24m[4m8[24m794.         77
[90m 8[39m North America Hardware          [4m4[24m[4m8[24m772.         16
[90m 9[39m North America Consulting        [4m4[24m[4m8[24m748.         63
[90m10[39m Europe        Consulting        [4m4[24m[4m

In [36]:
# Task 3.4: Chain Multiple Operations
# TODO: Create 'regional_top_sales' by:
#   1. Filtering for Revenue > 15000
#   2. Selecting: Region, Product_Category, Revenue
#   3. Arranging by Region (ascending) then Revenue (descending)
#   4. Keeping top 15 rows
# Use the pipe operator to chain all operations

regional_top_sales <- sales_clean %>%
  # Your code here:
  select(Region, Product_Category, Revenue) %>%
  filter(Revenue > 15000) %>%
  arrange(Region) %>%
  arrange(desc(Revenue)) %>%
  slice(1:15)
  
 

cat("========== REGIONAL TOP SALES ==========\n")
print(regional_top_sales)

[90m# A tibble: 15 × 3[39m
   Region        Product_Category Revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting        [4m4[24m[4m9[24m956.
[90m 2[39m Europe        Software          [4m4[24m[4m9[24m867.
[90m 3[39m Europe        Consulting        [4m4[24m[4m9[24m857.
[90m 4[39m Asia Pacific  Consulting        [4m4[24m[4m9[24m239.
[90m 5[39m Asia Pacific  Hardware          [4m4[24m[4m8[24m997.
[90m 6[39m North America Services          [4m4[24m[4m8[24m884.
[90m 7[39m Europe        Software          [4m4[24m[4m8[24m794.
[90m 8[39m North America Hardware          [4m4[24m[4m8[24m772.
[90m 9[39m North America Consulting        [4m4[24m[4m8[24m748.
[90m10[39m Europe        Consulting        [4m4[24m[4m8[24m572.
[90m11[39m Europe        Hardware          [4m4[24m[4m8[24m529.
[90m12[39m Latin America Services          [4m4[24m[4m8

## Part 4: Data Transformation Part 2 (Lesson 4)

**Skills Assessed:** mutate(), summarize(), group_by()

**Your Tasks:**
1. Create calculated columns with mutate()
2. Calculate summary statistics
3. Perform grouped analysis
4. Generate business metrics

In [40]:
# Task 4.1: Create Calculated Columns
# TODO: Add these new columns to sales_clean using mutate():
#   - revenue_per_unit: Revenue / Units_Sold
#   - high_value: "Yes" if Revenue > 20000, else "No"
# Store result in 'sales_enhanced'

sales_enhanced <- sales_clean %>%
  mutate(
    # Your code here:
    revenue_per_unit = Revenue / Units_Sold,
    high_value = Revenue > 20000
  )

cat("========== ENHANCED SALES DATA ==========\n")
cat("New columns added: revenue_per_unit, high_value\n")
head(sales_enhanced %>% select(Revenue, Units_Sold, revenue_per_unit, high_value), 5)

New columns added: revenue_per_unit, high_value


Revenue,Units_Sold,revenue_per_unit,high_value
<dbl>,<dbl>,<dbl>,<lgl>
20750.92,78,266.03744,True
32359.98,13,2489.22923,True
39268.4,34,1154.95294,True
28865.09,90,320.72322,True
3932.36,63,62.41841,False


In [41]:
# Task 4.2: Calculate Overall Summary Statistics
# TODO: Create 'overall_summary' with these metrics from sales_enhanced:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - total_units: sum of Units_Sold
#   - transaction_count: count using n()


overall_summary <- sales_enhanced %>%
  
    # Your code here:
    summarise(
        total_revenue = sum(Revenue),
        avg_revenue = mean(Revenue),
        total_units = sum(Units_Sold),
        transaction_count = n()

    )
  

cat("========== OVERALL SUMMARY ==========\n")
print(overall_summary)

[90m# A tibble: 1 × 4[39m
  total_revenue avg_revenue total_units transaction_count
          [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.      [4m2[24m[4m5[24m906.       [4m1[24m[4m6[24m169               300


In [42]:
# Task 4.3: Regional Performance Analysis
# TODO: Create 'regional_summary' by grouping sales_enhanced by Region
#       and calculating:
#   - total_revenue: sum of Revenue
#   - avg_revenue: mean of Revenue
#   - transaction_count: count using n()
# Then arrange by total_revenue descending
# Hint: Use group_by() %>% summarize() %>% arrange()

regional_summary <- sales_enhanced %>%
  # Your code here:
  group_by(Region) %>%
  summarise(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    transaction_count = n()
  ) %>%
  arrange(desc(total_revenue))

cat("========== REGIONAL SUMMARY ==========\n")
print(regional_summary)

[90m# A tibble: 4 × 4[39m
  Region        total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                 [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Europe             2[4m2[24m[4m2[24m[4m4[24m182.      [4m2[24m[4m7[24m124.                82
[90m2[39m Latin America      2[4m1[24m[4m1[24m[4m2[24m037.      [4m2[24m[4m5[24m446.                83
[90m3[39m Asia Pacific       1[4m8[24m[4m0[24m[4m4[24m243.      [4m2[24m[4m6[24m929.                67
[90m4[39m North America      1[4m6[24m[4m3[24m[4m1[24m248.      [4m2[24m[4m3[24m989.                68


In [43]:
# Task 4.4: Product Category Analysis
# TODO: Create 'category_summary' by grouping by Product_Category
#       and calculating the same metrics as regional_summary
#       Then arrange by total_revenue descending

category_summary <- sales_enhanced %>%
  # Your code here:
  group_by(Product_Category) %>%
  summarise(
    total_revenue = sum(Revenue),
    avg_revenue = mean(Revenue),
    transaction_count = n()
  ) %>%
  arrange(desc(total_revenue))

cat("========== CATEGORY SUMMARY ==========\n")
print(category_summary)

[90m# A tibble: 4 × 4[39m
  Product_Category total_revenue avg_revenue transaction_count
  [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m             [3m[90m<int>[39m[23m
[90m1[39m Consulting            1[4m9[24m[4m7[24m[4m8[24m840.      [4m2[24m[4m6[24m037.                76
[90m2[39m Services              1[4m9[24m[4m6[24m[4m1[24m565.      [4m2[24m[4m7[24m244.                72
[90m3[39m Hardware              1[4m9[24m[4m5[24m[4m1[24m325.      [4m2[24m[4m6[24m730.                73
[90m4[39m Software              1[4m8[24m[4m7[24m[4m9[24m981.      [4m2[24m[4m3[24m797.                79


## Part 5: Data Reshaping with tidyr (Lesson 5)

**Skills Assessed:** pivot_longer(), pivot_wider(), tidy data principles

**Your Tasks:**
1. Reshape data from wide to long format
2. Reshape data from long to wide format
3. Create analysis-ready datasets

In [44]:
# Task 5.1: Create Wide Format Data
# First, create a summary by Region and Product_Category
region_category_revenue <- sales_enhanced %>%
  group_by(Region, Product_Category) %>%
  summarize(total_revenue = sum(Revenue), .groups = 'drop')

cat("========== REGION-CATEGORY DATA (LONG FORMAT) ==========\n")
print(head(region_category_revenue, 10))

[90m# A tibble: 10 × 3[39m
   Region        Product_Category total_revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m                    [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting             [4m7[24m[4m5[24m[4m9[24m641.
[90m 2[39m Asia Pacific  Hardware               [4m2[24m[4m7[24m[4m1[24m979.
[90m 3[39m Asia Pacific  Services               [4m3[24m[4m3[24m[4m1[24m826.
[90m 4[39m Asia Pacific  Software               [4m4[24m[4m4[24m[4m0[24m797.
[90m 5[39m Europe        Consulting             [4m3[24m[4m9[24m[4m0[24m670.
[90m 6[39m Europe        Hardware               [4m7[24m[4m7[24m[4m7[24m044.
[90m 7[39m Europe        Services               [4m5[24m[4m1[24m[4m3[24m507.
[90m 8[39m Europe        Software               [4m5[24m[4m4[24m[4m2[24m961.
[90m 9[39m Latin America Consulting             [4m4[24m[4m3[24m[4m3[24m397.
[90m10[39m Latin America Hardware               [4

In [48]:
# Task 5.2: Reshape to Wide Format
# TODO: Create 'revenue_wide' by pivoting region_category_revenue
#       so that Product_Category values become column names
#       with total_revenue as the values


revenue_wide <- region_category_revenue %>%
  # Your code here:
  pivot_wider(
    names_from = "Product_Category",
    values_from = "total_revenue"
  )

cat("========== REVENUE DATA (WIDE FORMAT) ==========\n")
print(revenue_wide)

[90m# A tibble: 4 × 5[39m
  Region        Consulting Hardware Services Software
  [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m    [3m[90m<dbl>[39m[23m
[90m1[39m Asia Pacific     [4m7[24m[4m5[24m[4m9[24m641.  [4m2[24m[4m7[24m[4m1[24m979.  [4m3[24m[4m3[24m[4m1[24m826.  [4m4[24m[4m4[24m[4m0[24m797.
[90m2[39m Europe           [4m3[24m[4m9[24m[4m0[24m670.  [4m7[24m[4m7[24m[4m7[24m044.  [4m5[24m[4m1[24m[4m3[24m507.  [4m5[24m[4m4[24m[4m2[24m961.
[90m3[39m Latin America    [4m4[24m[4m3[24m[4m3[24m397.  [4m4[24m[4m7[24m[4m4[24m257.  [4m6[24m[4m4[24m[4m4[24m772.  [4m5[24m[4m5[24m[4m9[24m611.
[90m4[39m North America    [4m3[24m[4m9[24m[4m5[24m132.  [4m4[24m[4m2[24m[4m8[24m046.  [4m4[24m[4m7[24m[4m1[24m460.  [4m3[24m[4m3[24m[4m6[24m611.


In [55]:
# Task 5.3: Reshape Back to Long Format
# TODO: Create 'revenue_long' by pivoting revenue_wide back to long format
#       Column names (except Region) should go into 'Product_Category'
#       Values should go into 'revenue'


revenue_long <- revenue_wide %>%
  # Your code here:
  pivot_longer(
    cols = -Region,
    names_to = "Product_Category",
    values_to = "revenue"
  )

cat("========== REVENUE DATA (BACK TO LONG FORMAT) ==========\n")
print(head(revenue_long, 10))

[90m# A tibble: 10 × 3[39m
   Region        Product_Category revenue
   [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m              [3m[90m<dbl>[39m[23m
[90m 1[39m Asia Pacific  Consulting       [4m7[24m[4m5[24m[4m9[24m641.
[90m 2[39m Asia Pacific  Hardware         [4m2[24m[4m7[24m[4m1[24m979.
[90m 3[39m Asia Pacific  Services         [4m3[24m[4m3[24m[4m1[24m826.
[90m 4[39m Asia Pacific  Software         [4m4[24m[4m4[24m[4m0[24m797.
[90m 5[39m Europe        Consulting       [4m3[24m[4m9[24m[4m0[24m670.
[90m 6[39m Europe        Hardware         [4m7[24m[4m7[24m[4m7[24m044.
[90m 7[39m Europe        Services         [4m5[24m[4m1[24m[4m3[24m507.
[90m 8[39m Europe        Software         [4m5[24m[4m4[24m[4m2[24m961.
[90m 9[39m Latin America Consulting       [4m4[24m[4m3[24m[4m3[24m397.
[90m10[39m Latin America Hardware         [4m4[24m[4m7[24m[4m4[24m257.


## Part 6: Combining Datasets with Joins (Lesson 6)

**Skills Assessed:** left_join(), inner_join(), data integration

**Your Tasks:**
1. Join customers with orders
2. Join orders with order_items
3. Create integrated dataset

In [56]:
# Task 6.1: Join Customers and Orders
# TODO: Create 'customer_orders' by left joining customers with orders
#       Join on CustomerID


customer_orders <- left_join(customers, orders, by = "CustomerID")


cat("========== CUSTOMER ORDERS ==========\n")
cat("Total rows:", nrow(customer_orders), "\n")
cat("Columns:", ncol(customer_orders), "\n")

Total rows: 200 
Columns: 8 


In [57]:
# Task 6.2: Join Orders and Order Items
# TODO: Create 'orders_with_items' by inner joining orders with order_items
#       Join on OrderID


orders_with_items <- inner_join(orders, order_items, by = "OrderID")


cat("========== ORDERS WITH ITEMS ==========\n")
cat("Total rows:", nrow(orders_with_items), "\n")
head(orders_with_items, 5)

Total rows: 400 


OrderID,CustomerID,Order_Date,Total_Amount,ProductID,Quantity,Unit_Price
<dbl>,<dbl>,<date>,<dbl>,<dbl>,<dbl>,<dbl>
1,87,2023-08-30,424.3,2,3,115.72
1,87,2023-08-30,424.3,22,5,206.62
1,87,2023-08-30,424.3,26,5,61.75
3,37,2024-03-19,549.07,19,1,474.92
6,101,2023-07-22,189.85,32,4,272.64


## Part 7: String Manipulation & Date/Time Operations (Lesson 7)

**Skills Assessed:** stringr functions, lubridate functions

**Your Tasks:**
1. Clean text data
2. Parse dates
3. Extract date components

In [63]:
# Task 7.1: Clean Text Data
# TODO: Add these columns to sales_enhanced using mutate():
#   - region_clean: Region with trimmed whitespace and Title Case
#   - category_clean: Product_Category with trimmed whitespace and Title Case


sales_enhanced <- sales_enhanced %>%
  mutate(
    # Your code here:
    region_clean = str_trim(Region),
    region_clean = str_to_title(Region),
    
    category_clean = str_trim(Product_Category),
    category_clean = str_to_title(Product_Category)

  )

cat("========== CLEANED TEXT DATA ==========\n")
head(sales_enhanced %>% select(Region, region_clean, Product_Category, category_clean), 5)



Region,region_clean,Product_Category,category_clean
<chr>,<chr>,<chr>,<chr>
Latin America,Latin America,Services,Services
Europe,Europe,Hardware,Hardware
Europe,Europe,Services,Services
Europe,Europe,Hardware,Hardware
Latin America,Latin America,Software,Software


In [68]:
# Task 7.2: Parse Dates and Extract Components
# TODO: Add these date-related columns using mutate():
#   - date_parsed: Parse Sale_Date column (use ymd(), mdy(), or dmy() as appropriate)
#   - sale_month: Extract month name from date_parsed
#   - sale_weekday: Extract weekday name from date_parsed


sales_enhanced <- sales_enhanced %>%
  mutate(
    # Your code here:
    date_parsed = ymd(Sale_Date),
    sale_month = month(date_parsed, label = TRUE),
    sale_weekday = wday(date_parsed, label = TRUE)
  )

cat("========== DATE COMPONENTS ==========\n")
head(sales_enhanced %>% select(Sale_Date, date_parsed, sale_month, sale_weekday), 5)



Sale_Date,date_parsed,sale_month,sale_weekday
<date>,<date>,<ord>,<ord>
2023-04-24,2023-04-24,Apr,Mon
2023-06-09,2023-06-09,Jun,Fri
2023-03-25,2023-03-25,Mar,Sat
2023-04-11,2023-04-11,Apr,Tue
2023-08-26,2023-08-26,Aug,Sat


## Part 8: Advanced Wrangling & Business Intelligence (Lesson 8)

**Skills Assessed:** case_when(), complex logic, KPIs

**Your Tasks:**
1. Create business categories with case_when()
2. Calculate KPIs
3. Generate executive summary

In [74]:
# Task 8.1: Create Performance Categories
# TODO: Add 'performance_tier' column using case_when():
#   - "High" if Revenue > 25000
#   - "Medium" if Revenue > 15000
#   - "Low" otherwise

sales_enhanced <- sales_enhanced %>%
  mutate(
    performance_tier = case_when(
      # Your code here:
      Revenue > 25000 ~ "High",
      Revenue > 15000 ~ "Medium",
      Revenue < 15000 ~ "Low"
      
    )
  )

cat("========== PERFORMANCE TIERS ==========\n")
table(sales_enhanced$performance_tier)




  High    Low Medium 
   154     74     72 

In [76]:
# Task 8.2: Calculate Business KPIs
# TODO: Create 'business_kpis' with these metrics:
#   - total_revenue: sum of Revenue
#   - total_transactions: count of rows
#   - avg_transaction_value: mean of Revenue
#   - high_value_pct: percentage where high_value = "Yes"

business_kpis <- sales_enhanced %>%
  summarize(
    # Your code here:
    total_revenue = sum(Revenue),
    total_transactions = n(),
    avg_transaction_value = mean(Revenue),
    high_value_pct = mean(high_value == "Yes", na.rm = TRUE) * 100

  )

cat("========== BUSINESS KPIs ==========\n")
print(business_kpis)

[90m# A tibble: 1 × 4[39m
  total_revenue total_transactions avg_transaction_value high_value_pct
          [3m[90m<dbl>[39m[23m              [3m[90m<int>[39m[23m                 [3m[90m<dbl>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m      7[4m7[24m[4m7[24m[4m1[24m711.                300                [4m2[24m[4m5[24m906.              0


## Part 9: Reflection Questions

Answer the following questions based on your analysis.

### Question 9.1: Data Cleaning Impact

**How did handling missing values and outliers affect your analysis? Why is data cleaning important before performing business analysis?**

Your answer here:



### Question 9.2: Grouped Analysis Value

**What insights did you gain from the regional and category summaries that you couldn't see in the raw data? How can businesses use this type of grouped analysis?**

Your answer here:



### Question 9.3: Data Reshaping Purpose

**Why would you need to reshape data between wide and long formats? Provide a business scenario where each format would be useful.**

Your answer here:



### Question 9.4: Joining Datasets

**What is the difference between left_join() and inner_join()? When would you use each one in a business context?**

Your answer here:



### Question 9.5: Skills Integration

**Which R data wrangling skill (from Lessons 1-8) do you think is most valuable for business analytics? Why?**

Your answer here:



## Exam Complete!

### What You've Demonstrated

✅ **Lesson 1:** R basics and data import
✅ **Lesson 2:** Data cleaning (missing values & outliers)
✅ **Lesson 3:** Data transformation (select, filter, arrange)
✅ **Lesson 4:** Advanced transformation (mutate, summarize, group_by)
✅ **Lesson 5:** Data reshaping (pivot_longer, pivot_wider)
✅ **Lesson 6:** Combining datasets (joins)
✅ **Lesson 7:** String manipulation & date/time operations
✅ **Lesson 8:** Advanced wrangling & business intelligence

### Submission Checklist

Before submitting, ensure:
- [ ] All code cells run without errors
- [ ] All TODO sections completed
- [ ] All required dataframes created with correct names
- [ ] All 5 reflection questions answered
- [ ] Student name and ID filled in at top

**Good work! 🎉**