# R Programming & Machine Learning Fundamentals - Homework Assignment

## Course: Business Analytics with R

**Total Points: 188 + 15 Bonus = 203 possible points**

### Instructions
1. Complete all tasks by writing R code in the provided cells
2. Run each cell to verify your code works correctly
3. Use the `mlba` package datasets (WestRoxbury, ToyotaCorolla)
4. Comment your code to explain your reasoning
5. Show your output - do not suppress results

### Skills Tested
- Data Import and Exploration (including `class()`)
- Descriptive Statistics (including `pastecs::stat.desc()`)
- Data Transformation and Recoding (including `attach()`/`detach()`)
- Correlation Analysis (Pearson and Spearman)
- Data Visualization (including correlation heatmaps)
- Normality Testing
- Sampling Techniques
- Data Preprocessing
- Data Partitioning
- Predictive Modeling
- File I/O (`read.csv()`, `write.csv()`)

---

## Setup: Load Required Libraries

Run this cell first to load all necessary packages.

In [1]:
# Load required libraries
library(mlba)
library(tidyverse)
library(Hmisc)
library(psych)
library(pastecs)
library(caret)
library(fastDummies)
library(e1071)

# Disable scientific notation
options(scipen=999)

Loading required package: caret

Loading required package: ggplot2



Loading required package: lattice

Loading required package: forecast

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.2.0     [32m✔[39m [34mreadr    [39m 2.1.6
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.6.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtibble   [39m 3.3.1
[32m✔[39m [34mpurrr    [39m 1.2.1     [32m✔[39m [34mtidyr    [39m 1.3.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[31m✖[39m [34mpurrr[39m::[32mlift()[39m   masks [34mcaret[39m::lift()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to f

---

## Part 1: Data Import and Exploration (20 points)

In this section, you will load data and explore its structure using R's built-in functions.

### Task 1.1: Load the ToyotaCorolla Dataset (3 points)

Load the `ToyotaCorolla` dataset from the `mlba` package and store it in a variable called `toyota.df`.

In [None]:
# YOUR CODE HERE

### Task 1.2: Check Dataset Dimensions (3 points)

Use the appropriate function to display:
1. The number of rows and columns in the dataset
2. The column names

In [None]:
# YOUR CODE HERE

### Task 1.3: Examine Data Structure (4 points)

Use `str()` to display the structure of the dataset. Then identify and list:
1. How many numeric variables are there?
2. How many character/factor variables are there?

In [None]:
# YOUR CODE HERE

### Task 1.4: Preview Data (5 points)

1. Display the first 8 rows of the dataset
2. Display the last 5 rows of the dataset
3. Display rows 10-15 and columns 1-5 only

In [None]:
# YOUR CODE HERE

### Task 1.5: Column Access (5 points)

Using the `$` notation:
1. Display the first 10 values of the `Price` column
2. Calculate the length (number of values) of the `Price` column
3. Calculate the mean of the `Price` column

In [None]:
# YOUR CODE HERE

### Task 1.6: Check Data Type with class() (3 points)

Use the `class()` function to check the data type of:
1. The entire `toyota.df` dataframe
2. The `Price` column
3. The `Fuel_Type` column

In [None]:
# YOUR CODE HERE


---

## Part 2: Descriptive Statistics (20 points)

Calculate and interpret summary statistics using multiple methods.

### Task 2.1: Basic Summary Statistics (5 points)

Use the `summary()` function on the `Price` column. What are the mean and median values?

In [None]:
# YOUR CODE HERE

### Task 2.2: Five-Number Summary (5 points)

Use `fivenum()` to get the five-number summary for the `KM` (mileage) column. What is the IQR (interquartile range)?

In [None]:
# YOUR CODE HERE

### Task 2.3: Detailed Statistics with Hmisc (5 points)

Use `Hmisc::describe()` to get detailed statistics for the `HP` (horsepower) column. How many missing values are there?

In [None]:
# YOUR CODE HERE

### Task 2.5: Detailed Statistics with pastecs (5 points)

Use `pastecs::stat.desc()` to get comprehensive statistics for the `KM` column. Identify the coefficient of variation (coef.var) from the output.

In [None]:
# YOUR CODE HERE


### Task 2.4: Skewness and Kurtosis (5 points)

Use `psych::describe()` to calculate statistics for the `Price` column. Based on the skewness value:
1. Is the distribution symmetric, moderately skewed, or highly skewed?
2. Would you recommend using mean or median to describe the typical price?

In [None]:
# YOUR CODE HERE

---

## Part 3: Data Transformation (20 points)

Create new variables and transform existing ones.

### Task 3.1: Create Calculated Variables (5 points)

Using the `$` notation, create a new column called `Price_Per_KM` that calculates the price per kilometer driven (Price / KM). Display the first 5 values of this new column.

In [None]:
# YOUR CODE HERE

### Task 3.2: Transform with Multiple Variables (5 points)

Using `transform()`, create two new columns in a new dataframe called `toyota_transformed`:
1. `Total_Value` = Price + (HP * 50)  (estimating value based on horsepower)
2. `Age_Months` = Age_08_04 * 12  (convert age in years to months)

Display the names of the new dataframe to verify the columns were added.

In [None]:
# YOUR CODE HERE

### Task 3.3: Recode Binary Variable (5 points)

Using `ifelse()`, create a new column called `High_Mileage` that contains:
- "High" if KM > 100000
- "Low" if KM <= 100000

Display a table showing the count of each category.

In [None]:
# YOUR CODE HERE

### Task 3.5: Using attach() and detach() (5 points)

Use the `attach()` and `detach()` functions to create a new column called `HP_Per_Weight` that calculates HP divided by Weight, without using the `$` notation or dataframe prefix inside the calculation.

**Note:** Remember to detach after you're done!

In [None]:
# YOUR CODE HERE


### Task 3.4: Recode Multi-Level Variable (5 points)

Create a new column called `Price_Tier` with three categories based on Price:
- "Budget" if Price < 8000
- "Mid-Range" if Price >= 8000 AND Price < 15000
- "Premium" if Price >= 15000

Display a table of the distribution.

In [None]:
# YOUR CODE HERE

### Task 4.4: Spearman Correlation (5 points)

Calculate the Spearman (rank-based) correlation matrix for the same variables (`Price`, `Age_08_04`, `KM`, `HP`). Compare the results to Pearson - are there notable differences?

In [None]:
# YOUR CODE HERE


---

## Part 4: Correlation Analysis (15 points)

Analyze relationships between variables using correlation methods.

### Task 4.1: Pearson Correlation Matrix (5 points)

Select the numeric columns `Price`, `Age_08_04`, `KM`, and `HP` and calculate the Pearson correlation matrix. Round to 3 decimal places.

In [None]:
# YOUR CODE HERE

### Task 4.2: Interpret Correlations (5 points)

Based on your correlation matrix, answer these questions in a comment:
1. Which variable has the strongest correlation with Price?
2. Is this correlation positive or negative? What does this mean in business terms?
3. Which pair of variables has the weakest correlation?

In [None]:
# YOUR ANSWERS HERE (as comments)

### Task 4.3: Statistical Significance Test (5 points)

Use `cor.test()` to test if the correlation between `Price` and `Age_08_04` is statistically significant. Report the p-value and state your conclusion (α = 0.05).

In [None]:
# YOUR CODE HERE

### Task 5.4: Correlation Heatmap (5 points)

Create a correlation heatmap visualization for the numeric variables `Price`, `Age_08_04`, `KM`, `HP`, and `Weight`. Use the `image()` function with a color palette that shows negative correlations in blue and positive correlations in red.

In [None]:
# YOUR CODE HERE


---

## Part 5: Data Visualization (15 points)

Create visualizations to explore data distributions and relationships.

### Task 5.1: Boxplot for Outlier Detection (5 points)

Create a boxplot for the `Price` variable. Add a meaningful title and axis label.

In [None]:
# YOUR CODE HERE

### Task 5.2: Density Plot (5 points)

Create a density plot for the `KM` (mileage) variable. Add appropriate title and labels. Based on the shape, describe the distribution (symmetric, left-skewed, right-skewed).

In [None]:
# YOUR CODE HERE

### Task 5.3: Scatterplot with Smoothing (5 points)

Create a scatterplot showing the relationship between `Age_08_04` (x-axis) and `Price` (y-axis). Add a smoothing line using `scatter.smooth()`.

In [None]:
# YOUR CODE HERE

---

## Part 6: Normality Testing (10 points)

Test whether variables follow a normal distribution.

### Task 6.1: Shapiro-Wilk Test (5 points)

Perform a Shapiro-Wilk normality test on the `Price` variable. State the null hypothesis, report the p-value, and state your conclusion at α = 0.05.

In [None]:
# YOUR CODE HERE

### Task 6.2: Compare Normality of Two Variables (5 points)

Test both `HP` and `Weight` for normality. Which variable is more likely to be normally distributed? Support your answer with the test statistics.

In [None]:
# YOUR CODE HERE

---

## Part 7: Sampling Techniques (15 points)

Apply different sampling methods for data analysis.

### Task 7.1: Simple Random Sampling (5 points)

Using `set.seed(123)` for reproducibility, take a random sample of 10 rows from the toyota dataset. Display the sample.

In [None]:
# YOUR CODE HERE

### Task 7.2: Weighted Sampling (5 points)

Create a weighted sample of 10 observations that oversamples cars with HP > 100 (give them 90% probability vs 10% for others). Use `set.seed(123)`.

In [None]:
# YOUR CODE HERE

### Task 7.3: Class Rebalancing with Upsampling (5 points)

1. First, convert `Fuel_Type` to a factor and display the class distribution using `table()`
2. Use `caret::upSample()` to balance the classes
3. Display the new distribution

In [None]:
# YOUR CODE HERE

---

## Part 8: Data Preprocessing (15 points)

Prepare data for machine learning models.

### Task 8.1: Handle Missing Values (5 points)

1. First, simulate missing data by setting 15 random `Age_08_04` values to NA (use `set.seed(42)`)
2. Show the summary to confirm NA values exist
3. Impute the missing values using the median
4. Verify no NAs remain

In [None]:
# YOUR CODE HERE

### Task 8.2: Convert to Factor (5 points)

Convert the `Fuel_Type` column to a factor. Display the structure and levels of the converted variable.

In [None]:
# YOUR CODE HERE

### Task 8.3: Create Dummy Variables (5 points)

Using `fastDummies::dummy_cols()`, create dummy variables for `Fuel_Type`. Remove the original column and the first dummy (reference category). Display the new column names to verify.

In [None]:
# YOUR CODE HERE

---

## Part 9: Data Partitioning (10 points)

Split data properly for model training and evaluation.

### Task 9.1: Holdout Partition (5 points)

Using `set.seed(1)`, create a 70/30 train/test split:
1. Randomly select 70% of row indices for training
2. Use the remaining 30% for the holdout set
3. Print the number of rows in each set

In [None]:
# YOUR CODE HERE

### Task 9.2: Stratified Partitioning with caret (5 points)

Use `caret::createDataPartition()` to create a stratified 60/40 split based on the `Price` variable. This ensures the distribution of Price is similar in both sets. Use `set.seed(1)`.

In [None]:
# YOUR CODE HERE

---

## Part 10: Predictive Modeling (10 points)

Build and evaluate a simple predictive model.

### Task 10.1: Build Linear Regression Model (5 points)

Using the WestRoxbury housing dataset:
1. Load and preprocess the data (convert REMODEL to factor, remove TAX column)
2. Create a 60/40 train/test split using `set.seed(1)`
3. Build a linear regression model to predict `TOTAL.VALUE` using all other variables
4. Display the first 5 predicted values on the training set

In [None]:
# YOUR CODE HERE

### Task 10.2: Evaluate Model Performance (5 points)

1. Make predictions on the holdout set
2. Calculate RMSE and MAE for both training and holdout sets
3. Compare the metrics - is there evidence of overfitting?

In [None]:
# YOUR CODE HERE

---

## Bonus: Complete ML Pipeline (15 points)

### Task B.1: End-to-End Analysis (15 points)

Using the ToyotaCorolla dataset, complete the following pipeline in a single code block:

1. Load the data and select only these columns: `Price`, `Age_08_04`, `KM`, `HP`, `cc`, `Doors`, `Weight`, `Fuel_Type`
2. Handle any missing values by removing rows with NA
3. Convert `Fuel_Type` to dummy variables
4. Create a stratified 70/30 train/test split (use `set.seed(42)`)
5. Build a linear regression model to predict `Price`
6. Calculate and compare RMSE on training vs holdout sets
7. Print a brief conclusion about model performance

---

## Part 11: File Input/Output (10 points)

Working with files is fundamental to data analysis workflows.

### Task 11.1: Read CSV File (5 points)

Using `read.csv()`, load a sample dataset. Since we're using mlba package data, demonstrate the syntax by:
1. First, export the toyota data to a CSV file using `write.csv()`
2. Then read it back using `read.csv()` with `header = TRUE`
3. Verify the data was read correctly by checking the dimensions

In [None]:
# YOUR CODE HERE


### Task 11.2: Export Processed Data (5 points)

After creating a subset of the data with only `Price`, `Age_08_04`, `KM`, and `HP` columns:
1. Use `write.csv()` to export to a file called "toyota_subset.csv"
2. Set `row.names = FALSE` to avoid adding row numbers
3. Print a confirmation message

In [None]:
# YOUR CODE HERE


In [None]:
# YOUR CODE HERE

---

## Submission Checklist

Before submitting, verify:

- [ ] All cells run without errors
- [ ] All tasks have been attempted
- [ ] Code includes comments explaining your approach
- [ ] Interpretation questions are answered in comments or markdown
- [ ] Output is visible (not suppressed)

**Good luck!**