
---

# Data Analysis: Prices vs. Automobile Pre-Owned Condition

## Objective
This project aims to analyze the relationship between pre-owned automobile condition and pricing, with a focus on various car brands. The goal is to provide actionable insights to the automotive industry using rigorous data-driven methods.

## Scope
- Analysis uses a [publicly available dataset](https://www.kaggle.com/datasets/ankits29/used-car-price-data) of used car listings from diverse car brands and manufacturers.
- Utilizes R for statistical analysis and data visualization.

## Deliverables
- Detailed Jupyter Notebook containing the analysis and findings.
- Graphs and charts summarizing key discoveries and recommendations for stakeholders.
- GLMs concluding the analysis and findings

---


## Data Analysis Project Roadmap

### ASK PHASE
- **Objective:** Understand how different car brands have different relationships between age of car and used car prices.
- **End Goal:** Create predictive models for future price trends based on car brand.

### PREPARE PHASE
- **Data Source:** [Kaggle dataset](https://www.kaggle.com/datasets/ankits29/used-car-price-data).
- **Data Description:** Includes variables like model, selling price, kilometers driven, year of manufacture, owner, fuel type, transmission, insurance, and car condition.

### PROCESS PHASE
*Objective:* Prepare the dataset for analysis by cleaning, formatting, and transforming variables.

1. **Import Data:** Load the raw data from a CSV file.
2. **Group Data by Model:** Group the data by the car model to segment it for individual model analysis.
3. **Variable Formatting:**
   - `Model`: Convert it to a character type for consistency.
   - `Car Condition`: Convert it to a numeric type for numerical analysis.
   - `Year`: Convert it to a numeric type for numerical analysis.
   - `Owner`: Convert it to a character type and map it to numeric values (0 to 4) based on ownership status.


In [None]:
library(dplyr)

# Import data from CSV file
data <- read.csv("./data-cars/car_data.csv")

# Group data by car model and format variables for analysis
data <- data %>%
	group_by(Model) %>%
	mutate(
		Model = as.character(Model),
		`Car Condition` = as.numeric(`Car Condition`),
		Year = as.numeric(Year),
		Owner = as.character(Owner),
		Owner = case_when(
			startsWith(Owner, "First") ~ 0,
			startsWith(Owner, "Second") ~ 1,
			startsWith(Owner, "Third") ~ 2,
			startsWith(Owner, "Fourth") ~ 3,
			startsWith(Owner, "Fifth") ~ 4
		)
	)

### ANALYSIS PHASE
*Objective:* Analyze each car model's data to calculate usedness and evaluate linear regression models.

1. **Function for Summarizing Linear Models:** Create a function `summarize_lm` to extract key information from linear regression models.
2. **Calculate Usedness:** Calculate the `usedness` metric for each car based on the formula involving variables like `Kilometers Driven`, `Year`, `Owner`, and `Car Condition`.
3. **Prepare Results Data Frame:** Create an empty data frame `results_df` to store analysis results.
4. **Model Analysis Loop:**
   - For each unique car model:
     - Subset the data for the specific model.
     - Fit a linear regression model between `Selling Price` and `usedness`.
     - Summarize the linear regression model using `summarize_lm`.
     - Calculate various metrics like MAE, MSE, RMSE, R-squared, and Adjusted R-squared.
     - Add the metrics and the coefficients of the linear model to `results_df`.

In [None]:
library(stats)

# Function to summarize lm object
summarize_lm <- function(lm_object) {
	summary_info <- summary(lm_object)
	lm_coef <- coef(summary_info)	# Coefficients
	r_squared <- summary_info$r.squared	# R-squared value
	adj_r_squared <- summary_info$adj.r.squared	# Adjusted R-squared value
	return(list(lm_coef = lm_coef, r_squared = r_squared, adj_r_squared = adj_r_squared))
}

# Calculate usedness
current_year <- 2023
refactored_data <- data %>%
	mutate(usedness = (
		`Kilometers Driven` * (current_year - Year) * (Owner + 1) * (1 / `Car Condition`)
	))

# Create an empty data frame to store the results
results_df <- data.frame(
	Model = character(0),
	MAE = numeric(0),
	MSE = numeric(0),
	RMSE = numeric(0),
	R_squared = numeric(0),
	Adj_R_squared = numeric(0),
	Coefficients = character(0)	# Store coefficients as strings
)

# Analysis for each car model
unique_models <- distinct(data, Model)$Model

for (model in unique_models) {
	model_data <- refactored_data %>%
		filter(Model == model) %>%
		select(`Selling Price`, usedness)

	# Create linear regression model
	lm_model <- lm(`Selling Price` ~ usedness, data = model_data)

	# Summarize the lm object
	lm_summary <- summarize_lm(lm_model)

	# Extract coefficients and R-squared value
	lm_coef <- lm_summary$lm_coef
	r_squared <- lm_summary$r_squared
	adj_r_squared <- lm_summary$adj_r_squared

	# Calculate other metrics
	actual_values <- model_data$`Selling Price`
	predictions <- predict(lm_model, newdata = model_data)
	mae <- mean(abs(predictions - actual_values))
	mse <- mean((predictions - actual_values)^2)
	rmse <- sqrt(mse)
	ss_total <- sum((actual_values - mean(actual_values))^2)
	ss_residual <- sum((actual_values - predictions)^2)

	# Add metrics and lm summary to the results_df data frame
	results_df <- results_df %>%
		add_row(
			Model = model,
			MAE = mae,
			MSE = mse,
			RMSE = rmse,
			R_squared = r_squared,
			Adj_R_squared = adj_r_squared,
			Coefficients = toString(lm_coef)
		)
}


### SHARE PHASE
*Objective:* Create scatter plots and linear regression plots for each car model.

1. **Load Required Libraries:** Import the necessary libraries for data visualization.
2. **Function to Create Plots:** Create a function `create_plots` that generates scatter plots and linear regression plots for a specific car model.
3. **Plot Generation Loop:**
   - For each unique car model:
     - Subset the data for the specific model.
     - Fit a linear regression model between `Selling Price` and `usedness`.
     - Create a scatter plot of `Selling Price` against `usedness`.
     - Create a linear regression plot with a regression line.
     - Save both plots as separate image files.

In [None]:
library(ggplot2)

# Function to create scatter and line plots for a car model
create_plots <- function(model_name) {
	model_data <- refactored_data %>%
		filter(Model == model_name) %>%
		select(`Selling Price`, usedness)

	# Create linear regression model
	lm_model <- lm(`Selling Price` ~ usedness, data = model_data)

	# Create scatter plot
	scatter_plot <- ggplot(model_data, aes(x = usedness, y = `Selling Price`)) +
		geom_point() +
		labs(title = paste("Scatter Plot for", model_name),
				x = "Usedness",
				y = "Selling Price") +
		theme_minimal()

	# Create line plot for linear regression model
	line_plot <- ggplot(model_data, aes(x = usedness, y = `Selling Price`)) +
		geom_point() +
		geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "red") +
		labs(title = paste("Linear Regression Plot for", model_name),
				x = "Usedness",
				y = "Selling Price") +
		theme_minimal()

	return(list(scatter_plot = scatter_plot, line_plot = line_plot))
}

# Create and save plots for each unique car model
plots_list <- list()

for (model in unique_models) {
	plots <- create_plots(model)
	plots_list[[model]] <- plots
}

# Save plots to separate files
for (model in unique_models) {
	ggsave(paste(model, "_scatter.png", sep = ""), plots_list[[model]]$scatter_plot)
	ggsave(paste(model, "_line.png", sep = ""), plots_list[[model]]$line_plot)
}

# Preview the results_df
head(results_df)

### ACT PHASE

*Objective:* Summarize the findings from the analysis and provide recommendations or insights.

1. **Analyze Results:**
   - Examine the analysis results stored in `results_df` to understand the performance of linear regression models for each car model.
   - Identify models with high and low R-squared values, MAE, MSE, and RMSE.
   - Observe coefficients to understand the relationship between usedness and selling price.

2. **Overall Analysis:**
   - Assess the overall performance of linear regression models on the dataset.
   - Identify patterns or trends in the data, such as whether usedness is a significant predictor of selling price.
   - Consider the adjusted R-squared values to account for model complexity.

3. **Recommendations:**
   - Based on the analysis, provide recommendations for car sellers or buyers:
     - For sellers: Highlight models with strong positive correlations between usedness and selling price. Suggest pricing strategies based on usedness.
     - For buyers: Identify models where usedness has a less pronounced impact on selling price, potentially offering better value for older vehicles.

4. **Future Work:**
   - Suggest possible areas for further analysis or improvement:
     - Explore more advanced regression techniques beyond linear regression, such as polynomial regression or machine learning algorithms.
     - Collect additional data, including features like vehicle mileage, maintenance history, or market demand, to enhance the accuracy of predictions.

5. **Final Thoughts:**
   - Summarize the main takeaways and key findings from the analysis.
   - Emphasize the importance of data-driven decision-making in the automotive market.
   - Encourage stakeholders to leverage the insights gained to make informed choices when buying or selling cars.

6. **Report or Presentation:**
   - Consider creating a formal report or presentation to communicate the analysis results and recommendations to stakeholders, clients, or colleagues.
   - Include clear visuals, such as plots and summary statistics, to support the findings.
