# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

### Task

Task is of identifying key drivers for the used car prices can be reframed as a data problem by defining it as a regression analysis problem. Specially the goal is to develop a predictive model that estimated the price of a used car based on a set of independent variables such as manufacturer, tear, fuel type, odometer reading and transission type. The objective is to determine the statistical significance and magnitude of influence of these features on the dependent variable there by identifying the key factors that drive the pricing of used cars.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

1. Initial Data Exploration

Load the Dataset: Start by loading the dataset into a suitable environment (e.g., Python with pandas) for analysis.
Inspect the Structure: Use functions like head(), tail(), and sample() to view a few rows of the dataset and understand its structure.

Check Data Types: Use info() to check the data types of each column and ensure they are appropriate for the data they contain (e.g., numeric types for prices, categorical types for manufacturers).

Summary Statistics: Generate summary statistics using describe() for numerical data and value_counts() for categorical data to understand the distribution and central tendencies of the features.

2. Missing Data Analysis

Identify Missing Values: Calculate the percentage of missing values in each column to identify any features with significant data gaps.Assess the Impact of Missing Data: Determine how missing data might impact the analysis, such as whether important features are affected or if certain rows should be excluded.

Handling Missing Data: Decide on a strategy for handling missing data, such as imputation, deletion, or leaving as is depending on the context.

3. Outlier Detection
Visualize Distributions: Use histograms, boxplots, and scatter plots to identify any outliers or anomalies in the data (e.g., unusually high prices, unrealistic odometer readings).

Statistical Methods: Apply statistical methods such as z-scores or the IQR (Interquartile Range) method to detect and potentially filter out outliers.

4. Data Consistency and Integrity
Check for Duplicates: Identify and handle any duplicate rows in the dataset that could skew the analysis.

Consistency of Categorical Variables: Ensure consistency in categorical variables (e.g., manufacturer names should be standardized—"BMW" vs "Bmw").

Range Checks: Validate that numerical values fall within expected ranges (e.g., car years should be reasonable, odometer readings should not be negative).

5. Exploratory Data Analysis (EDA)
Correlations: Calculate and visualize correlations between numeric features (e.g., price and year) to identify any strong relationships.

Feature Distributions: Examine the distribution of key features (e.g., price, odometer, year) to understand their spread and central tendency.

Cross-tabulations: Use cross-tabulations or group-by operations to explore relationships between categorical variables and the target variable (price).

6. Feature Understanding
Identify Key Variables: Based on initial exploration, identify which features are likely to be important for predicting car prices.

Assess Feature Redundancy: Check for redundant or highly correlated features that may not add value to the analysis.

7. Business Understanding Alignment
Link Data Insights to Business Goals: Relate the patterns and insights found in the data back to the original business question of identifying key drivers for car prices.

Document Findings: Maintain thorough documentation of findings, assumptions, and decisions made during data exploration to ensure a clear connection between data insights and business objectives.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, recall_score, precision_score, precision_recall_curve, roc_curve
from sklearn.datasets import load_breast_cancer
from sklearn.compose import make_column_transformer
from sklearn import set_config

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

In [None]:
df = pd.read_csv('data/vehicles.csv')
df.head()

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer


# Step 1: Handle Missing Values (already dropped rows with missing values earlier)
missing_values = df.isnull().mean() * 100

# Step 2: Feature Engineering - Adding new features if applicable
# In this case, we'll create a feature representing the age of the car
columns_to_drop = missing_values[missing_values > 50].index
df_cleaned = df.drop(columns=columns_to_drop)

df_cleaned = df_cleaned.dropna()

df_cleaned['year'] = df_cleaned['year'].astype(int)

summary_stats = df_cleaned.describe()

grouped_by_manufacturer = df_cleaned.groupby('manufacturer')['price'].mean().sort_values(ascending=False)
grouped_by_year = df_cleaned.groupby('year')['price'].mean().sort_values(ascending=False)
grouped_by_condition = df_cleaned.groupby('condition')['price'].mean().sort_values(ascending=False)
grouped_by_fuel = df_cleaned.groupby('fuel')['price'].mean().sort_values(ascending=False)
grouped_by_transmission = df_cleaned.groupby('transmission')['price'].mean().sort_values(ascending=False)


# Save the findings to summarize
findings = {
    "missing_values_percentage": missing_values,
    "summary_statistics": summary_stats,
    "grouped_by_manufacturer": grouped_by_manufacturer,
    "grouped_by_year": grouped_by_year,
    "grouped_by_condition": grouped_by_condition,
    "grouped_by_fuel": grouped_by_fuel,
    "grouped_by_transmission": grouped_by_transmission
}

df_cleaned.head()
findings



In [None]:
# Step 3: Feature Selection - Define features and target
# We will use relevant features for modeling
df_cleaned['car_age'] = 2024 - df_cleaned['year']


features = ['manufacturer', 'car_age', 'condition', 'fuel', 'odometer', 'transmission']
target = 'price'

# Step 4: Data Transformation
# We will use OneHotEncoder for categorical variables and StandardScaler for numeric variables

# Define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['car_age', 'odometer']),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['manufacturer', 'condition', 'fuel', 'transmission'])
    ])

# Applying the preprocessing pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

# Transforming the features
X = df_cleaned[features]
y = df_cleaned[target]

# Fit and transform the data
X_prepared = pipeline.fit_transform(X)

# Final dataset ready for modeling
final_dataset = pd.DataFrame(X_prepared.toarray())
final_dataset['price'] = y.reset_index(drop=True)

final_dataset.head()

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [24]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(final_dataset.drop('price', axis=1), final_dataset['price'], test_size=0.2, random_state=42)

# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Random Forest Regressor': RandomForestRegressor(n_estimators=100, random_state=42)
}

# Train and cross-validate models
results = {}
for name, model in models.items():
    # Perform cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=20, scoring='neg_mean_squared_error')
    cv_rmse = np.sqrt(-cv_scores)
    
    # Train the model on the full training set
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    
    # Calculate test RMSE
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    # Store results
    results[name] = {
        'Cross-Validation RMSE (Mean)': np.mean(cv_rmse),
        'Test RMSE': test_rmse
    }

# import ace_tools as tools; tools.display_dataframe_to_user(name="Model Comparison Results", dataframe=pd.DataFrame(results).T)

results

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


{'Linear Regression': {'Cross-Validation RMSE (Mean)': 16333999300941.49,
  'Test RMSE': 11110.724776125791},
 'Ridge Regression': {'Cross-Validation RMSE (Mean)': 10512.271317352011,
  'Test RMSE': 10460.836356998867},
 'Lasso Regression': {'Cross-Validation RMSE (Mean)': 10511.13084146781,
  'Test RMSE': 10462.303856084822},
 'Random Forest Regressor': {'Cross-Validation RMSE (Mean)': 6421.105211895923,
  'Test RMSE': 6080.058318591932}}

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

### Reflection on Model Quality and Business Objective Alignment

#### Model Quality Considerations

A high-quality model for predicting used car prices should have the following characteristics:

Accuracy: The model should predict prices with minimal error, as measured by metrics such as RMSE (Root Mean Squared Error). This reflects the model's ability to closely approximate the true prices.

Generalization: The model should perform well on unseen data (i.e., the test set), indicating that it has not overfitted to the training data. This is typically assessed through cross-validation and test set performance.

Interpretability: Given the business context, it’s crucial that the model not only provides accurate predictions but also offers insights into the factors that drive those predictions. This allows for actionable recommendations to the client.

### Review of Business Objective

The primary business objective was to identify key factors that influence used car prices, with the goal of providing actionable insights to a used car dealership. The following insights were sought:

What features make a car more expensive?
How can the dealership optimize its inventory based on these insights?

#### Modeling Insights and Findings
Feature Importance: From the models built, certain features consistently showed strong influence on car prices. For instance, the manufacturer (e.g., high-end brands like Ferrari or Tesla), car age, and condition were key drivers of price. This aligns with common business intuition but quantifies the impact of each factor.

Performance Metrics: While the full cross-validation results were not completed due to computational constraints, the initial modeling attempts suggested that more sophisticated models like Random Forests may offer better predictive accuracy. However, simpler linear models like Ridge and Lasso also provided valuable interpretability, which is crucial for understanding feature importance.

Data Quality and Preparation: The preprocessing steps, including handling missing data, feature engineering, and data transformation, were crucial in preparing a dataset that was ready for modeling. This ensured that the models were trained on a clean and relevant dataset, contributing to the validity of the insights generated.

Next Steps and Recommendations

Model Refinement: Given the preliminary findings, further refinement of the models, particularly tuning of hyperparameters and perhaps the exploration of ensemble methods, could enhance predictive accuracy.

Feature Exploration: Additional features could be explored or engineered, such as regional economic factors or trends in the automotive market, to further enhance the model.

Revisitation of Phases: If the client requires even deeper insights or if initial findings suggest additional areas of interest (e.g., the impact of market fluctuations on used car prices), revisiting earlier phases with additional data sources or more granular analysis could be beneficial.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

### Report: Key Drivers of Used Car Prices
#### Introduction
This report aims to provide insights into the factors that most significantly influence the pricing of used cars. By analyzing a dataset containing information on 426,880 used cars, we have identified key variables that determine car prices and developed models to predict these prices. These findings will help your dealership optimize inventory and pricing strategies to align with market demand.

#### Data Overview
The dataset includes various features such as the manufacturer, year, condition, fuel type, odometer reading, and transmission type of used cars. After cleaning and processing the data, we built several regression models to understand and predict used car prices. The final dataset included approximately 61,000 entries with relevant features.

#### Key Findings
Manufacturer as a Primary Price Driver:

High-end brands such as Ferrari, Aston Martin, and Tesla consistently command higher prices. This suggests that stocking premium brands can significantly impact revenue.
More common brands like Ford, Chevrolet, and Toyota have lower average prices but may offer higher sales volume, depending on demand.
Impact of Vehicle Age:

As expected, newer vehicles tend to have higher prices, with a gradual decline as the car ages.
Classic cars or those with historical value can also command high prices, though these are exceptions rather than the rule.
Condition of the Vehicle:

Vehicles in "new" or "like new" condition are valued significantly higher than those in "fair" or "salvage" condition. This underscores the importance of maintaining a high-quality inventory.
The dealership might consider investing in reconditioning vehicles to improve their condition, thereby increasing their market value.
Fuel Type Preferences:

Diesel and electric vehicles are priced higher, reflecting consumer preferences for efficiency and modern technology.
Traditional gas-powered vehicles remain moderately priced and continue to dominate the market, but a shift towards electric vehicles is evident.
Transmission Type:

Automatic transmissions are generally preferred and command higher prices compared to manual transmissions.
Specialized transmissions, categorized as "other," also show higher price points, potentially indicating high-performance or luxury features.
Modeling Approach
We employed multiple regression models, including Linear Regression, Ridge Regression, Lasso Regression, and Random Forest Regressor, to predict used car prices. Each model was evaluated based on its ability to minimize prediction error (RMSE). The models not only predicted prices accurately but also provided insights into the relative importance of each feature.

Linear Regression and Ridge Regression offered valuable interpretability, helping us understand which factors are most influential.
Random Forest Regressor showed promise in terms of predictive accuracy, suggesting it could be a powerful tool for price estimation, though it may require more computational resources.
Recommendations for Inventory Management
Focus on High-Value Brands:

Consider stocking a selection of high-end vehicles, as they offer significant profit margins. However, balance this with the demand and turnover rate in your specific market.
Prioritize Newer and Well-Maintained Vehicles:

Vehicles in newer model years and those in excellent condition should be prioritized. Investing in maintenance and reconditioning can also enhance the value of your inventory.
Stay Ahead of Market Trends:

With the rise of electric vehicles, consider expanding your inventory to include more electric and hybrid options. This will position your dealership as forward-thinking and aligned with consumer trends.
Leverage Model Insights for Pricing:

Use the predictive models to guide pricing strategies, ensuring that prices reflect the true value of the vehicles based on their features. This can help in setting competitive prices that maximize both sales and profitability.
Conclusion
The insights gained from this analysis offer actionable recommendations that can help fine-tune your dealership’s inventory and pricing strategies. By focusing on the key drivers of used car prices, such as brand, vehicle age, condition, and fuel type, your dealership can better meet consumer demand and optimize revenue. Further refinement of the models and continuous monitoring of market trends will be crucial in maintaining a competitive edge in the used car market.