# Expediting Car Evaluations with ML 
###  Data Science and Advanced Analytics
###  Machine Learning 1st Phase Delivery

**Group 18**  
- Diogo Tibério         20250341
- José Montez           20250351
- Henrique Figueiredo   20250433
- Sebastião Jerónimo    20240660


<a id="index"></a>
## Index

1. [Introduction to the Project](#introduction)  
   - 1.1. [Project Overview & Objectives](#project_overview)
   - 1.2. [Dataset Import data exploration](#data_exploration)  
      - 1.2.1. [Import Libraries](#import_libraries)
      - 1.2.3. [Dataset Import and Initial Checks](#dataset_import)
         - 1.2.3.1 [Verify Dataset Integrity](#check_imports)
      - 1.2.4. [Dataset Metadata Description](#metadata_description)
      - 1.2.5. [Descriptive Statistics](#descriptive_statistics)
      - 1.2.6. [Exploratory Data Analysis (EDA)](#eda)
         - 1.2.6.1. [Univariate Analysis](#univariate_analysis)
         - 1.2.6.2. [Multivariate Relationships](#multivariate_analysis)  

2. [Data Preparation](#data_preparation)  
   - 2.1. [Handling Missing Values](#handling_missing_values)  
   - 2.2. [Outlier Detection and Treatment](#outliers)  
   - 2.3. [Categorical Variable Encoding](#categorical_encoding)  
   - 2.4. [Feature Engineering](#feature_engineering)  
   - 2.5. [Data Scaling and Normalization](#data_scaling)  

3. [Feature Selection](#feature_selection)  
   - 3.1. [Feature Selection Strategy](#selection_strategy)  
   - 3.2. [Implementation and Results](#selection_implementation)  
   - 3.3. [Final Feature Set Justification](#feature_justification)  

4. [Model Building and Evaluation](#model_building)  
   - 4.1. [Problem Type Identification](#problem_type)  
   - 4.2. [Algorithm Selection](#algorithm_selection)  
   - 4.3. [Model Assessment Strategy](#assessment_strategy)  
   - 4.4. [Model Training and Prediction](#model_training)  
   - 4.5. [Performance Metrics and Interpretation](#performance_metrics)  


<hr>
<a class="anchor" id="introduction"></a>

# 1. Introduction to the Project

Cars 4 You is an online car resale company that sells cars from multiple different brands.

The company's goal is to replace the evaluation process by creating a predictive model capable of evaluating the price of a car based on the user’s input without needing the car to be taken to a mechanic.

This project aims to build this model.

<hr>
<a class="anchor" id="project_overview"></a>

## 1.1. Project Overview & Objectives

1. Regression Benchmarking: we aim to develop a regression model that accurately predicts car prices (price). 

2. Model Optimization: During our selection of best (or set of best) model(s) we will explore ways to improve their performance.

3. Additional Insights we aim to explore:

   a.Analyze and discuss the importance of the features for the different values of the target variable and how they contribute towards the prediction.

   b.Ablation Study: Measure the contribution of each element of the pipeline.

   c.Create an analytics interface that returns a prediction when new input data is provided.

   d.Test whether the best performance is achieved using a general model (trained using data from all brands/models, etc...) or using brand, model, fuel type, etc...-specific models.

   e.Determine whether training a Deep Learning network from scratch is moreeffective than fine-tuning a pre-trained model. 

<hr>
<a class="anchor" id="data_exploration"></a>

## 1.2. Dataset Import data exploration

1. Import the dataset and explore the data (3 points):
   
   a. Check data contents, provide descriptive statistics and check for inconsistencies in the data.
   
   b. Explore data visually and extract relevant insights. Explain your rationale and findings. Do not forget to analyse multivariate relationships.

<hr>
<a class="anchor" id="import_libraries"></a>

### 1.2.1. Import Libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from rapidfuzz import process, fuzz
import re
import numpy as np
from typing import Sequence, Mapping, Optional

<hr>
<a class="anchor" id="dataset_import"></a>

## 1.2.2. Dataset Import and Initial Checks

Import train, test and samples submission data

In [None]:
"""
    Loads a CSV datasets into pandas DataFrames.
    Handles missing file paths and encoding issues gracefully.
"""
try:
    train_car_data_original = pd.read_csv('data/train.csv')
    test_car_data_original = pd.read_csv('data/test.csv')
    sample_submission_car_data_original = pd.read_csv('data/sample_submission.csv')
# For when the file directory is not found
except FileNotFoundError as f:
    print(f"File not found: {f.filename}")
# For general errors
except Exception as e:
    print(f"An error occurred: {e}")

# Makes a preventive copy so it does not modify the original reference
train_car_data = train_car_data_original.copy()
test_car_data = test_car_data_original.copy()
sample_submission_car_data = sample_submission_car_data_original.copy()


File not found: data/train.csv


NameError: name 'train_car_data_original' is not defined

<hr>
<a class="anchor" id="check_imports"></a>

### 1.2.2.3. Verify Dataset Integrity

Check to see if the train, test and submission datasets were correctly imported

In [None]:
train_car_data.head()

NameError: name 'test_car_data' is not defined

In [None]:
test_car_data.head()

In [None]:
sample_submission_car_data.head()

<hr>
<a class="anchor" id="metadata_description"></a>

## 1.4. Dataset Metadata Description

<hr>
<a class="anchor" id="descriptive_statistics"></a>

## 1.5. Descriptive Statistics

<hr>
<a class="anchor" id="eda"></a>

## 1.6. Exploratory Data Analysis (EDA)

<hr>
<a class="anchor" id="univariate_analysis"></a>

### 1.6.1. Univariate Analysis

<hr>
<a class="anchor" id="multivariate_analysis"></a>

### 1.6.2. Multivariate Relationships

<hr>
<a class="anchor" id="data_preparation"></a>

# 2. Data Preparation

<hr>
<a class="anchor" id="handling_missing_values"></a>

## 2.1. Handling Missing Values

<hr>
<a class="anchor" id="outliers"></a>

## 2.2. Outlier Detection and Treatment

<hr>
<a class="anchor" id="categorical_encoding"></a>

## 2.3. Categorical Variable Encoding

<hr>
<a class="anchor" id="feature_engineering"></a>

## 2.4. Feature Engineering

<hr>
<a class="anchor" id="data_scaling"></a>

## 2.5. Data Scaling and Normalization

<hr>
<a class="anchor" id="feature_selection"></a>

# 3. Feature Selection

<hr>
<a class="anchor" id="selection_strategy"></a>

## 3.1. Feature Selection Strategy

<hr>
<a class="anchor" id="selection_implementation"></a>

## 3.2. Implementation and Results

<hr>
<a class="anchor" id="feature_justification"></a>

## 3.3. Final Feature Set Justification

<hr>
<a class="anchor" id="model_building"></a>

# 4. Model Building and Evaluation

<hr>
<a class="anchor" id="problem_type"></a>

## 4.1. Problem Type Identification

<hr>
<a class="anchor" id="algorithm_selection"></a>

## 4.2. Algorithm Selection

<hr>
<a class="anchor" id="assessment_strategy"></a>

## 4.3. Model Assessment Strategy

<hr>
<a class="anchor" id="model_training"></a>

## 4.4. Model Training and Prediction

<hr>
<a class="anchor" id="performance_metrics"></a>

## 4.5. Performance Metrics and Interpretation