# Used Car Price Prediction

Buying a used car is a daunting experience fraught with challenges. In addition to the hidden issues such as past accidents, mechanical and maintenance problems, the buying process involves navigating through various options, which is not only daunting and time consuming with no guarantee of a positive outcome. Certain car-knowledge domain expertize is typically needed to make an inform decision to better improvement for a positive purchase outcome.
This project presents a ML Model that can used

### Business Understanding

From the business perspective, the profitability of the used car business hinges on acquiring well-maintained vehicles, precisely evaluating their worth, and providing competitive prices.

A crucial aspect of the used car business is the detailed identification of key features that determine the price of a used car. Factors such as the vehicle's wear and tear measured using its age, mileage, condition, and service history play a significant role in its valuation. The make and model of the car, its popularity, and regional market demand also influence the price.

From the consumer perspective, especially for those typically lacking the car domain-specific expertise, the ability to make an inform purchase resulting in an positive and happy outcome is of great importance for future purchases.


## DataSet

The dataset is a tabular, comma delimited, excel (.csv) file, where each row represents a specific used car listing, and each columns represents the specific feature or attribute for the car.

### Data Source:

<table>
    <tr>
        <th>Filename</th>
        <th>Type</th>
        <th>Number of Rows</th>
        <th>Number of Features</th>
        <th>Delimiter</th>
    </tr>
    <tr>
        <td>'data/vehicle.csv'</td>
        <td>Excel csv</td>
        <td>426,880</td>
        <td>18</td>
        <td>,</td>
    </tr>
</table>

### Features:

<table>
    <tr>
        <th>Feature</th>
        <th>Type</th>
        <th>Description</th>
        <th>Sample Values</th>
        <th>Unique Values</th>
    </tr>
    <tr>
        <td>id</td>
        <td>Numeric</td>
        <td>assigned vehicle ID</td>
        <td>7222695916, ...</td>
        <td>426880</td>
    </tr>
    <tr>
        <td>region</td>
        <td>String</td>
        <td>Vehicle roaming area</td>
        <td>hudson valley, el passo, ...</td>
        <td>426880</td>
    </tr>
    <tr>
        <td>price</td>
        <td>Numeric</td>
        <td>Vehicle Listed Price
        <td>21000, 15,995, ...</td>
        <td>426880</td>
    </tr>
    <tr>
        <td>year</td>
        <td>Numeric</td>
        <td>Manufacture year of the vehicle
        <td>2014, 2018, ...</td>
        <td>425675</td>
    </tr>
    <tr>
        <td>manufacturer</td>
        <td>String</td>
        <td>Vehicle manufacturer</td>
        <td>gmc, chevrolet, toyota, ...</td>
        <td>409234</td>
    </tr>
    <tr>
        <td>model</td>
        <td>String</td>
        <td>Vehicle model
        <td>f-150 xlt, tacoma, cherokee</td>
        <td>421603</td>
    </tr>
    <tr>
        <td>condition</td>
        <td>String</td>
        <td>Vehicle condition</td>
        <td>good, excellent, ...</td>
        <td>252776</td>
    </tr>
    <tr>
        <td>cylinders</td>
        <td>String</td>
        <td>Number of engine cylinders</td>
        <td>8 cylinders, 6 cylinders, ...</td>
        <td>249202</td>
    </tr>
    <tr>
        <td>fuel</td>
        <td>String</td>
        <td>Vehicle fuel type</td>
        <td>gas, diesel, ...</td>
        <td>423867</td>
    </tr>
    <tr>
        <td>odometer</td>
        <td>Numeric</td>
        <td>Milage driven in miles</td>
        <td>12102, 80328, ...</td>
        <td>422480</td>
    </tr>
    <tr>
        <td>title_status</td>
        <td>String</td>
        <td>Vehicle title status</td>
        <td>clean, rebuilt, ...</td>
        <td>418638</td>
    </tr>
    <tr>
        <td>transmission</td>
        <td>String</td>
        <td>Vehicle transmission type</td>
        <td>automatic, manual, ...</td>
        <td>424324</td>
    </tr>
    <tr>
        <td>VIN</td>
        <td>String</td>
        <td>Vehicle Identification Number</td>
        <td>5TFTX4CN6CX015282, ...</td>
        <td>265838</td>
    </tr>
    <tr>
        <td>drive</td>
        <td>String</td>
        <td>Vehicle Drive type</td>
        <td>fwd, rwd, ...</td>
        <td>296313</td>
    </tr>
    <tr>
        <td>size</td>
        <td>String</td>
        <td>Vehicle Size Classification
        <td>mid-size, full-size, ...</td>
        <td>120519</td>
    </tr>
    <tr>
        <td>type</td>
        <td>String</td>
        <td>Vehicle type classification</td>
        <td>pickup, SUV, truck, ...</td>
        <td>334022</td>
    </tr>
    <tr>
        <td>paint_color</td>
        <td>String</td>
        <td>Vehicle Color</td>
        <td>white, red, blue, ...</td>
        <td>296677</td>
    </tr>
    <tr>
        <td>state</td>
        <td>String</td>
        <td>Abbreviated lower case of each state where the vehicle is registered
        <td>ma, ca, or, ...</td>
        <td>426880</td>
    </tr>
</table>


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import random
import seaborn as sns

In [None]:
#
# Data Cleaning 
#   - Remove null rows (for any field)
#   - Remove No price listed (incomplete price listing)

df_vehicle_ = pd.read_csv('data/vehicles.csv')
print(df_vehicle_.info())
print()
# drop unreasonable price == 0
df_vehicle = df_vehicle_[ df_vehicle_['price'] > 0 ].dropna()
df_vehicle = df_vehicle.dropna()
print(df_vehicle.info())


### Data Understanding

This step prepares the data for modeling.  Only valid data are allowed to proceed:
1.  Remove missing, null.
2.  Remove bad data, where price = 0
3.  Standardize the entire dataset to 'lower case'
4.  Remove unusable features

<table>
    <th>Usable Rows</th>
    <th>32,496</th>
    <th>Features</th>
    <th>18</th>
</table>

This section analyze the dataset's features and evaluates the correlations and influences of each feature against the target feature.  

* target feature: price

The id, and VIN features are independent of the vehicle price as they are unique for each car, and are removed from the dataset.

Furthermore, with the objective to reduce the number of features to use for regression, a <b>Mutual Information Score</b> is performed for every feature vs the target feature: price, and shown in the following diagram.  A odometer reading cluster breakdown vs car price is also shown.

### Module: make_mi_scores(), plot_mi_scores()

make_mi_scores():  Compute the Mutual Importance between the data frame's features and the target
                   The computed score is returned

plot_mi_scores():  Plot the computed Mutual Importance Scores between features

In [None]:
from sklearn.feature_selection import mutual_info_regression

# Compute Mutual Importance Score between Features and the Target
def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

def plot_mi_scores(title,scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.xlabel('score')
    plt.grid(True)
    title_ = "Mutual Information Scores VS " + title
    plt.title( title_ )
    

In [None]:
# Data Ranger encoder, encode data to a pre-assigned value for each range specified
def encode_range( df, feature, data_ranges ):
#    print(f'encode_range: type: {type(df[feature])}, name: {feature}')
    def map_to_ordinal( value ):
        for i,r in enumerate(data_ranges, start=1):
            if r[0] <= value <= r[1]:
                return i

        return len(data_ranges)  # case value outside of range
    
    df[feature+"_encoded"] = df[feature].apply(map_to_ordinal)
    return df


### Model: run_model

A module that performs the GridSerachCV with the hyper-parameters supplied to find the best estimator with the best score.  The module assumes that the numeric columns has already been standardized, and does not perform any further scaling.
The model's pipeline consists of the following steps:
1. training / development / testing data split: 80%, 15%, 5%
1. generate PolynomialFeatures from the data's features based on the hyper-parameter supplied
2. perform SequentialFeatureSelector, using LinearRegression as the estimator, and iterate with the hyper-parameters supplied
3. perform model regularization based on the "regressor" specified

GridSearchCV is configured with:
1. PolynomialFeatures degrees: [1,2,3]
2. SequentialFeatureSelector's n_features_to_select: [3,5,9]
4. Regularization Alpha: [0.01,0.1,1.0,5.0,10.0]
5. K-Fold validation with cv = 5
6. r2 is used as the scoring method

The module returns the best_estimator, best_esimator's coef and the grid_search object back to the calling routine

In [None]:
## Model the entire USA
#
from collections import defaultdict

nation_models       = defaultdict()
nation_gmodels      = defaultdict()
nation_model_coef   = defaultdict()
df_vehicle = df_vehicle.dropna(axis=1)

print(f'\n\n##########   National Model   ##########\n')
X  = df_vehicle.copy()
X0 = df_vehicle.copy()
Z  = X.copy()
y  = df_vehicle['price'].copy()

# cleanup Odometer column to ensure all numeric values
X['odometer'] = pd.to_numeric( X['odometer'],errors='coerce' )
odo_ranges_encoding = [
    (0, 9999),
    (10000,29999),
    (30000,49999),
    (50000,69999),
    (70000,99999),
    (100000,149999),
    (150000,199999),
    (200000,249999),
    (250000,999999)
]
odo_features = ['0-9,999', '10,000-29,999','30,000-49,999','50,000-69,999','70,000-99,999','100,000-149,999','150,000-199,999','200,000-249,999','250,000-999,999']
odo_range = {
    '0-9999' : (0, 9999),
    '10000-29999'   : (10000, 29999),
    '30000-49999'   : (30000,49999),
    '50000-69999'   : (50000,69999),
    '70000-99999'   : (70000,99999),
    '100000-149999' : (100000,149999),
    '150000-199999' : (150000,199999),
    '200000-249999' : (200000,249999),
    '250000-1000000': (250000,1000000)
}

features_target    = ['price']
features_untouched = ['year']
features_target    = ['price']
features_ordinal   = ['state','model','transmission','drive','size','condition', 'cylinders','title_status','manufacturer','paint_color','fuel','type']
features_onehot    = []
features_label     = []
features_numeric   = ['year']
features_range     = ['odometer']

ordinal_transformer  = Pipeline(steps=[ ('ordinal', OrdinalEncoder()) ])
onehot_transformer   = Pipeline(steps=[ ('onehot',  OneHotEncoder() ) ])
label_transformer    = Pipeline(steps=[ ('label',   LabelEncoder()  ) ])
numeric_transformer  = FunctionTransformer( lambda x: x)
identify_transformer = FunctionTransformer( lambda x: x)
copy_transformer     = FunctionTransformer( lambda x: x)

preprocessor = ColumnTransformer(
    transformers = [
    ('num',     copy_transformer,    features_numeric ),
    ('range',   copy_transformer,    features_range   ),
    ('ord',     ordinal_transformer, features_ordinal ),
#    ('onehot',  onehot_transformer,  features_onehot  ),
])

X1 = preprocessor.fit_transform(X)
X1_dense = X1.copy()
#onehot_features = list(preprocessor.named_transformers_['onehot'].named_steps['onehot'].get_feature_names_out(input_features=features_onehot))

all_features = features_numeric + features_range + features_ordinal 
try:
    X2 = pd.DataFrame(X1_dense, columns=all_features)
    X2 = encode_range(X2, 'odometer', odo_ranges_encoding)
    X2 = X2.drop(columns=features_range)
    y2 = y.copy()
#    print(f'X2: {X2.columns}, {X2.shape}, \n{X2.head(10)}')
except ValueError as e:
    print(f'Error creating DataFrame: {e}')
    X2 = None

#####from sklearn.linear_model import LinearRegression, Ridge, Lasso
###### Data Ready 
###### Run Model
param_grid = {
    'regressor__alpha': np.logspace(-3, 3, 3),  # alpha values for Ridge regression
    'regressor__max_iter' : [ 10 ]
}
print(f'X2: {X2.info()}\n{X2.head(5)}')
#regressor = Ridge()
regressor = Lasso()
ridge_model, ridge_coef, gridsearch_model = run_model_Z_1( X2, y2, 'Ridge', regressor, param_grid)
    


In [None]:
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import  mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.utils import shuffle

from collections import defaultdict
def run_model( df, df_target, model_name, base_model, param_grid ):
    
    X = df_ = df.copy()
    y = df_target.copy()
    
    X_train, X_combo, y_train, y_combo = train_test_split(X, y, test_size=0.2, random_state=42)
    X_dev,   X_test,  y_dev,   y_test  = train_test_split(X_combo, y_combo, test_size=0.05, random_state=42)

    print(f'************  Model: {model_name}  **************')

    pgrid = {
        'pfeatures__degree':[1,2,3],  #[2,3]
        'feature_selector__n_features_to_select': [3,5,9],  #[3,5]
        'feature_selector__direction' : ['forward'],
        'regressor__alpha': [0.01,0.1,1.0,5.0,10.0]  #[0.1,1.0,10.0]
    }
    pipeline = Pipeline([
        ('pfeatures',PolynomialFeatures(include_bias=False)),
        ('feature_selector',SequentialFeatureSelector(estimator=LinearRegression())),
        ('regressor', base_model)
    ])
    grid_search = GridSearchCV(pipeline, pgrid, cv=5, scoring='r2', error_score='raise')
    grid_search.fit(X_train, y_train)
        
    y_dev_preds  = grid_search.predict(X_dev )
    y_test_preds = grid_search.predict(X_test)

    mse_y_dev  = mean_squared_error( y_dev_preds,  y_dev  )
    mse_y_test = mean_squared_error( y_test_preds, y_test )

    best_model           = grid_search.best_estimator_
    best_model_coef      = best_model.named_steps['regressor'].coef_
    best_model_intercept = best_model.named_steps['regressor'].intercept_
    best_model_features  = best_model.named_steps['feature_selector'].get_support()
    
    poly_feature_names = best_model.named_steps['pfeatures'].get_feature_names_out(input_features=X.columns)
    # Select the names of the selected features
    selected_feature_names = np.array(poly_feature_names)[best_model_features]
    print(f'selected_feature_names: {selected_feature_names}, Coef: {best_model_coef}')

    coef_df = pd.DataFrame({'Feature': selected_feature_names, 'Coefficient': best_model_coef[0]})
    
    print("Best parameters:", grid_search.best_params_)
    print("Best r2 score:", grid_search.best_score_)
    #    print(f'mse-y_dev : {mse_y_dev}')
    #    print(f'mse-y_test: {mse_y_test}')
    print(f'{coef_df}')
    # Plotting
    plt.figure(figsize=(10, 6))
    # Plot Ridge regression resultsu8
    ridge_alphas = [params['regressor__alpha'] for params in grid_search.cv_results_['params']]
    ridge_mse = -grid_search.cv_results_['mean_test_score']
    plt.plot(ridge_alphas, ridge_mse, marker='o', label='Ridge')
    plt.xscale('log')
    plt.xlabel('Alpha')
    plt.ylabel('Mean Squared Error')
    plt.title(f'{model_name} Regression')
    plt.legend()
    plt.show()
    
    return best_model, coef_df, grid_search


### Compute Mutual Importance Scores

This is the first step in understanding the data content and how each feature is related to the target feature.  All none numeric fields are label encoded.  The odometer feature is range encoded for group computation.

From the Mutual Importance score, and not surprising, the odometer, the model, and the year of the vehicle are the top features that influences the price of the car.  
Vehicle with lower odometer milages tend to have higher prices.

In [None]:
# setup Modeling Dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.preprocessing import FunctionTransformer

target_feature = 'price'

X = df_vehicle.copy()
Z = df_vehicle.copy()
X = X.drop(columns=[target_feature,'VIN','id'])
y = df_vehicle[target_feature]

for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()

# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X.dtypes == int
mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]  # show a few features with their MI scores

plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(target_feature,mi_scores)
plt.savefig('images/mi_scores.png')

X_odo = X[ 'odometer' ]
# cleanup Odometer column to ensure all numeric values
X['odometer'] = pd.to_numeric( X['odometer'],errors='coerce' )
odo_ranges_encoding = [
    (0, 9999),
    (10000,29999),
    (30000,49999),
    (50000,69999),
    (70000,99999),
    (100000,149999),
    (150000,199999),
    (200000,249999),
    (250000,999999)
]
odo_features = ['0-9,999', '10,000-29,999','30,000-49,999','50,000-69,999','70,000-99,999','100,000-149,999','150,000-199,999','200,000-249,999','250,000-999,999']
odo_range = {
    '0-9999' : (0, 9999),
    '10000-29999'   : (10000, 29999),
    '30000-49999'   : (30000,49999),
    '50000-69999'   : (50000,69999),
    '70000-99999'   : (70000,99999),
    '100000-149999' : (100000,149999),
    '150000-199999' : (150000,199999),
    '200000-249999' : (200000,249999),
    '250000-1000000': (250000,1000000)
}
X_odo = encode_range( Z, 'odometer', odo_ranges_encoding )
##print(f'\n\n{X_odo.columns}')
##print(f'\n{X_odo.head(10)}')

plot = sns.lmplot(data=X_odo,x='odometer_encoded',y='price')
plot.set(xticks=range(len(odo_features)),xticklabels=odo_features)
plt.xticks(rotation=45)
plt.savefig('images/odometer_range_price.png')
plt.show()

X_model = X_odo.groupby('model')['model'].count()

car_models, n_car_models = np.unique(X_odo['model'],return_counts=True)


### Price Prediction Model
1. Column Transformer Pipeline
    a. Range Encode the 'odometer' feature for standardization
    b. Ordinal Encode the non-numeric fields with each unique values assigned on first-come-first-serve basis
    c. drop features id, VIN from data frame
2. Run model pipeline, selecting Ridge(), and Lasso() and compare results

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer

## Model the entire USA
#
from collections import defaultdict

nation_models       = defaultdict()
nation_gmodels      = defaultdict()
nation_model_coef   = defaultdict()
df_vehicle = df_vehicle.dropna(axis=1)

print(f'\n\n####################   National Model   ####################\n')
X  = df_vehicle.copy()
X0 = df_vehicle.copy()
Z  = X.copy()
y  = df_vehicle['price'].copy()

# cleanup Odometer column to ensure all numeric values
X['odometer'] = pd.to_numeric( X['odometer'],errors='coerce' )
odo_ranges_encoding = [
    (0, 9999),
    (10000,29999),
    (30000,49999),
    (50000,69999),
    (70000,99999),
    (100000,149999),
    (150000,199999),
    (200000,249999),
    (250000,999999)
]
odo_features = ['0-9,999', '10,000-29,999','30,000-49,999','50,000-69,999','70,000-99,999','100,000-149,999','150,000-199,999','200,000-249,999','250,000-999,999']
odo_range = {
    '0-9999' : (0, 9999),
    '10000-29999'   : (10000, 29999),
    '30000-49999'   : (30000,49999),
    '50000-69999'   : (50000,69999),
    '70000-99999'   : (70000,99999),
    '100000-149999' : (100000,149999),
    '150000-199999' : (150000,199999),
    '200000-249999' : (200000,249999),
    '250000-1000000': (250000,1000000)
}

features_target    = ['price']
features_untouched = ['year']
features_target    = ['price']
features_ordinal   = ['state','model','transmission','drive','size','condition', 'cylinders','title_status','manufacturer','paint_color','fuel','type']
features_onehot    = []
features_label     = []
features_numeric   = ['year']
features_range     = ['odometer']

ordinal_transformer  = Pipeline(steps=[ ('ordinal', OrdinalEncoder()) ])
onehot_transformer   = Pipeline(steps=[ ('onehot',  OneHotEncoder() ) ])
label_transformer    = Pipeline(steps=[ ('label',   LabelEncoder()  ) ])
numeric_transformer  = FunctionTransformer( lambda x: x)
identify_transformer = FunctionTransformer( lambda x: x)
copy_transformer     = FunctionTransformer( lambda x: x)

preprocessor = ColumnTransformer(
    transformers = [
    ('num',     copy_transformer,    features_numeric ),
    ('range',   copy_transformer,    features_range   ),
    ('ord',     ordinal_transformer, features_ordinal ),
])

X1 = preprocessor.fit_transform(X)
X1_dense = X1.copy()

all_features = features_numeric + features_range + features_ordinal 
try:
    X2 = pd.DataFrame(X1_dense, columns=all_features)
    X2 = encode_range(X2, 'odometer', odo_ranges_encoding)
    X2 = X2.drop(columns=features_range)
    y2 = y.copy()
except ValueError as e:
    print(f'Error creating DataFrame: {e}')
    X2 = None

param_grid = {
    'regressor__alpha': np.logspace(-3, 3, 3),  # alpha values for Ridge regression
    'regressor__max_iter' : [ 10 ]
}

ridge_model, ridge_coef, gridsearch_model_ridge = run_model( X2, y2, 'Ridge', Ridge(), param_grid)
lasso_model, lasso_coef, gridsearch_model_lasso = run_model( X2, y2, 'Lasso', Lasso(), param_grid)
    
####################################### TBSplit #############################
from collections import defaultdict

# Create model for each state and calculate fit score
us_states = df_vehicle['state'].unique()


state_ridge_models = defaultdict()
ridge_gmodels      = defaultdict()
model_ridge_coef   = defaultdict()

state_lasso_models = defaultdict()
lasso_gmodels      = defaultdict()
model_lasso_coef   = defaultdict()

df_vehicle = df_vehicle.dropna(axis=1)
for state in us_states:
    print(f'\n\n##############################   {state}   ##############################\n')
    X  = df_vehicle[ df_vehicle['state'] == state ].copy()
    X0 = df_vehicle[ df_vehicle['state'] == state ].copy()
    Z  = X.copy()
    y  = df_vehicle[ df_vehicle['state'] == state ]['price'].copy()

    # cleanup Odometer column to ensure all numeric values
    X['odometer'] = pd.to_numeric( X['odometer'],errors='coerce' )
    odo_ranges_encoding = [
        (0, 9999),
        (10000,29999),
        (30000,49999),
        (50000,69999),
        (70000,99999),
        (100000,149999),
        (150000,199999),
        (200000,249999),
        (250000,999999)
    ]
    odo_features = ['0-9,999', '10,000-29,999','30,000-49,999','50,000-69,999','70,000-99,999','100,000-149,999','150,000-199,999','200,000-249,999','250,000-999,999']
    odo_range = {
        '0-9999' : (0, 9999),
        '10000-29999'   : (10000, 29999),
        '30000-49999'   : (30000,49999),
        '50000-69999'   : (50000,69999),
        '70000-99999'   : (70000,99999),
        '100000-149999' : (100000,149999),
        '150000-199999' : (150000,199999),
        '200000-249999' : (200000,249999),
        '250000-1000000': (250000,1000000)
    }

    features_target    = ['price']
    features_untouched = ['year']
    features_target    = ['price']
    features_ordinal   = ['state','transmission','drive','size','condition', 'cylinders','title_status','manufacturer','paint_color','fuel','type']
    features_onehot    = []
    features_numeric   = ['year']
    features_range     = ['odometer']

    ordinal_transformer  = Pipeline(steps=[ ('ordinal', OrdinalEncoder()) ])
    onehot_transformer   = Pipeline(steps=[ ('onehot',  OneHotEncoder() ) ])
    label_transformer    = Pipeline(steps=[ ('label',   LabelEncoder()  ) ])
    numeric_transformer  = FunctionTransformer( lambda x: x)
    identify_transformer = FunctionTransformer( lambda x: x)
    copy_transformer     = FunctionTransformer( lambda x: x)

    preprocessor = ColumnTransformer(
        transformers = [
        ('num',     copy_transformer,    features_numeric ),
        ('range',   copy_transformer,    features_range   ),
        ('ord',     ordinal_transformer, features_ordinal ),
    ])

    X1 = preprocessor.fit_transform(X)
    X1_dense = X1.copy()

    all_features = features_numeric + features_range + features_ordinal
    try:
        X2 = pd.DataFrame(X1_dense, columns=all_features)
        X2 = encode_range(X2, 'odometer', odo_ranges_encoding)
        X2 = X2.drop(columns=features_range)
        y2 = y.copy()
    except ValueError as e:
        print(f'Error creating DataFrame: {e}')
        X2 = None

    param_grid = {
        'regressor__alpha': np.logspace(-3, 3, 3),  # alpha values for Ridge regression
        'regressor__max_iter' : [ 10 ]
    }
    state_model, state_coef, gridsearch_model = run_model( X2, y2, 'Ridge', Ridge(), param_grid)
    
    state_ridge_models[state] = state_model
    model_ridge_coef[state]   = state_coef
    ridge_gmodels[state]      = gridsearch_model

    state_model, state_coef, gridsearch_model = run_model( X2, y2, 'Lasso', Lasso(), param_grid)
    
    state_lasso_models[state] = state_model
    model_lasso_coef[state]   = state_coef
    lasso_gmodels[state]      = gridsearch_model
