## Deliverable 1

Student ID's:

Andreea Roica: 20250361

Beatriz Varela: 20250367

Barbara Franco: 20250388

Marisa Esteves: 20250348

We will follow the CRISP-DM methodology.

## Business Understanding

The goal of this project is to create a regression model that can predict car prices based on its details. This includes:

- Regression Benchmarking
- Model Optimization
- Additional Insights

## Data Understanding

Metadata:

- **carID** : An attribute that contains an identifier for each car.
- **Brand** : The car’s main brand (e.g. Ford, Toyota).
- **model** : The car model.
- **year**: The year of Registration of the Car.
- **transmission** - Type of transmission of the car (e.g. Manual, Automatic, Semi-Automatic)
- **mileage** : The total reported distance travelled by the car (in miles).
- **tax** : The amount of road tax (in £) that, in 2020, was applicable to the car in question.
- **fuelType** : Type of Fuel used by the car (Diesel, Petrol, Hybrid, Electric).
- **mpg** : Average Miles per Gallon.
- **engineSize** : Size of Engine in liters (Cubic Decimeters).
- **paintQuality%** : The mechanic’s assessment of the cars’ overall paint quality and hull integrity (filled by the mechanic during evaluation).
- **previousOwners** : Number of previous registered owners of the vehicle.
- **hasDamage** : Boolean marker filled by the seller at the time of registration stating whether the car is damaged or not.
- **price** : The car’s price when purchased by Cars 4 You (in £).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import openpyxl 
from math import ceil

from sklearn.model_selection import train_test_split

# Import fuzzywuzzy to correct the typos in 'Brand', 'fuelType' snd 'transmission'
from fuzzywuzzy import fuzz

# Import get_close_matches to identify and group similar words for typo correction in 'model'
from difflib import get_close_matches

# Import to perform the Chi-squared test
from scipy.stats import chi2_contingency


In [None]:
# Set seed
np.random.seed(33)

In [None]:
#Reading the data
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

### Characteristics of our data

In [None]:
df_train.head()

We can already see that we have Null Values (at least in tax) and have strange values like negative previous owners

In [None]:
df_test.head()

In [None]:
df_train.info()

Identified Problems:

- Year and previousOwners is float when it should be integer
- hasDamage should be boolean instead of float

In [None]:
df_test.info()

### Duplicates

In [None]:
print('Check duplicates:')
print(f'Train: {df_train.duplicated().sum()}\nTest: {df_test.duplicated().sum()}')
print('\nCheck duplicates in carId:')
print(f'Train: {df_train.duplicated(subset='carID').sum()}\nTest: {df_test.duplicated(subset='carID').sum()}')

### Categorical Variables

In [None]:
categorical_features = ['Brand', 'model', 'transmission', 'fuelType']

In [None]:
df_train.describe(include='object')

In [None]:
for feat in categorical_features:
    print(f'{feat} :' )
    print(f'{pd.concat([df_train[feat], df_test[feat]]).unique().tolist()}\n')

**Histograms**

In [None]:
sns.set(style="whitegrid")

fig, axes = plt.subplots(
    2, 
    ceil(len(categorical_features) / 2), 
    figsize=(20, 11)
    )

for ax, feat in zip(axes.flatten(), categorical_features):
    sns.countplot(x=df_train[feat], ax=ax, 
                  order=df_train[feat].value_counts().index, color = 'hotpink') 
    ax.set_title(feat)
    ax.tick_params(axis='x', rotation=90)  # roda os labels no eixo x


plt.suptitle("Categorical Variables' Absolute Counts", fontsize=25)

plt.tight_layout()
plt.show()

With these plots we can visualize the problems in the categorical variables, the huge amount of classes, most with very low frequency and simillar names. We will fix this in Data Preparation.

**Association between variables**

To evaluate the association between categorical (nominal) variables we will perfom the chi-squared test. 

H0: There is no evidence of a statistically significant association.

H1: There is a statistically significant association between the variables.

Interpretation:
- if p_value < 0.05 (significance level): Reject H0, so there is a statistically significant association between the {var1} and {var2}.
- if p_value >= 0.05 (significance level): Do not reject H0, so there is no evidence of a statistically significant association.

In [None]:
association_results = pd.DataFrame(columns=categorical_features, index=categorical_features)

def color_pvalues(val):

    #Apllies color coding to p-values.
    #Green: p-values < 0.05 (significant association).
    #Red: p-values >= 0.05 (no significant association).

    if val < 0.05:
        return 'background-color: lightgreen; color:black; border: 1px solid black;'
    else:
        return 'background-color: lightcoral; color:black; border: 1px solid black;'

for var1 in categorical_features:
    for var2 in categorical_features:
        #Chi-square test between {var1} and {var2}

        contingency_table = pd.crosstab(df_train[var1], df_train[var2]) # Create the contingency table
        result = chi2_contingency(contingency_table) # Perform the Chi-square test
                                                    #Chi-square Statistic: result[0]
                                                    #p-value: result[1]
                                                    #Degrees of Freedom: result[2]
                                                    #Expected Frequencies: result[3]
        association_results.loc[var1, var2] = result[1]

association_results= association_results.style.applymap(color_pvalues)
display(association_results)

To measure their association we will use Cramer's V (suitable for nominal variables).

**Cramer's V = √(X2/N) / min(C-1, R-1)**

X2- Chi- squared statistics;

N- Total number of observations;

C- Number of columns in the contingency table;

R- Number of rows in the contingency table.

In [None]:
cramer_v_table = pd.DataFrame(columns=categorical_features, index = categorical_features)

for var1 in cramer_v_table.columns:
    for var2 in cramer_v_table.index:

        contingency_table = pd.crosstab(df_train[var1], df_train[var2])
        result = chi2_contingency(contingency_table)        

        # Calculate Cramer's V
        X2 = result[0]
        n = df_train.shape[0]
        minimum_dimension = min(df_train[var1].nunique(), df_train[var2].nunique())-1
        cramer_v_table.loc[var1, var2] = np.sqrt((X2/n) / minimum_dimension)

def color_cramervalues(val):
    if val > 0.6:
        return 'background-color: lightgreen; color:black; border: 1px solid black;'
    else:
        return 'background-color: lightcoral; color:black; border: 1px solid black;'


cramer_v_table= cramer_v_table.style.applymap(color_cramervalues)
display(cramer_v_table)

The strength of the associations is low due to the number of typos that exist. 

### Numerical Variables

In [None]:
df_train = df_train.set_index ('carID')
df_test = df_test.set_index ('carID')

In [None]:
numeric_features = df_train.columns.drop(categorical_features)

In [None]:
df_train.describe()

**Histograms**

In [None]:
# We will put all the numeric variables' histograms in one figure
# Prepare figure. Create individual axes where each histogram will be placed
fig, axes = plt.subplots(2, ceil(len(numeric_features) / 2 ), figsize = (20, 11))

for ax, feat in zip(axes.flatten(), numeric_features):
    ax.hist(df_train[feat], color = 'hotpink')
    ax.set_title(feat, y = -0.13)

# Delete empty plots
for ax in axes.flatten()[len(numeric_features):]:
    ax.axis('off')  

# Add a centered title to the figure:
plt.suptitle("Numeric Variables' Histograms", fontsize=25)

plt.tight_layout()
plt.show()

**Boxplots**

In [None]:
# We will put all the numeric variables' histograms in one figure
# Prepare figure. Create individual axes where each histogram will be placed
fig, axes = plt.subplots(2, ceil(len(numeric_features) / 2 ), figsize = (20, 11))

for ax, feat in zip(axes.flatten(), numeric_features):
    sns.boxplot(x=df_train[feat], ax=ax, color='hotpink')

# Delete empty plots
for ax in axes.flatten()[len(numeric_features):]:
    ax.axis('off')

# Add a centered title to the figure:
plt.suptitle("Numeric Variables' Box Plots", fontsize=25)

plt.tight_layout()
plt.show()

paintQuality% and hasDamage appear to be the only features without outliers. hasDamage is a straight line only since it only has 0 ou NA.

The remaining features have a lot of outliers, and very extreme ones since the distribuition is very compressed.

**Correlations**

To measure the correlation between numerical variables we decided to use Spearman's coefficient, as it captures monotonic associations, not just linear ones. 

In [None]:
corr = df_train[numeric_features].corr(method="spearman")
corr = corr.round(2)
corr

In [None]:
# Prepare figure
fig = plt.figure(figsize=(12, 8))

mask_annot = np.absolute(corr.values) >= 0.5 
annot = np.where(mask_annot, corr.values, np.full(corr.shape,"")) 


# Plot heatmap of the correlation matrix
sns.heatmap(data=corr, 
            annot=annot, # Specify custom annotation
            fmt='s', # The annotation matrix now has strings, so we need to explicitly say this
            vmin=-1, vmax=1, 
            center=0, # Center the colormap at zero
            square=True, # Make each cell square-shaped
            linewidths=.5, # Add lines between cells
            cmap='PiYG' # Diverging color map
            )

plt.show()


'previousOwners' and 'paintQuality%' present almost no correlation with the remainder, which indicates some degree of irrelevance.

'mileage' and 'year' show strong correlation (in opposite directions).

'mpg' and 'tax' show medium correlation (in opposite directions).

'mpg' and 'year', 'tax' and 'mileage', 'tax' and 'year' show medium-low correlation.

The value of the correlations is probably affected by the amount of existing errors. 

**Bivariate plots**

## Data Preparation

### Hold Out Implementation

In [None]:
target = df_train['price']
data = df_train.drop(['price'], axis=1)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(data, 
                                                 target, 
                                                 test_size=0.2, 
                                                 random_state=15, 
                                                 shuffle=True)

### Categorical Variables Treatment

##### *Brands*

In [None]:
# Pre processing the data set to be easier to find clusters:
    #remove spaces (at the beginning and end) and lowercase all letters
df_train['Brand'] = df_train['Brand'].where(df_train['Brand'].isna(), df_train['Brand'].astype(str).str.strip().str.upper())

 #does not replace NaN's
df_test['Brand']  = df_test['Brand'].where(df_test['Brand'].isna(), df_test['Brand'].astype(str).str.strip().str.upper())

In [None]:
brands = pd.concat([df_train['Brand'], df_test['Brand']]).dropna().unique().tolist()
print(f' Typos in brands: {brands}')

In Brands we have the value 'W' which could mean VW or BMW. 

In [None]:
def correct_brand_w(df, brand, model):

    '''
    The function will switch the observations 'w' and 'W' with 'BMW' and 'VW' depending on the correspondence of their
    models in other observations. This function is only applied to one element, one brand and the corresponding model

    Parameters
    -----------
    df : DataFrame
        the DataFrame whose columns are to be fixed
        
    brand : string
        the brand 

    model : string
        the corresponding model

    
    Returns
    -----------
    brand : string
        correct model, which will be 'BMW' or 'VW' if the brand is 'w' or 'W', and the input brand otherwise
    

    '''

    # If the brand in lower case is 'w' and its a string
    if isinstance(brand, str) and brand.lower() == 'w':

        # For cicle to go over the brands and corresponding models in the DataFrame
        for brand_in_column, model_in_column in zip(df['Brand'], df['model']):

            # If the same model is found, then return the corresponding brand
            if isinstance(brand_in_column, str) and model_in_column == model and brand_in_column.lower() != 'w':
                print (brand_in_column)
                return brand_in_column
            
    # If the brand is not 'w' or 'W', it remains the same           
    return brand

# Correct the 'w' and 'W' values in the columns 'Brand' by applying the function correct_brand_w to all elements in the column
df_train['Brand'] = df_train.apply(lambda row: correct_brand_w(df_train, row['Brand'], row['model']), axis = 1)

In [None]:
df_test['Brand'] = df_test.apply(lambda row: correct_brand_w(df_test, row['Brand'], row['model']), axis = 1)

In [None]:
df_train[(df_train['Brand']=='w') | (df_train['Brand']=='W')]

In [None]:
df_test[(df_test['Brand']=='w') | (df_test['Brand']=='W')]

We can see that we fixed the problem with the brand 'w' except for the cases with null values, which are only 5 observations. The ones that were fixed were all 'VW' so we will assume the same for these.

In [None]:
df_train.loc[
    df_train['Brand'] =='W' ,
    'Brand'
] = 'VW'

df_train[df_train['Brand'] =='W']

In [None]:
df_test.loc[
    df_test['Brand'] =='W' ,
    'Brand'
] = 'VW'

df_test[df_test['Brand'] =='W']

In [None]:
brands = pd.concat([df_train['Brand'], df_test['Brand']]).dropna().unique().tolist()
print(f' Typos in brands: {brands}')

TheFuzz uses the Levenshtein edit distance to calculate the degree of closeness between two strings.

**Levenshtein distance** = at a minimum, how many edits are required to change one string into the other.

[https://www.datacamp.com/tutorial/fuzzy-string-python]

In [None]:
set([len(brand) for brand in brands])

[https://stackoverflow.com/questions/31806695/when-to-use-which-fuzz-function-to-compare-2-strings]

[https://medium.com/@laxmi17sarki/string-matching-using-fuzzywuzzy-24be9e85c88d]

**Ratio choice**: fuzz.WRatio, most robust method

In [None]:
from fuzzywuzzy import fuzz

# Creates clusters with similar brands
def create_clusters(brand_list, column, threshold=86): #groups strings with a similarity greater than or equal to treshold%. 
    clusters = []
    for brand in brand_list:
        found = False
        for cluster in clusters:
            # evaluates if brand is similar to any cluster
            if any(fuzz.WRatio(str(brand).lower(), str(b).lower()) >= threshold for b in cluster):
                cluster.append(brand)
                found = True
                break
        #if it doesn't find a match --> new cluster
        if not found:
            clusters.append([brand])

    # Gives the clusters names- chooses the most freq name
    mapping = {}
    counts = df_train[column].value_counts()
    for cluster in clusters:
        mode = max(cluster, key=lambda x: counts.get(x,0))  #finds the "max" in the cluster according to the key --> mode
        for brand in cluster:
            mapping[brand] = str(mode.upper())

    return clusters, mapping

clusters, mapping = create_clusters(brands, 'Brand', threshold=85)

print("Clusters:")
for c in clusters:
    print(c)

df_train['Brand_cleaned'] = df_train['Brand'].map(mapping)
df_test['Brand_cleaned'] = df_test['Brand'].map(mapping)

print("\nCleaned brand - Train:")
print(df_train['Brand_cleaned'].dropna().unique())

print("\nCleaned brand - Test:")
print(df_test['Brand_cleaned'].dropna().unique())



#### *Models*

In [None]:
# Pre processing the data set to be easier to find clusters:
    #remove spaces (at the beginning and end) and lowercase all letters
df_train['model'] = df_train['model'].where(df_train['model'].isna(), df_train['model'].astype(str).str.strip().str.upper())

 #does not replace NaN's
df_test['model']  = df_test['model'].where(df_test['model'].isna(), df_test['model'].astype(str).str.strip().str.upper())

In [None]:
models= pd.concat([df_train['model'], df_test['model']]).dropna().unique().tolist()
print(f'Nº of unique values: {len(models)}')

In [None]:
set([len(str(model)) for model in models])

Fuzzywuzzy wasnt able to group the same models in the column 'model', so for this case we will use get_close_matches from difflib:

In [None]:
def similar_models(models):

    # This list, which starts as an empty list, will store the similar groups of strings
    similar_groups = []
    
    # Start a for loop that will go over all the values in models
    for model in models:

        # Transform de list of lists in a unique list with all the values in the sublists
        similar_groups_flat = [item for sublist in similar_groups if sublist is not None for item in sublist]

        if model in similar_groups_flat:

            # If the model is already in similar_groups_flat, then it already has its similarity group, no need to serach for more
            continue
        else:
             
             # Calculate the similarity between model and the other observations and keep the ones with a similarity higher than 0.85
             close_matches = get_close_matches(model, models, cutoff=0.85)

             model_prefix =  model.split(" ")[0] 
             
             # For the models with more than one word it is necessary to evaluate the prefix in order to separate them well
             if " " in model:

                # Only keep the models with the same model code/ prefix. Different model codes belong to different models
                close_matches = [match for match in close_matches if match.split(" ")[0] == model_prefix]

            # Add the close matches to the list of similar groups
             similar_groups.append(close_matches)

    return similar_groups

clusters = similar_models (models)

print("Clusters:")
for c in clusters:
    print(c)

In [None]:
def correct_column_model(model, similar_groups):

    # If the element is NA or an empty string then nothing is changed
    if pd.isna(model) or model == ' ':
        return model
    
    # If the model has only one character then it is not possible to associate it with any model so return NA
    elif len(model) == 1:
        return np.nan
    
    # Put the model in upper case and remove spaces in beggining or end of the string
    model = model.upper().strip()

    # Go over all the sublists in similar_groups, the word similarity groups
    for group in similar_groups:

        # Find the group which contains model
        if model in group :

            # Return the match that is the longest, since it will be the complete one ( in our data set we dont have typos because of more characters, it is always beacause of less)
            return max(group, key=len) 
        
    return model

df_train['model_cleaned'] = df_train['model'].apply(lambda x: correct_column_model(x, clusters))
df_test['model_cleaned'] =  df_test['model'].apply(lambda x: correct_column_model(x, clusters))


In [None]:

print("\nCleaned model - Train:")
print(df_train['model_cleaned'].dropna().unique())

print("\nCleaned model - Test:")
print(df_test['model_cleaned'].dropna().unique())

#### *Fuel Types*

In [None]:
# Pre processing the data set to be easier to find clusters:
    #remove spaces (at the beginning and end) and lowercase all letters
df_train['fuelType'] = df_train['fuelType'].where(df_train['fuelType'].isna(), df_train['fuelType'].astype(str).str.strip().str.upper().str.replace('[\s\-]+', '_', regex=True))

 #does not replace NaN's
df_test['fuelType']  = df_test['fuelType'].where(df_test['fuelType'].isna(), df_test['fuelType'].astype(str).str.strip().str.upper().str.replace('[\s\-]+', '_', regex=True))

In [None]:
fuel_types=pd.concat([df_train['fuelType'], df_test['fuelType']]).dropna().unique().tolist()
print(f' Typos in models: {fuel_types}')

In [None]:
# Creates clusters with similar fuel types
clusters, mapping = create_clusters(fuel_types, 'fuelType', threshold=85)

df_train['fuelType_cleaned'] = df_train['fuelType'].map(mapping)
df_test['fuelType_cleaned'] = df_test['fuelType'].map(mapping)

print("Clusters:")
for c in clusters:
    print(c)

print("\nCleaned fuel type - Train:")
print(df_train['fuelType_cleaned'].dropna().unique())

print("\nCleaned fuel type - Test:")
print(df_test['fuelType_cleaned'].dropna().unique())

#### *Transmission*

In [None]:
# Pre processing the data set to be easier to find clusters:
    #remove spaces (at the beginning and end) and lowercase all letters
df_train['transmission'] = df_train['transmission'].where(df_train['transmission'].isna(), df_train['transmission'].astype(str).str.strip().str.upper())

 #does not replace NaN's
df_test['transmission']  = df_test['transmission'].where(df_test['transmission'].isna(), df_test['transmission'].astype(str).str.strip().str.upper())

In [None]:
transmission_types = pd.concat([df_train['transmission'], df_test['transmission']]).dropna().unique().tolist()
print(f' Typos in models: {transmission_types}')

In [None]:
# Creates clusters with similar fuel types
clusters, mapping = create_clusters(transmission_types, 'transmission', threshold=86)

df_train['transmission_cleaned'] = df_train['transmission'].map(mapping)
df_test['transmission_cleaned'] = df_test['transmission'].map(mapping)

print("Clusters:")
for c in clusters:
    print(c)

print("\nCleaned transmission - Train:")
print(df_train['transmission_cleaned'].dropna().unique())

print("\nCleaned transmission - Test:")
print(df_test['transmission_cleaned'].dropna().unique())

### Numerical Variables Treatment

Bea --- Eliminar as inválidas, adicionar as percentagens , etc...

### Outlier Treatment

### New Visualizations

### Missing Values Treatment and Typecasting

NaNs will be treated as a new category.

Let's analyze the zeros vs. the NaNs to see if the NaNs are associated with damaged or undamaged cars.:


In [None]:
cols = ['Brand_cleaned','model','fuelType_cleaned','transmission_cleaned', 'year']
distinctive_cols = [ 'price', 'mileage', 'tax', 'paintQuality%', 'previousOwners', 'hasDamage']

df_train_temp = df_train.dropna(subset=cols) [cols + distinctive_cols] 
                                            #selectes the rows with no NaN's in cols 
df_train_temp = df_train_temp [ df_train_temp.duplicated(subset=cols, keep=False) ].sort_values(cols) 
                                            #saves the filtered duplicates
groups_filtered = df_train_temp.groupby(cols).filter(lambda group: group['hasDamage'].eq(0).any() and group['hasDamage'].isna().any())
                                            #groups the rows by cols
                                            #.filter: function applied to the groups created by groupby
                                            #saves the group if there is at least one obs with 0 damamge and another with nan damage
group_dict = {name: group for name, group in groups_filtered.groupby(cols)} 

*Note*: groups_filtered.groupby(cols) is a GroupBy object. Saves the name (brand, model, etc.) of the group and the observations in each group.

In [None]:
print(f"Number of groups: {len(group_dict)}")

In [None]:
#Analyze the tendecy of the price, mileage, tax and paintQuality% in the groups with missing hasDamage
diffs_list = []

for name, group in group_dict.items():
    zeros = group[group['hasDamage'] == 0]
    nans = group[group['hasDamage'].isna()]

    if zeros.empty or nans.empty:
            continue

    diffs = {col: zeros[col].mean() - nans[col].mean() for col in ['price', 'mileage', 'tax', 'paintQuality%', 'previousOwners']}
    diffs_list.append(diffs)

diffs_df = pd.DataFrame(diffs_list)
    
diffs_df.median().to_frame(name='mean_diff') #median of all differences 

<div class="alert alert-block alert-danger">

We decided NaN's will be **damaged cars?**. Now lets correct the data type of the variable:

</div>

In [None]:
#Correct data type:
df_train['hasDamage'] = df_train ['hasDamage'].astype(bool)
df_train[df_train['hasDamage'].isna()] ['hasDamage'] = True
df_test['hasDamage'] = df_test ['hasDamage'].astype(bool)
df_test[df_test['hasDamage'].isna()] ['hasDamage'] = True

In [None]:
#Check data type:
df_train.hasDamage.dtype

#### *previousOwners*

In [None]:
print(f'% of observations with negative owners:\nTrain:{round(df_train[df_train['previousOwners']<0].shape[0] / df_train.shape[0], 5)}\nTest:{round(df_test[df_test['previousOwners']<0].shape[0] / df_test.shape[0], 5)}')
print(f'\n% of observations with 0 owners:\nTrain:{round(df_train[df_train['previousOwners']==0].shape[0] / df_train.shape[0], 5)}\nTest:{round(df_test[df_test['previousOwners']==0].shape[0] / df_test.shape[0], 5)}')

Invalid observations - Positives vs Negatives: 

In [None]:
neg_owners = df_train[df_train['previousOwners']<0]
pos_owners = df_train[df_train['previousOwners']>0]

In [None]:
neg_owners.describe().T

In [None]:
pos_owners.describe().T

Comparing the positives with the negatives:

##### Visualization

In [None]:
df_train['OwnerStatus'] = np.where(df_train['previousOwners'] > 0, 1, 0)
    #1 if previousOwners>0 , 0 if previousOwners<=0

In [None]:
categorical_features = ['Brand_cleaned', 'model','fuelType_cleaned', 'transmission_cleaned', 'hasDamage']
not_used=['Brand', 'transmission', 'fuelType', 'model_cleaned', 'previousOwners', 'OwnerStatus']
numeric_features = df_train.columns.drop(categorical_features + not_used)
numeric_features

In [None]:
df_train_temp = df_train[df_train['previousOwners']!=0]

#Numeric Variables
for col in numeric_features:
    if col not in df_train.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train_temp.boxplot(column=col, by='OwnerStatus', grid=False)
    plt.title(f"{col} by OwnerStatus")
    plt.suptitle("")
    plt.xlabel("OwnerStatus (0=Negative, 1=Positive)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train['OwnerStatus'], df_train[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by OwnerStatus")
    plt.xlabel("OwnerStatus (0=Negative, 1=Positive)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

df_train.drop('OwnerStatus', axis=1, inplace=True)

<div class="alert alert-block alert-danger">

Negatives and positives exhibit similar behavior in almost all features, except for tax and engine size, where the difference is quite significant.

Therefore, the population of positives and negatives cannot be equated.

We will treat *negatives* and *zeros* as NaNs. 

</div>

#### *paintQuality%*

In [None]:
print(f'% of observations outside of a meaningful range: {round(df_train[df_train['paintQuality%']>100].shape[0] / df_train.shape[0], 5)}')

In [None]:
df_train [df_train ['paintQuality%']>100] ['paintQuality%'].unique()

##### Visualization

In [None]:
df_train['PaintQualityStatus'] = np.where(df_train['paintQuality%'] <= 100, 1, 0)
    #1 if paintQuality%<=100 , 0 if paintQuality%>100

In [None]:
categorical_features = ['Brand_cleaned', 'model','fuelType_cleaned', 'transmission_cleaned', 'hasDamage']
not_used=['Brand', 'transmission', 'fuelType', 'model_cleaned', 'paintQuality%', 'PaintQualityStatus']
numeric_features = df_train.columns.drop(categorical_features + not_used)

In [None]:
#Numeric Variables
for col in numeric_features:
    if col not in df_train.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train.boxplot(column=col, by='PaintQualityStatus', grid=False)
    plt.title(f"{col} by PaintQualityStatus")
    plt.suptitle("")
    plt.xlabel("PaintQualityStatus (0=Not Valid, 1=Valid)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train['PaintQualityStatus'], df_train[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by PaintQualityStatus")
    plt.xlabel("PaintQualityStatus (0=Not Valid, 1=Valid)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

Looking at the box plots and histograms, there is no apparent reason for these invalid observations.

##### Let us analyze these observations in more detail: 

In [None]:
#Average paint quality by brand
df_train.groupby('Brand_cleaned')['paintQuality%'].mean().sort_values(ascending=False)

No brand stands out.

In [None]:
#Already seen in the histogram, but just to confirm.
print('Of the invalid observations, how many are in each brand (in %):\n')
print(df_train[df_train['PaintQualityStatus'] == 0]['Brand_cleaned'].value_counts(normalize=True).sort_values(ascending=False))
print('\nOf the valid observations, how many are in each brand (in %):\n')
print(df_train[df_train['PaintQualityStatus'] == 1]['Brand_cleaned'].value_counts(normalize=True).sort_values(ascending=False))

We confirm the distribution is very similar. 

Let's compare the Ford group in both data sets:

In [None]:
df_train_temp = df_train[df_train['Brand_cleaned']=='Ford']

#Numeric Variables
for col in numeric_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train_temp.boxplot(column=col, by='PaintQualityStatus', grid=False)
    plt.title(f"{col} by PaintQualityStatus - Ford Group")
    plt.suptitle("")
    plt.xlabel("PaintQualityStatus (0=Not Valid, 1=Valid)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train_temp['PaintQualityStatus'], df_train_temp[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by PaintQualityStatus- Ford Group")
    plt.xlabel("PaintQualityStatus (0=Not Valid, 1=Valid)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

In [None]:
df_train.drop('PaintQualityStatus', axis=1, inplace=True)

<div class="alert alert-block alert-danger">

I would consider the invalid observations as NaN's. 

</div>

#### *engineSize*

In [None]:
print(f'% of negative sizes: {round(df_train[df_train['engineSize']<0].shape[0] / df_train.shape[0], 5)}')
print(f'% of size 0: {round(df_train[df_train['engineSize']==0].shape[0] / df_train.shape[0], 5)}')

Comparing the positives with the negatives:

##### Visualization

In [None]:
df_train['EngineSizeStatus'] = np.where(df_train['engineSize'] >= 0, 1, 0)
    #1 if valid , 0 if invalid

In [None]:
categorical_features = ['Brand_cleaned', 'model','fuelType_cleaned', 'transmission_cleaned', 'hasDamage']
not_used=['Brand', 'transmission', 'fuelType', 'model_cleaned', 'engineSize', 'EngineSizeStatus']
numeric_features = df_train.columns.drop(categorical_features + not_used)

In [None]:
df_train_temp = df_train[df_train['engineSize']!=0]

#Numeric Variables
for col in numeric_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train_temp.boxplot(column=col, by='EngineSizeStatus', grid=False)
    plt.title(f"{col} by EngineSizeStatus")
    plt.suptitle("")
    plt.xlabel("EngineSizeStatus (0=Negative, 1=Positive)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train_temp['EngineSizeStatus'], df_train_temp[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by EngineSizeStatus")
    plt.xlabel("EngineSizeStatus (0=Negative, 1=Positive)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

df_train.drop('EngineSizeStatus', axis=1, inplace=True)

<div class="alert alert-block alert-danger">

The negatives and positives show similar behavior. Should we do the correspondence?

Treat the zeros as NaN's.

</div>

#### *mpg*

In [None]:
print(f'% of negative mpg: {round(df_train[df_train['mpg']<0].shape[0] / df_train.shape[0], 5)}')
print(f'% of 0 mpg: {round(df_train[df_train['mpg']==0].shape[0] / df_train.shape[0], 5)}')

In [None]:
print(f'% of observations outside of meaningful interval (10-70)- Train: {round(df_train[(df_train['mpg']<10) & (df_train['mpg']>70)].shape[0] / df_train.shape[0], 5)}')
print(f'% of observations outside of meaningful interval (10-70)- Test: {round(df_test[(df_test['mpg']<10) & (df_test['mpg']>70)].shape[0] / df_test.shape[0], 5)}')

<div class="alert alert-block alert-danger">

No observations outside the range [10, 70].

</div>

In [None]:
#Change mpg of Eletrics to Unknown:
df_train.loc[df_train['fuelType']=='Electric', 'fuelType'] = 'Unknown'

Comparing the positives with the negatives:

##### Visualization:

In [None]:
df_train['mpgStatus'] = np.where(df_train['mpg'] >= 0, 1, 0)
    #1 if valid , 0 if invalid

In [None]:
categorical_features = ['Brand_cleaned', 'model','fuelType_cleaned', 'transmission_cleaned', 'hasDamage']
not_used=['Brand', 'transmission', 'fuelType', 'model_cleaned', 'mpg', 'mpgStatus']
numeric_features = df_train.columns.drop(categorical_features + not_used)

In [None]:
#Numeric Variables
for col in numeric_features:
    if col not in df_train.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train.boxplot(column=col, by='mpgStatus', grid=False)
    plt.title(f"{col} by mpgStatus")
    plt.suptitle("")
    plt.xlabel("mpgStatus (0=Negative, 1=Positive)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train['mpgStatus'], df_train[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by mpgStatus")
    plt.xlabel("mpgStatus (0=Negative, 1=Positive)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

Clearly not similar. 

In [None]:
print('Of the invalid observations, how many are in each brand (in %):\n')
print(df_train[df_train['mpgStatus'] == 0]['Brand_cleaned'].value_counts(normalize=True).sort_values(ascending=False))

Most invalid observations are concentrated in the Ford and Mercedes group.

In [None]:
df_train_temp = df_train[df_train['Brand_cleaned']=='Ford']
#Numeric Variables
for col in numeric_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train_temp.boxplot(column=col, by='mpgStatus', grid=False)
    plt.title(f"{col} by mpgStatus- Ford Group")
    plt.suptitle("")
    plt.xlabel("mpgStatus (0=Negative, 1=Positive)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train_temp['mpgStatus'], df_train_temp[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by mpgStatus - Ford Group")
    plt.xlabel("mpgStatus (0=Negative, 1=Positive)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

**FORD GROUP**

The range of years is greater for invalid observations, and their average price and mileage is slightly higher.

The invalid observations are concentrated in some specfic models--> check again after cleaning the models. 

In [None]:
df_train_temp = df_train[df_train['Brand_cleaned']=='Mercedes']
#Numeric Variables
for col in numeric_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train_temp.boxplot(column=col, by='mpgStatus', grid=False)
    plt.title(f"{col} by mpgStatus- Mercedes Group")
    plt.suptitle("")
    plt.xlabel("mpgStatus (0=Negative, 1=Positive)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train_temp['mpgStatus'], df_train_temp[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by mpgStatus - Mercedes Group")
    plt.xlabel("mpgStatus (0=Negative, 1=Positive)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

**MERCEDES GROUP**

The invalid observations are concentrated in some specfic models--> check again after cleaning the models.

<div class="alert alert-block alert-danger">

The best thing is to consider the invalid observations as NaN's. 

Or (after cleaning the models) try to find the correspondence of these observations with the model group.

But for the other brands just treat them as NaN's?

</div>

In [None]:
df_train.drop('mpgStatus', axis=1, inplace=True)

#### *tax*

In [None]:
print(f'% of negative tax: {round(df_train[df_train['tax']<0].shape[0] / df_train.shape[0], 5)}')
print(f'% of 0 tax: {round(df_train[df_train['tax']==0].shape[0] / df_train.shape[0], 5)}')

Comparing the positives with the negatives:

##### Visualization:

In [None]:
df_train['taxStatus'] = np.where(df_train['tax'] >= 0, 1, 0)
    #1 if valid , 0 if invalid

In [None]:
categorical_features = ['Brand_cleaned', 'model','fuelType_cleaned', 'transmission_cleaned', 'hasDamage']
not_used=['Brand', 'transmission', 'fuelType', 'model_cleaned', 'tax', 'taxStatus']
numeric_features = df_train.columns.drop(categorical_features + not_used)

In [None]:
df_train_temp = df_train[df_train['tax']!=0]

#Numeric Variables
for col in numeric_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train_temp.boxplot(column=col, by='taxStatus', grid=False)
    plt.title(f"{col} by taxStatus")
    plt.suptitle("")
    plt.xlabel("taxStatus (0=Negative, 1=Positive)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train_temp['taxStatus'], df_train_temp[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by taxStatus")
    plt.xlabel("taxStatus (0=Negative, 1=Positive)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

df_train.drop('taxStatus', axis=1, inplace=True)

<div class="alert alert-block alert-danger">

Too different. Treat negatives and 0's as NaN's.

</div>

#### *mileage*

In [None]:
print(f'% of negative mileage: {round(df_train[df_train['mileage']<0].shape[0] / df_train.shape[0], 5)}')
print(f'% of 0 mileage: {round(df_train[df_train['mileage']==0].shape[0] / df_train.shape[0], 5)}')

Comparing the positives with the negatives:

##### Visualization:

In [None]:
df_train['mileageStatus'] = np.where(df_train['tax'] >= 0, 1, 0)
    #1 if valid , 0 if invalid

In [None]:
categorical_features = ['Brand_cleaned', 'model','fuelType_cleaned', 'transmission_cleaned', 'hasDamage']
not_used=['Brand', 'transmission', 'fuelType', 'model_cleaned', 'mileage', 'mileageStatus']
numeric_features = df_train.columns.drop(categorical_features + not_used)

In [None]:
df_train_temp = df_train[df_train['mileage']!=0]

#Numeric Variables
for col in numeric_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    df_train_temp.boxplot(column=col, by='mileageStatus', grid=False)
    plt.title(f"{col} by mileageStatus")
    plt.suptitle("")
    plt.xlabel("mileageStatus (0=Negative, 1=Positive)")
    plt.ylabel(col)
    plt.xticks([1, 2], ['Negative', 'Positive'])
    plt.tight_layout()
    plt.show()

#Categorical Variables
for col in categorical_features:
    if col not in df_train_temp.columns:
        continue
    plt.figure(figsize=(7, 4))
    pd.crosstab(df_train_temp['mileageStatus'], df_train_temp[col], normalize='index').plot(
        kind='bar', stacked=True, ax=plt.gca(), colormap='tab20'
    )
    plt.title(f"Distribution of {col} by mileageStatus")
    plt.xlabel("mileageStatus (0=Negative, 1=Positive)")
    plt.ylabel("Proportion")
    plt.legend(title=col, bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.show()

df_train.drop('mileageStatus', axis=1, inplace=True)

<div class="alert alert-block alert-danger">

Too different. Treat negatives as NaN's.

Are these errors related to the errors in tax and mpg?

</div>

#### *year*

In [None]:
df_train['year'].unique()

In [None]:
df_train[df_train['year']==1970]
#these observations are outliers and the second does not make sense (automatic and year=1970) --> drop them

In [None]:
df_train = df_train[df_train['year']!=1970]

In [None]:
df_train['year'].unique()

### Feature Engineering

### Feature Selection

## Modeling

## Deployment