# USING REGULAR EXPRESSION IN FEATURE ENGINEERING - USED CAR PRICE PREDICTION

# Introduction

Regular expressions are a valuable tool for extracting structured data from unstructured text, making them invaluable in the feature engineering stage of data science projects. They allow for concise and flexible pattern matching, reducing the need for lengthy, complex code. This simplifies data cleaning and transformation, improving code readability and maintainability. 

Regular expressions also enable the extraction of nuanced features from textual data, enhancing the potential for creating informative and predictive models. Their versatility empowers data scientists to efficiently handle diverse text formats, ultimately accelerating the feature engineering process and leading to more successful projects.

# 1. Import Libraries

In [6]:
import pandas as pd
import re

# 2. Load Data

In [8]:
df_main_train = pd.read_csv("train.csv")
df = df_main_train.copy()
df.columns

Index(['id', 'brand', 'model', 'model_year', 'milage', 'fuel_type', 'engine',
       'transmission', 'ext_col', 'int_col', 'accident', 'clean_title',
       'price'],
      dtype='object')

# 3. Practice in in 01 - Regression Problem - Used Car Price Prediciton

## 3.1. Extracting "HP" data from "engine" column

In [11]:
# # practice in 01 - Regression Problem - Used Car Price Prediciton.ipynb:
# #df["hp"] = df["engine"].apply(lambda x: x.split(" ")[0] if x.split(" ")[0][-2:]=="HP" else np.nan)
# df["hp"] = df[df["hp"].notnull()].hp.apply(lambda x: x.split(".")[0])
# hp_temp = df[df["hp"].notnull()]
# hp_temp["hp"] = hp_temp["hp"].astype(int)
# hp_temp2 = hp_temp.groupby("model")[["hp"]].mean().reset_index()
# df = pd.merge(df, hp_temp2, on="model", how = "left")
# df.loc[df['hp_x'].isna(), 'hp_x'] = df['hp_y']
# df.rename(columns={'hp_x': 'hp'}, inplace=True)
# df.drop("hp_y", axis=1 , inplace=True)
# hp_temp3 = df[df.hp.notna()]
# hp_temp3["hp"] = hp_temp3.hp.astype(float)
# hp_temp4 = hp_temp3.groupby("brand")[["hp"]].mean().reset_index()
# df = pd.merge(df, hp_temp4, on="brand", how = "left")
# df.loc[df['hp_x'].isna(), 'hp_x'] = df['hp_y']
# df.rename(columns={'hp_x': 'hp'}, inplace=True)
# df.drop("hp_y", axis=1 , inplace=True)
# hp_temp5 = df[df.hp.notna()]
# hp_temp5["hp"] = hp_temp5.hp.astype(float)
# hp_temp6 = hp_temp5.groupby("brand_category")[["hp"]].mean().reset_index()
# df = pd.merge(df, hp_temp6, on="brand_category", how = "left")
# df.loc[df['hp_x'].isna(), 'hp_x'] = df['hp_y']
# df.rename(columns={'hp_x': 'hp'}, inplace=True)
# df.drop("hp_y", axis=1 , inplace=True)
# df.hp = df.hp.astype(float)

## 3.2. Extracting "cylinder" data from "engine" column

In [13]:
# df["cylinder"] = df["engine"].apply(lambda x: x.split(" ")[1] if len(x.split(" ")) > 1 and x.split(" ")[1][-1:] == "L"
#                                     else (x.split(" ")[0] if len(x.split(" ")) > 1 and x.split(" ")[0][-1:] == "L" 
#                                           else (x if x[-1:]=="L" else np.nan)))
# df.engine.str.contains("Liter").sum()
# for index, value in df.engine.items():
#     if "Liter" in value:
#         df.at[index, 'cylinder'] = value.split(" ")[0]
# df['cylinder'] = df['cylinder'].str.replace('L', '', regex=False)
# df['cylinder'] = df['cylinder'].astype(float)
# df.cylinder.fillna(df[df["cylinder"].notna()].cylinder.mean(), inplace=True)

# 4. Using Regular Expression 

In [15]:
import re
import pandas as pd

def decode_engine(s: str):
    s = s.lower()
    
    # Extract HP
    hp_match = re.search(r'(\d+(\.\d+)?\s*)hp', s)
    engine_hp = float(hp_match.group(1)) if hp_match else None

    # Extract cc (liters)
    cc_match = re.search(r'(\d+(\.\d+)?\s*)l', s)
    engine_cc = float(cc_match.group(1)) if cc_match else None

    # Extract cylinder count
    cyl_match = re.search(r'(\d+)\s*cylinder', s)
    engine_cyl = int(cyl_match.group(1)) if cyl_match else None

    # Boolean feature extraction (turbo, flex fuel, hybrid, electric)
    features = {
        "turbo": re.search(r'turbo', s) is not None,
        "flex_fuel": re.search(r'flex fuel|flex', s) is not None,
        "hybrid": re.search(r'hybrid', s) is not None,
        "electric": re.search(r'electric', s) is not None
    }

    return engine_hp, engine_cc, engine_cyl, features["turbo"], features["flex_fuel"], features["hybrid"], features["electric"]

# Applying the function to the DataFrame
df[['engine_hp', 'engine_cc', 'engine_cyl', 'engine_turbo', 'engine_flexfuel',
    'engine_hybrid', 'electric']] = df['engine'].apply(lambda x: pd.Series(decode_engine(x)))

In [16]:
df.columns

Index(['id', 'brand', 'model', 'model_year', 'milage', 'fuel_type', 'engine',
       'transmission', 'ext_col', 'int_col', 'accident', 'clean_title',
       'price', 'engine_hp', 'engine_cc', 'engine_cyl', 'engine_turbo',
       'engine_flexfuel', 'engine_hybrid', 'electric'],
      dtype='object')

In [17]:
len(df.columns)

20

In [18]:
df.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price,engine_hp,engine_cc,engine_cyl,engine_turbo,engine_flexfuel,engine_hybrid,electric
0,0,MINI,Cooper S Base,2007,213000,Gasoline,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,A/T,Yellow,Gray,None reported,Yes,4200,172.0,1.6,4.0,False,False,False,False
1,1,Lincoln,LS V8,2002,143250,Gasoline,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Silver,Beige,At least 1 accident or damage reported,Yes,4999,252.0,3.9,8.0,False,False,False,False
2,2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900,320.0,5.3,8.0,False,True,False,False
3,3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000,420.0,5.0,8.0,False,False,False,False
4,4,Mercedes-Benz,Metris Base,2021,7388,Gasoline,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Black,Beige,None reported,Yes,97500,208.0,2.0,4.0,False,False,False,False


# 5. Conclusion

By leveraging regular expressions in feature engineering, we have successfully extracted more detailed information from the engine column. Previously, only horsepower (HP) and engine displacement (cylinder volume) were captured through basic grouping and filtering techniques. 

However, the current approach allows us to extract additional features such as the number of cylinders (engine_cyl), the presence of a turbocharger (turbo), flex-fuel capability (flexfuel), and whether the engine is electric.

This expanded feature set enhances the model’s ability to capture more nuances in the data, reducing the risk of underfitting. Additionally, the use of fewer lines of code to achieve this extraction results in cleaner, more readable code, improving overall code maintainability and efficiency.

Key Improvements:

1. **Clarity and Specificity:** Added detail about what features were previously extracted and what has been added, ensuring it’s clear to the reader.
2. **Technical Accuracy:** Used appropriate terminology like “feature set” and “underfitting” to align with data science terminology.
3. **Code Benefits:** Highlighted the benefits of clean, maintainable code alongside the data extraction process.