# ETL
## Objectives and steps
The goal of this part of the project is to turn raw data from a single CSV file into a working relational database before continuing with further analysis in Python or SQL.

The main steps are:
1. Extract data from CSV file
2. Understand gaps in data and correct when possible
3. Define a sensible relationship diagram (ERD)
4. Split the data into separate dataframes and CSV files. The CSV files can be used to load the data into a more advanced database system such as PostgreSQL or MySQL.
5. Load data into an SQLite database

## Dependencies

In [1]:
import pandas as pd

## Extraction

In [2]:
vehicles_df = pd.read_csv('raw_data/vehicles.csv')

### Dataset information

In [3]:
vehicles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 26 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   url           426880 non-null  object 
 2   region        426880 non-null  object 
 3   region_url    426880 non-null  object 
 4   price         426880 non-null  int64  
 5   year          425675 non-null  float64
 6   manufacturer  409234 non-null  object 
 7   model         421603 non-null  object 
 8   condition     252776 non-null  object 
 9   cylinders     249202 non-null  object 
 10  fuel          423867 non-null  object 
 11  odometer      422480 non-null  float64
 12  title_status  418638 non-null  object 
 13  transmission  424324 non-null  object 
 14  VIN           265838 non-null  object 
 15  drive         296313 non-null  object 
 16  size          120519 non-null  object 
 17  type          334022 non-null  object 
 18  pain

In [4]:
print('Missing data (%)')
missing_data_percent = 100* (len(vehicles_df) - vehicles_df.count())/len(vehicles_df)
missing_data_percent.sort_values(ascending=False).round(0)

Missing data (%)


county          100.0
size             72.0
cylinders        42.0
condition        41.0
VIN              38.0
drive            31.0
paint_color      31.0
type             22.0
manufacturer      4.0
title_status      2.0
lat               2.0
long              2.0
model             1.0
odometer          1.0
fuel              1.0
transmission      1.0
year              0.0
description       0.0
image_url         0.0
posting_date      0.0
url               0.0
price             0.0
state             0.0
region_url        0.0
region            0.0
id                0.0
dtype: float64

### Observations
- 426880 rows in total
- 1 identification column
- 25 columns containing information about listing, vehicle, price, location

Some data is missing, in particular:
- No information at all on county -> the column can be deleted
- 72% of size data missing, but this could be deducted from other similar vehicles
- 42% of info about number of cyclinders is missing, but this could be deducted from other similar vehicles
- 41% of info about condition is missing, which could be a problem for any model. This will only be specified as 'Unknown'
- 38% of VIN info is missing. The impact is TBD so this could only be specified as 'Unknown' for the time being.
- 31% of info about drive is missing, but this could be deducted from other similar vehicles
- 31% of paint color info is missing. This will only be specified as 'Unknown'
- 22% of vehicle type info is missing, but this could be deducted from other similar vehicles
- 4% of manufacturer information is missing but this could be deducted from the model if available.
- The other columns miss 2% of data or less

### Decisions
1. The county column will be drop (congtains no information)
2. The following columns will use a value of 'Unknown': condition, VIN, paint color
3. The following columns will use a value of 'TBC': size, cyclinders, drive, type, manufacturer
4. All the rows containing NA in the other columns will be dropped and a decision to use them or not will be made based on the size of the remaining dataset

## Transformation
### Cleaning

In [5]:
# List all columns
df_columns = vehicles_df.columns
print(df_columns)
print(f"--\nNumber of columns: {len(df_columns)}")

Index(['id', 'url', 'region', 'region_url', 'price', 'year', 'manufacturer',
       'model', 'condition', 'cylinders', 'fuel', 'odometer', 'title_status',
       'transmission', 'VIN', 'drive', 'size', 'type', 'paint_color',
       'image_url', 'description', 'county', 'state', 'lat', 'long',
       'posting_date'],
      dtype='object')
--
Number of columns: 26


In [6]:
# Columns to drop
df_columns_drop = ['county']

# Columns to change to 'Unknown'
df_columns_unknown = ['condition', 'VIN', 'paint_color']

# Column to change to 'TBC'
df_columns_tbc = ['size', 'cylinders', 'drive', 'type', 'manufacturer']

In [7]:
# Create a clean dataframe
cleaned_vehicles_df = vehicles_df.copy()

# Drop columns
cleaned_vehicles_df = vehicles_df.drop(columns=df_columns_drop)

In [8]:
# Assign value of Unknown or TBC to columns
na_to_uknown = {k: 'Unknown' for k in df_columns_unknown}
na_to_tbc = {k: 'TBC' for k in df_columns_tbc}

# Change NA to 'Unknown' and 'TBC'
cleaned_vehicles_df = cleaned_vehicles_df.fillna(na_to_uknown)
cleaned_vehicles_df = cleaned_vehicles_df.fillna(na_to_tbc)

In [9]:
# Check missing values
print('Missing data (%)')
missing_data_percent = 100* (len(cleaned_vehicles_df) - cleaned_vehicles_df.count())/len(cleaned_vehicles_df)
missing_data_percent.sort_values(ascending=False).round(0)

Missing data (%)


title_status    2.0
long            2.0
lat             2.0
model           1.0
odometer        1.0
fuel            1.0
transmission    1.0
year            0.0
description     0.0
image_url       0.0
posting_date    0.0
cylinders       0.0
url             0.0
condition       0.0
VIN             0.0
drive           0.0
size            0.0
type            0.0
paint_color     0.0
manufacturer    0.0
price           0.0
state           0.0
region_url      0.0
region          0.0
id              0.0
dtype: float64

In [10]:
# Drop all other NA
cleaned_vehicles_no_nan_df = cleaned_vehicles_df.dropna(how='any')

In [11]:
# Check remaining DataFrame
cleaned_vehicles_no_nan_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 399275 entries, 27 to 426879
Data columns (total 25 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            399275 non-null  int64  
 1   url           399275 non-null  object 
 2   region        399275 non-null  object 
 3   region_url    399275 non-null  object 
 4   price         399275 non-null  int64  
 5   year          399275 non-null  float64
 6   manufacturer  399275 non-null  object 
 7   model         399275 non-null  object 
 8   condition     399275 non-null  object 
 9   cylinders     399275 non-null  object 
 10  fuel          399275 non-null  object 
 11  odometer      399275 non-null  float64
 12  title_status  399275 non-null  object 
 13  transmission  399275 non-null  object 
 14  VIN           399275 non-null  object 
 15  drive         399275 non-null  object 
 16  size          399275 non-null  object 
 17  type          399275 non-null  object 
 18  pai

In [12]:
# Calculate data loss
original_size = len(vehicles_df)
new_size = len(cleaned_vehicles_no_nan_df)

print(f"Original dataset: {original_size:,.0f} rows")
print(f"Cleaned dataset: {new_size:,.0f} rows")
print(f"Deleted rows: {original_size-new_size:,.0f} rows ({100*(original_size-new_size)/original_size:.1f}%)")

Original dataset: 426,880 rows
Cleaned dataset: 399,275 rows
Deleted rows: 27,605 rows (6.5%)


### Conclusions on cleaning process
- All missing values have been addressed
- Potentially interesting missing values have been replaced by 'Unkonwn'
- Values that could be infered from other information have been marked 'TBC'
- Values that represented a small portion of a column's total row count (<2%) have been removed
- The remaining dataset is only 6.5% smaller than the original one

### Identify manufacturer based on model
We assume that some (if not all) rows with a known model but a missing manufacturer will have a corresponding number of rows where the same model is specified but with a defined manufacturer. Unless there are conflicts (e.g. same model name for different manufacturers), it will be possible to use the same manufacturer for the TBC row.

In [13]:
# Get a DataFrame of manufaturer and model where the manufacturer is TBC
tbc_manufacturer = cleaned_vehicles_no_nan_df.loc[cleaned_vehicles_no_nan_df['manufacturer']=='TBC',['manufacturer','model']]
tbc_manufacturer.head()

Unnamed: 0,manufacturer,model
97,TBC,Scion iM Hatchback 4D
122,TBC,blue bird bus
135,TBC,Scion iM Hatchback 4D
137,TBC,1966 C-30 1 ton
155,TBC,smart fortwo Passion Hatchback


In [14]:
# Get the list of unique models
models_w_tbc_manufacturer = tbc_manufacturer['model'].unique()
models_w_tbc_manufacturer

array(['Scion iM Hatchback 4D', 'blue bird bus', '1966 C-30 1 ton', ...,
       'ATI', '96 Suburban', 'Paige Glenbrook Touring'], dtype=object)

In [15]:
# Count the number of unique models
tbc_manufacturer['model'].nunique()

5759

In [16]:
# Count number of each unique model
tbc_manufacturer['model'].value_counts()

Scion iM Hatchback 4D             640
smart fortwo Passion Hatchback    168
Genesis G80 3.8 Sedan 4D          163
Freightliner Cascadia             153
International 4300                138
                                 ... 
F15O LIMITED EDITION                1
MITS. OUTLANDER SUV                 1
Mac Value                           1
Freightliner fl70                   1
Paige Glenbrook Touring             1
Name: model, Length: 5759, dtype: int64

In [17]:
# As a first test, look for the most common model with TBC manufacturer in the complete dataset
model_name = models_w_tbc_manufacturer[4]
model_name

'F-350'

In [18]:
# NOTE: this cell can take up to 2 min to run

count_manufacturers = []

for i, model_name in enumerate(models_w_tbc_manufacturer):

    print(f"{i} / {len(models_w_tbc_manufacturer)}", end='\r')

    # Retrieve a dataframe with the manufacturer and model for the specified model
    this_model_df = cleaned_vehicles_no_nan_df.loc[cleaned_vehicles_no_nan_df['model']==model_name,['manufacturer', 'model']]

    # List the number of unique manufacturer (we expect to get something other than TBC so a value greater than 1)
    count_manufacturers.append(this_model_df['manufacturer'].nunique())

5758 / 5759

In [19]:
# Get number of manufaturers >1
unknwon = len([c for c in count_manufacturers if c == 1])

# Get number of manufaturers = 2 (can be resolved)
identified = len([c for c in count_manufacturers if c == 2])

# Get number of manufaturers > 2 (cannot be resolved)
unresolved = len([c for c in count_manufacturers if c > 2])

print(f"{unknwon} manufacturers cannot be known (only 'TBC').")
print(f"{identified} manufacturers can be identifed (1 non-TBC value).")
print(f"{unresolved} manufacturers cannot be resolved (more than 1 non-TBC values).")

5623 manufacturers cannot be known (only 'TBC').
110 manufacturers can be identifed (1 non-TBC value).
26 manufacturers cannot be resolved (more than 1 non-TBC values).


In [20]:
# Get the indices of the manufacturers that can be identified
indices = [i for i, x in enumerate(count_manufacturers) if x == 2]

In [21]:
count_manufacturers = []

for i in indices:
    # Retrieve a dataframe with the manufacturer and model for the specified model
    this_model_df = cleaned_vehicles_no_nan_df.loc[cleaned_vehicles_no_nan_df['model']==models_w_tbc_manufacturer[i],['manufacturer', 'model']]

    # List the number of unique manufacturer (we expect to get something other than TBC so a value greater than 1)
    manufacturer_list = this_model_df['manufacturer'].unique().tolist()
    
    # Replace the manufacturer in the DataFrame by the non-TBC value
    for m in manufacturer_list:
        if m != 'TBC':
            cleaned_vehicles_no_nan_df.loc[cleaned_vehicles_no_nan_df['model']==models_w_tbc_manufacturer[i],'model'] = m

### Check that the manufacturers that could be identified are no longer marked as TBC

In [22]:
# NOTE: this cell can take up to 2 min to run

count_manufacturers = []

for i, model_name in enumerate(models_w_tbc_manufacturer):

    print(f"{i} / {len(models_w_tbc_manufacturer)}", end='\r')

    # Retrieve a dataframe with the manufacturer and model for the specified model
    this_model_df = cleaned_vehicles_no_nan_df.loc[cleaned_vehicles_no_nan_df['model']==model_name,['manufacturer', 'model']]

    # List the number of unique manufacturer (we expect to get something other than TBC so a value greater than 1)
    count_manufacturers.append(this_model_df['manufacturer'].nunique())

# Get number of manufaturers >1
unknwon = len([c for c in count_manufacturers if c == 1])

# Get number of manufaturers = 2 (can be resolved)
identified = len([c for c in count_manufacturers if c == 2])

# Get number of manufaturers > 2 (cannot be resolved)
unresolved = len([c for c in count_manufacturers if c > 2])

5758 / 5759

In [23]:
print(f"{unknwon} manufacturers cannot be known (only 'TBC').")
print(f"{identified} manufacturers can be identifed (1 non-TBC value).")
print(f"{unresolved} manufacturers cannot be resolved (more than 1 non-TBC values).")

5623 manufacturers cannot be known (only 'TBC').
0 manufacturers can be identifed (1 non-TBC value).
26 manufacturers cannot be resolved (more than 1 non-TBC values).


## Loading

### Prepare ERD diagram
We use this code to list all the columns, split them in the relevant tables and get the code to use on https://app.quickdatabasediagrams.com/

In [24]:
for c in cleaned_vehicles_no_nan_df.columns:
    print(c)

id
url
region
region_url
price
year
manufacturer
model
condition
cylinders
fuel
odometer
title_status
transmission
VIN
drive
size
type
paint_color
image_url
description
state
lat
long
posting_date


In [25]:
table_names = ['listing', 'location', 'vehicle', 'usage']

listing_columns = ['url', 'price', 'image_url', 'description', 'posting_date']
location_columns = ['region', 'region_url', 'state', 'lat', 'long']
vehicle_columns = ['manufacturer', 'year', 'model', 'cylinders', 'fuel', 'transmission', 'drive', 'size', 'type']
usage_columns = ['condition', 'odometer', 'title_status', 'VIN', 'paint_color']

### Create the location dataset

In [26]:
# Keep only the relevant columns
location_df = cleaned_vehicles_no_nan_df[location_columns]

# Use lowercase everywhere
location_df['region'] = location_df['region'].str.lower()
location_df['state'] = location_df['state'].str.lower()

location_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  location_df['region'] = location_df['region'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  location_df['state'] = location_df['state'].str.lower()


Unnamed: 0,region,region_url,state,lat,long
27,auburn,https://auburn.craigslist.org,al,32.590000,-85.480000
28,auburn,https://auburn.craigslist.org,al,32.590000,-85.480000
29,auburn,https://auburn.craigslist.org,al,32.590000,-85.480000
30,auburn,https://auburn.craigslist.org,al,32.590000,-85.480000
31,auburn,https://auburn.craigslist.org,al,32.592000,-85.518900
...,...,...,...,...,...
426875,wyoming,https://wyoming.craigslist.org,wy,33.786500,-84.445400
426876,wyoming,https://wyoming.craigslist.org,wy,33.786500,-84.445400
426877,wyoming,https://wyoming.craigslist.org,wy,33.779214,-84.411811
426878,wyoming,https://wyoming.craigslist.org,wy,33.786500,-84.445400


In [27]:
# Drop duplicates based on region, state and url
location_df = location_df.drop_duplicates(subset=['region','state', 'region_url'])
len(location_df)

426

In [28]:
location_df = location_df.sort_values('region').reset_index()
location_df['index'] = pd.Series([f"loc_{x}" for x in range(len(location_df))])

In [29]:
# Rename id column
location_df = location_df.rename(columns={'index': 'location_id'})

In [30]:
# Reorder columns
location_df = location_df[['location_id', 'region', 'state', 'lat', 'long', 'region_url']]
location_df

Unnamed: 0,location_id,region,state,lat,long,region_url
0,loc_0,abilene,tx,30.649697,-97.867069,https://abilene.craigslist.org
1,loc_1,akron / canton,oh,41.020675,-81.549647,https://akroncanton.craigslist.org
2,loc_2,albany,ny,42.652600,-73.756200,https://albany.craigslist.org
3,loc_3,albany,ga,31.570000,-84.170000,https://albanyga.craigslist.org
4,loc_4,albuquerque,nm,35.058537,-106.877873,https://albuquerque.craigslist.org
...,...,...,...,...,...,...
421,loc_421,york,pa,39.794300,-76.981200,https://york.craigslist.org
422,loc_422,youngstown,oh,41.277007,-80.842552,https://youngstown.craigslist.org
423,loc_423,yuba-sutter,ca,39.363221,-121.686417,https://yubasutter.craigslist.org
424,loc_424,yuma,az,32.709300,-114.490500,https://yuma.craigslist.org


### Create the vehicle dataset

In [31]:
# Keep only the relevant columns
vehicle_df = cleaned_vehicles_no_nan_df[vehicle_columns]
vehicle_df

Unnamed: 0,manufacturer,year,model,cylinders,fuel,transmission,drive,size,type
27,gmc,2014.0,sierra 1500 crew cab slt,8 cylinders,gas,other,TBC,TBC,pickup
28,chevrolet,2010.0,silverado 1500,8 cylinders,gas,other,TBC,TBC,pickup
29,chevrolet,2020.0,silverado 1500 crew,8 cylinders,gas,other,TBC,TBC,pickup
30,toyota,2017.0,tundra double cab sr,8 cylinders,gas,other,TBC,TBC,pickup
31,ford,2013.0,f-150 xlt,6 cylinders,gas,automatic,rwd,full-size,truck
...,...,...,...,...,...,...,...,...,...
426875,nissan,2019.0,maxima s sedan 4d,6 cylinders,gas,other,fwd,TBC,sedan
426876,volvo,2020.0,s60 t5 momentum sedan 4d,TBC,gas,other,fwd,TBC,sedan
426877,cadillac,2020.0,xt4 sport suv 4d,TBC,diesel,other,TBC,TBC,hatchback
426878,lexus,2018.0,es 350 sedan 4d,6 cylinders,gas,other,fwd,TBC,sedan


In [32]:
def lowercase_column(df, col_names):
    """ Set a given column in a DataFrame to lowercase

    Args:
        - df (DataFrame): DataFrame to operate on
        - col_name (str): Column to change

    Returns:
        - DataFrame: updated DataFrame (new object)
    """

    # Create a copy of the DataFrame
    df_out = df.copy()

    for col_name in col_names:
        # Use lowercase everywhere
        df_out[col_name] = df[col_name].str.lower()

    return df_out

In [33]:
cols =['manufacturer', 'model', 'cylinders', 'fuel', 'transmission', 'drive', 'size', 'type']
vehicle_df = lowercase_column(vehicle_df, cols)

In [34]:
len(vehicle_df)

399275

In [35]:
# Drop duplicates
vehicle_df = vehicle_df.drop_duplicates()
len(vehicle_df)

147208

In [36]:
vehicle_df.sort_values('manufacturer')

Unnamed: 0,manufacturer,year,model,cylinders,fuel,transmission,drive,size,type
2418,acura,2006.0,tsx,tbc,gas,automatic,tbc,tbc,tbc
28344,acura,2003.0,mdx,tbc,gas,automatic,4wd,tbc,suv
22529,acura,2003.0,cl type s,6 cylinders,gas,manual,fwd,compact,coupe
114015,acura,2020.0,tlx,6 cylinders,gas,other,4wd,compact,sedan
62919,acura,2018.0,rdx sport utility 4d,6 cylinders,gas,other,fwd,tbc,other
...,...,...,...,...,...,...,...,...,...
96169,volvo,2013.0,xc90 3.2 sport utility 4d,tbc,gas,other,fwd,tbc,suv
96165,volvo,2012.0,s60 t6,6 cylinders,gas,automatic,4wd,tbc,sedan
11518,volvo,2016.0,xc90 t6 momentum sport,tbc,gas,other,tbc,tbc,other
11458,volvo,2016.0,s60,4 cylinders,gas,automatic,fwd,tbc,tbc


In [37]:
# Change year from float to int
vehicle_df['year'] = vehicle_df['year'].astype('int')

In [38]:
vehicle_df = vehicle_df.sort_values('model')
vehicle_df

Unnamed: 0,manufacturer,year,model,cylinders,fuel,transmission,drive,size,type
317864,ford,1913,"""t""",4 cylinders,gas,other,rwd,compact,convertible
384480,mercedes-benz,2018,"$362.47, $1000 down, oac, 2.9%apr $362.47,luxu...",4 cylinders,gas,automatic,rwd,tbc,sedan
5847,tbc,2002,%,6 cylinders,gas,automatic,fwd,full-size,mini-van
396112,chevrolet,2006,& altima,6 cylinders,gas,automatic,4wd,full-size,mini-van
161362,tbc,1950,'50 business coupe,6 cylinders,gas,automatic,rwd,tbc,coupe
...,...,...,...,...,...,...,...,...,...
140276,kia,2014,​​sorento lx,tbc,gas,automatic,tbc,tbc,tbc
214747,tbc,2020,♦all tades welcome!♦,tbc,gas,other,tbc,tbc,tbc
139801,chrysler,2006,♿ vmi,6 cylinders,gas,automatic,fwd,full-size,mini-van
99400,tbc,2004,𝓜𝓮𝓻𝓬𝓮𝓭𝓮𝓼 𝓫𝓮𝓷𝔃 𝓶𝓵 350,6 cylinders,gas,automatic,tbc,full-size,suv


In [39]:
# Reorder columns
vehicle_df = vehicle_df[['model', 'manufacturer', 'year', 'cylinders', 'fuel', 'transmission', 'drive', 'size', 'type']].reset_index()
vehicle_df

Unnamed: 0,index,model,manufacturer,year,cylinders,fuel,transmission,drive,size,type
0,317864,"""t""",ford,1913,4 cylinders,gas,other,rwd,compact,convertible
1,384480,"$362.47, $1000 down, oac, 2.9%apr $362.47,luxu...",mercedes-benz,2018,4 cylinders,gas,automatic,rwd,tbc,sedan
2,5847,%,tbc,2002,6 cylinders,gas,automatic,fwd,full-size,mini-van
3,396112,& altima,chevrolet,2006,6 cylinders,gas,automatic,4wd,full-size,mini-van
4,161362,'50 business coupe,tbc,1950,6 cylinders,gas,automatic,rwd,tbc,coupe
...,...,...,...,...,...,...,...,...,...,...
147203,140276,​​sorento lx,kia,2014,tbc,gas,automatic,tbc,tbc,tbc
147204,214747,♦all tades welcome!♦,tbc,2020,tbc,gas,other,tbc,tbc,tbc
147205,139801,♿ vmi,chrysler,2006,6 cylinders,gas,automatic,fwd,full-size,mini-van
147206,99400,𝓜𝓮𝓻𝓬𝓮𝓭𝓮𝓼 𝓫𝓮𝓷𝔃 𝓶𝓵 350,tbc,2004,6 cylinders,gas,automatic,tbc,full-size,suv


In [208]:
import hashlib

def get_n_hashed(input_str, n):
    return hashlib.sha256(input_str.encode('utf-8')).hexdigest()[:n]

def get_vehicle_hash(df, index_name):
    # Create custom vehicle id
    # Combine the vehicles columns together
    df[index_name] = df['manufacturer'] \
        + df['year'].astype('int').astype('str') \
        + df['model'] \
        + df['cylinders'] \
        + df['fuel'] \
        + df['transmission'] \
        + df['drive'] \
        + df['size'] \
        + df['type']

    hashed_list = []

    for index, row in df.iterrows():
        hashed_list.append(get_n_hashed(row[index_name], 10))

    df[index_name] = hashed_list
    return df

def get_listing_hash(df, index_name):
    # Create custom listing id based on URL and posing date
    # Combine the listing columns together
    df[index_name] = df['url'] + df['posting_date'].astype('str').str.lower()


    hashed_list = []

    for index, row in df.iterrows():
        hashed_list.append(get_n_hashed(row[index_name], 10))

    df[index_name] = hashed_list
    return df

In [127]:
# Rename id column
vehicle_df = vehicle_df.rename(columns={'index': 'vehicle_id'})

vehicle_df = get_vehicle_hash(vehicle_df, 'vehicle_id')
vehicle_df

Unnamed: 0,vehicle_id,model,manufacturer,year,cylinders,fuel,transmission,drive,size,type
0,700354c8d7,"""t""",ford,1913,4 cylinders,gas,other,rwd,compact,convertible
1,9c1a74d8f4,"$362.47, $1000 down, oac, 2.9%apr $362.47,luxu...",mercedes-benz,2018,4 cylinders,gas,automatic,rwd,tbc,sedan
2,2fb4869cb3,%,tbc,2002,6 cylinders,gas,automatic,fwd,full-size,mini-van
3,c88e828be2,& altima,chevrolet,2006,6 cylinders,gas,automatic,4wd,full-size,mini-van
4,ac93697e04,'50 business coupe,tbc,1950,6 cylinders,gas,automatic,rwd,tbc,coupe
...,...,...,...,...,...,...,...,...,...,...
147203,1176fe1964,​​sorento lx,kia,2014,tbc,gas,automatic,tbc,tbc,tbc
147204,d2ac883a1f,♦all tades welcome!♦,tbc,2020,tbc,gas,other,tbc,tbc,tbc
147205,1936b8f3c4,♿ vmi,chrysler,2006,6 cylinders,gas,automatic,fwd,full-size,mini-van
147206,c9238a440a,𝓜𝓮𝓻𝓬𝓮𝓭𝓮𝓼 𝓫𝓮𝓷𝔃 𝓶𝓵 350,tbc,2004,6 cylinders,gas,automatic,tbc,full-size,suv


In [128]:
# Check that ID is unique
nb_rows = vehicle_df['vehicle_id'].size
nb_ids = vehicle_df['vehicle_id'].unique().size
print(f"Unique ID: {nb_rows - nb_ids == 0} ({nb_ids:,.0f} unique ID for {nb_rows:,.0f} rows)")

Unique ID: True (147,208 unique ID for 147,208 rows)


### Create the listing dataset
- Step 1: replace location data by location id (normalisation)
- Step 2: keep only listing data
- Setp 3: add listing_id

In [209]:
# Display location data
location_df.head()

Unnamed: 0,location_id,region,state,lat,long,region_url
0,loc_0,abilene,tx,30.649697,-97.867069,https://abilene.craigslist.org
1,loc_1,akron / canton,oh,41.020675,-81.549647,https://akroncanton.craigslist.org
2,loc_2,albany,ny,42.6526,-73.7562,https://albany.craigslist.org
3,loc_3,albany,ga,31.57,-84.17,https://albanyga.craigslist.org
4,loc_4,albuquerque,nm,35.058537,-106.877873,https://albuquerque.craigslist.org


In [210]:
# Display listing data
listing_df = cleaned_vehicles_no_nan_df[location_columns+listing_columns]
listing_df.head()

Unnamed: 0,region,region_url,state,lat,long,url,price,image_url,description,posting_date
27,auburn,https://auburn.craigslist.org,al,32.59,-85.48,https://auburn.craigslist.org/ctd/d/auburn-uni...,33590,https://images.craigslist.org/00R0R_lwWjXSEWNa...,Carvana is the safer way to buy a car During t...,2021-05-04T12:31:18-0500
28,auburn,https://auburn.craigslist.org,al,32.59,-85.48,https://auburn.craigslist.org/ctd/d/auburn-uni...,22590,https://images.craigslist.org/00R0R_lwWjXSEWNa...,Carvana is the safer way to buy a car During t...,2021-05-04T12:31:08-0500
29,auburn,https://auburn.craigslist.org,al,32.59,-85.48,https://auburn.craigslist.org/ctd/d/auburn-uni...,39590,https://images.craigslist.org/01212_jjirIWa0y0...,Carvana is the safer way to buy a car During t...,2021-05-04T12:31:25-0500
30,auburn,https://auburn.craigslist.org,al,32.59,-85.48,https://auburn.craigslist.org/ctd/d/auburn-uni...,30990,https://images.craigslist.org/00x0x_1y9kIOzGCF...,Carvana is the safer way to buy a car During t...,2021-05-04T10:41:31-0500
31,auburn,https://auburn.craigslist.org,al,32.592,-85.5189,https://auburn.craigslist.org/cto/d/auburn-uni...,15000,https://images.craigslist.org/00404_l4loxHvdQe...,2013 F-150 XLT V6 4 Door. Good condition. Leve...,2021-05-03T14:02:03-0500


In [211]:
len(listing_df)

399275

In [212]:
# Use lower case
cols =['state', 'region', 'region_url']
listing_df = lowercase_column(listing_df, cols)

In [213]:
# Create an empty column for the location id
listing_df['location_id'] = ''

# Iterate through location_df
for index, row in location_df.iterrows():
    # Add location id based on region, state and region_url values
    listing_df.loc[
        (listing_df['region'] == row['region']) & (listing_df['state'] == row['state']) & (listing_df['region_url'] == row['region_url']),
        'location_id'] = row['location_id']

In [214]:
# Check that all location id have been assigned
print('No empty location_id:')
len(listing_df.loc[listing_df['location_id'] == ''][['region', 'location_id']]) == 0

No empty location_id:


True

In [216]:
location_columns

['region', 'region_url', 'state', 'lat', 'long']

In [217]:
# Drop location columns
listing_df = listing_df.drop(columns=location_columns)

In [218]:
# Reorder columns
listing_df = listing_df[['posting_date', 'price', 'location_id', 'url', 'image_url', 'description']].sort_values('posting_date').reset_index()
listing_df.head()

Unnamed: 0,index,posting_date,price,location_id,url,image_url,description
0,40980,2021-04-04T00:10:40-0700,1000,loc_236,https://monterey.craigslist.org/cto/d/soledad-...,https://images.craigslist.org/01717_fZzCp0a4sv...,2003 Chevy Venture Handy Cap Van for sale run...
1,69115,2021-04-04T01:03:25-0700,6500,loc_396,https://ventura.craigslist.org/cto/d/newbury-p...,https://images.craigslist.org/00B0B_dLRfQD689D...,"2002 F150 Ford original owner 211,000 original..."
2,263503,2021-04-04T01:10:12-0600,24900,loc_324,https://santafe.craigslist.org/ctd/d/denver-20...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."
3,82872,2021-04-04T01:10:23-0600,24900,loc_411,https://westslope.craigslist.org/ctd/d/denver-...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."
4,246131,2021-04-04T01:10:34-0600,24900,loc_328,https://scottsbluff.craigslist.org/ctd/d/denve...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."


In [219]:
# Rename id column
listing_df = listing_df.rename(columns={'index': 'listing_id'})

# Create hashed id
listing_df = get_listing_hash(listing_df, 'listing_id')

listing_df

Unnamed: 0,listing_id,posting_date,price,location_id,url,image_url,description
0,5307a40771,2021-04-04T00:10:40-0700,1000,loc_236,https://monterey.craigslist.org/cto/d/soledad-...,https://images.craigslist.org/01717_fZzCp0a4sv...,2003 Chevy Venture Handy Cap Van for sale run...
1,c1f374abaa,2021-04-04T01:03:25-0700,6500,loc_396,https://ventura.craigslist.org/cto/d/newbury-p...,https://images.craigslist.org/00B0B_dLRfQD689D...,"2002 F150 Ford original owner 211,000 original..."
2,c5652c42d3,2021-04-04T01:10:12-0600,24900,loc_324,https://santafe.craigslist.org/ctd/d/denver-20...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."
3,d0596f03ec,2021-04-04T01:10:23-0600,24900,loc_411,https://westslope.craigslist.org/ctd/d/denver-...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."
4,046f52dd0c,2021-04-04T01:10:34-0600,24900,loc_328,https://scottsbluff.craigslist.org/ctd/d/denve...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."
...,...,...,...,...,...,...,...
399270,1707d72611,2021-05-04T23:05:52-0400,32489,loc_385,https://tricities.craigslist.org/ctd/d/milliga...,https://images.craigslist.org/00d0d_c3XK6uvJEP...,Most common questions about this vehicle: Wan...
399271,f0d6c3af87,2021-05-04T23:07:07-0500,4000,loc_228,https://milwaukee.craigslist.org/cto/d/milwauk...,https://images.craigslist.org/00P0P_4DM4IyYWOE...,Really good car had it since 2009.It’s a daily...
399272,ae377fc4db,2021-05-04T23:12:08-0500,8975,loc_180,https://racine.craigslist.org/ctd/d/franksvill...,https://images.craigslist.org/00n0n_beXYXJBJOd...,VIEW MORE INVENTORY AND PICS OF THIS 2010 FORD...
399273,1155ec05a0,2021-05-04T23:24:09-0500,8975,loc_228,https://milwaukee.craigslist.org/ctd/d/franksv...,https://images.craigslist.org/00x0x_10HAOGNVki...,VIEW MORE INVENTORY AND PICS OF THIS 2010 FORD...


In [220]:
# Check that ID is unique
nb_rows = listing_df['listing_id'].size
nb_ids = listing_df['listing_id'].unique().size
print(f"Unique ID: {nb_rows - nb_ids == 0} ({nb_ids:,.0f} unique ID for {nb_rows:,.0f} rows)")

Unique ID: True (399,275 unique ID for 399,275 rows)


### Create the usage dataset

In [221]:
# Create a new DataFrame, drop the location columns (already normalised within listing)
usage_df = cleaned_vehicles_no_nan_df.drop(columns=location_columns)
usage_df.head(2)

Unnamed: 0,id,url,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,image_url,description,posting_date
27,7316814884,https://auburn.craigslist.org/ctd/d/auburn-uni...,33590,2014.0,gmc,sierra 1500 crew cab slt,good,8 cylinders,gas,57923.0,clean,other,3GTP1VEC4EG551563,TBC,TBC,pickup,white,https://images.craigslist.org/00R0R_lwWjXSEWNa...,Carvana is the safer way to buy a car During t...,2021-05-04T12:31:18-0500
28,7316814758,https://auburn.craigslist.org/ctd/d/auburn-uni...,22590,2010.0,chevrolet,silverado 1500,good,8 cylinders,gas,71229.0,clean,other,1GCSCSE06AZ123805,TBC,TBC,pickup,blue,https://images.craigslist.org/00R0R_lwWjXSEWNa...,Carvana is the safer way to buy a car During t...,2021-05-04T12:31:08-0500


In [222]:
# List columns
usage_df.columns

Index(['id', 'url', 'price', 'year', 'manufacturer', 'model', 'condition',
       'cylinders', 'fuel', 'odometer', 'title_status', 'transmission', 'VIN',
       'drive', 'size', 'type', 'paint_color', 'image_url', 'description',
       'posting_date'],
      dtype='object')

In [223]:
for v in vehicle_columns:
    print(f"'{v}', ")

'manufacturer', 
'year', 
'model', 
'cylinders', 
'fuel', 
'transmission', 
'drive', 
'size', 
'type', 


In [224]:
# Use lower case in vehicle columns
cols =['manufacturer', 'model', 'cylinders', 'fuel', 'transmission', 'drive', 'size', 'type']
usage_df = lowercase_column(usage_df, cols)

In [225]:
for v in listing_columns:
    print(f"'{v}', ")

'url', 
'price', 
'image_url', 
'description', 
'posting_date', 


In [226]:
# Use lower case in listing columns
cols =['url', 'image_url', 'description', 'posting_date']
usage_df = lowercase_column(usage_df, cols)

#### Using bins

In [227]:
listing_columns

['url', 'price', 'image_url', 'description', 'posting_date']

In [228]:
# Create a copy to experiment
usage_x_df = usage_df.copy()

# Use lower case for vehicle columns
cols =['manufacturer', 'model', 'cylinders', 'fuel', 'transmission', 'drive', 'size', 'type']
usage_x_df = lowercase_column(usage_x_df, cols)

# Create hash ids for vehicle
usage_x_df = get_vehicle_hash(usage_x_df, 'vehicle_id')

# Drop the vehicle columns
usage_x_df = usage_x_df.drop(columns=vehicle_columns)


In [229]:
# Create hashed id
usage_x_df = get_listing_hash(usage_x_df, 'listing_id')

In [230]:
usage_x_df.head(2)

Unnamed: 0,id,url,price,condition,odometer,title_status,VIN,paint_color,image_url,description,posting_date,vehicle_id,listing_id
27,7316814884,https://auburn.craigslist.org/ctd/d/auburn-uni...,33590,good,57923.0,clean,3GTP1VEC4EG551563,white,https://images.craigslist.org/00r0r_lwwjxsewna...,carvana is the safer way to buy a car during t...,2021-05-04t12:31:18-0500,20cdca5c84,2050e4a22d
28,7316814758,https://auburn.craigslist.org/ctd/d/auburn-uni...,22590,good,71229.0,clean,1GCSCSE06AZ123805,blue,https://images.craigslist.org/00r0r_lwwjxsewna...,carvana is the safer way to buy a car during t...,2021-05-04t12:31:08-0500,392342d70b,6bcecf8ffb


In [234]:
# Drop the listing columns
usage_x_df = usage_x_df.drop(columns=listing_columns)

### Check that all vehicle id and listing ids are valid

In [240]:
import numpy as np

# Number of test runs
runs = 10000

all_vehicles_id = vehicle_df['vehicle_id'].to_list()
all_listing_id = listing_df['listing_id'].to_list()

In [241]:
# Get random indices and check ID validity
nb_records = len(usage_x_df['vehicle_id'].to_list())

for i in np.random.randint(0, nb_records, runs):
    if usage_x_df['vehicle_id'].iloc[i] not in all_vehicles_id:
        print(f"ID {usage_x_df['vehicle_id'].iloc[i]} not found.")

print('Test completed.')

Test completed.


In [242]:
# Get random indices and check ID validity
nb_records = len(usage_x_df['listing_id'].to_list())

for i in np.random.randint(0, nb_records, runs):
    if usage_x_df['listing_id'].iloc[i] not in all_listing_id:
        print(f"ID {usage_x_df['listing_id'].iloc[i]} not found.")

print('Test completed.')

Test completed.


### Save tables as CSV

#### Usage

In [244]:
usage_x_df.head()

Unnamed: 0,id,condition,odometer,title_status,VIN,paint_color,vehicle_id,listing_id
27,7316814884,good,57923.0,clean,3GTP1VEC4EG551563,white,20cdca5c84,2050e4a22d
28,7316814758,good,71229.0,clean,1GCSCSE06AZ123805,blue,392342d70b,6bcecf8ffb
29,7316814989,good,19160.0,clean,3GCPWCED5LG130317,red,41b558090c,8f577e58f9
30,7316743432,good,41124.0,clean,5TFRM5F17HX120972,red,37bedcde79,f41e6df83c
31,7316356412,excellent,128000.0,clean,Unknown,black,878bdd10fd,426e5571c2


In [245]:
usage_x_df.to_csv('cleaned_csv/usage.csv', index=False)

#### Listings

In [246]:
listing_df.head()

Unnamed: 0,listing_id,posting_date,price,location_id,url,image_url,description
0,5307a40771,2021-04-04T00:10:40-0700,1000,loc_236,https://monterey.craigslist.org/cto/d/soledad-...,https://images.craigslist.org/01717_fZzCp0a4sv...,2003 Chevy Venture Handy Cap Van for sale run...
1,c1f374abaa,2021-04-04T01:03:25-0700,6500,loc_396,https://ventura.craigslist.org/cto/d/newbury-p...,https://images.craigslist.org/00B0B_dLRfQD689D...,"2002 F150 Ford original owner 211,000 original..."
2,c5652c42d3,2021-04-04T01:10:12-0600,24900,loc_324,https://santafe.craigslist.org/ctd/d/denver-20...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."
3,d0596f03ec,2021-04-04T01:10:23-0600,24900,loc_411,https://westslope.craigslist.org/ctd/d/denver-...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."
4,046f52dd0c,2021-04-04T01:10:34-0600,24900,loc_328,https://scottsbluff.craigslist.org/ctd/d/denve...,https://images.craigslist.org/00Q0Q_l2MEYifGtL...,"2012 *Toyota* *Sienna* LE - $24,900Call or Tex..."


In [247]:
listing_df.to_csv('cleaned_csv/listing.csv', index=False)

#### Vehicles

In [248]:
vehicle_df.head()

Unnamed: 0,vehicle_id,model,manufacturer,year,cylinders,fuel,transmission,drive,size,type
0,700354c8d7,"""t""",ford,1913,4 cylinders,gas,other,rwd,compact,convertible
1,9c1a74d8f4,"$362.47, $1000 down, oac, 2.9%apr $362.47,luxu...",mercedes-benz,2018,4 cylinders,gas,automatic,rwd,tbc,sedan
2,2fb4869cb3,%,tbc,2002,6 cylinders,gas,automatic,fwd,full-size,mini-van
3,c88e828be2,& altima,chevrolet,2006,6 cylinders,gas,automatic,4wd,full-size,mini-van
4,ac93697e04,'50 business coupe,tbc,1950,6 cylinders,gas,automatic,rwd,tbc,coupe


In [249]:
vehicle_df.to_csv('cleaned_csv/vehicle.csv', index=False)

#### Locations

In [250]:
location_df.head()

Unnamed: 0,location_id,region,state,lat,long,region_url
0,loc_0,abilene,tx,30.649697,-97.867069,https://abilene.craigslist.org
1,loc_1,akron / canton,oh,41.020675,-81.549647,https://akroncanton.craigslist.org
2,loc_2,albany,ny,42.6526,-73.7562,https://albany.craigslist.org
3,loc_3,albany,ga,31.57,-84.17,https://albanyga.craigslist.org
4,loc_4,albuquerque,nm,35.058537,-106.877873,https://albuquerque.craigslist.org


In [251]:
location_df.to_csv('cleaned_csv/location.csv', index=False)

#### Compare size

In [261]:
print(f"Location: {len(location_df):,.0f} rows")
print(f"Vehicle: {len(vehicle_df):,.0f} rows")
print(f"Lisiting: {len(listing_df):,.0f} rows")
print(f"Usage: {len(usage_x_df):,.0f} rows")

Location: 426 rows
Vehicle: 147,208 rows
Lisiting: 399,275 rows
Usage: 399,275 rows


## Complete schemas

In [288]:
def change_type(col_type):

    col_type = 'VARCHAR' if col_type == 'object' else col_type
    col_type = 'FLOAT' if col_type == 'float64' else col_type
    col_type = 'INT' if col_type == 'int64' else col_type

    return col_type

In [309]:
def get_schema(df, table_name):
    print(table_name)
    print('--')
    for col in df.columns:
        col_type = df[col].dtype

        col_type = change_type(col_type)

        print(f"{col} {col_type}")

In [313]:
get_schema(location_df, 'location')
print('')
get_schema(vehicle_df, 'vehicle')
print('')
get_schema(usage_x_df, 'usage')
print('')
get_schema(listing_df, 'listing')

location
--
location_id VARCHAR
region VARCHAR
state VARCHAR
lat FLOAT
long FLOAT
region_url VARCHAR

vehicle
--
vehicle_id VARCHAR
model VARCHAR
manufacturer VARCHAR
year int32
cylinders VARCHAR
fuel VARCHAR
transmission VARCHAR
drive VARCHAR
size VARCHAR
type VARCHAR

usage
--
id INT
condition VARCHAR
odometer FLOAT
title_status VARCHAR
VIN VARCHAR
paint_color VARCHAR
vehicle_id VARCHAR
listing_id VARCHAR

listing
--
listing_id VARCHAR
posting_date VARCHAR
price INT
location_id VARCHAR
url VARCHAR
image_url VARCHAR
description VARCHAR


# Load to SQLite

In [1]:
# Engine
from sqlalchemy import create_engine

# ORM session tool
from sqlalchemy.orm import Session

# Get data types used in columns (class attributes)
from sqlalchemy import Column, Integer, String, Float, Boolean

# Get a preset of system classes
from sqlalchemy.ext.declarative import declarative_base

In [2]:
Base = declarative_base()

## Create classes for each table

In [None]:
class Location(Base):
    __tablename__ = "location"
    location_id = Column(String, primary_key = True)
    region = Column(String)
    state = Column(String)
    lat = Column(Float)
    long = Column(Float)
    region_url = Column(String)

In [None]:
class Listing(Base):
    __tablename__ = "listing"
    listing_id = Column(String, primary_key = True)
    posting_date = Column(String)
    price = Column(Integer)
    location_id = Column(String)
    url = Column(String)
    image_url = Column(String)
    description = Column(String)

In [28]:
class Vehicle(Base):
    __tablename__ = "vehicle"
    vehicle_id = Column(String, primary_key = True)
    model = Column(String)
    manufacturer = Column(String)
    year = Column(Integer)
    cylinders = Column(String)
    fuel = Column(String)
    transmission = Column(String)
    drive = Column(String)
    size = Column(String)
    type = Column(String)

## Open session

In [33]:
# Create database connection
# NOTE: Create the database if it does not exist
engine = create_engine("sqlite:///database/UsedCars.sqlite")

# Create all tables and columns (from class)
# NOTE: If the table already exists, it will use the existing table
Base.metadata.create_all(engine)

# Start session
session = Session(bind=engine)

## Add records

In [21]:
# Open CSV file
import pandas as pd

location_df = pd.read_csv('cleaned_csv/location.csv')

In [22]:
print(location_df.columns)

Index(['location_id', 'region', 'state', 'lat', 'long', 'region_url'], dtype='object')


In [23]:
nb_rows = len(location_df)
counter = 0

for index, row in location_df.iterrows():
    counter += 1
    print(f"{counter} / {nb_rows}", end='\r')

    new_record = Location(
        location_id = row['location_id'],
        region = row['region'],
        state = row['state'],
        lat = row['lat'],
        long = row['long'],
        region_url = row['region_url']
    )

    # Add new instance to session
    session.add(new_record)

# Commit all changes to database
session.commit()

426 / 426

In [24]:
listing_df = pd.read_csv('cleaned_csv/listing.csv')
print(listing_df.columns)

Index(['listing_id', 'posting_date', 'price', 'location_id', 'url',
       'image_url', 'description'],
      dtype='object')


In [25]:
nb_rows = len(listing_df)
counter = 0

for index, row in listing_df.iterrows():
    counter += 1
    print(f"{counter} / {nb_rows}", end='\r')

    new_record = Listing(
        listing_id = row['listing_id'],
        posting_date = row['posting_date'],
        price = row['price'],
        location_id = row['location_id'],
        url = row['url'],
        image_url = row['image_url'],
        description = row['description']
    )

    # Add new instance to session
    session.add(new_record)

399275 / 399275

In [34]:
# Commit all changes to database
session.commit()

In [29]:
vehicle_df = pd.read_csv('cleaned_csv/vehicle.csv')
print(vehicle_df.columns)

Index(['vehicle_id', 'model', 'manufacturer', 'year', 'cylinders', 'fuel',
       'transmission', 'drive', 'size', 'type'],
      dtype='object')


In [35]:
nb_rows = len(vehicle_df)
counter = 0

for index, row in vehicle_df.iterrows():
    counter += 1
    print(f"{counter} / {nb_rows}", end='\r')

    new_record = Vehicle(
        vehicle_id = row['vehicle_id'],
        model = row['model'],
        manufacturer = row['manufacturer'],
        year = row['year'],
        cylinders = row['cylinders'],
        fuel =row['fuel'],
        transmission = row['transmission'],
        drive = row['drive'],
        size = row['size'],
        type = row['type']
    )

    # Add new instance to session
    session.add(new_record)

147208 / 147208

In [31]:
# Commit all changes to database
session.commit()

OperationalError: (sqlite3.OperationalError) no such table: vehicle
[SQL: INSERT INTO vehicle (vehicle_id, model, manufacturer, year, cylinders, fuel, transmission, drive, size, type) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]
[parameters: (('700354c8d7', '"t"', 'ford', 1913, '4 cylinders', 'gas', 'other', 'rwd', 'compact', 'convertible'), ('9c1a74d8f4', '$362.47, $1000 down, oac, 2.9%apr $362.47,luxury low miles $1000 down, only 40k miles', 'mercedes-benz', 2018, '4 cylinders', 'gas', 'automatic', 'rwd', 'tbc', 'sedan'), ('2fb4869cb3', '%', 'tbc', 2002, '6 cylinders', 'gas', 'automatic', 'fwd', 'full-size', 'mini-van'), ('c88e828be2', '& altima', 'chevrolet', 2006, '6 cylinders', 'gas', 'automatic', '4wd', 'full-size', 'mini-van'), ('ac93697e04', "'50 business coupe", 'tbc', 1950, '6 cylinders', 'gas', 'automatic', 'rwd', 'tbc', 'coupe'), ('b172cc1453', "'99 h1 hummer", 'tbc', 1999, '8 cylinders', 'diesel', 'automatic', '4wd', 'full-size', 'offroad'), ('bba6f233d9', '(210)', 'chevrolet', 1955, '8 cylinders', 'gas', 'automatic', 'rwd', 'full-size', 'sedan'), ('0f4d61c8cf', '(300)', 'chrysler', 2006, '8 cylinders', 'gas', 'automatic', '4wd', 'full-size', 'other')  ... displaying 10 of 147208 total bound parameter sets ...  ('c9238a440a', '𝓜𝓮𝓻𝓬𝓮𝓭𝓮𝓼 𝓫𝓮𝓷𝔃 𝓶𝓵 350', 'tbc', 2004, '6 cylinders', 'gas', 'automatic', 'tbc', 'full-size', 'suv'), ('691ea1b3aa', '🔥gmc sierra 1500 sle🔥 4x4 🔥', 'tbc', 2004, '8 cylinders', 'gas', 'automatic', '4wd', 'tbc', 'truck'))]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

## Close session

In [32]:
# Close Session
session.close()