# Exporing Ebay Car Sales Data

In this project, we will work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the Gernam eBay website.

The original dataset and data dictionary can be found [here](https://www.kaggle.com/orgesleka/used-cars-database/data), but the version used in this project has been modified as follows:

* 50,000 data points were sampled from the full dataset of more than 370,000 observations
* The dataset was "dirtied" a bit to more closely resemble what we would expect from a scraped dataset. The original version of the dataset was cleaned before being uploaded to Kaggle.

The purpose of this project is to clean the data and analyze the included used car listings.

In [1]:
import numpy as np
import pandas as pd

In [2]:
autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [3]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [4]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [5]:
autos.isnull().sum()

dateCrawled               0
name                      0
seller                    0
offerType                 0
price                     0
abtest                    0
vehicleType            5095
yearOfRegistration        0
gearbox                2680
powerPS                   0
model                  2758
odometer                  0
monthOfRegistration       0
fuelType               4482
brand                     0
notRepairedDamage      9829
dateCreated               0
nrOfPictures              0
postalCode                0
lastSeen                  0
dtype: int64

We can see that several columns have missing values - "vehicleType", "gearbox", "model", "fuelType", and "notRepairedDamage". Many of the observations appear to be written in German. Additionally, the column names are written in camelcase rather than snakecase. We will fix these issues below.

First, we will convert the column names to snakecase and reword some of the column names to make them more descriptive.

In [6]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:
new_columns = ["date_crawled", "name", "seller", "offer_type", "price", "abtest", "vehicle_type", "registration_year",
               "gearbox", "power_ps", "model", "odometer", "registration_month", "fuel_type", "brand",
               "unrepaired_damage", "ad_created", "number_of_pictures", "postal_code", "last_seen"]
autos.columns = new_columns
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


After updating the names of some columns to be more descriptive and changing the column names from camelcase to snakecase, the column names are much easier to read and understand.

Now, let's do some basic data exploration to determine what other cleaning tasks need to be done.

In [8]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-11 22:38:16,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


* It appears that the 'price' and 'odometer' columns are numeric values stored as text. We will need to convert these columns to numeric data types.
* The 'seller' and 'offer_type' columns seem like they could be dropped from the dataframe - each contains only 2 unique values and all but 1 row have the same value.
* The 'registration_year', 'registration_month', 'number_of_pictures', and 'postal_code' columns need to be investigated further because of the number of unique values was shown as NaN in the describe() function.

First, let's investigate 'registration_year' and 'registration_month'

In [9]:
print(autos["registration_year"].unique().shape)
autos["registration_year"].unique()

(97,)


array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010,
       1999, 1982, 1990, 2015, 2014, 1996, 1992, 2005, 2002, 2012, 2011,
       2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988,
       1989, 1967, 1973, 1956, 1976, 4500, 1987, 1991, 1983, 1960, 1969,
       1950, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971,
       1966, 1979, 1981, 1970, 1974, 1910, 1975, 5000, 4100, 2019, 1959,
       9996, 9999, 6200, 1964, 1958, 1800, 1948, 1931, 1943, 9000, 1941,
       1962, 1927, 1937, 1929, 1000, 1957, 1952, 1111, 1955, 1939, 8888,
       1954, 1938, 2800, 5911, 1500, 1953, 1951, 4800, 1001], dtype=int64)

In [10]:
print(autos["registration_month"].unique().shape)
autos["registration_month"].unique()

(13,)


array([ 3,  6,  7,  4,  8, 12, 10,  0,  9, 11,  5,  2,  1], dtype=int64)

* A few values in the "registration_year" column look suspect - 4500, 5000, 4100, 9996, 9999, 6200, 1800, 9000, 1000, 1111, 8888, 2800, 4800, 1001
* The "registration_month" column has 13 unique values rather than the expected 12. It appears some rows have a value of 0 for registration month.

Next, let's investigate the "number_of_pictures" and "postal code" fields

In [11]:
print(autos["number_of_pictures"].unique().shape)
autos["number_of_pictures"].unique()

(1,)


array([0], dtype=int64)

In [12]:
print(autos["postal_code"].unique().shape)
autos["postal_code"].unique()

(7014,)


array([79588, 71034, 35394, ..., 34317, 97502, 84385], dtype=int64)

* The "number_of_pictures" column appears to contain nothing but 0 for all entries. This column can be dropped.
* The "postal_code" appears to not have any issues - there are over 7,000 unique values and we don't know enough about German postal codes to know if any values are invalid.

Next, we will clean the "price" and "odometer" columns by removing non-numeric characters and converting the columns to a numeric data type.

In [13]:
autos["price"].head(10)

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
5    $7,900
6      $300
7    $1,990
8      $250
9      $590
Name: price, dtype: object

In [14]:
autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(float)
                 )

In [15]:
autos["price"].head(10)

0    5000.0
1    8500.0
2    8990.0
3    4350.0
4    1350.0
5    7900.0
6     300.0
7    1990.0
8     250.0
9     590.0
Name: price, dtype: float64

In [16]:
autos["odometer"].head(10)

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
5    150,000km
6    150,000km
7    150,000km
8    150,000km
9    150,000km
Name: odometer, dtype: object

In [17]:
autos["odometer"] = (autos["odometer"]
                  .str.replace("km", "")
                  .str.replace(",", "")
                  .astype(float)
                 )

In [18]:
autos["odometer"].head(10)

0    150000.0
1    150000.0
2     70000.0
3     70000.0
4    150000.0
5    150000.0
6    150000.0
7    150000.0
8    150000.0
9    150000.0
Name: odometer, dtype: float64

In [19]:
autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True)

Now, let's take a closer look at the "price" and "odometer_km" columns.

In [20]:
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [21]:
autos["odometer_km"].unique().shape

(13,)

In [22]:
autos["odometer_km"].value_counts()

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
5000.0        967
40000.0       819
30000.0       789
20000.0       784
10000.0       264
Name: odometer_km, dtype: int64

In [23]:
autos["price"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [24]:
autos["price"].unique().shape

(2357,)

In [25]:
autos["price"].value_counts().sort_index(ascending=True)

0.0           1421
1.0            156
2.0              3
3.0              1
5.0              2
8.0              1
9.0              1
10.0             7
11.0             2
12.0             3
13.0             2
14.0             1
15.0             2
17.0             3
18.0             1
20.0             4
25.0             5
29.0             1
30.0             7
35.0             1
40.0             6
45.0             4
47.0             1
49.0             4
50.0            49
55.0             2
59.0             1
60.0             9
65.0             5
66.0             1
              ... 
151990.0         1
155000.0         1
163500.0         1
163991.0         1
169000.0         1
169999.0         1
175000.0         1
180000.0         1
190000.0         1
194000.0         1
197000.0         1
198000.0         1
220000.0         1
250000.0         1
259000.0         1
265000.0         1
295000.0         1
299000.0         1
345000.0         1
350000.0         1
999990.0         1
999999.0    

The values in the "odometer_km" column look reasonable. However, there are many values in the "price" column which seem unreasonable. The minimum price is 0.0 (free car!) and the maximum is nearly 100 billion dollars! Let's look at the summary statistics for the "price" column with these outliers removed. We will limit the column to those cars with prices between 200 and 500,000 euros.

In [26]:
autos["price"][autos["price"].between(200,500000)].describe()

count     47645.000000
mean       6000.707273
std        9110.783444
min         200.000000
25%        1300.000000
50%        3190.000000
75%        7500.000000
max      350000.000000
Name: price, dtype: float64

These values look much more reasonable. Let's go ahead and remove them from our dataframe. Doing so will reduce the size of our dataset from 50,000 rows to 47,645 rows.

In [27]:
autos = autos[autos["price"].between(200,500000)]
autos.shape

(47645, 20)

Let's now turn our attention to the date columns in the data. These include:

* "date_crawled"
* "last_seen"
* "ad_created"
* "registration_month"
* "registration_year"

Per the data dictionary, the "date_crawled" and "last_seen" columns were generated by the web crawler, and the other fields were taken from the website. As we can see from the running df.describe() on the date columns, "date_crawled", "last_seen", and "ad_created" are represented as strings. We will convert these columns to a numeric datetype so we can more easily analyze them.

In [28]:
autos[["date_crawled", "last_seen", "ad_created", "registration_month", "registration_year"]].describe(include="all")

Unnamed: 0,date_crawled,last_seen,ad_created,registration_month,registration_year
count,47645,47645,47645,47645.0,47645.0
unique,46029,37809,76,,
top,2016-03-16 21:50:53,2016-04-07 06:17:27,2016-04-03 00:00:00,,
freq,3,8,1857,,
mean,,,,5.8225,2004.800084
std,,,,3.667104,88.423872
min,,,,0.0,1000.0
25%,,,,3.0,1999.0
50%,,,,6.0,2004.0
75%,,,,9.0,2008.0


In [29]:
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025354
2016-03-06    0.014062
2016-03-07    0.035995
2016-03-08    0.033120
2016-03-09    0.033036
2016-03-10    0.032322
2016-03-11    0.032700
2016-03-12    0.036877
2016-03-13    0.015699
2016-03-14    0.036562
2016-03-15    0.034232
2016-03-16    0.029447
2016-03-17    0.031546
2016-03-18    0.012824
2016-03-19    0.034610
2016-03-20    0.037800
2016-03-21    0.037360
2016-03-22    0.032700
2016-03-23    0.032385
2016-03-24    0.029321
2016-03-25    0.031420
2016-03-26    0.032217
2016-03-27    0.031210
2016-03-28    0.035030
2016-03-29    0.033980
2016-03-30    0.033897
2016-03-31    0.031840
2016-04-01    0.033813
2016-04-02    0.035681
2016-04-03    0.038724
2016-04-04    0.036562
2016-04-05    0.013139
2016-04-06    0.003169
2016-04-07    0.001364
Name: date_crawled, dtype: float64

In [30]:
autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001091
2016-03-06    0.004303
2016-03-07    0.005373
2016-03-08    0.007178
2016-03-09    0.009613
2016-03-10    0.010473
2016-03-11    0.012299
2016-03-12    0.023906
2016-03-13    0.008899
2016-03-14    0.012530
2016-03-15    0.015720
2016-03-16    0.016287
2016-03-17    0.028083
2016-03-18    0.007283
2016-03-19    0.015615
2016-03-20    0.020653
2016-03-21    0.020506
2016-03-22    0.021408
2016-03-23    0.018491
2016-03-24    0.019624
2016-03-25    0.019079
2016-03-26    0.016686
2016-03-27    0.015511
2016-03-28    0.020758
2016-03-29    0.022185
2016-03-30    0.024578
2016-03-31    0.023864
2016-04-01    0.022899
2016-04-02    0.024808
2016-04-03    0.025102
2016-04-04    0.024620
2016-04-05    0.125470
2016-04-06    0.222437
2016-04-07    0.132669
Name: last_seen, dtype: float64

In [31]:
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000042
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000063
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000042
2016-02-05    0.000042
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000042
2016-02-14    0.000042
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000042
2016-02-19    0.000063
2016-02-20    0.000042
2016-02-21    0.000063
                ...   
2016-03-09    0.033141
2016-03-10    0.032029
2016-03-11    0.033015
2016-03-12    0.036688
2016-03-13    0.017106
2016-03-14    0.035177
2016-03-15    0.033959
2016-03-16    0.029951
2016-03-17    0.031210
2016-03-18    0.013454
2016-03-19    0.033519
2016-03-20    0.037884
2016-03-21 

By selecting the first 10 characters of each column and sorting in ascending order, we are able to see all of the date values in chronological order for each column. We can see that the "date_crawled" and "last_seen" columns each have a date range from 3/5/16 - 4/7/16. However, the "ad_created" column includes dates ranging from 6/11/2015 - 4/7/16.

In [32]:
autos["registration_year"].describe()

count    47645.000000
mean      2004.800084
std         88.423872
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Looking at the distribution of the "registration_year" column above, it is clear that there are problems its date range (as previously discussed). It includes cars which were supposedly registered as early as the year 1000, and as late as the year 9999. Because the "last_seen" column's date range ends in 2016, we know that any rows with a "registration_year" value greater than 2016 are definitely innacurate. Let's see how many rows fall out of the range 1900-2016 to see if it's safe to drop those rows entirely.

In [33]:
autos["registration_year"].shape[0] - autos["registration_year"][autos["registration_year"].between(1900,2016)].shape[0]

1859

In [34]:
(autos["registration_year"].shape[0] - autos["registration_year"][autos["registration_year"].between(1900,2016)].shape[0]) / autos["registration_year"].shape[0]

0.03901773533424284

After filtering by the range 1900-2016 for "registration_year" column, we still have 45,786 rows left, with 1,859 rows being dropped. This is equivalent to about 4% of the total dataset. This seems reasonable and we will go ahead and remove these rows from our dataset.

In [35]:
autos = autos[autos["registration_year"].between(1900,2016)]

In [36]:
autos["registration_year"].describe()

count    45786.000000
mean      2002.993098
std          7.113188
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Next, let's explore how many different car brands are represented in our dataset.

In [38]:
autos["brand"].value_counts().shape

(40,)

In [57]:
autos["brand"].value_counts()

volkswagen        9672
bmw               5094
opel              4849
mercedes_benz     4470
audi              4011
ford              3147
renault           2131
peugeot           1371
fiat              1149
seat               832
skoda              756
nissan             700
mazda              693
smart              658
citroen            647
toyota             592
hyundai            463
sonstige_autos     437
volvo              421
mini               407
mitsubishi         374
honda              363
kia                327
alfa_romeo         306
porsche            279
suzuki             266
chevrolet          262
chrysler           163
dacia              123
daihatsu           114
jeep               106
land_rover          98
subaru              96
saab                76
jaguar              70
daewoo              68
rover               61
trabant             59
lancia              48
lada                27
Name: brand, dtype: int64

As an exercise, we will use a for loop and a dictionary to calculate the mean price for the top 20 car brands represented in our dataset. Pandas has functions to make this much easier (such as groupby), but we will not use those methods here.

In [58]:
mean_price={}

for brand in autos["brand"].value_counts().iloc[:20].index:
    mean_price[brand] = round(autos["price"][autos["brand"] == brand].mean(), 2)
    
print(mean_price)

{'volkswagen': 5506.44, 'bmw': 8402.67, 'opel': 3077.58, 'mercedes_benz': 8691.72, 'audi': 9406.09, 'ford': 3883.29, 'renault': 2552.52, 'peugeot': 3142.02, 'fiat': 2925.95, 'seat': 4505.46, 'skoda': 6451.04, 'nissan': 4829.11, 'mazda': 4204.59, 'smart': 3596.4, 'citroen': 3818.96, 'toyota': 5175.56, 'hyundai': 5422.44, 'sonstige_autos': 12929.3, 'volvo': 5016.28, 'mini': 10665.35}


Of the top 20 most common car brands in our dataset, "sonstige_autos" is the type with the highest average list price. It turns out that this is not the name of a car manufacturer, but simply means "Other Autos" in German. Thus, we can assume that these cars are custom built or uncommon enough to not warrant a brand name being assigned to them. Of the cars which do have a brand name, the four brands with the most expensive listings on average are Mini, Audi, Mercedes-Benz, and BMW. This makes sense, as these are all brands known for making high-end vehicles. The brands with the cheapest average list price are Renault, Fiat, Opel, and Peugeot.

Another large factor which will affect vehicle price is mileage. Let's compare the average mileage of the same top 20 vehicle brands in our dataset to see if we can see any correlation between mileage, brand, and price. We will do so by calculating a mean_mileage dictionary (the same method we used to aggregate price), and then converting both of our dictionaries to pandas series objects. We can them combine these series into a single dataframe to facilitate easier comparison.

In [61]:
mean_mileage={}

for brand in autos["brand"].value_counts().iloc[:20].index:
    mean_mileage[brand] = round(autos["odometer_km"][autos["brand"] == brand].mean(), 2)
    
print(mean_mileage)

{'volkswagen': 128774.81, 'bmw': 132792.5, 'opel': 129231.8, 'mercedes_benz': 131091.72, 'audi': 129260.78, 'ford': 124095.96, 'renault': 128052.56, 'peugeot': 126929.25, 'fiat': 116949.52, 'seat': 121604.57, 'skoda': 110998.68, 'nissan': 118178.57, 'mazda': 124076.48, 'smart': 99734.04, 'citroen': 119814.53, 'toyota': 116106.42, 'hyundai': 106792.66, 'sonstige_autos': 90652.17, 'volvo': 138527.32, 'mini': 88513.51}


In [64]:
mean_price_series = pd.Series(mean_price)
mean_mileage_series = pd.Series(mean_mileage)

price_mileage = pd.DataFrame(mean_price_series, columns=["mean_price"])
price_mileage["mean_mileage"] = mean_mileage_series

price_mileage

Unnamed: 0,mean_price,mean_mileage
volkswagen,5506.44,128774.81
bmw,8402.67,132792.5
opel,3077.58,129231.8
mercedes_benz,8691.72,131091.72
audi,9406.09,129260.78
ford,3883.29,124095.96
renault,2552.52,128052.56
peugeot,3142.02,126929.25
fiat,2925.95,116949.52
seat,4505.46,121604.57


It appears that, with a few exceptions, the average mileage of the each of the top brands in our dataset is about the same. It is notable that the two brands with the lowest average mileage - Mini and "Sonstige Autos" - were the two categories with the highest average price. Thus, mileage is likely a significant factor in why these categories had higher average prices compared to the other brands in the dataset.