# Exploring Ebay Car Sales Data
In this project, we clean and analyze a dataset of used car listings on a classifieds section of the German eBay website, *eBay Kleinazeigen*. We will use pandas to make this process quick and easy.

The [orignal dataset](https://www.kaggle.com/orgesleka/used-cars-database) is available from Kaggle. While the dataset we use is a sample of 50,000 data points prepared and made messier by Dataquest.

In [2]:
import numpy as np
import pandas as pd

autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [3]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


We have 20 columsn in the dataset, of which 15 have no nulls. We have a few cells that can be converted to numeric including price and odometer. The notRepairedDamage column can be converted to boolean and the column names could be cleaned up a bit.

## Cleaning Column Names
First we convert the colums names from camelcase to snakecase which is preferred in Python.

In [4]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [6]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


## Initial Exploring and Cleaning

In [7]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-30 19:48:02,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


The `nr_of_pictures` column has only zeroes in it and so can be dropped as it is not useful. Also, the `seller` and `offer_type` are almost completely the same for each listing (private and Angebot, respectively) so these columns can be dropped as well.

The maximum and minimum values for `registration_year`, `power_ps` seem to be weird and should be investigated for correct information and/or outliers.

As noted earlier, the `odometer` and `price` columns have numeric data that should be converted to numeric.

First we drop the three columns we deemed not useful.

In [8]:
autos = autos.drop(['nr_of_pictures','seller','offer_type'],axis = 1)

Now lets take a quick look at `registration_year` and `power_ps`.

In [9]:
print(autos[~autos["registration_year"].between(1900,2018)]["registration_year"].value_counts())


5000    4
9999    4
2019    3
9000    2
1800    2
9996    1
4100    1
1000    1
8888    1
1001    1
1500    1
2800    1
5911    1
4500    1
1111    1
6200    1
4800    1
Name: registration_year, dtype: int64


We have a few cars with registration dates in the future or before the year 1900 which seems unlikely. These should be cleaned up as to not skew the data.

In [10]:
autos[~autos["power_ps"].between(50,1000)]["power_ps"].value_counts()

0        5500
45        397
41         57
44         52
40         42
26         39
34         27
43         20
39         18
48         16
5          13
46          9
33          9
47          8
37          7
42          7
18          6
15          5
1           5
29          5
27          5
23          4
11          4
4           4
20          4
19          3
10          3
6           3
30          3
1800        3
         ... 
6226        1
1401        1
1779        1
1771        1
14009       1
3750        1
1367        1
1103        1
1055        1
1011        1
1003        1
9011        1
15016       1
6512        1
2018        1
1986        1
1398        1
1202        1
1090        1
1082        1
15001       1
4400        1
14          1
6045        1
1781        1
5867        1
1753        1
3500        1
1405        1
1793        1
Name: power_ps, Length: 89, dtype: int64

There are a lot of cars listed with 0 horsepower which should probably be converted to NaN. Also there are several more cars with unrealistic horsepowers that should be cleaned up.

We'll clean up `registration_year` at a later step in the process, and will save cleaning `power_ps` as a next step.

First, lets convert the price and odometer columns to numeric and take a look at their values.

In [11]:
autos['price'] = (autos['price']
                      .str.replace('$','')
                      .str.replace(',','')
                      .astype(int)
                  )
autos['odometer'] = (autos['odometer']
                         .str.replace(',','')
                         .str.replace('km','')
                         .astype(int)
                    )
autos.rename(columns = {"odometer": "odometer_km"}, inplace=True)
autos[['price','odometer_km']].head()

Unnamed: 0,price,odometer_km
0,5000,150000
1,8500,150000
2,8990,70000
3,4350,70000
4,1350,150000


## Exploring the Odometer and Price Columns
Let's take a look at our newly numeric columns to see if there are any unrealistic values we need to handle. 

### Odometer

In [10]:
print(autos['odometer_km'].unique().shape)
print(autos['odometer_km'].describe())

(13,)
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


The values seeem to be in a normal range. 5000 km would be a barely used car, and 150000 km isn't that many miles for a used car. It does seem weird that both the middle and top quartile and the max are all 150000 km. Also, there are only 13 unique values. Perhaps these values are represending a range with 150000 meaning the car has over 150000 km on it's odometer. Let's dig into this.

In [11]:
autos['odometer_km'].value_counts().sort_index()

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

It does seem to be the case that the odometer numbers represent ranges with 5000 meaning < 5000 km, 10,000 meaning 5000 to 10000 km... and 150000 meaning over 150000 km. This also explains why the vast majority of the used cars listed are listed with 150000 as cars are often still able to be driven up to even 300000 km.

There are no outliers we need to get rid of in this column.

### Price

In [12]:
print(autos['price'].unique().shape)
print(autos['price'].describe())

(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


With a maximum price of 100 million euros, there are clearly some outliers we need to take care of.

In [13]:
autos['price'].value_counts().sort_index()

0           1421
1            156
2              3
3              1
5              2
8              1
9              1
10             7
11             2
12             3
13             2
14             1
15             2
17             3
18             1
20             4
25             5
29             1
30             7
35             1
40             6
45             4
47             1
49             4
50            49
55             2
59             1
60             9
65             5
66             1
            ... 
151990         1
155000         1
163500         1
163991         1
169000         1
169999         1
175000         1
180000         1
190000         1
194000         1
197000         1
198000         1
220000         1
250000         1
259000         1
265000         1
295000         1
299000         1
345000         1
350000         1
999990         1
999999         2
1234566        1
1300000        1
3890000        1
10000000       1
11111111       2
12345678      

Some of the very high prices seem to just be silly numbers entered by users. Exaples: 99999999, 1234566, 12345678, etc.

There are also a lot of very low prices, but this isn't necessarily unusual for used cars as some cars may be very very old or non-functional and the owner just wants to get rid of them. I believe this is what explains our spike at a price of 0 euros.

Lets take a closer look at some of our less silly outliers to see if the make sense.

In [14]:
autos[autos['price'] > 200000].sort_values(by='price')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
37840,2016-03-21 10:50:12,Porsche_997,privat,Angebot,220000,test,coupe,2008,manuell,415,911,30000,7,benzin,porsche,nein,2016-03-21 00:00:00,0,69198,2016-04-06 04:46:14
38299,2016-03-28 22:25:25,Glas_BMW_mit_Wasser,privat,Angebot,250000,test,,2015,,0,x_reihe,5000,0,,bmw,,2016-03-28 00:00:00,0,60489,2016-03-28 22:25:25
47337,2016-04-05 10:25:38,BMW_Z8_roadster,privat,Angebot,259000,test,cabrio,2001,manuell,400,z_reihe,20000,6,benzin,bmw,nein,2016-04-05 00:00:00,0,61462,2016-04-05 12:07:32
12682,2016-03-28 22:48:01,Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_...,privat,Angebot,265000,control,coupe,2016,automatik,500,911,5000,3,benzin,porsche,nein,2016-03-28 00:00:00,0,70193,2016-04-05 03:44:51
35923,2016-04-03 07:56:23,Porsche_911_Targa_Exclusive_Edition__1_von_15_...,privat,Angebot,295000,test,cabrio,2015,automatik,400,911,5000,6,benzin,porsche,nein,2016-04-03 00:00:00,0,74078,2016-04-03 08:56:20
34723,2016-03-23 16:37:29,Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au...,privat,Angebot,299000,test,coupe,1977,manuell,260,911,100000,7,benzin,porsche,nein,2016-03-23 00:00:00,0,61462,2016-04-06 16:44:50
14715,2016-03-30 08:37:24,Rolls_Royce_Phantom_Drophead_Coupe,privat,Angebot,345000,control,cabrio,2012,automatik,460,,20000,8,benzin,sonstige_autos,nein,2016-03-30 00:00:00,0,73525,2016-04-07 00:16:26
36818,2016-03-27 18:37:37,Porsche_991,privat,Angebot,350000,control,coupe,2016,manuell,500,911,5000,3,benzin,porsche,nein,2016-03-27 00:00:00,0,70499,2016-03-27 18:37:37
37585,2016-03-29 11:38:54,Volkswagen_Jetta_GT,privat,Angebot,999990,test,limousine,1985,manuell,111,jetta,150000,12,benzin,volkswagen,ja,2016-03-29 00:00:00,0,50997,2016-03-29 11:38:54
43049,2016-03-21 19:53:52,2_VW_Busse_T3,privat,Angebot,999999,test,bus,1981,manuell,70,transporter,150000,1,benzin,volkswagen,,2016-03-21 00:00:00,0,99880,2016-03-28 17:18:28


Some of these cars are clearly luxury cars and could be worth a lot. However we start to see non-sensical listings at around 1 million euros. No VW Jetta or Ford Focus is going to be worth this much. So we'll get rid of all entries with a price over 500,000 euros.

In [15]:
autos = autos[autos['price'] < 500000]
autos.price.describe()

count     49986.000000
mean       5721.525167
std        8983.617820
min           0.000000
25%        1100.000000
50%        2950.000000
75%        7200.000000
max      350000.000000
Name: price, dtype: float64

This got rid of 24 high price entries.

## Exporing the date columns

We have three date columns that represent the date as strings that we'll take a look at, date_crawled, ad_created, and last_seen. We'll take a look at the date distribution by extracting the date portion of the string.

##### date_crawled

In [16]:
autos['date_crawled'].str[:10].value_counts(dropna=False).sort_index()

2016-03-05    1269
2016-03-06     697
2016-03-07    1798
2016-03-08    1663
2016-03-09    1660
2016-03-10    1606
2016-03-11    1624
2016-03-12    1838
2016-03-13     778
2016-03-14    1831
2016-03-15    1699
2016-03-16    1475
2016-03-17    1575
2016-03-18     653
2016-03-19    1745
2016-03-20    1891
2016-03-21    1874
2016-03-22    1645
2016-03-23    1619
2016-03-24    1455
2016-03-25    1587
2016-03-26    1624
2016-03-27    1552
2016-03-28    1742
2016-03-29    1707
2016-03-30    1681
2016-03-31    1595
2016-04-01    1690
2016-04-02    1770
2016-04-03    1934
2016-04-04    1824
2016-04-05     655
2016-04-06     159
2016-04-07      71
Name: date_crawled, dtype: int64

All entries were scraped between March 3 and April 7, 2016.

##### ad_created

In [17]:
autos['ad_created'].str[:10].value_counts().sort_index()

2015-06-11       1
2015-08-10       1
2015-09-09       1
2015-11-10       1
2015-12-05       1
2015-12-30       1
2016-01-03       1
2016-01-07       1
2016-01-10       2
2016-01-13       1
2016-01-14       1
2016-01-16       1
2016-01-22       1
2016-01-27       3
2016-01-29       1
2016-02-01       1
2016-02-02       2
2016-02-05       2
2016-02-07       1
2016-02-08       1
2016-02-09       2
2016-02-11       1
2016-02-12       3
2016-02-14       2
2016-02-16       1
2016-02-17       1
2016-02-18       2
2016-02-19       3
2016-02-20       2
2016-02-21       3
              ... 
2016-03-09    1661
2016-03-10    1593
2016-03-11    1639
2016-03-12    1830
2016-03-13     846
2016-03-14    1761
2016-03-15    1687
2016-03-16    1500
2016-03-17    1559
2016-03-18     686
2016-03-19    1692
2016-03-20    1893
2016-03-21    1884
2016-03-22    1638
2016-03-23    1609
2016-03-24    1454
2016-03-25    1594
2016-03-26    1628
2016-03-27    1545
2016-03-28    1748
2016-03-29    1705
2016-03-30  

Most of the ads were created around the period of time that they were scraped. However there are a few that were created in the year prior to that timeframe.

##### last_seen

In [18]:
autos['last_seen'].str[:10].value_counts()

2016-04-06    11046
2016-04-07     6546
2016-04-05     6212
2016-03-17     1396
2016-04-03     1268
2016-04-02     1244
2016-03-30     1242
2016-04-04     1231
2016-03-31     1191
2016-03-12     1190
2016-04-01     1155
2016-03-29     1116
2016-03-22     1079
2016-03-28     1042
2016-03-21     1036
2016-03-20     1035
2016-03-24      978
2016-03-25      960
2016-03-23      929
2016-03-26      848
2016-03-16      822
2016-03-27      801
2016-03-15      794
2016-03-19      787
2016-03-14      640
2016-03-11      626
2016-03-10      538
2016-03-09      492
2016-03-13      449
2016-03-08      379
2016-03-18      371
2016-03-07      268
2016-03-06      221
2016-03-05       54
Name: last_seen, dtype: int64

Most ads were last seen towards the latter part of the time frame in which they were crawled indicating that a lot of ads are still up. Those ads with an earlier last seen dates were probably sold or the listing was taken down.

Let's also take a quick look at the **`registration year`** column.

In [19]:
autos['registration_year'].describe()

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Most cars were first registered in the mid-2000s, however, as we noted earlier, there are a few entries with ridiculous registration years which are either in the future or from before a time in which car registration existed.

## Dealing with Incorrect Registration Year Data

Lets take a look at our weird `registration_year` values from earlier. Anything outside of the 1900 to 2016 range is definitely inaccurate.

In [20]:
print(autos[~autos["registration_year"].between(1900,2016)]["registration_year"].value_counts())

2017    1452
2018     491
9999       4
5000       4
2019       3
9000       2
1800       2
6200       1
4500       1
8888       1
4800       1
2800       1
1001       1
1000       1
1111       1
1500       1
9996       1
5911       1
4100       1
Name: registration_year, dtype: int64


It looks safe to remove most of the rows with these bad values, however, we do have many rows with registration year as 2017 or 2018. Perhaps these are 2017 or 2018 model year cars in which the person selling filled in the `registration_year` field incorrectly. Let's keep these cars as to now skew results too much, but note that these values still exist in our data.

In [12]:
autos = autos[autos['registration_year'].between(1900,2018)]
autos['registration_year'].value_counts(normalize=True).head(15)

2000    0.067116
2005    0.060333
1999    0.060032
2004    0.054770
2003    0.054569
2006    0.054189
2001    0.054089
2002    0.050687
1998    0.049087
2007    0.046105
2008    0.044644
2009    0.041983
1997    0.040582
2011    0.032698
2010    0.031957
Name: registration_year, dtype: float64

As we expected from before, most of the cars were registered in the 90s and 00s.

## Exploring Price by Brand

Lets aggregate on car brand to take a look at prices.

In [22]:
autos['brand'].value_counts()

volkswagen        10679
opel               5457
bmw                5427
mercedes_benz      4731
audi               4283
ford               3477
renault            2404
peugeot            1456
fiat               1307
seat                940
skoda               785
mazda               757
nissan              754
smart               701
citroen             699
toyota              617
sonstige_autos      538
hyundai             488
volvo               456
mini                424
mitsubishi          404
honda               398
kia                 356
alfa_romeo          328
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                109
subaru              108
land_rover           99
daewoo               79
saab                 79
trabant              77
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

The top five manufacturers on the list are all german brands with Volkswagen by far the most popular.

We'll limit the analysis to brand with over 2% representation in our data set so that the samples are of significant size.

In [23]:
brand_counts = autos['brand'].value_counts(normalize=True)
brand_list = brand_counts[brand_counts > .02].index

average_price = {}

for brand in brand_list:
    average_price[brand] = autos[autos['brand'] == brand]['price'].mean()
    
average_price    

{'audi': 8965.560354891431,
 'bmw': 8028.474479454579,
 'fiat': 2697.6771231828616,
 'ford': 3626.5429968363533,
 'mercedes_benz': 8380.637920101459,
 'opel': 2842.8246289169874,
 'peugeot': 3010.8688186813188,
 'renault': 2351.301996672213,
 'volkswagen': 5159.401629366045}

There are three apparent tiers of prices among our top brands:
- Audi, BMW, and Mercedes Benz are all more expensive. This makes sense as they are luxury brands.
- Fiat, Ford, Opel, Peugeot, Renault are all relatively inexpensive.
- Volkswagen, our most popular brand in the listings, occupies a middle ground between the two other groups.

## Storing Aggregate Data in a DataFrame: Exploring Mileage

We'll use pandas Series and DataFrame constructors to explore average price and mileage of our top 9 brands from the last step.

First we'll get the mileage data into a dictionary.

In [24]:
average_mileage = {}

for brand in brand_list:
    average_mileage[brand] = autos[autos['brand'] == brand]['odometer_km'].mean()
    
average_mileage

{'audi': 129643.9411627364,
 'bmw': 132540.99871015293,
 'fiat': 117012.24177505738,
 'ford': 124153.00546448087,
 'mercedes_benz': 130933.20651025153,
 'opel': 129361.37071651091,
 'peugeot': 127352.33516483517,
 'renault': 128223.79367720465,
 'volkswagen': 129006.46127914598}

Convert the mileage and price dicts to Series and use these to create a DataFrame.

In [25]:
mean_price = pd.Series(average_price)
mean_mileage = pd.Series(average_mileage)
top_autos_stats = pd.DataFrame(mean_price,columns=['mean_price'])
top_autos_stats['mean_mileage'] = mean_mileage

top_autos_stats

Unnamed: 0,mean_price,mean_mileage
audi,8965.560355,129643.941163
bmw,8028.474479,132540.99871
fiat,2697.677123,117012.241775
ford,3626.542997,124153.005464
mercedes_benz,8380.63792,130933.20651
opel,2842.824629,129361.370717
peugeot,3010.868819,127352.335165
renault,2351.301997,128223.793677
volkswagen,5159.401629,129006.461279


Mean mileage is pretty consistant across our top brands at about 130,000 km for each. This could definitely be skewed by the fact that all cars above 150,000 km are considered to have exactly 150,000 km in our data. If one brand had cars that lasted much longer past 150,000 km mileage-wise than the others, this would not be discernible in this data. 

## Next Steps

Data Cleaning:
- Identify categorical data that uses german words, translate them and map the values to their english counterparts
- Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
- See if there are particular keywords in the name column that you can extract as new columns.
- Clean the `power_ps` columnn.

Analysis
- Find the most common brand/model combinations
- Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.
- How much cheaper are cars with damage than their non-damaged counterparts?