# Exploring The  eBay Kleinanzeigen Used Cars Database

In this project I will explore and clean a dataset of 50,000 data points sampled from the used cars database from the German eBay Kleinanzeigen.<br>
The original dataset can be found [here](https://www.kaggle.com/orgesleka/used-cars-database/data) on Kaggle. <br>
Respect to the original data, the dataset I'll use contains a few errors, in order to simulate more closely a "real" dataset with dirty data.

In [15]:
#imports
import pandas as pd
import numpy as np

In [16]:
#load data
autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [17]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [18]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Changing column Names

As shown below, column names are in camel case, I'll convert them to snake case, as per Python best practices.

In [19]:
columns = autos.columns
print(columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


In [20]:
snake_names = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [21]:
autos.columns = snake_names

In [22]:
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [23]:
autos.describe()

Unnamed: 0,registration_year,power_PS,registration_month,nr_of_pictures,postal_code
count,50000.0,50000.0,50000.0,50000.0,50000.0
mean,2005.07328,116.35592,5.72336,0.0,50813.6273
std,105.712813,209.216627,3.711984,0.0,25779.747957
min,1000.0,0.0,0.0,0.0,1067.0
25%,1999.0,70.0,3.0,0.0,30451.0
50%,2003.0,105.0,6.0,0.0,49577.0
75%,2008.0,150.0,9.0,0.0,71540.0
max,9999.0,17700.0,12.0,0.0,99998.0


In [24]:
autos.describe(include=['O'])

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,gearbox,model,odometer,fuel_type,brand,unrepaired_damage,ad_created,last_seen
count,50000,50000,50000,50000,50000,50000,44905,47320,47242,50000,45518,50000,40171,50000,50000
unique,48213,38754,2,2,2357,2,8,2,245,13,7,40,2,76,39481
top,2016-03-16 21:50:53,Ford_Fiesta,privat,Angebot,$0,test,limousine,manuell,golf,"150,000km",benzin,volkswagen,nein,2016-04-03 00:00:00,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,36993,4024,32424,30107,10687,35232,1946,8


### Numeric columns

The column **registration_year** ranges from 1000 to 9999, clearly there are some errors in data. According to [this entry on Wikipedia](https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_Germany#History), registrations of licence plates in Germany started in 1906 and clearly there aren't cars from the future. <br>
The **power_ps** column,representing the horsepower of the car, has a minimum value o zero and a maximum value of 17700 which is far more too great even for the most powerful car in the world. <br>
The **registration_month** ranges from 0 to 12, accounting for 13 months if we consider 0, so probably 0 represents a lack of data about the month. <br>
The **number_of_pictures** column contains only 0 for all the rows, so it could certainly be dropped. <br>
The **postal_code** column represents the postal codes as numbers, but the meaning of a postal code is not numeric, it could be transformed in a object, it makes no sense, for exampple, calculating the mean of a column of postal codes.

In [25]:
autos.registration_month.value_counts()

0     5075
3     5071
6     4368
5     4107
4     4102
7     3949
10    3651
12    3447
9     3389
11    3360
1     3282
8     3191
2     3008
Name: registration_month, dtype: int64

About 10% of all the cars in the database hasn't a registration month (5075), if we consider 0 as a sign of missing data. 

In [26]:
autos.power_PS.value_counts().sort_index(ascending = False).head(10)

17700    1
16312    1
16011    1
15016    1
15001    1
14009    1
9011     1
8404     1
7511     1
6512     1
Name: power_PS, dtype: int64

### Object Columns

The **price** column contains a $ sign so it has been converted to string, but it has clearly a numeric value, so it needs to be converted to numeric. <br>
The **odometer** column is as well made of strings, for the presence of a "km" attached to every data, but it could be converted to numeric data. <br>
The **seller** column contains always the same evalue, privat, except one column.

### Cleaning the odometer and price columns

In [27]:
#showing that seller is of little value for analysis,
#it contains almost only one value
autos.seller.value_counts()

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [28]:
# removing $ and commas from price columns and converting to int
new_price = autos["price"].str.split("$").str[1].str.replace(',','').astype(int)

In [29]:
#replace old column
autos["price"] = new_price

In [30]:
#confirm data type has changed
print(autos.price.dtype)

int32


In [31]:
#remove "km" from odometer data
new_odometer = autos["odometer"].str.split("km").str[0].str.replace(",", "").astype(int)

In [32]:
autos["odometer"] = new_odometer

In [33]:
#rename the odometer column to odometer_km
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace=True)

In [34]:
autos.head(2)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08


In [35]:
autos.odometer_km.value_counts().sort_index(ascending=False)

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

In [36]:
autos.gearbox.value_counts()

manuell      36993
automatik    10327
Name: gearbox, dtype: int64

In [37]:
autos.odometer_km.unique().shape

(13,)

There are 13 categories for odometer_km, the most common is 150000 Km.

In [38]:
autos.price.describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [39]:
#index of the most priced car
autos.price.idxmax()

39705

In [40]:
autos.loc[autos.price.idxmax()]

date_crawled                  2016-03-22 14:58:27
name                  Tausch_gegen_gleichwertiges
seller                                     privat
offer_type                                Angebot
price                                    99999999
abtest                                    control
vehicle_type                            limousine
registration_year                            1999
gearbox                                 automatik
power_PS                                      224
model                                    s_klasse
odometer_km                                150000
registration_month                              9
fuel_type                                  benzin
brand                               mercedes_benz
unrepaired_damage                             NaN
ad_created                    2016-03-22 00:00:00
nr_of_pictures                                  0
postal_code                                 73525
last_seen                     2016-04-06 05:15:30


In thid database there is a 2016 limousine priced 99 million dollars, probably an error: a brief online search shows that the most priced limousine is only about 4 million dollars. I'm going to remove the row with this value.

In [41]:
autos.drop(39705, axis = 0, inplace = True)

In [42]:
autos.sort_values("price", ascending=False).head(10)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
42221,2016-03-08 20:39:05,Leasinguebernahme,privat,Angebot,27322222,control,limousine,2014,manuell,163,c4,40000,2,diesel,citroen,,2016-03-08 00:00:00,0,76532,2016-03-08 20:39:05
47598,2016-03-31 18:56:54,Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...,privat,Angebot,12345678,control,limousine,2001,manuell,101,vectra,150000,3,benzin,opel,nein,2016-03-31 00:00:00,0,4356,2016-03-31 18:56:54
39377,2016-03-08 23:53:51,Tausche_volvo_v40_gegen_van,privat,Angebot,12345678,control,,2018,manuell,95,v40,150000,6,,volvo,nein,2016-03-08 00:00:00,0,14542,2016-04-06 23:17:31
27371,2016-03-09 15:45:47,Fiat_Punto,privat,Angebot,12345678,control,,2017,,95,punto,150000,0,,fiat,,2016-03-09 00:00:00,0,96110,2016-03-09 15:45:47
24384,2016-03-21 13:57:51,Schlachte_Golf_3_gt_tdi,privat,Angebot,11111111,test,,1995,,0,,150000,0,,volkswagen,,2016-03-21 00:00:00,0,18519,2016-03-21 14:40:18
2897,2016-03-12 21:50:57,Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000,privat,Angebot,11111111,test,limousine,1973,manuell,48,escort,50000,3,benzin,ford,nein,2016-03-12 00:00:00,0,94469,2016-03-12 22:45:27
11137,2016-03-29 23:52:57,suche_maserati_3200_gt_Zustand_unwichtig_laufe...,privat,Angebot,10000000,control,coupe,1960,manuell,368,,100000,1,benzin,sonstige_autos,nein,2016-03-29 00:00:00,0,73033,2016-04-06 21:18:11
47634,2016-04-04 21:25:21,Ferrari_FXX,privat,Angebot,3890000,test,coupe,2006,,799,,5000,7,,sonstige_autos,nein,2016-04-04 00:00:00,0,60313,2016-04-05 12:07:37
7814,2016-04-04 11:53:31,Ferrari_F40,privat,Angebot,1300000,control,coupe,1992,,0,,50000,12,,sonstige_autos,nein,2016-04-04 00:00:00,0,60598,2016-04-05 11:34:11
22947,2016-03-22 12:54:19,Bmw_530d_zum_ausschlachten,privat,Angebot,1234566,control,kombi,1999,automatik,190,,150000,2,diesel,bmw,,2016-03-22 00:00:00,0,17454,2016-04-02 03:17:32


Some prices are clearly only errors, like 12345678 or 	11111111, I'll remove these cars entries as well.

In [43]:
autos = autos[autos["price"] < 11111111]

### Dates

The date colums are represented as strings in the dataframe. <br>
The data in this dataset are taken from a website by a craweler.
The **date_crawled** column represnts the date the crawler exploring the website has found the ad for the car for the first time.<br>
The **last_seen** date represent the last time the same record was seen by the crawler.<br> 
The **ad_created** is the date the ad was created for the first time by the website. <br>
I'm going to explore the distrubution of these dates.

In [44]:
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025384
2016-03-06    0.013942
2016-03-07    0.035965
2016-03-08    0.033265
2016-03-09    0.033205
2016-03-10    0.032124
2016-03-11    0.032485
2016-03-12    0.036765
2016-03-13    0.015562
2016-03-14    0.036625
2016-03-15    0.033985
2016-03-16    0.029504
2016-03-17    0.031524
2016-03-18    0.013062
2016-03-19    0.034905
2016-03-20    0.037825
2016-03-21    0.037505
2016-03-22    0.032925
2016-03-23    0.032385
2016-03-24    0.029104
2016-03-25    0.031744
2016-03-26    0.032485
2016-03-27    0.031044
2016-03-28    0.034845
2016-03-29    0.034185
2016-03-30    0.033625
2016-03-31    0.031904
2016-04-01    0.033805
2016-04-02    0.035405
2016-04-03    0.038685
2016-04-04    0.036525
2016-04-05    0.013102
2016-04-06    0.003180
2016-04-07    0.001420
Name: date_crawled, dtype: float64

Considering the column **date_crawled**, it seems the website was crawled in about a month, daily, between March 5th, 2016 and April 7th, 2018.

In [45]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005361
2016-03-08    0.007581
2016-03-09    0.009841
2016-03-10    0.010762
2016-03-11    0.012522
2016-03-12    0.023803
2016-03-13    0.008981
2016-03-14    0.012802
2016-03-15    0.015882
2016-03-16    0.016442
2016-03-17    0.027924
2016-03-18    0.007421
2016-03-19    0.015742
2016-03-20    0.020703
2016-03-21    0.020723
2016-03-22    0.021583
2016-03-23    0.018583
2016-03-24    0.019563
2016-03-25    0.019203
2016-03-26    0.016962
2016-03-27    0.016022
2016-03-28    0.020863
2016-03-29    0.022343
2016-03-30    0.024843
2016-03-31    0.023823
2016-04-01    0.023103
2016-04-02    0.024903
2016-04-03    0.025364
2016-04-04    0.024623
2016-04-05    0.124297
2016-04-06    0.220991
2016-04-07    0.130938
Name: last_seen, dtype: float64

The last_seen colum has the same span of time for observations as the date_crawled column

In [46]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
                ...   
2016-04-03    0.038925
2016-04-04    0.036885
2016-04-05    0.011842
2016-04-06    0.003260
2016-04-07    0.001280
Name: ad_created, Length: 76, dtype: float64

The ad creation dates range between November 2015 and April 2016.

In [47]:
autos.registration_year.describe()

count    49993.000000
mean      2005.073650
std        105.720065
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

As stated previously, the registration_year column has some strange values, the minimum value is 1000 (clearly there were no cars) and the maximum is in the future (9999).

In [48]:
autos[autos["registration_year"] < 1906]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_PS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
10556,2016-04-01 06:02:10,UNFAL_Auto,privat,Angebot,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-01 00:00:00,0,63322,2016-04-01 09:42:30
22316,2016-03-29 16:56:41,VW_Kaefer.__Zwei_zum_Preis_von_einem.,privat,Angebot,1500,control,,1000,manuell,0,kaefer,5000,0,benzin,volkswagen,,2016-03-29 00:00:00,0,48324,2016-03-31 10:15:28
24511,2016-03-17 19:45:11,Trabant__wartburg__Ostalgie,privat,Angebot,490,control,,1111,,0,,5000,0,,trabant,,2016-03-17 00:00:00,0,16818,2016-04-07 07:17:29
32585,2016-04-02 16:56:39,UNFAL_Auto,privat,Angebot,450,control,,1800,,1800,,5000,2,,mitsubishi,nein,2016-04-02 00:00:00,0,63322,2016-04-04 14:46:21
35238,2016-03-26 13:45:20,Suche_Skoda_Fabia____Skoda_Fabia_Combi_mit_Klima,privat,Angebot,0,control,,1500,,0,,5000,0,benzin,skoda,,2016-03-26 00:00:00,0,15517,2016-04-04 00:16:54
49283,2016-03-15 18:38:53,Citroen_HY,privat,Angebot,7750,control,,1001,,0,andere,5000,0,,citroen,,2016-03-15 00:00:00,0,66706,2016-04-06 18:47:20


We saw data was created in 2016, so registration year greater than this year are incorrect. 1906 is generally considered the starting year for registration of cars, so I'm going to consider only cars with registration years greater than 1906.

In [49]:
print(autos[autos["registration_year"] < 1906].shape)

(6, 20)


In [50]:
print(autos[autos["registration_year"] > 2019].shape)

(18, 20)


In [51]:
autos.loc[autos["registration_year"] > 2019, "registration_year"].value_counts()

9999    4
5000    4
9000    2
6200    1
5911    1
4500    1
2800    1
9996    1
8888    1
4100    1
4800    1
Name: registration_year, dtype: int64

In [52]:
autos = autos[(autos["registration_year"]>= 1906) &(autos["registration_year"]<= 2016)]

In [53]:
autos.shape

(48023, 20)

In [54]:
autos.registration_year.value_counts(normalize=True).head(15)

2000    0.069842
2005    0.062782
1999    0.062449
2004    0.056994
2003    0.056785
2006    0.056390
2001    0.056265
2002    0.052746
1998    0.051080
2007    0.047977
2008    0.046457
2009    0.043687
1997    0.042230
2011    0.034025
2010    0.033255
Name: registration_year, dtype: float64

Most of the vehicles seem to be registered in first two decades of this century.

### Car Brands

Let's see which are the most popular brand in the database: being the data from a German websiste, I expect to find a significant portion of Germnan brands.

In [55]:
car_brands = autos.brand.value_counts(ascending=False, normalize=True)
print(car_brands)

volkswagen        0.212128
bmw               0.110031
opel              0.108157
mercedes_benz     0.095350
audi              0.086396
ford              0.069779
renault           0.047352
peugeot           0.029528
fiat              0.025863
seat              0.018179
skoda             0.016034
mazda             0.015139
nissan            0.015097
smart             0.013910
citroen           0.013910
toyota            0.012473
sonstige_autos    0.010953
hyundai           0.009849
volvo             0.009246
mini              0.008642
mitsubishi        0.008142
honda             0.007850
kia               0.007101
alfa_romeo        0.006622
porsche           0.006101
suzuki            0.005914
chevrolet         0.005706
chrysler          0.003665
dacia             0.002561
daihatsu          0.002561
jeep              0.002249
subaru            0.002186
land_rover        0.002041
saab              0.001603
jaguar            0.001583
trabant           0.001562
daewoo            0.001499
r

In [56]:
car_brands[0:5].sum()

0.6120608874914103

As expected the first five brand for number of cars are German:Volkswagen, BMW, Opel, Mercedes Benz and Audi and the account for about 61% of all the cars in the dataset. <br>
Let's see the mean price for these first five brands.

In [57]:
top_brands = list(car_brands.index[0:5])
top_brands

['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi']

### Top brands mean prices

In [58]:
top_brand_mean_price = {}
for brand in top_brands:
    mean_price = autos.loc[autos["brand"] == brand, "price"].mean()
    top_brand_mean_price[brand] = mean_price
for key, value in top_brand_mean_price.items():
     print("{} : {}".format(key, value))

volkswagen : 5426.382546382644
bmw : 8334.645155185466
opel : 2876.716403542549
mercedes_benz : 8485.239571958942
audi : 9093.65003615329


Audi cars are the most expensive cars between the top brands.

In [59]:
bmp_series = pd.Series(top_brand_mean_price)
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df

Unnamed: 0,mean_price
volkswagen,5426.382546
bmw,8334.645155
opel,2876.716404
mercedes_benz,8485.239572
audi,9093.650036


### Top brands mean mileage

In [60]:
top_brand_mean_km = {}
for brand in top_brands:
    mean_km = autos.loc[autos["brand"] == brand, "odometer_km"].mean()
    top_brand_mean_km[brand] = mean_km
for key, value in top_brand_mean_km.items():
     print("{} : {}".format(key, value))

volkswagen : 128728.28114263277
bmw : 132434.70855412565
opel : 129223.14208702349
mercedes_benz : 130856.0821139987
audi : 129287.78018799711


In [61]:
top_mileage = pd.Series(top_brand_mean_km).sort_values(ascending = False)

In [62]:
top_mileage

bmw              132434.708554
mercedes_benz    130856.082114
audi             129287.780188
opel             129223.142087
volkswagen       128728.281143
dtype: float64

In [63]:
top_price =  pd.Series(top_brand_mean_price).sort_values(ascending = False)

In [64]:
top_price

audi             9093.650036
mercedes_benz    8485.239572
bmw              8334.645155
volkswagen       5426.382546
opel             2876.716404
dtype: float64

In [65]:
df_top_price_km = pd.DataFrame(top_price, columns = ["mean_price"])

In [66]:
df_top_price_km

Unnamed: 0,mean_price
audi,9093.650036
mercedes_benz,8485.239572
bmw,8334.645155
volkswagen,5426.382546
opel,2876.716404


In [67]:
df_top_price_km["mean_km"] = top_mileage

In [68]:
df_top_price_km

Unnamed: 0,mean_price,mean_km
audi,9093.650036,129287.780188
mercedes_benz,8485.239572,130856.082114
bmw,8334.645155,132434.708554
volkswagen,5426.382546,128728.281143
opel,2876.716404,129223.142087
