# Cleaning and Analyzing Used Car Data

## In this project we'll work with a dataset of used cars from *eBay Kleinanzeigen*, a classified section of the German eBay website.

## The aim of this project is to clean the data and analyze the used car listings.

### Start by loading the data and taking a look at it:

In [171]:
# import modules
import numpy as np
import pandas as pd

In [172]:
# read in the data using pandas
# need to use Latin-1 encoding
autos = pd.read_csv("autos.csv",encoding="Latin-1")

In [173]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
5,2016-03-21 13:47:45,Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...,privat,Angebot,"$7,900",test,bus,2006,automatik,150,voyager,"150,000km",4,diesel,chrysler,,2016-03-21 00:00:00,0,22962,2016-04-06 09:45:21
6,2016-03-20 17:55:21,VW_Golf_III_GT_Special_Electronic_Green_Metall...,privat,Angebot,$300,test,limousine,1995,manuell,90,golf,"150,000km",8,benzin,volkswagen,,2016-03-20 00:00:00,0,31535,2016-03-23 02:48:59
7,2016-03-16 18:55:19,Golf_IV_1.9_TDI_90PS,privat,Angebot,"$1,990",control,limousine,1998,manuell,90,golf,"150,000km",12,diesel,volkswagen,nein,2016-03-16 00:00:00,0,53474,2016-04-07 03:17:32
8,2016-03-22 16:51:34,Seat_Arosa,privat,Angebot,$250,test,,2000,manuell,0,arosa,"150,000km",10,,seat,nein,2016-03-22 00:00:00,0,7426,2016-03-26 18:18:10
9,2016-03-16 13:47:02,Renault_Megane_Scenic_1.6e_RT_Klimaanlage,privat,Angebot,$590,control,bus,1997,manuell,90,megane,"150,000km",7,benzin,renault,nein,2016-03-16 00:00:00,0,15749,2016-04-06 10:46:35


In [174]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null obj

In [175]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### This dataset contains 50,000 observations with 20 columns (most of which are strings) describing the car postings.  Several of the variables (vehicleType, gearbox, model, fuelType, and notRepairedDamage) have missing values.

## Rename the columns in snakecase and make the names more descriptive:

In [176]:
# print column names
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [177]:
# rename columns
autos.columns=['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen']
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Explore the data to determine what other cleaning tasks need to be done

In [178]:
# look at summary stats
autos.describe(include='all') # include both numeric and categorical variables

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-04-02 11:37:04,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [179]:
autos["seller"].value_counts()

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [180]:
autos["offer_type"].value_counts()

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [181]:
autos["num_pictures"].value_counts()

0    50000
Name: num_pictures, dtype: int64

In [182]:
autos["postal_code"].value_counts()

10115    109
65428    104
66333     54
45888     50
44145     48
48599     47
65933     45
65719     44
15344     43
37154     42
50354     42
52525     42
38518     40
44339     40
32791     40
51065     40
77933     40
45881     40
30419     40
25524     39
33378     39
32257     39
46325     39
53773     38
61169     38
85055     38
60386     37
65929     37
52477     37
21423     37
        ... 
76872      1
88486      1
82405      1
74249      1
7927       1
29379      1
86759      1
19300      1
56182      1
31618      1
29571      1
17349      1
31714      1
76776      1
91233      1
40670      1
17509      1
74921      1
54455      1
97506      1
95491      1
9481       1
56598      1
91489      1
54647      1
23942      1
83365      1
95683      1
97794      1
67585      1
Name: postal_code, Length: 7014, dtype: int64

In [183]:
autos["registration_year"].value_counts()

2000    3354
2005    3015
1999    3000
2004    2737
2003    2727
2006    2708
2001    2703
2002    2533
1998    2453
2007    2304
2008    2231
2009    2098
1997    2028
2011    1634
2010    1597
2017    1453
1996    1444
2012    1323
2016    1316
1995    1313
2013     806
2014     666
1994     660
2018     492
1993     445
2015     399
1990     395
1992     391
1991     356
1989     181
        ... 
1950       3
1955       2
9000       2
1954       2
1800       2
1957       2
1941       2
1951       2
1934       2
4100       1
4800       1
1953       1
1111       1
1927       1
6200       1
4500       1
1943       1
5911       1
1939       1
1938       1
2800       1
8888       1
1000       1
1500       1
1948       1
1931       1
1929       1
1001       1
9996       1
1952       1
Name: registration_year, Length: 97, dtype: int64

### Observations:

-The variables `seller` and `offer_type` are always the same except for one observation that has a different value.  The variable `num_pictures` is always zero.  These variables do not contain much information and can be removed from the analysis.

-The variables `price` and `odometer` are stored as text and should be converted to numeric variables.

-Values for `postal_code` contain less than 5 digits and some of the values for `registration_year` do not make sense (e.g. 1000, 6200, etc.).

In [184]:
# clean up price variable
price_cleaned = ( autos["price"].str.replace("$","")
                                .str.replace(",","")
                                .astype(float))
autos["price"]=price_cleaned
autos["price"].value_counts()

0.0           1421
500.0          781
1500.0         734
2500.0         643
1200.0         639
1000.0         639
600.0          531
800.0          498
3500.0         498
2000.0         460
999.0          434
750.0          433
900.0          420
650.0          419
850.0          410
700.0          395
4500.0         394
300.0          384
2200.0         382
950.0          379
1100.0         376
1300.0         371
3000.0         365
550.0          356
1800.0         355
5500.0         340
1250.0         335
350.0          335
1600.0         327
1999.0         322
              ... 
2225.0           1
69997.0          1
139997.0         1
69999.0          1
4780.0           1
8930.0           1
21599.0          1
15911.0          1
10000000.0       1
5180.0           1
919.0            1
1247.0           1
5998.0           1
27020.0          1
21888.0          1
46500.0          1
2001.0           1
2459.0           1
345000.0         1
34940.0          1
2785.0           1
5248.0      

In [185]:
# clean up odometer variable
autos["odometer"].value_counts()
odometer_cleaned = ( autos["odometer"].str.replace("km","")
                                      .str.replace(",","")
                                      .astype(float))
autos["odometer"]=odometer_cleaned
autos["odometer"].value_counts()

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
5000.0        967
40000.0       819
30000.0       789
20000.0       784
10000.0       264
Name: odometer, dtype: int64

In [186]:
autos.rename(columns={"odometer":"odometer_km"},index=str,inplace=True)
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

### Continue cleaning the `odometer_km` and `price` columns (in particular remove outlying values)

In [187]:
# explore odometer_km values
autos["odometer_km"].unique().shape

(13,)

In [188]:
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

### The values for `odometer_km` all seem reasonable

In [189]:
# explore price values
autos["price"].unique().shape

(2357,)

In [190]:
autos["price"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [191]:
# lowest price values
autos["price"].sort_values(ascending=True)[2000] # 150
autos.loc[autos["price"] < 150,:]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_pictures,postal_code,last_seen
25,2016-03-21 21:56:18,Ford_escort_kombi_an_bastler_mit_ghia_ausstattung,privat,Angebot,90.0,control,kombi,1996,manuell,116,,150000.0,4,benzin,ford,ja,2016-03-21 00:00:00,0,27574,2016-04-01 05:16:49
27,2016-03-27 18:45:01,Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE,privat,Angebot,0.0,control,,2005,,0,,150000.0,0,,ford,,2016-03-27 00:00:00,0,66701,2016-03-27 18:45:01
30,2016-03-14 11:47:31,Peugeot_206_Unfallfahrzeug,privat,Angebot,80.0,test,kleinwagen,2002,manuell,60,2_reihe,150000.0,6,benzin,peugeot,ja,2016-03-14 00:00:00,0,57076,2016-03-14 11:47:31
55,2016-03-07 02:47:54,Mercedes_E320_AMG_zu_Tauschen!,privat,Angebot,1.0,test,,2017,automatik,224,e_klasse,125000.0,7,benzin,mercedes_benz,nein,2016-03-06 00:00:00,0,22111,2016-03-08 05:45:44
64,2016-04-05 07:36:19,Autotransport__Abschlepp_Schlepper,privat,Angebot,40.0,test,,2011,,0,5er,150000.0,5,,bmw,,2016-04-05 00:00:00,0,40591,2016-04-07 12:16:01
71,2016-03-28 19:39:35,Suche_Opel_Astra_F__Corsa_oder_Kadett_E_mit_Re...,privat,Angebot,0.0,control,,1990,manuell,0,,5000.0,0,benzin,opel,,2016-03-28 00:00:00,0,4552,2016-04-07 01:45:48
80,2016-03-09 15:57:57,Nissan_Primera_Hatchback_1_6_16v_73_Kw___99Ps_...,privat,Angebot,0.0,control,coupe,1999,manuell,99,primera,150000.0,3,benzin,nissan,ja,2016-03-09 00:00:00,0,66903,2016-03-09 16:43:50
87,2016-03-29 23:37:22,Bmw_520_e39_zum_ausschlachten,privat,Angebot,0.0,control,,2000,,0,5er,150000.0,0,,bmw,,2016-03-29 00:00:00,0,82256,2016-04-06 21:18:15
99,2016-04-05 09:48:54,Peugeot_207_CC___Cabrio_Bj_2011,privat,Angebot,0.0,control,cabrio,2011,manuell,0,2_reihe,60000.0,7,diesel,peugeot,nein,2016-04-05 00:00:00,0,99735,2016-04-07 12:17:34
118,2016-03-12 05:03:00,VW_Sharan_V6_204_PS_Karosse_Rohkarosse_mit_Pap...,privat,Angebot,0.0,control,bus,2001,manuell,204,sharan,150000.0,7,benzin,volkswagen,ja,2016-03-12 00:00:00,0,15370,2016-03-12 21:44:23


In [192]:
# highest price values
autos["price"].sort_values(ascending=False).head(50)

39705    99999999.0
42221    27322222.0
39377    12345678.0
47598    12345678.0
27371    12345678.0
2897     11111111.0
24384    11111111.0
11137    10000000.0
47634     3890000.0
7814      1300000.0
22947     1234566.0
43049      999999.0
514        999999.0
37585      999990.0
36818      350000.0
14715      345000.0
34723      299000.0
35923      295000.0
12682      265000.0
47337      259000.0
38299      250000.0
37840      220000.0
40918      198000.0
43668      197000.0
28090      194000.0
20351      190000.0
17140      180000.0
11433      175000.0
32840      169999.0
18509      169000.0
22673      163991.0
45387      163500.0
10500      155000.0
33638      151990.0
49668      145000.0
32185      139997.0
2454       137999.0
14268      135000.0
49815      130000.0
1878       129000.0
8232       128000.0
2751       120000.0
44406      120000.0
43282      119900.0
38814      119500.0
33884      116000.0
21783      115991.0
7402       115000.0
22060      114400.0
49391      109999.0


In [193]:
# remove observations which have price less than $500 
# and greater than $350,000 
autos=autos[autos["price"].between(500,350000)]

In [194]:
autos.shape

(45097, 20)

### convert the dates stored as strings numerical values

In [195]:
# look at the date columns
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37
3,2016-03-12 16:58:10,2016-03-12 00:00:00,2016-03-15 03:16:28
4,2016-04-01 14:38:50,2016-04-01 00:00:00,2016-04-01 14:38:50


In [196]:
# investigate date_crawled
autos["date_crawled"].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

2016-03-05    0.025567
2016-03-06    0.014125
2016-03-07    0.036189
2016-03-08    0.033173
2016-03-09    0.032907
2016-03-10    0.032707
2016-03-11    0.033018
2016-03-12    0.037320
2016-03-13    0.015522
2016-03-14    0.036300
2016-03-15    0.034016
2016-03-16    0.029359
2016-03-17    0.031155
2016-03-18    0.012883
2016-03-19    0.034747
2016-03-20    0.038073
2016-03-21    0.037741
2016-03-22    0.033018
2016-03-23    0.032397
2016-03-24    0.028982
2016-03-25    0.031089
2016-03-26    0.032641
2016-03-27    0.031177
2016-03-28    0.034836
2016-03-29    0.033262
2016-03-30    0.033328
2016-03-31    0.031665
2016-04-01    0.033905
2016-04-02    0.035767
2016-04-03    0.038827
2016-04-04    0.036610
2016-04-05    0.013172
2016-04-06    0.003171
2016-04-07    0.001353
Name: date_crawled, dtype: float64

### The `date_crawled` values appear on all days between 3/5/2016 and 4/7/2016.

In [197]:
# investigate ad_created
autos["ad_created"].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

2015-06-11    0.000022
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
2015-12-30    0.000022
2016-01-03    0.000022
2016-01-07    0.000022
2016-01-10    0.000044
2016-01-13    0.000022
2016-01-14    0.000022
2016-01-16    0.000022
2016-01-22    0.000022
2016-01-27    0.000067
2016-01-29    0.000022
2016-02-01    0.000022
2016-02-02    0.000044
2016-02-05    0.000044
2016-02-07    0.000022
2016-02-08    0.000022
2016-02-09    0.000022
2016-02-11    0.000022
2016-02-12    0.000044
2016-02-14    0.000044
2016-02-16    0.000022
2016-02-17    0.000022
2016-02-18    0.000044
2016-02-19    0.000067
2016-02-20    0.000044
2016-02-21    0.000044
                ...   
2016-03-09    0.032996
2016-03-10    0.032441
2016-03-11    0.033328
2016-03-12    0.037098
2016-03-13    0.016963
2016-03-14    0.034880
2016-03-15    0.033794
2016-03-16    0.029847
2016-03-17    0.030822
2016-03-18    0.013504
2016-03-19    0.033616
2016-03-20    0.038207
2016-03-21 

### The ads were created anywhere between 6/11/2015 and 4/7/2016.

In [198]:
# investigate last_seen
autos["last_seen"].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

2016-03-05    0.001087
2016-03-06    0.004169
2016-03-07    0.005211
2016-03-08    0.007007
2016-03-09    0.009468
2016-03-10    0.010289
2016-03-11    0.012041
2016-03-12    0.023904
2016-03-13    0.008870
2016-03-14    0.012285
2016-03-15    0.015677
2016-03-16    0.016165
2016-03-17    0.027674
2016-03-18    0.007406
2016-03-19    0.015411
2016-03-20    0.020423
2016-03-21    0.020667
2016-03-22    0.021243
2016-03-23    0.018405
2016-03-24    0.019536
2016-03-25    0.018582
2016-03-26    0.016476
2016-03-27    0.015456
2016-03-28    0.020534
2016-03-29    0.021354
2016-03-30    0.024148
2016-03-31    0.023438
2016-04-01    0.022862
2016-04-02    0.024880
2016-04-03    0.024946
2016-04-04    0.024303
2016-04-05    0.126616
2016-04-06    0.225314
2016-04-07    0.134155
Name: last_seen, dtype: float64

### The last_seen dates are between 3/5/2016 and 4/7/2016, the same as for date_crawled.

### take a look at the distribution of registration year:

In [199]:
autos["registration_year"].describe()

count    45097.000000
mean      2005.064173
std         89.652017
min       1000.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

### As mentioned above, there are some weird values here.

-In particular, if the cars were displayed on the website only until 2016, none of the registration dates after 2016 make sense.

-Also, registration dates prior to 1900 probably also do not make sense.

In [200]:
# count the number of listings with cars that were registered 
# outside of the 1900 - 2016 interval
autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).head(10)

1000    0.000022
1001    0.000022
1910    0.000044
1927    0.000022
1929    0.000022
1931    0.000022
1934    0.000044
1937    0.000089
1938    0.000022
1939    0.000022
Name: registration_year, dtype: float64

In [201]:
autos["registration_year"].value_counts(normalize=True).sort_index(ascending=False).head(20)

9999    0.000067
9000    0.000022
8888    0.000022
6200    0.000022
5911    0.000022
5000    0.000044
4800    0.000022
4500    0.000022
4100    0.000022
2800    0.000022
2019    0.000022
2018    0.010245
2017    0.028738
2016    0.021066
2015    0.008071
2014    0.014458
2013    0.017673
2012    0.028982
2011    0.035856
2010    0.035169
Name: registration_year, dtype: float64

In [202]:
autos=autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).head(10)

1910    0.000046
1927    0.000023
1929    0.000023
1931    0.000023
1934    0.000046
1937    0.000092
1938    0.000023
1939    0.000023
1941    0.000046
1943    0.000023
Name: registration_year, dtype: float64

In [203]:
autos["registration_year"].value_counts(normalize=True).sort_index(ascending=False).head(10)

2016    0.021928
2015    0.008402
2014    0.015050
2013    0.018397
2012    0.030169
2011    0.037324
2010    0.036609
2009    0.047965
2008    0.050920
2007    0.052397
Name: registration_year, dtype: float64

### Investigate aggregrating data by car brand

In [204]:
# check out unique brand values
autos["brand"].unique()

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'renault', 'audi', 'sonstige_autos', 'mazda', 'porsche', 'mini',
       'mercedes_benz', 'seat', 'toyota', 'opel', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'jaguar', 'fiat', 'skoda', 'subaru',
       'kia', 'citroen', 'mitsubishi', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [205]:
# lets look at brands that make up at least 2% of the observations
brands = autos["brand"].value_counts(normalize=True)[autos["brand"].value_counts(normalize=True)>0.02].index.tolist()
brands

['volkswagen',
 'bmw',
 'mercedes_benz',
 'opel',
 'audi',
 'ford',
 'renault',
 'peugeot',
 'fiat']

In [206]:
brand_mean_prices = {}
for brand in brands:
    mean_price = autos.loc[autos["brand"]==brand,"price"].mean()
    brand_mean_prices[brand] = mean_price
brand_mean_prices

{'audi': 9613.64779393012,
 'bmw': 8582.261288380494,
 'fiat': 3256.152109911678,
 'ford': 4291.666312433581,
 'mercedes_benz': 8766.902708803611,
 'opel': 3394.0395675178283,
 'peugeot': 3360.9205974842766,
 'renault': 2819.059411146162,
 'volkswagen': 5783.622984749455}

### Looks like  Audis are the most expensive and Renaults are the least expensive brands out of the brands that make up at least 2% of the data observations.

### Generally, Audi, BMW, and Mercedes Benz are more expensive, and Renault, Fiat, and Peugeot are least expensive.  Volkswagon, Opel, and Ford are mid-range.

In [209]:
brand_mean_km = {}
for brand in brands:
    mean_km = autos.loc[autos["brand"]==brand,"odometer_km"].mean()
    brand_mean_km[brand] = mean_km
brand_mean_km

{'audi': 128909.7169089518,
 'bmw': 132865.7435279952,
 'fiat': 114416.09421000982,
 'ford': 123494.50938717676,
 'mercedes_benz': 131019.18735891648,
 'opel': 128012.42236024844,
 'peugeot': 126073.11320754717,
 'renault': 126351.20925341746,
 'volkswagen': 128234.74945533769}

In [212]:
# convert dictionaries to series and create a data frame with info
# from both dictionaries
price_series = pd.Series(brand_mean_prices)
df = pd.DataFrame(price_series,columns=['mean_price'])
km_series = pd.Series(brand_mean_km)
df["mean_km"]=km_series
