# Car Listing Analysis 

In this project, we'll work with a dataset of used cars from __"eBay Kleinanzeigen"__, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka.
The original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

We've made a few modifications from the original dataset:

We sampled **50,000** data points from the full dataset, to ensure your code runs quickly in our hosted environment
We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)
The data dictionary provided with data is as follows:

| name | defintion |
| :----: | :--- | 
|`dateCrawled` |When this ad was first crawled. All field-values are taken from this date |
|`name` | Name of the car |
|`seller` | Whether the seller is private or a dealer|
|`offerType` | The type of listing|
|`price` | The price on the ad to sell the car|
| `abtest` | Whether the listing is included in an A/B test | 
|`vehicleType` | The vehicle Type |
|`yearOfRegistration`| The year in which the car was first registered |
|`gearbox`| The transmission type|
|`powerPS`| The power of the car in PS|
|`model`| The car model name|
|`kilometer`|How many kilometers the car has driven|
|`monthOfRegistration`|The month in which the car was first registered|
|`fuelType` |What type of fuel the car uses|
|`brand`|The brand of the car|
|`notRepairedDamage`|If the car has a damage which is not yet repaired|
|`dateCreated`|The date on which the eBay listing was created|
|`nrOfPictures`|The number of pictures in the ad|
|`postalCode`|The postal code for the location of the vehicle|
|`lastSeenOnline`|When the crawler saw this ad last online|

Now we are going to import the data 

In [1]:
import pandas as pd 
autos = pd.read_csv("autos.csv", encoding= "Latin-1")

In [2]:
autos 

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
print(autos.info())
print(autos.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

From the work we did in the last screen, we can make the following observations:

1. The dataset contains 20 columns, most of which are strings.
2. Some columns have null values, but none have more than ~20% null values.
3. The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.


In [4]:
autos.columns = autos.columns.str.lower() 
autos.rename(columns={"yearofregistration" : "registration_year" , "monthofregistration" : "registration_month"
                     , "notrepaireddamage" : "unreparied_damage" , "datecreated" : "ad_created"}, inplace = True)

In [5]:
autos.columns

Index(['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',
       'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',
       'odometer', 'registration_month', 'fueltype', 'brand',
       'unreparied_damage', 'ad_created', 'nrofpictures', 'postalcode',
       'lastseen'],
      dtype='object')

In [6]:
autos.head(5)

Unnamed: 0,datecrawled,name,seller,offertype,price,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer,registration_month,fueltype,brand,unreparied_damage,ad_created,nrofpictures,postalcode,lastseen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


The purpose of doing this, is to make the column easier to acccess by making them all lower case data. 

In [7]:
autos.describe(include="all") 

Unnamed: 0,datecrawled,name,seller,offertype,price,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer,registration_month,fueltype,brand,unreparied_damage,ad_created,nrofpictures,postalcode,lastseen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-12 16:06:22,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


## DATA CLEANING PROCESS

From data above, things that need to be considered: 
1. Any columns that have mostly one value that are candidates to be dropped: 
     * seller 
     * offertype
2. Any columns that need more investigation. 
    * registration year  powerps
    * regristration_month 
    * postalcode 
3. Any examples of numeric data stored as text that needs to be cleaned.
    * price 
    * odometer

In [8]:
#dropping the seller and offertype because they offer the samethings 
autos.drop(["seller","offertype"],inplace = True, axis = 1 )

In [9]:
#Changing the type of the column so that it can be analyzed easily
autos["price"] = autos["price"].str.replace("$", "").str.replace(",","").astype(float)
autos["odometer"] = autos["odometer"].str.replace(",","").str.replace("km","").astype(float)
autos.rename({"price" : "price_in_$" , "odometer" : "odometer_km"}, inplace = True, axis = 1)

In [10]:
autos.describe()

Unnamed: 0,price_in_$,registration_year,powerps,odometer_km,registration_month,nrofpictures,postalcode
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,9840.044,2005.07328,116.35592,125732.7,5.72336,0.0,50813.6273
std,481104.4,105.712813,209.216627,40042.211706,3.711984,0.0,25779.747957
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1100.0,1999.0,70.0,125000.0,3.0,0.0,30451.0
50%,2950.0,2003.0,105.0,150000.0,6.0,0.0,49577.0
75%,7200.0,2008.0,150.0,150000.0,9.0,0.0,71540.0
max,100000000.0,9999.0,17700.0,150000.0,12.0,0.0,99998.0


In [11]:
autos["odometer_km"].unique().shape
autos["odometer_km"].describe()
autos["odometer_km"].value_counts().sort_values(ascending=False)
#No outlier needed to be dropped because it is already propotional and makes sense  

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
5000.0        967
40000.0       819
30000.0       789
20000.0       784
10000.0       264
Name: odometer_km, dtype: int64

In [12]:
autos["price_in_$"].unique().shape


(2357,)

In [13]:
autos["price_in_$"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_in_$, dtype: float64

In [14]:
autos["price_in_$"].value_counts().sort_index(ascending=False).head(20)

99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
259000.0      1
250000.0      1
220000.0      1
198000.0      1
197000.0      1
Name: price_in_$, dtype: int64

In [15]:
# The used car for those amount of price seems really impossible 
outliers_in_price = autos[autos["price_in_$"].between(999990,99999999)]
autos.drop(outliers_in_price.index, axis = 0,inplace = True )
autos["price_in_$"].value_counts().sort_index(ascending=False).head(20)

350000.0    1
345000.0    1
299000.0    1
295000.0    1
265000.0    1
259000.0    1
250000.0    1
220000.0    1
198000.0    1
197000.0    1
194000.0    1
190000.0    1
180000.0    1
175000.0    1
169999.0    1
169000.0    1
163991.0    1
163500.0    1
155000.0    1
151990.0    1
Name: price_in_$, dtype: int64

In [16]:
#It is impossible to sell the used car for free, therefore, it needs to be dropped 
autos.drop(autos[autos["price_in_$"] == 0.0].index , inplace = True )

In [17]:
autos["price_in_$"].value_counts()

500.0      781
1500.0     734
2500.0     643
1000.0     639
1200.0     639
          ... 
3129.0       1
69900.0      1
6202.0       1
18310.0      1
4349.0       1
Name: price_in_$, Length: 2346, dtype: int64

In [18]:
autos["datecrawled"].str[:10].value_counts(normalize = 1, dropna = 0).sort_index()

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: datecrawled, dtype: float64

It is found that the most frequent date that the data is crawled is in 2016-04-04, but it is actually evenly distributed 


In [19]:
autos["ad_created"].str[:10].value_counts(normalize= 1, dropna= 0).sort_values(ascending = False)

2016-04-03    0.038855
2016-03-20    0.037949
2016-03-21    0.037579
2016-04-04    0.036858
2016-03-12    0.036755
                ...   
2016-02-22    0.000021
2016-01-03    0.000021
2016-01-22    0.000021
2016-01-14    0.000021
2016-02-11    0.000021
Name: ad_created, Length: 76, dtype: float64

In [20]:
autos["lastseen"].str[:10].value_counts(normalize= 1, dropna= 0).sort_values(ascending = False)

2016-04-06    0.221806
2016-04-07    0.131947
2016-04-05    0.124761
2016-03-17    0.028086
2016-04-03    0.025203
2016-04-02    0.024915
2016-03-30    0.024771
2016-04-04    0.024483
2016-03-12    0.023783
2016-03-31    0.023783
2016-04-01    0.022794
2016-03-29    0.022341
2016-03-22    0.021373
2016-03-28    0.020859
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-23    0.018532
2016-03-26    0.016802
2016-03-16    0.016452
2016-03-15    0.015876
2016-03-19    0.015834
2016-03-27    0.015649
2016-03-14    0.012602
2016-03-11    0.012375
2016-03-10    0.010666
2016-03-09    0.009595
2016-03-13    0.008895
2016-03-08    0.007413
2016-03-18    0.007351
2016-03-07    0.005395
2016-03-06    0.004324
2016-03-05    0.001071
Name: lastseen, dtype: float64

In [21]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

It is found to be really impossible to have a 9999 as a year, therefore, there must be errors in inputting the data. In year 1000, the cars were still not invented yet.  

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [22]:
outlier_min = autos[autos["registration_year"].between(1000,1889)]
outlier_max = autos[autos["registration_year"].between(2017,9999)]
autos.drop(outlier_min.index,inplace = True)
autos.drop(outlier_max.index, inplace = True)

In [23]:
autos["registration_year"].value_counts(normalize = 1).sort_values(ascending= False)

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
          ...   
1931    0.000021
1929    0.000021
1943    0.000021
1953    0.000021
1952    0.000021
Name: registration_year, Length: 78, dtype: float64

It is found that, the most car listed comes from cars produced in year 2000 

Now, we are going to change the data that use german words

In [24]:
autos["fueltype"].unique()

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [25]:
mapping_words = {"lpg" : "lpg", 
                 "benzin" : "diesel",
                 "diesel" : "diesel", 
                 "cng" : "cng", 
                 "hybrid" : "hybrid",
                 "elektro" : "electro", 
                 "andere" : "other"
                }
autos["fueltype"]= autos["fueltype"].map(mapping_words)

In [26]:
autos["fueltype"].unique()

array(['lpg', 'diesel', nan, 'cng', 'hybrid', 'electro', 'other'],
      dtype=object)

Cleaning the date data 

Starting from the data from datecrawled 

In [27]:
autos = autos.rename(columns= {"datecrawled":"timecrawled"})

In [28]:
autos["datecrawled"] = autos["timecrawled"].str[:10]

In [29]:
autos["datecrawled"] = autos["datecrawled"].str.replace("-","").astype(int)
autos["timecrawled"] = autos["timecrawled"].str[11:]

In [30]:
autos.describe(include = "all")

Unnamed: 0,timecrawled,name,price_in_$,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer_km,registration_month,fueltype,brand,unreparied_damage,ad_created,nrofpictures,postalcode,lastseen,datecrawled
count,46681,46681,46681.0,46681,43977,46681.0,44571,46681.0,44488,46681.0,46681.0,43363,46681,38374,46681,46681.0,46681.0,46681,46681.0
unique,20127,35812,,2,8,,2,,244,,,6,40,2,74,,,37146,
top,20:36:19,Volkswagen_Golf_1.4,,test,limousine,,manuell,,golf,,,diesel,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27,
freq,13,75,,24062,12598,,34715,,3707,,,42572,9862,33834,1821,,,8,
mean,,,5977.716801,,,2002.910756,,117.892933,,125586.855466,5.827125,,,,,0.0,51097.434181,,20160330.0
std,,,9177.909479,,,7.185103,,184.922911,,39852.528628,3.6703,,,,,0.0,25755.387192,,31.92964
min,,,1.0,,,1910.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,,20160300.0
25%,,,1250.0,,,1999.0,,75.0,,100000.0,3.0,,,,,0.0,30827.0,,20160310.0
50%,,,3100.0,,,2003.0,,109.0,,150000.0,6.0,,,,,0.0,49828.0,,20160320.0
75%,,,7500.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71732.0,,20160330.0


now is the date created 

In [31]:
autos = autos.rename(columns = {"ad_created":"ad_time_created"})

In [32]:
autos["ad_date_created"] = autos["ad_time_created"].str[:10]


In [33]:
autos["ad_date_created"] = autos["ad_date_created"].str.replace("-","").astype(int)

In [34]:
autos["ad_time_created"] = autos["ad_time_created"].str[11:]

In [35]:
autos.describe(include = "all")

Unnamed: 0,timecrawled,name,price_in_$,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer_km,registration_month,fueltype,brand,unreparied_damage,ad_time_created,nrofpictures,postalcode,lastseen,datecrawled,ad_date_created
count,46681,46681,46681.0,46681,43977,46681.0,44571,46681.0,44488,46681.0,46681.0,43363,46681,38374,46681,46681.0,46681.0,46681,46681.0,46681.0
unique,20127,35812,,2,8,,2,,244,,,6,40,2,1,,,37146,,
top,20:36:19,Volkswagen_Golf_1.4,,test,limousine,,manuell,,golf,,,diesel,volkswagen,nein,00:00:00,,,2016-04-07 06:17:27,,
freq,13,75,,24062,12598,,34715,,3707,,,42572,9862,33834,46681,,,8,,
mean,,,5977.716801,,,2002.910756,,117.892933,,125586.855466,5.827125,,,,,0.0,51097.434181,,20160330.0,20160330.0
std,,,9177.909479,,,7.185103,,184.922911,,39852.528628,3.6703,,,,,0.0,25755.387192,,31.92964,110.8499
min,,,1.0,,,1910.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,,20160300.0,20150610.0
25%,,,1250.0,,,1999.0,,75.0,,100000.0,3.0,,,,,0.0,30827.0,,20160310.0,20160310.0
50%,,,3100.0,,,2003.0,,109.0,,150000.0,6.0,,,,,0.0,49828.0,,20160320.0,20160320.0
75%,,,7500.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71732.0,,20160330.0,20160330.0


In [36]:
autos.rename(columns={"lastseen":"time_lastseen"}, inplace = True) 

In [37]:
autos["date_lastseen"] = autos["time_lastseen"].str[:10]

In [38]:
autos["date_lastseen"] = autos["date_lastseen"].str.replace("-","").astype(int)

In [39]:
autos["time_lastseen"] = autos["time_lastseen"].str[11:]

In [40]:
autos.describe(include= "all")

Unnamed: 0,timecrawled,name,price_in_$,abtest,vehicletype,registration_year,gearbox,powerps,model,odometer_km,...,fueltype,brand,unreparied_damage,ad_time_created,nrofpictures,postalcode,time_lastseen,datecrawled,ad_date_created,date_lastseen
count,46681,46681,46681.0,46681,43977,46681.0,44571,46681.0,44488,46681.0,...,43363,46681,38374,46681,46681.0,46681.0,46681,46681.0,46681.0,46681.0
unique,20127,35812,,2,8,,2,,244,,...,6,40,2,1,,,17204,,,
top,20:36:19,Volkswagen_Golf_1.4,,test,limousine,,manuell,,golf,,...,diesel,volkswagen,nein,00:00:00,,,06:16:52,,,
freq,13,75,,24062,12598,,34715,,3707,,...,42572,9862,33834,46681,,,12,,,
mean,,,5977.716801,,,2002.910756,,117.892933,,125586.855466,...,,,,,0.0,51097.434181,,20160330.0,20160330.0,20160370.0
std,,,9177.909479,,,7.185103,,184.922911,,39852.528628,...,,,,,0.0,25755.387192,,31.92964,110.8499,42.18126
min,,,1.0,,,1910.0,,0.0,,5000.0,...,,,,,0.0,1067.0,,20160300.0,20150610.0,20160300.0
25%,,,1250.0,,,1999.0,,75.0,,100000.0,...,,,,,0.0,30827.0,,20160310.0,20160310.0,20160320.0
50%,,,3100.0,,,2003.0,,109.0,,150000.0,...,,,,,0.0,49828.0,,20160320.0,20160320.0,20160400.0
75%,,,7500.0,,,2008.0,,150.0,,150000.0,...,,,,,0.0,71732.0,,20160330.0,20160330.0,20160410.0


## Data Analysis

### Top listed brand and price

In [41]:
brands = autos["brand"].value_counts(normalize = True)
brands

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

It is found from the data above that, the number of brand vehicles that has top 5% of the total ammount of car listed are volkswagen, bmw, opel , mercedes_benz, audi and ford. Therefore, we are going to conduct aggregate mean for the selected brands

In [42]:
fivepercent_brands = brands[brands>=0.05].index
fivepercent_brands 

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

In [43]:
Top_brands_price = {} 
for data in fivepercent_brands: 
    means = autos.loc[autos["brand"] == data, "price_in_$"].mean()
    Top_brands_price[data] = means 
Top_brands_price

{'volkswagen': 5402.410261610221,
 'bmw': 8332.820517811953,
 'opel': 2975.2419354838707,
 'mercedes_benz': 8628.450366422385,
 'audi': 9336.687453600594,
 'ford': 3749.4695065890287}

we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

* Audi, BMW and Mercedes Benz are more expensive
* Ford and Opel are less expensive
* Volkswagen is in between

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. While our natural instinct may be to display both aggregated series objects and visually compare them, this has a few limitations:

* it's difficult to compare more than two aggregate series objects if we want to extend to more columns
* we can't compare more than a few rows from each series object
* we can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

Instead, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly.

In [44]:
Top_brand_millage = {} 
for data in fivepercent_brands: 
    mean = round(autos.loc[autos["brand"]== data,"odometer_km"].mean() , 2)
    Top_brand_millage[data] = mean
Top_brand_millage

{'volkswagen': 128707.16,
 'bmw': 132572.51,
 'opel': 129310.04,
 'mercedes_benz': 130788.36,
 'audi': 129157.39,
 'ford': 124266.01}

In [45]:
mean_price = pd.Series(Top_brands_price)
mean_mileage = pd.Series(Top_brand_millage)
price_odometer_comparison = pd.DataFrame(mean_price, columns=["mean_price"])
price_odometer_comparison["mean_mileage"] = mean_mileage

In [46]:
price_odometer_comparison.sort_values(by = "mean_mileage")

Unnamed: 0,mean_price,mean_mileage
ford,3749.469507,124266.01
volkswagen,5402.410262,128707.16
audi,9336.687454,129157.39
opel,2975.241935,129310.04
mercedes_benz,8628.450366,130788.36
bmw,8332.820518,132572.51


## Most Listed Brand and Model

In [54]:
autos.groupby(["brand","model"]).size().sort_values(ascending = False).head(5)

brand       model 
volkswagen  golf      3707
bmw         3er       2615
volkswagen  polo      1609
opel        corsa     1592
volkswagen  passat    1349
dtype: int64

This conclude the top 5 car listed is from European brand.</br>  
The he most car listed is **volkswagen** , **golf** type 

### Price and Mileage Correlation 

In [50]:
autos.groupby("odometer_km")["price_in_$"].mean()

odometer_km
5000.0       8873.515924
10000.0     20550.867220
20000.0     18448.477089
30000.0     16608.836842
40000.0     15499.568381
50000.0     13812.173212
60000.0     12385.004433
70000.0     10927.182814
80000.0      9721.947636
90000.0      8465.025105
100000.0     8132.697279
125000.0     6214.022030
150000.0     3767.927107
Name: price_in_$, dtype: float64

This conclude that the car that has travelled more distance, is cheaper

### Damaged vs Undamaged Car Price 

In [51]:
autos["unreparied_damage"].value_counts()

nein    33834
ja       4540
Name: unreparied_damage, dtype: int64

In [52]:
autos.groupby("unreparied_damage")["price_in_$"].mean()

unreparied_damage
ja      2241.146035
nein    7164.033103
Name: price_in_$, dtype: float64

In [53]:
car_undamaged = autos.query("unreparied_damage == 'nein'")["price_in_$"].mean()
car_damaged = autos.loc[autos["unreparied_damage"] == "ja", "price_in_$"].mean()
car_damaged - car_undamaged

-4922.887067553713

This means that, the damaged car is much less in price compared to the undamaged car 