## Exploring Ebay Car Sales Data

In this project I explored the `Ebay Car Sales Dataset` from the [Ebay Kleinanzeigen](https://www.ebay-kleinanzeigen.de/) German website, a division for classified advertisements' website with sections devoted to jobs, housing, services, community service, gigs, and cars. Similar to "Craiglist" in the United States. 

In [2]:
import pandas as pd
import numpy as np

In [3]:
autos = pd.read_csv("autos.csv", encoding = "Latin-1")

In [4]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [5]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


After taking a look into the dataframe description, we can see that there are some NaNs in the dataset, mostly in the columns `notRepairedDamage` as well as `fuelType`. The cars `brand` colum is however complete which can help also to find missing data in the `Model` Column. Another problem is that maybe the users that are selling their cars do not have the complete technical information. For example the `gearbox` column has a lot of null-objetcs. One key point is however the `name` column. since the metadata may not be complete, sometimes this information can be derived from the title of the advertisement.

### Convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary

Here we see that the column names do follow the Camel case format but we want to have them in Snake Case format. Since is a more natural convention in the python community. We use the `DataFrame.rename()` function in this case. Therefore we create a helper variable containing a dictionary with the corresponding old an new keys.

In [6]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:
newcols = {"yearOfRegistration":"registration_year",
           "monthOfRegistration":"registration_month",
           "notRepairedDamage":"unrepaired_damage",
           "dateCreated":"ad_created",
           "dateCrawled": "date_crawled",
          "offerType": "offer_type",
          "vehicleType":"vehicle_type",
          "powerPS":"power_ps",
          "nrOfPictures":"nr_of_pictures",
          "postalCode":"postal_code",
          "lastSeen":"last_seen",
          "fuelType":"fuel_type"}

In [8]:
autos.rename(newcols, axis = 1, inplace=True)
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Exploring the dataframe further with `DataFrame.describe()` and `DataFrame.value_count()`

In [9]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-08 10:40:35,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In this analysis we can see the following:
* The column `seller` is mostly just "privat" with 49999 occurrences and a number of unique values of 2. The same applies to the `offer_type` column. So can we remove these columns. Another column that could be removed is the `nr_of_pictures`. Here there is no statistics available and looks like all the objects values are 0
* Some columns that may need more investigation are in this case the ones containing dates. For example the column `registration_year` appear to be numeric, but not in a date time format. So we cannot calculate statistics in periods of time.
* There are other columns that may need to be processes and converted to numbers. For example the `odometer` column and the `power_ps`

### Remove columns that are not so important

In this case the `nr_of_pictures`, `seller`, `offer_type`

In [10]:
autos.drop(["nr_of_pictures", "seller", "offer_type"], axis=1).head()

Unnamed: 0,date_crawled,name,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,39218,2016-04-01 14:38:50


### Convert the `odometer` and `price` to numerical values

In [11]:
print(autos["odometer"].unique())
print(autos["price"].unique())

['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km'
 '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km'
 '40,000km']
['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995']


In [12]:
autos["odometer"] = autos["odometer"].str.replace("km", "").str.replace(",", "").astype("float")
autos["odometer"].head()

0    150000.0
1    150000.0
2     70000.0
3     70000.0
4    150000.0
Name: odometer, dtype: float64

In [13]:
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype("float")
autos["price"].head()

  autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype("float")


0    5000.0
1    8500.0
2    8990.0
3    4350.0
4    1350.0
Name: price, dtype: float64

Rename the `odometer` column to `odometer_km`

In [14]:
autos.rename({"odometer":"odometer_km"},axis = 1, inplace=True)
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [15]:
autos["odometer_km"].unique().shape

(13,)

In [16]:
autos["odometer_km"].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [17]:
autos["odometer_km"].value_counts()

150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
5000.0        967
40000.0       819
30000.0       789
20000.0       784
10000.0       264
Name: odometer_km, dtype: int64

In [18]:
autos["price"].unique().shape

(2357,)

In [19]:
autos["price"].describe()

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [20]:
autos["price"].value_counts().sort_index(ascending = False).head(15)

99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
Name: price, dtype: int64

As we can see here there are bery unrealistic prices on the top of the dataframe. The prices sart to get unrealistic at 350K. So therefore we will remove values an that boudary. Prices of 0 may also not be unrealistic, since people are just asking for an auction and this can start actually at 0.

In [21]:
autos = autos[autos["price"].between(0,400000)]
autos["price"].describe()

count     49986.000000
mean       5721.525167
std        8983.617820
min           0.000000
25%        1100.000000
50%        2950.000000
75%        7200.000000
max      350000.000000
Name: price, dtype: float64

## Analyzing registration dates

In [22]:
autos["registration_year"].describe()

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [23]:
autos.loc[autos["registration_year"] == 9999]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
8012,2016-03-23 16:43:29,Opel_GT_Karosserie_mit_Brief!,privat,Angebot,700.0,test,,9999,,0,andere,10000.0,0,,opel,,2016-03-23 00:00:00,0,21769,2016-04-05 20:16:15
14341,2016-03-23 01:36:20,Hole_kostenlos_ab,privat,Angebot,0.0,test,,9999,,0,,10000.0,0,,bmw,,2016-03-23 00:00:00,0,32689,2016-03-23 08:47:00
33950,2016-03-23 21:52:25,58er_karmann_ghia_lowlight_Kaefer__zum_restaur...,privat,Angebot,7999.0,test,,9999,,0,kaefer,10000.0,0,,volkswagen,,2016-03-23 00:00:00,0,47638,2016-04-06 03:46:40
38076,2016-04-04 22:54:47,Mercedes_Benz_A180,privat,Angebot,18000.0,test,,9999,,0,a_klasse,10000.0,0,benzin,mercedes_benz,,2016-04-04 00:00:00,0,51379,2016-04-07 02:44:52


In [24]:
np.sort(autos["registration_year"].unique())

array([1000, 1001, 1111, 1500, 1800, 1910, 1927, 1929, 1931, 1934, 1937,
       1938, 1939, 1941, 1943, 1948, 1950, 1951, 1952, 1953, 1954, 1955,
       1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966,
       1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977,
       1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988,
       1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
       2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
       2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2800, 4100,
       4500, 4800, 5000, 5911, 6200, 8888, 9000, 9996, 9999])

Here we can see that the registration year has a lot of errors, since there is no possible way a car has bee registered in the future or way back in the 1800's. For example there are some cars with a registration year of 9999. With no price and what seems to be somebody giving away something for free (Index 14341). As well as a Opel GT body frame.

### Removing wrong registration years

As mentioned above the cars registered after 2019 and before 1910 cannot be right. Since One of the first cars accessible to the masses was the [1908 Model T](https://en.wikipedia.org/wiki/Car), an American car manufactured by the Ford Motor Company. Therefore, cars registered before that cannot be real. We can also check for cars that have a registration from 199 until 1950 and see if the models matching make sense.

In [25]:
autos[autos["registration_year"].between(1910,1950)]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
1171,2016-03-29 17:53:03,Seat_Leon_Spielzeug_Auto,privat,Angebot,2.0,control,limousine,1950,automatik,5,leon,5000.0,0,diesel,seat,,2016-03-29 00:00:00,0,26919,2016-04-06 03:45:23
2221,2016-03-15 14:57:07,Sehr_seltener_Oldtimer_Opel_1210_zum_Restaurieren,privat,Angebot,3350.0,control,andere,1934,manuell,0,andere,5000.0,0,benzin,opel,ja,2016-03-15 00:00:00,0,49828,2016-04-06 06:17:51
2573,2016-03-19 22:51:25,Hanomag_rekord_15k_Suche_ersatz_teile,privat,Angebot,3000.0,test,andere,1934,,0,,90000.0,1,benzin,sonstige_autos,nein,2016-03-19 00:00:00,0,90489,2016-03-19 22:51:25
3679,2016-04-04 00:36:17,Suche_Auto,privat,Angebot,1.0,test,,1910,,0,,5000.0,0,,sonstige_autos,,2016-04-04 00:00:00,0,40239,2016-04-04 07:49:15
11047,2016-03-08 20:50:10,Andere_Simca_5_Fourgonette_Kombilimousine,privat,Angebot,17500.0,control,kombi,1948,manuell,0,,60000.0,6,benzin,sonstige_autos,nein,2016-03-08 00:00:00,0,47546,2016-04-05 21:15:42
11246,2016-03-26 19:49:59,Ford_Model_A_Roadster_Deluxe_1931,privat,Angebot,27500.0,control,cabrio,1931,manuell,39,andere,10000.0,7,benzin,ford,nein,2016-03-26 00:00:00,0,9322,2016-04-06 09:46:59
11585,2016-03-11 21:48:36,Volkswagen__VW_Typ_82,privat,Angebot,41900.0,test,cabrio,1943,,0,andere,100000.0,7,,volkswagen,ja,2016-03-11 00:00:00,0,84174,2016-03-21 13:18:05
13963,2016-03-20 17:51:49,Mercedes_Benz_L1500S_Wehrmacht_/_Luftwaffe___F...,privat,Angebot,26900.0,test,andere,1941,manuell,60,andere,60000.0,7,benzin,mercedes_benz,nein,2016-03-20 00:00:00,0,38723,2016-04-07 01:17:51
14020,2016-03-19 11:52:47,Oldtimeraufloesung,privat,Angebot,10000.0,test,coupe,1950,manuell,130,andere,5000.0,1,benzin,alfa_romeo,nein,2016-03-19 00:00:00,0,34128,2016-04-06 14:46:35
15898,2016-03-08 10:50:05,Tausch_alles_aus_meinen_Anzeigen_gegen_Auto,privat,Angebot,0.0,test,,1910,,0,,5000.0,0,,sonstige_autos,,2016-03-08 00:00:00,0,6108,2016-03-08 17:47:19


As we can see some of these cars are marked as oldtimers except for those with a registration year from 1910. So we can take also remove this from the dataset and start from 1927 with the **Essex Super Six from 1927** this car is actually from that era. See the [Essex Motor Company](https://en.wikipedia.org/wiki/Essex_(automobile)), produced cars between 1922 to 1933

In [26]:
autos = autos[autos["registration_year"].between(1927,2019)]

In [59]:
autos["registration_year"].value_counts(normalize = True, dropna = False, ascending = False)[:5]

2000    0.067143
2005    0.060357
1999    0.060016
2004    0.054792
2003    0.054591
Name: registration_year, dtype: float64

Here we can observe that the most selled car models are from year 2000, 2005 and 1999. On top of the distribution

### Exploring Brands

In [28]:
autos["brand"].describe()

count          49953
unique            40
top       volkswagen
freq           10679
Name: brand, dtype: object

As seen before by describing the column. Volkswagen is the top Brand in the market. Let us choose only the top 10 brands

In [29]:
top10 = autos["brand"].value_counts(normalize = True)[:10]
print(top10)

volkswagen       0.213781
opel             0.109203
bmw              0.108642
mercedes_benz    0.094709
audi             0.085741
ford             0.069605
renault          0.048105
peugeot          0.029147
fiat             0.026165
seat             0.018818
Name: brand, dtype: float64


Here we can see that until Ford, the share of cars does not exceed the 5% and the most sold cars are German brands.
Now we can calculate the mean prices for the top 10.

In [30]:
# using dictonaries
mean_prices_by_brand = {}

for brand in top10.index:
    mean_price = autos.loc[autos["brand"] == brand, "price"].mean()
    mean_prices_by_brand[brand] = mean_price
    

In [65]:
mean_prices_by_brand

{'volkswagen': 5159.401629366045,
 'opel': 2843.769752520623,
 'bmw': 8028.474479454579,
 'mercedes_benz': 8380.637920101459,
 'audi': 8965.560354891431,
 'ford': 3626.5429968363533,
 'renault': 2352.031210986267,
 'peugeot': 3010.8688186813188,
 'fiat': 2697.6771231828616,
 'seat': 4223.654255319149}

In [71]:
# we could also use a groupby approach with aggregation

autos.groupby('brand').agg(np.mean).loc[top10.index]["price"]

volkswagen       5159.401629
opel             2843.769753
bmw              8028.474479
mercedes_benz    8380.637920
audi             8965.560355
ford             3626.542997
renault          2352.031211
peugeot          3010.868819
fiat             2697.677123
seat             4223.654255
Name: price, dtype: float64

The same can be done with the milage

In [32]:
# using dictionaries

mean_milage_by_brand = {}

for brand in top10.index:
    mean_milage = autos.loc[autos["brand"] == brand, "odometer_km"].mean()
    mean_milage_by_brand[brand] = mean_milage

In [72]:
mean_milage_by_brand

{'volkswagen': 129006.46127914598,
 'opel': 129362.96975252063,
 'bmw': 132540.99871015293,
 'mercedes_benz': 130933.20651025153,
 'audi': 129643.9411627364,
 'ford': 124153.00546448087,
 'renault': 128275.07282563462,
 'peugeot': 127352.33516483517,
 'fiat': 117012.24177505738,
 'seat': 122186.17021276595}

In [73]:
# we could also use a groupby approach with aggregation

autos.groupby('brand').agg(np.mean).loc[top10.index]["odometer_km"]

volkswagen       129006.461279
opel             129362.969753
bmw              132540.998710
mercedes_benz    130933.206510
audi             129643.941163
ford             124153.005464
renault          128275.072826
peugeot          127352.335165
fiat             117012.241775
seat             122186.170213
Name: odometer_km, dtype: float64

Now we can create two new vectors with shared index to resume all this in one dataframe

In [34]:
mean_prices_series = pd.Series(mean_prices_by_brand)

df = pd.DataFrame(mean_prices_series, columns=["mean_price"])

df["mean_milage"] = pd.Series(mean_milage_by_brand)

print(df)

print(df.describe())

                mean_price    mean_milage
volkswagen     5159.401629  129006.461279
opel           2843.769753  129362.969753
bmw            8028.474479  132540.998710
mercedes_benz  8380.637920  130933.206510
audi           8965.560355  129643.941163
ford           3626.542997  124153.005464
renault        2352.031211  128275.072826
peugeot        3010.868819  127352.335165
fiat           2697.677123  117012.241775
seat           4223.654255  122186.170213
        mean_price    mean_milage
count    10.000000      10.000000
mean   4928.861854  127046.640286
std    2575.705724    4661.122481
min    2352.031211  117012.241775
25%    2885.544519  124952.837890
50%    3925.098626  128640.767052
75%    7311.206267  129573.698310
max    8965.560355  132540.998710


From the DataFrame above we cans see that:
* the most expensive cars in average are dominated by german premium brands like Mercedes Benz, Audi and BMW.
* All car brands have in average very similar values. The mean of the mean is about 127000Km.
* Bmw have in average the most milage and fiats are the less driven

## Additional Steps

### Identify categorical data that uses german words

In [35]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Here we just want to check which columns contain categorical data that can be translated from german.
The columns that I will chage are the following:
* `offer_type`
* `gearbox`
* `fuel_type` 
* `unrepaired_damage` 
* `seller`

For this task i will use the `DataFrame.map()` method. Therefore we need to create in forehand the dictionaries containing the respective translations from german to english. With `Series.unique()` can we extract the keywords needed for the dictionary.

In [36]:
print(autos["offer_type"].unique())

offer_type_map = {"Angebot":"Offer", "Gesuch": "Searching For"}

print(autos["gearbox"].unique())

gearbox_type_map = {"manuell":"Manual Transmission", "automatik":"Automatic"}

print(autos["fuel_type"].unique())

fuel_type_map = {"benzin":"Gasoline", "elektro":"electric", "andere":"other",'lpg':'lpg','diesel':'diesel','cng':'cng', 'hybrid':'hybrid' }

print(autos["unrepaired_damage"].unique())

unrepaired_damage_map = {"nein":"no", "ja": "yes"}

print(autos["seller"].unique())

seller_map = {"privat":"private", "gewerblich":"commercial"}


['Angebot' 'Gesuch']
['manuell' 'automatik' nan]
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
['nein' nan 'ja']
['privat' 'gewerblich']


In order to make the code more compact, we can loop over the dictionaries to replace all columns at once of the Dataframe. We can pack all our dictionaries in a list of dictionaries. For the loop we will also need the names of the columns that we are looping trough. 
> Note: The dictionaries and the column names have to be in the same order and size, since the loop is selecting columns and then searching for the keywords in the dictionaries. If they are in different order, the mapping wont work in the column.

In [37]:
map_list = [offer_type_map, gearbox_type_map, fuel_type_map, unrepaired_damage_map, seller_map]
map_columns_list = ["offer_type", "gearbox", "fuel_type", "unrepaired_damage", "seller" ]

In [38]:
# From here we can work with a copy of the german dataset
autos_en = autos.copy()

In order to loop over both lists at the same time, we can make use of the `zip()` method in the loop definition. And index the columns and the map items more easily

In [39]:
for col, dic in zip(map_columns_list, map_list):
    autos_en[col] = autos_en[col].map(dic)

In [40]:
autos_en.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,private,Offer,5000.0,control,bus,2004,Manual Transmission,158,andere,150000.0,3,lpg,peugeot,no,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,private,Offer,8500.0,control,limousine,1997,Automatic,286,7er,150000.0,6,Gasoline,bmw,no,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,private,Offer,8990.0,test,limousine,2009,Manual Transmission,102,golf,70000.0,7,Gasoline,volkswagen,no,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,private,Offer,4350.0,control,kleinwagen,2007,Automatic,71,fortwo,70000.0,6,Gasoline,smart,no,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,private,Offer,1350.0,test,kombi,2003,Manual Transmission,0,focus,150000.0,7,Gasoline,ford,no,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Convert the dates to be uniform numeric data

In [41]:
date_cols = ["date_crawled", "ad_created", "last_seen"]

for col in date_cols:
    autos_en[col]=autos_en[col].str.replace("-", "").str[:8].astype("int")

In [42]:
autos_en[date_cols].head()

Unnamed: 0,date_crawled,ad_created,last_seen
0,20160326,20160326,20160406
1,20160404,20160404,20160406
2,20160326,20160326,20160406
3,20160312,20160312,20160315
4,20160401,20160401,20160401


### Extract particular keywords from the name column

One common keyword found in the name columns is the word "TÜV" which is the name of the most well known german inspection agency. Having "TÜV" means that your car passed the obligatory technical inspection. People tend to like a car with a valid TÜV since this can be seen as a indicator that the car has passed all technical inspections and is safe to drive in the streets. Since there is no specific colum adressing this , we can search for the keyword in the column. For this I used the `Series.str.contains()` to return a boolean array. Then assigned it to a new column and use the `Series.replace()` method to change the `True` and `False` values to strings indicating the status of the TÜV

In [43]:
autos_en["Vehicle_inspection"] = autos_en["name"].str.contains("tüv|TÜV|Tüv")
autos_en["Vehicle_inspection"] = autos_en["Vehicle_inspection"].replace({False: 'Not Specified', True: 'Valid'})

In [44]:
# First 5 matching cars with Valid TÜV
autos_en[autos_en["Vehicle_inspection"] == "Valid"][["name", "Vehicle_inspection"]].head()

Unnamed: 0,name,Vehicle_inspection
4,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,Valid
20,Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*...,Valid
32,Corsa_mit_TÜV_5.2016,Valid
66,Opel_Corsa_1.2_16V_Edition_Saphirschwarz_TÜV_neu,Valid
73,VW_T3_Doka_1_7D_TÜV_07/2017,Valid


### Most common brand/model combinations


Here we want to see which combination of Brand/Model are the most common. I will take the top 10 as a metric. Since the `Series.value_counts()` method cannot be applied over multiple columns at once I.e over a dataframe. We need to find a way to reduce this to one single string. We can do this by adding both column strings and adding a space for readibility and then apply the method over the new created `model_brand` Series object.

In [45]:
model_brand = autos_en["brand"] + " " + autos_en["model"]

In [46]:
model_brand.value_counts().head(10)

volkswagen golf           4022
bmw 3er                   2761
volkswagen polo           1757
opel corsa                1733
opel astra                1454
volkswagen passat         1425
audi a4                   1291
bmw 5er                   1183
mercedes_benz c_klasse    1172
mercedes_benz e_klasse    1001
dtype: int64

Here we can observe that the **Volkswagen Golf** , the **BMW Serie 3** and the **Volkswagen Polo** are the most common Cars sold in the Platform.

We can also approach the same result by using more advnced methods not covered in the course. Like the [`DataFrame.groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html?highlight=groupby#pandas.DataFrame.groupby) method. With this method we can group a dataframe by a column and apply other method over this object. 

We use then `DataFrame.size()` method and sort the whole thing to get the top 10

In [47]:
# Groupby approach
autos_en.groupby(["model","brand"]).size().sort_values(ascending = False).head(10)

model     brand        
golf      volkswagen       4022
3er       bmw              2761
polo      volkswagen       1757
corsa     opel             1733
astra     opel             1454
passat    volkswagen       1425
a4        audi             1291
5er       bmw              1183
c_klasse  mercedes_benz    1172
e_klasse  mercedes_benz    1001
dtype: int64

### Explore the `odometer_km` column and search for patterns 

Here we want to see if there is a relationship between the milage and the average price. To simplify the code we are going to divide the dataset into tree groups. **5K to 20K Km**, **20K to 80K km** and **80K to 150K km**. Then we use aggregation to get the mean price for each group and add it to a dictionary

In [48]:
autos_en["odometer_km"].unique()

array([150000.,  70000.,  50000.,  80000.,  10000.,  30000., 125000.,
        90000.,  20000.,  60000.,   5000., 100000.,  40000.])

In [49]:
five_to20 = autos_en[autos_en["odometer_km"].between(5000, 20000)]

tweny_to80 = autos_en[autos_en["odometer_km"].between(20001, 80000)]

eighty_200 = autos_en[autos_en["odometer_km"].between(80001, 150000)]

In [50]:
patterns_price = {}
groups = [five_to20, tweny_to80, eighty_200]
group_names = ["5K to 20K", "20K to 80K", "80K to 150K"]


for group, name in zip(groups, group_names):
    mean = group["price"].mean()
    patterns_price[name] = mean

In [51]:
pd.Series(patterns_price)

5K to 20K      13038.603327
20K to 80K     12435.780409
80K to 150K     4327.563375
dtype: float64

As we can observe, there is a negative correlation between the milage and the price. The higher the milage on the car the lower is the average price for these cars.

### Comparison between damaged and non damaged cars

This is kind of a tricky question since it is difficult to generalize this to all car models in the dataset. Since there are cars that don not have a damaged counterpart to compare with. Therefore for the sake of simplicity we are going to take the top 10 car models and investigate this further.

In [52]:
top_10_models = list(autos_en["model"].value_counts().head(10).index)

In [53]:
top_10_not_damaged = autos_en[(autos_en["model"].isin(top_10_models)) & (autos_en["unrepaired_damage"] == "no")]

In [54]:
top_10_not_damaged[["price", "model", "unrepaired_damage"]].head()

Unnamed: 0,price,model,unrepaired_damage
0,5000.0,andere,no
2,8990.0,golf,no
7,1990.0,golf,no
19,4150.0,andere,no
24,48500.0,5er,no


In [55]:
top_10_damaged = autos_en[(autos_en["model"].isin(top_10_models)) & (autos_en["unrepaired_damage"] == "yes")]

In [56]:
top_10_damaged[["price", "model", "unrepaired_damage"]].head()

Unnamed: 0,price,model,unrepaired_damage
97,800.0,andere,yes
136,1499.0,andere,yes
148,500.0,polo,yes
163,600.0,andere,yes
197,888.0,polo,yes


In [57]:
Not_damaged = {}
Damaged = {}

for model in top_10_models:
    avg_no_damage = top_10_not_damaged.loc[top_10_not_damaged["model"] == model, "price" ].mean()
    Not_damaged[model] = avg_no_damage
    
    avg_damage = top_10_damaged.loc[top_10_damaged["model"] == model, "price" ].mean()
    Damaged[model] = avg_damage
    

In [58]:
Damage_vs_price = pd.DataFrame(pd.Series(Not_damaged), columns=["Not Damaged"])
Damage_vs_price["Damaged"] = pd.Series(Damaged)

Damage_vs_price

Unnamed: 0,Not Damaged,Damaged
golf,5923.202199,1703.131965
andere,8258.812179,2329.746269
3er,6799.020894,2135.497925
polo,3126.468832,936.949495
corsa,2234.716742,1127.076923
astra,3890.461692,1249.960265
passat,5636.043265,2057.065574
a4,8002.556139,2762.545455
5er,8755.436426,3672.212963
c_klasse,7626.058378,3683.986842


To conclude this, we can see that there is a positive correlation between damaged cars and average price. E.g the Volkswagen Golf, which happens to be also the most selled model in the platform has an average price of **5923 \$** without damages. On the other hand the average price for the same model drops to **1703 \$** , also about **72\%** less money in average for the same model. Of course this is a mere comparison and does not go very deeply. For a more accurate result one would have to do the same analysis but comparing only models with similar milage and same year of registration for example.