# Analysis of Ebay car sale data

Goal: The project aims to clean and analyzed the data from the German eBay website of used cars.

Method: The main analysis is data cleaning using the NumPy and pandas libraries.

Data set: ~50,000 data points from *eBay Kleinanzeigen* is avaliable [here](https://data.world/data-society/used-cars-data).

Key parameter: ...

## Open the data

In [1]:
#import libratries
import numpy as np
import pandas as pd

#open file
autos = pd.read_csv("autos.csv", encoding = "Latin-1") #"Windows-1252"

Inside of the data set:

In [2]:
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


In [3]:
#get info about data set

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

### Primary observations about data cleaning

The colums with NaN entries:

- 6   `vehicleType`
- 8   `gearbox`
- 10  `model`
- 13  `fuelType`
- 15  `notRepairedDamage`

Overall no more than ~20% NaN values.

The columns that require data type conversion:
- 4   `price`    - from "object" to "int64" by excluding "$" & ","
- 11  `odometer` - from "object" to "int64" by excluding "km" & ","



The column names require renaming as they use [camelcase](https://en.wikipedia.org/wiki/Camel_case) instead of Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case) and are not descriptive.

In [4]:
#initial column names
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
#rename column names
autos.rename({"yearOfRegistration":"registration_year"},axis=1,inplace = True)
autos.rename({"monthOfRegistration":"registration_month"},axis=1,inplace = True)
autos.rename({"notRepairedDamage":"unrepaired_damage"},axis=1,inplace = True)
autos.rename({"dateCreated":"ad_created"},axis=1,inplace = True)
autos.rename({"powerPS":"power_ps"},axis=1,inplace = True)

#import regular expression operations
#replace symbols

import re 
def camel_to_snake(col):
    return re.sub("([A-Z])", "_\\1", col).lower()

new_columns = []
for col in autos.columns:
    col = camel_to_snake(col)
    new_columns.append(col)
    
autos.columns = new_columns

#modified column names
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

The columns were renamed from camelcases to snakecase, meaning symbols as "([A-Z])" to "_" and original name in low cases. Some badly described columns were renamed manually.

### Converting object columns to numeric data 

The columns that require data type conversion:
- 4   `price`    - from "object" to "int64" by excluding "$" & ","
- 11  `odometer` - from "object" to "int64" by excluding "km" & ","


In [6]:
#explore column 4 price
autos["price"].value_counts()

$0         1421
$500        781
$1,500      734
$2,500      643
$1,200      639
           ... 
$17,520       1
$28,399       1
$35,700       1
$6,755        1
$16,960       1
Name: price, Length: 2357, dtype: int64

In [7]:
#remove '$' & ',' symbols
autos["price"] = autos["price"].str.replace('$','').str.replace(',','').astype(int)
autos.rename({"price":"price_$"},axis=1,inplace = True)
autos["price_$"].value_counts()

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
20790       1
8970        1
846         1
2895        1
33980       1
Name: price_$, Length: 2357, dtype: int64

In [8]:
#explore column 11 odometer
autos["odometer"].value_counts()

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [9]:
#remove 'km' & ',' symbols
autos["odometer"] = autos["odometer"].str.replace('km','').str.replace(',','').astype(int)
autos.rename({"odometer":"odometer_km"},axis=1,inplace = True)
autos["odometer_km"].value_counts()

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

### Numeric data cleaning

In [10]:
#inside the numeric data entries
#for strng data use (include='all')
autos.describe()

Unnamed: 0,price_$,registration_year,power_ps,odometer_km,registration_month,nr_of_pictures,postal_code
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,9840.044,2005.07328,116.35592,125732.7,5.72336,0.0,50813.6273
std,481104.4,105.712813,209.216627,40042.211706,3.711984,0.0,25779.747957
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1100.0,1999.0,70.0,125000.0,3.0,0.0,30451.0
50%,2950.0,2003.0,105.0,150000.0,6.0,0.0,49577.0
75%,7200.0,2008.0,150.0,150000.0,9.0,0.0,71540.0
max,100000000.0,9999.0,17700.0,150000.0,12.0,0.0,99998.0


Cleaning plan:

1. The column `nr_of_pictures` could be removed as it dose not contain any data.

2. The column `registration_year` has unrealistic data entries as min_year = 1000 and max_year = 9999 that should be removed/replaced.

3. The column `power_ps` has unrealistic data entries as max_power = 17700 that should be removed/replaced.

4. The column `registration_month` has unrealistic data entries as min_month = 0 that should be removed partly or entirely.

5. The column `price` has unrealistic data entries as min_price = 0 and max_price = 100000000 that should be verifyed/removed.

1. The column `nr_of_pictures` could be removed as it dose not contain any data.

In [11]:
#remove nr_of_pictures column
autos = autos.drop(columns = "nr_of_pictures")
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price_$,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,39218,2016-04-01 14:38:50


2. The column `registration_year` has unrealistic data entries as min_year = 1000 and max_year = 9999 that should be removed/replaced.

In [12]:
#explore unrealistic data entries in registration_year column
autos["registration_year"].value_counts(ascending=True)

1952       1
9996       1
1001       1
1929       1
1931       1
        ... 
2003    2727
2004    2737
1999    3000
2005    3015
2000    3354
Name: registration_year, Length: 97, dtype: int64

In [13]:
#detect unrealistic years
#delete years that >2016 (the date when data was crawled) & <1885 (when the first German car was issued)
autos = autos[autos['registration_year'].between(1885,2016)]
autos["registration_year"].describe()  

count    48028.00000
mean      2002.80351
std          7.31085
min       1910.00000
25%       1999.00000
50%       2003.00000
75%       2008.00000
max       2016.00000
Name: registration_year, dtype: float64

In [14]:
#create a distribution of registration_year
autos["registration_year"].value_counts(normalize=True)

2000    0.069834
2005    0.062776
1999    0.062464
2004    0.056988
2003    0.056779
          ...   
1939    0.000021
1927    0.000021
1929    0.000021
1948    0.000021
1952    0.000021
Name: registration_year, Length: 78, dtype: float64

The distribution of the `registration_year` shows that mainly cars were issued in early 2000.

3. The column `power_ps` has unrealistic data entries as max_power = 17700 that should be removed/replaced.
Here PS stands for Pferdestrke (literally, 'horsepower').

In [15]:
#explore unrealistic data entries in power_ps column
autos["power_ps"].value_counts(ascending=True)

16312       1
454         1
1103        1
262         1
16          1
         ... 
140      1823
150      1985
60       2084
75       3004
0        4989
Name: power_ps, Length: 441, dtype: int64

In [16]:
#select unrealistic power
#power_ps that >700 (limit for powerful but not race cars)
print(autos[autos["power_ps"] > 700].describe())
print('\n')
autos[autos["power_ps"] > 700]

            price_$  registration_year      power_ps    odometer_km  \
count  5.800000e+01          58.000000     58.000000      58.000000   
mean   7.130403e+04        2002.827586   3229.034483  113448.275862   
std    5.102690e+05           6.548437   4247.054166   53411.731550   
min    0.000000e+00        1972.000000    740.000000    5000.000000   
25%    9.925000e+02        2000.000000   1001.500000   82500.000000   
50%    2.095000e+03        2003.000000   1400.000000  150000.000000   
75%    5.150000e+03        2006.000000   2546.250000  150000.000000   
max    3.890000e+06        2016.000000  17700.000000  150000.000000   

       registration_month   postal_code  
count           58.000000     58.000000  
mean             6.224138  40903.068966  
std              4.116902  23618.057665  
min              0.000000   1594.000000  
25%              3.000000  24107.500000  
50%              6.000000  39860.000000  
75%             10.000000  53229.500000  
max             12.00000

Unnamed: 0,date_crawled,name,seller,offer_type,price_$,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
1699,2016-04-04 19:49:19,Opel_Corsa_1.0_Motor_ecotek,privat,Angebot,1200,test,limousine,2001,manuell,6512,corsa,150000,12,benzin,opel,,2016-04-04 00:00:00,47198,2016-04-06 22:16:46
2220,2016-03-30 17:56:27,Ford_ka_top_zustand,privat,Angebot,850,control,,2005,manuell,1003,ka,5000,12,benzin,ford,nein,2016-03-30 00:00:00,45891,2016-04-05 06:17:55
2670,2016-04-02 15:47:00,Verkaufe_Ford_Focus_!,privat,Angebot,360,control,kleinwagen,1999,,1988,focus,150000,2,benzin,ford,,2016-04-02 00:00:00,54459,2016-04-06 14:44:28
2876,2016-03-19 06:36:23,Golf_3_Cabrio_voll_fahrbereit_tuev,privat,Angebot,1990,control,cabrio,1998,manuell,900,,150000,3,benzin,volkswagen,nein,2016-03-19 00:00:00,87549,2016-04-06 06:16:32
3753,2016-04-03 18:47:14,VW_Polo_9n,privat,Angebot,4700,control,kleinwagen,2009,manuell,6045,polo,125000,12,benzin,volkswagen,nein,2016-04-03 00:00:00,48565,2016-04-05 19:17:39
4279,2016-03-29 19:51:19,Citroen_C4_1.6Hdi__94.500km,privat,Angebot,2850,control,limousine,2005,manuell,900,c4,100000,2,diesel,citroen,nein,2016-03-29 00:00:00,46459,2016-04-04 05:16:35
4405,2016-03-21 19:53:24,VW_Golf_Automatik_Grau,privat,Angebot,2200,test,limousine,1998,automatik,1781,golf,150000,10,benzin,volkswagen,ja,2016-03-21 00:00:00,47198,2016-04-06 22:46:46
4464,2016-03-29 16:47:47,Zu_verkaufen_Mercedes_A_160_mit_neu_TÜV,privat,Angebot,1650,control,kleinwagen,2000,automatik,1001,a_klasse,150000,3,benzin,mercedes_benz,nein,2016-03-29 00:00:00,89134,2016-04-04 01:18:34
4777,2016-04-03 12:45:25,Audi_tt_bj_2000_Unfall.._laeuft_top..,privat,Angebot,2200,control,coupe,2000,manuell,1793,tt,150000,4,,audi,ja,2016-04-03 00:00:00,16248,2016-04-07 14:57:35
7556,2016-03-23 23:56:04,Kaufe_alle_A4_B5_egal_in_welchem_zustand!!,privat,Angebot,100,control,,1995,,999,a4,150000,8,,audi,ja,2016-03-23 00:00:00,33034,2016-03-24 07:42:59


There are 58 cars with `power_ps` over 700, whose price is mainly (75% of them) below ~5000, pointing out to the fact that they must be false entries and could be removed. Only below 25% of cars have a realistic combination of both power and price. Moreover the lables of these 58 cars mainly do not correspond to race/high end cars.

In [17]:
#delete unrealistic entries
autos = autos[autos['power_ps'].between(0,700)]
autos.describe()

Unnamed: 0,price_$,registration_year,power_ps,odometer_km,registration_month,postal_code
count,47970.0,47970.0,47970.0,47970.0,47970.0,47970.0
mean,9510.628,2002.803481,113.307776,125558.786742,5.767209,50947.997874
std,484350.4,7.311785,70.520388,40086.524007,3.696279,25792.460894
min,0.0,1910.0,0.0,5000.0,0.0,1067.0
25%,1150.0,1999.0,71.0,100000.0,3.0,30519.0
50%,2990.0,2003.0,107.0,150000.0,6.0,49716.0
75%,7400.0,2008.0,150.0,150000.0,9.0,71672.0
max,100000000.0,2016.0,696.0,150000.0,12.0,99998.0


4. The column `registration_month` has unrealistic data entries as min_month = 0 that should be removed/replaced.

In [18]:
#explore unrealistic data entries in registration_month column
autos["registration_month"].value_counts(ascending=True)

2     2913
8     3082
1     3161
11    3268
9     3299
12    3314
10    3552
7     3807
5     3945
4     3950
6     4204
0     4580
3     4895
Name: registration_month, dtype: int64

As a lot of data entries has '0' `registration_month`, this column could be deleted as being misleading and unnecessary. However, as the rest of months have uniform distribution, we will **replace month - 0 to NaN values** for the sace of futute analysis.

In [19]:
#1 way
#remove registration_month column
#autos = autos.drop(columns = "registration_month")

#2 way
#select month = 0
month_0 = autos["registration_month"] == 0
#convert month = 0 to NaN
autos.loc[month_0,"registration_month"] = np.nan
autos["registration_month"].describe()

count    43390.000000
mean         6.375962
std          3.350099
min          1.000000
25%          3.000000
50%          6.000000
75%          9.000000
max         12.000000
Name: registration_month, dtype: float64

5. The column `price` has unrealistic data entries as min_price = 0 and max_price = 100000000 that should be verifyed/removed.

In [20]:
autos["price_$"].unique().shape

(2331,)

In [21]:
print(autos["price_$"].describe())
print("\n")
print("About an individual entries:")
autos["price_$"].value_counts(ascending=True) #.head(20)

count    4.797000e+04
mean     9.510628e+03
std      4.843504e+05
min      0.000000e+00
25%      1.150000e+03
50%      2.990000e+03
75%      7.400000e+03
max      1.000000e+08
Name: price_$, dtype: float64


About an individual entries:


13290       1
27299       1
2671        1
16998       1
28850       1
         ... 
1200      605
2500      614
1500      694
500       756
0        1334
Name: price_$, Length: 2331, dtype: int64

As majority of cars have price ~10^3, perhaps cars with price >10^4 are suspisiously expensive and their brands should be verified. Also 1334 cars have price = 0, which shoule be removed.

In [22]:
#select unrealistic price and examine the car labels
#explore price_$ that >100.000$ (seems quite expensive)
high_price = autos["price_$"] > 100000
print("Amount of highly priced cars: ", len(autos[high_price]))
autos[high_price].head(5)

Amount of highly priced cars:  50


Unnamed: 0,date_crawled,name,seller,offer_type,price_$,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
514,2016-03-17 09:53:08,Ford_Focus_Turnier_1.6_16V_Style,privat,Angebot,999999,test,kombi,2009,manuell,101,focus,125000,4.0,benzin,ford,nein,2016-03-17 00:00:00,12205,2016-04-06 07:17:35
1878,2016-03-12 16:58:37,Porsche_911_Turbo,privat,Angebot,129000,control,coupe,1995,manuell,408,911,125000,9.0,benzin,porsche,nein,2016-03-12 00:00:00,70180,2016-04-05 04:49:19
2454,2016-03-21 22:51:29,Porsche_911_GT3,privat,Angebot,137999,control,coupe,2010,manuell,435,911,20000,7.0,benzin,porsche,nein,2016-03-21 00:00:00,80636,2016-04-07 05:45:39
2751,2016-03-15 10:52:35,Porsche_911___993_4S,privat,Angebot,120000,control,coupe,1998,manuell,286,911,125000,3.0,benzin,porsche,nein,2016-03-15 00:00:00,25488,2016-04-05 19:47:31
2897,2016-03-12 21:50:57,Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000,privat,Angebot,11111111,test,limousine,1973,manuell,48,escort,50000,3.0,benzin,ford,nein,2016-03-12 00:00:00,94469,2016-03-12 22:45:27


Not all the cars whose price is above 100.000$ are of a corresponding to this price range brands, for exampe `Ford Focus`. As we are intended to analyze the majority of market propositions, it is resonable to delete the tail of the distribution and exclude such cars.

In [23]:
#convert to NaN
autos = autos[autos["price_$"].between(1,100000)]
autos.describe()

Unnamed: 0,price_$,registration_year,power_ps,odometer_km,registration_month,postal_code
count,46586.0,46586.0,46586.0,46586.0,42568.0,46586.0
mean,5840.789035,2002.909157,114.135599,125669.514446,6.377326,51102.739557
std,7590.321984,7.177158,69.470882,39757.358905,3.351918,25755.66673
min,1.0,1910.0,0.0,5000.0,1.0,1067.0
25%,1250.0,1999.0,75.0,100000.0,3.0,30827.0
50%,3100.0,2003.0,109.0,150000.0,6.0,49828.0
75%,7500.0,2008.0,150.0,150000.0,9.0,71738.5
max,99900.0,2016.0,696.0,150000.0,12.0,99998.0


### Date data cleaning

The column `registration_month` is already deleted as not infomative, besided month = 0 prevailed.
The column `registration_year` is already cleaned.

The columns `date_crawled`,`last_seen`,`ad_created` should be converted to date type cells to analyize the range they cover as precentages.

In [24]:
autos[['date_crawled','last_seen','ad_created']][0:5]

Unnamed: 0,date_crawled,last_seen,ad_created
0,2016-03-26 17:47:46,2016-04-06 06:45:54,2016-03-26 00:00:00
1,2016-04-04 13:38:56,2016-04-06 14:45:08,2016-04-04 00:00:00
2,2016-03-26 18:57:24,2016-04-06 20:15:37,2016-03-26 00:00:00
3,2016-03-12 16:58:10,2016-03-15 03:16:28,2016-03-12 00:00:00
4,2016-04-01 14:38:50,2016-04-01 14:38:50,2016-04-01 00:00:00


To understand the date range, we will extract just the date values `year-month-date` = 10 symbols and generage their distributions.

In [25]:
import datetime as dt
#leave only 10 symbols in the string
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['last_seen'] = autos['last_seen'].str[:10]
autos['ad_created'] = autos['ad_created'].str[:10]
#convert type from object to date
autos[['date_crawled','last_seen','ad_created']] = autos[['date_crawled','last_seen','ad_created']].apply(pd.to_datetime)

In [26]:
#generate distribution for the 'date_crawled' column
print('Distribution for the date_crawled')
print(autos['date_crawled'].
      value_counts(normalize=True, dropna=False).
      sort_index()
     )
print("\n")
print('Distribution for the last_seen')
#generate distribution for the 'last_seen' column
print(autos['last_seen'].
      value_counts(normalize=True, dropna=False).
      sort_index()
     )
print("\n")
print('Distribution for the ad_created')
#generate distribution for the 'ad_created' column
print(autos['ad_created'].
      value_counts(normalize=True, dropna=False).
      sort_index()
     )

Distribution for the date_crawled
2016-03-05    0.025158
2016-03-06    0.014189
2016-03-07    0.036320
2016-03-08    0.033508
2016-03-09    0.033229
2016-03-10    0.032284
2016-03-11    0.032520
2016-03-12    0.036814
2016-03-13    0.015906
2016-03-14    0.036320
2016-03-15    0.034367
2016-03-16    0.029494
2016-03-17    0.031812
2016-03-18    0.012772
2016-03-19    0.034710
2016-03-20    0.037994
2016-03-21    0.037264
2016-03-22    0.032778
2016-03-23    0.032220
2016-03-24    0.029494
2016-03-25    0.031533
2016-03-26    0.032070
2016-03-27    0.030782
2016-03-28    0.034560
2016-03-29    0.034066
2016-03-30    0.033766
2016-03-31    0.031834
2016-04-01    0.033787
2016-04-02    0.035547
2016-04-03    0.038746
2016-04-04    0.036642
2016-04-05    0.013008
2016-04-06    0.003091
2016-04-07    0.001417
Name: date_crawled, dtype: float64


Distribution for the last_seen
2016-03-05    0.001073
2016-03-06    0.004121
2016-03-07    0.005388
2016-03-08    0.007470
2016-03-09    0.009767
2

The distribution of the `data_crawled` shows that the data set was crawled homogineously around March & April of 2016.

The distribution of the `ad_created` shows that the ads were created over a year staring from June 2015 till April 2016.

The distribution of the `last_seen` shows that the data set was checked for the last time abound `data_crawled` time, i.e. around March & April of 2016, and mainly in the lase 3 days 5-7 of April 2016.



### Brand - Price & Milage relation

Explore the 14th column `brand` & statistically analyze it, i.e. to explore variations across different car brands.
As a result we would like to see the relation of a brand with a mean price and mean mileage.

In [27]:
#explore unique data and count their appearence
print("Number of unique brands", len(autos["brand"].unique()))
brands = autos["brand"].value_counts(normalize=True)
brands.sort_values(ascending = False)

Number of unique brands 40


volkswagen        0.211480
bmw               0.110140
opel              0.107607
mercedes_benz     0.096467
audi              0.086614
ford              0.069849
renault           0.047203
peugeot           0.029859
fiat              0.025694
seat              0.018246
skoda             0.016443
nissan            0.015305
mazda             0.015219
smart             0.014167
citroen           0.013996
toyota            0.012686
hyundai           0.010046
sonstige_autos    0.009724
volvo             0.009144
mini              0.008779
mitsubishi        0.008243
honda             0.007835
kia               0.007084
alfa_romeo        0.006654
suzuki            0.005925
chevrolet         0.005710
porsche           0.005581
chrysler          0.003520
dacia             0.002640
daihatsu          0.002511
jeep              0.002275
subaru            0.002147
land_rover        0.002104
saab              0.001653
jaguar            0.001567
daewoo            0.001503
trabant           0.001395
r

As there are 40 unique brands and some have low number of appearences, we would select 99% of them and aggregate, meaning selecting brands that account at least for 1% of the total number of entries.

In [28]:
#selecting top 99% of brands
top_brands = brands[brands > 0.01].index

#find mean price for those top brands
#store it as a dictionary
#mean_price_top_brands = {}
#mean_mile_top_brands = {}
mean_top_brands = {}

for top_brand in top_brands:
    brand = autos[autos["brand"] == top_brand]  #selecting top brands
    mean_price = brand["price_$"].mean()        #finding their mean price
    mean_mile = brand["odometer_km"].mean()     #finding their mean mileage
#    mean_price_top_brands[top_brand] = int(mean_price) #key = brand, value = mean price
#    mean_mile_top_brands[top_brand] = int(mean_mile) #key = brand, value = mean mile
    mean_top_brands[top_brand] = (int(mean_price),int(mean_mile)) #key = brand, value = mean price & mean mile
    

#Brand - mean price relation
#convert to DataFrame from dictionary
#mean_price_top_brands = pd.DataFrame.from_dict(mean_price_top_brands,orient='index', columns=["mean_price_$"])
#mean_price_top_brands.sort_values("mean_price_$", ascending = False)

#Brand - mean mileage relation
#convert to DataFrame from dictionary
#mean_mile_top_brands = pd.DataFrame.from_dict(mean_mile_top_brands,orient='index', columns=["mean_mile_km"])
#mean_mile_top_brands.sort_values("mean_mile_km", ascending = False)

print("Brand - mean price and mileage relations")
#convert to DataFrame from dictionary
mean_top_brands = pd.DataFrame.from_dict(mean_top_brands,orient='index', columns=["mean_price_$","mean_mile_km"])
mean_top_brands.sort_values("mean_price_$", ascending = False) 

Brand - mean price and mileage relations


Unnamed: 0,mean_price_$,mean_mile_km
audi,9285,129206
mercedes_benz,8536,130843
bmw,8203,132623
skoda,6368,110848
volkswagen,5403,128718
hyundai,5365,106442
toyota,5166,116277
nissan,4743,118330
seat,4410,121270
mazda,4112,124464


In [29]:
print("Mean price across all brands:")
print(int(sum(mean_top_brands["mean_price_$"])/len(mean_top_brands)))

Mean price across all brands:
4942


The brand - mean price relation is presented in descending order strating with the most expensive car brand. The price is ranging from ~10.000 to ~2.500 \\$. The four out of five most expensive car brands are also amoung top five most common brands in the data set.

The most interesting information is that amoung top 6 most popular car brands:
 - Audi, Mercedes Benz & BMW are more expensive
 - Ford & Opel are less expensive
 - Volkswagen is in between (being the most popular)
 
The mean milage for those brands has no significant effect on the popularity.

### Conclusions.

The mean price of Volkswagen cars ~5.500 \\$ corresponds to the mean price across all brands being ~5.000 \\$. Perhaps it is one of the reason why it is the most popular brand.