![Imgur](https://image.freepik.com/free-vector/automobiles-models-icon-collection_74855-5435.jpg)



# Exploring Ebay Car Sales Data

## Introduction


In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka, the original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

This is the sort version, 50000 rows.  

The aim of this project is **to clean the data** and analyze the included used car listings **answering the following questions**:

- What is the average mileage of the vehicles sold on this page?

- What was the period in which the most vehicles were offered for sale on the site?

- Which is the most offered brand? and which is the least?


- What is the highest average number of kilometers per brand?

- What is the most used type of fuel?

- The most offered brands and the model of each brand?


## Data dictionary:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.


- `name` - Name of the car.


- `seller` - Whether the seller is private or a dealer.


- `offerType` - The type of listing


- `price` - The price on the ad to sell the car.


- `abtest` - Whether the listing is included in an A/B test.


- `vehicleType` - The vehicle Type.


- `yearOfRegistration` - The year in which the car was first registered.


- `gearbox` - The transmission type.


- `powerPS` - The power of the car in PS.


- `model` - The car model name.


- `kilometer` - How many kilometers the car has driven.


- `monthOfRegistration` - The month in which the car was first registered.


- `fuelType` - What type of fuel the car uses.


- `brand` - The brand of the car.


- `notRepairedDamage`- If the car has a damage which is not yet repaired.


- `dateCreated` - The date on which the eBay listing was created.


- `nrOfPictures` - The number of pictures in the ad.


- `postalCode` - The postal code for the location of the vehicle.


- `lastSeenOnline` - When the crawler saw this ad last online.


# <a id='0'>Index</a>


### <a href='#1'>1. </a>

### <a href='#2'>2. </a>

### <a href='#3'>3.  </a>

### <a href='#4'>4. </a>

### <a href='#5'>5. </a>

### <a href='#6'>6. </a>

### <a href='#7'>7.  </a>

### <a href='#8'>8. </a>

### <a href='#9'>9. </a>

### <a href='#10'>10. </a>

In [1]:
import numpy as np
import pandas as pd

In [2]:
! file -k autos.csv

autos.csv: CSV text\012- , Non-ISO extended-ASCII text


In [3]:
! file -i autos.csv

autos.csv: application/csv; charset=unknown-8bit


In [4]:
try:
    print( 'hello' / 'there')
except TypeError:
    print('TypeError thrown')
except NameError:
    print('NameError thrown')
except Exception as e:
        print('An exception occurred that was not NameError or TypeError')

TypeError thrown


In [7]:
lista = ['iso-8859-8'] #['Latin-1'] iso-8859-2


for index in lista:
    print(index)
    try:
        autos = pd.read_csv("autos.csv", encoding = 'Latin-1' )
        
    except Exception as e: 
        print(e)

iso-8859-8


## 1. Exploring Data

In [8]:
autos.head(3)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


In [9]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

### Dataset observations:

- There's a 50000 rows and 20 columns.

- Column's name is not properly writed, so we need to lowerCamelCase it.

- Some of the columns are missing data

- Some of the columns do not have the appropriate data type


las columnas que deberian ser enteros y que son texto
las que son texto y deberian ser tipo de fecha

Some columns have null values, but none have more than ~20% null values.






## 2. Cleaning Column Names

Working on fixing the column names:

   - `yearOfRegistration`  to `registration_year`
   - `monthOfRegistration` to `registration_month`
   - `notRepairedDamage`   to `unrepaired_damagè`
   - `dateCreated`         to `ad_created`
   


In [10]:
column_name = autos.columns
column_name

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [11]:
autos.rename({'yearOfRegistration':'registration_year','monthOfRegistration':'registration_month',
             'notRepairedDamage':'unrepaired_damage','dateCreated':'ad_created',},axis = 1, inplace = True)

- The rest of the column names from camelcase to snakecase.

In [12]:
autos.rename({'dateCrawled':'date_crawled','price':'price_in_dollars',
              'offerType':'offer_type','vehicleType':'vehicle_type',
              'powerPS':'CV','fuelType':'fuel_type',
              'nrOfPictures':'nr_pictures','postalCode':'postal_code',
              'lastSeen':'last_seen',},
             axis = 1, 
             inplace = True) 

In [13]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_in_dollars',
       'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'CV', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

## Changes did it on column name

Rename names on snake format, and powerPS are converted to CV (Cavalli Vapore) means 'Horse Power' in metric system, system used in Europe where the dataframe comes (Germany)

based on this [link](https://www.autoweek.com/news/technology/a1820831/what-ps-metric-horsepower-autoweek-explains/)

## 3. Initial Exploration and Cleaning.

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. 
Initially we will look for:

- 1. Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.


- 2. Examples of numeric data stored as text which can be cleaned and converted.

In [14]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   seller              50000 non-null  object
 3   offer_type          50000 non-null  object
 4   price_in_dollars    50000 non-null  object
 5   abtest              50000 non-null  object
 6   vehicle_type        44905 non-null  object
 7   registration_year   50000 non-null  int64 
 8   gearbox             47320 non-null  object
 9   CV                  50000 non-null  int64 
 10  model               47242 non-null  object
 11  odometer            50000 non-null  object
 12  registration_month  50000 non-null  int64 
 13  fuel_type           45518 non-null  object
 14  brand               50000 non-null  object
 15  unrepaired_damage   40171 non-null  object
 16  ad_created          50

In [15]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price_in_dollars,abtest,vehicle_type,registration_year,gearbox,CV,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-08 10:40:35,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


### - 1. Text columns where all or almost all values are the same. 

We can see some of columns that only have two values and they aren't relevant for our purpuses, like:

- `seller`
- `offer_type`
- `nr_picture`
- `abtest`

Lets see more in detail the content of each columns.

In [None]:
selected = ['seller','offer_type','nr_pictures','abtest']

for columna in selected: print(autos[columna].value_counts(),'\n') 

### Dataset observations:

Columns candidates to be dropped:

   
   `seller`:

        privat(particular owner) = 49999 and gewerblich(comercial) = 1
    
    
   `offer_type`:
   
       Angebot(Offer) = 49999 and Gesuch(Wanted) = 1
        
   `nr_pictures`:
   
       total number of pictures is 0:

   `abtest`:
   
       not useful for dataframe study
        
       

### -2  Numeric data stored as text that needs to be cleaned:

- `price_in_dollars`

- `odometer`

- `ad_created`

- `date_crawled`

- `last_seen`

In [16]:
autos['price_in_dollars']

0         $5,000
1         $8,500
2         $8,990
3         $4,350
4         $1,350
          ...   
49995    $24,900
49996     $1,980
49997    $13,200
49998    $22,900
49999     $1,250
Name: price_in_dollars, Length: 50000, dtype: object

In [17]:
autos['odometer']

0        150,000km
1        150,000km
2         70,000km
3         70,000km
4        150,000km
           ...    
49995    100,000km
49996    150,000km
49997      5,000km
49998     40,000km
49999    150,000km
Name: odometer, Length: 50000, dtype: object

### Character deletion and integer conversion

In [18]:
autos['price_in_dollars'] = autos['price_in_dollars'].str.replace('$', '')
autos['price_in_dollars'] = autos['price_in_dollars'].str.replace(',', '')
autos['price_in_dollars'] = autos['price_in_dollars'].astype(int)
autos['price_in_dollars']

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price_in_dollars, Length: 50000, dtype: int64

### By removing the characters symbolizing the column units, it is possible to indicate in the column name

In [19]:
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
autos['odometer_km'] = autos['odometer_km'].str.replace(',','')
autos['odometer_km'] = autos['odometer_km'].str.replace('km','')
autos['odometer_km'] = autos['odometer_km'].astype(int)
autos['odometer_km']

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer_km, Length: 50000, dtype: int64

## 4. Exploring the `odometer_km` and `price` columns

#### 1. Exploring `odometer_km`:

- Number of differents values
- Descriptive statistics.
- list of top 10 km groups

#### Return unique values of `odometer_km`

In [21]:
autos['odometer_km'].unique()

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])

#### Descriptive statistics 'odometer_km' 

In [22]:
autos['odometer_km'].describe()

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

#### Upper limit `odometer_km`

In [23]:
autos['odometer_km'].value_counts().sort_index(ascending = False).head(10)

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
Name: odometer_km, dtype: int64

#### Lower limit `odometer_km`

In [24]:
autos['odometer_km'].value_counts().sort_index(ascending = True).head(10)

5000      967
10000     264
20000     784
30000     789
40000     819
50000    1027
60000    1164
70000    1230
80000    1436
90000    1757
Name: odometer_km, dtype: int64

#### 2. Exploring `price_in_dollars`:

- Number of differents values
- Descriptive statistics.
- list of top 10 km groups

#### Return unique values of `price_in_dollars`

In [1]:
autos['price_in_dollars'].unique().shape

NameError: name 'autos' is not defined

#### Descriptive statistics 'price_in_dollars' 

In [26]:
autos['price_in_dollars'].describe().round()

count       50000.0
mean         9840.0
std        481104.0
min             0.0
25%          1100.0
50%          2950.0
75%          7200.0
max      99999999.0
Name: price_in_dollars, dtype: float64

#### Upper limit `price_in_dollars`

In [27]:
autos['price_in_dollars'].value_counts().sort_index(ascending = False).head(10)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
Name: price_in_dollars, dtype: int64

#### Lower limit `odometer_km`

In [28]:
autos['price_in_dollars'].value_counts().sort_index(ascending = True).head(10)

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price_in_dollars, dtype: int64

#### Note: 

Let's take a closer look at the top of the column and choose a value that, even though it is high, allows us to have a reference from which to start.

### Exploring `price_in_dollars` outliers on the top of the list:

In [29]:
autos['price_in_dollars'].value_counts().sort_index(ascending = False).head(15)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price_in_dollars, dtype: int64

Let's see to which vehicle the value of 265000$ corresponds. 

In the event that this price is consistent with the vehicle, we can take it as a reference.

In [30]:
autos[autos["price_in_dollars"] == 265000]

Unnamed: 0,date_crawled,name,seller,offer_type,price_in_dollars,abtest,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_pictures,postal_code,last_seen
12682,2016-03-28 22:48:01,Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_...,privat,Angebot,265000,control,coupe,2016,automatik,500,911,5000,3,benzin,porsche,nein,2016-03-28 00:00:00,0,70193,2016-04-05 03:44:51


It makes sense, so we will use our comparison from this value upwards and see what we find.

https://www.carindigo.com/used-cars/porsche-911-gt3-rs

As there are not too many values to check, it seems sensible to start from this model and see what we can eliminate and what we cannot.

let's see which vehicles are above this one.

In [31]:
top_List = autos[(autos["price_in_dollars"] >= 265000)].sort_index(ascending = False)
top_List

Unnamed: 0,date_crawled,name,seller,offer_type,price_in_dollars,abtest,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_pictures,postal_code,last_seen
47634,2016-04-04 21:25:21,Ferrari_FXX,privat,Angebot,3890000,test,coupe,2006,,799,,5000,7,,sonstige_autos,nein,2016-04-04 00:00:00,0,60313,2016-04-05 12:07:37
47598,2016-03-31 18:56:54,Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...,privat,Angebot,12345678,control,limousine,2001,manuell,101,vectra,150000,3,benzin,opel,nein,2016-03-31 00:00:00,0,4356,2016-03-31 18:56:54
43049,2016-03-21 19:53:52,2_VW_Busse_T3,privat,Angebot,999999,test,bus,1981,manuell,70,transporter,150000,1,benzin,volkswagen,,2016-03-21 00:00:00,0,99880,2016-03-28 17:18:28
42221,2016-03-08 20:39:05,Leasinguebernahme,privat,Angebot,27322222,control,limousine,2014,manuell,163,c4,40000,2,diesel,citroen,,2016-03-08 00:00:00,0,76532,2016-03-08 20:39:05
39705,2016-03-22 14:58:27,Tausch_gegen_gleichwertiges,privat,Angebot,99999999,control,limousine,1999,automatik,224,s_klasse,150000,9,benzin,mercedes_benz,,2016-03-22 00:00:00,0,73525,2016-04-06 05:15:30
39377,2016-03-08 23:53:51,Tausche_volvo_v40_gegen_van,privat,Angebot,12345678,control,,2018,manuell,95,v40,150000,6,,volvo,nein,2016-03-08 00:00:00,0,14542,2016-04-06 23:17:31
37585,2016-03-29 11:38:54,Volkswagen_Jetta_GT,privat,Angebot,999990,test,limousine,1985,manuell,111,jetta,150000,12,benzin,volkswagen,ja,2016-03-29 00:00:00,0,50997,2016-03-29 11:38:54
36818,2016-03-27 18:37:37,Porsche_991,privat,Angebot,350000,control,coupe,2016,manuell,500,911,5000,3,benzin,porsche,nein,2016-03-27 00:00:00,0,70499,2016-03-27 18:37:37
35923,2016-04-03 07:56:23,Porsche_911_Targa_Exclusive_Edition__1_von_15_...,privat,Angebot,295000,test,cabrio,2015,automatik,400,911,5000,6,benzin,porsche,nein,2016-04-03 00:00:00,0,74078,2016-04-03 08:56:20
34723,2016-03-23 16:37:29,Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au...,privat,Angebot,299000,test,coupe,1977,manuell,260,911,100000,7,benzin,porsche,nein,2016-03-23 00:00:00,0,61462,2016-04-06 16:44:50


In [32]:
top_List[["price_in_dollars","brand","model","vehicle_type"]].sort_index(ascending = False)

Unnamed: 0,price_in_dollars,brand,model,vehicle_type
47634,3890000,sonstige_autos,,coupe
47598,12345678,opel,vectra,limousine
43049,999999,volkswagen,transporter,bus
42221,27322222,citroen,c4,limousine
39705,99999999,mercedes_benz,s_klasse,limousine
39377,12345678,volvo,v40,
37585,999990,volkswagen,jetta,limousine
36818,350000,porsche,911,coupe
35923,295000,porsche,911,cabrio
34723,299000,porsche,911,coupe


It seems to be clearer this way, we see that there are vehicles that make sense to be on this list and that others clearly do not, for example: 

Strange values of 1234566, corresponding to a BMW or 11111111 to a Volkswagen even a C4 that is worth 27322222. 

Clearly, these cases could be eliminated as error.

However, there are some kind of vehicles called 'sonstige autos' (mean other cars in German) that we are not aware of and because they have an important weight in the list it is important to clarify whether or not we should eliminate them.  

In [33]:
top_List[top_List["brand"] == "sonstige_autos"]

Unnamed: 0,date_crawled,name,seller,offer_type,price_in_dollars,abtest,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_pictures,postal_code,last_seen
47634,2016-04-04 21:25:21,Ferrari_FXX,privat,Angebot,3890000,test,coupe,2006,,799,,5000,7,,sonstige_autos,nein,2016-04-04 00:00:00,0,60313,2016-04-05 12:07:37
14715,2016-03-30 08:37:24,Rolls_Royce_Phantom_Drophead_Coupe,privat,Angebot,345000,control,cabrio,2012,automatik,460,,20000,8,benzin,sonstige_autos,nein,2016-03-30 00:00:00,0,73525,2016-04-07 00:16:26
11137,2016-03-29 23:52:57,suche_maserati_3200_gt_Zustand_unwichtig_laufe...,privat,Angebot,10000000,control,coupe,1960,manuell,368,,100000,1,benzin,sonstige_autos,nein,2016-03-29 00:00:00,0,73033,2016-04-06 21:18:11
7814,2016-04-04 11:53:31,Ferrari_F40,privat,Angebot,1300000,control,coupe,1992,,0,,50000,12,,sonstige_autos,nein,2016-04-04 00:00:00,0,60598,2016-04-05 11:34:11


### High-end vehicles with a very high market value are appearing, will have to be reviewed.

| name | price in Dollars |registration_year | source |
| :--- | :----------- |:------ | :--- |
| Ferrari_FXX | 4,000,000| 2006 |https://www.autoblog.com/2006/06/14/for-sale-2006-ferrari-fxx-slightly-used/?guccounter=1 |
| Rolls_Royce_Phantom_Drophead_Coupe | From 450,000 | 2012 | https://www.cars.com/shopping/rolls_royce-phantom_drophead_coupe-2012/ |
|Ferrari_F40| 1,959,900 |1992 | https://www.dupontregistry.com/autos/listing/1992/ferrari/f40/2418434 |
|Maserati 3200 GT|  | |  **Note** | 

**Note**: In relation to the Maserati, there are several things to consider.


- The [3200 Gt](https://es.wikipedia.org/wiki/Maserati_3200_GT) model was in production during the period 1998 to 2001 so the registration date we have from 1960 does not match, however, during the period 1957 - 1964 only one model of this brand was produced, the  [3500 Gt](https://en.wikipedia.org/wiki/Maserati_3500_GT#:~:text=The%20Maserati%203500%20GT%20(Tipo,Maserati%20between%201957%20and%201964.), only 2,222 units were produced between the Coupe and the Spider version.

- If we also take into account the description referred to in the cell of the name:

     ***" Zustand_unwichtig_laufe... // Condition_unimportant_running..."***


- The person who placed the advert was looking for a masserati in any condition (just running or not...), this gives us a clue that the model we are talking about is not the newest one, so probably refers to the 60's model.

- As if this were not enough the price we have is 10 000 000 dollars and surely this is wrong because doing a search of what is the value of the car [I have found that this ranges between 863.170,50 USD / 151.290,57 USD](https://www.el-parking.es/coches-usados/maserati-3500-gt.html#!/coches-usados/maserati-3500-gt.html%3Ftri%3Dprix_decroissant)

Therefore, the best thing I can think of is assign the <strong>'average between prices'</strong> to our vehicle.


In [34]:
autos.iloc[11137,:]

date_crawled                                        2016-03-29 23:52:57
name                  suche_maserati_3200_gt_Zustand_unwichtig_laufe...
seller                                                           privat
offer_type                                                      Angebot
price_in_dollars                                               10000000
abtest                                                          control
vehicle_type                                                      coupe
registration_year                                                  1960
gearbox                                                         manuell
CV                                                                  368
model                                                               NaN
odometer_km                                                      100000
registration_month                                                    1
fuel_type                                                       

Change the **'average between prices'** price of the Masserati = 507230$

In [35]:
autos.iloc[11137,4] = 507230

### So... finally what do we have to remove from this top list?

Everything other than 'Sonstige_autos' and 'Porsche'

In [36]:
remove_bad_price_bool = ((top_List["brand"] != 'sonstige_autos') & (top_List["brand"] != 'porsche'))

In [37]:
bad_cars = top_List[remove_bad_price_bool] # type dataframe! This is the dataframe to remove from check_price
bad_cars

Unnamed: 0,date_crawled,name,seller,offer_type,price_in_dollars,abtest,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_pictures,postal_code,last_seen
47598,2016-03-31 18:56:54,Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...,privat,Angebot,12345678,control,limousine,2001,manuell,101,vectra,150000,3,benzin,opel,nein,2016-03-31 00:00:00,0,4356,2016-03-31 18:56:54
43049,2016-03-21 19:53:52,2_VW_Busse_T3,privat,Angebot,999999,test,bus,1981,manuell,70,transporter,150000,1,benzin,volkswagen,,2016-03-21 00:00:00,0,99880,2016-03-28 17:18:28
42221,2016-03-08 20:39:05,Leasinguebernahme,privat,Angebot,27322222,control,limousine,2014,manuell,163,c4,40000,2,diesel,citroen,,2016-03-08 00:00:00,0,76532,2016-03-08 20:39:05
39705,2016-03-22 14:58:27,Tausch_gegen_gleichwertiges,privat,Angebot,99999999,control,limousine,1999,automatik,224,s_klasse,150000,9,benzin,mercedes_benz,,2016-03-22 00:00:00,0,73525,2016-04-06 05:15:30
39377,2016-03-08 23:53:51,Tausche_volvo_v40_gegen_van,privat,Angebot,12345678,control,,2018,manuell,95,v40,150000,6,,volvo,nein,2016-03-08 00:00:00,0,14542,2016-04-06 23:17:31
37585,2016-03-29 11:38:54,Volkswagen_Jetta_GT,privat,Angebot,999990,test,limousine,1985,manuell,111,jetta,150000,12,benzin,volkswagen,ja,2016-03-29 00:00:00,0,50997,2016-03-29 11:38:54
27371,2016-03-09 15:45:47,Fiat_Punto,privat,Angebot,12345678,control,,2017,,95,punto,150000,0,,fiat,,2016-03-09 00:00:00,0,96110,2016-03-09 15:45:47
24384,2016-03-21 13:57:51,Schlachte_Golf_3_gt_tdi,privat,Angebot,11111111,test,,1995,,0,,150000,0,,volkswagen,,2016-03-21 00:00:00,0,18519,2016-03-21 14:40:18
22947,2016-03-22 12:54:19,Bmw_530d_zum_ausschlachten,privat,Angebot,1234566,control,kombi,1999,automatik,190,,150000,2,diesel,bmw,,2016-03-22 00:00:00,0,17454,2016-04-02 03:17:32
2897,2016-03-12 21:50:57,Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000,privat,Angebot,11111111,test,limousine,1973,manuell,48,escort,50000,3,benzin,ford,nein,2016-03-12 00:00:00,0,94469,2016-03-12 22:45:27


In [38]:
len(bad_cars)

11

In [39]:
for bad in bad_cars.index:
    autos.drop([bad], inplace = True)

### Exploring price outliers on the botton of the list:

In [40]:
autos["price_in_dollars"].describe().round()

count      49989.0
mean        5835.0
std        20520.0
min            0.0
25%         1100.0
50%         2950.0
75%         7200.0
max      3890000.0
Name: price_in_dollars, dtype: float64

We see that the minimum value is 0 and until we do not reach 25% of the total dataset the price does not reach 1100€.

In [41]:
autos["price_in_dollars"].value_counts().sort_index(ascending = False)

3890000       1
1300000       1
507230        1
350000        1
345000        1
           ... 
5             2
3             1
2             3
1           156
0          1421
Name: price_in_dollars, Length: 2350, dtype: int64

#

In [42]:
autos[autos['price_in_dollars'].between(0,1100)].describe()

Unnamed: 0,price_in_dollars,registration_year,CV,odometer_km,registration_month,nr_pictures,postal_code
count,12538.0,12538.0,12538.0,12538.0,12538.0,12538.0,12538.0
mean,556.422396,2002.788563,74.677461,136162.466103,4.831552,0.0,47992.659595
std,339.147087,146.343951,174.629327,34988.383373,3.947714,0.0,25833.421398
min,0.0,1111.0,0.0,5000.0,0.0,0.0,1069.0
25%,300.0,1996.0,45.0,150000.0,1.0,0.0,27376.25
50%,599.0,1999.0,71.0,150000.0,4.0,0.0,46240.0
75%,850.0,2001.0,101.0,150000.0,8.0,0.0,66482.0
max,1100.0,9999.0,15016.0,150000.0,12.0,0.0,99988.0


if i want to know what is the total sum of the price of these vehicles in the different price ranges

i.e. how much 75 of the 25 of the total means

In [43]:
seventy_five = autos[autos['price_in_dollars'].between(850,1100)]
seventy_five.loc[:,'price_in_dollars'].sum()

3227534

how much 50 out of 25 of the total amount means

In [44]:
fifty_percent = autos[autos['price_in_dollars'].between(599,850)]
fifty_percent.loc[:,'price_in_dollars'].sum()

2462601

how much 25 out of 25 of the total amount means

In [45]:
cuart = autos[autos['price_in_dollars'].between(300,599)]
cuart.loc[:,'price_in_dollars'].sum()

1431649

What is the sum of the price of the entire data set without taking into account the range 0 to 25

In [46]:
sum_total_dataset = autos[autos['price_in_dollars'].between(1100,10000000)]
sum_total_dataset.loc[:,'price_in_dollars'].sum()

285130563

In this way the importance in the overall data set can be compared.


| price ranges | price in Dollars | percentage over total |
| :---          | :-----  | :-----      | 
| [850 ~  1100] | 3227534 | 1,095478069 % |
| [599 ~ 850]   | 2462601 | 0,835847241 % |
| [300 ~ 599]   | 1431649 | 0,485925193 % |

It seems that it is not too relevant, so in order to take advantage of the maximum amount of information I will take into account a range that goes from the maximum that we have already defined before to a minimum price of **850 €.**

Once we have the data concerning the prices worked, both at the top and at the bottom, we make a clean copy of our dataframe and call it:


- `auto_clean`

In [47]:
auto_clean = autos[autos['price_in_dollars'].between(850,10000000)].copy()

In [48]:
auto_clean['price_in_dollars'].describe().round()

count      40786.0
mean        7060.0
std        22537.0
min          850.0
25%         1950.0
50%         3999.0
75%         8500.0
max      3890000.0
Name: price_in_dollars, dtype: float64

## 5. Exploring the date columns

In [49]:
auto_clean[['date_crawled','ad_created','last_seen']].describe()

Unnamed: 0,date_crawled,ad_created,last_seen
count,40786,40786,40786
unique,39611,76,32824
top,2016-03-09 11:54:38,2016-04-03 00:00:00,2016-04-06 21:17:51
freq,3,1608,7


In [50]:
print(auto_clean['date_crawled'].str[:10].value_counts(normalize=True,
                                                       dropna=False).sort_index(ascending = True))

2016-03-05    0.025573
2016-03-06    0.014147
2016-03-07    0.035600
2016-03-08    0.032683
2016-03-09    0.032462
2016-03-10    0.033075
2016-03-11    0.032658
2016-03-12    0.037415
2016-03-13    0.016035
2016-03-14    0.036434
2016-03-15    0.033639
2016-03-16    0.029177
2016-03-17    0.030770
2016-03-18    0.012921
2016-03-19    0.035110
2016-03-20    0.038028
2016-03-21    0.037439
2016-03-22    0.032805
2016-03-23    0.032315
2016-03-24    0.028956
2016-03-25    0.030721
2016-03-26    0.032879
2016-03-27    0.031310
2016-03-28    0.035208
2016-03-29    0.033713
2016-03-30    0.033100
2016-03-31    0.031261
2016-04-01    0.034424
2016-04-02    0.036238
2016-04-03    0.039156
2016-04-04    0.036802
2016-04-05    0.013264
2016-04-06    0.003212
2016-04-07    0.001471
Name: date_crawled, dtype: float64


In [51]:
print(auto_clean['ad_created'].str[:10].sort_index(ascending = False).value_counts(normalize=True,
                                                                                   dropna=False))

2016-04-03    0.039425
2016-03-20    0.038150
2016-03-21    0.037684
2016-03-12    0.037268
2016-04-04    0.037194
                ...   
2016-02-20    0.000025
2016-02-16    0.000025
2016-02-22    0.000025
2016-01-29    0.000025
2016-01-16    0.000025
Name: ad_created, Length: 76, dtype: float64


Between March and April 2016 there was the highest concentration of vehicles for sale ads. 

In [52]:
print(auto_clean['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending = True))

2016-03-05    0.001128
2016-03-06    0.003800
2016-03-07    0.004781
2016-03-08    0.006522
2016-03-09    0.009121
2016-03-10    0.010126
2016-03-11    0.011622
2016-03-12    0.022753
2016-03-13    0.008581
2016-03-14    0.012038
2016-03-15    0.015103
2016-03-16    0.015618
2016-03-17    0.026700
2016-03-18    0.007478
2016-03-19    0.014932
2016-03-20    0.019811
2016-03-21    0.020301
2016-03-22    0.020988
2016-03-23    0.018266
2016-03-24    0.018928
2016-03-25    0.018143
2016-03-26    0.015961
2016-03-27    0.014858
2016-03-28    0.019786
2016-03-29    0.021061
2016-03-30    0.023537
2016-03-31    0.022998
2016-04-01    0.022998
2016-04-02    0.025425
2016-04-03    0.024567
2016-04-04    0.023537
2016-04-05    0.129554
2016-04-06    0.231550
2016-04-07    0.137425
Name: last_seen, dtype: float64


In [53]:
auto_clean['registration_year'].describe()

count    40786.000000
mean      2005.389864
std         84.381634
min       1000.000000
25%       2000.000000
50%       2005.000000
75%       2009.000000
max       9999.000000
Name: registration_year, dtype: float64

In [54]:
auto_clean['registration_year'].value_counts().sort_index(ascending = False).head(25)

9999       2
9000       1
8888       1
6200       1
5911       1
5000       2
4500       1
4100       1
2800       1
2019       1
2018     435
2017    1113
2016     733
2015     362
2014     649
2013     795
2012    1306
2011    1617
2010    1585
2009    2076
2008    2205
2007    2261
2006    2660
2005    2816
2004    2645
Name: registration_year, dtype: int64

In [55]:
auto_clean['registration_year']

0        2004
1        1997
2        2009
3        2007
4        2003
         ... 
49995    2011
49996    1996
49997    2014
49998    2013
49999    1996
Name: registration_year, Length: 40786, dtype: int64

The maximum and minimum dates are wrong, we must find out which are the values to work with.

We see that beyond 2019 the values of the dates do not make sense, however we must remember that the data capture was done in 2016 so 2019 would be a date that would be equally wrong and we should limit the highest year to 2016.

In [56]:
auto_clean['registration_year'].value_counts().sort_index(ascending = False).tail(5)...

1931    1
1929    1
1927    1
1001    1
1000    1
Name: registration_year, dtype: int64

At the bottom, the minimum meaningful value is 1927, so our dataset should be between 1927 and 2019.

In [57]:
auto_clean = auto_clean[auto_clean['registration_year'].between(1927,2016)].copy()

In [58]:
auto_clean['registration_year'].describe()

count    39224.000000
mean      2003.713619
std          7.022507
min       1927.000000
25%       2000.000000
50%       2004.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Once we have corrected the incorrect values we have that our 75% of the vehicles were registered in 2008.

In [59]:
auto_clean['registration_year'].value_counts(normalize=True,
                                             dropna=False, 
                                             bins = 20).sort_index(ascending = False).round(3)

(2011.55, 2016.0]     0.098
(2007.1, 2011.55]     0.191
(2002.65, 2007.1]     0.330
(1998.2, 2002.65]     0.225
(1993.75, 1998.2]     0.104
(1989.3, 1993.75]     0.024
(1984.85, 1989.3]     0.011
(1980.4, 1984.85]     0.004
(1975.95, 1980.4]     0.005
(1971.5, 1975.95]     0.002
(1967.05, 1971.5]     0.003
(1962.6, 1967.05]     0.002
(1958.15, 1962.6]     0.001
(1953.7, 1958.15]     0.000
(1949.25, 1953.7]     0.000
(1944.8, 1949.25]     0.000
(1940.35, 1944.8]     0.000
(1935.9, 1940.35]     0.000
(1931.45, 1935.9]     0.000
(1926.91, 1931.45]    0.000
Name: registration_year, dtype: float64

This is the distribution of the dates of registration by groups of 20, in which we can see how the period between 2002 and 2007 is the one with the highest number of registrations. 

## 7. Exploring Price by Brand

Brands with the highest number of vehicles on the sales list

In [60]:
top_ten_brands=auto_clean['brand'].value_counts()
top_ten_brands

volkswagen        8249
bmw               4761
mercedes_benz     4258
audi              3724
opel              3656
ford              2398
renault           1559
peugeot           1105
fiat               868
skoda              721
seat               701
smart              635
mazda              559
toyota             558
citroen            555
nissan             554
hyundai            413
mini               406
sonstige_autos     399
volvo              358
kia                296
honda              294
mitsubishi         281
porsche            278
alfa_romeo         255
chevrolet          252
suzuki             232
chrysler           131
dacia              122
jeep               105
land_rover          98
daihatsu            77
subaru              73
jaguar              69
saab                57
daewoo              42
trabant             36
rover               33
lancia              31
lada                25
Name: brand, dtype: int64

In [61]:
selected_brands=auto_clean['brand'].value_counts().index[:10]

### Average price per vehicle brand

In [62]:
brands_price = {}

for brand in selected_brands:
    sel_brand = auto_clean[auto_clean['brand'] == brand ]
    brands_price[brand] = sel_brand['price_in_dollars'].mean().round()

brands_price_sorted = sorted(brands_price.items(),key = lambda kv: kv[1],reverse=True)
brands_price_sorted

[('audi', 10088.0),
 ('mercedes_benz', 9094.0),
 ('bmw', 8952.0),
 ('skoda', 6739.0),
 ('volkswagen', 6364.0),
 ('ford', 4939.0),
 ('opel', 3911.0),
 ('peugeot', 3772.0),
 ('fiat', 3710.0),
 ('renault', 3297.0)]

### Average number of kilometers per brand 

In [63]:
brands_km = {}

for brand in sorted(selected_brands):
    sel_brand = auto_clean[auto_clean['brand'] == brand ]
    brands_km[brand] = sel_brand['odometer_km'].mean().round()

brands_km_sorted = sorted(brands_km.items(),key = lambda kv: kv[1], reverse=True)
brands_km_sorted

[('bmw', 132200.0),
 ('mercedes_benz', 130461.0),
 ('audi', 128012.0),
 ('volkswagen', 126561.0),
 ('opel', 125495.0),
 ('peugeot', 123362.0),
 ('renault', 123307.0),
 ('ford', 121253.0),
 ('fiat', 110478.0),
 ('skoda', 110430.0)]

### List of car brands by average price and average mileage

In [64]:
listado = {}

template_string = "Mean price {money:.2f}$ and {km:.2f} mean Kilometers"

for vehiculo in brands_price:
    mean_price = brands_price[vehiculo]
    mean_km = brands_km[vehiculo]
    output = template_string.format(money = mean_price, km = mean_km )
    listado[vehiculo] = output
    

listado

{'volkswagen': 'Mean price 6364.00$ and 126561.00 mean Kilometers',
 'bmw': 'Mean price 8952.00$ and 132200.00 mean Kilometers',
 'mercedes_benz': 'Mean price 9094.00$ and 130461.00 mean Kilometers',
 'audi': 'Mean price 10088.00$ and 128012.00 mean Kilometers',
 'opel': 'Mean price 3911.00$ and 125495.00 mean Kilometers',
 'ford': 'Mean price 4939.00$ and 121253.00 mean Kilometers',
 'renault': 'Mean price 3297.00$ and 123307.00 mean Kilometers',
 'peugeot': 'Mean price 3772.00$ and 123362.00 mean Kilometers',
 'fiat': 'Mean price 3710.00$ and 110478.00 mean Kilometers',
 'skoda': 'Mean price 6739.00$ and 110430.00 mean Kilometers'}

In [65]:
brands_price

{'volkswagen': 6364.0,
 'bmw': 8952.0,
 'mercedes_benz': 9094.0,
 'audi': 10088.0,
 'opel': 3911.0,
 'ford': 4939.0,
 'renault': 3297.0,
 'peugeot': 3772.0,
 'fiat': 3710.0,
 'skoda': 6739.0}

In [66]:
brands_km

{'audi': 128012.0,
 'bmw': 132200.0,
 'fiat': 110478.0,
 'ford': 121253.0,
 'mercedes_benz': 130461.0,
 'opel': 125495.0,
 'peugeot': 123362.0,
 'renault': 123307.0,
 'skoda': 110430.0,
 'volkswagen': 126561.0}

In [67]:
brands_price
brands_km

frame = {'brands average price':brands_price,
         'brands average kilometer':brands_km}

output = pd.DataFrame(frame)
output

Unnamed: 0,brands average price,brands average kilometer
volkswagen,6364.0,126561.0
bmw,8952.0,132200.0
mercedes_benz,9094.0,130461.0
audi,10088.0,128012.0
opel,3911.0,125495.0
ford,4939.0,121253.0
renault,3297.0,123307.0
peugeot,3772.0,123362.0
fiat,3710.0,110478.0
skoda,6739.0,110430.0


list of car brands by number of vehicles with their average price and average mileage

## 8. Storing Aggregate Data in a DataFrame

In [68]:
brands_price_series = pd.Series(brands_price)
print(brands_price_series)

volkswagen        6364.0
bmw               8952.0
mercedes_benz     9094.0
audi             10088.0
opel              3911.0
ford              4939.0
renault           3297.0
peugeot           3772.0
fiat              3710.0
skoda             6739.0
dtype: float64


In [69]:
brands_km_series = pd.Series(brands_km)
print(brands_km_series)

audi             128012.0
bmw              132200.0
fiat             110478.0
ford             121253.0
mercedes_benz    130461.0
opel             125495.0
peugeot          123362.0
renault          123307.0
skoda            110430.0
volkswagen       126561.0
dtype: float64


In [70]:
brands_price_series
brands_km_series

frame = {'mean_price':brands_price_series,
        'mean_kilometers':brands_km_series}

output = pd.DataFrame(frame)
output

Unnamed: 0,mean_price,mean_kilometers
audi,10088.0,128012.0
bmw,8952.0,132200.0
fiat,3710.0,110478.0
ford,4939.0,121253.0
mercedes_benz,9094.0,130461.0
opel,3911.0,125495.0
peugeot,3772.0,123362.0
renault,3297.0,123307.0
skoda,6739.0,110430.0
volkswagen,6364.0,126561.0


List of car brands by highest price, their average price and average mileage.

## 9. Next Steps

### 9.1 Identify categorical data 

Using German words, translate them and map the values to their equivalents, in this case to Spanish, because there are categories that are unknown here. 

In [71]:
 auto_clean['vehicle_type'].value_counts(dropna = False)

limousine     11059
kombi          7891
kleinwagen     7613
bus            3797
cabrio         2893
coupe          2243
suv            1937
NaN            1482
andere          309
Name: vehicle_type, dtype: int64

In [72]:
auto_clean['vehicle_type'].value_counts().dropna()

limousine     11059
kombi          7891
kleinwagen     7613
bus            3797
cabrio         2893
coupe          2243
suv            1937
andere          309
Name: vehicle_type, dtype: int64

In [73]:
categorical_vehicle_type = auto_clean['vehicle_type'].value_counts().index[:]

In [74]:
category_translator = {'bus':'monovolumen','limousine':'sedan', 'kleinwagen':'compacto','kombi':'familiar',
                       'coupe':'coupe','suv':'suv','cabrio':'cabrio','andere':'otros'}

In [75]:
for category in categorical_vehicle_type:
    bool_category = auto_clean['vehicle_type'] == category
    auto_clean.loc[bool_category,'vehicle_type'] = category_translator[category]

In [76]:
auto_clean.head(5)

Unnamed: 0,date_crawled,name,seller,offer_type,price_in_dollars,abtest,vehicle_type,registration_year,gearbox,CV,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,monovolumen,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,sedan,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,sedan,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,compacto,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,familiar,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [77]:
auto_clean['vehicle_type'].describe()

count     37742
unique        8
top       sedan
freq      11059
Name: vehicle_type, dtype: object

### fuel_type column

In [78]:
auto_clean['fuel_type'].value_counts(dropna = False)

benzin     22913
diesel     13571
NaN         2026
lpg          586
cng           63
hybrid        37
elektro       17
andere        11
Name: fuel_type, dtype: int64

In [79]:
auto_clean['fuel_type'].value_counts().dropna()

benzin     22913
diesel     13571
lpg          586
cng           63
hybrid        37
elektro       17
andere        11
Name: fuel_type, dtype: int64

In [80]:
auto_clean['fuel_type'].describe()

count      37198
unique         7
top       benzin
freq       22913
Name: fuel_type, dtype: object

In [None]:
#auto_clean.loc['fuel_type'] = auto_clean['fuel_type'].value_counts().dropna().copy()

In [None]:
auto_clean[auto_clean['fuel_type'] == 'andere']

In [None]:
categorical_type_fuel = auto_clean['fuel_type'].value_counts().index[:]
categorical_type_fuel.value_counts()

the categories we have in our dataset

In [None]:
categorical_type_fuel_translator = {'benzin':'gasolina', 'diesel':'diesel', 'lpg':'lpg', 'cng':'cng',
                                    'hybrid':'híbrido', 'elektro':'electrico', 'andere':'otros'}

In [None]:
for category in categorical_type_fuel:
    bool_fuel = auto_clean['fuel_type'] == category
    auto_clean.loc[bool_fuel,'fuel_type'] = categorical_type_fuel_translator[category]

In [None]:
auto_clean['fuel_type'].value_counts()

In [None]:
condicion_1 = auto_clean['fuel_type'] == 'híbrido' # testing hybrid category
auto_clean[condicion_1]

### 9.2 Convert dates into numerical data.

In [None]:
auto_clean.info()

In [None]:
replace_columns = ['date_crawled', 'ad_created', 'last_seen']

for column in replace_columns:
    auto_clean[column] = auto_clean[column].str[:10]
    auto_clean[column] = auto_clean[column].str.replace('-','')

In [None]:
auto_clean

In [None]:
all_brands= auto_clean['brand'].unique()
sorted(all_brands)

### 9.3 Find the most common make/model combinations

The top three brands on our list also have their own favourite models, as can be seen below.

In [None]:
def combi_marca_modelo(marca):
    bool_fiat = auto_clean['brand'] == marca 
    print(auto_clean.loc[bool_fiat,'model'].value_counts()[:5]) # top 5

In [None]:
combi_marca_modelo('audi')

In [None]:
combi_marca_modelo('mercedes_benz')

In [None]:
combi_marca_modelo('bmw')

In [None]:
def combi_marca_modelo_top(marca):
    bool_fiat = auto_clean['brand'] == marca
    return auto_clean.loc[bool_fiat,'model'][0:1].max()
    #return auto_clean[bool_fiat]

In [None]:
marc_modelo = {}

for brand in sorted(all_brands):
    marc_modelo[brand] = combi_marca_modelo_top(brand)

marc_modelo = pd.Series(marc_modelo)
marc_modelo

### Model within each brand that is repeated more times within our data set, the order is given by the brand of the vehicle.

### 9.4 Divide the odometer into groups 

Use aggregation to see if the prices follow any pattern in mileage.

In [None]:
km_group = auto_clean['odometer_km'].value_counts()
km_group

In [None]:
km_group = auto_clean['odometer_km'].value_counts().index[:]

In [None]:
price_group = auto_clean['price_in_dollars'].value_counts().index[:]
price_group

In [None]:
avg_price_by_km_non_damage = {}

for km in km_group:
    selected_km = auto_clean[auto_clean['odometer_km'] == km ]
    mean = selected_km['price_in_dollars'].mean().round()
    combined = mean
    avg_price_by_km_non_damage[km] = combined

In [None]:
avg_price_by_km_non_damage

In [None]:
avg_price_by_km = {}

for km in km_group:
    selected_km = auto_clean[auto_clean['odometer_km'] == km ]
    mean = selected_km['price_in_dollars'].mean().round()
    vehiculos = selected_km['price_in_dollars'].value_counts(normalize = True, sort = True)
    combined = (mean, vehiculos)
    avg_price_by_km[km] = combined

In [None]:
avg_price_by_km

a car with 150,000 km and worth that much money...(¿?) 

In [None]:
auto_clean[auto_clean['price_in_dollars'] == 69993]

Clearly it is not an outlyer. 😅

In [None]:
avg_price_by_km_damage = {}

for km in km_group:
    selected_km = auto_clean[(auto_clean['odometer_km'] == km) &
                             (auto_clean['unrepaired_damage'] == 'ja') ]
    mean = selected_km['price_in_dollars'].mean()
    avg_price_by_km_damage[km] = mean.round()

In [None]:
avg_price_by_km_damage

In [None]:
series_nondamage = pd.Series(avg_price_by_km_non_damage) 
df_nondamage = pd.DataFrame(series_nondamage)
df_nondamage = df_nondamage.rename(columns = {0:'price_not_damage'})

series_damage = pd.Series(avg_price_by_km_damage)
df_damage = pd.DataFrame(series_damage)
df_damage = df_damage.rename(columns = {0:'price_damage'})

series_diference = series_nondamage - series_damage
df_diference = pd.DataFrame(series_diference)
df_diference = df_diference.rename(columns = {0:'price_difference'})

series_diference_percent = (series_damage * 100) / series_nondamage.round(2)
df_diference_percent = pd.DataFrame(series_diference_percent)
df_diference_percent = df_diference_percent.rename(columns = {0:'%difference_damage_notdamage '})


df = pd.concat([df_nondamage, df_damage, df_diference, df_diference_percent], axis = 1)
df

## Conclusions

The conclusions we can draw from our study are several:

 1. The 75% of the vehicles for sale have an average mileage of about 150,000 km and if we count the vehicles with a minimum selling price of 850 we have that 75% of the total have an average price of 8750.

 2. Once we have corrected the values we have that our 75% of the vehicles were registered in 2008 and that most of the registrations were between the period 2002 / 2007.

 3. The most offered vehicle brand is Volkswagen.

 4. The highest average price belongs to Audi, then Mercedes and finally to BMW. 

 5. The average number of kilometers per brand is led by BMW followed by Mercedes and Audi.

 6. most of the vehicles are sedan type and are gasoline cars followed by diesel and lpg cars.


Taking into account the order of average price per vehicle brand 
we know that the most repeated per model is:

   - Audi with A4 followed by the A3 and A6.

   - Mercedes the  C Class followed E Class and the A Class.

   - BMW with the 3 Series, the 5 Series in second place and the 1 Series in third place .
   

Finally we can see the differences between the price kilometer of the damaged and non-damaged cars and we observe that the biggest difference in price is given in vehicles with 40000 Kilometers 150000Km and 60000Km.