![Imgur](https://image.freepik.com/free-vector/automobiles-models-icon-collection_74855-5435.jpg)

<a href='https://www.freepik.com/'>models icon @freepik.com </a>


# Exploring Ebay Car Sales Data 

## Introduction

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka, the original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

This is the sort version, 50000 rows.  

The aim of this project is **to clean the data** and analyze the included used car listings 


### Data dictionary:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.


- `name` - Name of the car.


- `seller` - Whether the seller is private or a dealer.


- `offerType` - The type of listing


- `price` - The price on the ad to sell the car.


- `abtest` - Whether the listing is included in an A/B test.


- `vehicleType` - The vehicle Type.


- `yearOfRegistration` - The year in which the car was first registered.


- `gearbox` - The transmission type.


- `powerPS` - The power of the car in PS.


- `model` - The car model name.


- `kilometer` - How many kilometers the car has driven.


- `monthOfRegistration` - The month in which the car was first registered.


- `fuelType` - What type of fuel the car uses.


- `brand` - The brand of the car.


- `notRepairedDamage`- If the car has a damage which is not yet repaired.


- `dateCreated` - The date on which the eBay listing was created.


- `nrOfPictures` - The number of pictures in the ad.


- `postalCode` - The postal code for the location of the vehicle.


- `lastSeenOnline` - When the crawler saw this ad last online.




# <a id='0'>Index</a>


### 1. Basic Exploration Data

- #### 1.2. Cleaning column names 

- #### 1.3. Initial Exploration Data and cleaning

- #### 1.4. Columns candidates to be droped

- #### 1.5. Columns that need more investigation.

### 2. Numeric data stored as text that needs to be cleaned

- #### 2.1 Character deletion and integer conversion

### 3 . What does our dataset look like?

- #### 3.1 Exploring the odometer_km

 - #### 3.1.1 Upper limit odometer_km

 - #### 3.1.2 Lower limit odometer_km

- #### 3.2 Exploring the price_in_dollar

 - #### 3.2.1 Upper limit price_in_dollars
 
 - ##### 3.2.1.1 Exploring price outliers on the top of the list:
 
 - #### 3.2.2 Exploring price outliers on the botton of the list

### 4. The date columns 

- #### 4.1 Text to datetime format 

 - #### 4.1.1 Distribution of values in series

 - #### 4.1.2 Upper and lower limit registration_year
 
### 5. Price by Brand

- #### 5.1 Average price per vehicle brand

- #### 5.2 Average number of kilometers per vehicle brand

- #### 5.3 List of car brands by average price and average mileage

### 6. Storing Aggregate Data in a DataFrame

- #### 6.1 Identify categorical data

- #### 6.2 Finding the most common brand/model combi

- #### 6.2.1 Dividing the odometer in groups
 
### 7. How much cheaper are damaged cars than their undamaged counterparts?

### 8. Conclusions

* * *

In [1]:
import numpy as np
import pandas as pd
import chardet # Character encoding auto-detection in Python
import matplotlib.pyplot as plt
%matplotlib inline

### checking our csv file

In [2]:
! file -k csv/autos.csv

csv/autos.csv: CSV text\012- , Non-ISO extended-ASCII text


In [3]:
! file -i csv/autos.csv

csv/autos.csv: application/csv; charset=unknown-8bit


In [4]:
with open("csv/autos.csv", 'rb') as file:
    print(chardet.detect(file.read()))

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


In [5]:
autos = pd.read_csv("csv/autos.csv",encoding='Windows-1252') # Windows-1252 == ISO-8859-1

 https://stackoverflow.com/questions/7048745/what-is-the-difference-between-utf-8-and-iso-8859-1
 
 
- ASCII: 7 bits. 128 code points.

- ISO-8859-1: 8 bits. 256 code points.

- UTF-8: 8-32 bits (1-4 bytes). 1,112,064 code points.

Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1:


## 1. Basic Exploration Data

In [6]:
autos.head(3)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


In [7]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

### Observations

- There's a max numbers of 50000 rows with 20 columns, there are rows that are complete and others that are not, some columns are missing data.


- Column's name is not properly writed, so we need to [lowerCamelCase](https://en.wikipedia.org/wiki/Camel_case) it.


- Some columns do not have the appropriate data type.


- Some columns have null values, but none have more than ~20% null values.

### 1.2 Cleaning column names

The column names use [camelcase](https://en.wikipedia.org/wiki/Camel_case) instead of Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case), which means we can't just replace spaces with underscores.

Working on fixing the column names:

   - `yearOfRegistration`  to `registration_year`
   - `monthOfRegistration` to `registration_month`
   - `notRepairedDamage`   to `unrepaired_damagè`
   - `dateCreated`         to `ad_created`

In [8]:
column_name = autos.columns
column_name

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [9]:
autos.rename({'yearOfRegistration':'registration_year','monthOfRegistration':'registration_month',
             'notRepairedDamage':'unrepaired_damage','dateCreated':'ad_created',
              'nrOfPictures':'nr_pictures'  
             },axis = 1, inplace = True)

and the rest of columns

In [10]:
autos.rename({'dateCrawled':'date_crawled','price':'price_in_dollars',
              'offerType':'offer_type','vehicleType':'vehicle_type',
              'powerPS':'CV','fuelType':'fuel_type',
              'nrOfPictures':'nr_pictures','postalCode':'postal_code',
              'lastSeen':'last_seen',},
             axis = 1, 
             inplace = True) 

In [11]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_in_dollars',
       'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'CV', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Renaming names on snake format, and **powerPS** are converted to **CV** (Cavalli Vapore) means 'Horse Power' in metric system, system used in Europe where the dataframe comes (Germany)

based on this [link](https://www.autoweek.com/news/technology/a1820831/what-ps-metric-horsepower-autoweek-explains/)

### 1.3 Initial Exploration Data and cleaning

Cleaning tasks need to be done: 


- 1. Text columns where all or almost all values are the same. These can often **be dropped as they don't have useful** information for analysis.


- 2. Examples of numeric **data stored as text which can be cleaned and converted**.

In [12]:
column_name

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

### 1.4 Columns candidates to be droped:

As we can see by the low relevance of the content it is appropriate to delete the following columns:

- `seller`
- `offer_type`
- `abtest`
- `nr_pictures`
- `postal_code`

In [13]:
autos['seller'].unique()

array(['privat', 'gewerblich'], dtype=object)

In [14]:
autos['offer_type'].unique() # Angebot = Offer / Gesuch = Request

array(['Angebot', 'Gesuch'], dtype=object)

In [15]:
autos['abtest'].value_counts()

test       25756
control    24244
Name: abtest, dtype: int64

In [16]:
autos['nr_pictures'].value_counts()

0    50000
Name: nr_pictures, dtype: int64

In [17]:
autos['postal_code'].unique()[:10]

array([79588, 71034, 35394, 33729, 39218, 22962, 31535, 53474,  7426,
       15749])

### Dropping columns.

In [18]:
columns_todrop = ['seller','offer_type','abtest','nr_pictures','postal_code']

for i in columns_todrop:
    autos.drop([i], axis=1, inplace=True)

In [None]:
len(autos.columns) # number columns after

In [20]:
autos.head(0)

Unnamed: 0,date_crawled,name,price_in_dollars,vehicle_type,registration_year,gearbox,CV,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,last_seen


### 1.5 Columns that need more investigation.

Using `describe` show us **Nan** values, so I want to assure what the reason is.

In [None]:
autos['unrepaired_damage'].value_counts(dropna=False)

## 2. Numeric data stored as text that needs to be cleaned and converted

A quick look at **all the text-type** columns to determine which ones should be cleaned and changed the data type.

In [None]:
for index in autos.columns:
    condition = autos[index].dtype
    if  condition == 'object':
        print("\n" )
        print(autos[index][0:2])

In [None]:
%%html
<style>
table {float:left}
</style>

|text column names|to clean|
|:---|:---|
|date_crawled|datetime|
|name|**ok**|
|price_in_dollars|clean $ sign, to number|
|vehicle_type|**ok**|
|gearbox|**ok**|
|mmodel|**ok**|
|odometer|clean km, to number and rename column to odometer_km|
|fuel_type|**ok**|
|brand|**ok**|
|unrepaired_damage|**ok**|
|ad_created|**ok**|

### 2.1 Character deletion and integer conversion

Let's only modify the columns price in dollars and odometer

In [None]:
autos['price_in_dollars'] = autos['price_in_dollars'].str.replace('$', '')
autos['price_in_dollars'] = autos['price_in_dollars'].str.replace(',', '')
autos['price_in_dollars'] = autos['price_in_dollars'].astype(int)
autos['price_in_dollars']

In [None]:
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
autos['odometer_km'] = autos['odometer_km'].str.replace(',','')
autos['odometer_km'] = autos['odometer_km'].str.replace('km','')
autos['odometer_km'] = autos['odometer_km'].astype(int)
autos['odometer_km']

## 3. What does our dataset look like?

In [None]:
autos.describe() # numeric columns

### 3.1 Exploring the `odometer_km`

In [None]:
autos['odometer_km'].unique()

In [None]:
autos['odometer_km'].unique().shape

In [None]:
autos['odometer_km'].value_counts().sort_index(ascending = False)

#### Upper limit `odometer_km`

In [None]:
autos['odometer_km'].value_counts().sort_index(ascending = False).head(5)

#### Lower limit `odometer_km`

In [None]:
autos['odometer_km'].value_counts().sort_index(ascending = True).head(5)

In [None]:
autos["odometer_km"].describe().round()

there do not appear to be outliers in this column `odometer_km`

### 3.2 Exploring the `price_in_dollar`

In [None]:
autos['price_in_dollars'].unique().shape

###  3.2.1 Upper limit `price_in_dollars` detecting outlyers

In [None]:
autos['price_in_dollars'].value_counts().sort_index(ascending = False).head(15)

Let's see to which vehicle the value of 265000$ corresponds. 

In the event that this price is consistent with the vehicle, we can take it as a reference.

In [None]:
autos[autos["price_in_dollars"] == 265000]

It makes sense, so we will use our comparison from this value upwards and see what we find.

https://www.carindigo.com/used-cars/porsche-911-gt3-rs

As there are not too many values to check, it seems sensible to start from this model and see what we can eliminate and what we cannot.

let's see which vehicles are above this one, creating a new dataset variable `top_list` to have a look.

In [None]:
top_list = autos[(autos["price_in_dollars"] >= 265000)].sort_index(ascending = False)

In [None]:
top_list[["price_in_dollars","brand","model","vehicle_type"]].sort_index(ascending = False)

### 3.2.1.1 Exploring price outliers on the top of the list:

There are vehicles that make sense to be on this list and that others clearly do not, for example: 

Strange values of 1234566, corresponding to a BMW or 11111111 to a Volkswagen even a C4 that is worth 27322222. 

Clearly, these cases could be eliminated as error.

However, there are some kind of vehicles called **'sonstige autos'** (mean **other cars in German**) that we are not aware of and because they have an important weight in the list it is important to clarify whether or not we should eliminate them.  

In [None]:
top_list[top_list["brand"] == "sonstige_autos"]

### let's check if these prices correspond to reality

| name | price in Dollars |registration_year | source |
| :--- | :----------- |:------ | :--- |
| Ferrari_FXX | 4,000,000| 2006 |https://www.autoblog.com/2006/06/14/for-sale-2006-ferrari-fxx-slightly-used/?guccounter=1 |
| Rolls_Royce_Phantom_Drophead_Coupe | From 450,000 | 2012 | https://www.cars.com/shopping/rolls_royce-phantom_drophead_coupe-2012/ |
|Ferrari_F40| 1,959,900 |1992 | https://www.dupontregistry.com/autos/listing/1992/ferrari/f40/2418434 |
|Maserati 3200 GT|  | |  **Note** | 

**Note**: In relation to the Maserati, there are several things to consider.


- The [3200 Gt](https://es.wikipedia.org/wiki/Maserati_3200_GT) model was in production during the period 1998 to 2001 so the registration date we have from 1960 does not match, however, during the period 1957 - 1964 only one model of this brand was produced, the  [3500 Gt](https://en.wikipedia.org/wiki/Maserati_3500_GT#:~:text=The%20Maserati%203500%20GT%20(Tipo,Maserati%20between%201957%20and%201964.), only 2,222 units were produced between the Coupe and the Spider version.

- If we also take into account the description referred to in the cell of the name:

     ***" Zustand_unwichtig_laufe... // Condition_unimportant_running..."***


- The person who placed the advert was looking for a masserati in any condition (just running or not...), this gives us a clue that the model we are talking about is not the newest one, so probably refers to the 60's model.

- As if this were not enough the price we have is 10 000 000 dollars and surely this is wrong because doing a search of what is the value of the car [I have found that this ranges between 863.170,50 USD / 151.290,57 USD](https://www.el-parking.es/coches-usados/maserati-3500-gt.html#!/coches-usados/maserati-3500-gt.html%3Ftri%3Dprix_decroissant)

Therefore, the best thing I can think of is assign the <strong>'average between prices'</strong> to our vehicle.


In [None]:
autos.iloc[11137,:]

Change the **'average between prices'** price of the Masserati = 507230$

In [None]:
autos.iloc[11137,4] = 507230

### Removing rows from top list

#### So... what do we have to remove from this top list?

Everything other than 

If **'Sonstige_autos'** and **'Porsche'** remain, we will eliminate the others that are the ones that do not interest us

In [None]:
removed_bad_price_bool = ((top_list["brand"] != 'sonstige_autos') & (top_list["brand"] != 'porsche'))

These are the rows that we must eliminate

In [None]:
bad_cars = top_list[removed_bad_price_bool]
bad_cars

#### Running the index of each row we eliminate the vehicles from the main dataset `autos`

In [None]:
for bad in bad_cars.index:
    autos.drop([bad], inplace = True)

###  3.2.2 Exploring price outliers on the botton of the list:

In [None]:
autos["price_in_dollars"].describe().round()

### The importance of a specific strip on our data set.

#### The minimum value is 0 and we do not reach 25% of the total dataset the price does not reach 1100€.

Let's see if prices vary a lot or if they are very spread out within this range (25%). 

In [None]:
autos[autos['price_in_dollars'].between(0,1100)].describe()

if i want to the total sum of the price of these vehicles **in different price ranges**

i.e. how much 75 of the 25 of the total.

In [None]:
seventy_five = autos[autos['price_in_dollars'].between(850,1100)] #75% of the total 25%
seventy_five.loc[:,'price_in_dollars'].sum()

In [None]:
fifty_percent = autos[autos['price_in_dollars'].between(599,850)] #50% of the total 25%
fifty_percent.loc[:,'price_in_dollars'].sum()

In [None]:
cuart = autos[autos['price_in_dollars'].between(300,599)] #25% of the total 25%
cuart.loc[:,'price_in_dollars'].sum()

In this way the importance in the overall data set can be compared.


| price ranges | price in Dollars | percentage over total |
| :---          | :-----  | :-----      | 
| [850 ~  1100] | 3227534 | 1,095478069 % |
| [599 ~ 850]   | 2462601 | 0,835847241 % |
| [300 ~ 599]   | 1431649 | 0,485925193 % |

It seems that it is not too relevant, so in order to take advantage of the maximum amount of information I will take into account a range that goes from the maximum that we have already defined before to **850 €.**

Once we have the properlly data at the top and at the bottom, we make a clean copy of our dataframe and call it:

- `auto_clean`

In [None]:
auto_clean = autos[autos['price_in_dollars'].between(850,10000000)].copy()

In [None]:
auto_clean['price_in_dollars'].describe().round()

## 4. Date columns

These are the four columns that represent dates:

- `date_crawled`

- `ad_created`

- `last_seen`

- `registration_year`

In [None]:
auto_clean['registration_year'].dtype

In [None]:
auto_clean['date_crawled'].dtype

Columns with with two different types of data, numerical and type = ('O')

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. 

The item size must correspond to an existing type, or an error will be raised. The supported kinds are to an existing type, or an error will be raised. The supported kinds are:

|dtype|type|
|:--|:--|
|'b'       |boolean|
|'i'       |(signed) integer|
|'u'       |unsigned integer|
|'f'       |floating-point|
|'c'       |complex-floating point|
|**'O'**       |**(Python) objects**|
|'S', 'a'  |(byte-)string|
|'U'       |Unicode|
|'V'       |raw data (void)|

Objects of this type are usually of type text strings.

https://docs.scipy.org/doc/numpy-1.10.1/reference/arrays.dtypes.html

In [None]:
auto_clean[['date_crawled','ad_created','last_seen']].describe() # type text

### 4.1 Text to datetime format

We need to convert the data text into a in a time format so we can understand it quantitatively, and understand the distribution.

In [None]:
replace_columns = ['date_crawled', 'ad_created', 'last_seen']

for column in replace_columns:
    auto_clean[column] = auto_clean[column].str[:10]
    auto_clean[column] = auto_clean[column].str.replace('-','')
    auto_clean[column] = auto_clean[column].astype(int)
    #auto_clean[column] = auto_clean[column].value_counts(normalize=True, dropna=False)
    
auto_clean[['date_crawled','ad_created','last_seen','registration_year']].describe()

The first three columns are the ones that we have just modified give us an unclear information, the reason is that the date format of which we are working includes the month and the day and as pandas sees it as an integer of must appear the engineering notation.

On the other hand, in column registration_year, even if it is of int type, we see that both the minimum value and the maximum value do not correspond to what should be the years of the registration of the vehicles.

we will work with each of the series individually.

### 4.1.1 Distribution of values in the `date_crawled`, `ad_created`, and `last_seen` columns

In [None]:
auto_clean['date_crawled'][2] # content series looks like...

In [None]:
import datetime as dt

### date_crawled

In [None]:
var_date_crawled = auto_clean['date_crawled']

In [None]:
auto_clean['date_crawled'] = pd.to_datetime(var_date_crawled,format='%Y%m%d')

In [None]:
auto_clean['date_crawled'].describe()

In [None]:
auto_clean['date_crawled'].value_counts(normalize=True, dropna=False)*100

### ad_created

In [None]:
var_date_crawled = auto_clean['ad_created']

In [None]:
auto_clean['ad_created'] = pd.to_datetime(var_date_crawled,format='%Y%m%d')

In [None]:
auto_clean['ad_created'].describe()

In [None]:
auto_clean['ad_created'].value_counts(normalize=True, dropna=False)*100

### last_seen

In [None]:
var_date_crawled = auto_clean['last_seen']

In [None]:
auto_clean['last_seen'] = pd.to_datetime(var_date_crawled,format='%Y%m%d')

In [None]:
auto_clean['last_seen'].describe()

In [None]:
auto_clean['last_seen'].value_counts(normalize=True, dropna=False)*100

### 4.1.2 Selecting the upper and lower limit `registration_year`

In the column of when the cars were registered it is convenient to look at the upper and lower limits to determine the time interval in which we are working.

In [None]:
auto_clean['registration_year'].value_counts().sort_index(ascending = False).head(20)

In [None]:
auto_clean['registration_year'].value_counts().sort_index(ascending = False).tail(30)

The maximum and minimum dates are wrong.






although in this column we see that there are values until 2018, We see that beyond 2016 the values of the dates do not make sense, however we must remember that the data capture was done in 2016  thats the reason  those dates do not exist we should limit the highest year to 2016.

Any vehicle with a registration year above 2016 is inaccurate. 

Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s, let's set a time slot between 1959,2016.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [None]:
years = auto_clean['registration_year'].between(1959,2016)

In [None]:
auto_clean['registration_year'] = auto_clean[years]['registration_year']

In [None]:
auto_clean['registration_year'].value_counts(normalize=True, dropna=False)*100 # % of frecuency

### Summary

In [None]:
auto_clean[['date_crawled','ad_created','last_seen']].describe()

Between March and April 2016 there was the highest activity on the site.

In [None]:
auto_clean['registration_year'].describe()

75% of the vehicles were registered in 2008.

In [None]:
auto_clean['registration_year'].value_counts(normalize=True,
                                             dropna=False, 
                                             bins = 10).sort_index(ascending = False).round(3)*100

This is the distribution of the dates of registration by groups of 10, in which we can see how the period between 2002 and 2007 is the one with the highest number of registrations. 

## 5. Price by Brand

Brands with the highest number of vehicles on the sales list

In [None]:
top_ten_brands=auto_clean['brand'].value_counts()
top_ten_brands

In [None]:
selected_brands=auto_clean['brand'].value_counts().index[:10]
selected_brands

I choose only the 10 first brands

### 5.1 Average price per vehicle brand

In [None]:
brands_price = {}

for brand in selected_brands:
    sel_brand = auto_clean[auto_clean['brand'] == brand ]
    brands_price[brand] = sel_brand['price_in_dollars'].mean().round()

brands_price_sorted = sorted(brands_price.items(),key = lambda kv: kv[1],reverse=True)
brands_price_sorted

### 5.2 Average number of kilometers per vehicle brand 

In [None]:
brands_km = {}

for brand in sorted(selected_brands):
    sel_brand = auto_clean[auto_clean['brand'] == brand ]
    brands_km[brand] = sel_brand['odometer_km'].mean().round()

brands_km_sorted = sorted(brands_km.items(),key = lambda kv: kv[1], reverse=True)
brands_km_sorted

### 5.3 List of car brands by average price and average mileage

In [None]:
listado = {}

template_string = "Mean price {money:.2f}$ and {km:.2f} mean Kilometers"

for vehiculo in brands_price:
    mean_price = brands_price[vehiculo]
    mean_km = brands_km[vehiculo]
    output = template_string.format(money = mean_price, km = mean_km )
    listado[vehiculo] = output
    
listado

In [None]:
frame = {'brands average price':brands_price,
         'brands average kilometer':brands_km}

output = pd.DataFrame(frame)
output

## 6. Storing Aggregate Data in a DataFrame

In [None]:
brands_price_series = pd.Series(brands_price)

In [None]:
brands_price_series

In [None]:
brands_km_series = pd.Series(brands_km)

In [None]:
brands_km_series

In [None]:
auto_clean['registration_year']

frame = {'mean_price':brands_price_series,
        'mean_kilometers':brands_km_series}

output = pd.DataFrame(frame)
output

List of car brands by highest price, their average price and average mileage.

### 6.1 Identify categorical data

Using German words, translate them and map the values to their equivalents, in this case to Spanish, because there are categories that are unknown here. 

In [None]:
 auto_clean['vehicle_type'].value_counts(dropna = False)

In [None]:
auto_clean['vehicle_type'].value_counts().dropna()

In [None]:
categorical_vehicle_type = auto_clean['vehicle_type'].value_counts().dropna().index[:]

In [None]:
category_translator = {'bus':'monovolumen','limousine':'sedan', 'kleinwagen':'compacto','kombi':'familiar',
                       'coupe':'coupe','suv':'suv','cabrio':'cabrio','andere':'otros'}

for category in categorical_vehicle_type:
    bool_category = auto_clean['vehicle_type'] == category
    auto_clean.loc[bool_category,'vehicle_type'] = category_translator[category]

In [None]:
auto_clean['vehicle_type'].value_counts()

In [None]:
categorical_type_fuel = auto_clean['fuel_type'].value_counts().index[:]
categorical_type_fuel.value_counts()

In [None]:
categorical_type_fuel_translator = {'benzin':'gasolina', 'diesel':'diesel', 'lpg':'lpg', 'cng':'cng',
                                    'hybrid':'híbrido', 'elektro':'electrico', 'andere':'otros'}

for category in categorical_type_fuel:
    bool_fuel = auto_clean['fuel_type'] == category
    auto_clean.loc[bool_fuel,'fuel_type'] = categorical_type_fuel_translator[category]

In [None]:
auto_clean['fuel_type'].value_counts()

In [None]:
condicion_1 = auto_clean['fuel_type'] == 'híbrido' # testing hybrid category
auto_clean[condicion_1].head(3) # checking

In [None]:
auto_clean.info()

In [None]:
all_brands= auto_clean['brand'].unique()
sorted(all_brands)

### 6.2 Finding the most common brand/model combi.

The top three brands on our list also have their own favourite models, as can be seen below.

In [None]:
def modelos_brand(marca):
    bool_marca = auto_clean['brand'] == marca                              # boolean filter by brand
    print(auto_clean.loc[bool_marca,'model'].value_counts()[:5])           # top five model brand 

In [None]:
modelos_brand('alfa_romeo')

In [None]:
modelos_brand('mercedes_benz')

In [None]:
modelos_brand('bmw')

In [None]:
def top_model(marca):
    boolean_filter = auto_clean['brand'] == marca
    if auto_clean.loc[boolean_filter,'model'].value_counts().index[0] == 'andere':
        return auto_clean.loc[boolean_filter,'model'].value_counts().index[1]
    else:
        return auto_clean.loc[boolean_filter,'model'].value_counts().index[0]

Model within each brand that is repeated more times within our data set, the order is given by the brand of the vehicle.

In [None]:
bool_fiat = auto_clean['brand'] == 'alfa_romeo'

In [None]:
x = auto_clean.loc[bool_fiat,'model']
x.value_counts().index[0:1]

In [None]:
top_model('bmw')

In [None]:
marc_modelo = {}

for brand in sorted(all_brands):
    try:
        marc_modelo[brand] = top_model(brand)
    except:
        print(brand,"doesnt have model")

marc_modelo = pd.Series(marc_modelo)
marc_modelo

### 6.2.1 Odometer in groups 

Use aggregation to see if the prices follow any pattern in mileage.

In [None]:
km_group = auto_clean['odometer_km'].value_counts()
km_group

In [None]:
km_group = auto_clean['odometer_km'].value_counts().index[:]
km_group

In [None]:
price_group = auto_clean['price_in_dollars'].value_counts().index[:]
price_group

In [None]:
avg_price_by_km_non_damage = {}

for km in km_group:
    selected_km = auto_clean[auto_clean['odometer_km'] == km ]
    mean = selected_km['price_in_dollars'].mean().round()
    combined = mean
    avg_price_by_km_non_damage[km] = combined

In [None]:
avg_price_by_km_non_damage

In [None]:
marklist = sorted(avg_price_by_km_non_damage.items(),key=lambda x:x[1])
sort_avg_price_by_km_non_damage = dict(marklist)
print(sort_avg_price_by_km_non_damage)

In [None]:
avg_price_by_km = {}

for km in km_group:
    selected_km = auto_clean[auto_clean['odometer_km'] == km ]
    mean = selected_km['price_in_dollars'].mean().round()
    vehiculos = selected_km['price_in_dollars'].value_counts(normalize = False, sort = True)
    combined = (mean, vehiculos)
    avg_price_by_km[km] = combined

In [None]:
avg_price_by_km[150000]

In [None]:
auto_clean[auto_clean['price_in_dollars'] == 69993]

## 7. How much cheaper are damaged cars than their undamaged counterparts?

we retrieved the dictionary we had made called the 'avg_price_by_km_non_damage' and we create another dictionary that compiles those vehicles that are indeed damaged.

In [None]:
avg_price_by_km_damage = {}

for km in km_group:
    selected_km = auto_clean[(auto_clean['odometer_km'] == km) &
                             (auto_clean['unrepaired_damage'] == 'ja') ]
    mean = selected_km['price_in_dollars'].mean()
    avg_price_by_km_damage[km] = mean.round()

In [None]:
avg_price_by_km_damage

In [None]:
marklist = sorted(avg_price_by_km_damage.items(),key=lambda x:x[1])
sort_avg_price_by_km_damage = dict(marklist)
print(sort_avg_price_by_km_damage)

In [None]:
plt.figure(figsize = (15,4))

plt.subplot(1,2,1)

data_nodamage_1 = range(len(sort_avg_price_by_km_non_damage))
data_nodamage_2 = list(sort_avg_price_by_km_non_damage.values())

plt.bar(data_nodamage_1,data_nodamage_2,label="avg price non damage cars")
plt.legend()

plt.xticks(data_nodamage_1, data_nodamage_2,rotation=30)
plt.xlabel("Km")
plt.ylabel("avg_price")

plt.subplot(1,2,2)


data_damage_1 = range(len(sort_avg_price_by_km_damage))
data_damage_2 = list(sort_avg_price_by_km_damage.values())

plt.bar(data_damage_1,data_damage_2,label="avg price non damage cars")
plt.legend()

plt.xticks(data_damage_1, data_damage_2,rotation=30)
plt.xlabel("Km")
plt.ylabel("avg_price")
plt.show()

In [None]:
plt.figure(figsize = (15,4))

plt.title('avg price / Km')

plt.bar(data_nodamage_1,data_nodamage_2,label="avg price non damage cars")

plt.bar(data_damage_1,data_damage_2,label="avg price damage cars")
plt.xticks(data_nodamage_1,data_nodamage_2,rotation=30)
plt.legend()
plt.xlabel("Km")
plt.ylabel("$")

plt.show()

In [None]:
series_nondamage = pd.Series(avg_price_by_km_non_damage) 
df_nondamage = pd.DataFrame(series_nondamage)
df_nondamage = df_nondamage.rename(columns = {0:'price_not_damage'})

series_damage = pd.Series(avg_price_by_km_damage)
df_damage = pd.DataFrame(series_damage)
df_damage = df_damage.rename(columns = {0:'price_damage'})

series_diference = series_nondamage - series_damage
df_diference = pd.DataFrame(series_diference)
df_diference = df_diference.rename(columns = {0:'price_difference'})

series_diference_percent = (series_damage * 100) / series_nondamage.round(2)
df_diference_percent = pd.DataFrame(series_diference_percent)
df_diference_percent = df_diference_percent.rename(columns = {0:'%_damage_vs_notdamage '})


df = pd.concat([df_nondamage, df_damage, df_diference, df_diference_percent], axis = 1)
df

In [None]:
df.index

In [None]:
plt.figure(figsize = (15,4))

plt.plot(series_nondamage,label="price_not_damage")
plt.plot(series_damage,label="price_damage")
plt.plot(series_diference,label="price_difference")

plt.xlabel("Km")
plt.ylabel("$")
plt.legend()

plt.show()

In [None]:
plt.figure(figsize = (15,6))
plt.plot(df_diference_percent,label="Km vs %_damage_vs_notdamage")

plt.xlabel("Km")
plt.ylabel("%")

plt.legend()
plt.show()

## 8. Conclusions

The conclusions we can draw from our study are several:

 1. The 75% of the vehicles for sale have an average mileage of about 150,000 km and if we count the vehicles with a minimum selling price of 850 we have that 75% of the total have an average price of 8750.

 2. Once we have corrected the values we have that our 75% of the vehicles were registered in 2008 and that most of the registrations were between the period 2002 / 2007.

 3. The most offered vehicle brand is Volkswagen.

 4. The highest average price belongs to Audi, then Mercedes and finally to BMW. 

 5. The average number of kilometers per brand is led by BMW followed by Mercedes and Audi.

 6. most of the vehicles are sedan type and are gasoline cars followed by diesel and lpg cars.


Taking into account the order of average price per vehicle brand 
we know that the most repeated per model is:

   - Audi with A4 followed by the A3 and A6.

   - Mercedes the  C Class followed E Class and the A Class.

   - BMW with the 3 Series, the 5 Series in second place and the 1 Series in third place .
   

Finally we can see the differences between the price kilometer of the damaged and non-damaged cars and we observe that the biggest difference in price is given in vehicles with 40000 Kilometers 150000Km and 60000Km.