![Imgur](https://image.freepik.com/free-vector/automobiles-models-icon-collection_74855-5435.jpg)

<a href='https://www.freepik.com/'>models icon @freepik.com </a>

# Exploring Ebay Car Sales Data

## Introduction


In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka.

The original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

This is the sort version, 50000 rows.  

The aim of this project is **to clean the data** and analyze the included used car listings **answer the following questions**:

- What is the average mileage of the vehicles sold on this page?

- What was the period in which the most vehicles were offered for sale on the site?

- Which is the most offered brand? and which is the least?

- What is the highest average number of kilometers per brand?

- What is the most used type of fuel?

- The most offered brands and the model of each brand?


## Data dictionary:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.


- `name` - Name of the car.


- `seller` - Whether the seller is private or a dealer.


- `offerType` - The type of listing


- `price` - The price on the ad to sell the car.


- `abtest` - Whether the listing is included in an A/B test.


- `vehicleType` - The vehicle Type.


- `yearOfRegistration` - The year in which the car was first registered.


- `gearbox` - The transmission type.


- `powerPS` - The power of the car in PS.


- `model` - The car model name.


- `kilometer` - How many kilometers the car has driven.


- `monthOfRegistration` - The month in which the car was first registered.


- `fuelType` - What type of fuel the car uses.


- `brand` - The brand of the car.


- `notRepairedDamage`- If the car has a damage which is not yet repaired.


- `dateCreated` - The date on which the eBay listing was created.


- `nrOfPictures` - The number of pictures in the ad.


- `postalCode` - The postal code for the location of the vehicle.


- `lastSeenOnline` - When the crawler saw this ad last online.


# <a id='0'>Index</a>


### <a href='#1'>1. </a>

### <a href='#2'>2. </a>

### <a href='#3'>3.  </a>

### <a href='#4'>4. </a>

### <a href='#5'>5. </a>

### <a href='#6'>6. </a>

### <a href='#7'>7.  </a>

### <a href='#8'>8. </a>

### <a href='#9'>9. </a>

### <a href='#10'>10. </a>

In [1]:
import numpy as np
import pandas as pd

In [38]:
! file -k autos.csv

autos.csv: CSV text\012- , Non-ISO extended-ASCII text


In [39]:
! file -i autos.csv

autos.csv: application/csv; charset=unknown-8bit


In [15]:
try:
    print( 'hello' / 'there')
except TypeError:
    print('TypeError thrown')
except NameError:
    print('NameError thrown')
except Exception as e:
        print('An exception occurred that was not NameError or TypeError')

TypeError thrown


In [26]:
lista = ['iso-8859-8'] #['Latin-1'] iso-8859-2


for index in lista:
    print(index)
    try:
        autos = pd.read_csv("autos.csv", encoding = index )
        
    except Exception as e: 
        print(e)

iso-8859-8
'charmap' codec can't decode byte 0xdc in position 1059: character maps to <undefined>


## 1. Exploring Data

In [17]:
autos.head(3)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37


In [None]:
autos.info()

### Dataset observations:

- There's a 50000 rows and 20 columns.

- Column's name is not properly writed, so we need to lowerCamelCase it.

- Some of the columns are missing data (indicar un porcentage aproximado de los que falta)

- Some of the columns do not have the appropriate data type

## 2. Cleaning Column Names

Working on fixing the column names:

   - `yearOfRegistration`  to `registration_year`
   - `monthOfRegistration` to `registration_month`
   - `notRepairedDamage`   to `unrepaired_damagè`
   - `dateCreated`         to `ad_created`
   


In [None]:
column_name = autos.columns
column_name

In [None]:
autos.rename({'yearOfRegistration':'registration_year','monthOfRegistration':'registration_month',
             'notRepairedDamage':'unrepaired_damage','dateCreated':'ad_created',},axis = 1, inplace = True)

- The rest of the column names from camelcase to snakecase.

In [None]:
autos.rename({'dateCrawled':'date_crawled','price':'price_in_dollars',
              'offerType':'offer_type','vehicleType':'vehicle_type',
              'powerPS':'CV','fuelType':'fuel_type',
              'nrOfPictures':'nr_pictures','postalCode':'postal_code',
              'lastSeen':'last_seen',},axis = 1, inplace = True) 

In [None]:
autos.columns

### Changes did it on column name

Rename names on snake format, and powerPS are converted to CV (Cavalli Vapore) means 'Horse Power' in metric system, system used in Europe where the dataframe comes (Germany)

based on this [link](https://www.autoweek.com/news/technology/a1820831/what-ps-metric-horsepower-autoweek-explains/)

## 3. Initial Exploration and Cleaning.

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. 
Initially we will look for:

- 1. Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.


- 2. Examples of numeric data stored as text which can be cleaned and converted.

In [None]:
autos.info()

In [None]:
autos.describe(include = 'all')

### - 1. Text columns where all or almost all values are the same. 

We can see some of columns that only have two values and they aren't relevant for our purpuses, like:

- `seller`
- `offer_type`
- `nr_picture`
- `abtest`

Lets see more in detail the content of each columns.

In [None]:
selected = ['seller','offer_type','nr_pictures','abtest']

for columna in selected: print(autos[columna].value_counts(),'\n') 

### Dataset observations:

Columns candidates to be dropped:

   
   `seller`:

        privat(particular owner) = 49999 and gewerblich(comercial) = 1
    
    
   `offer_type`:
   
       Angebot(Offer) = 49999 and Gesuch(Wanted) = 1
        
   `nr_pictures`:
   
       total number of pictures is 0:

   `abtest`:
   
       not useful for dataframe study
        
       

### -2  Numeric data stored as text that needs to be cleaned:

- `price_in_dollars`

- `odometer`

- `ad_created`

- `date_crawled`

- `last_seen`

In [None]:
autos['price_in_dollars']

In [None]:
autos['odometer']

### Character deletion and integer conversion

In [None]:
autos['price_in_dollars'] = autos['price_in_dollars'].str.replace('$', '')
autos['price_in_dollars'] = autos['price_in_dollars'].str.replace(',', '')
autos['price_in_dollars'] = autos['price_in_dollars'].astype(int)
autos['price_in_dollars']

### By removing the characters symbolizing the column units, it is possible to indicate in the column name

In [None]:
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
autos['odometer_km'] = autos['odometer_km'].str.replace(',','')
autos['odometer_km'] = autos['odometer_km'].str.replace('km','')
autos['odometer_km'] = autos['odometer_km'].astype(int)
autos['odometer_km']

## 4. Exploring the `odometer_km` and `price` columns

#### 1. Exploring `odometer_km`:

- Number of differents values
- Descriptive statistics.
- list of top 10 km groups

#### Return unique values of `odometer_km`

In [None]:
autos['odometer_km'].unique().shape

#### Descriptive statistics 'odometer_km' 

In [None]:
autos['odometer_km'].describe()

#### Upper limit `odometer_km`

In [None]:
autos['odometer_km'].value_counts().sort_index(ascending = False).head(10)

#### Lower limit `odometer_km`

In [None]:
autos['odometer_km'].value_counts().sort_index(ascending = True).head(10)

#### 2. Exploring `price_in_dollars`:

- Number of differents values
- Descriptive statistics.
- list of top 10 km groups

#### Return unique values of `price_in_dollars`

In [None]:
autos['price_in_dollars'].unique().shape

#### Descriptive statistics 'price_in_dollars' 

In [None]:
autos['price_in_dollars'].describe().round()

#### Upper limit `price_in_dollars`

In [None]:
autos['price_in_dollars'].value_counts().sort_index(ascending = False).head(10)

#### Lower limit `odometer_km`

In [None]:
autos['price_in_dollars'].value_counts().sort_index(ascending = True).head(10)

#### Note: 

Let's take a closer look at the top of the column and choose a value that, even though it is high, allows us to have a reference from which to start.

### Exploring `price_in_dollars` outliers on the top of the list:

In [None]:
autos['price_in_dollars'].value_counts().sort_index(ascending = False).head(15)

Let's see to which vehicle the value of 265000$ corresponds. 

In the event that this price is consistent with the vehicle, we can take it as a reference.

In [None]:
autos[autos["price_in_dollars"] == 265000]

It makes sense, so we will use our comparison from this value upwards and see what we find.

https://www.carindigo.com/used-cars/porsche-911-gt3-rs

As there are not too many values to check, it seems sensible to start from this model and see what we can eliminate and what we cannot.

let's see which vehicles are above this one.

In [None]:
top_List = autos[(autos["price_in_dollars"] >= 265000)].sort_index(ascending = False)
top_List

In [None]:
top_List[["price_in_dollars","brand","model","vehicle_type"]].sort_index(ascending = False)

It seems to be clearer this way, we see that there are vehicles that make sense to be on this list and that others clearly do not, for example: 

Strange values of 1234566, corresponding to a BMW or 11111111 to a Volkswagen even a C4 that is worth 27322222. 

Clearly, these cases could be eliminated as error.

However, there are some kind of vehicles called 'sonstige autos' (mean other cars in German) that we are not aware of and because they have an important weight in the list it is important to clarify whether or not we should eliminate them.  

In [None]:
top_List[top_List["brand"] == "sonstige_autos"]

### High-end vehicles with a very high market value are appearing, will have to be reviewed.

| name | price in Dollars |registration_year | source |
| :--- | :----------- |:------ | :--- |
| Ferrari_FXX | 4,000,000| 2006 |https://www.autoblog.com/2006/06/14/for-sale-2006-ferrari-fxx-slightly-used/?guccounter=1 |
| Rolls_Royce_Phantom_Drophead_Coupe | From 450,000 | 2012 | https://www.cars.com/shopping/rolls_royce-phantom_drophead_coupe-2012/ |
|Ferrari_F40| 1,959,900 |1992 | https://www.dupontregistry.com/autos/listing/1992/ferrari/f40/2418434 |
|Maserati 3200 GT|  | |  **Note** | 

**Note**: In relation to the Maserati, there are several things to consider.


- The [3200 Gt](https://es.wikipedia.org/wiki/Maserati_3200_GT) model was in production during the period 1998 to 2001 so the registration date we have from 1960 does not match, however, during the period 1957 - 1964 only one model of this brand was produced, the  [3500 Gt](https://en.wikipedia.org/wiki/Maserati_3500_GT#:~:text=The%20Maserati%203500%20GT%20(Tipo,Maserati%20between%201957%20and%201964.), only 2,222 units were produced between the Coupe and the Spider version.

- If we also take into account the description referred to in the cell of the name:

     ***" Zustand_unwichtig_laufe... // Condition_unimportant_running..."***


- The person who placed the advert was looking for a masserati in any condition (just running or not...), this gives us a clue that the model we are talking about is not the newest one, so probably refers to the 60's model.

- As if this were not enough the price we have is 10 000 000 dollars and surely this is wrong because doing a search of what is the value of the car [I have found that this ranges between 863.170,50 USD / 151.290,57 USD](https://www.el-parking.es/coches-usados/maserati-3500-gt.html#!/coches-usados/maserati-3500-gt.html%3Ftri%3Dprix_decroissant)

Therefore, the best thing I can think of is assign the <strong>'average between prices'</strong> to our vehicle.


In [None]:
autos.iloc[11137,:]

Change the **'average between prices'** price of the Masserati = 507230$

In [None]:
autos.iloc[11137,4] = 507230

### So... finally what do we have to remove from this top list?

Everything other than 'Sonstige_autos' and 'Porsche'

In [None]:
remove_bad_price_bool = ((top_List["brand"] != 'sonstige_autos') & (top_List["brand"] != 'porsche'))

In [None]:
bad_cars = top_List[remove_bad_price_bool] # type dataframe! This is the dataframe to remove from check_price
bad_cars

In [None]:
len(bad_cars)

In [None]:
for bad in bad_cars.index:
    autos.drop([bad], inplace = True)

### Exploring price outliers on the botton of the list:

In [None]:
autos["price_in_dollars"].describe().round()

We see that the minimum value is 0 and until we do not reach 25% of the total dataset the price does not reach 1100€.

In [None]:
autos["price_in_dollars"].value_counts().sort_index(ascending = False)

Let's see if prices vary a lot or if they are very spread out within this range (25%). 

In [None]:
autos[autos['price_in_dollars'].between(0,1100)].describe()

if i want to know what is the total sum of the price of these vehicles in the different price ranges

i.e. how much 75 of the 25 of the total means

In [None]:
seventy_five = autos[autos['price_in_dollars'].between(850,1100)]
seventy_five.loc[:,'price_in_dollars'].sum()

how much 50 out of 25 of the total amount means

In [None]:
fifty_percent = autos[autos['price_in_dollars'].between(599,850)]
fifty_percent.loc[:,'price_in_dollars'].sum()

how much 25 out of 25 of the total amount means

In [None]:
cuart = autos[autos['price_in_dollars'].between(300,599)]
cuart.loc[:,'price_in_dollars'].sum()

What is the sum of the price of the entire data set without taking into account the range 0 to 25

In [None]:
sum_total_dataset = autos[autos['price_in_dollars'].between(1100,10000000)]
sum_total_dataset.loc[:,'price_in_dollars'].sum()

In this way the importance in the overall data set can be compared.


| price ranges | price in Dollars | percentage over total |
| :---          | :-----  | :-----      | 
| [850 ~  1100] | 3227534 | 1,095478069 % |
| [599 ~ 850]   | 2462601 | 0,835847241 % |
| [300 ~ 599]   | 1431649 | 0,485925193 % |

It seems that it is not too relevant, so in order to take advantage of the maximum amount of information I will take into account a range that goes from the maximum that we have already defined before to a minimum price of **850 €.**

Once we have the data concerning the prices worked, both at the top and at the bottom, we make a clean copy of our dataframe and call it:


- `auto_clean`

In [None]:
auto_clean = autos[autos['price_in_dollars'].between(850,10000000)].copy()

In [None]:
auto_clean['price_in_dollars'].describe().round()

## 5. Exploring the date columns

In [None]:
auto_clean[['date_crawled','ad_created','last_seen']].describe()

In [None]:
print(auto_clean['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending = True))

In [None]:
print(auto_clean['ad_created'].str[:10].sort_index(ascending = False).value_counts(normalize=True, dropna=False))

Between March and April 2016 there was the highest concentration of vehicles for sale ads. 

In [None]:
print(auto_clean['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending = True))

In [None]:
auto_clean['registration_year'].describe()

In [None]:
auto_clean['registration_year'].value_counts().sort_index(ascending = False).head(25)

In [None]:
auto_clean['registration_year']

The maximum and minimum dates are wrong, we must find out which are the values to work with.

We see that beyond 2019 the values of the dates do not make sense, however we must remember that the data capture was done in 2016 so 2019 would be a date that would be equally wrong and we should limit the highest year to 2016.

In [None]:
auto_clean['registration_year'].value_counts().sort_index(ascending = False).tail(5)

At the bottom, the minimum meaningful value is 1927, so our dataset should be between 1927 and 2019.

In [None]:
auto_clean = auto_clean[auto_clean['registration_year'].between(1927,2016)].copy()

In [None]:
auto_clean['registration_year'].describe()

Once we have corrected the incorrect values we have that our 75% of the vehicles were registered in 2008.

In [None]:
auto_clean['registration_year'].value_counts(normalize=True, dropna=False, bins = 20).sort_index(ascending = False).round(3)

This is the distribution of the dates of registration by groups of 20, in which we can see how the period between 2002 and 2007 is the one with the highest number of registrations. 

## 7. Exploring Price by Brand

Brands with the highest number of vehicles on the sales list

In [None]:
top_ten_brands=auto_clean['brand'].value_counts()
top_ten_brands

In [None]:
selected_brands=auto_clean['brand'].value_counts().index[:10]

### Average price per vehicle brand

In [None]:
brands_price = {}

for brand in selected_brands:
    sel_brand = auto_clean[auto_clean['brand'] == brand ]
    brands_price[brand] = sel_brand['price_in_dollars'].mean().round()

brands_price_sorted = sorted(brands_price.items(),key = lambda kv: kv[1],reverse=True)
brands_price_sorted

### Average number of kilometers per brand 

In [None]:
brands_km = {}

for brand in sorted(selected_brands):
    sel_brand = auto_clean[auto_clean['brand'] == brand ]
    brands_km[brand] = sel_brand['odometer_km'].mean().round()

brands_km_sorted = sorted(brands_km.items(),key = lambda kv: kv[1], reverse=True)
brands_km_sorted

### List of car brands by average price and average mileage

In [None]:
listado = {}

template_string = "Mean price {money:.2f}$ and {km:.2f} mean Kilometers"

for vehiculo in brands_price:
    mean_price = brands_price[vehiculo]
    mean_km = brands_km[vehiculo]
    output = template_string.format(money = mean_price, km = mean_km )
    listado[vehiculo] = output
    

listado

In [None]:
brands_price

In [None]:
brands_km

In [None]:
brands_price
brands_km

frame = {'brands average price':brands_price,
         'brands average kilometer':brands_km}

output = pd.DataFrame(frame)
output

list of car brands by number of vehicles with their average price and average mileage

## 8. Storing Aggregate Data in a DataFrame

In [None]:
brands_price_series = pd.Series(brands_price)
print(brands_price_series)

In [None]:
brands_km_series = pd.Series(brands_km)
print(brands_km_series)

In [None]:
brands_price_series
brands_km_series

frame = {'mean_price':brands_price_series,
        'mean_kilometers':brands_km_series}

output = pd.DataFrame(frame)
output

List of car brands by highest price, their average price and average mileage.

## 9. Next Steps

### 9.1 Identify categorical data 

Using German words, translate them and map the values to their equivalents, in this case to Spanish, because there are categories that are unknown here. 

In [None]:
 auto_clean['vehicle_type'].value_counts(dropna = False)

In [None]:
auto_clean['vehicle_type'].value_counts().dropna()

In [None]:
categorical_vehicle_type = auto_clean['vehicle_type'].value_counts().index[:]

In [None]:
category_translator = {'bus':'monovolumen','limousine':'sedan', 'kleinwagen':'compacto','kombi':'familiar',
                       'coupe':'coupe','suv':'suv','cabrio':'cabrio','andere':'otros'}

In [None]:
for category in categorical_vehicle_type:
    bool_category = auto_clean['vehicle_type'] == category
    auto_clean.loc[bool_category,'vehicle_type'] = category_translator[category]

In [None]:
auto_clean.head(5)

In [None]:
auto_clean['vehicle_type'].describe()

### fuel_type column

In [None]:
auto_clean['fuel_type'].value_counts(dropna = False)

In [None]:
auto_clean['fuel_type'].value_counts().dropna()

In [None]:
auto_clean['fuel_type'].describe()

In [None]:
#auto_clean.loc['fuel_type'] = auto_clean['fuel_type'].value_counts().dropna().copy()

In [None]:
auto_clean[auto_clean['fuel_type'] == 'andere']

In [None]:
categorical_type_fuel = auto_clean['fuel_type'].value_counts().index[:]
categorical_type_fuel.value_counts()

the categories we have in our dataset

In [None]:
categorical_type_fuel_translator = {'benzin':'gasolina', 'diesel':'diesel', 'lpg':'lpg', 'cng':'cng',
                                    'hybrid':'híbrido', 'elektro':'electrico', 'andere':'otros'}

In [None]:
for category in categorical_type_fuel:
    bool_fuel = auto_clean['fuel_type'] == category
    auto_clean.loc[bool_fuel,'fuel_type'] = categorical_type_fuel_translator[category]

In [None]:
auto_clean['fuel_type'].value_counts()

In [None]:
condicion_1 = auto_clean['fuel_type'] == 'híbrido' # testing hybrid category
auto_clean[condicion_1]

### 9.2 Convert dates into numerical data.

In [None]:
auto_clean.info()

In [None]:
replace_columns = ['date_crawled', 'ad_created', 'last_seen']

for column in replace_columns:
    auto_clean[column] = auto_clean[column].str[:10]
    auto_clean[column] = auto_clean[column].str.replace('-','')

In [None]:
auto_clean

In [None]:
all_brands= auto_clean['brand'].unique()
sorted(all_brands)

### 9.3 Find the most common make/model combinations

The top three brands on our list also have their own favourite models, as can be seen below.

In [None]:
def combi_marca_modelo(marca):
    bool_fiat = auto_clean['brand'] == marca 
    print(auto_clean.loc[bool_fiat,'model'].value_counts()[:5]) # top 5

In [None]:
combi_marca_modelo('audi')

In [None]:
combi_marca_modelo('mercedes_benz')

In [None]:
combi_marca_modelo('bmw')

In [None]:
def combi_marca_modelo_top(marca):
    bool_fiat = auto_clean['brand'] == marca
    return auto_clean.loc[bool_fiat,'model'][0:1].max()
    #return auto_clean[bool_fiat]

In [None]:
marc_modelo = {}

for brand in sorted(all_brands):
    marc_modelo[brand] = combi_marca_modelo_top(brand)

marc_modelo = pd.Series(marc_modelo)
marc_modelo

### Model within each brand that is repeated more times within our data set, the order is given by the brand of the vehicle.

### 9.4 Divide the odometer into groups 

Use aggregation to see if the prices follow any pattern in mileage.

In [None]:
km_group = auto_clean['odometer_km'].value_counts()
km_group

In [None]:
km_group = auto_clean['odometer_km'].value_counts().index[:]

In [None]:
price_group = auto_clean['price_in_dollars'].value_counts().index[:]
price_group

In [None]:
avg_price_by_km_non_damage = {}

for km in km_group:
    selected_km = auto_clean[auto_clean['odometer_km'] == km ]
    mean = selected_km['price_in_dollars'].mean().round()
    combined = mean
    avg_price_by_km_non_damage[km] = combined

In [None]:
avg_price_by_km_non_damage

In [None]:
avg_price_by_km = {}

for km in km_group:
    selected_km = auto_clean[auto_clean['odometer_km'] == km ]
    mean = selected_km['price_in_dollars'].mean().round()
    vehiculos = selected_km['price_in_dollars'].value_counts(normalize = True, sort = True)
    combined = (mean, vehiculos)
    avg_price_by_km[km] = combined

In [None]:
avg_price_by_km

a car with 150,000 km and worth that much money...(¿?) 

In [None]:
auto_clean[auto_clean['price_in_dollars'] == 69993]

Clearly it is not an outlyer. 😅

### 9.4 How much cheaper are damaged cars than their undamaged counterparts?

we retrieved the dictionary we had made called the 'avg_price_by_km_non_damage' and we create another dictionary that compiles those vehicles that are indeed damaged.


In [None]:
avg_price_by_km_damage = {}

for km in km_group:
    selected_km = auto_clean[(auto_clean['odometer_km'] == km) &
                             (auto_clean['unrepaired_damage'] == 'ja') ]
    mean = selected_km['price_in_dollars'].mean()
    avg_price_by_km_damage[km] = mean.round()

In [None]:
avg_price_by_km_damage

In [None]:
series_nondamage = pd.Series(avg_price_by_km_non_damage) 
df_nondamage = pd.DataFrame(series_nondamage)
df_nondamage = df_nondamage.rename(columns = {0:'price_not_damage'})

series_damage = pd.Series(avg_price_by_km_damage)
df_damage = pd.DataFrame(series_damage)
df_damage = df_damage.rename(columns = {0:'price_damage'})

series_diference = series_nondamage - series_damage
df_diference = pd.DataFrame(series_diference)
df_diference = df_diference.rename(columns = {0:'price_difference'})

series_diference_percent = (series_damage * 100) / series_nondamage.round(2)
df_diference_percent = pd.DataFrame(series_diference_percent)
df_diference_percent = df_diference_percent.rename(columns = {0:'%difference_damage_notdamage '})


df = pd.concat([df_nondamage, df_damage, df_diference, df_diference_percent], axis = 1)
df

## Conclusions

The conclusions we can draw from our study are several:

 1. The 75% of the vehicles for sale have an average mileage of about 150,000 km and if we count the vehicles with a minimum selling price of 850 we have that 75% of the total have an average price of 8750.

 2. Once we have corrected the values we have that our 75% of the vehicles were registered in 2008 and that most of the registrations were between the period 2002 / 2007.

 3. The most offered vehicle brand is Volkswagen.

 4. The highest average price belongs to Audi, then Mercedes and finally to BMW. 

 5. The average number of kilometers per brand is led by BMW followed by Mercedes and Audi.

 6. most of the vehicles are sedan type and are gasoline cars followed by diesel and lpg cars.


Taking into account the order of average price per vehicle brand 
we know that the most repeated per model is:

   - Audi with A4 followed by the A3 and A6.

   - Mercedes the  C Class followed E Class and the A Class.

   - BMW with the 3 Series, the 5 Series in second place and the 1 Series in third place .
   

Finally we can see the differences between the price kilometer of the damaged and non-damaged cars and we observe that the biggest difference in price is given in vehicles with 40000 Kilometers 150000Km and 60000Km.