# Exploring Ebay Car Sales Data

In this project, we are going to analyse the prices of used cars from eBay Kleinanzeigen, a [classified](https://en.wikipedia.org/wiki/Classified_advertising) section of the German eBay website.

The dataset we are going to use was originally [scraped](https://en.wikipedia.org/wiki/Web_scraping) and uploaded by user [https://www.kaggle.com/orgesleka](https://www.kaggle.com/orgesleka). However the original dataset isn't available on Kaggle anymore, but you can find it [here](https://data.world/data-society/used-cars-data).

The data dictionary privided with data is as follows:

|Column Name           |Description                                                                  |
|----------------------|-----------------------------------------------------------------------------|
| dateCrawled          | When this ad was first crawled. All   fieldvalues are taken from this date. |
| name                 | Name of the car.                                                            |
| seller               | Whether the seller is private or a dealer.                                  |
| offerType            | The type of listing                                                         |
| price                | The price on the ad to sell the car.                                        |
| abtest               | Whether the listing is included in an A/B test.                             |
| vehicleType          | The vehicle Type.                                                           |
| yearOfRegistration   | The year in which the car was first registered.                             |
| gearbox              | The transmission type.                                                      |
| powerPS              | The power of the car in PS.                                                 |
| model                | The car model name.                                                         |
| kilometer            | How many kilometers the car has driven.                                     |
| monthOfRegistration  | The month in which the car was first registered.                            |
| fuelType             | What type of fuel the car uses.                                             |
| brand                | The brand of the car.                                                       |
| notRepairedDamage    | If the car has a damage which is not yet repaired.                          |
| dateCreated          | The date on which the eBay listing was created.                             |
| nrOfPictures         | The number of pictures in the ad.                                           |
| postalCode           | The postal code for the location of the vehicle.                            |
| lastSeenOnline       | When the crawler saw this ad last online.                                   |

In this project, there will be an element of data cleaning before we analyse the data. Panda and NumPy will take centre stage in this project, so our first step will import those libraries. Then we will read in our dataset as a DataFrame. We need to use "Latin-1" encoding to ensure that Pandas will be able to read the file without error.

We will utilise a neat feature of Jupyter is that it will automatically display the first and last 5 columns of a DataFrame.

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding="Latin-1")
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371523,2016-03-14 17:48:27,Suche_t4___vito_ab_6_sitze,privat,Angebot,2200,test,,2005,,0,,20000,1,,sonstige_autos,,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,privat,Angebot,1199,test,cabrio,2000,automatik,101,fortwo,125000,3,benzin,smart,nein,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200,test,bus,1996,manuell,102,transporter,150000,3,diesel,volkswagen,nein,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,privat,Angebot,3400,test,kombi,2002,manuell,100,golf,150000,6,diesel,volkswagen,,2016-03-20 00:00:00,0,40764,2016-03-24 12:45:21


Since the original dataset was crawled from a German website, some of the entries are in German, most noticibly in the gearbox column we have entries such as "manuell" and "automatik". 

For now, we will continue exploring an overview of the data using the .info() method.

In [2]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

Some observations:

- There are 20 total columns in this dataset.
- Some of the columns contain null values, but none have more than 20%.
- The columns use [camalcase](https://en.wikipedia.org/wiki/Camel_case) instead of Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case)

Our first step is to clean the column names converting it into snakecase.

## Cleaning Column Names

Rather than manually typing each column name out and then creating a list, we can utilise a shortcut using the DataFrame.columns attribute to print out an array of existing column names and then rename the columns into snakecase, with a few exceptions:

- yearOfRegistration to registration_year
- monthOfRegistration to registration_month
- notRepairedDamage to unrepaired_damage
- dateCreated to ad_created

Then we reassign these columns back to the DataFrame.columns attribute and then use the .head() method to check the current state of the DataFrame.

In [3]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gear_box', 'car_horsepower', 'model',
       'kilometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'last_seen']
autos.head()

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gear_box,car_horsepower,model,kilometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,number_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


As we can see, the columns are now in snakecase. Next we will explore the data itself.

## Initial Exploration and Cleaning

We will be using the .describe() method to display some basic descriptive statistics about each of the columns. Although we can pass the argument *include='all'*, this would show both numerical and non-numerical columns together, each with their own set of descriptive statistics. For improved clarity we will analyse them separately starting with the numerical columns:

In [4]:
autos.describe()

Unnamed: 0,price,registration_year,car_horsepower,kilometer,registration_month,number_of_pictures,postal_code
count,371528.0,371528.0,371528.0,371528.0,371528.0,371528.0,371528.0
mean,17295.14,2004.577997,115.549477,125618.688228,5.734445,0.0,50820.66764
std,3587954.0,92.866598,192.139578,40112.337051,3.712412,0.0,25799.08247
min,0.0,1000.0,0.0,5000.0,0.0,0.0,1067.0
25%,1150.0,1999.0,70.0,125000.0,3.0,0.0,30459.0
50%,2950.0,2003.0,105.0,150000.0,6.0,0.0,49610.0
75%,7200.0,2008.0,150.0,150000.0,9.0,0.0,71546.0
max,2147484000.0,9999.0,20000.0,150000.0,12.0,0.0,99998.0


We can already see that there are some clear anomalies with the data where there is a price of 2,147,483,647. Also there is a registration year of 1,000 and 9,999. This is not surprising as this could be explained by an error in the scraping process

Next we will explore the non-numerical columns:

In [5]:
autos.describe(include=['O'])

Unnamed: 0,date_crawled,name,seller,offer_type,ab_test,vehicle_type,gear_box,model,fuel_type,brand,unrepaired_damage,ad_created,last_seen
count,371528,371528,371528,371528,371528,333659,351319,351044,338142,371528,299468,371528,371528
unique,280500,233531,2,2,2,8,2,251,7,40,2,114,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,test,limousine,manuell,golf,benzin,volkswagen,nein,2016-04-03 00:00:00,2016-04-07 06:45:59
freq,7,657,371525,371516,192585,95894,274214,30070,223857,79640,263182,14450,17


We can see that some of the columns contains information relating to the scraping program used to scrape this data, such as:

- date_crawled
- offer_type

These are the columns we can safely ignore for the purposes of our analysis.

For clarity we will rename the "kilometer" column to "odometer_km" as this column relates to the mileage of the car.

In [6]:
autos.rename({"kilometer": "odometer_km"}, axis=1, inplace=True)

Next we will explore the most important columns, odometer and price, in more detail.

## Exploring Odometer and Price Columns

As we saw in the previous section, the price columns clearly contains outliers. We will also inspect the odometer column to see if there are any outliers in this column as well.

In [7]:
print(f"There are {autos['price'].unique().shape[0]:,} price entries.")
print(f"There are {autos['odometer_km'].unique().shape[0]:,} odometer entries.")

autos[['price', 'odometer_km']].describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

There are 5,597 price entries.
There are 13 odometer entries.


Unnamed: 0,price,odometer_km
count,371528.0,371528.0
mean,17295.141865,125618.688228
std,3587953.74441,40112.337051
min,0.0,5000.0
25%,1150.0,125000.0
50%,2950.0,150000.0
75%,7200.0,150000.0
max,2147483647.0,150000.0


Although the minimum value odometer value is 5,000 can be considered an outlier, it is not an anomaly in the context our analysis, since it is entirely possible that a used car can have a mileage of 5,000KM. So we will not be removing any outliers in the odometer column.

With the price column, the way we will remove outliers is to remove any values that lies more than 1.5 times the interquartile range. The interquartile range is defined as $ Q_{3} - Q_{1} $, where $Q_{3}$ is defined as the 75th percentile and $Q_{1}$ is defined as the 25th percentile.

In the case of the price column, $ Q_{3} = 7200 $ and $ Q_{1} = 1150 $, so $ Q_{3} - Q_{1} = 6,050 $. The range of price values we will only consider is:<br><br>

<center>$[1,150 - 1.5 \times 6,050, 7,200 + 1.5 \times 6050] = [-7,925, 16,275]$</center>

In [8]:
autos_cleaned = autos[autos["price"].between(-7925,16275)]
autos_cleaned[['price', 'odometer_km']].describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

Unnamed: 0,price,odometer_km
count,343420.0,343420.0
mean,3980.284025,129429.619708
std,3851.319988,36950.758687
min,0.0,5000.0
25%,1000.0,125000.0
50%,2550.0,150000.0
75%,5900.0,150000.0
max,16270.0,150000.0


This process has removed 28,108 lines of data, leaving us with 343,420 lines of data. 

## Exploring the Date columns

Next we will analyse the date columns in our data. Currently the "date_crawled", "last_seen" and "ad_created" are all identified as a string values by Pandas. We need to convert these strings into a numerical representation so we can understand it quantitatively, using the .describe() method. 

Let's inspect formatting of the strings in those columns:

In [9]:
autos_cleaned[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-24 11:52:17,2016-03-24 00:00:00,2016-04-07 03:16:57
2,2016-03-14 12:52:21,2016-03-14 00:00:00,2016-04-05 12:47:46
3,2016-03-17 16:54:04,2016-03-17 00:00:00,2016-03-17 17:40:17
4,2016-03-31 17:25:20,2016-03-31 00:00:00,2016-04-06 10:17:21
5,2016-04-04 17:36:23,2016-04-04 00:00:00,2016-04-06 19:17:07


As we can see, the first 10 characters represents the day in yyyy/mm/dd format. Fortunately we can extract the first 10 characters of a string using the Series.str[:10]. We will analyse the distribution of values in each of the columns, starting with the "date_crawled" column:

In [10]:
autos_cleaned["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.025700
2016-03-06    0.014495
2016-03-07    0.035743
2016-03-08    0.033562
2016-03-09    0.034232
2016-03-10    0.032797
2016-03-11    0.032881
2016-03-12    0.036393
2016-03-13    0.015675
2016-03-14    0.036241
2016-03-15    0.033458
2016-03-16    0.030374
2016-03-17    0.031970
2016-03-18    0.013141
2016-03-19    0.035406
2016-03-20    0.036314
2016-03-21    0.035880
2016-03-22    0.032730
2016-03-23    0.031981
2016-03-24    0.029774
2016-03-25    0.032933
2016-03-26    0.032016
2016-03-27    0.030147
2016-03-28    0.035088
2016-03-29    0.034104
2016-03-30    0.033420
2016-03-31    0.031742
2016-04-01    0.033775
2016-04-02    0.034762
2016-04-03    0.038478
2016-04-04    0.037345
2016-04-05    0.012707
2016-04-06    0.003133
2016-04-07    0.001602
Name: date_crawled, dtype: float64

The range of dates is 05/03/2016 to 07/04/2016, this suggests this is the period the web crawler was active.

Next we analyse the "ad_created" column:

In [11]:
autos_cleaned["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2014-03-10    0.000003
2015-03-20    0.000003
2015-06-18    0.000003
2015-08-07    0.000003
2015-08-10    0.000003
                ...   
2016-04-03    0.038588
2016-04-04    0.037499
2016-04-05    0.011531
2016-04-06    0.003122
2016-04-07    0.001543
Name: ad_created, Length: 109, dtype: float64

There is a much greater range of dates compared to "date_crawled". This is not surprising the ad would have been placed far earlier when the scraping took place.

Next we analyse the "last_seen" column:

In [12]:
autos_cleaned["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

2016-03-05    0.001351
2016-03-06    0.004359
2016-03-07    0.005486
2016-03-08    0.008442
2016-03-09    0.010430
2016-03-10    0.012038
2016-03-11    0.013511
2016-03-12    0.024431
2016-03-13    0.008855
2016-03-14    0.012699
2016-03-15    0.016959
2016-03-16    0.016933
2016-03-17    0.029704
2016-03-18    0.007160
2016-03-19    0.016968
2016-03-20    0.020538
2016-03-21    0.020858
2016-03-22    0.021111
2016-03-23    0.018624
2016-03-24    0.019932
2016-03-25    0.019775
2016-03-26    0.016671
2016-03-27    0.017486
2016-03-28    0.023109
2016-03-29    0.023898
2016-03-30    0.024504
2016-03-31    0.024693
2016-04-01    0.024565
2016-04-02    0.025523
2016-04-03    0.025919
2016-04-04    0.026093
2016-04-05    0.122512
2016-04-06    0.210148
2016-04-07    0.124719
Name: last_seen, dtype: float64

The range of dates is exactly the same as the "date_crawled" column. This is because both of these columns relates to when the web crawler was active. The main difference is the distribution of dates in this column is skewed towards the end of the crawling period. While there is more uniform distribution in the "date_crawled" column, dropping off as it nears the end of the cralwing period.

From these two columns we can deduce that the web crawler gathered information as well as updated the existing entries in the database. Since there was a steady amount of entries crawled between 05/03/2016 to 04/04/2016, we can deduce that the crawler periodically scraped the database over this period, to avoid being caught by eBay. The drop off in the "date_crawled" corresponds closely with the "ad_created" date suggesting that the web crawler has crawled all the available ads. So rather than focussing on scraping new data, the web crawler was set to update existing links, hence the large spike in "last_seen" during these few days.

Now we will convert these dates of the format "yyyy-mm-dd" to be a uniform numeric data format "yyyymmdd", by first converting the string into a datetime object then using .dt.strftime("%Y%m%d") method to convert it into the desired format.

In [13]:
pd.options.mode.chained_assignment = None

autos_cleaned["date_crawled"] = pd.to_datetime(autos_cleaned["date_crawled"].str[:10]).dt.strftime("%Y%m%d")
autos_cleaned["date_crawled"] = autos_cleaned["date_crawled"].astype(int)
autos_cleaned["ad_created"] = pd.to_datetime(autos_cleaned["ad_created"].str[:10]).dt.strftime("%Y%m%d")
autos_cleaned["ad_created"] = autos_cleaned["ad_created"].astype(int)
autos_cleaned["last_seen"] = pd.to_datetime(autos_cleaned["last_seen"].str[:10]).dt.strftime("%Y%m%d")
autos_cleaned["last_seen"] = autos_cleaned["last_seen"].astype(int)

print(autos_cleaned[['date_crawled','ad_created','last_seen']].info())
autos_cleaned[['date_crawled','ad_created','last_seen']][0:5]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 343420 entries, 0 to 371526
Data columns (total 3 columns):
 #   Column        Non-Null Count   Dtype
---  ------        --------------   -----
 0   date_crawled  343420 non-null  int32
 1   ad_created    343420 non-null  int32
 2   last_seen     343420 non-null  int32
dtypes: int32(3)
memory usage: 6.6 MB
None


Unnamed: 0,date_crawled,ad_created,last_seen
0,20160324,20160324,20160407
2,20160314,20160314,20160405
3,20160317,20160317,20160317
4,20160331,20160331,20160406
5,20160404,20160404,20160406


We have successfully converted the date data into an integer of the form "yyyymmdd" format, and all the variables has been cast as the integer variable type.

We will now explore the registration year:

In [14]:
autos_cleaned["registration_year"].describe()

count    343420.000000
mean       2004.034669
std          89.309235
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2007.000000
max        9999.000000
Name: registration_year, dtype: float64

As we can see there are some anomalies in the data, we have registration year of 1,000 which is before the first cars were invented as well as registration year 9,999 which are in the future with respect to when the data has been scrapped. Next we will deal with the incorrect registration eyar data.

## Dealing with Incorrect Registration Year Data

We know that the data crawled in 2016, so any registration year that is beyond 2016 is incorrect. We are very confident that cars registered before 1900 are incorrect. We will analyse the nature of these values to get a better understanding of these rows of data:

In [15]:
autos_cleaned.loc[(autos_cleaned["registration_year"] > 2016) | (autos_cleaned["registration_year"] < 1900), ["registration_year", "price"]].describe()

Unnamed: 0,registration_year,price
count,14423.0,14423.0
mean,2044.043472,2966.70547
std,432.611633,3071.314765
min,1000.0,0.0
25%,2017.0,950.0
50%,2017.0,1850.0
75%,2018.0,3900.0
max,9999.0,16200.0


There are roughly 4.2% of the total data has a registration year that is less than 1900 and more than 2016. We see that the price data for those rows seems valid. For the interest of maximising the amount of rows of data for our analysis, we will preserve those rows with incorrect registration years. Rather than deleting those rows outright, we will mark those registration year with a NaN value as an indicator that the data is incorrect.

Due to the approach we are taking, we can restrict the lowest acceptable values more strictly. Based on the [wikipedia](https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_Germany) article on Vehicle registrations in Germany, the current system was introduced in 1956. So we make a decision null all registration years before 1956 and after 2016.

In [16]:
pd.options.mode.chained_assignment = None
autos_cleaned.loc[(autos_cleaned["registration_year"] > 2016) | (autos_cleaned["registration_year"] < 1956), "registration_year"] = np.nan
counts = autos_cleaned["registration_year"].value_counts()
percent = counts / counts.sum()
fmt = '{:.3%}'.format
pd.DataFrame({'Counts': counts, 'Percentage': percent.map(fmt)})

Unnamed: 0,Counts,Percentage
2000.0,24460,7.440%
1999.0,22702,6.905%
2005.0,21934,6.672%
2001.0,20062,6.102%
2003.0,19680,5.986%
...,...,...
1961.0,38,0.012%
1959.0,30,0.009%
1956.0,24,0.007%
1958.0,20,0.006%


A large proportion of cars were registred in the late 1990s and early 2000s. The most uncommon registration years were ones registred close to when the Germany introduced the current registration system.

## Exploring Price by Band

Next we will use the concept of "aggregation" (applying statistical operation to a group of data) to understand the average price of cars by brand. First we will explore the brand distribution in our dataset:

In [17]:
counts = autos_cleaned["brand"].value_counts()
percent = counts / counts.sum()
fmt = '{:.3%}'.format
pd.DataFrame({'Counts': counts, 'Percentage': percent.map(fmt)})

Unnamed: 0,Counts,Percentage
volkswagen,74744,21.765%
opel,39622,11.537%
bmw,34956,10.179%
mercedes_benz,30165,8.784%
audi,27648,8.051%
ford,24739,7.204%
renault,17844,5.196%
peugeot,10908,3.176%
fiat,9606,2.797%
seat,6771,1.972%


We notice that a small minority of brands make up a majority of our dataset. If we look at the top 10 brands (out of 39), it makes up a little over 80% of the dataset. Interestingly, this follows quite closely with the Pareto principle "with 80% of the results (or effect) coming from 20% of the cause". In light of this, we will consider only the top 10 brands in our analysis. 

The first step in our analysis is to find the average prices for each of the top 10 brands. We will 

- First extract the top 10 brands into a list
- Loop through that list and find the average price for each brand
- Save the brand as the key and average price as the value into a dictionary
- Also creat a list of lists, so that we can display the results in a table 
- Each row of that table contains the average price, number of cars and the proportion
- We will sort the result by the average price

In [18]:
top_10 = autos_cleaned["brand"].value_counts().head(10).to_frame(name="counts")
unique_brands = top_10.index.tolist()
data = []
brand_average_price = {}

for brand in unique_brands:
    selected_rows = autos_cleaned[autos_cleaned["brand"] == brand]
    prices = selected_rows["price"]
    avg_price = round(prices.mean(),2)
    data.append([brand, avg_price, selected_rows.shape[0]])
    brand_average_price[brand] = avg_price
    
table = pd.DataFrame(data, columns=["Brand", "Average Price", "Number of Cars"]).sort_values(by=['Average Price'], ascending=False).set_index("Brand")
table["Proportion"] = table["Number of Cars"] / table["Number of Cars"].sum()
table['Proportion'] = pd.Series(["{0:.2f}%".format(val * 100) for val in table['Proportion']], index = table.index)
table

Unnamed: 0_level_0,Average Price,Number of Cars,Proportion
Brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bmw,5494.14,34956,12.62%
audi,5464.63,27648,9.98%
mercedes_benz,5144.72,30165,10.89%
volkswagen,3966.32,74744,26.98%
seat,3714.26,6771,2.44%
peugeot,2991.18,10908,3.94%
ford,2911.95,24739,8.93%
fiat,2635.64,9606,3.47%
opel,2609.99,39622,14.30%
renault,2194.67,17844,6.44%


As we can see there is a distinct price gap between the brands, with:

- BMW, Audi and Mercedez Benz occupying the high price range
- Volkswagen and Seat in the mid-price range
- Peugeot, Ford, Fiat, Opel and Renault occupying the low price range

Next we want to combine this average mileage to see if there is any visible links with mean price.

## Storing Aggregate Data in a Dataframe

For easier analysis, we will the separate series into one dataframe. First we create dictionary containing the brands as the key and average mileage as the value. We can easily convert a dictionary into a series using the DataFrame.Series() method. Lastly we can combine series with a shared index by first constructing a dataframe from one of the series then assign the other series as a new column in this dataframe.

In [19]:
bap_series = pd.Series(brand_average_price)
brand_mileage = {}

for brand in unique_brands:
    selected_rows = autos_cleaned[autos_cleaned["brand"] == brand]
    mileage = selected_rows["odometer_km"]
    avg_mileage = round(mileage.mean(),2)
    brand_mileage[brand] = avg_mileage

bm_series = pd.Series(brand_mileage)
df = pd.DataFrame(bap_series, columns=["mean_price"])
df["mean_mileage"] = bm_series
df.sort_values(by=['mean_price'], ascending=False)

Unnamed: 0,mean_price,mean_mileage
bmw,5494.14,138894.75
audi,5464.63,138590.49
mercedes_benz,5144.72,137317.59
volkswagen,3966.32,132625.7
seat,3714.26,124354.6
peugeot,2991.18,125842.96
ford,2911.95,126094.83
fiat,2635.64,117294.4
opel,2609.99,129950.15
renault,2194.67,128619.42


Interestingly, we notice that the average price is positively correlated with the average mileage. We would normally expect that cars with higher mileage to sell for less which is contrary to the overall pattern here. The brand plays a much more important role compared to the mileage of the car.

# Conclusion

Based on the data which has been scraped between 05/03/2016 to 07/04/2016, the average price of cars are $3980, after excluding all the outliers. The brand of the car plays a big role in the average price more so than the average mileage, with brands such as BMW, Audi and Mercedez Benz costing the most while Renault, Opel and Fiat costing the least.