# Exploring Ebay Car Sales Data

---

## 1. Introduction

In this project, we'll work with a dataset of used cars from [eBay Kleinanzeigen](https://www.ebay-kleinanzeigen.de/), a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle. The version of the dataset we are working with is a sample of 50,000 data points that was prepared by [Dataquest](https://www.dataquest.io/) including simulating a less-cleaned version of the data.

The objective is to clean the data and analyze the included used car listings to answer the following:
- **What are the most popular car brands and why?**
- **How is the selling price affected by the condition of the used car?**

---

## 2. Open and Read the Data

The data dictionary provided is as follows:

| Column | Description | 
| - | - |
| `dateCrawled` | When this ad was first crawled. All field-values are taken from this date. |
| `name` | Name of the car |
| `seller` | Whether the seller is private or a dealer |
| `offerType` | The type of listing | 
| `price` | The price on the ad to sell the car |
| `abtest` | Whether the listing is included in an A/B test |
| `vehicleType` | The vehicle Type |
| `yearOfRegistration` | The year in which the car was first registered |
| `gearbox` | The transmission type |
| `powerPS` | The power of the car in PS |
| `model` | The car model name |
| `kilometer` | How many kilometers the car has driven |
| `monthOfRegistration` | The month in which the car was first registered |
| `fuelType` | What type of fuel the car uses |
| `brand` | The brand of the car |
| `notRepairedDamage` | If the car has a damage which is not yet repaired |
| `dateCreated` | The date on which the eBay listing was created |
| `nrOfPictures` | The number of pictures in the ad |
| `postalCode` | The postal code for the location of the vehicle |
| `lastSeenOnline` | When the crawler saw this ad last online |

Let's start by reading the dataset into pandas.

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv('autos.csv', encoding = 'Windows-1252')
autos

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,2016-03-27 14:38:19,Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon,privat,Angebot,"$24,900",control,limousine,2011,automatik,239,q5,"100,000km",1,diesel,audi,nein,2016-03-27 00:00:00,0,82131,2016-04-01 13:47:40
49996,2016-03-28 10:50:25,Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...,privat,Angebot,"$1,980",control,cabrio,1996,manuell,75,astra,"150,000km",5,benzin,opel,nein,2016-03-28 00:00:00,0,44807,2016-04-02 14:18:02
49997,2016-04-02 14:44:48,Fiat_500_C_1.2_Dualogic_Lounge,privat,Angebot,"$13,200",test,cabrio,2014,automatik,69,500,"5,000km",11,benzin,fiat,nein,2016-04-02 00:00:00,0,73430,2016-04-04 11:47:27
49998,2016-03-08 19:25:42,Audi_A3_2.0_TDI_Sportback_Ambition,privat,Angebot,"$22,900",control,kombi,2013,manuell,150,a3,"40,000km",11,diesel,audi,nein,2016-03-08 00:00:00,0,35683,2016-04-05 16:45:07


---

## 3. Inspect and Clean the Data

We will now inspect the dataframe to identify areas for cleaning.

**a. Rename the columns**

In [2]:
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

It is observed that: 
- The dataset contains 20 columns, which mostly consist of string objects.
- A few columns have null values, however, less than 20% of the total values are null.
- The column names use camelcase instead of Python's preferred snakecase.

We will convert the column names from camelcase to snakecase and reword some column names to be more descriptive.

In [3]:
print('Original column names: \n', autos.columns)

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

print('New column names: \n', autos.columns)

Original column names: 
 Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
New column names: 
 Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


**b. Drop columns which largely consist of the same values**

Text columns which largely consist of the same values can be dropped as the information is not be useful for analysis.

In [4]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-23 18:39:34,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


We will drop the `price` and `odometer` columns as 49999 of 50000 columns contain the same value.

In [5]:
autos.drop(['seller', 'offer_type'], axis = 1, inplace = True)
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-23 18:39:34,Ford_Fiesta,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


**c. Clean and convert numeric data stored as text**

Next, the data in the `price` and `odometer` columns can be cleaned and converted to numeric values for further computation.

In [6]:
# Remove non-numeric characters and convert to numeric data type
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '')
autos['odometer'] = autos['odometer'].str.replace('km', '').str.replace(',', '')

autos['price'] = autos['price'].astype(float)
autos['odometer'] = autos['odometer'].astype(float)

# Rename odometer column to indicate unit of measuremnet
autos.rename(columns = {'odometer': 'odometer_km'}, inplace = True)
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 18 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        50000 non-null  object 
 1   name                50000 non-null  object 
 2   price               50000 non-null  float64
 3   ab_test             50000 non-null  object 
 4   vehicle_type        44905 non-null  object 
 5   registration_year   50000 non-null  int64  
 6   gearbox             47320 non-null  object 
 7   power_ps            50000 non-null  int64  
 8   model               47242 non-null  object 
 9   odometer_km         50000 non-null  float64
 10  registration_month  50000 non-null  int64  
 11  fuel_type           45518 non-null  object 
 12  brand               50000 non-null  object 
 13  unrepaired_damage   40171 non-null  object 
 14  ad_created          50000 non-null  object 
 15  nr_of_pictures      50000 non-null  int64  
 16  post

**d. Remove unrealistic numeric outliers**

Let's investigate the `price` column to look for any values that seem unrealistically high or low.

In [7]:
# Investigate price column statistics
print('Summary statistics of price column')
print(autos['price'].describe())
print('\n')

print('10 highest prices')
print('Price:      Frequency')
print(autos['price'].value_counts().sort_index(ascending = False).head(20))
print('\n')

print('10 lowest prices')
print('Price:  Frequency')
print(autos['price'].value_counts().sort_index(ascending = True).head(20))

Summary statistics of price column
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


10 highest prices
Price:      Frequency
99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
259000.0      1
250000.0      1
220000.0      1
198000.0      1
197000.0      1
Name: price, dtype: int64


10 lowest prices
Price:  Frequency
0.0     1421
1.0      156
2.0        3
3.0        1
5.0        2
8.0        1
9.0        1
10.0       7
11.0       2
12.0       3
13.0       2
14.0       1
15.0       2
17.0       3
18.0       1
20.0       4
25.0       5
29.0       1
30.0       7
35.0       1
Name: price, dtype: int64


We will set the realistic price range to be:
- lower bound of 1 dollar as the zero values could be placeholders for missing data. 
- upper bound of 350,000 dollars as the prices increase drastically above this value.

Let's replace the outliers outside this range with null values and update the column statistics.

In [8]:
# Replace outliers with null values
autos.loc[autos['price'] == 0, 'price'] = np.nan
autos.loc[autos['price'] > 350000, 'price'] = np.nan

# Updated price column statistics
print('Summary statistics of price column')
print(autos['price'].describe())
print('\n')

print('10 highest prices')
print('Price:      Frequency')
print(autos['price'].value_counts().sort_index(ascending = False).head(20))
print('\n')

print('10 lowest prices')
print('Price:  Frequency')
print(autos['price'].value_counts().sort_index(ascending = True).head(20))

Summary statistics of price column
count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64


10 highest prices
Price:      Frequency
350000.0    1
345000.0    1
299000.0    1
295000.0    1
265000.0    1
259000.0    1
250000.0    1
220000.0    1
198000.0    1
197000.0    1
194000.0    1
190000.0    1
180000.0    1
175000.0    1
169999.0    1
169000.0    1
163991.0    1
163500.0    1
155000.0    1
151990.0    1
Name: price, dtype: int64


10 lowest prices
Price:  Frequency
1.0     156
2.0       3
3.0       1
5.0       2
8.0       1
9.0       1
10.0      7
11.0      2
12.0      3
13.0      2
14.0      1
15.0      2
17.0      3
18.0      1
20.0      4
25.0      5
29.0      1
30.0      7
35.0      1
40.0      6
Name: price, dtype: int64


**e. Inspect datetime columns stored as text values**

The `date_crawled`, `ad_created` and `last_seen` columns represent datetime values, but are stored as string objects. 

Let's first understand how the values in these three string columns are formatted. 

In [9]:
autos[['date_crawled','ad_created','last_seen']][0:3]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-26 17:47:46,2016-03-26 00:00:00,2016-04-06 06:45:54
1,2016-04-04 13:38:56,2016-04-04 00:00:00,2016-04-06 14:45:08
2,2016-03-26 18:57:24,2016-03-26 00:00:00,2016-04-06 20:15:37


To understand the date range, we will extract the date values and convert into numerical representations to understand it quantitatively.

In [10]:
# Value counts of date crawled
date_crawled_value_counts = 100 * autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = True)
print('Date Crawled: Relative Frequency (%)')
print(date_crawled_value_counts.head(10))
print('\n')

# Value counts of ad_created
ad_created_value_counts = 100 * autos['ad_created'].str[:10].value_counts(normalize = True, dropna = True)
print('Ad Created (Date): Relative Frequency (%)')
print(ad_created_value_counts.head(10))
print('\n')

# Value counts of last_seen
last_seen_value_counts = 100 * autos['last_seen'].str[:10].value_counts(normalize = True, dropna = True)
print('Last Seen (Date): Relative Frequency (%)')
print(last_seen_value_counts.head(10))

Date Crawled: Relative Frequency (%)
2016-04-03    3.868
2016-03-20    3.782
2016-03-21    3.752
2016-03-12    3.678
2016-03-14    3.662
2016-04-04    3.652
2016-03-07    3.596
2016-04-02    3.540
2016-03-19    3.490
2016-03-28    3.484
Name: date_crawled, dtype: float64


Ad Created (Date): Relative Frequency (%)
2016-04-03    3.892
2016-03-20    3.786
2016-03-21    3.772
2016-04-04    3.688
2016-03-12    3.662
2016-03-14    3.522
2016-04-02    3.508
2016-03-28    3.496
2016-03-07    3.474
2016-03-29    3.414
Name: ad_created, dtype: float64


Last Seen (Date): Relative Frequency (%)
2016-04-06    22.100
2016-04-07    13.092
2016-04-05    12.428
2016-03-17     2.792
2016-04-03     2.536
2016-04-02     2.490
2016-03-30     2.484
2016-04-04     2.462
2016-03-31     2.384
2016-03-12     2.382
Name: last_seen, dtype: float64


It is observed that:
- the distribution of dates `date_crawled` and `ad_created` are relatively uniform
- there are significantly more `last_seen` dates between 2016-04-05 to 2016-04-07. We can confirm that this coincides with the end of the data collection duration, hence there is no anomaly.

In [11]:
print('Last Seen (Date): Relative Frequency (%)')
print(last_seen_value_counts.sort_index().tail(5))

Last Seen (Date): Relative Frequency (%)
2016-04-03     2.536
2016-04-04     2.462
2016-04-05    12.428
2016-04-06    22.100
2016-04-07    13.092
Name: last_seen, dtype: float64


**f. Inspect datetime columns stored as numeric values**

The `registration_month` and `registration_year` columns represent datetime values, but are stored as numeric data. We can understand the distribution without any extra data processing.

In [12]:
print('Value counts of registration_month column')
print('Month: Frequency')
print(autos['registration_month'].value_counts(dropna = False).sort_index())

Value counts of registration_month column
Month: Frequency
0     5075
1     3282
2     3008
3     5071
4     4102
5     4107
6     4368
7     3949
8     3191
9     3389
10    3651
11    3360
12    3447
Name: registration_month, dtype: int64


The values in the registration_month column tally with the months from January (1) to December (12), except for 5075 rows with zero value. These may be placeholders for missing data, hence we will replace them with null values.

In [13]:
autos.loc[autos['registration_month'] == 0, 'registration_month'] = np.nan

print('Value counts of registration_month column')
print('Month: Frequency')
print(autos['registration_month'].value_counts(dropna = False).sort_index())

Value counts of registration_month column
Month: Frequency
1.0     3282
2.0     3008
3.0     5071
4.0     4102
5.0     4107
6.0     4368
7.0     3949
8.0     3191
9.0     3389
10.0    3651
11.0    3360
12.0    3447
NaN     5075
Name: registration_month, dtype: int64


Now, we will explore the values in the `registration_year` column.

In [14]:
print('10 lowest values of registration_year column')
print('Year: Frequency')
print(autos['registration_year'].value_counts().sort_index().head(10))
print('\n')

print('10 highest values of registration_year column')
print('Year: Frequency')
print(autos['registration_year'].value_counts().sort_index().tail(15))

10 lowest values of registration_year column
Year: Frequency
1000    1
1001    1
1111    1
1500    1
1800    2
1910    9
1927    1
1929    1
1931    1
1934    2
Name: registration_year, dtype: int64


10 highest values of registration_year column
Year: Frequency
2016    1316
2017    1453
2018     492
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, dtype: int64


There seems to be erroneous `registration year` values, that are either too long ago (e.g. year 1000) or in the future (e.g. 2800). Realistically, we will limit the range of this column to 1900 - 2020, and replace any outliers with null values.

In [15]:
# Replace outliers with null values
autos.loc[autos['registration_year'] < 1900, 'registration_year'] = np.nan
autos.loc[autos['registration_year'] > 2020, 'registration_year'] = np.nan

print('10 lowest values of registration_year column')
print('Year: Frequency')
print(autos['registration_year'].value_counts().sort_index().head(10))
print('\n')

print('10 highest values of registration_year column')
print('Year: Frequency')
print(autos['registration_year'].value_counts().sort_index().tail(15))

10 lowest values of registration_year column
Year: Frequency
1910.0    9
1927.0    1
1929.0    1
1931.0    1
1934.0    2
1937.0    4
1938.0    1
1939.0    1
1941.0    2
1943.0    1
Name: registration_year, dtype: int64


10 highest values of registration_year column
Year: Frequency
2005.0    3015
2006.0    2708
2007.0    2304
2008.0    2231
2009.0    2098
2010.0    1597
2011.0    1634
2012.0    1323
2013.0     806
2014.0     666
2015.0     399
2016.0    1316
2017.0    1453
2018.0     492
2019.0       3
Name: registration_year, dtype: int64


**g. Translate text categories from German to English**

The `vehicle_type` and `unrepaired_damage` columns contain categorical data in German language.

In [16]:
german_columns = ['vehicle_type', 'unrepaired_damage']

for column in german_columns:
    print(f'In German, the {column} categories are:')
    print([i for i in autos[column].unique()])

In German, the vehicle_type categories are:
['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere']
In German, the unrepaired_damage categories are:
['nein', nan, 'ja']


We will create a dictionary to map and translate the German names in these two columns to English words.

In [17]:
# Mapping dictionary
german_to_english = {'kleinwagen': 'small car',
                    'kombi': 'station wagon',
                    'cabrio': 'convertible',
                    'andere': 'other',
                    'manuell': 'manual',
                    'automatik': 'automatic',
                    'nein': 'no',
                    'ja': 'yes',
                    'Bus': 'bus'}

# Map and translate columns
german_columns = ['vehicle_type', 'unrepaired_damage']

for column in german_columns:
    autos[column].replace(german_to_english, inplace = True)
    print(f'In English, the {column} categories are:')
    print([i for i in autos[column].unique()])

In English, the vehicle_type categories are:
['bus', 'limousine', 'small car', 'station wagon', nan, 'coupe', 'suv', 'convertible', 'other']
In English, the unrepaired_damage categories are:
['no', nan, 'yes']


---

## 4. Analyze the Data

**a. What are the most popular car brands and why?**

Let's start by identifying the 5 most popular car brands from the cleaned dataset.

In [18]:
print('Brand            Number of Cars Sold')
print(autos['brand'].value_counts(sort = True)[:5])

Brand            Number of Cars Sold
volkswagen       10687
opel              5461
bmw               5429
mercedes_benz     4734
audi              4283
Name: brand, dtype: int64


**Volkswagen is by far the most popular brand**, with almost twice the sales relative to its nearest competitors (Opel and BMW).

We will then compute the mean price of each car brand to understand its impact on car sales.

In [19]:
# Save top 5 brand names
top_5_brands = autos['brand'].value_counts(sort = True).index[:5]

# Compute mean price for each brand 
brand_mean_prices = {}

for b in top_5_brands:
    selected_rows = autos[autos['brand'] == b]
    mean_price = selected_rows['price'].mean()
    brand_mean_prices[b] = mean_price

print('Brand: Mean Price')
for brand, mean_price in brand_mean_prices.items():
    print(f'{brand}: ${mean_price:.2f}')
print('\n')

# Print statistics of all price data
print('Price Data Statistics')
autos['price'].describe()

Brand: Mean Price
volkswagen: $5332.48
opel: $2944.61
bmw: $8261.38
mercedes_benz: $8536.03
audi: $9212.93


Price Data Statistics


count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

Based on the mean prices above, the top 10 car brands can be classified into a few broad categories:
- Expensive - BMW, Mercedes-Benz,  Audi
- Mid-range - Volkswagen
- Affordable - Opel

The popularity of Volkswagen may be attributed to the fact that it is a **mid-range option that offers the best of both worlds** i.e. more premium features than affordable brands, but more economical than expensive brands.

We will further investigate common brand and vehicle type combinations to ascertain this.

In [20]:
vehicle_types = autos['vehicle_type'].unique()
brand_vehicle_type_list = []

for b in top_5_brands:
    for vt in vehicle_types:
        selected_rows = autos[(autos['brand'] == b) & (autos['vehicle_type'] == vt)]
        
        mean_price = selected_rows['price'].mean()
        rows_count = len(selected_rows)
        
        brand_vehicle_type_list.append([b, vt, mean_price, rows_count])
        
brand_vehicle_type_df = pd.DataFrame(brand_vehicle_type_list, columns = ['Brand', 'Vehicle Type', 'Mean Price', 'Sales'])
brand_vehicle_type_df.sort_values('Sales', ascending = False, ignore_index = True).head(10)

Unnamed: 0,Brand,Vehicle Type,Mean Price,Sales
0,volkswagen,limousine,5648.234405,2700
1,bmw,limousine,7849.314607,2539
2,volkswagen,small car,2516.969984,2503
3,mercedes_benz,limousine,6835.104133,1894
4,volkswagen,station wagon,4935.372696,1779
5,opel,small car,2066.718905,1648
6,audi,station wagon,9156.446945,1586
7,audi,limousine,7537.691799,1550
8,volkswagen,bus,9285.413768,1400
9,bmw,station wagon,7510.137566,1159


From the 10 most brand and vehicle type combinations:
- the expensive brands (BMW, Mercedes-Benz,  Audi) have highest sales in limousines and station wagons
- the mid-range brand (Volkswagen) performs well in sales for limousines, small cars, station wagons and even buses
- the affordable brand (Opel) sells mostly small cars

We can infer that **Volkswagen is the top car brand as it sells a wide range of vehicle models**, from small cheap cars to larger and more premium models. This also explains why the mean price of its vehicles sold are in the mid-range.

---

**b. How is the selling price affected by the condition of the used car?**

We will aggregate mileage data by brands and explore whether this affects the mean price.

In [21]:
# Compute mean mileage for each brand 
brand_mean_mileage = {}

for b in top_5_brands:
    selected_rows = autos[autos['brand'] == b]
    mean_price = selected_rows['odometer_km'].mean()
    brand_mean_mileage[b] = mean_price

# Combine mean price and mean mileage for each brand in single dataframe
bmp_series = pd.Series(brand_mean_prices)
bmm_series = pd.Series(brand_mean_mileage)

bmp_df = pd.DataFrame(bmp_series, columns = ['mean_price'])
bmm_df = pd.DataFrame(bmm_series, columns = ['mean_mileage'])

brand_means_df = pd.concat([bmp_df, bmm_df], axis = 1)
brand_means_df

Unnamed: 0,mean_price,mean_mileage
volkswagen,5332.478425,128955.272761
opel,2944.607542,129298.663248
bmw,8261.382442,132521.643028
mercedes_benz,8536.027085,130886.142797
audi,9212.930662,129643.941163


**The range of mean mileages do not differ much across the top car brands, although the more expensive car brands tend to have slightly higher mileages**.

Alternatively, let's aggregate mileage data into intervals and confirm if the trend is the same regardless of brand.

In [22]:
# Split the odometer_km into groups, and use aggregation to see if average -prices follows any patterns based on the mileage.

intervals = autos['odometer_km'].value_counts(bins = 10).index
mileage_price_list = []

for i in intervals:
    selected_rows = autos[autos['price'].between(i.left, i.right)]
    mean_price = selected_rows['price'].mean()
    mileage_price_list.append([(f'{i.left:.0f} - {i.right:.0f}'), mean_price])

mileage_price_df = pd.DataFrame(mileage_price_list, columns = ['Mileage Interval (km)', 'Mean Price'])
mileage_price_df.sort_values('Mean Price', ignore_index = True)

Unnamed: 0,Mileage Interval (km),Mean Price
0,4855 - 19500,9609.993091
1,19500 - 34000,24667.709126
2,34000 - 48500,39548.060274
3,48500 - 63000,53954.009524
4,63000 - 77500,69173.612245
5,77500 - 92000,83094.529412
6,92000 - 106500,98842.5
7,106500 - 121000,116754.444444
8,121000 - 135500,130500.0
9,135500 - 150000,140998.666667


Similarly, it can be seen that **cars with higher mileages sell at higher mean price**, regardless of the brand. A possible explanation is that buyers are willing to pay more for expensive vintage models that have clocked more mileage.

Let's also explore whether and how much damaged cars are cheaper than non-damaged cars.

In [23]:
damage_status = autos['unrepaired_damage'].dropna().unique()

for status in damage_status:
    selected_rows = autos[autos['unrepaired_damage'] == status]
    mean_price = selected_rows['price'].mean()
    
    if status == 'no':
        print(f'The mean price of cars that have been damaged is ${mean_price:.2f}')
    else:
        print(f'The mean price of cars that have not been damaged is ${mean_price:.2f}')

The mean price of cars that have been damaged is $7086.80
The mean price of cars that have not been damaged is $2221.89


On average, **damaged cars cost only about 30% the price of non-damaged cars**. Hence, there may be value in purchasing a damaged car and paying for the repair if the price is not prohibitive.

---

## 5. Conclusion

What are the most popular car brands and why?
- The top 5 car brands are Volkswagen, Opel, BMW, Mercedes-Benz and Audi
- **Volkswagen is the top selling brand** since it is a **mid-range** option that offers a **wide range of popular vehicle types** 

How is the selling price affected by the condition of the used car?
- Cars with **higher mileages** tend to sell at **higher mean price**
- On average, **damaged cars are around 70% cheaper than non-damaged cars**