# Analyzing Ebay Car Sales Data
by @antosnj

# Introduction
This project aims to analyze the included used car listings from eBay Kleinanzeigen, a classifieds section of the German eBay website. 

In [None]:
%autosave 2

import pandas as pd
import numpy as np
import csv

# Read in the data

In [None]:
autos = pd.read_csv('../input/autos.csv', encoding='Latin-1')
print(autos.info())
autos.head()

# Cleaning

In [None]:
autos.columns

Let's change the column names from camelcase to Python's preferred snakecase, as well as reword some of the column names based on the data dictionary in order to be more descriptive.

In [None]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'date_last_seen']

In [None]:
autos.head()

In order to clean the data a little more, let's take a closer look.

In [None]:
autos.describe(include='all')

Based on the statistics above, we can say a couple of things:

- Columns 'seller', 'offer_type', 'abtest', 'gearbox', 'unrepaired_damage' have mostly one value, so they are candidates to be dropped.
- The dtype for the 'price' and 'odometer' columns is str. 
- The 'number_of_pictures' column is all zeros, so we can drop it.

Let's start by converting the dtype for the 'price' and 'odometer' columns  to a numeric (float) dtype after removing any non-numeric characters. 

In [None]:
#'price' column
autos["price"] = autos["price"].str.replace("$","").str.replace(",","")
autos["price"] = autos["price"].str.strip().astype(float)

#'odometer' column
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","")
autos["odometer"] = autos["odometer"].str.strip().astype(float)

#Rename both
autos = autos.rename(columns={"price": "price_dollars","odometer": "odometer_km"})

autos.head()

Next, let's find outliers in the two numerical columns, 'prices_dollars' and 'odometer_km'. To do so, I will take a look at some basic statistics using Series.describe() and use Series.value_counts() to see each value's frequency.

In [None]:
autos["price_dollars"].describe()

In [None]:
autos["odometer_km"].describe()

In [None]:
autos["price_dollars"].value_counts(ascending=False)

Based on the observations and each value's frequency, we could say prices between 100 and 100000 dollars is a good interval. Anything outside that interval will be excluded as an outlier.

In [None]:
price_interval = [100,100000]

autos = autos.loc[autos["price_dollars"].between(price_interval[0],price_interval[1])]
autos["odometer_km"].value_counts(ascending=False)

The 'odemeter_km' values seem reasonable for their nature, so I'll leave them like that.

Finally, we can drop the 'number_of_pictures' column.

In [None]:
autos.drop('number_of_pictures', axis=1, inplace=True)

# Analysis

Let's start out by taking a look at the date columns: "date_crawled", "ad_created", "date_last_seen", "registration_month" and "registration_year". First, by using value_counts() I have created a normalized distribution of the values in each column, in order to use percentages instead of counts:

In [None]:
date_crawled_dist = autos["date_crawled"].value_counts(normalize=True, dropna=False)
ad_created_dist = autos["ad_created"].value_counts(normalize=True, dropna=False)
last_seen_dist = autos["date_last_seen"].value_counts(normalize=True, dropna=False)

print(date_crawled_dist,ad_created_dist,last_seen_dist)

Now, let's sort the distributions by index:

In [None]:
sorted_date_crawled = date_crawled_dist.sort_index()
sorted_ad_created = ad_created_dist.sort_index()
sorted_last_seen = last_seen_dist.sort_index()

print(sorted_date_crawled,sorted_ad_created,sorted_last_seen)

Based on the observations above, we can see that most the values were uniformly crawled (data_crawled) in March and April, 2016, whereas the date_last_seen column includes values from April, for the most part. The ads have been created since August, 2015, but the majority of them have also been created in March-April. 

Another important observation is that for the date the ads were created the dataset does not include the exact time at that particular day in the timestamp.

Now, let's take a look at the registration year column:

In [None]:
autos["registration_year"].describe()

We can see how the minimum and maximum year values do not make sense at all, since in year 1000 cars were still not invented and we cannot have data from year 9999.

Also, because a car can't be first registered before the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. For that reason, let's count the number of listings with cars that fall outside the 1900 - 2016 interval (which sounds reasonable dates considering when cars were invented) and see if it's safe to remove those rows entirely

In [None]:
autos.loc[autos["registration_year"].between(1900,2016)].shape

Out of 50000, only about 3700 cars fall outside the interval, so I would say it is safe to remove those rows.

In [None]:
autos.drop(autos[~autos["registration_year"].between(1900,2016)].index, inplace=True)

Let's now see how the registration year distribution looks like after cleaning:

In [None]:
autos["registration_year"].value_counts(normalize=True)

Most of the registration years are dated from the 1990s on, showing year 2000 as the leader with close to a 7% percent of registered cars.

Now, let's explore variations across those car brands that have over a 5% of the total values.

In [None]:
autos["brand"].value_counts(normalize=True)>0.05

As seen above, the selected brands to aggregate on are 'volkswagen','bmw','opel','mercedes_benz','audi' and 'ford'. Let's take a look at the mean price for those brands:

In [None]:
aggregate_brands = ['volkswagen','bmw','opel','mercedes_benz','audi','ford']
brand_mean_prices = {}
brand_mean_mileage = {}

for brand in aggregate_brands:
    analyzed_brand = autos.loc[autos["brand"]==brand]
    mean_price = analyzed_brand["price_dollars"].mean()
    mean_mileage = (analyzed_brand["odometer_km"].mean())*0.621371 #Convert to miles
    brand_mean_prices[brand] = mean_price
    brand_mean_mileage[brand] = mean_mileage
    
print(brand_mean_prices,"\n\n",brand_mean_mileage)

We can see that, on average, high-end/luxury car brands such us audi, bmw or mercedes_benz have a higher price (8000-9000 dollars), whereas more affordable brands like ford, opel and volkswagen offer lower prices (3000-5000 dollars), where volkswagen is sort of in between. However, as shown before in the brand distribution, the difference in price does not affect that significantly to buyers, since both brand levels have similar percentages of the total values. 

Let's put that along with the brands mean mileage together in a new dataframe, for an easier analysis.

In [None]:
brand_mean_prices = pd.Series(brand_mean_prices)
brand_mean_mileage = pd.Series(brand_mean_mileage)

new_df = pd.DataFrame(brand_mean_prices, columns=["mean_price"])
new_df["mean_mileage"] = brand_mean_mileage

new_df

We can see how regardless of the price every brand has, on average, a mileage of around 8000 miles.

Next, let's clean the data a little more, starting by translating any German words.

In [None]:
autos.head()

In [None]:
#'seller' column
autos["seller"].unique()

In [None]:
autos.loc[autos["seller"]=='privat',"seller"] = 'private'
autos.loc[autos["seller"]=='gewerblich',"seller"] = 'commercial'

In [None]:
#'offer_type' column
autos["offer_type"].unique()

In [None]:
autos["offer_type"] = 'offer'

In [None]:
#'offer_type' column
autos["vehicle_type"].unique()

In [None]:
autos.loc[autos["vehicle_type"]=='kleinwagen',"vehicle_type"] = 'compact car'
autos.loc[autos["vehicle_type"]=='kombi',"vehicle_type"] = 'caravan'
autos.loc[autos["vehicle_type"]=='andere',"vehicle_type"] = 'other'

In [None]:
#'gearbox' column
autos["gearbox"].unique()

In [None]:
autos.loc[autos["gearbox"]=='manuell',"gearbox"] = 'manual'
autos.loc[autos["gearbox"]=='automatik',"gearbox"] = 'automatic'

In [None]:
#'model' column
autos.loc[autos["model"]=='andere',"model"] = 'other'

In [None]:
#'fuel_type' column
autos["fuel_type"].unique()

In [None]:
autos.loc[autos["fuel_type"]=='benzin',"fuel_type"] = 'gasoline'
autos.loc[autos["fuel_type"]=='elektro',"fuel_type"] = 'electric'
autos.loc[autos["fuel_type"]=='andere',"fuel_type"] = 'other'

In [None]:
#'unrepaired_damage' column
autos["unrepaired_damage"].unique()

In [None]:
autos.loc[autos["unrepaired_damage"]=='nein',"unrepaired_damage"] = 'no'
autos.loc[autos["unrepaired_damage"]=='ja',"unrepaired_damage"] = 'yes'

Let's see how the dataframe looks like after translating from German:

In [None]:
autos.head()

Next, let's convert the dates to be uniform numeric data, starting by splitting date and time into two different columns.

In [None]:
#Separate date and time into two different columns
autos["date_crawled"] = pd.Series(autos["date_crawled"]).astype(object).astype(str)
autos["ad_created"] = pd.Series(autos["ad_created"]).astype(object).astype(str)
autos["date_last_seen"] = pd.Series(autos["date_last_seen"]).astype(object).astype(str)

autos[["date_crawled","time_crawled"]] = autos["date_crawled"].str.split(expand=True)
autos[["ad_created","time_ad_created"]] = autos["ad_created"].str.split(expand=True)
autos[["date_last_seen","time_last_seen"]] = autos["date_last_seen"].str.split(expand=True)

Since in the 'time_ad_created' column the time does not give us any relevant information (all zeros), we can drop it.

In [None]:
autos.drop("time_ad_created", axis=1, inplace=True)

Lastly, let's convert the 'date_crawled', 'ad_created' and 'date_last_seen' columns to integer datatypes.

In [None]:
autos["date_crawled"] = autos["date_crawled"].str.replace("-","").astype(int)
autos["ad_created"] = autos["ad_created"].str.replace("-","").astype(int)
autos["date_last_seen"] = autos["date_last_seen"].str.replace("-","").astype(int)

After this last cleaning step, let's take a look at how the dataframe looks like:

In [None]:
autos

Now, in terms of analysis, it is of interest to figure out the following things:

- What the most common brand/model combinations are.
- See if average prices follows any patterns based on the milage.
- How much cheaper than their non-damaged counterparts cars with damage are.

Let's start out by finding what the most common brand/model combinations are. In order to do this, we need to find how many times each brand is combined with a certain model.

In [None]:
aux_df = pd.concat([autos["brand"],autos["model"]], axis=1)
aux_df = aux_df.groupby(['brand','model']).size().reset_index().rename(columns={0:'count'})
aux_df

Column 'count' contains the values we want. In order for the analysis to be easier, let's sort the new dataframe created in descending order by column 'count' to find the most common brand/model combinations. 

In [None]:
aux_df.sort_values('count', ascending=False)

Now we can tell that Volkswagen's most common model is the Golf, with 3684 units, BMW is 3 Series with 2602 units, etc.

Next, let's find out whether average prices follows any patterns based on the milage, starting by splitting the 'odometer_km' column into the following groups:

- (g1) 0 - 50000 km 
- (g2) 50000 - 100000 km 
- (g3) 100000 - 125000 km
- (g4) 125000 - 15000 km

In [None]:
g1 = autos.loc[autos["odometer_km"]<50000,'odometer_km']
g2 = autos.loc[autos["odometer_km"].between(50000,100000),'odometer_km']
g3 = autos.loc[autos["odometer_km"].between(100000,125000),'odometer_km']
g4 = autos.loc[autos["odometer_km"].between(125000,150000),'odometer_km']

Now, using aggregation, let's calculate average prices for each group:

In [None]:
mileage_avg_prices = {}
groups = [g1,g2,g3,g4]
group_number = 1

for group in groups:
    av_price = autos.loc[group.index, "price_dollars"].mean()
    mileage_avg_prices['g'+str(group_number)] = av_price
    group_number+=1
    
mileage_avg_prices
    

We can see how average prices clearly depend on the milage. On average, cars with under 50000 miles (newer ones) cost about 15000 dollars, whereas those between 50000 and 100000 miles cost 10000 dollars, those between 100000 miles cost close to 7000 dollars and finally those with over 125000 miles cost 4000 dollars. 

Therefore, we can conclude that there is a significant pattern followed by the price based on the milage, where the bigger the m

Finally, let's find out how much cheaper than their non-damaged counterparts cars with damage are.

In [None]:
#Find what models are/are not damaged
damaged_models = autos.loc[autos['unrepaired_damage']=='yes',"model"]
non_damaged_models = autos.loc[autos['unrepaired_damage']=='no',"model"]

#Get the price for those models
price_damaged = autos.loc[damaged_models.index,"price_dollars"]
non_damaged_price = autos.loc[non_damaged_models.index,"price_dollars"]

print("\n\nDAMAGED:\n\n",price_damaged.describe(),"\n\n\nNON-DAMAGED:\n\n",non_damaged_price.describe())

Based on the analysis above, we can conclude unrepaired damage can significantly lower car prices. On average, damaged models are priced at around 2000 dollars up to close to 45000, while non-damaged are worth 7000 dollars, up to 100000.