# Project: Exploring Ebay Car Sales Data

In this project, we explore a dataset of used cars from eBay _Kleinanzeigen_, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database/data). We've made a few modifications from the original dataset that was uploaded to Kaggle:

We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment
We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The aim of this project is to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

    dateCrawled - When this ad was first crawled. All field-values are taken from this date.
    name - Name of the car.
    seller - Whether the seller is private or a dealer.
    offerType - The type of listing
    price - The price on the ad to sell the car.
    abtest - Whether the listing is included in an A/B test.
    vehicleType - The vehicle Type.
    yearOfRegistration - The year in which which year the car was first registered.
    gearbox - The transmission type.
    powerPS - The power of the car in PS.
    model - The car model name.
    kilometer - How many kilometers the car has driven.
    monthOfRegistration - The month in which which year the car was first registered.
    fuelType - What type of fuel the car uses.
    brand - The brand of the car.
    notRepairedDamage - If the car has a damage which is not yet repaired.
    dateCreated - The date on which the eBay listing was created.
    nrOfPictures - The number of pictures in the ad.
    postalCode - The postal code for the location of the vehicle.
    lastSeenOnline - When the crawler saw this ad last online.

In [None]:
# import libraries
import pandas as pd
import numpy as np

## capture data and get basic information

In [None]:
bmp_series = pd.Series(brand_mean_price)
bmp_series

In [None]:
bmm_series = pd.Series(brand_mean_mileage)
bmm_series

Create a dataframe from the brand_mean_prices series via the DataFrame constructor

In [None]:
# convert from float to int
brands = pd.DataFrame(bmp_series.astype(int), columns=['mean_price'])
brands

Assign the other series as a new column in this dataframe

In [None]:
# convert from float to int
brands['mean_mileage_km'] = bmm_series.astype(int)
brands

Analyze dataframe

In [None]:
brands

In [None]:
# brands with less than 100,000 km
brands[brands['mean_mileage_km'] < 100000]

In [None]:
# sort by mean mileage
brands.sort_values(by='mean_mileage_km')

In [None]:
# sort by mean price
brands.sort_values(by='mean_price')

Limiting our focus to the top 6 brands

In [None]:
autos['odometer_km'].describe()

In [None]:
autos['odometer_km'].value_counts().sort_index(ascending=True)

In [None]:
# % of cars with odometer reading of 150,000 km
(31414/48565) * 100

The odometer numbers range from 5,000 km to 150,000 km with a majority (64.68%) having the maximum. Users must have been given a set of pre-set options to pick from, explaining the nice rounded numbers. There a lot of high mileage vehicles.

---

## Explore Date Columns

There are several date columns in the data:
- date_crawled
- ad_created
- registration_month
- registration_year
- last_seen

Some of these came directly from the website and others were created by the web crawler.

In [None]:
dates = ['date_crawled', 'ad_created', 'last_seen', 'registration_month', 'registration_year']
autos[dates].describe(include='all')

In [None]:
autos[dates].info()

**Observation:**
registration_month and registration_year are numeric data types. registration_month has values ranging from 0 to 12, representing the 12 months of the year. Perhaps the 0 means there was no registration month. registration_year ranges from 1000 to 9999. Some of these values do not look valid.


date_crawled, ad_created and last_seen are dates identified as strings. They need to be converted into a numerical representation. We will split the date from the timestamp and look at the distribution of the values for each column.

In [None]:
autos[dates].head()

**date_crawled**

In [None]:
autos['unrepaired_damage'].value_counts()

In [None]:
autos['registration_month'].value_counts()

In [None]:
autos['registration_year'].value_counts()

In [None]:
autos['odometer'].value_counts()

In [None]:
autos['abtest'].value_counts()

In [None]:
autos['offer_type'].value_counts()

In [None]:
autos['seller'].value_counts()

In [None]:
autos['vehicle_type'].value_counts()

### drop columns

We observed a few columns with a couple of unique values where most of the columns had the same value:
- offer_type
- seller

We also observed that all values in the num_photos column were 0. So none of the cars in our dataset had any photos included.

We can drop these 3 columns

In [None]:
autos = autos.drop(['offer_type', 'seller', 'num_photos'], axis=1)

### convert numeric values stored as text to numeric values - price and odometer

In [None]:
autos['registration_year'].describe()

In [None]:
(autos['registration_year']
     .value_counts()
     .sort_index(ascending=False)
)

Registration year is the year that the car was first registered. This would be close to if not the same year as the car was built.

The minimum (1000) and maximum (9999) years are odd. There were no cars in the year 1000 and 9999 is way into the future. If the web crawler captured ads during March and April of 2016, we would expect the registration years to range from 19XX to 2016.

## Cleaning up registration years

We observed a few odd values in the registration_year column. We need to dive deeper to clean up the column. 

A car can't be registered before the listing was seen, so any car with a registration year above 2016 is inaccurate. We also need to look at the earlier dates to see what we can remove.

In [None]:
years = [2016, 2017, 2018, 2019, 2020]
for year in years:
    num = autos[autos['registration_year'] > year].shape[0]
    print('There are {0} vehicles with registration years higher than {1}'.format(num, year))

In [None]:
years = [2016, 2017, 2018, 2019, 2020]
for year in years:
    num = autos[autos['registration_year'] == year].shape[0]
    print('There are {0} vehicles with registration years in {1}'.format(num, year))

There are almost 1900 cars with registration years higher than 2016, the year in which the web crawler captured this data. A majority of these have registration years in 2017, 2018 and 2 in 2019. Regardless, these will be removed.

In [None]:
years = [1900, 1950, 2000, 2010, 2020]
for year in years:
    num = autos[autos['registration_year'] < year].shape[0]
    print('There are {0} vehicles with registration years lower than {1}'.format(num, year))

In [None]:
(autos['registration_year']
     .value_counts()
     .sort_index(ascending=True)
     .head(10)
)

There are a handful of cars with registration years lower than 1900. This seems very unlikely. Looking at those years further (1800, 1111, 1001, 1000), cars did not even exist during this time! These should be removed.

Create brand_mean_price and brand_mean_mileage

In [None]:
# capture top 20 brands
top_20_brands = (autos['brand']
                     .value_counts()
                     .sort_values(ascending=False)
                     .index[:20]
                )

In [None]:
# calculate mean mileage and mean price for top 20 brands
brand_mean_price = {}
brand_mean_mileage = {}

for brand in top_20_brands:
    #print(brand)
    brand_autos = autos[autos['brand'] == brand]
    #print(brand_autos.head())
    mean_price = brand_autos['price'].mean()
    #print(mean_price)
    mean_mileage = brand_autos['odometer_km'].mean()
    #print(mean_mileage)
    brand_mean_price[brand] = mean_price
    brand_mean_mileage[brand] = mean_mileage

In [None]:
brand_mean_price

In [None]:
autos.columns

In [None]:
brand_mean_mileage

Convert both dictionaries to series objects via the Series constructor

### Explore odometer_km

In [None]:
autos['odometer_km'].unique()

In [None]:
# convert odometer column to int type after cleaning non-numeric values
autos['odometer'] = (autos['odometer']
                         .str.replace('km', '')
                         .str.replace(',', '')
                         .astype(int)
                    )

In [None]:
# rename odometer column name to include distance information
autos.rename({'odometer': 'odometer_km'}, axis=1, inplace=True)

In [None]:
autos.info()

## Continue Data Exploration

### Explore prices

In [None]:
autos['price'].unique().shape

In [None]:
autos['price'].describe()

In [None]:
autos['price'].value_counts().head(20)

In [None]:
autos['price'].value_counts().sort_index(ascending=False).head(20)

In [None]:
autos['price'].value_counts().sort_index(ascending=True).head(20)

**Observations:**
- There are 2357 unique prices
- 1421 entries have a price of \$0
- The most expensive is \$99M
- The most expensive 20 are \$197,000 and up
- The 1st quartile is \$1,100 and the 3rd quartile is \$7,200 with the 2nd quartile being \$2,950
- Prices go up to \$350,000 before jumping up to \$1 M and then much higher
- Ebay being a bidding platform, it is possible for prices to start even at \$1

In [None]:
autos.shape[0]

In [None]:
autos[autos['registration_year'].between(1900, 2016)].shape[0]

In [None]:
(46681/48565) * 100

In [None]:
(46681/50000) * 100

We will retain 96% of the current dataset by removing cars with registration years earlier than 1900 and later than 2016. From the original dataset of 50,000, we will retain ~93%.

In [None]:
# get distribution of brands ordered into descending order
(autos['brand']
         .value_counts(normalize=True)
         .sort_values(ascending=False)
)

In [None]:
top_20_brands

In [None]:
# top 6 brands
autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[:5].sum()

**Observations**

- The top 4 brands account for 50% of all brands.
- 23 brands account for less than 1% each with 13 of these accounting for less than 0.5% each
- The top 5 brands are all German car manufactuers, accounting for over 60% of all cars, with the 6th being American (Ford) followed by a French (Renault).

Let's calculate the average brand price:

In [None]:
# calculate average brand price
average_brand_price = {}

for brand in top_20_brands:
    #print(brand)
    brand_autos = autos[autos['brand'] == brand]
    #print(brand_autos.head())
    average_price = brand_autos['price'].mean()
    #print(average_price)
    average_brand_price[brand] = average_price

print('Average Brand Price:')
average_brand_price

In [None]:
sorted(average_brand_price.items(), key=lambda x: x[1])

**Observations**

Looking at the top 6 brands, BMW, Audi, Mercedes Benz were amongst the most expensive brands. Opel and Ford were amongst the cheaper with the former being the third cheapest brand. Volkswagen was in the middle of the pack.

The most expensive brands were Sonstige Autos and Mini (both had over $10,000 average prices). This may be attributed to both brands being more of a niche. There are not that many models of the mini vs. volkswagen which has so many different car models in various price ranges. The same applies to the other top brands as they are much larger car companies.

## Use aggregation to understand average mileage of top brands

In [None]:
autos = autos[autos['registration_year'].between(1900, 2016)]

Let's look at the distribution of the remaining registrations years:

In [None]:
(autos['registration_year']
     .value_counts(normalize=True)
     .sort_values()
)

In [None]:
(autos['registration_year']
     .value_counts(normalize=True)
     .sort_index(ascending=True)
)

Looking through the distribution, we can see most cars were registered from 1990's onwards. The number of cars registered in the earlier years (1910 - 1970) make up a small percentage. We can look at the actual number of registered cars in groups of year periods:

In [None]:
years = [(1900, 1950), (1950, 1960), (1960, 1970), (1970, 1980), 
         (1980, 1990), (1990, 2000), (2000, 2010), (2010, 2016)]

for start, end in years:
    filtered_autos = autos[((autos['registration_year'] >= start) & (autos['registration_year'] < end))]
    num = filtered_autos.shape[0]
    print("There were {} registered cars between {} and {}".format(num, start, end))

In [None]:
before_90 = autos[autos['registration_year'] < 1990]
before_90_num = before_90.shape[0]
before_90_percent = (before_90_num/autos.shape[0]) * 100
print("{:.2f}% of cars were registered before 1990".format(before_90_percent))

# after 1990
after_90 = autos[autos['registration_year'] >= 1990]
after_90_num = after_90.shape[0]
after_90_percent = (after_90_num/autos.shape[0]) * 100
print("{:.2f}% of cars were registered in 1990 or later".format(after_90_percent))

**Observations**

We can more clearly see that most the cars were registered in past 20 years. Less than 3% of cars were registered before 1990, while over 97% of cars were registered in 1990 or later. Looking at the distribution, since 1994, each year had at least 1% of registered cars with the exception of 2015.

## Exploring Price By Brand

We will explore variations across different car brands by using aggregation.

In [None]:
(autos['date_crawled']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
     .count()
)

There are 34 dates where the web crawler captured data.

In [None]:
(autos['date_crawled']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
)

In [None]:
(autos['date_crawled']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_values()
)

The web crawler captured data on a daily basis over the course of a month and a few days. Specifically, from March 5th 2016 to April 7th 2016. The distribution is fairly uniform.

**ad_created**

In [None]:
(autos['ad_created']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
     .count()
)

In [None]:
(autos['ad_created']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
)

In [None]:
(autos['ad_created']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_values()
)

There are 76 dates when ads were created. These dates span from June 11th 2015 to April 7th 2016. The distribution is scattered - there are a number of dates where very few ads were placed and other days when lots of ads were placed.

**last_seen**

In [None]:
(autos['last_seen']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
     .count()
)

In [None]:
(autos['last_seen']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_index(ascending=True)
)

In [None]:
(autos['last_seen']
     .str[:10]
     .value_counts(normalize=True, dropna=False)
     .sort_values()
)

The last_seen date represents when the crawler saw this ad last online. This would ideally be the day the car was sold and the seller removed the listing. The distribution is fairly consistent across the days until the last 3 days where it jumps up 6 - 12 times than the preceeding days. Unless there was some massive sales during those 3 days, this maybe correlated to the last days of the web crawler activity.

**registration_year**

In [None]:
# confirm brand column has no missing data
autos.info()

We can confirm that the 'brand' column has no missing data.

In [None]:
autos['brand'].value_counts()

Keeping entries with prices between \$1 and \$350,000, we retain 48,565 records. This is 97.13% of the data.

In [None]:
autos = autos[autos['price'].between(1, 350000)]

In [None]:
autos = pd.read_csv('autos.csv', delimiter=',', encoding='Latin-1')

In [None]:
autos

**Observations:**
The following columns have missing data:
- vehicleType
- gearbox
- model
- fuelType
- notRepairedDamage

Columns are either object (15/20) or int (5/20) data types.

In [None]:
autos.head()

**Observations:**
- The data isn't sorted in any way (e.g. not in order of data crawled).
- The name column follows a format of 
    - car_manufacturer_car_model_engine...
- The data is not in English (of course it's from a German website)
- monthOfRegistration column is represented in numbers
- Some columns look like they have been cleaned up and are in lower case format
- Units are included (odometer data includes km)
- dateCreated includes time portion which looks like it is set to 00:00:00

**More Observations:**
- The dataset contains 20 columns, most of which are strings.
- Some columns have null values, but none have more than ~20% null values.
- The column names use [camelcase](https://en.wikipedia.org/wiki/Camel_case) instead of Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case), which means we can't just replace spaces with underscores.

## convert column names from camelcase to snakecase

In [None]:
# remove any entries with prices below $1 and above $350,000
autos[autos['price'].between(1, 350000)].shape

investigate a few columns in more detail:

In [None]:
autos['num_photos'].value_counts()