# Analyzing Used Car Listings on eBay Kleinanzeigen

We will work with a dataset of used cars from *eBay Kleinanzeigen*, a classifieds section of the German eBay website.

This dataset has originally been scraped and uploaded to Kaggle. The original dataset is not available on Kaggle anymore, but can be found [here](https://data.world/data-society/used-cars-data).

The version of the dataset we will work with is a sample of 50,000 data points that was prepared by [Dataquest](https://www.dataquest.io/) including simulating a less-cleaned version of the data.

The data dictionary provided with data is as follows:

- `dateCrawled` - When this ad was first crawled. All field-values are taken from this date.
- `name` - Name of the car.
- `seller` - Whether the seller is private or a dealer.
- `offerType` - The type of listing.
- `price` - The price on the ad to sell the car.
- `abtest` - Whether the listing is included in an A/B test.
- `vehicleType` - The vehicle Type.
- `yearOfRegistration` - The year in which which year the car was first registered.
- `gearbox` - The transmission type.
- `powerPS` - The power of the car in PS.
- `model` - The car model name.
- `kilometer` - How many kilometers the car has driven.
- `monthOfRegistration` - The month in which which year the car was first registered.
- `fuelType` - What type of fuel the car uses.
- `brand` - The brand of the car.
- `notRepairedDamage` - If the car has a damage which is not yet repaired.
- `dateCreated` - The date on which the eBay listing was created.
- `nrOfPictures` - The number of pictures in the ad.
- `postalCode` - The postal code for the location of the vehicle.
- `lastSeenOnline` - When the crawler saw this ad last online.

During this project, we will focus on cleaning the dataset and analyze the included car listings.

In [None]:
import os
import sys
import logging
from pathlib import Path

import numpy as np

%load_ext autoreload
%autoreload 2

import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

logging.basicConfig(level=logging.INFO, stream=sys.stdout)

## Load Data

In [None]:
from carsales.datasets import load_autos

In [None]:
autos = load_autos()

In [None]:
autos.info()

In [None]:
autos.head()

Most of the data is represented as string. There are some attributes (`vehicleType`, `gearbox`, `model`, `fuelType` and `notRepairedDamage`) which have missing values. We will need to deal with that later.

We will start by cleaning the column names to make the data easier to work with: we will use the Python's preferred [snakecase](https://en.wikipedia.org/wiki/Snake_case) style, and also rename some fields to make sense more understandable.

## Rename Columns

In [None]:
autos.columns

In [None]:
autos.columns = [
    'date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type',
    'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month',
    'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'n_pictures',
    'postal_code', 'last_seen'
]

In [None]:
autos.head()

We will now investigate the data in more detail.

First of all, we should remove text columns where all or almost all values are the same, as it often has no useful information for further analysis.

It is also important to check for numeric data stored as text: they can be cleaned and converted to an appropriate format.

In [None]:
autos.describe(include='all')

In [None]:
autos['odometer'].value_counts().sort_index()

The following columns have have mostly one value and can be safely dropped:
- `seller`: all but one ads are associated to a private seller,
- `offer_type`: all but one ads are associated to the same value `Angebot`,
- `n_pictures`: all data seem to have no pictures.

There are also a few columns that need more investigation:
- `price` has unrealistically low and high values: some ads display a price of 0 and 1 USD, and also more than 999,000 USD! Moreover, the data is represented as strings; it would be more convenient and appropriate to use float numbers,
- `registration_year` has also the same issue with unrealistic scenarios (1000, 1001, 1111, 1500, 1800, and some values are above 2800).
- `registration_month` has values ranging from 0 to 12. Since there are only 12 months in a year, we need to look more carefully at this issue, and decide which value to get rid of (0 or 12).

The `odometer` attribute needs to be converted to a numerical type (currently represented as strings).

Let's first drop attributes mentioned above that are not interesting for further analysis.

In [None]:
autos.drop(columns=['seller', 'offer_type', 'n_pictures'], inplace=True)

We now investigate the `price` attribute and we will:
- remove any non-numeric character,
- convert the column to a numeric dtype,
- get rid of instances that have unrealistic values.

In [None]:
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '')

In [None]:
autos['price'].head()

In [None]:
autos['price'] = autos['price'].astype(float)

In [None]:
autos['price'].head()

For the `odometer`, we will also remove any non-numeric character and convert the column to a numeric dtype. It seems all values are reasonable.