# eBay Kleinanzeigen Car Sales
### Objective - To clean and analyze used car sales data from the german equivalent of eBay motors. 

The data dictionary provided with data is as follows:

- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which which year the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which which year the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

In [70]:
import pandas as pd

## Import

data file stored on GitHub

In [71]:
autos = pd.read_csv('https://raw.githubusercontent.com/aaronsang/guided-data-science-projects/master/data/autos.csv', encoding='Latin-1')

## Data Evaluation & Cleaning

Data contains 20 columns, 50,000 rows, mostly string data with some integers. Some column names are Camel Case. Some column names are not descriptive enough.

### Cleaning Todo:

- [x] Standardize casing across column names
- [x] Rename columns whos names are not descriptive enough
- [x] Clean price column: (convert to int, remove outliers)
- [x] Clean odometer column: (convert to int, remove outliers)
- [ ] Identify and remediate any columns with null values
  - [ ] vehicle_type
  - [ ] gearbox
  - [ ] model
  - [ ] fuel_type
  - [ ] unrepaired_damage


In [72]:
autos.head()

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


In [73]:
autos.describe(include="all")

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-16 21:50:53,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


In [74]:
autos.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [75]:
# Modified column names from CamelCase to snake_case
autos.columns
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'last_seen']
autos.columns = new_columns

In [76]:
autos.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

I renamed a few columns so that they more accurately describe their data points and to create snake-case consistency.

- yearOfRegistration to registration_year
- monthOfRegistration to registration_month
- notRepairedDamage to unrepaired_damage
- dateCreated to ad_created

In [77]:
# Coverted odometer_km column to integer and removed comma and km label
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype('int64')
# removed $ and ,; converted to int
autos['price'] = autos['price'].str.replace('$',"").str.replace(',',"").astype('int64')
# Renamed odometer column to odometer_km
autos.rename(columns={'odometer':'odometer_km'}, inplace=True)


In [78]:
# Verification of column renaming and dtype
autos[['price','odometer_km']].dtypes

price          int64
odometer_km    int64
dtype: object

In [79]:
autos['price'].unique().shape[0]

2357

In [80]:
print(autos['price'].describe())
print()
print(autos['price'].value_counts())

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

0        1421
500       781
1500      734
2500      643
1000      639
         ... 
20790       1
8970        1
846         1
2895        1
33980       1
Name: price, Length: 2357, dtype: int64


In [108]:
# Evaluation of low prices
print(autos['price'].value_counts().sort_index(ascending=True).head())
# Evaluation of high prices
print(autos['price'].value_counts().sort_index(ascending=False).head(15))

0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price, dtype: int64


In [113]:
low_prices = autos.loc[autos['price'] < 1].shape[0]
high_prices = autos.loc[autos['price'] > 350_000].shape[0]
print(f"Rows with prices < 1: {low_prices}")
print(f"Rows with prices > 350k: {high_prices}")
print(f"low and high prices combined: {low_prices + high_prices}")

Rows with prices < 1: 1421
Rows with prices > 350k: 14
low and high prices combined: 1435


There are 1421 rows with their prices < 1. Looking at the maximum prices, it jumps from 350k to 999990. Since there are such a small amount of rows with prices >350k, I will remove the prices under <1, and <350k.

In [114]:
autos = autos[autos['price'].between(1, 350_000)]
print(autos.shape)

(48565, 20)


The sum of low and high prices I removed matches the difference between the starting index of 50k and the amount of rows remaining.

In [125]:
autos['odometer_km'].value_counts()

150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
5000        836
40000       815
30000       780
20000       762
10000       253
Name: odometer_km, dtype: int64

Reviewing the odometer values, doesn't appear to have any outliers and the data looks well rounded. Most of the vehicles have 150,000 km.