In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Project: Analysis on the Used Cars Data

## Table of Contents
- <a href = "#intro"> Introduction </a>
    - <a href = "#imports"> Notebook Imports </a>
    - <a href = "#functions"> Function Definitions </a>
- Data Wrangling
    - <a href = "#gathering"> Data Gathering </a>
    - <a href = "#assessing"> Data Assessing </a>
    - <a href = "#cleaning"> Data Cleaning </a>
- <a href = "#analysis"> Data Analysis and Visualization </a>
- <a href = "#conclusions"> Conclusions </a>

<a id='intro'></a>
## Introduction

> Over 370000 used cars scraped with Scrapy from Ebay-Kleinanzeigen.

> Fields in the data set:
> - `dateCrawled` : when this ad was first crawled, all field-values are taken from this date
> - `name` : "name" of the car
> - `seller` : private or dealer
> - `offerType` : type of listing
> - `price` : the price on the ad to sell the car
> - `abtest` : whether the listing is included in an A/B test
> - `vehicleType` : the vehicle type
> - `yearOfRegistration` : at which year the car was first registered
> - `gearbox` : transmission type
> - `powerPS` : power of the car in PS
> - `model` : car model name
> - `kilometer` : how many kilometers the car has driven
> - `monthOfRegistration` : at which month the car was first registered
> - `fuelType` : type of fuel the car uses
> - `brand` : brand of the car
> - `notRepairedDamage` : if the car has a damage which is not repaired yet
> - `dateCreated` : the date for which the ad at ebay was created
> - `nrOfPictures` : number of pictures in the ad
> - `postalCode` : postal code for the location of the vehicle
> - `lastSeenOnline` : when the crawler saw this ad last online

> The content of the data is in German, so one has to translate it first if one can not speak German. The fields lastSeen and dateCrawled could be used to estimate how long a car will be at least online before it is sold.

Source: [Kaggle](https://www.kaggle.com/orgesleka/used-cars-database)

<a id='imports'></a>
### Notebook Imports

In [2]:
import pandas as pd  # For Data Manipulation
import numpy as np  # For Array Manipulation

import calendar  # For month manipulation

import matplotlib.pyplot as matpy  # For Data Visualization
import seaborn as sb  # For Data Visualization
%matplotlib inline

## Data Wrangling

<a id='gathering'></a>
#### Data Gathering

In [3]:
# Reading file
cars = pd.read_csv("autos.csv", encoding="Latin-1")

<a id='assessing'></a>
#### Data Assessing

In [4]:
# Getting random fields to assess data
cars.sample(5)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
62472,2016-03-08 18:36:43,Mercedes_E_Klasse_W124_E320_Automatik,privat,Angebot,500,test,limousine,1994,automatik,224,e_klasse,150000,1,benzin,mercedes_benz,ja,2016-03-08 00:00:00,0,12101,2016-03-08 18:36:43
342672,2016-04-04 13:47:22,Caddy_14d_Projekt_zu_verkaufen,privat,Angebot,800,control,,1990,,0,,5000,0,,volkswagen,,2016-04-04 00:00:00,0,82515,2016-04-06 15:15:53
88625,2016-03-10 16:36:43,Audi_A3_1.8_TFSI_Sportback_Ambition,privat,Angebot,7990,control,kombi,2007,manuell,160,a3,150000,10,benzin,audi,nein,2016-03-10 00:00:00,0,77960,2016-03-22 08:47:26
126720,2016-03-31 18:38:12,Fiat_Panda_169__1.2_51.300_Km_Super_Kleinwagen,privat,Angebot,3699,control,kleinwagen,2005,manuell,0,,60000,0,,fiat,,2016-03-31 00:00:00,0,91189,2016-04-04 11:17:04
97562,2016-03-13 16:42:01,Audi_a6_2_5_tdi,privat,Angebot,2800,test,limousine,2003,manuell,163,a6,150000,10,diesel,audi,nein,2016-03-13 00:00:00,0,33689,2016-03-15 18:15:55


In [5]:
# Checking information of the dataset
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

In [6]:
# Checking how many NA values are in the dataset
cars.isna().sum()

dateCrawled                0
name                       0
seller                     0
offerType                  0
price                      0
abtest                     0
vehicleType            37869
yearOfRegistration         0
gearbox                20209
powerPS                    0
model                  20484
kilometer                  0
monthOfRegistration        0
fuelType               33386
brand                      0
notRepairedDamage      72060
dateCreated                0
nrOfPictures               0
postalCode                 0
lastSeen                   0
dtype: int64

In [7]:
# Getting a clearer picture of the missing values
cars_na = cars.isna().sum()
cars_na_perc = (cars_na * 100 / len(cars))
print(round(cars_na_perc,2).sort_values(ascending = False))

notRepairedDamage      19.40
vehicleType            10.19
fuelType                8.99
model                   5.51
gearbox                 5.44
kilometer               0.00
postalCode              0.00
nrOfPictures            0.00
dateCreated             0.00
brand                   0.00
monthOfRegistration     0.00
dateCrawled             0.00
name                    0.00
powerPS                 0.00
yearOfRegistration      0.00
abtest                  0.00
price                   0.00
offerType               0.00
seller                  0.00
lastSeen                0.00
dtype: float64


In [8]:
cars[["vehicleType", "gearbox", "model", "fuelType", "notRepairedDamage"]]

Unnamed: 0,vehicleType,gearbox,model,fuelType,notRepairedDamage
0,,manuell,golf,benzin,
1,coupe,manuell,,diesel,ja
2,suv,automatik,grand,diesel,
3,kleinwagen,manuell,golf,benzin,nein
4,kleinwagen,manuell,fabia,diesel,nein
...,...,...,...,...,...
371523,,,,,
371524,cabrio,automatik,fortwo,benzin,nein
371525,bus,manuell,transporter,diesel,nein
371526,kombi,manuell,golf,diesel,


In [9]:
cars["vehicleType"].value_counts()

limousine     95894
kleinwagen    80023
kombi         67564
bus           30201
cabrio        22898
coupe         19015
suv           14707
andere         3357
Name: vehicleType, dtype: int64

In [10]:
cars["gearbox"].value_counts()

manuell      274214
automatik     77105
Name: gearbox, dtype: int64

In [11]:
cars["model"].value_counts()

golf               30070
andere             26400
3er                20567
polo               13092
corsa              12573
                   ...  
serie_2                8
rangerover             6
serie_3                4
serie_1                2
discovery_sport        1
Name: model, Length: 251, dtype: int64

In [12]:
cars["fuelType"].value_counts()

benzin     223857
diesel     107746
lpg          5378
cng           571
hybrid        278
andere        208
elektro       104
Name: fuelType, dtype: int64

In [13]:
cars["notRepairedDamage"].value_counts()

nein    263182
ja       36286
Name: notRepairedDamage, dtype: int64

In [14]:
cars["monthOfRegistration"].value_counts()

0     37675
3     36170
6     33167
4     30918
5     30631
7     28958
10    27337
11    25489
12    25380
9     25074
1     24561
8     23765
2     22403
Name: monthOfRegistration, dtype: int64

In [15]:
cars[cars["monthOfRegistration"] == 0]

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
9,2016-03-17 10:53:50,VW_Golf_4_5_tuerig_zu_verkaufen_mit_Anhaengerk...,privat,Angebot,999,test,kleinwagen,1998,manuell,101,golf,150000,0,,volkswagen,,2016-03-17 00:00:00,0,27472,2016-03-31 17:17:06
15,2016-03-11 21:39:15,KA_Lufthansa_Edition_450_VB,privat,Angebot,450,test,kleinwagen,1910,,0,ka,5000,0,benzin,ford,,2016-03-11 00:00:00,0,24148,2016-03-19 08:46:47
16,2016-04-01 12:46:46,Polo_6n_1_4,privat,Angebot,300,test,,2016,,60,polo,150000,0,benzin,volkswagen,,2016-04-01 00:00:00,0,38871,2016-04-01 12:46:46
36,2016-03-11 11:50:37,Opel_Kadett_E_CC,privat,Angebot,1600,control,andere,1991,manuell,75,kadett,70000,0,,opel,,2016-03-11 00:00:00,0,2943,2016-04-07 03:46:09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371460,2016-04-03 13:46:24,Polo_g40_auch_Tausch_vag...no_vr6_gti_1.8t,privat,Angebot,3500,control,,1995,,0,polo,150000,0,,volkswagen,,2016-04-03 00:00:00,0,74579,2016-04-05 12:44:38
371473,2016-03-15 19:57:11,Subaru_Allrad,privat,Angebot,400,control,kombi,1991,manuell,0,legacy,150000,0,benzin,subaru,,2016-03-15 00:00:00,0,24558,2016-03-19 15:49:00
371482,2016-03-31 19:36:18,Peugeot_206,privat,Angebot,1300,control,kleinwagen,1999,manuell,75,2_reihe,125000,0,,peugeot,,2016-03-31 00:00:00,0,35102,2016-04-06 13:44:44
371486,2016-03-30 20:55:30,Zu_verkaufen,privat,Angebot,350,control,kleinwagen,1996,,65,punto,150000,0,,fiat,,2016-03-30 00:00:00,0,25436,2016-04-07 13:50:41


In [16]:
cars["nrOfPictures"].value_counts()

0    371528
Name: nrOfPictures, dtype: int64

In [17]:
# Checking for duplicated values
cars.duplicated().any().sum()

1

In [18]:
cars["seller"].value_counts()

privat        371525
gewerblich         3
Name: seller, dtype: int64

In [19]:
cars["offerType"].value_counts()

Angebot    371516
Gesuch         12
Name: offerType, dtype: int64

In [20]:
cars.describe(include="all")

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:45:59
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


##### Quality Issues

- camelCase column names
- Incorrect data types
- Drop duplicates
- Remove the `nrOfPictures` column as there is only 0 in it
- Change the `monthOfRegistration` values to their correct corresponding months
- Investigate `monthOfRegistration` with empty values
- Replace underscore with spaces in the `name` column 
- Chaning German to English for some columns
- Investigate `yearOfRegistration` column

<a id='cleaning'></a>
#### Data Cleaning

In [21]:
cars_clean = cars.copy()

#### Renaming Columns
Changing camelCase column names to snake_case to increase code efficiency

In [22]:
cars_clean.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [23]:
cars_clean.rename({
    "dateCrawled": "date_crawled",
    "offerType": "offer_type",
    "vehicleType": "vehicle_type",
    "yearOfRegistration": "year_of_registration",
    "powerPS": "power_ps",
    "monthOfRegistration": "month_of_registration",
    "fuelType": "fuel_type",
    "notRepairedDamage": "not_repaired_damage",
    "dateCreated": "date_created",
    "nrOfPictures": "num_of_pictures",
    "postalCode": "postal_code",
    "lastSeen": "last_seen"
}, axis=1, inplace=True)

In [24]:
cars_clean.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'year_of_registration', 'gearbox', 'power_ps', 'model',
       'kilometer', 'month_of_registration', 'fuel_type', 'brand',
       'not_repaired_damage', 'date_created', 'num_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

#### Changing incorrect datatypes
Changing the `date_crawled`, `date_created`, `last_seen` columns to datetime formats and the `postal_code` to an object

In [25]:
cars_clean.date_crawled = pd.to_datetime(cars_clean.date_crawled)

In [26]:
cars_clean.date_created = pd.to_datetime(cars_clean.date_created)

In [27]:
cars_clean.last_seen = pd.to_datetime(cars_clean.last_seen)

In [28]:
cars_clean["postal_code"] = cars_clean["postal_code"].astype(object)

In [29]:
cars_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   date_crawled           371528 non-null  datetime64[ns]
 1   name                   371528 non-null  object        
 2   seller                 371528 non-null  object        
 3   offer_type             371528 non-null  object        
 4   price                  371528 non-null  int64         
 5   abtest                 371528 non-null  object        
 6   vehicle_type           333659 non-null  object        
 7   year_of_registration   371528 non-null  int64         
 8   gearbox                351319 non-null  object        
 9   power_ps               371528 non-null  int64         
 10  model                  351044 non-null  object        
 11  kilometer              371528 non-null  int64         
 12  month_of_registration  371528 non-null  int6

#### Dropping duplicates
Removing all duplicates and it turns out there were only 4 duplicates which is good

In [30]:
print(f"Shape of data before dropping duplicates: {cars_clean.shape}")
cars_clean = cars_clean.drop_duplicates()
print(f"Shape of data after dropping duplicates: {cars_clean.shape}")

Shape of data before dropping duplicates: (371528, 20)
Shape of data after dropping duplicates: (371524, 20)


#### Removing `num_of_pictures` column
As there is only one value (0), this column is totally not needed

In [31]:
cars_clean.drop("num_of_pictures", axis=1, inplace=True)

In [32]:
cars_clean.columns

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'year_of_registration', 'gearbox', 'power_ps', 'model',
       'kilometer', 'month_of_registration', 'fuel_type', 'brand',
       'not_repaired_damage', 'date_created', 'postal_code', 'last_seen'],
      dtype='object')

#### Changing the `month_of_registration` values
Changing numeric values to corresponding months

In [33]:
cars_clean["month_of_registration"] = cars_clean["month_of_registration"].apply(lambda x: calendar.month_abbr[x])

In [34]:
cars_clean["month_of_registration"]

0            
1         May
2         Aug
3         Jun
4         Jul
         ... 
371523    Jan
371524    Mar
371525    Mar
371526    Jun
371527    Aug
Name: month_of_registration, Length: 371524, dtype: object

In [35]:
cars_clean[cars_clean["month_of_registration"] == 0]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,year_of_registration,gearbox,power_ps,model,kilometer,month_of_registration,fuel_type,brand,not_repaired_damage,date_created,postal_code,last_seen


In [36]:
cars_clean.iloc[[0]]

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,year_of_registration,gearbox,power_ps,model,kilometer,month_of_registration,fuel_type,brand,not_repaired_damage,date_created,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,,benzin,volkswagen,,2016-03-24,70435,2016-04-07 03:16:57


#### Replace underscore with spaces in the `name` column
Removing the underscores in car names

In [37]:
cars_clean["name"] = cars_clean["name"].apply(lambda x: x.replace("_", " "))

In [38]:
cars_clean["name"]

0                                           Golf 3 1.6
1                                 A5 Sportback 2.7 Tdi
2                       Jeep Grand Cherokee "Overland"
3                                   GOLF 4 1 4  3TÜRER
4                       Skoda Fabia 1.4 TDI PD Classic
                              ...                     
371523                      Suche t4   vito ab 6 sitze
371524           Smart smart leistungssteigerung 100ps
371525              Volkswagen Multivan T4 TDI 7DC UY2
371526                          VW Golf Kombi 1 9l TDI
371527    BMW M135i vollausgestattet NP 52.720    Euro
Name: name, Length: 371524, dtype: object

#### Standardization of Language
Changing German to English

In [39]:
# Changing the `seller` column
cars_clean["seller"].mask(cars_clean["seller"] == "privat", "private", inplace=True)
cars_clean["seller"].mask(cars_clean["seller"] == "gewerblich", "commercial", inplace=True)

# Changing the `offer_type` column
cars_clean["offer_type"].mask(cars_clean["offer_type"] == "Angebot", "offer", inplace=True)
cars_clean["offer_type"].mask(cars_clean["offer_type"] == "Gesuch", "request", inplace=True)

# Changing the `gearbox` column
cars_clean["gearbox"].mask(cars_clean["gearbox"] == "manuell", "manual", inplace=True)
cars_clean["gearbox"].mask(cars_clean["gearbox"] == "automatik", "automatic", inplace=True)

# Changing the `not_repaired_damage` column
cars_clean["not_repaired_damage"].mask(cars_clean["not_repaired_damage"] == "ja", "yes", inplace=True)
cars_clean["not_repaired_damage"].mask(cars_clean["not_repaired_damage"] == "nein", "no", inplace=True)

# Changing the `fuel_type` column
cars_clean["fuel_type"].mask(cars_clean["fuel_type"] == "benzin", "petrol", inplace=True)
cars_clean["fuel_type"].mask(cars_clean["fuel_type"] == "andere", "others", inplace=True)
cars_clean["fuel_type"].mask(cars_clean["fuel_type"] == "elektro", "electric", inplace=True)

# Changing the `vehicle_type` column
cars_clean["vehicle_type"].mask(cars_clean["vehicle_type"] == "kleinwagen", "small car", inplace=True)
cars_clean["vehicle_type"].mask(cars_clean["vehicle_type"] == "cabrio", "convertible", inplace=True)
cars_clean["vehicle_type"].mask(cars_clean["vehicle_type"] == "kombi", "station wagon", inplace=True)
cars_clean["vehicle_type"].mask(cars_clean["vehicle_type"] == "andere", "others", inplace=True)

In [40]:
cars_clean[["seller", "offer_type", "gearbox", "not_repaired_damage", "fuel_type", "vehicle_type"]]

Unnamed: 0,seller,offer_type,gearbox,not_repaired_damage,fuel_type,vehicle_type
0,private,offer,manual,,petrol,
1,private,offer,manual,yes,diesel,coupe
2,private,offer,automatic,,diesel,suv
3,private,offer,manual,no,petrol,small car
4,private,offer,manual,no,diesel,small car
...,...,...,...,...,...,...
371523,private,offer,,,,
371524,private,offer,automatic,no,petrol,convertible
371525,private,offer,manual,no,diesel,bus
371526,private,offer,manual,,diesel,station wagon


#### Investigate `year_of_registration` column

In [42]:
cars_clean["year_of_registration"].describe()

count    371524.000000
mean       2004.578033
std          92.867097
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: year_of_registration, dtype: float64