# Exploring And Analyzing Ebay Car Sales Data

The version of the dataset we are working with is a sample of 371,528  data points. We will be cleaning up and aggregating a lot of data in this project, and we'll be doing it by using different indexing methods, boolean masks, and other data cleaning techniques.

The data dictionary provided with data is as follows:

- dateCrawled - When this ad was first crawled. All field-values are taken from this date.
- name - Name of the car.
- seller - Whether the seller is private or a dealer.
- offerType - The type of listing
- price - The price on the ad to sell the car.
- abtest - Whether the listing is included in an A/B test.
- vehicleType - The vehicle Type.
- yearOfRegistration - The year in which which year the car was first registered.
- gearbox - The transmission type.
- powerPS - The power of the car in PS.
- model - The car model name.
- kilometer - How many kilometers the car has driven.
- monthOfRegistration - The month in which which year the car was first registered.
- fuelType - What type of fuel the car uses.
- brand - The brand of the car.
- notRepairedDamage - If the car has a damage which is not yet repaired.
- dateCreated - The date on which the eBay listing was created.
- nrOfPictures - The number of pictures in the ad.
- postalCode - The postal code for the location of the vehicle.
- lastSeenOnline - When the crawler saw this ad last online.

**The aim of this project is to clean the data and analyze the included used car listings.**




In [1]:
import numpy as np
import pandas as pd

# read csv file using pandas

autos=pd.read_csv("autos.csv", encoding='Latin-1')

In [2]:
autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
dateCrawled            371528 non-null object
name                   371528 non-null object
seller                 371528 non-null object
offerType              371528 non-null object
price                  371528 non-null int64
abtest                 371528 non-null object
vehicleType            333659 non-null object
yearOfRegistration     371528 non-null int64
gearbox                351319 non-null object
powerPS                371528 non-null int64
model                  351044 non-null object
kilometer              371528 non-null int64
monthOfRegistration    371528 non-null int64
fuelType               338142 non-null object
brand                  371528 non-null object
notRepairedDamage      299468 non-null object
dateCreated            371528 non-null object
nrOfPictures           371528 non-null int64
postalCode             371528 non-null int64
lastSeen              

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21


Our dataset contains 20 columns, most of which are stored as strings. There are a few columns (vehicleType,gearbox,model,fuelType and notRepairedDamage) with null values. There are some columns (dateCrawled,yearOfRegistration,monthOfRegistration,dateCreated and lastSeen) that contain dates stored as strings.

Let's start by cleaning the column names (from camelCase to snake_case) to make the data descriptive.

In [3]:
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [4]:
col=['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
       'last_seen']

autos.columns=col

autos.head(3)

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46


## Screening of data to determine useful columns for our analysis

- Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
- Examples of numeric data stored as text which can be cleaned and converted.



In [5]:
autos.describe(include='all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,num_photos,postal_code,last_seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:45:59
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,



**Our initial observations:**
- seller and offer_type columns have only two unique values. So, we can drop these columns from our analysis. 
- The num_photos column statstics show that column is empty, we'll explore this column further.
- The count for some of the columns have count less than number of entries (50,000). That means they are few missing values. These column needs to be investigated to remove or modify the missing values.
    - vehicle_type,
    - gearbox,
    - model,
    - fuel_type,
    - unrepaired_damage

In [6]:
autos["num_photos"].value_counts()

0    371528
Name: num_photos, dtype: int64

For column num_photos value count is zero, which means every entry in the column is zero. So, we shall drop this column from our dataframe

In [7]:
autos=autos.drop(["num_photos","seller","offer_type"],axis=1)

In [8]:
autos.head()

Unnamed: 0,date_crawled,name,price,ab_test,vehicle_type,registration_year,gearbox,power_ps,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,60437,2016-04-06 10:17:21


In [9]:

autos["odometer"].unique()

array([150000, 125000,  90000,  40000,  30000,  70000,   5000, 100000,
        60000,  20000,  80000,  50000,  10000], dtype=int64)

In [10]:
autos.rename({"odometer":"odometer_km"},inplace=True, axis=1)

In [11]:
autos.columns


Index(['date_crawled', 'name', 'price', 'ab_test', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

In [12]:

autos["odometer_km"].describe()

count    371528.000000
mean     125618.688228
std       40112.337051
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [13]:
autos["odometer_km"].value_counts()


150000    240797
125000     38067
100000     15920
90000      12523
80000      11053
70000       9773
60000       8669
50000       7615
5000        7069
40000       6376
30000       6041
20000       5676
10000       1949
Name: odometer_km, dtype: int64

There are more high mileage than low mileage vehicles.

In [14]:
autos["price"].unique()

array([  480, 18300,  9800, ..., 18429, 24895, 10985], dtype=int64)

In [15]:
autos["price"].describe()

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

In [16]:
autos["price"].value_counts().sort_index(ascending=False)


2147483647        1
99999999         15
99000000          1
74185296          1
32545461          1
27322222          1
14000500          1
12345678          9
11111111         10
10010011          1
10000000          8
9999999           3
3895000           1
3890000           1
2995000           1
2795000           1
1600000           2
1300000           1
1250000           2
1234566           1
1111111           2
1010010           1
1000000           5
999999           13
999990            1
911911            1
849000            1
820000            1
780000            1
745000            2
              ...  
35               18
33                1
32                1
30               55
29                2
27                1
26                1
25               33
24                1
21                1
20               51
19                3
18                3
17                5
16                2
15               27
14                5
13                7
12                8


Again, the prices in this column seem rounded, however given there are 5597 unique values in the column, that may just be people's tendency to round prices on the site.

There are 10,778 cars listed with $0 price, we might consider removing these rows. There are cars which are priced over hundred million dollars, which seems a lot.

## Exploring the date columns

 The date_crawled, last_seen, and ad_created columns are all identified as string values. Let's first explore and convert them to desired format for our analysis
 

In [17]:
autos[['date_crawled','ad_created','last_seen']][0:5]

Unnamed: 0,date_crawled,ad_created,last_seen
0,2016-03-24 11:52:17,2016-03-24 00:00:00,2016-04-07 03:16:57
1,2016-03-24 10:58:45,2016-03-24 00:00:00,2016-04-07 01:46:50
2,2016-03-14 12:52:21,2016-03-14 00:00:00,2016-04-05 12:47:46
3,2016-03-17 16:54:04,2016-03-17 00:00:00,2016-03-17 17:40:17
4,2016-03-31 17:25:20,2016-03-31 00:00:00,2016-04-06 10:17:21


In [18]:
autos["date_crawled"].value_counts(normalize=True, dropna=False).sort_index().head(10)

2016-03-05 14:06:22    0.000003
2016-03-05 14:06:23    0.000003
2016-03-05 14:06:24    0.000008
2016-03-05 14:06:25    0.000005
2016-03-05 14:06:26    0.000003
2016-03-05 14:06:27    0.000005
2016-03-05 14:06:28    0.000003
2016-03-05 14:06:29    0.000005
2016-03-05 14:06:30    0.000005
2016-03-05 14:06:40    0.000003
Name: date_crawled, dtype: float64

In [19]:
autos["registration_year"].describe()

count    371528.000000
mean       2004.577997
std          92.866598
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

The year that the car was first registered will likely indicate the age of the car. Looking at this column, we note some odd values. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.

## Dealing with Incorrect Registration Year Data
Because a car can't be first registered before the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

One option is to remove the listings with these values. Let's determine what percentage of our data has invalid values in this column:

In [20]:
(~autos["registration_year"].between(1900,2016)).sum()/autos.shape[0]

0.03969552765874981


Given that this is less than 4% of our data, we will remove these rows.

In [21]:
autos=autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).head(10)

2000    0.068813
1999    0.063812
2005    0.062548
2006    0.056702
2001    0.056668
2003    0.055701
2004    0.055345
2002    0.053784
1998    0.050314
2007    0.049535
Name: registration_year, dtype: float64

Looks like most vechiles were registered in past 20 years

In [25]:
brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .05].index

brand_mean_price={}



for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_price = brand_only["price"].mean()
    brand_mean_price[brand] = int(mean_price)
        
brand_mean_price   

{'volkswagen': 13643,
 'bmw': 14798,
 'opel': 3248,
 'mercedes_benz': 17614,
 'audi': 16218,
 'ford': 8702}

**Obeservation**
- audi, mercedes_benz and bmw are most expensive brands
- volkswagen is a medium priced car brand
- ford and then opel are least expensive brands

In [27]:

brand_mean_mileage={}



for brand in common_brands:
    brand_only = autos[autos["brand"] == brand]
    mean_mileage = brand_only["odometer_km"].mean()
    brand_mean_mileage[brand] = int(mean_mileage)
        
brand_mean_mileage   

{'volkswagen': 128337,
 'bmw': 132657,
 'opel': 128755,
 'mercedes_benz': 130580,
 'audi': 129491,
 'ford': 123618}

In [33]:
mean_mileage = pd.Series(brand_mean_mileage).sort_values(ascending=False)
mean_prices = pd.Series(brand_mean_price).sort_values(ascending=False)

brand_info = pd.DataFrame(mean_mileage,columns=['mean_mileage'])

brand_info["mean_price"] = mean_prices
brand_info


Unnamed: 0,mean_mileage,mean_price
bmw,132657,14798
mercedes_benz,130580,17614
audi,129491,16218
opel,128755,3248
volkswagen,128337,13643
ford,123618,8702


# Conclusion
We have utilized pandas and NumPy to quickly and easily clean, sort, and manipulate a large set of data. There were many things that required cleaning, as well as things that were technically fine, but needed to be changed for improved readability. We made heavy use of DataFrames in this project, as well as boolean index masking and aggregation.

In terms of cars, it should be no surprise that there are usually three "tiers": cheap, average, and expensive. Some cheap cars have low mileage, for example Tabant and Lada. Some expensive cars such as Sonstige Autos and Mini also have rather decent mileage for their price tag. As mentioned earlier there is a very slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.

There are other factors at play as well. Obviously really famous and popular brands such as Porsche will typically be more expensive and overpriced. Damaged cars will typically be cheaper as well.

Finally, the range of car mileages does not vary as much as the prices do by brand, instead all falling within 10% for the top brands. There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.