# Exploring eBay Car Sales Data

The goal of this project is to utilize data cleaning skills to update the car sales data into a usable form. This data was acquired from kaggle and is a smaller set with 50,000 rows.

This dataset has columns including, the name of the car, seller, offer type, price, vehicle type, model, odometer, brand, pictures and more

In [1]:
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1')

#importing pandas, numpy and the dataset

In [2]:
autos.info()

# analyzing the columns for null values and data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  date

In [3]:
autos.head()

#viewing the top 5 columns

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,odometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Analyzing Data Issues
Columns with null values include notRepairedDamage, fuelType, gearbox, and vehicleType.

Issues with data types include: price has a dollar sign / non-numeric, odometer has km / non-numeric, 

In [4]:
autos.columns

#print existing columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:
updated_headers = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerPS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

autos.columns = updated_headers
autos.head()

#updating column names to snakecase and correcting some naming

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,"$5,000",control,bus,2004,manuell,158,andere,"150,000km",3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,"$8,500",control,limousine,1997,automatik,286,7er,"150,000km",6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,"$8,990",test,limousine,2009,manuell,102,golf,"70,000km",7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,"$4,350",control,kleinwagen,2007,automatik,71,fortwo,"70,000km",6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,"$1,350",test,kombi,2003,manuell,0,focus,"150,000km",7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


Updating the column names keeps the formatting consistent and allows for easier identification of variables for the user or anyone interpreting the analysis

In [6]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
count,50000,50000,50000,50000,50000,50000,44905,50000.0,47320,50000.0,47242,50000,50000.0,45518,50000,40171,50000,50000.0,50000.0,50000
unique,48213,38754,2,2,2357,2,8,,2,,245,13,,7,40,2,76,,,39481
top,2016-03-19 17:36:18,Ford_Fiesta,privat,Angebot,$0,test,limousine,,manuell,,golf,"150,000km",,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-07 06:17:27
freq,3,78,49999,49999,1421,25756,12859,,36993,,4024,32424,,30107,10687,35232,1946,,,8
mean,,,,,,,,2005.07328,,116.35592,,,5.72336,,,,,0.0,50813.6273,
std,,,,,,,,105.712813,,209.216627,,,3.711984,,,,,0.0,25779.747957,
min,,,,,,,,1000.0,,0.0,,,0.0,,,,,0.0,1067.0,
25%,,,,,,,,1999.0,,70.0,,,3.0,,,,,0.0,30451.0,
50%,,,,,,,,2003.0,,105.0,,,6.0,,,,,0.0,49577.0,
75%,,,,,,,,2008.0,,150.0,,,9.0,,,,,0.0,71540.0,


Based on the descriptive statistics above, registration_year has some incorrect values as the max is 999 and min is 1000. Registration month probably has some NAs as the min is 0.  Looks there aren't many number of pictures data and this column could likely be dropped.  Additionally almost all values in seller and offer_type are the same

### Cleaning Data

In [7]:
autos["price"] = autos["price"].str.replace('$','')
autos["price"] = autos["price"].str.replace(',','')
autos["price"] = autos["price"].astype(int)

#cleaning the price column and converting to an integer

In [8]:
autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','')
autos["odometer"] = autos["odometer"].astype(int)

#cleaning the odometer column and converting to an integer

In [9]:
autos.rename(columns={'odometer' : 'odometer_km'}, inplace=True)

#renaming the odometer column to inlude the UM

In [10]:
autos.head()

#confirming updates are correct

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,powerPS,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,nr_of_pictures,postal_code,last_seen
0,2016-03-26 17:47:46,Peugeot_807_160_NAVTECH_ON_BOARD,privat,Angebot,5000,control,bus,2004,manuell,158,andere,150000,3,lpg,peugeot,nein,2016-03-26 00:00:00,0,79588,2016-04-06 06:45:54
1,2016-04-04 13:38:56,BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik,privat,Angebot,8500,control,limousine,1997,automatik,286,7er,150000,6,benzin,bmw,nein,2016-04-04 00:00:00,0,71034,2016-04-06 14:45:08
2,2016-03-26 18:57:24,Volkswagen_Golf_1.6_United,privat,Angebot,8990,test,limousine,2009,manuell,102,golf,70000,7,benzin,volkswagen,nein,2016-03-26 00:00:00,0,35394,2016-04-06 20:15:37
3,2016-03-12 16:58:10,Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...,privat,Angebot,4350,control,kleinwagen,2007,automatik,71,fortwo,70000,6,benzin,smart,nein,2016-03-12 00:00:00,0,33729,2016-03-15 03:16:28
4,2016-04-01 14:38:50,Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...,privat,Angebot,1350,test,kombi,2003,manuell,0,focus,150000,7,benzin,ford,nein,2016-04-01 00:00:00,0,39218,2016-04-01 14:38:50


### Exploring Outliers 

In [11]:
autos["odometer_km"].unique().shape
autos["odometer_km"].describe()
autos["odometer_km"].value_counts(ascending = False)

#examining the descriptive statiscs and value counts of odometer_km

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

Based on the data above, it appears all the odometer quantities were rounded to the nearest 5,000 possibly. No evident issues with the numbers.  Most of the vehicles sold were high mileage

In [12]:
autos["price"].unique().shape
autos["price"].describe()

#examining the descriptive statiscs and value counts of odometer_km

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

The max and mins look a little odd as the min is 0 and the max is extremely high. This needs to further investigated as it could skew our data

In [13]:
autos["price"].value_counts().head(10)

#reviewing the top 10 most commen prices

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
Name: price, dtype: int64

A price of 0 shows up 1,421 times which we may want to remove. Again it looks like the prices may have been rounded

In [14]:
autos["price"].value_counts().sort_index(ascending=False).head(20)

#sorts top 20 prices and shows counts

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [15]:
autos["price"].value_counts().sort_index(ascending=True).head(20)

#sorts bottom 20 prices and shows counts

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

There seems to be some unreasonably high and low prices but much more low than high. I would say anything above 350,000 is too high to be legitimate data.  For the low numbers, eBay is an auction so it is possible cars went for 1$.  Will remove any data outside this range.

In [16]:
autos = autos[autos["price"].between(1,351000)]
autos["price"].describe()

#removing outlier data and reviewing the updated descriptive stats

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

### Exploring Date Data

In [17]:
date_cols = autos[['date_crawled','ad_created','last_seen']][0:5]

In [18]:
print(autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index())

#shows dates crawled with percentages sorted by date

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64


date_crawled appears to all have been done in March and Early April of 2016

In [19]:
print(autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index())

#shows dates ad_created with percentages sorted by date

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64


Looks like the adds go before March and April of 2016 although there are very few based on the percentages

In [20]:
print(autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index())

#shows dates last_seen with percentages sorted by date

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64


last_seen appears to all have been done in March and Early April of 2016

In [21]:
autos["registration_year"].describe()

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Based on the descriptive statistics above, there are incorrect values in the data as the max is 999 and min is 1000. Will utilize 1900 through 2016 per the data analysis portion

In [22]:
autos = autos[autos["registration_year"].between(1900,2016)]
 
#removing inaccurate registration data

In [23]:
print(len(autos))

#updated total row count

46681


It looks like this only removing incaccuarte registration data removed about 2,000 records out of 50,000 which should not be an issue

In [24]:
autos["registration_year"].value_counts(normalize=True)

#shows the updated registration year data with percentages

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
          ...   
1938    0.000021
1948    0.000021
1927    0.000021
1931    0.000021
1952    0.000021
Name: registration_year, Length: 78, dtype: float64

### Exploring Car Brands

In [25]:
autos["brand"].value_counts(normalize=True)

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
r

In [26]:
top_10_brands = autos["brand"].value_counts(normalize=True).head(10).index

#pulls the top 10 brands with percentages then pulls the name only with the index function

In [27]:
mean_price = {}

for brand in top_10_brands:
    price = autos.loc[autos["brand"]== brand, "price"].mean()
    mean_price[brand] = price
    
mean_price

#calculating the mean price by auto brand

{'volkswagen': 5402.410261610221,
 'bmw': 8332.820517811953,
 'opel': 2975.2419354838707,
 'mercedes_benz': 8628.450366422385,
 'audi': 9336.687453600594,
 'ford': 3749.4695065890287,
 'renault': 2474.8646069968195,
 'peugeot': 3094.0172290021537,
 'fiat': 2813.748538011696,
 'seat': 4397.230949589683}

In [29]:
price_series = pd.Series(mean_price).sort_values(ascending = False)
print(price_series)

#converting the dictionary to a series to sort

audi             9336.687454
mercedes_benz    8628.450366
bmw              8332.820518
volkswagen       5402.410262
seat             4397.230950
ford             3749.469507
peugeot          3094.017229
opel             2975.241935
fiat             2813.748538
renault          2474.864607
dtype: float64


Volkswagon is the most popular brand as it makes up 21% of all sales (at least twice as many sales as any other brand) and has mid-line price between the top 5 brands sold.  Mercedes, BMW and Audi are more expense but have high sales. Opel, Ford, Fiat, seat are on the lower end in price but not as popular

In [30]:
mean_mileage = {}

for brand in top_10_brands:
    mileage = autos.loc[autos["brand"]== brand, "odometer_km"].mean()
    mean_mileage[brand] = mileage
    
mean_mileage

#creating a top 10 common brand mean mileage data dictionary

{'volkswagen': 128707.15879132022,
 'bmw': 132572.51313996495,
 'opel': 129310.0358422939,
 'mercedes_benz': 130788.36331334666,
 'audi': 129157.38678544914,
 'ford': 124266.01287159056,
 'renault': 128071.33121308497,
 'peugeot': 127153.62526920316,
 'fiat': 117121.9715956558,
 'seat': 121131.30128956624}

In [31]:
mileage_series = pd.Series(mean_mileage)
combined_df = pd.DataFrame(mileage_series, columns=['mean_mileage'])
combined_df["price"] = price_series

#converts mean mileage dictionary to a series then the series to a dataframe than adds second column to the DF.  

In [35]:
combined_df.sort_values("price")

Unnamed: 0,mean_mileage,price
renault,128071.331213,2474.864607
fiat,117121.971596,2813.748538
opel,129310.035842,2975.241935
peugeot,127153.625269,3094.017229
ford,124266.012872,3749.469507
seat,121131.30129,4397.23095
volkswagen,128707.158791,5402.410262
bmw,132572.51314,8332.820518
mercedes_benz,130788.363313,8628.450366
audi,129157.386785,9336.687454


The mean mileage is all fairly close per the analysis above with the exception of fiat an seat. If I had to buy a car based on this data I would go with a Volkswagon because it is about in the middle of the pack price-wise and it is by far the most common brand sold.