# Exploring Car Sales Listings - Provided by Ebay

**Data Cleaning and Observation Project**

Ebay Kleinanzeigen, a classfields section of the German eBay website has provided data of 50,000 listings. At the beginning of the project, the goal will be to clean this data by all means. This project will be initially presented in german as well as containing errors in the data. .

## Brief Information about Dataset

In [2]:
# Import pandas library
import pandas as pd

# Open file and import dataset as an dataframe
autos = pd.read_csv('autos.csv', encoding = "Latin-1")

# Display information about the dataset
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

In [3]:
# Display the first 3 rows
autos.head(3)

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46


From the information presented above, we can conclude that there are two different data types, int64 and objects. There are a few columns that contain null values, those columns being the `model`, `vehicle type`,`gearbox`, `fuelType`and `notRepairedDamage` columns with approximately `20,484`, `37,869`, `20,209`, `33,386` and `72,060` null values respectively. As previously mentioned in the project README there are german words that need to be converted to the English language as well. Not to mention the usage of camelcase type spelling to represent each colomn.


## Column Replacement - Snakecase

In [4]:
# Print the autos dataframe columns
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


In [5]:
# Rename the autos dataframe columns
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'n_of_pictures', 'postal_code', 'last_seen']

# Print snakecase version of autos dataframe columns
print(autos.columns)

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'n_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


In [6]:
autos.head(4)

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,n_of_pictures,postal_code,last_seen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17


The previous columns were presented in a camelcase format. From the dataframe above, those values have been corrected and are now presented in a more preferred snakecase format.

## Brief Exploration of Dataset

Below we will take a further look at the data to see what additional tasks need to be carried out to complete cleaning this dataset.

In [7]:
autos.describe(include = 'all')

Unnamed: 0,date_crawled,name,seller,offer_type,price,abtest,vehicle_type,registration_year,gearbox,power_ps,model,odometer_km,registration_month,fuel_type,brand,unrepaired_damage,ad_created,n_of_pictures,postal_code,last_seen
count,371528,371528,371528,371528,371528.0,371528,333659,371528.0,351319,371528.0,351044,371528.0,371528.0,338142,371528,299468,371528,371528.0,371528.0,371528
unique,280500,233531,2,2,,2,8,,2,,251,,,7,40,2,114,,,182806
top,2016-03-24 14:49:47,Ford_Fiesta,privat,Angebot,,test,limousine,,manuell,,golf,,,benzin,volkswagen,nein,2016-04-03 00:00:00,,,2016-04-06 13:45:54
freq,7,657,371525,371516,,192585,95894,,274214,,30070,,,223857,79640,263182,14450,,,17
mean,,,,,17295.14,,,2004.577997,,115.549477,,125618.688228,5.734445,,,,,0.0,50820.66764,
std,,,,,3587954.0,,,92.866598,,192.139578,,40112.337051,3.712412,,,,,0.0,25799.08247,
min,,,,,0.0,,,1000.0,,0.0,,5000.0,0.0,,,,,0.0,1067.0,
25%,,,,,1150.0,,,1999.0,,70.0,,125000.0,3.0,,,,,0.0,30459.0,
50%,,,,,2950.0,,,2003.0,,105.0,,150000.0,6.0,,,,,0.0,49610.0,
75%,,,,,7200.0,,,2008.0,,150.0,,150000.0,9.0,,,,,0.0,71546.0,


In the line of code below we will use the python statement to look at each column individually. There will only be one column present because the analysis will have already been completed. For the sake of explaining the steps in this project, I will not remove the obsolete line of code. As for the second line of code, this particular statement will display the datatypes for each column present. 

In [8]:
#autos['postal_code'].value_counts()

In [9]:
#autos['price'].apply(type)

After our analysis I have made a list of some of changes that need to be made within this dataset. These items include:
* Convert each column to proper datatype `price, vehicle_type`, `model, kilometer`, `fuel_type`, `unreparied_damage`
* Columns that have mostly one value and are candidates to be dropped `seller`, `offer_type`
* Seperate/Format time and date `ad_created`, `date_crawled` Format time and date `last_seen`
* Address special colums - formatting errors `price`, `power_ps`
* Convert German words to English language `seller`, `offer_type`, `gearbox`, `fuel_type`, `unreparied_damage`

## Data Cleaning

**Convert each column to proper datatype `price, vehicle_type`, `model, kilometer`, `fuel_type`, `unreparied_damage`**

In [None]:
# Convert column to a string and remove additional characters - Convert to an interger datatype
autos.loc[:,'price'] = autos.loc[:,'price'].astype(str).str.replace("$","").str.replace(",","")
autos.loc[:,'price'] = autos.loc[:,'price'].astype(int)

autos.loc[:,'odometer_km'] = autos.loc[:,'odometer_km'].astype(str).str.replace("km","").str.replace(",","")
autos.loc[:,'odometer_km'] = autos.loc[:,'odometer_km'].astype(int)

# Convert datatype to a string
autos.loc[:,['vehilce_type', 'model', 'fuel_type', 'unrepaired_damage']] = autos.loc[:,['vehilce_type', 'model', 'fuel_type', 'unrepaired_damage']].astype(str)

#autos.loc[:,'model'] = autos.loc[:,'model'].astype(str)
#autos.loc[:,'fuel_type'] = autos.loc[:,'fuel_type'].astype(str)
#autos.loc[:,'unrepaired_damage'] = autos['unrepaired_damage'].astype(str)

**Columns that have mostly one value and are candidates to be dropped `seller`, `offer_type`**

In [11]:
# Determine unique values in each column
s = autos['seller'].unique()
ot = autos['offer_type'].unique()

# Determine distribution for each unique value in each column
s_dict = {}
for i in autos['seller']:
    if i not in s_dict:
        s_dict[i] = 1
    else:
        s_dict[i] += 1

ot_dict = {}
for i in autos['offer_type']:
    if i not in ot_dict:
        ot_dict[i] = 1
    else:
        ot_dict[i] += 1

# Display Results
print("'Seller': {0} Count: ".format(s) + str(s_dict.values())) 
print("'Offer Type': {0} Count: ".format(ot) + str(ot_dict.values()))

'Seller': ['privat' 'gewerblich'] Count: dict_values([371525, 3])
'Offer Type': ['Angebot' 'Gesuch'] Count: dict_values([371516, 12])


Given the numberic count for both "Gewerblich" and "Gesuch" under the 'Seller' and 'Offer Type' columns are far below 1%, these records will be removed from the dataset.

In [12]:
# Remove all records from the defined column - Display Res ults
autos = autos[autos['seller'] != 'gewerblich']
autos['seller'].unique()

array(['privat'], dtype=object)

In [None]:
# Remove all records from the defined column - Display Results
autos = autos[autos['offer_type'] != 'Gesuch']
autos['offer_type'].unique()

**Seperate/Format time and date `ad_created`, `date_crawled` Format time and date `last_seen`**

In [None]:
# Import datetime lirbrary
import datetime as dt

# Seperate and format defined column
def seperate_date_time(column_choice):
    for i in column_choice:
 
        choice_date = i.split()
        choice_date = choice_date[0]
        choice_date = str(choice_date)
        choice_date = dt.datetime.strptime(ad_crawled_hr, "%Y-%m-%d")
        column_choice.replace(i, choice_date)
        
# Call Function
seperate_date_time(autos.loc[:,"ad_created"])
seperate_date_time(autos.loc[:,"date_crawled"])

#Format defined column by date and time
for i in autos.loc[:,"last_seen"]:
    last_seen = i.split()
    
    last_seen_date = last_seen[0]
    last_seen_hour = last_seen[1]
    
    last_seen_date = str(last_seen_date)
    last_seen_hour = str(last_seen_hour)
    
    last_seen_date = dt.datetime.strptime(last_seen_date, "%Y-%m-%d")
    last_seen_hour = dt.datetime.strptime(last_seen_date, "%H:%M-%s")
    #autos.loc[:,"date_crawled"].replace(i, ad_crawled_hr)

# Display Results
print(autos.loc[:,"date_crawled"])

**Address special colums - formatting errors `price`, `power_ps`**

*Convert German words to English language `seller`, `offer_type`, `gearbox`, `fuel_type`, `unreparied_damage`*

In [None]:
autos.