# Data Scrubbing and Observation Report - Provided by Ebay

**Data Cleaning and Observation Project**

Ebay Kleinanzeigen, a classfields section of the German eBay website has provided data of 50,000 listings. At the beginning of the project, the goal will be to clean this data by all means. This project will be initially presented in german as well as containing errors in the data. .

## Brief Information about Dataset

In [1]:
# Import pandas library
import pandas as pd

# Open file and import dataset as an dataframe
autos = pd.read_csv('autos.csv', encoding = "Latin-1")

# Display information about the dataset
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-nu

In [None]:
# Display the first 3 rows
autos.head(3)

From the information presented above, we can conclude that there are two different data types, int64 and objects. There are a few columns that contain null values, those columns being the `model`, `vehicle type`,`gearbox`, `fuelType`and `notRepairedDamage` columns with approximately `20,484`, `37,869`, `20,209`, `33,386` and `72,060` null values respectively. As previously mentioned in the project README there are german words that need to be converted to the English language as well. Not to mention the usage of camelcase type spelling to represent each colomn.


## Column Replacement - Snakecase

In [2]:
# Rename the autos dataframe columns
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'n_of_pictures', 'postal_code', 'last_seen']

# Print snakecase version of autos dataframe columns
print(autos.columns)

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'n_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')


In [None]:
autos.head(4)

The previous columns were presented in a camelcase format. From the dataframe above, those values have been corrected and are now presented in a more preferred snakecase format.

## Brief Exploration of Dataset

Below we will take a further look at the data to see what additional tasks need to be carried out to complete cleaning this dataset.

In [None]:
autos.describe(include = 'all')

In the line of code below we will use the python statement to look at each column individually. There will only be one column present because the analysis will have already been completed. For the sake of explaining the steps in this project, I will not remove the obsolete line of code. As for the second line of code, this particular statement will display the datatypes for each column present. 

In [None]:
#autos['postal_code'].value_counts()

In [None]:
#autos['price'].apply(type)

After our analysis I have made a list of some of changes that need to be made within this dataset. These items include:
* Convert each column to proper datatype `price, vehicle_type`, `model, kilometer`, `fuel_type`, `unreparied_damage`
* Columns that have mostly one value and are candidates to be dropped `seller`, `offer_type`
* Seperate/Format time and date `ad_created`, `date_crawled` Format time and date `last_seen`
* Address special colums - formatting errors `price`, `power_ps`
* Convert German words to English language `seller`, `offer_type`, `gearbox`, `fuel_type`, `unreparied_damage`

## Data Cleaning

**Convert each column to proper datatype `price, vehicle_type`, `model, kilometer`, `fuel_type`, `unreparied_damage`, `gearbox`**

In [27]:
# Convert column to a string and remove additional characters - Convert to an interger datatype
autos.loc[:,'price'] = autos.loc[:,'price'].astype(str).str.replace("$","").str.replace(",","")
autos.loc[:,'price'] = autos.loc[:,'price'].astype(int)

autos.loc[:,'odometer_km'] = autos.loc[:,'odometer_km'].astype(str).str.replace("km","").str.replace(",","")
autos.loc[:,'odometer_km'] = autos.loc[:,'odometer_km'].astype(int)

# Convert datatype to a string
autos.loc[:,'vehicle_type'] = autos.loc[:,'vehicle_type'].astype(str)
autos.loc[:,'model'] = autos.loc[:,'model'].astype(str)
autos.loc[:,'fuel_type'] = autos.loc[:,'fuel_type'].astype(str)
autos.loc[:,'unrepaired_damage'] = autos['unrepaired_damage'].astype(str)
autos.loc[:,'gearbox'] = autos.loc[:,'gearbox'].astype(str)

  autos.loc[:,'price'] = autos.loc[:,'price'].astype(str).str.replace("$","").str.replace(",","")


**Columns that have mostly one value and are candidates to be dropped `seller`, `offer_type`**

In [4]:
# Determine unique values in each column
s = autos['seller'].unique()
ot = autos['offer_type'].unique()

# Determine distribution for each unique value in each column
s_dict = {}
for i in autos['seller']:
    if i not in s_dict:
        s_dict[i] = 1
    else:
        s_dict[i] += 1

ot_dict = {}
for i in autos['offer_type']:
    if i not in ot_dict:
        ot_dict[i] = 1
    else:
        ot_dict[i] += 1

# Display Results
print("'Seller': {0} Count: ".format(s) + str(s_dict.values())) 
print("'Offer Type': {0} Count: ".format(ot) + str(ot_dict.values()))

'Seller': ['privat' 'gewerblich'] Count: dict_values([371525, 3])
'Offer Type': ['Angebot' 'Gesuch'] Count: dict_values([371516, 12])


Given the numberic count for both "Gewerblich" and "Gesuch" under the 'Seller' and 'Offer Type' columns are far below 1%, these records will be removed from the dataset.

In [5]:
# Remove all records from the defined column - Display Res ults
autos = autos[autos['seller'] != 'gewerblich']
autos['seller'].unique()

array(['privat'], dtype=object)

In [6]:
# Remove all records from the defined column - Display Results
autos = autos[autos['offer_type'] != 'Gesuch']
autos['offer_type'].unique()

array(['Angebot'], dtype=object)

**Seperate/Format time and date `ad_created`, `date_crawled` Format time and date `last_seen`**

In [None]:
# Import datetime lirbrary
import datetime as dt

# Seperate and format defined column
def seperate_date_time(column_choice):
    for i in column_choice:
 
        choice_date = i.split()
        choice_date = choice_date[0]
        choice_date = str(choice_date)
        choice_date = dt.datetime.strptime(choice_date, "%Y-%m-%d")
        column_choice.replace(i, choice_date)
        
# Call Function
seperate_date_time(autos.loc[:,"ad_created"])
seperate_date_time(autos.loc[:,"date_crawled"])

#Format defined column by date and time
for i in autos.loc[:,"last_seen"]:
    last_seen = i.split()
    
    last_seen_date = last_seen[0]
    last_seen_hour = last_seen[1]
    
    last_seen_date = str(last_seen_date)
    last_seen_hour = str(last_seen_hour)
    
    last_seen_date = dt.datetime.strptime(last_seen_date, "%Y-%m-%d")
    last_seen_hour = dt.datetime.strptime(last_seen_date, "%H:%M-%s")
    autos.loc[:,"date_crawled"].replace(i, ad_crawled_hr)

# Display Results
#autos.loc[0:2,["date_crawled", "ad_created", "last_seen"]]

Brief Discription about the results from above

**Address special colums - Additional Formating `price`, `power_ps`**

In [9]:
# Display Statistcal Report for each column
def min_max(column1, column2 , name1, name2):
    min_value_1 = column1.min()
    min_value_2 = column2.min()
     
    max_value_1 = column1.max()
    max_value_2 = column2.max()
    
    print(str(name1) + " - Statistical Report\n\nMininum Value: {0}\nMaximum Value: {1}\n\n".format(min_value_1, max_value_1) + str(column1.describe()))
    print("\n\n" + str(name2) + " - Statistical Report\n\nMininum Value: {0}\nMaximum Value: {1}\n\n".format(min_value_2, max_value_2) + str(column2.describe()))

# Display unformatted results
min_max(autos["price"], autos["power_ps"], "Price", "Power(PS)")

Price - Statistical Report

Mininum Value: 0
Maximum Value: 2147483647

count    3.715130e+05
mean     1.729570e+04
std      3.588026e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64


Power(PS) - Statistical Report

Mininum Value: 0
Maximum Value: 20000

count    371513.000000
mean        115.552689
std         192.142534
min           0.000000
25%          70.000000
50%         105.000000
75%         150.000000
max       20000.000000
Name: power_ps, dtype: float64


With a quick obervation we can see that the `price` column in denoted by a scientific notation value as to the `power_ps` column carrying a multitude of decimal places. Below we will format the data appropriately using the `options.display.float_format` function in pandas; converting each value to 2 decimal places.

In [10]:
# Formatted columns 
pd.options.display.float_format = '{:,.2f}'.format

# Display Results
autos.loc[0:1,["price", "power_ps"]].describe()

Unnamed: 0,price,power_ps
count,2.0,2.0
mean,9390.0,95.0
std,12600.64,134.35
min,480.0,0.0
25%,4935.0,47.5
50%,9390.0,95.0
75%,13845.0,142.5
max,18300.0,190.0


*Convert German words to English language `seller`, `offer_type`, `gearbox`, `fuel_type`, `unrepaired_damage`*

In [32]:
# Replace all German words with their English translation
for i in autos.loc[:,"seller"]:
    if i == "privat":
        i = i.replace("privat", "Private")

for i in autos.loc[:,"offer_type"]:
    if i == "Angebot":
        i = i.replace("Angebot", "Offer")
    
for i in autos.loc[:,"gearbox"]:
    if i == "manuell" or i == "automatik":
        i = i.replace("manuell", "Manual").replace("automatik", "Automatic")
    
for i in autos.loc[:,"fuel_type"]:
    if i == "benzin" or  i == "diesel":
        i = i.replace("benzin", "Gasoline").replace("diesel", "Diesel")
    
for i in autos.loc[:,"fuel_type"]:
    if i == "ja" or i == "nan" or i == "nein":
        i = i.replace("ja", "Yes").replace("nan", "Nan").replace("nein", "No")
    
    
# Display Results
print(autos.loc[0:5,["seller", "offer_type", "gearbox", "fuel_type", "unrepaired_damage"]])

   seller offer_type    gearbox fuel_type unrepaired_damage
0  privat    Angebot    manuell    benzin               nan
1  privat    Angebot    manuell    diesel                ja
2  privat    Angebot  automatik    diesel               nan
3  privat    Angebot    manuell    benzin              nein
4  privat    Angebot    manuell    diesel              nein
5  privat    Angebot    manuell    benzin                ja
