# Data Scrubbing and Observation Report - Provided by Ebay

**Data Cleaning and Observation Project**

Ebay Kleinanzeigen, a classfields section of the German eBay website has provided data of 50,000 listings. At the beginning of the project, the goal will be to clean this data by all means. This project will be initially presented in german as well as containing errors in the data. .

## Brief Information about Dataset

In [None]:
# Import pandas library
import pandas as pd

# Open file and import dataset as an dataframe
autos = pd.read_csv('autos.csv', encoding = "Latin-1")

# Display information about the dataset
autos.info()

In [None]:
# Display the first 3 rows
print(autos.head(3))

From the information presented above, we can conclude that there are two different data types, int64 and objects. There are a few columns that contain null values, those columns being the `model`, `vehicle type`,`gearbox`, `fuelType`and `notRepairedDamage` columns with approximately `20,484`, `37,869`, `20,209`, `33,386` and `72,060` null values respectively. As previously mentioned in the project README there are german words that need to be converted to the English language as well. Not to mention the usage of camelcase type spelling to represent each colomn.


## Column Replacement - Snakecase

In [None]:
# Rename the autos dataframe columns
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'n_of_pictures', 'postal_code', 'last_seen']

# Print snakecase version of autos dataframe columns
print(autos.columns)

In [None]:
autos.head(4)

The previous columns were presented in a camelcase format. From the dataframe above, those values have been corrected and are now presented in a more preferred snakecase format.

## Brief Exploration of Dataset

Below we will take a further look at the data to see what additional tasks need to be carried out to complete cleaning this dataset.

In [None]:
autos.describe(include = 'all')

In the line of code below we will use the python statement to look at each column individually. There will only be one column present because the analysis will have already been completed. For the sake of explaining the steps in this project, I will not remove the obsolete line of code. As for the second line of code, this particular statement will display the datatypes for each column present. 

In [None]:
#autos['postal_code'].value_counts()

In [None]:
#autos['price'].apply(type)

After our analysis I have made a list of some of changes that need to be made within this dataset. These items include:
* Convert each column to proper datatype `price, vehicle_type`, `model, kilometer`, `fuel_type`, `unreparied_damage`
* Columns that have mostly one value and are candidates to be dropped `seller`, `offer_type`
* Seperate/Format time and date `ad_created`, `date_crawled` Format time and date `last_seen`
* Address special colums - formatting errors `price`, `power_ps`
* Convert German words to English language `seller`, `offer_type`, `gearbox`, `fuel_type`, `unrepaired_damage`, `vehicle_type`, `abtest`
* Regulate `regulation_year` min and max values

## Data Cleaning

**Convert each column to proper datatype `price, vehicle_type`, `model, kilometer`, `fuel_type`, `unreparied_damage`, `gearbox`**

In [None]:
# Convert column to a string and remove additional characters - Convert to an interger datatype
autos.loc[:,'price'] = autos.loc[:,'price'].astype(str).str.replace("$","").str.replace(",","")
autos.loc[:,'price'] = autos.loc[:,'price'].astype(int)

autos.loc[:,'odometer_km'] = autos.loc[:,'odometer_km'].astype(str).str.replace("km","").str.replace(",","")
autos.loc[:,'odometer_km'] = autos.loc[:,'odometer_km'].astype(int)

# Convert datatype to a string
autos.loc[:,'vehicle_type'] = autos.loc[:,'vehicle_type'].astype(str)
autos.loc[:,'model'] = autos.loc[:,'model'].astype(str)
autos.loc[:,'fuel_type'] = autos.loc[:,'fuel_type'].astype(str)
autos.loc[:,'unrepaired_damage'] = autos['unrepaired_damage'].astype(str)
autos.loc[:,'gearbox'] = autos.loc[:,'gearbox'].astype(str)

**Columns that have mostly one value and are candidates to be dropped `seller`, `offer_type`**

In [None]:
# Determine unique values in each column
s = autos['seller'].unique()
ot = autos['offer_type'].unique()

# Determine distribution for each unique value in each column
s_dict = {}
for i in autos['seller']:
    if i not in s_dict:
        s_dict[i] = 1
    else:
        s_dict[i] += 1

ot_dict = {}
for i in autos['offer_type']:
    if i not in ot_dict:
        ot_dict[i] = 1
    else:
        ot_dict[i] += 1

# Display Results
print("'Seller': {0} Count: ".format(s) + str(s_dict.values())) 
print("'Offer Type': {0} Count: ".format(ot) + str(ot_dict.values()))

Given the numberic count for both "Gewerblich" and "Gesuch" under the 'Seller' and 'Offer Type' columns are far below 1%, these records will be removed from the dataset.

In [None]:
# Remove all records from the defined column - Display Res ults
autos = autos[autos['seller'] != 'gewerblich']
autos['seller'].unique()

In [None]:
# Remove all records from the defined column - Display Results
autos = autos[autos['offer_type'] != 'Gesuch']
autos['offer_type'].unique()

**Seperate/Format time and date `ad_created`, `date_crawled` Format time and date `last_seen`**

In [None]:
# Import datetime lirbrary
import datetime as dt

# Format defined column by date and times
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['ad_created'] = autos['ad_created'].str[:10]
last_seen_copy = autos['last_seen'].copy()
autos['last_seen'] = autos['last_seen'].str[:10]
autos['last_seen_hour'] = last_seen_copy.str[12:]

# Display Results
autos[['date_crawled', 'ad_created', 'last_seen', 'last_seen_hour']]

Brief Discription about the results from above

**Address special colums - Additional Formating `price`, `power_ps`**

In [None]:
# Display Statistcal Report for each column
def min_max(column1, column2 , name1, name2):
    min_value_1 = column1.min()
    min_value_2 = column2.min()
     
    max_value_1 = column1.max()
    max_value_2 = column2.max()
    
    print(str(name1) + " - Statistical Report\n\nMininum Value: {0}\nMaximum Value: {1}\n\n".format(min_value_1, max_value_1) + str(column1.describe()))
    print("\n\n" + str(name2) + " - Statistical Report\n\nMininum Value: {0}\nMaximum Value: {1}\n\n".format(min_value_2, max_value_2) + str(column2.describe()))

# Display unformatted results
min_max(autos["price"], autos["power_ps"], "Price", "Power(PS)")

With a quick obervation we can see that the `price` column in denoted by a scientific notation value as to the `power_ps` column carrying a multitude of decimal places. Below we will format the data appropriately using the `options.display.float_format` function in pandas; converting each value to 2 decimal places.

In [None]:
# Formatted columns 
pd.options.display.float_format = '{:,.2f}'.format

# Display Results
autos.loc[0:1,["price", "power_ps"]].describe()

*Convert German words to English language `seller`, `offer_type`, `gearbox`, `fuel_type`, `unrepaired_damage`, `vehicle_type`, `abtest`*

In [None]:
# Replace all German words with their English translation
autos["seller"] = autos["seller"].replace("privat", "Private")
autos["offer_type"] = autos["offer_type"].replace("Angebot", "Offer")
autos["gearbox"] = autos["gearbox"].replace("manuell", "Manual").replace("automatik", "Automatic")
autos["fuel_type"] = autos["fuel_type"].replace("benzin", "Gasoline").replace("diesel", "Diesel")
autos["unrepaired_damage"] = autos["unrepaired_damage"].replace("ja", "Yes").replace("nan", "Nan").replace("nein", "No")
autos["vehicle_type"] = autos["vehicle_type"].replace("kleinwagen", "Mini").replace("coupe", "Coupe").replace("suv", "SUV").replace("limousine", "Limousine").replace("cabrio", "Convertible").replace("bus", "Bus").replace("kombi", "Combination").replace("andere", "Other")
autos["abtest"] = autos["abtest"].replace("test", "Test").replace("control", "Control")

# Display Results
autos.loc[30:60]

Now that the majority of the words presented in this dataframe have been translated in English, we have a clear understanding of each record and its data points. With that being said, columns `model` and `brand` were left alone for there are at least 50 multiple unique items for each and for the most part, those translations will remain the same. 

**Regulate `regulation_year` min and max values**

In [None]:
#utos["registration_year"].min()
#utos["registration_year"].max()
#utos["registration_year"].unique()


greater_2016 = autos.loc[autos["registration_year"] > 2016, "registration_year"]
#ess_1990 = autos.loc[autos["registration_year"] < 1990, "registration_year"] == True
autos.loc[:30,"registration_year"]


#utos[autos["registration_year"] > 9999] == True