# Data cleaning

This notebook aims to outline the initial data preparation before building a price prediction model.
The data seen in this notebook was scraped from the autotrader website. This notebook will do the following:

- [Load the vehicle features and sellers tables from the database](#read_data)
- [Assess dataset](#assess_data)
- [Outline the cleaning process prior to EDA](#clean_data)
- [Output a cleaner dataset for EDA](#pickle_data)

In [None]:
import pandas as pd
from data_cleaning import *

## Read data tables into pandas <a id='read_data'></a>

Connect to database and join vehicle features with sellers infomation. 

In [None]:
full_df = get_data_from_database()
full_df.head(10)

## Assess data <a id='assess_data'></a>

In [None]:
full_df.shape

Some of the columns contain nulls and some of the data types are incorrect.

In [None]:
full_df.info()

Percentage of missing values in each column in descending order.

In [None]:
get_percentage_nulls(full_df)

Checking duplicate entries

In [None]:
full_df['advert_id'].duplicated().sum()

## Clean Data <a id='explore_data'></a>

The columns which will not be used for the price prediction model will be dropped.
The following columns will be dropped as they uniquely identify each car ad and will lead to overfitting in the model:
- advert_id
- derivative_id
- vehicle_registration_mark 
- seller_id (not uniquely but still identifies car ads)
- seller_name (not uniquely but still identifies car ads)

The following columns are not available at the time predictions are made and can lead to data leaks:
- price_deviation
- price_deviation_type
- price_excluding_fees
- no_admin_fees
- price_rating
- price_rating_label

The following columns have a lot of data points missing:
- max_loading_weight
- gross_vehicle_weight
- zero_to_sixty_two
- zero_to_sixty
- minimum_kerb_weight

The following columns are not useful as they only have one category:
- car_condition (all 'used')
- is_dealer_trusted (all False)
- country (all 'GB')

The following columns will probably not be a good predictor of price:
- ad_description
- page_url
- primary_contact_number
- dealer_website
- date_scraped
- time_scraped

The infomation in the following columns is already represented by another column:
- manufactured_year_identifier (year)
- seller_address_one (lat and long)
- seller_address_two (lat and long)
- vehicle_location_postcode (lat and long)
- seller_postcode (lat and long)

The following columns contain inaccurate data:
- average_mileage (average based on full dataset on website but only a fraction was scraped)
- mileage_deviation
- mileage_deviation_type 


In [None]:
df = drop_columns(full_df)
df.columns

### Cleaning issues

- vehicle_location_longitude & vehicle_location_latitude contains the same data as seller_longlat but has more missing values
- there are two columns with CO2 emissions
- manufactured_year, doors, seats, engine_power, wheelbase, height, legnth, valves, cylinders, top_speed, tax, CO2 emissions are all floats
- trim column along with others have missing values
- the emission_scheme column only has one 'ULEZ' category (convert to boolean column)

#### Combine latitude and longitudes into one column.

In [None]:
# check missing values in column
df[(df.vehicle_location_longitude.isnull()) & (~df.seller_longlat.isnull())][['vehicle_location_longitude', 'vehicle_location_latitude', 'seller_longlat']]

In [None]:
df.count()[['vehicle_location_longitude', 'vehicle_location_latitude', 'seller_longlat']]

In [None]:
df2 = combine_latitudes_and_longitudes(df)

In [None]:
# number of missing values have increased
df2.count()[['vehicle_location_longitude', 'vehicle_location_latitude']]

#### Combine CO2 emissions into one column.

In [None]:
df2[(df2.co2_emissions.isnull()) & (~df2.co2_emission.isnull())][['co2_emission', 'co2_emissions']]

In [None]:
df2.count()[['co2_emission', 'co2_emissions']]

In [None]:
df2 = combine_CO2_columns(df2)

In [None]:
df2.count()['co2_emissions']

#### Convert emission_scheme column to boolean to show if a car is ULEZ compliant or not.

In [None]:
# check unique entries in column
df2['emission_scheme'].unique()

In [None]:
df2 = convert_emission_scheme_to_boolean(df2)

In [None]:
df2.ulez.unique()

#### Fix datatypes of columns, and make data entries lowercase

In [None]:
df2 = clean_round1(df2)

In [None]:
df2.head()

In [None]:
df2.dtypes

#### Fill missing values in trim column.

In [None]:
df2 = fill_trim_missing_values(df2)

In [None]:
df2.trim.isnull().any()

#### Deal with missing values.

In [None]:
df2 = drop_rows_with_small_percentage_of_missing_values(df2)

In [None]:
# fill missing values by imputing mode of groups by make, model, trim and year
df3 = fill_columns_with_missing_values(df2)

In [None]:
# drop missing values in mileage column to use for number of owners predictions in next cell
df3 = df3.dropna(subset=['mileage'])

In [None]:
df4 = fill_number_of_owners_with_predictions(df3)

In [None]:
df4 = df4.dropna()

In [None]:
df4 = fix_int_datatypes(df4)

## Output dataset to file <a id='pickle_data'></a>

Save this dataset to a file which will hold the datatypes. There are still nulls in this dataset to preserve as much data as possible 
before performing exploratory data anaylysis. The effect of dropping nulls vs imputing values will be evaluated for the prediction model.

In [None]:
df4.to_pickle('eda/car_data.pkl')