# Data cleaning

This notebook aims to outline the initial data preparation before building a price prediction model.
The data seen in this notebook was scraped from the autotrader website. This notebook will do the following:

- [Load the vehicle features and sellers tables from the database](#read_data)
- [Assess dataset](#assess_data)
- [Outline the cleaning process prior to EDA](#clean_data)
- [Output a cleaner dataset for EDA](#pickle_data)

In [87]:
import pandas as pd
from data_cleaning import *
from sklearn.linear_model import LinearRegression


## Read data tables into pandas <a id='read_data'></a>

Connect to database and join vehicle features with sellers infomation. 

In [2]:
full_df = get_data_from_database()
full_df.head(10)

Unnamed: 0,advert_id,date_scraped,time_scraped,make,model,trim,manufactured_year,manufactured_year_identifier,body_type,mileage,...,total_reviews,region,county,town,country,seller_postcode,seller_address_one,seller_address_two,dealer_website,primary_contact_number
0,202107165101873,2021-08-22,0 days 00:55:25,DS AUTOMOBILES,DS 3,Prestige,2018.0,18,Hatchback,31537.0,...,13067.0,SOUTH EAST,KENT,ADDINGTON,GB,ME19 5PL,A20 London Road,,https://www.bigmotoringworld.co.uk/autotraderv...,1634215708
1,202009113616600,2021-08-22,0 days 00:55:29,Vauxhall,Astra,SXi,2007.0,7,Hatchback,70000.0,...,18.0,LONDON,MIDDLESEX,HOUNSLOW,GB,TW4 6JQ,"VISTA BUSINESS CENTRE, SALISBURY ROAD",,https://www.motorpedia.uk/,2080337311
2,202107305627204,2021-08-22,0 days 00:55:29,Volkswagen,Polo,Moda,2010.0,60,Hatchback,89000.0,...,125.0,LONDON,HERTFORDSHIRE,BARNET,GB,EN5 4RY,BENTLEY HEATH LANE,,http://www.mynextcar.co.uk,2081152043
3,202108206481776,2021-08-22,0 days 00:55:30,Vauxhall,Astra,SXi,2008.0,8,Hatchback,117000.0,...,7.0,LONDON,ESSEX,ILFORD,GB,IG3 8RW,777-779 High Road,Seven Kings,https://dmsgateway.autotrader.co.uk/api/advert...,7441907724
4,202108206471965,2021-08-22,0 days 00:55:30,Mazda,Mazda3,Sport,2007.0,7,Hatchback,93500.0,...,18.0,SOUTH WEST,WILTSHIRE,SWINDON,GB,SN1 2PG,"UNIT 10-11, ISIS TRADING ESTATE",,https://dmsgateway.autotrader.co.uk/api/advert...,7537125171
5,202108136200337,2021-08-22,0 days 00:55:30,Vauxhall,Astra,Club,2004.0,54,Hatchback,73888.0,...,10.0,SOUTH EAST,SURREY,WORCESTER PARK,,,,,,7971223786
6,202108045818203,2021-08-22,0 days 00:55:30,Volvo,S40,SE,2005.0,55,Saloon,160000.0,...,15.0,LONDON,MIDDLESEX,LONDON,GB,,,,https://dmsgateway.autotrader.co.uk/api/advert...,7537165960
7,202108206446218,2021-08-22,0 days 00:55:31,Ford,Fiesta,Ghia,2006.0,6,Hatchback,63000.0,...,13.0,SOUTH EAST,EAST SUSSEX,UCKFIELD,,,,,,7971243842
8,202108206442132,2021-08-22,0 days 00:55:31,MINI,Hatch,Cooper,2004.0,4,Hatchback,63977.0,...,37.0,LONDON,SURREY,WANDSWORTH,GB,SW18 4QA,"152-156, Penwith Road",,,7537165675
9,202105243041206,2021-08-22,0 days 00:55:31,Vauxhall,Corsa,SE,2009.0,9,Hatchback,100843.0,...,128.0,NORTH WEST,CHESHIRE,NORTHWICH,GB,CW8 1BE,Unit 5 RIVERSIDE TRADING ESTATE,NAVIGATION ROAD,https://dmsgateway.autotrader.co.uk/api/advert...,7537164943


## Assess data <a id='assess_data'></a>

In [3]:
full_df.shape

(27808, 77)

Some of the columns contain nulls and some of the data types are incorrect.

In [4]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27808 entries, 0 to 27807
Data columns (total 77 columns):
 #   Column                        Non-Null Count  Dtype          
---  ------                        --------------  -----          
 0   advert_id                     27808 non-null  int64          
 1   date_scraped                  27808 non-null  object         
 2   time_scraped                  27808 non-null  timedelta64[ns]
 3   make                          27808 non-null  object         
 4   model                         27808 non-null  object         
 5   trim                          26792 non-null  object         
 6   manufactured_year             27787 non-null  float64        
 7   manufactured_year_identifier  27508 non-null  object         
 8   body_type                     27790 non-null  object         
 9   mileage                       27761 non-null  float64        
 10  engine_size                   27565 non-null  object         
 11  transmission   

Percentage of missing values in each column in descending order.

In [5]:
get_percentage_nulls(full_df)

max_loading_weight      82.771145
zero_to_sixty_two       66.218354
gross_vehicle_weight    65.330121
seller_address_two      61.388809
number_of_owners        45.857307
                          ...    
date_scraped             0.000000
model                    0.000000
make                     0.000000
time_scraped             0.000000
advert_id                0.000000
Length: 77, dtype: float64

Checking duplicate entries

In [6]:
full_df['advert_id'].duplicated().sum()

0

## Clean Data <a id='explore_data'></a>

The columns which will not be used for the price prediction model will be dropped.
The following columns will be dropped as they uniquely identify each car ad and will lead to overfitting in the model:
- advert_id
- derivative_id
- vehicle_registration_mark 
- seller_id (not uniquely but still identifies car ads)
- seller_name (not uniquely but still identifies car ads)

The following columns are not available at the time predictions are made and can lead to data leaks:
- price_deviation
- price_deviation_type
- price_excluding_fees
- no_admin_fees
- price_rating
- price_rating_label

The following columns have a lot of data points missing:
- max_loading_weight
- gross_vehicle_weight
- zero_to_sixty_two
- zero_to_sixty
- minimum_kerb_weight

The following columns are not useful as they only have one category:
- car_condition (all 'used')
- is_dealer_trusted (all False)
- country (all 'GB')

The following columns will probably not be a good predictor of price:
- ad_description
- page_url
- primary_contact_number
- dealer_website
- date_scraped
- time_scraped

The infomation in the following columns is already represented by another column:
- manufactured_year_identifier (year)
- seller_address_one (lat and long)
- seller_address_two (lat and long)
- vehicle_location_postcode (lat and long)
- seller_postcode (lat and long)

The following columns contain inaccurate data:
- average_mileage (average based on full dataset on website but only a fraction was scraped)
- mileage_deviation
- mileage_deviation_type 


In [7]:
df = drop_columns(full_df)
df.columns

Index(['make', 'model', 'trim', 'manufactured_year', 'body_type', 'mileage',
       'engine_size', 'transmission', 'fuel_type', 'doors', 'seats',
       'number_of_owners', 'emission_scheme', 'vehicle_location_latitude',
       'vehicle_location_longitude', 'imported', 'price', 'number_of_photos',
       'co2_emission', 'tax', 'top_speed', 'cylinders', 'valves',
       'engine_power', 'engine_torque', 'height', 'length', 'wheelbase',
       'width', 'fuel_tank_capacity', 'boot_space_seats_up',
       'boot_space_seats_down', 'urban', 'extra_urban', 'combined',
       'co2_emissions', 'insurance_group', 'seller_longlat', 'seller_segment',
       'seller_rating', 'total_reviews', 'region', 'county', 'town'],
      dtype='object')

### Cleaning issues

- vehicle_location_longitude & vehicle_location_latitude contains the same data as seller_longlat but has more missing values
- there are two columns with CO2 emissions
- manufactured_year, doors, seats, engine_power, wheelbase, height, legnth, valves, cylinders, top_speed, tax, CO2 emissions are all floats
- make, model and trim are interlinked so could put into one column
- the emission_scheme column only has one 'ULEZ' category (convert to boolean column)

#### Combine latitude and longitudes into one column.

In [8]:
df[(df.vehicle_location_longitude.isnull()) & (~df.seller_longlat.isnull())][['vehicle_location_longitude', 'vehicle_location_latitude', 'seller_longlat']]

Unnamed: 0,vehicle_location_longitude,vehicle_location_latitude,seller_longlat
841,,,"51.631083,-0.761748"
1204,,,"51.5283026,-0.1302148"
1209,,,"53.7201316,-0.4299832"
1225,,,"52.9229051,-1.4600297"
1358,,,"51.5283026,-0.1302148"
...,...,...,...
27097,,,"51.5283026,-0.1302148"
27101,,,"51.5283026,-0.1302148"
27193,,,"51.5283026,-0.1302148"
27430,,,"51.5283026,-0.1302148"


In [9]:
df.count()[['vehicle_location_longitude', 'vehicle_location_latitude', 'seller_longlat']]

vehicle_location_longitude    22914
vehicle_location_latitude     22914
seller_longlat                23256
dtype: int64

In [10]:
df2 = combine_latitudes_and_longitudes(df)

In [11]:
df2.count()[['vehicle_location_longitude', 'vehicle_location_latitude']]

vehicle_location_longitude    23259
vehicle_location_latitude     23259
dtype: int64

#### Combine CO2 emissions into one column.

In [12]:
df2[(df2.co2_emissions.isnull()) & (~df2.co2_emission.isnull())][['co2_emission', 'co2_emissions']]

Unnamed: 0,co2_emission,co2_emissions
74,220.0,
7896,142.0,
22593,258.0,


In [13]:
df2.count()[['co2_emission', 'co2_emissions']]

co2_emission     27329
co2_emissions    27336
dtype: int64

In [14]:
df2 = combine_CO2_columns(df2)

In [15]:
df2.count()['co2_emissions']

27339

#### Convert emission_scheme column to boolean to show if a car is ULEZ compliant or not.

In [16]:
df2['emission_scheme'].unique()

array(['ULEZ', None], dtype=object)

In [17]:
df2 = convert_emission_scheme_to_boolean(df2)

In [18]:
df2.ulez.unique()

array([1, 0])

#### Fix datatypes of columns, and make data entries lowercase

In [19]:
df2 = clean_round1(df2)

In [20]:
df2.head()

Unnamed: 0,make,model,trim,manufactured_year,body_type,mileage,engine_size,transmission,fuel_type,doors,...,combined,co2_emissions,insurance_group,seller_segment,seller_rating,total_reviews,region,county,town,ulez
0,ds-automobiles,ds-3,prestige,2018,hatchback,31537,1.6,manual,diesel,3,...,78.5,94,26e,independent,4.7,13067,south_east,kent,addington,1
1,vauxhall,astra,sxi,2007,hatchback,70000,1.4,manual,petrol,3,...,46.3,146,10e,independent,5.0,18,london,middlesex,hounslow,1
2,volkswagen,polo,moda,2010,hatchback,89000,1.2,manual,petrol,3,...,51.4,128,05e,independent,4.8,125,london,hertfordshire,barnet,1
3,vauxhall,astra,sxi,2008,hatchback,117000,1.4,manual,petrol,5,...,46.3,146,10e,independent,4.4,7,london,essex,ilford,1
4,mazda,mazda3,sport,2007,hatchback,93500,2.0,manual,petrol,5,...,35.8,189,20e,independent,5.0,18,south_west,wiltshire,swindon,1


In [21]:
df2.dtypes

make                            object
model                           object
trim                            object
manufactured_year                Int64
body_type                     category
mileage                          Int64
engine_size                    float64
transmission                  category
fuel_type                     category
doors                            Int64
seats                            Int64
number_of_owners                 Int64
vehicle_location_latitude      float64
vehicle_location_longitude     float64
imported                         int64
price                            int64
number_of_photos                 int64
tax                              Int64
top_speed                        Int64
cylinders                        Int64
valves                           Int64
engine_power                     Int64
engine_torque                  float64
height                           Int64
length                           Int64
wheelbase                

#### Combine make, model and trim into one column.

In [22]:
df2 = combine_make_model_trim(df2)

In [23]:
df2.head()

Unnamed: 0,make_model_trim,manufactured_year,body_type,mileage,engine_size,transmission,fuel_type,doors,seats,number_of_owners,...,combined,co2_emissions,insurance_group,seller_segment,seller_rating,total_reviews,region,county,town,ulez
0,ds-automobiles_ds-3_prestige,2018,hatchback,31537,1.6,manual,diesel,3,5,1.0,...,78.5,94,26e,independent,4.7,13067,south_east,kent,addington,1
1,vauxhall_astra_sxi,2007,hatchback,70000,1.4,manual,petrol,3,5,,...,46.3,146,10e,independent,5.0,18,london,middlesex,hounslow,1
2,volkswagen_polo_moda,2010,hatchback,89000,1.2,manual,petrol,3,5,,...,51.4,128,05e,independent,4.8,125,london,hertfordshire,barnet,1
3,vauxhall_astra_sxi,2008,hatchback,117000,1.4,manual,petrol,5,5,3.0,...,46.3,146,10e,independent,4.4,7,london,essex,ilford,1
4,mazda_mazda3_sport,2007,hatchback,93500,2.0,manual,petrol,5,5,2.0,...,35.8,189,20e,independent,5.0,18,south_west,wiltshire,swindon,1


In [110]:
for row in df2[df2.body_type.isnull()]:
    print(row)

make_model_trim
manufactured_year
body_type
mileage
engine_size
transmission
fuel_type
doors
seats
number_of_owners
vehicle_location_latitude
vehicle_location_longitude
imported
price
number_of_photos
tax
top_speed
cylinders
valves
engine_power
engine_torque
height
length
wheelbase
width
fuel_tank_capacity
boot_space_seats_up
boot_space_seats_down
urban
extra_urban
combined
co2_emissions
insurance_group
seller_segment
seller_rating
total_reviews
region
county
town
ulez


In [98]:
df2.groupby(['make_model_trim', 'manufactured_year'])['fuel_type'].agg(pd.Series.mode)

2.0

In [77]:
get_percentage_nulls(df2)[25:]

engine_size          0.873849
mileage              0.169016
doors                0.140247
manufactured_year    0.075518
body_type            0.064730
valves               0.025173
cylinders            0.025173
engine_power         0.010788
transmission         0.003596
fuel_type            0.003596
make_model_trim      0.000000
number_of_photos     0.000000
price                0.000000
imported             0.000000
ulez                 0.000000
dtype: float64

In [79]:
def drop_rows_with_small_percentage_of_missing_values(df):

    df=df.copy()
    columns=[]
    for index, percentage in get_percentage_nulls(df).items():
        if percentage < 0.1:
            columns.append(index)
    
    return df.dropna(subset=columns)


In [83]:
df2 = drop_rows_with_small_percentage_of_missing_values(df2)

In [88]:
get_percentage_nulls(df2).index

Index(['number_of_owners', 'boot_space_seats_down', 'urban', 'extra_urban',
       'seller_rating', 'total_reviews', 'vehicle_location_longitude',
       'vehicle_location_latitude', 'boot_space_seats_up', 'seller_segment',
       'county', 'region', 'town', 'fuel_tank_capacity', 'combined', 'tax',
       'engine_torque', 'insurance_group', 'top_speed', 'width', 'wheelbase',
       'height', 'length', 'seats', 'co2_emissions', 'engine_size', 'mileage',
       'doors', 'make_model_trim', 'manufactured_year', 'engine_power',
       'valves', 'cylinders', 'number_of_photos', 'price', 'imported',
       'fuel_type', 'transmission', 'body_type', 'ulez'],
      dtype='object')

In [None]:
def fill_columns_with_missing_values():
    
    columns = ['boot_space_seats_down', 'urban', 'extra_urban',
     'boot_space_seats_up', 'fuel_tank_capacity', 'combined', 'tax',
     'engine_torque', 'insurance_group', 'top_speed', 'width', 'wheelbase',
     'height', 'length', 'seats', 'co2_emissions', 'engine_size', 'mileage',
     'doors']

    df = df.copy()
    for column in columns:
        df[df[column].isnull()][column]

In [153]:
b = df2.manufactured_year.mode()

In [155]:
df2.groupby(['make_model_trim','manufactured_year'], sort=False)['doors'].apply(lambda x: x.fillna(x.mode())).isn

0        3
1        3
2        3
3        5
4        5
        ..
27803    5
27804    5
27805    5
27806    5
27807    5
Name: doors, Length: 27761, dtype: Int64

In [137]:
df2.groupby(['make_model_trim']).mode().iloc[0]

AttributeError: 'DataFrameGroupBy' object has no attribute 'mode'

In [127]:
df2['boot_space_seats_down'].fillna(df2.groupby(['make_model_trim', 'manufactured_year'])['boot_space_seats_down'].transform('mean'))

TypeError: cannot safely cast non-equivalent object to int64

In [None]:
def predict_no_of_owners(df):
    
    
    
    data = pd.read_csv("train.csv")
data = data[["Survived", "Pclass", "Sex", "SibSp", "Parch", "Fare", "Age"]]

data["Sex"] = [1 if x=="male" else 0 for x in data["Sex"]]

test_data = data[data["Age"].isnull()]
data.dropna(inplace=True)

y_train = data["Age"]
X_train = data.drop("Age", axis=1)
X_test = test_data.drop("Age", axis=1)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

## Output dataset to file <a id='pickle_data'></a>

Save this dataset to a file which will hold the datatypes. There are still nulls in this dataset to preserve as much data as possible 
before performing exploratory data anaylysis. The effect of dropping nulls vs imputing values will be evaluated for the prediction model.

In [24]:
df2.to_pickle("car_data.pkl")