# Data cleaning

The data seen in this notebook was scraped from the autotrader website. This notebook will do the following:

- Read the vehicle features table and the sellers table in the database created with the scraper [here](#read_data)
- Explore initial descriptive information about dataset [here](#explore_data)
- Outline the cleaning process prior to EDA [here](#clean_data)

In [1]:
import pandas as pd
import mysql.connector
from autotrader_scraper.autotrader_scraper.config import mysql_details

## Read data tables into pandas <a id='read_data'></a>

Connect to database and join vehicle features with sellers infomation. 

In [2]:
DB_NAME = 'autotrader_adverts'

cnx = mysql.connector.connect(**mysql_details)
cursor = cnx.cursor(dictionary=True)

cursor.execute("USE {}".format(DB_NAME))
cursor.execute('''SELECT * 
                  FROM vehicle_features as vf
                  LEFT JOIN sellers as s
                  ON vf.seller_id=s.seller_id
                  ORDER BY date_scraped ASC, time_scraped ASC''')
full_results = cursor.fetchall()
cnx.close()

In [3]:
full_df = pd.DataFrame(full_results)
full_df.head(10)

Unnamed: 0,advert_id,date_scraped,time_scraped,make,model,trim,manufactured_year,manufactured_year_identifier,body_type,mileage,...,total_reviews,region,county,town,country,seller_postcode,seller_address_one,seller_address_two,dealer_website,primary_contact_number
0,202107165101873,2021-08-22,0 days 00:55:25,DS AUTOMOBILES,DS 3,Prestige,2018.0,18,Hatchback,31537.0,...,13067.0,SOUTH EAST,KENT,ADDINGTON,GB,ME19 5PL,A20 London Road,,https://www.bigmotoringworld.co.uk/autotraderv...,1634215708
1,202009113616600,2021-08-22,0 days 00:55:29,Vauxhall,Astra,SXi,2007.0,7,Hatchback,70000.0,...,18.0,LONDON,MIDDLESEX,HOUNSLOW,GB,TW4 6JQ,"VISTA BUSINESS CENTRE, SALISBURY ROAD",,https://www.motorpedia.uk/,2080337311
2,202107305627204,2021-08-22,0 days 00:55:29,Volkswagen,Polo,Moda,2010.0,60,Hatchback,89000.0,...,125.0,LONDON,HERTFORDSHIRE,BARNET,GB,EN5 4RY,BENTLEY HEATH LANE,,http://www.mynextcar.co.uk,2081152043
3,202108206481776,2021-08-22,0 days 00:55:30,Vauxhall,Astra,SXi,2008.0,8,Hatchback,117000.0,...,7.0,LONDON,ESSEX,ILFORD,GB,IG3 8RW,777-779 High Road,Seven Kings,https://dmsgateway.autotrader.co.uk/api/advert...,7441907724
4,202108206471965,2021-08-22,0 days 00:55:30,Mazda,Mazda3,Sport,2007.0,7,Hatchback,93500.0,...,18.0,SOUTH WEST,WILTSHIRE,SWINDON,GB,SN1 2PG,"UNIT 10-11, ISIS TRADING ESTATE",,https://dmsgateway.autotrader.co.uk/api/advert...,7537125171
5,202108136200337,2021-08-22,0 days 00:55:30,Vauxhall,Astra,Club,2004.0,54,Hatchback,73888.0,...,10.0,SOUTH EAST,SURREY,WORCESTER PARK,,,,,,7971223786
6,202108045818203,2021-08-22,0 days 00:55:30,Volvo,S40,SE,2005.0,55,Saloon,160000.0,...,15.0,LONDON,MIDDLESEX,LONDON,GB,,,,https://dmsgateway.autotrader.co.uk/api/advert...,7537165960
7,202108206446218,2021-08-22,0 days 00:55:31,Ford,Fiesta,Ghia,2006.0,6,Hatchback,63000.0,...,13.0,SOUTH EAST,EAST SUSSEX,UCKFIELD,,,,,,7971243842
8,202108206442132,2021-08-22,0 days 00:55:31,MINI,Hatch,Cooper,2004.0,4,Hatchback,63977.0,...,37.0,LONDON,SURREY,WANDSWORTH,GB,SW18 4QA,"152-156, Penwith Road",,,7537165675
9,202105243041206,2021-08-22,0 days 00:55:31,Vauxhall,Corsa,SE,2009.0,9,Hatchback,100843.0,...,128.0,NORTH WEST,CHESHIRE,NORTHWICH,GB,CW8 1BE,Unit 5 RIVERSIDE TRADING ESTATE,NAVIGATION ROAD,https://dmsgateway.autotrader.co.uk/api/advert...,7537164943


## Explore data <a id='explore_data'></a>

In [4]:
full_df.shape

(27808, 77)

Some of the columns contain nulls and some of the data types are incorrect.

In [5]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27808 entries, 0 to 27807
Data columns (total 77 columns):
 #   Column                        Non-Null Count  Dtype          
---  ------                        --------------  -----          
 0   advert_id                     27808 non-null  int64          
 1   date_scraped                  27808 non-null  object         
 2   time_scraped                  27808 non-null  timedelta64[ns]
 3   make                          27808 non-null  object         
 4   model                         27808 non-null  object         
 5   trim                          26792 non-null  object         
 6   manufactured_year             27787 non-null  float64        
 7   manufactured_year_identifier  27508 non-null  object         
 8   body_type                     27790 non-null  object         
 9   mileage                       27761 non-null  float64        
 10  engine_size                   27565 non-null  object         
 11  transmission   

Percentage of missing values in each column in descending order.

In [6]:
print(pd.DataFrame((full_df.isnull().sum()/len(full_df))*100).sort_values(0, ascending=False).to_string())

                                      0
max_loading_weight            82.771145
zero_to_sixty_two             66.218354
gross_vehicle_weight          65.330121
seller_address_two            61.388809
number_of_owners              45.857307
price_deviation               35.399885
price_deviation_type          35.399885
price_rating                  35.399885
price_rating_label            35.399885
boot_space_seats_down         35.209292
emission_scheme               33.076812
zero_to_sixty                 29.433976
urban                         25.722814
extra_urban                   25.672468
seller_rating                 22.292146
total_reviews                 22.292146
dealer_website                21.364356
minimum_kerb_weight           20.386220
vehicle_location_postcode     17.599252
vehicle_location_latitude     17.599252
vehicle_location_longitude    17.599252
seller_postcode               16.369390
seller_address_one            16.369390
seller_longlat                16.369390


## Clean Data<a id='explore_data'></a>