# Project: Cleaning And Analyzing Data
## Introduction:
Hi! I am Mujibur Rahman Bhuniyan. I am a physicist and data scientist. This project is to show my data cleaning and analyzing skills.
In this project, I am going to clean and analyze the Automobile dataset from the UCI Machine Learning Database. The dataset contains features of the car and its price. The features can be used to predict the price of the car. But not all features are the same. I will use python pandas to clean and analyze the data to extract the meaning behind the data. We will be able to see which features are important and can be used to predict the car prices. We can then use some Machine Learning to develop a model for this. We will later evaluate the model and see its effectiveness. The steps I am going to take for the data analysis project are the following
1. Extract the data from the database and get a basic insight of the data
2. Data Wrangling to convert it into a suitable format for better analysis
3. Exploratory Data Analysis to find meaningful features in the data
4. Developiong a model based on the information we learned from the previous section
5. Evaluate the model with different evaluation techniques to find how effective the model is

So, be patient while I uncover the mysteries of the data.

## 1. Getting the data
This data is hosted at the UCI Machine Learning Database. It is in csv format. So, we won't have to make our hands dirty by writing queries to get the data from the database. The source for the data is https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data 
I will use Pandas library in python to import and load it into the dataframe.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

filepath = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'

In [16]:
#The data doesn't have a header. So, we set 'header = None'
df= pd.read_csv(filepath, header= None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


To get the header, we can read about the dataset at https://archive.ics.uci.edu/ml/datasets/Automobile and get information about it. They have a list of names for the fields in the dataset. We get those and add to our dataset as the column names.

In [17]:
names=[ 'symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels',
       'engine-location', 'wheel-base', 'length', 'width','height', 'curb-weight', 'engine-type' ,'num-of-cylinders', 'engine-size',
       'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

df.columns = names
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Let's get a statistical summary of each field

In [18]:
df.describe()

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


In [19]:
df.describe(include='all')

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,205,205,205,205,205,205,205,205,205.0,...,205.0,205,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205
unique,,52,22,2,2,3,5,3,2,,...,,8,39.0,37.0,,60.0,24.0,,,187
top,,?,toyota,gas,std,four,sedan,fwd,front,,...,,mpfi,3.62,3.4,,68.0,5500.0,,,?
freq,,41,32,185,168,114,96,120,202,,...,,94,23.0,20.0,,19.0,37.0,,,4
mean,0.834146,,,,,,,,,98.756585,...,126.907317,,,,10.142537,,,25.219512,30.75122,
std,1.245307,,,,,,,,,6.021776,...,41.642693,,,,3.97204,,,6.542142,6.886443,
min,-2.0,,,,,,,,,86.6,...,61.0,,,,7.0,,,13.0,16.0,
25%,0.0,,,,,,,,,94.5,...,97.0,,,,8.6,,,19.0,25.0,
50%,1.0,,,,,,,,,97.0,...,120.0,,,,9.0,,,24.0,30.0,
75%,2.0,,,,,,,,,102.4,...,141.0,,,,9.4,,,30.0,34.0,


From the "description" of the dataframe, we can see that a lot of field has some entries as '?'. These are the missing values. We will have to deal with these values. We can replace the '?' with NaN. That way it'd be easier to identify them.

In [20]:
df.replace('?', np.nan, inplace=True)

In [21]:
df.describe(include='all')

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,164.0,205,205,205,203,205,205,205,205.0,...,205.0,205,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
unique,,51.0,22,2,2,2,5,3,2,,...,,8,38.0,36.0,,59.0,23.0,,,186.0
top,,161.0,toyota,gas,std,four,sedan,fwd,front,,...,,mpfi,3.62,3.4,,68.0,5500.0,,,7898.0
freq,,11.0,32,185,168,114,96,120,202,,...,,94,23.0,20.0,,19.0,37.0,,,2.0
mean,0.834146,,,,,,,,,98.756585,...,126.907317,,,,10.142537,,,25.219512,30.75122,
std,1.245307,,,,,,,,,6.021776,...,41.642693,,,,3.97204,,,6.542142,6.886443,
min,-2.0,,,,,,,,,86.6,...,61.0,,,,7.0,,,13.0,16.0,
25%,0.0,,,,,,,,,94.5,...,97.0,,,,8.6,,,19.0,25.0,
50%,1.0,,,,,,,,,97.0,...,120.0,,,,9.0,,,24.0,30.0,
75%,2.0,,,,,,,,,102.4,...,141.0,,,,9.4,,,30.0,34.0,


### Deal with the missing data

In [22]:
nan_values = df.isnull()

for i in nan_values.columns:
    if nan_values[i].sum() >0:
        print(i,' has ',nan_values[i].sum(), ' missing values')
    

normalized-losses  has  41  missing values
num-of-doors  has  2  missing values
bore  has  4  missing values
stroke  has  4  missing values
horsepower  has  2  missing values
peak-rpm  has  2  missing values
price  has  4  missing values


Since our goal is to build a model to predict price of the car, we only need the listing with prices available. So, we will drop all null values in that field.

In [23]:
print(df.shape)
df.dropna(subset=['price'], axis=0,inplace=True)
df.reset_index(drop=True)
print(df.shape)
df.head()

(205, 26)
(201, 26)


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Number of doors can be figured out from the body style information.

In [24]:
df[df['num-of-doors'].isnull()]['body-style']

27    sedan
63    sedan
Name: body-style, dtype: object

Since they are both sedan, we know they'll have 4 doors.

In [25]:
df['num-of-doors'].replace(np.nan, 'four', inplace=True)

nan_values = df.isnull()
for i in nan_values.columns:
    if nan_values[i].sum() >0:
        print(i,' has ',nan_values[i].sum(), ' missing values')

normalized-losses  has  37  missing values
bore  has  4  missing values
stroke  has  4  missing values
horsepower  has  2  missing values
peak-rpm  has  2  missing values


We replace missing values in normalized losses with their average

In [32]:
df['normalized-losses']= df['normalized-losses'].astype('float')
norm_loss_avg = df['normalized-losses'].mean()
norm_loss_avg

122.0

In [34]:
df['normalized-losses'].replace(np.nan, norm_loss_avg, inplace=True)

Bore, stroke, horsepower, and peak-rpm can be replaced with their mean

In [45]:
df[['bore', 'stroke', 'horsepower', 'peak-rpm']] = df[['bore', 'stroke', 'horsepower', 'peak-rpm']].astype('float')
bore_avg = df['bore'].mean()
stroke_avg = df['stroke'].mean()
horsepower_avg = df['horsepower'].mean()
peak_rpm_avg = df['peak-rpm'].mean()

df['bore'].replace(np.nan, bore_avg, inplace=True)
df['stroke'].replace(np.nan, stroke_avg, inplace=True)
df['horsepower'].replace(np.nan, horsepower_avg, inplace=True)
df['peak-rpm'].replace(np.nan, peak_rpm_avg, inplace=True)

Let's check the missing values again

In [48]:
nan_values = df.isnull()
for i in nan_values.columns:
    if nan_values[i].sum() >0:
        print(i,' has ',nan_values[i].sum(), ' missing values')
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450
