### I. Importing needed libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")

### II. Prepping the dataset

In [3]:
dataset = pd.read_csv("auto-mpg.data.csv")
dataset.head(10)

Unnamed: 0,"18.0 8 307.0 130.0 3504. 12.0 70 1	""chevrolet chevelle malibu"""
0,15.0 8 350.0 165.0 3693. 11...
1,18.0 8 318.0 150.0 3436. 11...
2,16.0 8 304.0 150.0 3433. 12...
3,17.0 8 302.0 140.0 3449. 10...
4,15.0 8 429.0 198.0 4341. 10...
5,14.0 8 454.0 220.0 4354. 9...
6,14.0 8 440.0 215.0 4312. 8...
7,14.0 8 455.0 225.0 4425. 10...
8,15.0 8 390.0 190.0 3850. 8...
9,15.0 8 383.0 170.0 3563. 10...


In [4]:
# checking size and shape of the dataset
print("Dataset size: " + str(dataset.size))
print("Dataset shape: " + str(dataset.shape))

Dataset size: 397
Dataset shape: (397, 1)


As we see, there are three problems with this dataset:<br>
<ul>
    <li>1: the first row of the dataset had automatically become the column header of the dataset</li>
    <li>2: the entire dataset has around 8 attributes. but it's clear from the picture that the there is just one column </li>
    <li>3: proper column names is missing</li>
        

In [5]:
# making the column header as the first row, and splitting the attributes into respective columns
dataset = pd.read_csv("auto-mpg.data.csv", sep='\s+', header=None)
dataset.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312.0,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425.0,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850.0,8.5,70,1,amc ambassador dpl


In [6]:
# replacing column index with names
dataset.columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model-year", "origin", "name"]
dataset.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312.0,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425.0,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850.0,8.5,70,1,amc ambassador dpl


In [7]:
# checking size and shape of the dataset
print("Dataset size: " + str(dataset.size))
print("Dataset shape: " + str(dataset.shape))

Dataset size: 3582
Dataset shape: (398, 9)


All three problems are fixed :) <br>
<ul>
    <li>1: dataset has proper column header</li>
    <li>2: dataset has around 9 columns (~= attributes)</li>
    <li>3: proper column names</li>
        

##### Now  the dataset is shaped well in the outline (row and oclumn index, shape, and column names), we should now have a quick look into the individual attributes (columns)

In [8]:
# checking for any null values 
dataset.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model-year      0
origin          0
name            0
dtype: int64

In [9]:
# checking the datatype of the attributes
dataset.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight          float64
acceleration    float64
model-year        int64
origin            int64
name             object
dtype: object

Every attribute is numerical except: "horsepower" and "name" which are objects. There is no need to make a modification on the "name" attribute" because we're anyways not using name as an attribute when we begin our modelling. So let's look at the 'horsepower' attribute:

In [10]:
# checking if 'horsepower' has any missing values
dataset["horsepower"].isnull().sum()

0

In [11]:
# check if 'horsepower' has any special characters (checking random segments)
print(dataset["horsepower"].values[0:40])
print("\n=================================")
print(dataset["horsepower"].values[100:140])
print("\n=================================")
print(dataset["horsepower"].values[300:340])

['130.0' '165.0' '150.0' '150.0' '140.0' '198.0' '220.0' '215.0' '225.0'
 '190.0' '170.0' '160.0' '150.0' '225.0' '95.00' '95.00' '97.00' '85.00'
 '88.00' '46.00' '87.00' '90.00' '95.00' '113.0' '90.00' '215.0' '200.0'
 '210.0' '193.0' '88.00' '90.00' '95.00' '?' '100.0' '105.0' '100.0'
 '88.00' '100.0' '165.0' '175.0']

['88.00' '95.00' '46.00' '150.0' '167.0' '170.0' '180.0' '100.0' '88.00'
 '72.00' '94.00' '90.00' '85.00' '107.0' '90.00' '145.0' '230.0' '49.00'
 '75.00' '91.00' '112.0' '150.0' '110.0' '122.0' '180.0' '95.00' '?'
 '100.0' '100.0' '67.00' '80.00' '65.00' '75.00' '100.0' '110.0' '105.0'
 '140.0' '150.0' '150.0' '140.0']

['90.00' '70.00' '70.00' '65.00' '69.00' '90.00' '115.0' '115.0' '90.00'
 '76.00' '60.00' '70.00' '65.00' '90.00' '88.00' '90.00' '90.00' '78.00'
 '90.00' '75.00' '92.00' '75.00' '65.00' '105.0' '65.00' '48.00' '48.00'
 '67.00' '67.00' '67.00' '?' '67.00' '62.00' '132.0' '100.0' '88.00' '?'
 '72.00' '84.00' '84.00']


We can clearly see some "?" values in the 'horsepower' attribute. Filtering only those rows where 'horsepower' == '?'

In [12]:
dataset[dataset["horsepower"] == "?"]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year,origin,name
32,25.0,4,98.0,?,2046.0,19.0,71,1,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,1,ford maverick
330,40.9,4,85.0,?,1835.0,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,?,2905.0,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,?,2320.0,15.8,81,2,renault 18i
374,23.0,4,151.0,?,3035.0,20.5,82,1,amc concord dl


Instead of removing, let us fix this through --> <strong>imputation</strong> (process of replacing missing data with substituted values). But since this attribute is mixed with values (that are not strings) we cannot impute it directly, and do a little pre-work as follows:

In [13]:
# splitting the dataset into two datasets: 
# ds_mark --> dataset that has "?" in horsepower attribute
# ds_wo_mark --> dataset without the ?" in horsepower attribute
ds_mark = dataset[dataset["horsepower"] == "?"]
ds_wo_mark = dataset[dataset["horsepower"] != "?"]

# BEFORE imputation
print("ds_mark dataset BEFORE imputation:")
ds_mark["horsepower"]

ds_mark dataset BEFORE imputation:


32     ?
126    ?
330    ?
336    ?
354    ?
374    ?
Name: horsepower, dtype: object

In [14]:
# converting ds_wo_mark dataset to float type
ds_wo_mark.horsepower = ds_wo_mark.horsepower.astype(float)

# actual imputation:
# we are taking the mean value of the "horsepower" attribute and use it as a substitute 
# value to impute the "?" values
ds_wo_mean = ds_wo_mark.horsepower.mean()

# performing the imputation on ds_mark dataset
ds_mark["horsepower"] = ds_wo_mean

# AFTER imputation
print("ds_mark dataset AFTER imputation: ")
ds_mark["horsepower"]

ds_mark dataset AFTER imputation: 


32     104.469388
126    104.469388
330    104.469388
336    104.469388
354    104.469388
374    104.469388
Name: horsepower, dtype: float64

now we have imputated the 'horsepower' attributes, lets combine the two datasets to get our working dataframe (but with imputed values)

In [15]:
# concatenating the two dataframes into one
frames = [ds_wo_mark, ds_mark]
dataset = pd.concat(frames)

# verifying the shape again
dataset.shape

(398, 9)

if you quite didn't understand what we did above, this is what we did:

<img src="imputation_explained.png" style="width:500px;height:650px">

1. split the dataset into two sub-sets (one with "?" and one without "?")
2. calculate mean of ds_wo_mark["horsepower"]
3. impute ds_mark dataset's missing (or "?") values with the mean calculated above
4. concatenate both the datasets

In [16]:
# verifying if our "horsepower" problem has been fixed in the final dataset
dataset[dataset["horsepower"] == "?"]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year,origin,name


In [17]:
# dtypes one final time 
dataset.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight          float64
acceleration    float64
model-year        int64
origin            int64
name             object
dtype: object

In [39]:
# ignoring the 'name' attribute and finalizing the dataset
dataset = dataset.loc[:, dataset.columns != 'name']
dataset.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year,origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1
5,15.0,8,429.0,198.0,4341.0,10.0,70,1
6,14.0,8,454.0,220.0,4354.0,9.0,70,1
7,14.0,8,440.0,215.0,4312.0,8.5,70,1
8,14.0,8,455.0,225.0,4425.0,10.0,70,1
9,15.0,8,390.0,190.0,3850.0,8.5,70,1
