#### I. Importing needed libraries

In [226]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")

#### II. Prepping the dataset

In [2]:
dataset = pd.read_csv("auto-mpg.data.csv")
dataset.head(10)

Unnamed: 0,"18.0 8 307.0 130.0 3504. 12.0 70 1	""chevrolet chevelle malibu"""
0,15.0 8 350.0 165.0 3693. 11...
1,18.0 8 318.0 150.0 3436. 11...
2,16.0 8 304.0 150.0 3433. 12...
3,17.0 8 302.0 140.0 3449. 10...
4,15.0 8 429.0 198.0 4341. 10...
5,14.0 8 454.0 220.0 4354. 9...
6,14.0 8 440.0 215.0 4312. 8...
7,14.0 8 455.0 225.0 4425. 10...
8,15.0 8 390.0 190.0 3850. 8...
9,15.0 8 383.0 170.0 3563. 10...


As we see, there are two problems with this dataset:<br>
<ul>
    <li>1: the first row of the dataset had automatically become the column header of the dataset</li>
    <li>2: the entire dataset has around 8 attributes. but it's clear from the picture that the first seven and the last one attribute are split into just 2 columns [like 'mpg, cylinder, weight..', 'car-name-attribute']</li>
    <li>3: column index needs to be replaced with columns names</li>
        

In [227]:
# making the column header as the first row, and splitting the attributes into respective columns
dataset = pd.read_csv("auto-mpg.data.csv", sep='\s+', header=None)
dataset.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312.0,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425.0,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850.0,8.5,70,1,amc ambassador dpl


In [228]:
# replacing column index with names
dataset.columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model-year", "origin", "name"]
dataset.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312.0,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425.0,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850.0,8.5,70,1,amc ambassador dpl


In [229]:
# checking the datatype of the attributes
dataset.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight          float64
acceleration    float64
model-year        int64
origin            int64
name             object
dtype: object

Every attribute is numerical except: "horsepower" and "name" which are objects. Name cannot be changed, but we can change 'horsepower' attribute to be of numerical type.
But, before doing so, it's good to take a look at this attribute to see if there are special characters or missing values

In [230]:
# checking if 'horsepower' has any missing values
dataset["horsepower"].isnull().sum()

0

In [231]:
# checking if 'horsepower' has any special characters
dataset["horsepower"].values

array(['130.0', '165.0', '150.0', '150.0', '140.0', '198.0', '220.0',
       '215.0', '225.0', '190.0', '170.0', '160.0', '150.0', '225.0',
       '95.00', '95.00', '97.00', '85.00', '88.00', '46.00', '87.00',
       '90.00', '95.00', '113.0', '90.00', '215.0', '200.0', '210.0',
       '193.0', '88.00', '90.00', '95.00', '?', '100.0', '105.0', '100.0',
       '88.00', '100.0', '165.0', '175.0', '153.0', '150.0', '180.0',
       '170.0', '175.0', '110.0', '72.00', '100.0', '88.00', '86.00',
       '90.00', '70.00', '76.00', '65.00', '69.00', '60.00', '70.00',
       '95.00', '80.00', '54.00', '90.00', '86.00', '165.0', '175.0',
       '150.0', '153.0', '150.0', '208.0', '155.0', '160.0', '190.0',
       '97.00', '150.0', '130.0', '140.0', '150.0', '112.0', '76.00',
       '87.00', '69.00', '86.00', '92.00', '97.00', '80.00', '88.00',
       '175.0', '150.0', '145.0', '137.0', '150.0', '198.0', '150.0',
       '158.0', '150.0', '215.0', '225.0', '175.0', '105.0', '100.0',
       '100.0',

We can clearly see some "?" values in the 'horsepower' attribute. Filtering only those rows where 'horsepower' == '?'

In [235]:
horsepower_df = dataset[dataset["horsepower"] == "?"]  
horsepower_df

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year,origin,name
32,25.0,4,98.0,?,2046.0,19.0,71,1,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,1,ford maverick
330,40.9,4,85.0,?,1835.0,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,?,2905.0,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,?,2320.0,15.8,81,2,renault 18i
374,23.0,4,151.0,?,3035.0,20.5,82,1,amc concord dl


We will fix this, before moving further with the dataset.