#### I. Importing needed libraries

In [188]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")

#### II. Prepping the dataset

In [189]:
dataset = pd.read_csv("auto-mpg.data.csv")
dataset.head(10)

Unnamed: 0,"18.0 8 307.0 130.0 3504. 12.0 70 1	""chevrolet chevelle malibu"""
0,15.0 8 350.0 165.0 3693. 11...
1,18.0 8 318.0 150.0 3436. 11...
2,16.0 8 304.0 150.0 3433. 12...
3,17.0 8 302.0 140.0 3449. 10...
4,15.0 8 429.0 198.0 4341. 10...
5,14.0 8 454.0 220.0 4354. 9...
6,14.0 8 440.0 215.0 4312. 8...
7,14.0 8 455.0 225.0 4425. 10...
8,15.0 8 390.0 190.0 3850. 8...
9,15.0 8 383.0 170.0 3563. 10...


As we see, there are two problems with this dataset:<br>
<ul>
    <li>1: the first row of the dataset had automatically become the column header of the dataset</li>
    <li>2: the entire dataset has around 8 attributes. but it's clear from the picture that the first seven and the last one attribute are split into just 2 columns [like 'mpg, cylinder, weight..', 'car-name-attribute']</li>
    <li>3: column index needs to be replaced with columns names</li>
        

In [197]:
# making the column header as the first row, and splitting the attributes into respective columns
dataset = pd.read_csv("auto-mpg.data.csv", sep='\s+', header=None)
dataset.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312.0,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425.0,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850.0,8.5,70,1,amc ambassador dpl


In [199]:
# replacing column index with names
dataset.columns = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model-year", "origin", "name"]
dataset.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model-year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312.0,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425.0,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850.0,8.5,70,1,amc ambassador dpl


In [201]:
# checking if there is any null data (for imputation)
dataset.isnull().any()

mpg             False
cylinders       False
displacement    False
horsepower      False
weight          False
acceleration    False
model-year      False
origin          False
name            False
dtype: bool

There is no missing data in the dataset. Now that the dataset is ready, let's assign it to <strong>X</strong> and <strong>y</strong> variables to proceed further.