# Exercise 3

The purpose of this exercise is to help you learn how to use some of the preprocessing functions available in Python.

**Step 1.** Download the Auto MPG dataset from the UCI data repository website (https://archive.ics.uci.edu/ml/datasets/Auto+MPG). Load the auto-mpg.data file into a pandas DataFrame and display its first 5 records. Note that some columns contain missing values, which are denoted as '?' in the original data. You need to replace the '?' with None before inserting the row into the pandas DataFrame object.

***Hint:*** You can use a tab character as delimiter (separator) to split the car_name from other attributes and then use whitespace as delimiter to split the rest of the columns (attributes). You may refer to exercise 1 solution on how to apply a function to parse each row in a DataFrame.

In [1]:
import pandas as pd
import re

colnames = ['mpg','cylinders','displacement','horsepower','weight','acceleration','model_year','origin','car_name']

### Write the code to load the data into a dataframe object and replace the missing value with None ###
def getData(Row):
    
    regex = r'\s+'          # splits by spaces
    fields = pd.Series([None if x=='?' else x for x in re.split(regex, Row)])
    return fields   

data_temp = pd.read_csv("auto-mpg.data",sep='\t',header=None)
data = pd.DataFrame(data_temp[0].apply(getData))
data.columns = colnames[:-1]
data[colnames[-1]] = data_temp[1]

data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


**Step 2.** Use the describe() function to obtain statistics of the dataframe.

In [2]:
data.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
count,398.0,398,398.0,392.0,398.0,398.0,398,398,398
unique,129.0,5,82.0,93.0,351.0,96.0,13,3,305
top,13.0,4,97.0,150.0,2130.0,14.5,73,1,ford pinto
freq,20.0,204,21.0,22.0,4.0,23.0,40,249,6


**Step 3.** Based on the results in Step 2, identify which column(s) contain missing values and display the rows that contain missing values.

**Answer:**  The horsepower column contains missing values.

In [3]:
missing_index = list(data[data['horsepower'].isnull()].index)# Get the index of the missing values
data.loc[missing_index]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
32,25.0,4,98.0,,2046.0,19.0,71,1,ford pinto
126,21.0,6,200.0,,2875.0,17.0,74,1,ford maverick
330,40.9,4,85.0,,1835.0,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,,2905.0,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,,2320.0,15.8,81,2,renault 18i
374,23.0,4,151.0,,3035.0,20.5,82,1,amc concord dl


**Step 4.** Replace the missing value with the median value of the column. Display the rows that contain the missing values after imputation by the median.

In [4]:
data['horsepower'] = data['horsepower'].fillna(data['horsepower'].median())
data.loc[missing_index]         

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
32,25.0,4,98.0,93.5,2046.0,19.0,71,1,ford pinto
126,21.0,6,200.0,93.5,2875.0,17.0,74,1,ford maverick
330,40.9,4,85.0,93.5,1835.0,17.3,80,2,renault lecar deluxe
336,23.6,4,140.0,93.5,2905.0,14.3,80,1,ford mustang cobra
354,34.5,4,100.0,93.5,2320.0,15.8,81,2,renault 18i
374,23.0,4,151.0,93.5,3035.0,20.5,82,1,amc concord dl


**Step 5.** Create a dataframe object named data2 that contains only toyota corolla and chevrolet impala. Next, create 3 samples of size 4 each from data2:
    a. sample1, which is obtained via sampling without replacement.
    b. sample2, which is obtained via sampling with replacement.
    c. sample3, which is obtained via stratified sampling, where each strata corresponds to number of cylinders of the car (i.e., column 2).
    
To ensure repeatability, set the random_state of the sampling function to be 1.

In [5]:
data2 = data[(data.car_name=="toyota corolla")]
data2 = data2.append(data[(data.car_name=="chevrolet impala")])
data2

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
167,29.0,4,97.0,75.0,2171.0,16.0,75,3,toyota corolla
205,28.0,4,97.0,75.0,2155.0,16.4,76,3,toyota corolla
321,32.2,4,108.0,75.0,2265.0,15.2,80,3,toyota corolla
356,32.4,4,108.0,75.0,2350.0,16.8,81,3,toyota corolla
382,34.0,4,108.0,70.0,2245.0,16.9,82,3,toyota corolla
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
38,14.0,8,350.0,165.0,4209.0,12.0,71,1,chevrolet impala
62,13.0,8,350.0,165.0,4274.0,12.0,72,1,chevrolet impala
103,11.0,8,400.0,150.0,4997.0,14.0,73,1,chevrolet impala


In [6]:
sample1 = data2.sample(n=4,replace=False, random_state=1)    # sampling without replacement
sample1

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
103,11.0,8,400.0,150.0,4997.0,14.0,73,1,chevrolet impala
321,32.2,4,108.0,75.0,2265.0,15.2,80,3,toyota corolla
38,14.0,8,350.0,165.0,4209.0,12.0,71,1,chevrolet impala
62,13.0,8,350.0,165.0,4274.0,12.0,72,1,chevrolet impala


In [7]:
sample2 = data2.sample(n=4,replace=True, random_state=1)     # sampling with replacement
sample2

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
103,11.0,8,400.0,150.0,4997.0,14.0,73,1,chevrolet impala
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
167,29.0,4,97.0,75.0,2171.0,16.0,75,3,toyota corolla


In [8]:
sample3 = data2[data2.car_name=="toyota corolla"].sample(n=2,random_state=1)       # stratified sampling
sample3 = sample3.append(data2[data2.car_name=="toyota corolla"].sample(n=2,random_state=1))
sample3

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
321,32.2,4,108.0,75.0,2265.0,15.2,80,3,toyota corolla
205,28.0,4,97.0,75.0,2155.0,16.4,76,3,toyota corolla
321,32.2,4,108.0,75.0,2265.0,15.2,80,3,toyota corolla
205,28.0,4,97.0,75.0,2155.0,16.4,76,3,toyota corolla


**Step 6.**  For this question, you will apply different discretization methods to the ***mpg*** attribute

- Apply equal width discretization to produce 5 bins. Display the first 5 discretized values by using the head() function.

In [9]:
bins = pd.cut(data['mpg'].astype(float),5)
bins.head()

0    (16.52, 24.04]
1    (8.962, 16.52]
2    (16.52, 24.04]
3    (8.962, 16.52]
4    (16.52, 24.04]
Name: mpg, dtype: category
Categories (5, interval[float64]): [(8.962, 16.52] < (16.52, 24.04] < (24.04, 31.56] < (31.56, 39.08] < (39.08, 46.6]]

- Apply equal frequency discretization to produce 5 bins. Make sure you choose the appropriate quantiles that will produce exactly 5 bins.

In [10]:
bins = pd.qcut(data['mpg'].astype(float),[0,0.2,0.4,0.6,0.8,1])
bins.head()

0     (16.0, 20.0]
1    (8.999, 16.0]
2     (16.0, 20.0]
3    (8.999, 16.0]
4     (16.0, 20.0]
Name: mpg, dtype: category
Categories (5, interval[float64]): [(8.999, 16.0] < (16.0, 20.0] < (20.0, 25.0] < (25.0, 31.0] < (31.0, 46.6]]