# CAIS++ Linear Regression Part 2: Real-World Data

In this section, we'll be learning to read in real-world data from .csv files, and to apply linear regression techniques using sklearn's built-in functionality. While the previous part focused more on theory, this part will be more representative of the kind of work you'll be doing on your projects/in the real world.

First, download the auto-mpg dataset from [here](https://archive.ics.uci.edu/ml/datasets/auto+mpg). (Go to "Data Folder", click on auto-mpg.data, and do a ctrl-S, then save it to your current working directory).

## Reading in the data: Pandas

In [81]:
import pandas as pd

In [82]:
headers = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']

Since the file is white-space delimited, we'll have to manually set our delimiter, instead of using the usual comma character (since we'll usually be dealing with CSV, comma-separated values, files).

In [83]:
mpg_df = pd.read_csv('auto-mpg.txt', names=headers, delim_whitespace=True)

In [84]:
mpg_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


We don't really care about the name of the model for these prediction purposes, so we can go ahead and drop it.

In [85]:
mpg_df = mpg_df.drop('name', axis=1)

"Origin" is a categorical value (1: America, 2: Europe, 3: Asia), so we'll need to figure out what to do with that. What we can do is split the "origin" variable into three separate variables (one for each region of origin), each of which will either have a 0 or 1 value (depending on where the car's origin is). Then, each region of origin will be assigned its own weight.

In [86]:
mpg_df['origin'] = mpg_df['origin'].replace({1: 'america', 2: 'europe', 3: 'asia'})

In [87]:
mpg_df = pd.get_dummies(mpg_df, columns=['origin'])

In [88]:
mpg_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin_america,origin_asia,origin_europe
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,0,0
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,0,0
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,0,0
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,0,0
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,0,0


From the dataset description online: "Missing Attribute Values:  horsepower has 6 missing values". This means that we'll have to remove the entries with missing horsepower values. First, let's see what the missing values show up as in the dataset.

In [89]:
mpg_df['horsepower'].unique()

array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '?', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)

In [90]:
import numpy as np

mpg_df = mpg_df.replace('?', np.nan)
mpg_df = mpg_df.dropna()

Now, the '?' entries should be gone.

In [91]:
mpg_df['horsepower'].unique()

array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)

Finally, we'll pluck out the target feature (mpg) from the dataframe to create our X (features) and y (target).

In [92]:
X = mpg_df.drop('mpg', axis=1)
y = mpg_df[['mpg']]

## Training the model

Now that we have our X and y good to go, we can apply sklearn's built-in linear regression function, just as we did before with the Boston house price dataset.

In [93]:
from sklearn.model_selection import train_test_split

# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [94]:
from sklearn.linear_model import LinearRegression

regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [95]:
regression_model.coef_

array([[-0.24633756,  0.02387034, -0.00601724, -0.00733643,  0.21897778,
         0.78518011, -1.76249341,  0.80962692,  0.95286649]])

In [96]:
regression_model.intercept_

array([-19.80918385])

In [97]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, regression_model.predict(X_test))

12.230963834602667

## Additional Datasets:

Now, you should have a general idea of how to go about importing a dataset, getting it ready for some machine learning model, and then applying a built-in model of your choice (e.g. simple linear regression, regression trees), etc. to make predictions from new data.

Here are a bunch of regression-oriented datasets that you can use on your own: https://archive.ics.uci.edu/ml/datasets.html?format=&task=reg&att=&area=&numAtt=&numIns=&type=&sort=nameUp&view=table

Each dataset will come with its own challenges (e.g. missing data, categorical inputs, etc.), so select one that seems interesting to you, and see what you can do with it!