<h1 style='color: #C9C9C9'>Machine Learning with Python<img style="float: right; margin-top: 0;" width="240" src="../../Images/cf-logo.png" /></h1> 
<p style='color: #C9C9C9'>&copy; Coding Fury 2022 - all rights reserved</p>

<hr style='color: #C9C9C9' />

# Linear Regression - Automobiles

* Load the Automobiles Dataset
* Drop any rows for which we don't have a price
* Drop the 'symbolling' and 'normalised_losses' columns

In [None]:
import numpy as np
import pandas as pd
# ensure that we can see all columns when we display a dataframe
pd.set_option('max_columns', None) 

# read the automobiles dataset into a dataframe
auto_df = pd.read_csv('../../Data/automobiles.csv')

# drop rows that are missing a value for price
auto_df = auto_df.dropna(subset=['price']) 

# drop the symbolling and normalised losses columns
auto_df = auto_df.drop(['symboling', 'normalised_losses'], axis=1)
auto_df



# Impute Missing Values


In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imp_mode = SimpleImputer(strategy='most_frequent', missing_values=np.NaN)   # "most_frequent" is same as the "mode"
auto_df['num_of_doors'] = imp_mode.fit_transform(auto_df[['num_of_doors']])

In [None]:
imp_mean = SimpleImputer(strategy='mean', missing_values=np.NaN)  
auto_df[['bore', 'stroke']] = imp_mean.fit_transform(auto_df[['bore','stroke']])

In [None]:
imp_median = SimpleImputer(strategy='median', missing_values=np.NaN)  
auto_df[['horsepower','peak_rpm']] = imp_mean.fit_transform(auto_df[['horsepower','peak_rpm']])

# Standardise all the number columns

Note that I have to name the columns because I don't want to Standardise the columns I'm making dummies with.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

cols = ['wheel_base','length','width','height','curb_weight', 'engine_size', 'bore','stroke','compression_ratio','horsepower','peak_rpm','city_mpg','highway_mpg']
features = auto_df[cols]

ct = ColumnTransformer([
        ('ct_scaler', StandardScaler(), cols)
    ], remainder='passthrough')

auto_df[cols] = ct.fit_transform(features)
auto_df



# Create Dummy Columns

In [None]:
auto_dummies_df = pd.get_dummies(auto_df, drop_first=True)
auto_dummies_df

# Train and Score the Model

In [None]:
X = auto_dummies_df.drop('price', axis=1).values

In [None]:
y = auto_dummies_df['price'].values

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

# Discuss

Note that we fudged this example slightly. We should really have split the data into test and training, then we should have scaled each one separately. 

We've skipped this step in the interest of time. 