#Implementing Linear Regression with ColumnTransformer on Sci-kit Learn.

##Sci-Kit learn has a linear regression module that makes it easy to implement the algorithm on your data. The data I will be using is the California Housing Prices dataset. I gave a brief explanation of the data in my previous blog post you can access it here. The goal is to predict the housing prices of California from the features in the dataset.
Read in the data
The first step is to read in our data and preprocess it before ingesting into the linear regression model.



In [3]:
import pandas as pd

In [6]:
final_housing=pd.read_csv('/content/drive/MyDrive/Github/RealStateMachineLearningModels/Implementing Linear Regression with ColumnTransformer/housing.csv')
final_housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


###From the info above can see that the total_bedrooms column has some missing values. I missing values was filled with the total_bedroom median value.

In [8]:
final_housing['total_bedrooms'].fillna(final_housing['total_bedrooms'].median(), inplace = True)

##Why Column Transformer?
###Column transformer is a module in Scikit-Learn that allows us to perform different preprocessing and feature extraction steps on features of different data types. Our dataset, for instance, contains both continuous and categorical features, these two data types require different preprocessing steps. Categorical features require one-encoding, continuous features requires standard scaling, the Column Transformer makes it possible to apply these preprocessing steps to these features. The pipeline module will then be used in integrating the preprocessed features, which will be used in building our Linear Regression model.
##Preprocessing our data
###The code scripts below will be showing how to use the Column Transformer and Pipeline module on our dataset.


In [9]:
# Import all needed modules
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

##Preprocess the Numeric columns
###We apply PolynomialFeatures and StandardScaler preprocessing steps to the numeric columns.

In [10]:
numeric_features = ['longitude','latitude','housing_median_age',
                    'total_rooms','total_bedrooms',
                    'population','households','median_income']
numeric_transformer = Pipeline(steps=[('poly',PolynomialFeatures(degree =2)),
                                      ('scaler', StandardScaler())])

##Preprocess the Categorical Column
###We one-hot encode our categorical data

In [11]:
categorical_features = ['ocean_proximity']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

###Combine the two preprocessed steps together using the Column Transformer module

In [12]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

##Integrate the Preprocessed Features and Linear Regression Model Using Pipelines

###The next step is the integrate the features we just preprocessed with our Machine Learning algorithm to enable us to build a model

In [13]:
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LinearRegression())])

##Train our Model

In [14]:
X = final_housing.drop('median_house_value', axis = 1)
y = final_housing['median_house_value']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('num',
                                                  Pipeline(memory=None,
                                                           steps=[('poly',
                                                                   PolynomialFeatures(degree=2,
                                                                                      include_bias=True,
                                                                                      interaction_only=False,
                                                                                      order='C')),
                                                                  ('scaler',
                                                                   Sta

##Evaluate the trained model

In [15]:
clf.score(X_train, y_train)

0.7012750467393345

###Our model gives us 70.13% accuracy.
###Let see how our model does on the test set.

In [16]:
clf.score(X_test, y_test)

0.6954375277548792

###Our model gives us 69.54% on the test set.
###From these metrics, we can see that our model is not overfitting, so there’s no need to apply any regularisation techniques to it. Let’s if our model will perform better when we increase the degree of our Polynomial features.

In [20]:
numeric_features = ['longitude','latitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income']

numeric_transformer = Pipeline(steps=[('poly',PolynomialFeatures(degree =3)),
                                      ('scaler', StandardScaler())])


###This is the only preprocessing step that is changed every other step remains the same.

In [21]:
clf.score(X_train, y_train)

0.7012750467393345

In [22]:
clf.score(X_test, y_test)

0.6954375277548792

###Our train score improves to 74.3% but our test score drops to 67.3%, which is an indication that our model is overfitting our training data. Hence Polynomial features at degree = 2, performs best on both train and test data.

###Linear regression with polynomial features of degree 2 gives us the best accuracy score of roughly 70% on both the train and test datasets. In the subsequent blog posts of these series, other algorithms will be tried and hopefully, we can achieve higher accuracy.