![cdac-logo](https://media-exp1.licdn.com/dms/image/C4E1BAQFHBADdyISluw/company-background_10000/0/1574830149499?e=2159024400&v=beta&t=3RhoioNp9zpT_h_y15pHIEZsrXsaH6L-aIdTlaNJrp0)
# **Workshop 6: Practical Demonstration - Regression [ BUS2003 - Data Engineering with Python ]**

Advanced predictive models based on statistics, machine learning, and modern AI models have been widely applied to solve many challenging real-world problems. For some challenging problems, complex models are required.

In this week, we discuss one of the most popular tasks in predictive analytics, i.e. regression. The goal of regression is to describe the relationship between one or more independent variables and a dependent variable and to predict the value of the dependent variable based on the values of the independent variable based on observed data.

For example, we want to predict house prices based on the features of those houses. In this case, the house prices will be our dependent variables, and house features such as the number of bedrooms, the number of car parks, and the distance from CBD are our independent variables.




## **1. Regression**

Regression is a common prediction tasks for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable) and one or more independent variables (often called 'predictors', 'covariates', 'explanatory variables' or 'features').

$Y_i = f(X_i, \beta) + e_i$

The researchers' goal is to estimate the function $f(X_{i},\beta )$ that most closely fits the data.

#### **Dateset 1: Car efficiency**

Run the below code cell to import data to our notebook.

In [None]:
import numpy as np
import pandas as pd

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

raw_dataset.head()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1


Clean data

In [None]:
df_car = raw_dataset.copy()
df_car.isna().sum()
df_car = df_car.dropna()
df_car.head()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
0,18.0,8,307.0,130.0,3504.0,12.0,70,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,1


The "Origin" column is categorical, not numeric. So the next step is to one-hot encode the values in the column with pd.get_dummies.

In [None]:
df_car['Origin'] = df_car['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
df_car = pd.get_dummies(df_car, columns=['Origin'], prefix='', prefix_sep='')
df_car.head()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Europe,Japan,USA
0,18.0,8,307.0,130.0,3504.0,12.0,70,0,0,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,0,0,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,0,0,1
3,16.0,8,304.0,150.0,3433.0,12.0,70,0,0,1
4,17.0,8,302.0,140.0,3449.0,10.5,70,0,0,1


In [None]:
# import plotly and usescatter matrix to explore data
import plotly.express as px
fig = px.scatter_matrix(df_car, height=1200)
fig

It is very hard pick up useful patterns related to car efficiency.

In [None]:
# get the list of feature names (all column names except for the target variable MPG)
feature_names = [f for f in df_car.columns if f!="MPG"]

In [None]:
# import numpy and sklearn packages for predictive modelling
import numpy as np
from sklearn.linear_model import LinearRegression # importing linear regression algorithm
X = df_car[feature_names] # define X
y = df_car["MPG"] # define y
reg = LinearRegression().fit(X, y) # train the model
reg.score(X, y) # calculate score/metric (R square by default)

0.8241994699119172

The R-squared value for the linear regression model is high. However, it must be noted that it is just the training performance. We will talk about the testing performance in the next workshop.

In [None]:
# get the predicted mpg and assign to a new column called "predicted_mpg"
df_car["predicted_mpg"] = reg.predict(X)

In [None]:
# show the dataframe with a new column called "predicted_mpg"
df_car

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Europe,Japan,USA,predicted_mpg
0,18.0,8,307.0,130.0,3504.0,12.0,70,0,0,1,14.953252
1,15.0,8,350.0,165.0,3693.0,11.5,70,0,0,1,14.040098
2,18.0,8,318.0,150.0,3436.0,11.0,70,0,0,1,15.230551
3,16.0,8,304.0,150.0,3433.0,12.0,70,0,0,1,14.994084
4,17.0,8,302.0,140.0,3449.0,10.5,70,0,0,1,14.901941
...,...,...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86.0,2790.0,15.6,82,0,0,1,28.108037
394,44.0,4,97.0,52.0,2130.0,24.6,82,1,0,0,35.465976
395,32.0,4,135.0,84.0,2295.0,11.6,82,0,0,1,31.029739
396,28.0,4,120.0,79.0,2625.0,18.6,82,0,0,1,29.100271


In [None]:
# use scatter plot to show the correlation between predicted mpg and its actual values
import plotly.express as px
px.scatter(df_car, x="MPG", y="predicted_mpg")

The above scatter plot shows that there is a strong correlation between our predictions and the acutual MGP values.

In [None]:
# calculate the difference/errors between our predictions and actual values
df_car["error"] = df_car["MPG"] - df_car["predicted_mpg"]

In [None]:
# draw a histogram of errors
px.histogram(df_car, x="error")

The average error is close to zero so our predictions are pretty accurate. However, there are some cases where the errors are high.

In [None]:
# use box plots to show the error distribution across Cylinders
px.box(df_car, x="Cylinders", y="error")

In [None]:
# show the coefficients of the linear models (beta)
reg.coef_

array([-0.48970942,  0.02397864, -0.01818346, -0.00671038,  0.07910304,
        0.77702694,  0.80225883,  1.0254847 , -1.82774353])

In [None]:
# visualise the cofficients
px.bar(x=reg.coef_, y=feature_names, orientation="h")

From the above visualisation, feature variables like USA, Japan, Europe, Model Year, and Cylinders can significantly influence MGP. For example, MGP will be reduced by 1.82 if we look at a USA car. Also, every year, the car efficiency MGP will increase by 0.77 (check the Model Year variable).

We can apply the exact same steps to build a KNN model (except for the coefficient ones which are not avaiable in KNN).

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor # import KNN algorithm
X = df_car[feature_names]
y = df_car["MPG"]
reg = KNeighborsRegressor(n_neighbors=2).fit(X, y) # train the KNN model with K = 2
reg.score(X, y) # calculate score/metric (R square by default)

0.8970993462360187

With K=2, KNN has better score (R-squared) compared to linear regression.

In [None]:
# similar to linear regression, we can reuse the code above to make predictions
df_car["predicted_mpg"] = reg.predict(X)

In [None]:
# calculate errors
df_car["error"] = df_car["predicted_mpg"] - df_car["MPG"]

In [None]:
# visualise predictions vs actual values
px.scatter(df_car, x="predicted_mpg", y="MPG")

In [None]:
# show error distribution
px.histogram(df_car, x="error")

#### **Dataset 2: Predict rental prices**

The data behind the [Inside Airbnb](http://insideairbnb.com/get-the-data.html) site is sourced from publicly available information from the Airbnb site. The data has been analyzed, cleansed and aggregated where appropriate to faciliate public discussion.

Using the below code to collect rental data from AirBnB of Melbourne, Victoria.

In [None]:
!wget http://data.insideairbnb.com/australia/vic/melbourne/2021-11-06/data/listings.csv.gz
!gzip -d "listings.csv.gz"

import pandas as pd
df_airbnb = pd.read_csv("listings.csv")
df_airbnb.price = df_airbnb.price.str.replace("$", "")
df_airbnb.price = df_airbnb.price.str.replace(",", "").astype(float)
df_airbnb.head()

--2022-04-11 05:58:50--  http://data.insideairbnb.com/australia/vic/melbourne/2021-11-06/data/listings.csv.gz
Resolving data.insideairbnb.com (data.insideairbnb.com)... 52.217.97.3
Connecting to data.insideairbnb.com (data.insideairbnb.com)|52.217.97.3|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12088244 (12M) [application/x-gzip]
Saving to: ‘listings.csv.gz’


2022-04-11 05:58:50 (42.0 MB/s) - ‘listings.csv.gz’ saved [12088244/12088244]




The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.



Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,9835,https://www.airbnb.com/rooms/9835,20211106153141,2021-11-07,Beautiful Room & House,"<b>The space</b><br />House: Clean, New, Moder...",Very safe! Family oriented. Older age group.,https://a0.muscache.com/pictures/44620/5a5815c...,33057,https://www.airbnb.com/users/show/33057,...,4.75,4.5,4.67,,f,1,0,1,0,0.03
1,12936,https://www.airbnb.com/rooms/12936,20211106153141,2021-11-06,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,RIGHT IN THE HEART OF ST KILDA! It doesn't get...,A stay at our apartment means you can enjoy so...,https://a0.muscache.com/pictures/59701/2e8cdaf...,50121,https://www.airbnb.com/users/show/50121,...,4.83,4.78,4.66,,f,10,10,0,0,0.71
2,33111,https://www.airbnb.com/rooms/33111,20211106153141,2021-11-07,Million Dollar Views Over Melbourne,<b>The space</b><br /><b>Enjoy Million Dollar ...,,https://a0.muscache.com/pictures/187260/0888dd...,143550,https://www.airbnb.com/users/show/143550,...,4.0,5.0,4.0,,f,1,0,1,0,0.02
3,38271,https://www.airbnb.com/rooms/38271,20211106153141,2021-11-07,Melbourne - Old Trafford Apartment,Please note: No booking will be accepted with ...,Our street is quiet & secluded but within walk...,https://a0.muscache.com/pictures/1182791/3bf4b...,164193,https://www.airbnb.com/users/show/164193,...,4.92,4.88,4.87,,f,1,1,0,0,1.22
4,41836,https://www.airbnb.com/rooms/41836,20211106153141,2021-11-06,CLOSE TO CITY & MELBOURNE AIRPORT,Easy to travel from and to the Airport; quiet ...,"The neighbours are quiet and friendly, please...",https://a0.muscache.com/pictures/569696dd-1ad0...,182833,https://www.airbnb.com/users/show/182833,...,4.83,4.39,4.69,,f,2,0,2,0,1.69


In [None]:
# visualise the rental price from airbnb (in Melbourne)
import plotly.express as px
px.histogram(df_airbnb, x="price")

The price range is huge in this case. So we just filter our data to focus on the most common price range (<1000)

In [None]:
# filter data price < 1000
df_airbnb = df_airbnb[df_airbnb["price"] < 1000]

In [None]:
# show accommodation prices and their locations
fig = px.scatter_mapbox(df_airbnb, lon="longitude", lat="latitude", text="name", color="price")
fig.update_layout(mapbox_style="open-street-map")

Now, let's try to build a prediction models to predict the accommodation price based on the review scores.

In [None]:
# get the features with "review_" in their column name
feature_names = [c for c in df_airbnb.columns if "review_" in c]

# drop missing value (na) in the dataframe
df_airbnb = df_airbnb.dropna(subset=feature_names + ["price"])

In [None]:
df_airbnb

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,9835,https://www.airbnb.com/rooms/9835,20211106153141,2021-11-07,Beautiful Room & House,"<b>The space</b><br />House: Clean, New, Moder...",Very safe! Family oriented. Older age group.,https://a0.muscache.com/pictures/44620/5a5815c...,33057,https://www.airbnb.com/users/show/33057,...,4.75,4.50,4.67,,f,1,0,1,0,0.03
1,12936,https://www.airbnb.com/rooms/12936,20211106153141,2021-11-06,St Kilda 1BR+BEACHSIDE+BALCONY+WIFI+AC,RIGHT IN THE HEART OF ST KILDA! It doesn't get...,A stay at our apartment means you can enjoy so...,https://a0.muscache.com/pictures/59701/2e8cdaf...,50121,https://www.airbnb.com/users/show/50121,...,4.83,4.78,4.66,,f,10,10,0,0,0.71
3,38271,https://www.airbnb.com/rooms/38271,20211106153141,2021-11-07,Melbourne - Old Trafford Apartment,Please note: No booking will be accepted with ...,Our street is quiet & secluded but within walk...,https://a0.muscache.com/pictures/1182791/3bf4b...,164193,https://www.airbnb.com/users/show/164193,...,4.92,4.88,4.87,,f,1,1,0,0,1.22
4,41836,https://www.airbnb.com/rooms/41836,20211106153141,2021-11-06,CLOSE TO CITY & MELBOURNE AIRPORT,Easy to travel from and to the Airport; quiet ...,"The neighbours are quiet and friendly, please...",https://a0.muscache.com/pictures/569696dd-1ad0...,182833,https://www.airbnb.com/users/show/182833,...,4.83,4.39,4.69,,f,2,0,2,0,1.69
5,43429,https://www.airbnb.com/rooms/43429,20211106153141,2021-11-07,Tranquil Javanese-Style Apartment in Oakleigh ...,Study the exquisite detail of the antique Java...,Oakleigh is one of the most convenient and div...,https://a0.muscache.com/pictures/32bd863a-9149...,189684,https://www.airbnb.com/users/show/189684,...,4.93,4.77,4.85,,f,2,2,0,0,3.32
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17914,52959980,https://www.airbnb.com/rooms/52959980,20211106153141,2021-11-07,Yarra Valley House - Cosy/Big House Close To Town,Newly renovated / refurbuished large 4 bedroom...,Healesville is a tourism mecca that goes back ...,https://a0.muscache.com/pictures/1739595c-9966...,4003668,https://www.airbnb.com/users/show/4003668,...,5.00,5.00,5.00,,f,2,2,0,0,1.00
17928,52980228,https://www.airbnb.com/rooms/52980228,20211106153141,2021-11-06,Eq tower-sky high view-free parking,Capturing a superb location in the heart of Me...,Hospital grade sanitation and disinfection hav...,https://a0.muscache.com/pictures/miso/Hosting-...,254627785,https://www.airbnb.com/users/show/254627785,...,4.50,5.00,5.00,,t,5,5,0,0,2.00
17947,53015109,https://www.airbnb.com/rooms/53015109,20211106153141,2021-11-07,"Walk to Moonee Valley 2br, Free wifi, wine",Welcome to Moonee Ponds. Live like a local in ...,This apartment is just a stone's throw away fr...,https://a0.muscache.com/pictures/cb03e36e-9a42...,9885145,https://www.airbnb.com/users/show/9885145,...,5.00,5.00,5.00,,f,5,5,0,0,1.00
17954,53020017,https://www.airbnb.com/rooms/53020017,20211106153141,2021-11-06,HeartCBD Highrise 2 bedrooms+parking by request,"Luxury and modern designed apartment, which is...","It is in the heart of melbourne CBD, best loca...",https://a0.muscache.com/pictures/miso/Hosting-...,231393835,https://www.airbnb.com/users/show/231393835,...,5.00,5.00,5.00,,t,5,5,0,0,1.00


In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor # import KNN algorithm
X = df_airbnb[feature_names]
y = df_airbnb["price"]
reg = KNeighborsRegressor(n_neighbors=2).fit(X, y) # train the KNN model with K = 2
reg.score(X, y) # calculate score/metric (R square by default)

0.3072324010109828

Follow the steps showed in the dataset 1 to build and interpret linear regression model for this dataset.