# Linear models and Regression

This worksheet will guide you to using regressionmodels. 

When you have classes of data (example cats and dogs...) you can classify these using classifiers. But what if you are trying to predict a value on a scale... For example you might want to predict how much a house costs based on its size. You cannot predict between a class...

This tutorial will guide you through 
- downloading an open dataset
- setting up various regression models
- tuning hyperparameters to get the best outcome

You will need to make sure you have installed the libraries of 
- kagglehub
- sklearn

In [None]:
!pip install kagglehub
!pip install -U scikit-learn
!pip install statsmodels

In [2]:
import numpy as np #library for using large arrays in python
import pandas as pd #library for playing with datasets
import kagglehub
import matplotlib.pyplot as plt
############# regression libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
import statsmodels.api as sm
import statsmodels.formula.api as smf


## importing a dataset
Firstly take a look at Kaggle https://www.kaggle.com/ which is a great platform for people posting datasets. Have a look round, for now we wll be using a specific dataset but later on you should download datasets that interests you.

Before we can install a dataset, we have to make an account, and use this account to sign in.

In [None]:
#to import the data we will use the open datasets library #we place the link copied from Kaggle into the parameter of the download function
path = kagglehub.dataset_download("shree1992/housedata")

# Download latest version
print("Path to dataset files:", path)

## playing with our data

Great now we have downloaded a dataset... but can we just plug this into a model? No! We have to look at our data, how it is formatted and how this looks. The data is downloaded in a folder in csv format. Try to think back to how we opened up a dataset from yesterdays lab.


In [None]:
# TASKS

# open the dataset from the csv, you already have the path
df = 
# visualise this dataset 
print(df.describe())
print(df.head())

## Modelling the data

Before we get into anything fancy, lets train some regression models. Regression models take data in a linear format. By this I mean e need all the data to be represetned by a single vector. Lets say we only care about the bedrooms and bathrooms, and nothing else matters. We would then have [bedroom_no, bathroom_no] and a label [price of house]. The input data is the number of bedrooms andnumber of bathrooms... and our label data is the cost of that house. 

To extract this data is quite simple, we just grab the columns from the dataset that we want, convert to numpy and put them together (using the concatenate function).

We have shown you how to do it for bedrooms and bathrooms. Play around and try add some other useful metrics into the input data. By the end we want to have a matrix n x m, where n is the number of samples, and m is the number of features we want to take input from. So for the above example m would be 2, bedrooms and bathrooms. 

In [None]:
# we can grab a column by using the column title 
bedrooms = df['bedrooms']
bathrooms = df['bathrooms']
#we can then convert the pandas object to numpy arrays so we can work with them for machine learning
bedrooms = bedrooms.to_numpy()
bathrooms = bathrooms.to_numpy()
#when we add them together we dont just want to add them together and go from a dataset of n to 2n... so we create a new axis.
print("Bathroom shape:",bathrooms.shape)
converted_bathroom=bathrooms.reshape((-1,1)) #-1 means ignore axis
converted_bedroom=bedrooms.reshape((-1,1)) #-1 means ignore axis
print("Bathrooms post processing:",converted_bathroom.shape)
#once they are in this shape we can add them together
X_data=np.concatenate([converted_bathroom,converted_bedroom],axis=1) #axis is telling us where we want to combine them

#or we could do all the above code in one line:
X_data = np.concatenate([df['bedrooms'].to_numpy().reshape((-1,1)),df['bathrooms'].to_numpy().reshape((-1,1))],axis=1)


#then to gather the labels we can use
y_data=df['price']

print("Expected data shape:",X_data.shape)
print("Expected label shape:",y_data.shape)

assert len(y_data)==len(X_data), "incorrect shapes, x and y must be the same"

In [None]:
# TASK

#now add in a new column, or a few new columns from the dataset


#this will make sure your code works, assuming you use the same variable names for your labels
assert len(y_data)==len(X_data), "incorrect shapes, x and y must be the same"

How do we know which data to use? Well to begin with you could go about plotting values and evaluating visually if there are links.


In [None]:
plt.scatter(df['bedrooms'],df['price'])
plt.grid(1)

plt.xlabel("number of bedrooms")
plt.ylabel("price")
plt.title("Price vs bedrooms")
plt.show()

it looks like number of bedrooms alone does not directly impact the price of a house. Have a look at some other variables and see if there are any that show relationships.

### Converting our data to a train and test set

It is important with models to train it on as much data as we can, but how does it perform on unseen data? We must reserve part of the dataset for testing. This tells us how well our model works outside what it has seen. If it does not perform well it suggests overfitting... or even a lack of relationship between your data and labels. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

#another hyperparameter is the test_sixe, by default it is 20%, but does varying this size impact accuracy?

#an idea for challenges once you have finished is to investigate this
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

#another hyperparameter is the random state. This does not impact the way the model performs significantly
# but the way the data begins can change the accuracy. For fair experiments you should change the random state to random numbers
# and then average the model performance over a number of trials

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=np.random.randint(0,100))


## Training a model
This section is about importing various regression models and training. 

### Linear regression
Linear regression is conseptually the most simple model. Lets take a look at implementing this:

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

#make predictions
y_pred = model.predict(X_test)

#evaluate the model
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("Mean squared error (MSE):", mean_squared_error(y_test, y_pred)) #this should be low
print("R^2 score:", r2_score(y_test, y_pred))

plt.axis('equal')
plt.grid(1)
plt.plot([df['price'].min(), df['price'].max()],
         [df['price'].min(), df['price'].max()], 'k--')
plt.title("Prediction vs real value")
plt.scatter(y_test,y_pred)
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
plt.show()

So looking at the graph and our MSE output, it seems the model has not performed too well. A good scatter of prediction vs real should be a diagonal line of points. Our data shows the predicted values are much lower than the real ones. It seems to predict low cost houses fairly well but not higher cost ones.

Lets take a look at some other models

### Ridge

In [None]:
ridge = Ridge(alpha=1.0)  # Regularization strength
ridge.fit(X_train, y_train)

#make predictions
y_pred = ridge.predict(X_test)

#evaluate the model
print("Coefficients:", ridge.coef_)
print("Intercept:", ridge.intercept_)
print("Mean squared error (MSE):", mean_squared_error(y_test, y_pred)) #this should be low
print("R^2 score:", r2_score(y_test, y_pred))

plt.axis('equal')
plt.grid(1)
plt.plot([df['price'].min(), df['price'].max()],
         [df['price'].min(), df['price'].max()], 'k--')
plt.title("Prediction vs real value")
plt.scatter(y_test,y_pred)
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
plt.show()

It is not that much better than the linear. It seems our data might not be enough to capture a decent relationship. Try going back and adding more values into the X data. Does this improve the data?

In [None]:
#TASK

# train a model with more, lets include all values



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=np.random.randint(0,100))

ridge = Ridge(alpha=1.0)  # Regularization strength
ridge.fit(X_train, y_train)

#make predictions
y_pred = ridge.predict(X_test)

#evaluate the model
print("Coefficients:", ridge.coef_)
print("Intercept:", ridge.intercept_)
print("Mean squared error (MSE):", mean_squared_error(y_test, y_pred)) #this should be low
print("R^2 score:", r2_score(y_test, y_pred))

plt.axis('equal')
plt.grid(1)
plt.plot([df['price'].min(), df['price'].max()],
         [df['price'].min(), df['price'].max()], 'k--')
plt.title("Prediction vs real value")
plt.scatter(y_test,y_pred)
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
plt.show()

Has this improved it? If not perhaps there is not enough of a linear relationship between the data. Lets look at structures that are better with noisy data

### Random forest regression

In [None]:

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

#make predictions
y_pred = model.predict(X_test)

#evaluate the model
print("Mean squared error (MSE):", mean_squared_error(y_test, y_pred)) #this should be low
print("R^2 score:", r2_score(y_test, y_pred))

plt.axis('equal')
plt.grid(1)
plt.plot([df['price'].min(), df['price'].max()],
         [df['price'].min(), df['price'].max()], 'k--')
plt.title("Prediction vs real value")
plt.scatter(y_test,y_pred)
plt.xlabel("Actual value")
plt.ylabel("Predicted value")
plt.show()

### GLM
Lets give GLM a go. For the task of house price predictions it might not be most optimal, but here is some demo code you can play around with

Now when we say GLM we are more talking about logistic regression... which yes is classifiaction... Lets convert our dataset to a classification task. What if we wanted to instead of predict cost, predict whether we could afford the house?  

In [None]:
our_budget= #put your budget here, check the df['price'] to see what the values are, make sure your value is in this

#now we convert the price to binary classes

df['labels'] = df['values'] > our_budget

df.head()

Now we have this the data in binary format we can classify it using logistic regression

But you need to do this. We have our code from before on how to convert the pandas data to numpy input data and labels.

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html go to the documentation</a> to find out how to implement a logistic regression function. 

You have all the basics in this sheet... start to use this to make your function

In [None]:
# code goes here

X = ...
y = ...

#train logistic regression

#test how accurate it is

## Conclusions so far

So far we have imported a dataset and played around with a few different hyperparameters. Go further to see how each hyperparemeter affects the accuracy. Try display this as a scientific experiment. 

Once you have optimised, play around with other datasets on kaggle. As long as the label is a continuous value rather than a classification task, the methods you learned here today are applicable. 