# Introduction to Linear Regression 



## Learning Objectives

1. Analyze a Pandas Dataframe.
2. Create Seaborn plots for Exploratory Data Analysis.
3. Train a Linear Regression Model using Scikit-Learn.


## Introduction 
This lab is an introduction to linear regression using Python and Scikit-Learn.  This lab serves as a foundation for more complex algorithms and machine learning models that you will encounter in the course. We will train a linear regression model to predict housing price.

Each learning objective will correspond to a __#TODO__ in this student lab notebook

### Import Libraries

In [None]:
import os 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns # Seaborn is a Python data visualization library based on matplotlib. 
%matplotlib inline   

###  Load the Dataset

We will use the [USA housing prices](https://www.kaggle.com/kanths028/usa-housing) dataset found on Kaggle.  The data contains the following columns:

* 'Avg. Area Income': Avg. Income of residents of the city house is located in.
* 'Avg. Area House Age': Avg Age of Houses in same city
* 'Avg. Area Number of Rooms': Avg Number of Rooms for Houses in same city
* 'Avg. Area Number of Bedrooms': Avg Number of Bedrooms for Houses in same city
* 'Area Population': Population of city house is located in
* 'Price': Price that the house sold at
* 'Address': Address for the house

Next, we read the dataset into a Pandas dataframe.

In [None]:
df_USAhousing = pd.read_csv('../USA_Housing_toy.csv')

In [None]:
# Show the first five row.

df_USAhousing.head()

Let's check for any null values.

In [None]:
# The isnull() method is used to check and manage NULL values in a data frame.
df_USAhousing.isnull().sum()

In [None]:
# Pandas describe() is used to view some basic statistical details of a data frame or a series of numeric values.
df_USAhousing.describe()

In [None]:
# Pandas info() function is used to get a concise summary of the dataframe.
df_USAhousing.info()

Let's take a peek at the first and last five rows of the data for all columns.

**Lab Task 1:** Print the first five rows of the data for all columns.

In [None]:
# TODO -- print pierwsze 5 wierszy

## Exploratory Data Analysis (EDA)

Let's create some simple plots to check out the data!  

In [None]:
sns.pairplot(df_USAhousing)

In [None]:
sns.displot(df_USAhousing['Price'])

**Lab Task 2:** Create the plots using heatmap():

In [None]:
# TODO -- stworzyc heatmape

## Training a Linear Regression Model

Regression is a supervised machine learning process.  It is similar to classification, but rather than predicting a label, we try to predict a continuous value.   Linear regression defines the relationship between a target variable (y) and a set of predictive features (x).  Simply stated, If you need to predict a number, then use regression. 

Let's now begin to train our regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. We will toss out the Address column because it only has text info that the linear regression model can't use.

### X and y arrays

Next, let's define the features and label.  Briefly, feature is input; label is output. This applies to both classification and regression problems.

In [None]:
X = df_USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = df_USAhousing['Price']

## Train - Test - Split

Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.  Note that we are using 40% of the data for testing.  

#### What is Random State? 
If an integer for random state is not specified in the code, then every time the code is executed, a new random value is generated and the train and test datasets will have different values each time.  However, if a fixed value is assigned -- like random_state = 0 or 1 or 101 or any other integer, then no matter how many times you execute your code the result would be the same, e.g. the same values will be in the train and test datasets.  Thus, the random state that you provide is used as a seed to the random number generator. This ensures that the random numbers are generated in the same order. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

## Creating and Training the Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

**Lab Task 3:** Training the Model using fit():

In [None]:
# TODO -- wytrenowac model

## Model Evaluation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [None]:
# print the intercept
print(lm.intercept_)

In [None]:
coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])
coeff_df

Interpreting the coefficients:

- Holding all other features fixed, a 1 unit increase in **Avg. Area Income** is associated with an **increase of \$21.52 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area House Age** is associated with an **increase of \$164883.28 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Rooms** is associated with an **increase of \$122368.67 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Bedrooms** is associated with an **increase of \$2233.80 **.
- Holding all other features fixed, a 1 unit increase in **Area Population** is associated with an **increase of \$15.15 **.



## Predictions from our Model

Let's grab predictions off our test set and see how well it did!

In [None]:
predictions = lm.predict(X_test)

In [None]:
plt.scatter(y_test,predictions)

**Residual Histogram**

In [None]:
sns.displot((y_test-predictions), bins=50);

## Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

### Now, let's use Vertex Training and Prediction services

In [1]:
#TODO: skopiuj ID PROJEKTU
PROJECT_ID = 'YOUR_PROJECT_ID'
REGION = 'us-central1'
#TODO: podaj nazwe swojego zasobnika w Cloud Storage (zostanie stworzony jesli nie istnieje)
BUCKET_NAME = 'YOUR_BUCKET_NAME'
BUCKET_URI = f'gs://{BUCKET_NAME}'
JOB_NAME = "linear_regression_job"

In [None]:
!gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

In [None]:
import os
from google.cloud import aiplatform

#### Initialize Vertex AI 

Initialize the Vertex AI SDK for Python for your project.

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

#### Set pre-built containers

Vertex AI provides pre-built containers to run training and prediction.

In [2]:
TRAIN_VERSION = "scikit-learn-cpu.0-23"
DEPLOY_VERSION = "scikit-learn-cpu.0-23"

TRAIN_IMAGE = "us-docker.pkg.dev/vertex-ai/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(DEPLOY_VERSION)

#### Training script

In the next cell, you will write the contents of the training script, `task.py` that will be sent do the training service

In [5]:
#TODO: stworzyc folder input_data w swoim bucket na Cloud Storage, skopiować do niego plik 
# USA_Housing_toy.csv

In [None]:
%%writefile task.py
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from google.cloud import storage

import datetime
import pandas as pd
import numpy as np
import argparse
import os
import sys
import joblib

parser = argparse.ArgumentParser()
parser.add_argument('--bucket_name', dest='bucket_name',
                    type=str, help='Name of bucket')
args = parser.parse_args()

# read and process input data
df_USAhousing = pd.read_csv(f'gs://{BUCKET_NAME}/input_data/USA_Housing_toy.csv')
X = df_USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = df_USAhousing['Price']
#TODO: stworz podzial na sety treningowe i testowe
X_train, X_test, y_train, y_test = ...

# Build the Scikit model
# TODO: zainicjalizuj obiekt liniowej regresji zaimportowany wyzej
lm = ...
# TODO: wytrenuj model na X_train, y_train

# Export the model to a file
model_name = 'model.joblib'
joblib.dump(lm, model_name)

# Upload the model to GCS
bucket = storage.Client().bucket(args.bucket_name)
blob = bucket.blob('{}/model/{}'.format(
    datetime.datetime.now().strftime('usa-housing-%Y-%m-%d-%H:%M:%S'),
    model_name))
blob.upload_from_filename(model_name)

#### Deploy the model

In [None]:
DEPLOYED_NAME = "linear_reg_model_deployed"

TRAFFIC_SPLIT = {"0": 100}

MIN_NODES = 1
MAX_NODES = 1


endpoint = model.deploy(
    deployed_model_display_name=DEPLOYED_NAME,
    traffic_split=TRAFFIC_SPLIT,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
)