# Module 4 - Regression

In this lesson, you will be introduced to regression as one kind of supervised learning method.

By the end of this lesson, you will be able to describe:
- Regression.
- Properties of linear regression.
- Modern real-life applications of regression.
- Regression metrics.

In the previous lesson, you learned classification methods.

**Housing Prices** 

Assume you want to buy a house. You go to your real estate agent and tell them that you want to buy a property. The real estate agent starts by asking you questions about your neighborhood, community, and preferences, including: the size of the house, the number of bedrooms, bathrooms, and any special features you require.

The real estate agent then locates a number of suitable properties based on these features and helps you through the home buying process.

Regression basically does the same thing. Regression is a statistical method that aims at finding relationships between different variables that are generalized enough to determine one variable using the others. Based on previous knowledge of similar data samples, machine learning models can fit a curve through regression to map the input variables of continuous numeric results.

## Linear regression

### Linear Regression in Python
In the following section, you will learn how to write Python code using the Scikit-learn implementation of a linear regression algorithm.
In this section, you do not need your computer. You will simply read and follow the example, unless you want to run the code step-by-step.

First, read the data

```
import pandas as pd 
df = pd.read_csv(“data.csv”)

```

Normalize your data using the Scikit-learn standard scaler.

```
from sklearn.preprocessing import StandardScaler()
scaler = StandardScaler()
X = scaler.fit_transform(df)

```

Divide your data into train and test sets.

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size=0.33, random_state=0)

 ```

Use the Scikit-learn implementation of linear regression.


```
from sklearn.linear_model import LinearRegression()
model = LinearRegression()
model.fit(X_train, y_train)
```

Use the instance of LinearRegression() you created to predict your data.

```

y_pred = model.predict(X_test)

```
Collect the R-squared value after fitting.
Hint:
The function LinearRegression().score returns the r2_score of the prediction.

```

score = model.score(X_test,y_test)
print(score)

```








### Mission: Estimating Housing Prices Using Linear Regression

In this mission, you will work with the Boston Housing dataset. You will practice how to extract and perform regression modeling on the Boston Housing dataset using linear regression. You will then evaluate the developer regressor.

Instructions
Perform the following task to complete this mission and write the codes in the provided editor:

Try regression with the Boston Housing dataset using linear regression!

Your code should perform the following tasks to complete this mission:

Train a LinearRegression model using X_train and y_train.

Score your model using X_test and y_test.

Your code should return the score of the model using the test dataset.

In [7]:
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

from sklearn.model_selection import train_test_split
X_train, X_test , y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

score = model.score(X_test, y_test)

print(score)

0.6705795412578821


### Mission: Regression with Iris

In this mission, you will work with the Iris dataset. You will practice how to extract and perform regression modeling on the Iris dataset using linear regression. You will then evaluate the developer regressor.

**Instructions**

Perform the following task to complete this mission and write the code in the provided editor:

Try regression with the Iris dataset using linear regression!

Your code should perform the following tasks to complete this mission:

Train a LinearRegression model using sepal_train and petal_length_train.

Score your model using sepal_test and petal_length_test.

Your code should return the score of the model using the test dataset.

Revisit the Iris dataset here.

In [13]:
def main():
    from sklearn.datasets import load_iris
    X, _ = load_iris(return_X_y=True )

    sepal = X[:, :2]
    petal_length = X[:, 2]

    from sklearn.model_selection import train_test_split
    sepal_train, sepal_test, petal_length_train, petal_length_test = train_test_split(sepal, petal_length, test_size=0.33, random_state=0)

    from sklearn.linear_model import LinearRegression

    model = LinearRegression()

    model.fit(sepal_train, petal_length_train)

    petal_length_pred = model.predict(sepal_test)

    score = model.score(sepal_test, petal_length_test)

    print(score)

    return(score)

main()

# return score

## Regression using SVMs

### Linear SVM

Support vector machine regression simply follows the same procedure and rules as the support vector machine classification process with only a few minor differences. Most notably, the algorithm aims at satisfying the maximal possible margin and the least regression error.
Since the output in the case of SVM regression is a real, continuous output, you need an approximation to be able to predict a definite continuous value given the data points.
Consider the example shown below. The SVM aims at finding the hyperplane that decreases the least square errors to the minimum.
In SVM for regression problem, you need the data point to be as close as possible to the chosen hyperplane. The SVM regression inherited this difference from simple regression (like ordinary least square) so that you can define a range from both sides of the hyperplane to make the regression function insensitive to the error.
Eventually, SVM in regression has a boundary, like SVM in classification. However, the boundary for regression is for making the regression function insensitive with respect to the error.


### Nonlinear SVM

Kernel functions are functions that represent the data using a higher dimensional space to make it possible to fit with a linear hyperplane in the higher dimension.
The difference between nonlinear and linear regression can be explained in the image below:

The linear model will always fit to any data in a linear way, which in this case is insufficient and unrealistic. Therefore, higher orders are used.

### Implement SVM Regression

You are going to use an SVM to train your classifier. Because, you are going to perform a classification task, you will use the support vector classifier class.
You will use the Boston Housing dataset, so you will import the datasets from sklearn.

In [4]:
from sklearn.svm import SVR
from sklearn import datasets
from sklearn import model_selection

Now load the data, and split it into training and testing data using the sklearn.modelselection.train_test_split() function.

In [6]:
data = datasets.load_boston()
X = data.data
y = data.target
X_train, X_test, y_train, y_test =  model_selection.train_test_split(X, y,test_size=0.2)

In [8]:
data

{'data': array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
         4.9800e+00],
        [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
         9.1400e+00],
        [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
         4.0300e+00],
        ...,
        [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         5.6400e+00],
        [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
         6.4800e+00],
        [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
         7.8800e+00]]),
 'target': array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
        18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
        15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
        13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
        21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
        35.4, 24.7, 3


This is the most important step.
Your data for this example is simple and linearly separable, so you will use the linear kernel.
Also, if you cannot visualize your data (more than three dimensions) always try the linear kernel first.
Arguments:
1. Kernel: the kernel used to implement the kernel trick, as was previously discussed. The kernel argument can take multiple values, such as:
Linear: for a simple linear equation kernel.
Poly/RBF: polynomial and radial basis function equations, which are useful for creation of nonlinear hyperplanes.
2. Gamma: gamma is mainly used with nonlinear hyperplanes. It represents how hard or soft the SVM margin would be. The higher the gamma, the greater accuracy the SVM is trying to achieve for lowest misclassification error and highest separation margin. It can be tuned to avoid overfitting.
3. C: penalty parameter C of the error term. Could also be tuned to avoid overfitting of the SVM by creating a hard margin.

In [12]:
reg = SVR(kernel='linear')
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
print(reg.score(X_test, y_test))


0.7120644131806819


In [18]:
### Estimating house prices using SVM

from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

model = SVR(C= 1.0, kernel = "rbf", gamma = 'auto')

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

score = model.score(X_test, y_test)

from sklearn.metrics import mean_squared_error

print('MSE: ' + str(mean_squared_error(y_test,y_pred)))

print("R-squared: " + str(score))

MSE: 79.60870712971428
R-squared: 0.01257013727196088


In [33]:
### Iris regression SVM



from sklearn.datasets import load_iris
X, _ = load_iris(return_X_y=True)

sepal = X[:, :2]
petal_length = X[:, 2]

from sklearn.model_selection import train_test_split
sepal_train, sepal_test, petal_length_train, petal_length_test = train_test_split(sepal, petal_length, test_size=0.33, random_state=0)

from sklearn.svm import SVR

model = SVR(kernel = 'linear', C = 0.2, gamma = 'auto', epsilon = 0.1)

model.fit(sepal_train, petal_length_train)

y_pred = model.predict(sepal_test)

score = model.score(sepal_test, petal_length_test)

from sklearn.metrics import mean_squared_error

print('MSE: ' + str(mean_squared_error(petal_length_test,y_pred)))


    

MSE: 0.45971785004541466


In [27]:
from sklearn.datasets import load_iris
X, _ = load_iris(return_X_y=True)

sepal = X[:, :2]
petal_length = X[:, 2]

from sklearn.model_selection import train_test_split
sepal_train, sepal_test, petal_length_train, petal_length_test = train_test_split(sepal, petal_length, test_size=0.33, random_state=0)

petal_length_train



array([3.9, 6.1, 4.7, 3.8, 4.9, 5.1, 4.5, 5. , 4.7, 5.2, 4.5, 1.6, 5.1,
       4.2, 3.6, 4. , 4.6, 6. , 1.5, 1.1, 5.3, 4.2, 1.7, 1.5, 4.9, 1.5,
       5.1, 3. , 1.4, 4.5, 6.1, 4.2, 1.4, 5.9, 5.7, 5.8, 5.6, 1.6, 1.6,
       5.1, 5.7, 1.3, 5.4, 1.4, 5. , 5.4, 1.3, 1.4, 5.8, 1.4, 1.3, 1.7,
       4. , 5.9, 6.6, 1.4, 1.5, 1.4, 4.5, 4.4, 1.2, 1.7, 4.3, 1.5, 6.9,
       3.3, 6.4, 4.4, 1.5, 4.8, 1.2, 6.7, 1.5, 1.6, 6.1, 1.4, 5.6, 4.1,
       3.9, 3.5, 5.3, 5.2, 4.9, 5. , 1.6, 3.7, 5.6, 5.1, 1.5, 4.6, 4.1,
       4.8, 4.4, 1.3, 1.5, 1.5, 5.6, 4.1, 6.7, 1.4])

## Decision tree regression

Similar to decision tree classification, the decision regressor tree is composed of nodes, branches, and decision leaves.
The decision regression trees aim to decrease different metrics while dealing with numeric data. However, they use the same algorithm as before. Decision trees are considered supervised learning methods because they predict values of responses by learning decision rules that are derived from features.

Splitting the dataset on an attribute leads to standard deviation reduction. You will now read steps for calculating decision tree regression.
Step 1
Calculate the standard deviation of the target.
Step 2
Split the dataset on the different attributes. Calculate the standard deviation for each branch. Subtract the resulting standard deviation from the standard deviation before the split. The result is the standard deviation reduction.
Step 3
Choose the attribute with the largest standard deviation reduction for the decision node. This means the most homogeneity in the branch.
Step 4
Repeat steps 2 and 3 while calculating the coefficient of variation (CV).
The stopping condition:
Reaching the maximum count of iterations.
The CV in each attribute has become less than some threshold

In [37]:
### DT Regression, boston

from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(criterion= 'mae', random_state=0)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

score = model.score(X_test, y_test)

from sklearn.metrics import mean_absolute_error

print('MAE: ' + str(mean_absolute_error(y_test,y_pred)))

# return score

MAE: 3.2946107784431136


In [38]:
## RF Regression, IRIS


from sklearn.datasets import load_iris
X, _ = load_iris(return_X_y=True)

sepal = X[:, :2]
petal_length  = X[:, 2]

from sklearn.model_selection import train_test_split
sepal_train, sepal_test, petal_length_train, petal_length_test = train_test_split(sepal, petal_length, test_size=0.33, random_state=0)

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(criterion="mse", random_state=0, n_estimators = 25)

model.fit(sepal_train, petal_length_train)

y_pred = model.predict(sepal_test)

score = model.score(sepal_test, petal_length_test)

from sklearn.metrics import mean_absolute_error

print('MAE: '+ str(mean_absolute_error(petal_length_test, y_pred)))

MAE: 0.4141514285714284


### KNN Regression

The K-Nearest Neighbor (KNN) algorithm.
The difference between KNN for solving classification and regression problems.
Implementation of KNN for regression in Python.

As discussed in previous lessons, the K-Nearest Neighbor classification is a non-parametric method which depends on the feature space to determine the output. A KNN model basically uses the k-closest data points to the test point to use as references for prediction.
In classification, the output points to a class membership. A new point is classified based on the votes of the surrounding k-neighbors.
Example:
If k = 1, the class of a new data point is simply the same class as the nearest data point.
However, what about regression?

In [39]:
### Predicting housing prices using KNN

from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0 )

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=3)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

score = model.score(X_test, y_test)

print(score)

# return score

0.5673733201680796


In [42]:
### Predicting IRIS using KNN

    from sklearn.datasets import load_iris
    X, _ = load_iris(return_X_y=True)

    sepal = X[:, :2]
    petal_length  = X[:, 2]

    from sklearn.model_selection import train_test_split
    sepal_train, sepal_test, petal_length_train, petal_length_test = train_test_split(sepal, petal_length, test_size=0.33, random_state=0)

    from sklearn.neighbors import KNeighborsRegressor

    model = KNeighborsRegressor(n_neighbors=5)

    model.fit(sepal_train, petal_length_train)

    y_pred = model.predict(sepal_test)

    score = model.score(sepal_test, petal_length_test)

print(score)

0.9271883296829704
