# Implementation from Scratch

<br />

I am going to implement algorithms by using the least kinds of libraries such as Numpy possible.

## [Task 1] Create a Class of Linear Regression from Scratch

<br />

I will create a class of linear regression and incorporate it to the pipeline of regressions on the "sprint2" directory.

### Hypothesis Function

<br />

I implement the following hypothesis function of linear regression.

$$
h_\theta(x) = \theta_0 x_0 + \theta_1 x_1 + \cdots + \theta_j x_j + \theta_n x_n \ \ \ (x_0=1)
$$

$x$: feature vector

$\theta$: parameter vector

$n$: the number of features

$x_j$: jth feature vector

$\theta_j$: jth parameter(weight) vector

I will implement the hypothesis function that can apply to any $n$, the number of features.

<br />

In addition, the following equation is the vector format.

$$
h_\theta(x) = \theta^T x
$$

### Objective Function

<br />


I will implement the following objective function of linear regression. This is the MSE, mean square error divided by 2 to use the steepest descent method easily.

$$
J(\theta) = \frac{1}{2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2
$$

$m$: the number of data input

$h_\theta()$: hypothesis function

$x^{(i)}$: feature vector of ith sample

$y^{(i)}$: correct values of ith sample

### Steepest Descent Method

<br />

I will fit datasets by steepest descent method. The following equation is to update the jth parameter.

$$
\theta_j := \theta_j - \alpha\frac{1}{m}\sum_{i=1}^m[(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}]
$$

$\alpha$: learning rate

$i$: index of a sample

$j$: index of a feature

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [21]:
# Create a class of linear regression from scratch

class ScratchLinearRegression():
    """
    Implementation of linear regression from scratch
    
    Parameters
    ----------
    num_iter: int
        The number of iteration
    
    lr: float
        Learning rate
    
    no_bias: bool
        True if not input the bias term
    
    verbose: bool
        True if output the learning process
    
    
    Attributes
    ----------
    self.coef_: ndarray whose shape is (n_features,)
        parameters
    
    self.loss: ndarray whose shape is (self.iter,)
        records of loss on train dataset
    
    self.val_loss: ndarray whose shape is (self.iter,)
        records of loss on validation dataset
    """
    
    def __init__(self, num_iter, lr, bias, verbose):
        # Record hyperparameters as attribute
        self.iter = num_iter
        self.lr = lr
        self.bias = bias
        self.verbose = verbose
        
        # Prepare arrays for recording loss
        self.loss = np.zeros(self.iter)
        self.val_loss = np.zeros(self.iter)
    
    
    def fit(self, X, y, X_val=None, y_val=None):
        """
        Fit linear regression. In a case of inputting validation dataset, return loss and accuracy of 
        the data per iteration.
        
        Parameters
        ----------
        X: ndarray whose shape is (n_samples,n_features)
            Features of train dataset
        
        y: ndarray whose shape is (n_samples,)
            Correct values of train dataset
        
        X_val: ndarray whose shape is (n_samples,n_features)
            Features of validation dataset
        
        y_val: ndarray whose shape is (n_samples,)
            Correct values of validation dataset
        """
        
        # Set a parameter randomly and transform it
        self.coef_ = np.random.randn(X.shape[1])
        self.coef_ = self.coef_.reshape(len(self.coef_),1)
        ###print("fit-1",self.coef_.shape)   # (2, 1)
        
        # Transform dataframes to let features rows
        X = X.T
        y = y.T
        if (X_val is not None) and (y_val is not None):
            X_val = X.T
            y_val = y_val.T
        
        if self.bias == True:
            bias = np.array([1 for _ in range(X.shape[1])])
            X = np.vstack((bias, X))
            bias = np.array([1 for _ in range(y.shape[1])])
            y = np.vstack((bias, y))
        
        # Update the theta and get loss of train dataset
        for i in range(self.iter):
            # Update the parameter
            self.coef_ = self._gradient_descent(X, y)
            print("fit-2",self.coef_.shape)
            # Compute the mean square mean
            mse = self._compute_cost(X, y)
            print("fit-3",mse.shape)
            # Record the errors
            self.loss[i] = mse
            # Return the loss if verbose is True
            if self.verbose:
                print(self.loss[i])
            
            # Get loss of validation datasets
            if (X_val is not None) and (y_val is not None):
                # Get the mean square error
                val_mse = self._compute_cost(X_test, y_val)
                # Record the errors
                self.val_loss[i] = val_mse
                # Return the loss if verbose is True
                if self.verbose:
                    print(self.val_loss[i])
    
    
    def predict(self, X):
        """
        Predict by using linear regression
        
        Parameters
        ----------
        X: ndarray whose shape is (n_samples,n_features)
            Samples
        
        
        Returns
        ----------
        ndarray whose shape is (n_samples,1)
            Results of the prediction by using linear regression
        """
        
        if self.bias == True:
            bias = np.array([1 for _ in range(X.shape[1])])
            X = np.vstack((bias, X))
        
        # Predict train dataset
        y_pred = np.dot(self.coef_.T, X)   # (1, 4) * (4, 120)
        
        return y_pred
    
    
    # Create a definition of hypothesis function of lunear regression
    def _linear_hypothesis(self, X):
        """
        Return hypothesis function of linear regression
        
        Parameters
        ----------
        X: ndarray whose shape is (n_samples,n_features)
            Train dataset
        
        Returns
        ----------
        ndarray whose shape is (n_samples,1)
            Results of the prediction by hypothesis function of linear regression
        """
        
        print("_linear_hypothesis-1",self.coef_.shape)
        print("_linear_hypothesis-2",X.shape)
        
        # Compute the hypothesis function
        y_pred = np.dot(self.coef_.T, X)   # (1, 4) * (4, 120)
        print("_linear_hypothesis-3",y_pred.shape)
         
        return y_pred
    
    
    # Create a definition to compute the mean square error
    def _compute_cost(self, X, y):
        """
        Compute the mean square error. Import the "MSE" definition.

        Parameters
        ----------
        X: ndarray whose shape is (n_samples,n_features)
            train dataset

        y: ndarray whose shape is (n_samples,1)
            correct value


        Returns
        ----------
        ndarray whose shape is (1,)
            mean square error
        """

        y_pred = self._linear_hypothesis(X)
    
        return self.MSE(y_pred, y)
    
    
    # Create a definition of the mean square error
    def MSE(self, y_pred, y):
        """
        Return the mean square error
        
        Parameters
        ----------
        y_pred: ndarray whose shape is (n_samples,)
            predited value
        
        y: ndarray whose shape is (n_samples,)
            correct value
        
        
        Returns
        ----------
        mse: numpy.float
            mean square error
        """
        
        print("MSE-1",y_pred.shape)
        print("MSE-2",y.shape)
        
        # Compute an error
        error = y_pred - y
        
        # Sum errors
        sum_errors = np.sum(error**2,axis=1) / error.shape[0]
        
        # Return the mean square error devided by 2
        return sum_errors / (2*y.shape[1])
    
    
    # Create a definition to fit datasets by steepest descent method
    def _gradient_descent(self, X, y):
        """
        Fit datasets by steepest descent method
        
        Parameters
        ----------
        X: ndarray whose shape is (n_samples,n_features)
            train dataset
        
        y: ndarray whose shape is (n_samples,1)
            correct value
        
        
        Returns
        ----------
        ndarray whose shape is (1,)
            parameter(weight)
        """
        
        print(X.reshape(2,1168))   # (2, 1)
        
        # Predict train dataset
        y_pred = np.dot(self.coef_.T, X)   # (1, 4) * (2, 1)
        print("_gradient_decsent-1",y_pred.shape)
        print(y.shape)
        
        # Compute the error and the mean square error
        error = y_pred - y   # (1, 120)
        print("_gradient_decsent-2",error.shape)
        
        # Compute the gradient
        grad = np.dot(X, error.T)   # (4, 120) * (120, 1)
        print("_gradient_decsent-3",grad.shape)
        
        print("_gradient_decsent-4",y.shape)
        print("_gradient_decsent-5",self.coef_.shape)
        
        # Update the parameter
        return self.coef_ - self.lr*grad/y.shape[1]
    
    
    # Plot learning records
    def plot_learning_record(self):
        plt.plot(self.loss)
        plt.plot(self.val_loss)
        
        plt.title("Learning Records")
        plt.xlabel("Number of Iterrations")
        plt.ylabel("Loss")
        
        plt.show()

#### Validate the Class

<br />

I am going to validate the class by using the "House Prices: Advanced Regression Techniques" datasets on Kaggle.

In [11]:
# Prepare a dataset for the validation

# Import the dataset
train = pd.read_csv('"House Prices- Advanced Regression Techniques".train.csv')
test = pd.read_csv('"House Prices- Advanced Regression Techniques".test.csv')

# Split the datasets into explanatory and objective variables
X = train.loc[:,["GrLivArea", "YearBuilt"]].values
X = X.reshape()

y = train.SalePrice.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

#Xy = pd.concat([X,y], axis=1)

#df = Xy[Xy.Species!=0]

# create and save a csv file of the dataframe
#df.to_csv('iris_dataset.csv')

In [12]:
# Standardize the dataset

# Initialize the class
scaler = StandardScaler()

# Fit the dataset
scaler.fit(X_train)

# Transfer the datasets
scaler.transform(X_train, X_test)



array([[-0.40709315, -0.45546896],
       [ 0.08317013,  0.71860895],
       [-1.39525026, -1.98829291],
       ...,
       [-1.26553079, -0.52069551],
       [-0.19343756, -1.72738671],
       [ 0.0526479 ,  1.17519481]])

In [22]:
# Initialize the class

slr = ScratchLinearRegression(num_iter=1000, lr=0.01, bias=False, verbose=True)

In [23]:
slr.fit(X_train, y_train, X_test, y_test)

[[1314 1571  796 ...  864 1426 1555]
 [1957 1993 1910 ... 1955 1918 2007]]
_gradient_decsent-1 (1, 1168)
(1168,)
_gradient_decsent-2 (1, 1168)
_gradient_decsent-3 (2, 1)
_gradient_decsent-4 (1168,)
_gradient_decsent-5 (2, 1)


IndexError: tuple index out of range

In [None]:
X_test = X_test.T
slr.predict(X_test)

# Validation

## [Task 2] Plot Learning Curve

<br />

I am going to create a definition of drawing a plot of learning curves to validate the "ScratchLinearRegression" class.

In [None]:
slr.plot_learning_record()