# Linear Regression

I am going to try to code a linear regression model in java. I'll have chat cook some random data up for train/test. We will start with a simple single variable regression to start, and if that goes well, I will try to code a multi-variable linear regression model.

## Short Description of Linear Regression Mathematics

Linear regression aims to establish a linear relationship between the independent variable(s) and the dependent variable. The goal of linear regression is to minimize the sum of the squared differences (errors) between the observed values (actual values) and the values predicted by the model.

## Simple Linear Regression Model (Single Variable)

The linear regression model can be represented by the equation: y = mx + b

In [None]:
import java.util.ArrayList;
import java.util.Arrays;

public class SingleVarCommitsGradeRegression {

    public static void main(String[] args) {
        // Mock data
        ArrayList<Double> commitsData = new ArrayList<>(Arrays.asList(5.0, 10.0, 15.0, 20.0, 25.0)); // Number of commits
        ArrayList<Double> gradeData = new ArrayList<>(Arrays.asList(70.0, 75.0, 80.0, 85.0, 90.0)); // Corresponding grades

        double m = calculateSlope(commitsData, gradeData);
        double c = calculateIntercept(commitsData, gradeData, m);

        System.out.println("Linear Regression Model for Grade based on Commits: Grade = " + m + " * Commits + " + c);
    }

    public static double calculateSlope(ArrayList<Double> commits, ArrayList<Double> grades) {
        int n = commits.size();
        double sumCommits = 0, sumGrades = 0, sumCommitsGrades = 0, sumCommits2 = 0;

        for (int i = 0; i < n; i++) {
            sumCommits += commits.get(i);
            sumGrades += grades.get(i);
            sumCommitsGrades += commits.get(i) * grades.get(i);
            sumCommits2 += commits.get(i) * commits.get(i);
        }

        return (n * sumCommitsGrades - sumCommits * sumGrades) / (n * sumCommits2 - sumCommits * sumCommits);
    }

    public static double calculateIntercept(ArrayList<Double> commits, ArrayList<Double> grades, double m) {
        int n = commits.size();
        double sumGrades = 0, sumCommits = 0;

        for (int i = 0; i < n; i++) {
            sumGrades += grades.get(i);
            sumCommits += commits.get(i);
        }

        return (sumGrades - m * sumCommits) / n;
    }
}

CommitsGradeRegression.main(null);

In our context, we're trying to predict a student's grade (dependent variable) based on the number of commits they've made (independent variable). Mathematically, this relationship is represented as Grade=m×Commits+c, where m is the slope and c is the y-intercept. The slope m indicates how much the grade changes for each additional commit, while c represents the grade when there are no commits. The code calculates m and c using the method of least squares, which minimizes the sum of the squared differences between the observed grades and the grades predicted by the model. Once we have the values of m and c, we can predict the grade for any given number of commits.

In [1]:
%jars /home/vishnuaa77/vscode/vishnu/lib/commons-math3-3.6.1.jar

In [4]:
import org.apache.commons.math3.linear.*;

import java.util.Arrays;

public class MultiVariableLinearRegression {

    public static void main(String[] args) {
        // Mock data representing GitHub analytics for each student
        double[][] xData = {
            {10, 2, 500, 100},  // {Commits, Repositories Contributed To, Additions, Deletions} for student 1
            {15, 3, 700, 150},
            {12, 1, 650, 120},
            {8,  2, 400, 80},
            {20, 4, 900, 180}
        };
        double[] yData = {85, 90, 87, 80, 95};  // Predicted grades for the students based on their GitHub activity

        // Calculate coefficients
        double[] coefficients = calculateCoefficients(xData, yData);

        System.out.println("Coefficients: " + Arrays.toString(coefficients));
    }

    public static double[] calculateCoefficients(double[][] xData, double[] yData) {
        int n = xData.length;
        int m = xData[0].length;

        // Construct matrix X and vector Y
        RealMatrix X = new Array2DRowRealMatrix(n, m + 1);
        RealVector Y = new ArrayRealVector(yData, false);

        for (int i = 0; i < n; i++) {
            X.setEntry(i, 0, 1);  // Bias term
            for (int j = 0; j < m; j++) {
                X.setEntry(i, j + 1, xData[i][j]);
            }
        }

        // Calculate coefficients using the formula: (X^T * X + lambda*I)^(-1) * X^T * Y
        RealMatrix Xt = X.transpose();
        RealMatrix XtX = Xt.multiply(X);
        
        // Add regularization term
        double lambda = 0.01;  // Regularization parameter
        RealMatrix identity = MatrixUtils.createRealIdentityMatrix(m + 1);
        XtX = XtX.add(identity.scalarMultiply(lambda));

        RealMatrix XtXInverse = new LUDecomposition(XtX).getSolver().getInverse();
        RealVector XtY = Xt.operate(Y);

        RealVector B = XtXInverse.operate(XtY);

        return B.toArray();
    }
}

MultiVariableLinearRegression.main(null);


Coefficients: [22.8704682225341, -23.00370260421539, 22.727831352114663, 0.47012450471473244, 0.10164040857773671]


This took quite a while to code, I had to import the Apache Commons Math3 so that I could do some of these linear operations.

Ill now breakdown the code.

Mock Data:
xData: This 2D array represents the independent variables for each student. Each row corresponds to a student's GitHub analytics, and the columns represent:
Commits
Repositories Contributed To
Additions
Deletions
yData: This array represents the dependent variable, which is the predicted grade for each student based on their GitHub activity.