In [19]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import scipy.stats as stats
from sklearn.model_selection import train_test_split

# Testing the Johnny Appleseed model just to play with it a little bit

In [None]:
# Example of Linear Regression
# Linear Regression using sklearn
model = LinearRegression()

# X = [distance_to_water, soil_quality, current_population]
X = np.array([
    [2,7,50],
    [5,5,30],
    [1,9,60],
    [8,4,20],
    [30,1,5]
])

# Target: Expected population in 5 years
y = np.array([100, 80, 150, 60, 10])

# Number of training examples and features
n_samples, n_features = X.shape
X.shape

# weights and bias
weights = np.zeros(n_features)
bias = 1
learning_rate = 0.001
epochs = 1000

# Gradient Descent
for epoch in range(epochs):
    y_predicted = np.dot(X, weights) + bias

    # gradients
    dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
    db = (1 / n_samples) * np.sum(y_predicted - y)

    # weights and biases
    weights -= learning_rate * dw
    bias -= learning_rate * db


# Test data
X_new = np.array([
    [4,8,35],
    [6,5,25],
    [30,2,5]
])
new_pred = np.dot(X_new, weights) + bias

print("Weights: ", weights)
print("Bias: ", bias)
print("Predicted population growth for new areas: ", new_pred)

for new in new_pred:
    if new < 50:
        print(
            f"Johnny Appleseed will not plant in this area. Population pred: {np.round(new, 1)}"
        )
    else:
        print(f"He would have planted here. Population pred: {np.round(new, 1)}")

<class 'numpy.ndarray'>
Weights:  [-0.08108529  4.44161259  1.66889345]
Bias:  1.9071065862991954
Predicted population growth for new areas:  [95.52693676 65.35099395 16.70224028]
He would have planted here. Population pred: 95.5
He would have planted here. Population pred: 65.4
Johnny Appleseed will not plant in this area. Population pred: 16.7


# Beginning of my own implementation

In [31]:
# Load dataset
filename = "dataset/survey_results_public.csv"
df = pd.read_csv(filename, low_memory=False)

# Encode categorical data
education_encoding_map = {
    "I never completed any formal education": 0,
    "Primary/elementary school": 1,
    "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)": 2,
    "Some college/university study without earning a degree": 3,
    "Associate degree": 4,
    "Bachelor's degree (BA, BS, B.Eng., etc.)": 5,
    "Master's degree (MA, MS, M.Eng., MBA, etc.)": 6,
    "Professional degree (JD, MD, etc.)": 7,
    "Other doctoral degree (Ph.D, Ed.D., etc.)": 8
}
years_coding_encoding_map = {
    '0-2 years': 0,
    '3-5 years': 1,
    '6-8 years': 2,
    '9-11 years': 3,
    '12-14 years': 4,
    '15-17 years': 5,
    '18-20 years': 6,
    '21-23 years': 7,
    '24-26 years': 8,
    '27-29 years': 9,
    '30 or more years': 10
}
hours_computer_encoding_mapping = {
    'Less than 1 hour': 0,
    '1 - 4 hours': 1,
    '5 - 8 hours': 2,
    '9 - 12 hours': 3,
    'Over 12 hours': 4
}
df['FormalEducation'] = df['FormalEducation'].map(education_encoding_map)
df["YearsCoding"] = df["YearsCoding"].map(years_coding_encoding_map)
df["HoursComputer"] = df["HoursComputer"].map(hours_computer_encoding_mapping)

# Filter columns and remove NULLS
cols = ["FormalEducation", "YearsCoding", "HoursComputer", "ConvertedSalary"]
df_encoded = df[cols]
df_encoded = df_encoded.dropna()

In [39]:
# Get X and y variables
X = df_encoded.drop("ConvertedSalary", axis=1)
y = df_encoded["ConvertedSalary"]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

# Convert to numpy arrays 
X_train_np = X_train.to_numpy()
X_test_np = X_test.to_numpy()
y_train_np = y_train.to_numpy()
y_test_np = y_test.to_numpy()

# samples and features
n_samples, n_features = X_train_np.shape

In [40]:
# Initialize weights/bias
weights = np.zeros(n_features)
bias = 1
learning_rate = 0.001
epochs = 5000

In [41]:
# Gradient Descent
for epoch in range(epochs):
    y_predicted = np.dot(X_train_np, weights) + bias

    # gradients
    dw = (1 / n_samples) * np.dot(X_train_np.T, (y_predicted - y_train_np))
    db = (1 / n_samples) * np.sum(y_predicted - y_train_np)

    # weights and biases
    weights -= learning_rate * dw
    bias -= learning_rate * db

In [42]:
y_test_pred = np.dot(X_test_np, weights) + bias
mse_test = np.mean((y_test_pred - y_test_np) ** 2)
print("MSE: ", mse_test)

print("Weights: ", weights)
print("Bias: ", bias)
print("New y predictions: ", y_test_pred)

MSE:  39698088820.43863
Weights:  [ 4335.44745063 10431.98916744  9056.27473919]
Bias:  13913.755939945793
New y predictions:  [40697.20031958 55464.63693765 95816.87917917 ... 82425.15698935
 52504.90391528 94441.16475091]


#### We can see here the weights and biases associated with each coefficient in my simple model. I will work to add more features in the future, but for now we can see that of the 3 independent variables I included (Formal Education, # Years Coding, Hours/Day Spent Coding), formal education is surprisingly the least relevant when it comes to predicting salary. This goes against my original hypothesis that it would be very relevant. Instead, experience seems to be the biggest contributor to "ConvertedSalary", as we see that that feature's weight is roughly 2.5 times greater than education. Additionally, Hours/Day is weight quite strongly, which is less surprising to me. 

# Conceptual Questions

## Question 1

#### RSS (Residual Sum of Squares) represents the total squared difference between predicted values and actual data points, while RSE (Residual Standard Error) is the root of the average squared distance, basically acting as a standardized measure of how well the regression line fits the data. RSS is a raw sum, while RSE is a standardized measure of the average error per data point.

#### RSS is given by: $$ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$ where n is the number of data points, y_i is the observed response for the ith datapoint, and y_hat_i is the predicted value from the model for that point.

#### RSE is given by: $$ \text{RSE} = \sqrt{ \frac{1}{n - p} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 } $$ where p is the number of parameters. This is basically an estimated standard deviation of the model's residuals, measuring how far residuals tend to deviate from 0.

#### RSS is often used as a way to evaluate and calculate other metrics. For example, R^2 depends on RSS. On the other hand, RSE is a more intuitive measure which is not really used to calculate other metrics. RSE uses the same units as the output variable, where RSS is the unit squared. So while they both measure error, RSS measures the overall size of the errors while RSE transforms that sum into a standard deviation of residuals.

## Question 2

#### A loss function quantifies the discrepancy between the model's predictions and the actuals. You can choose a loss function with respect to the model's parameters. 

#### Mean Squared Error (MSE) is a specific type of loss function seen often in regression models. Its equation is $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} \bigl(y_i - \hat{y}_i\bigr)^2 $$

#### MSE is mainly used for regression problems, with an emphasis on penalizing large errors. This function squares residuals, which heavily weights bigger mistakes. lower MSE = predictions are closer on average than higher MSE 

#### In general, a loss function will tell you how poorly a model is doing. The primary object of training the model is to reduce this function; if the model is doing 'less bad', it must improving. In summary, MSE is one of the simplest and most common loss functions for regression.

## Question 3

#### To the untrained eye, linear regression may seem straightforward: modeling the relationship between a dependent variable Y and one or more independent variables X, using a simple linear equation that looks strikingly similar to the first equation many of us learned: y=mx+b. However, the complexity in linear regression arises from a number of factors. One is all the assumptions you have to make. Linear regression relies on key assumptions like linearity, independence of variables, and normality of errors. Many models are also prone to multicollinearity, when predictors are highly correlated. This affects the regression coefficients making interpretation of the model very difficult. Even with simple models, finding a good balance such that the model is not over or under fitting is very important. Understanding these complexities on a fundamental level is essential for effectively applying regression techniques.

## Question 4

#### There are several techniques and strategies you can employ to determine if a model is a good or bad fit. One that comes to mind is an R^2 test, which measures the proportion of variance in the dependent variable explained by the independent variables. R^2 ranges from 0..1, and this serves as a key metric of the model's accuracy. The closer R^2 is to 1, the better the fit of the model. However, one thing to keep in mind here is, R^2 could be high even if the model has been overfit to the data, so this cannot be the only method used to assess the quality of the model. Another method is plotting residuals around zero and making sure the distribution is randomly scattered, rather than having clear patterns or clustering. You can also use error metrics like Mean Squared Error (MSE). Like I previously discussed, having a lower MSE is more desirable and shows that the model is improving (better fit). 