# Forecasting Non-Farm Payrolls With Linear Regression 

by [Manuel Perez Yllan](https://www.linkedin.com/in/mpyllan/)

<img width=80 src="Images/Assembler.png">

***

Because of its simplicity and outcomes that are linearly dependent, linear regression is ignored. Linear regression is a powerful tool for handling many challenging forecasting problems. Let’s see how to forecast employment data from the United States using a simple linear regression in Python.

<img width=800 src="Images/linear-regression.png">


## What Are Non-Farm Payrolls?

From all health of the labor market indicators in the United States, non-farm payrolls (NFP) is a crucial and frequently followed economic indicator. This monthly employment report, which does not include occupations in private households, NGOs, or the agricultural sector, offers a thorough overview of the employment situation in the United States. The NFP report reveals the net change in the total number of paid employees in the U.S., excluding the sectors mentioned above, in the previous month.

By analyzing the data, many people (economists, policymakers, and investors) can measure the health of the labor market, track employment trends, and make informed decisions regarding economic policies, investments, and hiring practices.

Let’s try to apply our model and see how it performs. We will evaluate our forecasts using the directional ratio and the RMSE. Here’s what they refer to:

- The directional ratio is simply a binary up or down measure that compares the number of the correct forecasts (NFP going up versus NFP going down) to the number of predictions.


- The RMSE stands for root mean squared error. It is a measure of the average magnitude of the errors between predicted and actual values in a dataset.

## Tasks

Linear regression draws the best-fitting straight line through a scatterplot of data points. 

1. Load the NFP data from the Excel file (NFP.xlsx). You can use the FRED API as well.

2. Split the data into training and test sets.

3. Fit the model using the last five NFP changes as features or signals. Predict the test set.

4. Evaluate and compare the performance of the predictions.

### Import Libraries

But first we need to load the required libraries:

In [None]:
!pip install scikit-learn
!pip install openpyxl
!pip install matplotlib

# Required libraries
import math
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

### Load and Transform the Data

In [None]:
# Load data from file
data = pd.read_excel('NFP.xlsx').values
# Is everything ok with the libraries?
print(data)

We have to flatten the data. We need a list of numbers, not a list of lists.

In [None]:
# Flatten the data 
data = np.reshape(data, (-1))
# What's np.reshape doing?
print(data)

Now we want the "derivative" of the data. That is, the discrete rate of change from one point to the next.

In [None]:
# What's diff doing? Why?
data = np.diff(data)
print(data)

Do you truly understand the data? Can you envision a plot for it?

### Data Processing

Now we need to split the data into train and test sets. We`ll train the LR model over the train dataset and test its performance over the test dataset.
Let's define a function to split the data:

In [None]:
def data_preprocessing(data, num_lags, train_test_split):
    # Prepare the data for training
    x = []
    y = []
    for i in range(len(data) - num_lags):
        x.append(data[i:i + num_lags])
        y.append(data[i+ num_lags])
    # Convert the data to numpy arrays
    x = np.array(x)
    y = np.array(y)
    # Split the data into training and testing setst
    split_index = int(train_test_split * len(x))
    _x_train = x[:split_index]
    _y_train = y[:split_index]
    _x_test = x[split_index:]
    _y_test = y[split_index:]
    return _x_train, _y_train, _x_test, _y_test 

In [None]:
# Datase preprocessing & Splitting
x_train, y_train, x_test, y_test = data_preprocessing(data, 5, 0.80)

In [None]:
print('Test set size: ', len(x_test))
print(x_test)

In [None]:
print('Test set objective set size:', len(y_test))
print(y_test)

### LR model

In [None]:
# Create a LR model
model = LinearRegression()

In [None]:
# Fit the model to the TRAIN!! data
model.fit(x_train, y_train)

In [None]:
# Predict on the TEST!! data.
y_pred = model.predict(x_test)  # Use X, not X_new for prediction

#### Results and analysis

In [None]:
# Plot the original sine wave and the predicted values
plt.plot(y_pred[-50:], label='Predicted Data', linestyle='--', marker = 'o')
plt.plot(y_test[-50:], label='True Data', marker = 'o')
plt.legend()
plt.grid()
plt.axhline(y = 0, color = 'black', linestyle = '--')

In [None]:
# RMSE Calculation
rmse_test = math.sqrt(mean_squared_error(y_pred, y_test))
print(f"RMSE of Test: {rmse_test}")
# Is this a good value for the error? Why?

In [None]:
# Directional ratio calculation
same_sign_count = np.sum(np.sign(y_pred) == np.sign(y_test)) / len(y_test) * 100
print('Directional Ratio = ', same_sign_count, '%')
# Is this a good ratio? Why?