# Final Report

## Problem Statement

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Presently, the United States’ population as well as the world’s population, is increasing rapidly. The world’s population has increased by 1 billion in just over 11 years, with no sign of slowing down at all. As the population continues to grow, the rate at which it grows will increase as well. As the population increases however, it gets much harder for the government to provide for each and every person. We have issues of how many jobs we need created, how much food we need to produce and import, and so on. We can see the effects of this with the current housing crisis due to lack of inventory and food shortages due to production and supply chain issues. Using population data from the US Census Bureau from 1952 to 2019, I will create a model that can accurately predict the population of the United States through 2030. Using that model, we can make predictions regarding the number of jobs our country will have to have, the amount of housing we will have to create, the amount of food we will have to produce, etc. Additionally, it will be very useful to see how this increase in population will affect the GDP per capita. We would like to know how this rapid increase in population will affect the nation economically. I will use data about the 1952-2019 GDP per capita from FRED (Federal Reserve Economic Data) to make predictions about how the upcoming change in population will affect our economy.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Using the data from the US Census Bureau and FRED, I trained models to predict the growth of the population and GDP per capita respectively. Using this information, we will be able to accurately predict the direction that the population is trending as well as if the GDP per capita will show any signs of trending more slowly as time goes on.

## Data Wrangling

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Data Wrangling went very smoothly for these datasets. I knew that the data would be relatively clean going into this phase since I had gotten the data from very reputable sources. Initally after importing both csv files, I dropped all extra columns so that in each dataframe I had a data column and a value column. One thing that I wanted to make sure of early was that the dtypes of my values were correct. I wanted the date column to be a datetime object, and the value column top be an integer. Next, I wanted to make sure that I had the same column names in each dataframe, so I had to switch the name of the date and value column in the GDP dataframe to match the 'date' and 'value' column names of the population dataframe. One more thing that was really important was that the dates matched up. The GDP per capita data was measured at the start of the quarter every quarter, while the population data was measured monthly, so I had to change the population dataframe to only include values from January, April, July, and October. Additionally, the population data was only measured from 1952 through the end of 2019, so I had to change the GDP per capita dataframe to match that same date range. 

## Detrending for Stationarity

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Below is the recorded data of both the population and the GDP per capita in the United States from the beginning of 1952 through the end of 2019, as well as the output using statsmodels' STL function on each dataset.

![dw_pop.png](attachment:dw_pop.png)

![stl_pop.png](attachment:stl_pop.png)

![dw_gdp.png](attachment:dw_gdp.png)

![stl_gdp.png](attachment:stl_gdp.png)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; When working with time series data such as this, there are some very useful models where it is important to detrend the data for stationarity. In this case, we have 2 datasets where the mean increases over time, which makes both datasets not stationary. Neither datasets shows large signs of seasonality or excessive noise, so I decided to difference both datasets to detrend them.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For the population data, I had initially done one difference that returned a high enough p-value using the kpss test to call the data stationary. However, when looking at the the differenced data plotted, the data looked like a sin graph with a k value of 700. It just didn't sit right with me because I feel that given time, that data will begin to show seasonality. I decided to take a second difference of this data. Both graphs are attached below.

![diff_pop1.png](attachment:diff_pop1.png)

![diff_pop2.png](attachment:diff_pop2.png)

For the GDP per capita data, I was able to take only the first difference and find my data to be stationary. The graph of the data differenced once is shown below.

![diff_gdp.png](attachment:diff_gdp.png)

## Data Analysis

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; With our recorded data, we can see that the population grows from around 156 million to nearly 330 million in the 68 years that we are analyzing. On the other hand, the United States' GDP has grown as well, but not enough. The United States GDP per Capita has increased from 16,299 Chained 2012 Dollars (CUSD) in 1952 to just over 58,000 CUSD in 2019, a 355% increase from 70 years ago. This number has been adjusted for inflation, using Chained 2012 Dollars to keep amounts relative to the 2012 USD value and to accurately compare through time.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For the most part, the population data stays fairly linear throughout time. Even with certain jumps or dips through this time frame, the population trends steadily upwards. We can see in the STL output that there is a very small amount of seasonality to the population's growth throughout time. It is noteworthy that this seasonality is very small, and in general insignificant to the stationarity of the data. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The GDP per capita shows us a little more about what has been going on in the United States through this time. We can see a few jumps and dips in the GDP per capita in this time frame. Some of those include the dip around 1975 signalling the end of the Vietnam war, the upward spike in the mid 80's as a result of Ronald Reagan's deficit spending caused by tax cuts, and of course the 2008 recession caused by hedge fund failures and poor real estate investments by banks. It would be unreasonable to say that something like that could not happen again, as my model will not predict the COVID-19 pandemic and subsequent recession.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; While the population and GDP per capita predictions might not be exact for each quarter throught 2020s, the goal is to see where the economy is trending towards at the end of 2029. With that data, economists can determine the best course of action for the United States over the upcoming decade.

## Machine Learning Models

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; In terms of machine learning, I decided to use 4 different models on each of my 2 datasets. I used an ARIMA model, a Linear Regression model, a Linear Support Vector Regression (SVR) model, and an XGBoost Regression model. I trained the models on the entire dataset, as I didn't feel comfortable with the amount of data points I would have been using if I had split the data into a training and a testing set. I also felt that for accuracy of the model, the predictions needed to begin exactly where the recorded data ended off. We would have been at risk of overfitting our data having done this, but both datasets are fairly linear so I was willing to take the risk.

### ARIMA Models

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; An ARIMA model is a time series model that uses past averages to accurately predict future values. For each ARIMA model, I needed to find a p, d, and q value to determine the model's order. In an ARIMA model, the p value is the number of autoregressive terms, the d value is the number of nonseasonal differences needed for stationarity, and the q value is the number of lagged forecast errors in the prediction equation. We already had a good idea of our d value for each, as that was the number of differences we took when detrending for stationarity. I initially began with the population data. 

#### Population Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I began by using statsmodels' plot acf function to plot the autocorrelation function as well as the data at 0, 1, and 2 differences. Below are the results.

![arima_acf_pop.png](attachment:arima_acf_pop.png)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Here, as before, I was in between 1 and 2 differences for stationarity. While 2 differences is clearly stationary, there could be overfitting present. I decided that it would be best to use a function to determine the best parameters based on AIC and BIC values given certain p, d, and q values. AIC and BIC values determine the accuracy of each model, while penalizing for overfitting. I used p and q values 0-5, once with a d value of 1 and once with a d value of 2. Below are the top 5 parameters from first the function when passing 1 difference, and second when passing 2 differences.

![aicbic_pop1.png](attachment:aicbic_pop1.png)

![aicbic_pop2.png](attachment:aicbic_pop2.png)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; When looking at AIC and BIC values, lower ones show greater accuracy of the model. We got much lower AIC and BIC values when using this model with a d value of 2, so the ARIMA model for the population data had an order of (3, 2, 4).

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; With this order, we could now create our model. When doing this, I used the .summary() method to print a summary of our model. That summary is pictured below.

![arima_summary_pop.png](attachment:arima_summary_pop.png)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Here, since the Ljung-Box p value is above .05, the null hypothesis is not rejected and we cannot say for sure that the residuals are uncorrelated. This means that the noise in the data could be related, as we cannot say for sure that it is not related. Additionally, however, we also reject the Heteroskedasticity null hypothesis, which means that the residuals in the prediction don't show variance over time. This means that there is no pattern between the continuation of time and the variance of the residuals. We can determine through this that there is no conclusive seasonality in the population data. 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Finally, I plotted a comparison between our model and the recorded data, and I found the root mean squared error (RMSE) of those values to be 35.98. Though potentially overfitted, that is a very accurate model. That comparison is shown below.

![arima_comparison_pop.png](attachment:arima_comparison_pop.png)

#### GDP per Capita Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I am going to repeat the same process with the GDP per capita data as I did with the population data. Below are the autocorrelation function plotted with the data at 0 and 1 differences, as well as the top 5 AIC and BIC parameters. 

![arima_acf_gdp.png](attachment:arima_acf_gdp.png)

![aicbic_gdp.png](attachment:aicbic_gdp.png)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; When looking at this autocorrelation function, we can see pretty clearly that this data reaches stationarity at one difference. There is a considerable amount of noise in the plot of the first difference's data, but not enough for the data to be not stationary. I used the same function as with the population data to find the best AIC and BIC values. We got our lowest AIC and BIC value when using the order (3, 1, 1), so that will be the order for our model. First the model summary, and then the comparison, are printed below.

![arima_summary_gdp.png](attachment:arima_summary_gdp.png)

![arima_comparison_gdp.png](attachment:arima_comparison_gdp.png)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Here, once again, we reject the null hypothesis for both the Ljung-Box test and the Heteroskedasticity test. We once again cannot say for sure that the residuals are uncorrelated and can say that the residuals in the prediction do not show variance over time.

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Using the same function as with the population data, I plotted the comparison between the recored data and our model. The RMSE here was 245.92, showing a slightly more inaccurate model than with the population data. We would also expect this model to have a smaller RMSE since the value range is much smaller, but that is not the case here. 

### Linear Regression Models

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; A Linear Regression model uses previous data to create a line of best fit. For this, more linear data will create a more accurate model. In this case, we created a model trained on the actual data for the full date range. I did not use the differenced data because the Linear Regression model would not be able to make an accurate prediction based on the stationary data. I began by writing a function that would take an input, X and y. That function would create a Linear Regression model trained on the inputted values. The function would use the model and the X value to predict a y value, and plot a graph with both the recorded and predicted data on top of one another. Finally, this function would print an RMSE for the comparison.

#### Population Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The population data was the main reason for trying this regressor, as it looks very linear. Given a line of best fit, there is an RMSE of 2500.04 between the recorded data and the predicted data. This model isn't nearly as accurate as the ARIMA model, so I will not be using this model for any forecasting. Thought the model will not be useful for future predictions, we can see certain time periods where population growth either increased or decreased. There was a considerable drop in population growth through the 70s and 80s. This can be can attributed to high population growth through the baby boom in the 60s, as well as higher population growth more recently due to immigration. William H. Frey wrote more about this in December of 2021, I would suggest reading [his piece for Brookings](https://www.brookings.edu/research/u-s-population-growth-has-nearly-flatlined-new-census-data-shows/) for more information on that.

![linreg_pop.png](attachment:linreg_pop.png)

#### GDP per Capita Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; With our GDP data, I was not expecting much from the Linear Regression model. With an RMSE of 1439.24, nearly 6 times higher than that of the ARIMA model, I do not plan on using either Linear Regression model for forecasting. On the other hand, we can make some really clear interpolations based on this chart. For example, we can see that since around 1960 all the way up to the turn of the century, the recorded data generally lived below the line of best fit. Once the terrorist attacks of September 11th, 2001 hit the United States and they went to war with Afghanistan and soon Iraq, there was a sharp turn around in the GDP per Capita. As written in [here](https://watson.brown.edu/costsofwar/costs/economic/economy) by the Watson Institute for International and Public Affairs at Brown University, we see this large rise in GDP per Capita as a result of massive deficit spending. While the GDP is greatly increase, the national debt is increasing even more.

![linreg_gdp.png](attachment:linreg_gdp.png)

### Linear SVR Regression Model

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The Linear Support Vector Regression (SVR) model is very similar to the Linear Regression model used previously. While the Linear Regression model is best suited to have the lowest residuals when creating a line of best fit, Linear SVR [is said to](https://towardsdatascience.com/support-vector-regression-svr-one-of-the-most-flexible-yet-robust-prediction-algorithms-4d25fbdaca60) also model non-linear relationships very well. An SVR model will have a certain loss value where if a data point is within that loss value from the line, the error will be counted as zero. Chapter 3 in Anh Le My Phung's [Comparison of Support Vector Regression and Neural Networks](https://scse.d.umn.edu/sites/scse.d.umn.edu/files/comparison_of_support_vector_regression_and_neural_network_final.pdf) has a very good explanation of how an SVR works. I used the same function as with the Linear Regression model, as both would seek to create the model, make the predictions, and plot the comparison. The only difference in the function was that instead of initializing a Linear Regression model I initialized a LinearSVR model with certain parameters. My goal with this model was to learn more about how a Linear SVR model works, and see if I could find it more useful than a Linear Regression model in this instance.

#### Population Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; In the case of our population data, we see predicted values that are almost identical to the line of best fit shown using the Linear Regression model. The Linear SVR model yielded an RMSE of 3018 for the population data, as opposed to the 2500 from the Linear Regression model. In this case, Linear Regression is a better option than the Linear SVR model, but we still have the GDP per capita data to look at.

![lsvr_pop.png](attachment:lsvr_pop.png)

#### GDP per Capita Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Here, our Linear SVR model has an RMSE of 2082. When looking at the RMSE values that our Linear SVR model put out in contrast to the RMSE values that our Linear Regression model put out, the population data showed an increase of just over 500 and the GDP per capita data showed an increase of around 640. In both cases, the Linear Regression model was more accurate, so these Linear SVR models will not be used for forecasting. Since this data is so linear, a Linear Regression model ends up working really well. This model would be better suited for independently and identically distributed (IID) data, while that is not the case for this data at all. Since the population and GDP per capita are increasing or decreasing over time, each value is in no way independent from the one previous. 

![lsvr_gdp.png](attachment:lsvr_gdp.png)

### XGBoost Regression Model

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The XGBoost Regression model is a gradient boosted decision tree. An XGBoost Regression model will iterate through a decision tree a certain number of times (determined by an n_estimator parameter) with the goal of lowering the residual of the prediction of a certain value, using a convex loss function similarly to the Linear SVR. Ultimately, this model is unable to make out of range forecasting. Since the population and GDP per capita are very much expected to be out side of the 1952-2019 range in 2029, we need to make our predictions based on the stationary data. I once again used the same function for this comparison, except I added in a 'diffs' input, indicating the number of times the data had to be differenced to reach stationarity. That is necessary to keep the X and y variables the same length when fitting the data and plotting.

#### Population Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; For the population data, we can see the XGBoost Regressor doing a very good job keeping up with the recording data. It seems that the model was not able to predict the extent to which the variance in change of GDP per capita would change, with a little bit quieter noise. This comparison yielded an RMSE of 29.54, beating out the ARIMA model for the lowest RMSE so far. Be that as it may, we have to pay attention to the range of values we see here. With the XGBoost Regressor being the only model we used the stationary data for, we should expect this to have the lowest RMSE by far. These values, ranging from ~-300 to ~200 should have a much lower RMSE than ones that range over hundreds of thousands. I will keep an eye on this model, but for the population data I will be choosing the ARIMA model as my best model.

![xgb_pop.png](attachment:xgb_pop.png)

#### GDP per Capita Data

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; With the GDP per capita data, we run into the same issue to a lesser extent. The original data for GDP per capita ranges from around 17000 to around 60000, so stationary data from -1000 to 1000 is not that far off. Furthermore, the XGBoost Regressor has the lowest RMSE by a considerable margin. Compared to the next best RMSE from the ARIMA model at 245.92, here we only see an RMSE of 38.98. I still believe that the ARIMA model will be the best model for making out of range forecasting predictions, but I am curious to see how similarly this model preforms. I choose the ARIMA model as my best model

![xgb_gdp.png](attachment:xgb_gdp.png)

### My Best Model

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

#### Confusion Matrix for KNN sb Group Predictions

![KNNSB](../Images/KNNSB.png)

#### Confusion Matrix for KNN nsb Group Predictions

![KNNNSB](../Images/KNNNSB.png)

#### Analysis

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

## Conclusion

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 

## Further Insight

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 