# US Stock Market Prediction Analysis
### By: Wasinee 

![zaz4.png](attachment:zaz4.png)

## Shocking But True: 90% of People Lose Money In Stocks

Success in any financial market requires one to identify solid investments. When a stock or derivative is undervalued, it makes sense to buy. If it's overvalued, perhaps it's time to sell. While these finance decisions were historically made manually by professionals, technology has ushered in new opportunities for retail investors. Data scientists, specifically, may be interested in exploring quantitative trading, where decisions are executed programmatically based on predictions from trained models. This project will compare models against real future returns.

Seasoned and new investors all lose money to stocks at some point, and most will not succeed. Many investors fall for the stock tips where they depend on other friends, relatives, or colleagues on which stocks to buy. In this day, there are all these self proclaimed stock geniuses that tell people which stocks to buy but with no data backing their statements. In this recommenation system, I created a way to predict stocks using data and time series analysis that have an ***average residual error of $0.03*** using real stock market values.

## 1. Data

I used data from Kaggle which is one of the world's largest data science community with tools and resources. This dataset contains historical daily prices for all tickers currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com. The historic data is retrieved from Yahoo finance via yfinance python package. It contains prices for up to 01 of April 2020. The link to the dataset is below:

[Kaggle Stocks Data](https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset?select=stocks)

## 2. Method

A popular statistical method for time series forecasting is ARIMA (AutoRegressive Integrated Moving Average) model.

***AR***: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations.

***I***: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.

***MA***: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

The parameters of the ARIMA model are defined as follows:

***p***: The number of lag observations included in the model, also called the lag order.

***d***: The number of times that the raw observations are differenced, also called the degree of differencing.

***q***: The size of the moving average window, also called the order of moving average.

## 3. Data Cleaning

[Data Cleaning Github](https://github.com/WasineeSi/Springboard/blob/386625949ab6bbf35408b086b7122db4c3118767/Capstone%202/Project/Data%20Wrangling.ipynb)

I checked the data and it does not have any categorical values, no null values, and everything was already in chronological order. There are outliers in almost every data set, however, I decided to keep those values since they are important for stock prediction. There was only 1 minor problem with the dataset since the data was already pretty clean.

![Screenshot%20%28323%29.png](attachment:Screenshot%20%28323%29.png)

***Data Types***: The first is to change the 'Date' column into datetime and focused the data mainly on the 'Closing' values. The 'Closing' column was already in float data type so I did not need to worry about it. Then I reset the data so that the 'Date' column became the index, therefore, making plotting the datapoints easier later on.

## Exploratory Data Analysis (EDA)

[EDA Github](https://github.com/WasineeSi/Springboard/blob/386625949ab6bbf35408b086b7122db4c3118767/Capstone%202/Project/EDA.ipynb)



The moving average (MA) is a simple technical analysis tool that smooths out price data by creating a constantly updated average price. The average is taken over a specific period of time, like 10 days, 20 minutes, 30 weeks, or any time period the trader chooses. This is a plot of the moving average of 'a' stock. The moving average looks good, so we continue to look at the daily returns.

![Screenshot%20%28328%29.png](attachment:Screenshot%20%28328%29.png)

The daily return measures the dollar change in a stock's price as a percentage of the previous day's closing price. A positive return means the stock has grown in value, while a negative return means it has lost value. For stock 'a', there seems to be about a normal distribution of losses and gains percentage.

![Screenshot%20%28330%29.png](attachment:Screenshot%20%28330%29.png)

For a period of 1 year, if one were to invest 1 dollar in January of 2016, it is possible that one can make a profit of around 15 cents.

![Screenshot%20%28331%29.png](attachment:Screenshot%20%28331%29.png)

A percentage increase in stock value is the change in stock comparing that to the previous day. The bigger the value either positive or negative, the more volatile the stock is. Therefore, if we plot the votality of the plot, we can see which one has the wider graph which means it has the highest votality. Having a higher votality indicates that there is more risk involved. Having a higher risk indicates that you can have a higher return or loss if investing in this particular stock. Looking at the 10 stocks we are evaluating, 'aal' seems to be the most volatile.

![Screenshot%20%28332%29.png](attachment:Screenshot%20%28332%29.png)

## 5. Which Dataset to choose?

After graphing all of the 10 datasets' rolling mean and standard deviation, and running the Dickey-Fuller Test which tests if a data is stationdary or nonstationary, I realized that data5 or 'aal' data is stationary. Since that data is stationary, we cannot use the ARIMA model on it since we need a nonstationary dataset. Therefore, aside from data5, I can choose any of the dataset to run the ARIMA model with and I decided to choose data0 or 'a' stock dataset which passed the Dickey_Fuller Test.

## 6. Algorithms, Machine Learning, and Predictions

[Training and Preprocessing Github](https://github.com/WasineeSi/Springboard/blob/386625949ab6bbf35408b086b7122db4c3118767/Capstone%202/Project/Training%20and%20Preprocessing.ipynb)

[Modeling Github](https://github.com/WasineeSi/Springboard/blob/386625949ab6bbf35408b086b7122db4c3118767/Capstone%202/Project/Modeling.ipynb)

I chose to work with ARIMA for training my recommendation system. I tested over 4000 stocks from 'a' stock into 3 different machine learning models provided. I tested stock 'a' in the Linear Regression, Decision Tree Regression, and ARIMA model. Linear Regression was the least accurate, followed by Decision Tree Regressor, and ARIMA was the most accurate. It should be noted that this algorithm, although the most accurate is also the most complex and requires the most time to run the programs.

I chose root mean square error (RMSE) as the accuracy metric over mean absolute error (MAE) because the errors are squared before they are averaged which gives the RMSE a higher weight to large errors. Thus, the RMSE is useful when large errors are undesirable. The smaller the RMSE, the more accurate the prediction because the RMSE takes the square root of the residual errors of the line of best fit.

![Screenshot%20%28333%29.png](attachment:Screenshot%20%28333%29.png)

![Screenshot%20%28334%29.png](attachment:Screenshot%20%28334%29.png)

![Screenshot%20%28335%29.png](attachment:Screenshot%20%28335%29.png)

The ARIMA model most closely matches the prediction, so much so that you can barely see a difference in the predicted vs the actual values.

In conclusion, there is an average residual of 0.03804182375937516 for the data0 or 'a' stock. This is pretty good since it shows that we are on average only 3 cents off from the actual stock prices. I used a 80/20 train to test to fit the model on the train data. It has a Test RMSE: 0.564 while using the ARIMA model. 

## 8. Future Improvements

For future references, I should run the rest of the other loaded datasets and figure out the best order using auto_arima and then plugging that order in the ARIMA model to predict other stock prices. The recommendation system could be improved with run time, especially when using auto_arima and ARIMA() which used up a lot of time. I would also like to have more recent stock datasets in order for it to be more relevant and I can use more data to make my model more accurate.

## 9. Credits

I would like to thank my Springboard Mentor Ricardo for all of the advice and suggestions, Kaggle for their data, and George Box and Gwilym Jenkins who developed the ARIMA model.