![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=stock-prices/stock-prices-ML.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Callysto’s Weekly Data Visualization

## Stock Prices ML

### Recommended Grade levels: 
<br>

### Instructions

Click "Cell" and select "Run All".

This will import the data and run all the code, so you can see this week's data visualization. Scroll back to the top after you’ve run the cells.

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don't need to do any coding to view the visualizations**.

The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

## Question

How can we utilize machine learning to leverage historical data for predictive analysis using stock-data.

### Goal

We will be using machine-learning to see if we can create models that would be useful in predicting stock prices.

### Background

Understanding financial literacy in the form of stocks at a young age cultivates a foundation for long-term financial stability, promoting individuals to make informed decisions, foster savings habits, and capitalize on the compounding benefits of early investments.

## Gather

Stock prices in this notebook are obtained using the Python library [yfinance](https://pypi.org/project/yfinance/) and stock symbols and names are obtained from the [Nasdaq](https://www.nasdaq.com/market-activity/stocks/screener).

### **Disclaimer**

This notebook is **strictly** for educational and informational purposes and does not constitute financial advice, recommendation, or endorsement. The content presented here is not intended to influence any investment decisions, and readers are strongly advised to seek independent financial advice and conduct their own research before making any investment choices. The authors and contributors of this notebook shall not be held responsible for any financial losses or decisions made based on the information provided.

### Code: 

Run the code cells below to import the libraries we need for this project. Libraries are pre-made code that make it easier to analyze our data.

In [None]:
import pandas as pd
import plotly.express as px

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

try:
    from xgboost import XGBClassifier
except:
    !pip install xgboost
    from xgboost import XGBClassifier

try:
    import yfinance as yf
except:
    !pip install yfinance
    import yfinance as yf

print("Libaries imported.")

In [None]:
data = yf.download("^GSPC")
data

In [None]:
data["Tomorrow"] = data["Close"].shift(-1)
data["Target"] = (data["Tomorrow"] > data["Close"]).astype(int)
data

In [None]:
data = data[data.index >= "2000-01-01"]
data

In [None]:
bad_data = data.copy()
bad_data = bad_data.dropna()

bad_model = RandomForestClassifier(n_estimators=100, min_samples_split=100, random_state=42)

train = bad_data.iloc[:-100]
test = bad_data.iloc[-100:]

predictors = ["Open", "High", "Low", "Close", "Volume", "Tomorrow", "Target"]
bad_model.fit(train[predictors], train["Target"])

In [None]:
predictions = bad_model.predict(test[predictors])
predictions = pd.Series(predictions, index=test.index)
precision_score(test["Target"], predictions)

In [None]:
model = RandomForestClassifier(n_estimators=100, min_samples_split=100, random_state=42)

train = data.iloc[:-100]
test = data.iloc[-100:]

predictors = ["Open", "High", "Low", "Close", "Volume"]
model.fit(train[predictors], train["Target"])

In [None]:
predictions = model.predict(test[predictors])
predictions = pd.Series(predictions, index=test.index)
precision_score(test["Target"], predictions)

In [None]:
combined = pd.concat([test["Target"], predictions], axis=1)
combined.rename(columns={"Target": "Actual", 0: "Predicted"}, inplace=True)
combined

In [None]:
px.line(combined, x=combined.index, y=['Actual', 'Predicted'], labels={'index': 'Date', 'value': 'Values'},title='Actual vs Predicted').show()

In [None]:
data_copy = data.copy()

predictors = ["Open", "High", "Low", "Close", "Volume"]
target = "Target"

for col in predictors[:4]:
    data_copy[col + ' Rolling Mean'] = data_copy[col].rolling(window=5).mean()

data_copy.dropna(inplace=True)
data_copy.head()

In [None]:
X = data_copy[predictors + [col + ' Rolling Mean' for col in predictors[:4]]]
y = data_copy[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
pipeline = Pipeline([('imputer', SimpleImputer(strategy="mean")),('classifier', XGBClassifier(random_state=42))])
param_grid = {'classifier__n_estimators': [50, 100, 200],'classifier__learning_rate': [0.01, 0.1, 0.2],'classifier__max_depth': [3, 4, 5]}
grid_search = GridSearchCV(pipeline, param_grid=param_grid, scoring='precision', cv=5)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)
precision = precision_score(y_test, predictions)

print("Best Hyperparameters:", grid_search.best_params_)
print("Precision Score:", precision)