![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fdata-viz-of-the-week&branch=main&subPath=stock-prices/stock-prices.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Callysto’s Weekly Data Visualization

## Stock Prices

### Recommended Grade levels: 
<br>

### Instructions

Click "Cell" and select "Run All".

This will import the data and run all the code, so you can see this week's data visualization. Scroll back to the top after you’ve run the cells.

![instructions](https://github.com/callysto/data-viz-of-the-week/blob/main/images/instructions.png?raw=true)

**You don't need to do any coding to view the visualizations**.

The plots generated in this notebook are interactive. You can hover over and click on elements to see more information. 

Email contact@callysto.ca if you experience issues.

### About this Notebook

Callysto's Weekly Data Visualization is a learning resource that aims to develop data literacy skills. We provide Grades 5-12 teachers and students with a data visualization, like a graph, to interpret. This companion resource walks learners through how the data visualization is created and interpreted by a data scientist. 

The steps of the data analysis process are listed below and applied to each weekly topic.

1. Question - What are we trying to answer?
2. Gather - Find the data source(s) you will need. 
3. Organize - Arrange the data, so that you can easily explore it. 
4. Explore - Examine the data to look for evidence to answer the question. This includes creating visualizations. 
5. Interpret - Describe what's happening in the data visualization. 
6. Communicate - Explain how the evidence answers the question. 

## Question

How can we compare and visualize trends in stock prices to make informed financial decision alongside using machine learning to leverage historical data for predictive analysis.

### Goal

Our objective is to introduce the concept of stocks, emphasizing the benefits of adopting a positive mentality and mindset for investing. We will also be using machine-learning to see if we can create models that would be useful in predicting prices.

### Background

Understanding financial literacy in the form of stocks at a young age cultivates a foundation for long-term financial stability, promoting individuals to make informed decisions, foster savings habits, and capitalize on the compounding benefits of early investments.

## Gather

Stock prices in this notebook are obtained using the Python library [yfinance](https://pypi.org/project/yfinance/) and stock symbols and names are obtained from the [Nasdaq](https://www.nasdaq.com/market-activity/stocks/screener).

### **Disclaimer**

This notebook is **strictly** for educational and informational purposes and does not constitute financial advice, recommendation, or endorsement. The content presented here is not intended to influence any investment decisions, and readers are strongly advised to seek independent financial advice and conduct their own research before making any investment choices. The authors and contributors of this notebook shall not be held responsible for any financial losses or decisions made based on the information provided.

### Code: 

Run the code cells below to import the libraries we need for this project. Libraries are pre-made code that make it easier to analyze our data.

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
import pandas as pd
import ipywidgets

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

try:
    from xgboost import XGBClassifier
except:
    !pip install xgboost
    from xgboost import XGBClassifier

try:
    import yfinance as yf
except:
    !pip install yfinance
    import yfinance as yf

print("Libaries imported.")

### Data

In this notebook, we'll be primarily obtaining data using the Python library [yfinance](https://pypi.org/project/yfinance/). As mentioned before, yfinance is useful tool in order to obtain financial data on stocks easily. This helps us analyze the data in detail and draw meaningful insights on price trends.

Before getting any coding in this notebook, let's start by talking about what **stocks** are. The [Merriam-Webster](https://www.merriam-webster.com/dictionary/stock) dictionary defines a stock as: 

<div style="font-family: 'Times New Roman'; border: 1px solid #ddd; padding: 10px; border-radius: 5px;">
  <p style="font-weight: bold; margin-bottom: 5px;">Stock Definition:</p>
  <p style="margin-left: 20px;">"The proprietorship element in a corporation usually divided into shares and represented by transferable certificates."</p>
</div>


In simpler terms, a stock is like owning a *piece* of a company. It's divided into **shares**, and each share is a piece of the company. 

With these shares, you are able to trade or sell them to others. 

The [Canadian Government](https://www.canada.ca/en/financial-consumer-agency/services/savings-investments/investing-basics.html) defines an **Exchange-Traded Fund** **(ETF)** as: 

"an investment fund that holds assets such as stocks, commodities or bonds. Exchange traded funds trade on stock exchanges and have a value that is similar to the total value of the assets they contain. This means that the value of an exchange traded fund can change throughout the day." 

Essentially, an ETF is a *diversified* investment fund traded on stock exchanges. The main benefit of investing in an ETF is that, since the ETF's price is determined by a combination of various stocks, even if a particular stock performs poorly within the ETF, the idea is that diversification across a variety of stocks mitigates the impact on your overall investment.

The data we're using is from the **S&P 500** ETF, named after the Standard & Poor's 500 index, representing 500 major U.S. companies and providing investors a convenient way to track the overall stock market performance.

In [None]:
sp500 = yf.Ticker("^GSPC")
end_date = "2023-12-31"
sp500 = sp500.history(start="1927-01-01", end=end_date)

sp500

The `sp500` dataframe represents financial data, where the `Date` serves as the index for this particular dataframe. A dataframe is similar to a spreadsheet, where we have rows and columns that correspond to entries of data. 

In the context of a dataframe, an index is similar to the row numbers in a spreadsheet but, in this case, it's labeled with specific dates. The other columns, such as `Open`, `High`, `Low`, `Close`, `Volume`, `Dividends`, and `Stock Splits`, provide information corresponding to each date.


- **Date:** This shows the date of the financial data.
- **Open:** The price of the stock at the beginning of the trading day.
- **High:** The highest price the stock reached during the trading day.
- **Low:** The lowest price the stock reached during the trading day.
- **Close:** The price of the stock at the end of the trading day.
- **Volume:** This represents the total number of shares traded during the day.

Please note that `Dividends` and `Stock Splits` are included in the table but won't be used in this notebook, so you can ignore them.

# Organize

Now that we have a better sense of the different columns in our dataframe, let's *organize* our data for useful analysis. In coding terms, this is known as **data-cleaning**. 

Generally, data cleaning involves the process of identifying and removing errors, inconsistencies, or missing values in a dataset to ensure that the data is accurate. This also involves removing unused information to enhance clarity and focus on relevant data in our dataframe.

We don't have to do much data-cleaning to our dataframe, `sp500`. The only major thing to do is to remove the columns `Dividends` and `Stock Splits` as they will not be useful in any of our future analysis.

In [None]:
del sp500["Dividends"]
del sp500["Stock Splits"]
display(sp500.head())
print("Unused columns deleted.")

# Explore and Interpret

We can use these different columns to visualize the trends of stock prices. In our particular case for the stock **S&P 500**, stock prices starts at the year 1927 until 2023.

In [None]:
sp500_plots = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=("Open Price for S&P 500", "Close Price for S&P 500"))
sp500_plots.add_trace(px.line(sp500, x=sp500.index, y="Open").data[0], row=1, col=1)
sp500_plots.add_trace(px.line(sp500, x=sp500.index, y="Close").data[0], row=1, col=2)

sp500_plots.update_xaxes(title_text="Year", row=1, col=1)
sp500_plots.update_xaxes(title_text="Year", row=1, col=2)

sp500_plots.update_layout(title="S&P 500 Open and Close Price", yaxis_title="Price (USD)").show()

Let's take a look at the overall growth of the S&P 500 over the past years. While there are certain years when the stock dipped in price, we see that historically it has increased over time. 

We can also look at something similar to a stock but vastly different at the same time. [Cryptocurrency](https://en.wikipedia.org/wiki/Cryptocurrency) is a digital currency that exploded in popularity back in 2020-2021. However, cryptocurrency is generally considered highly volatile, meaning you have a large amount of risk in losing the money you invest. 

Let's take **DogeCoin**, a type of cryptocurrency, for example. 

In [None]:
dogecoin = yf.Ticker("DOGE-USD")
dogecoin = dogecoin.history(start="2017-01-01", end=end_date)

dogecoin_plots = make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=("Open Price for DogeCoin", "Close Price for DogeCoin"))
dogecoin_plots.add_trace(px.line(dogecoin, x=dogecoin.index, y="Open").data[0], row=1, col=1)
dogecoin_plots.add_trace(px.line(dogecoin, x=dogecoin.index, y="Close").data[0], row=1, col=2)

dogecoin_plots.update_xaxes(title_text="Year", row=1, col=1)
dogecoin_plots.update_xaxes(title_text="Year", row=1, col=2)

dogecoin_plots.update_layout(title="DogeCoin Open and Close Price", yaxis_title="Price (USD)").show()

Comparing the price progression of Dogecoin to that of the S&P 500, it's evident that Dogecoin experienced a significant price surge in *May 2021*, followed by an **abrupt** crash. 

This visual representation underscores the volatile nature of cryptocurrencies and highlights the associated risks. Investing substantial amounts of money in inherently risky assets like cryptocurrencies can be dangerous, as demonstrated by the sharp and unpredictable price movements observed in Dogecoin.

### Import the data

We will be using [Nasdaq's](https://www.nasdaq.com/market-activity/stocks/screener) list of stocks on the current market to help show you the differing types of stock currently on the market. 

In [None]:
stock_symbols = pd.read_csv("https://raw.githubusercontent.com/callysto/data-files/main/data-viz-of-the-week/stock-prices/stock_names.csv")
del stock_symbols['Unnamed: 2'] 

stock_symbols

Looking at the output of the different stocks above, use the drop-down menu below to see the different types of stocks in the market. 

In [None]:
options = stock_symbols['Name'].tolist()
stock_val = stock_symbols['Name'].iloc[0]

symbol_dropdown = ipywidgets.Dropdown(options=options, value=stock_val, description='Stock Name:')
output = ipywidgets.Output()

def display_stock_info(stock_name):
    with output:
        output.clear_output()
        selected_stock = stock_symbols[stock_symbols['Name'] == stock_name]
        display(selected_stock[['Symbol', 'Name']])

ipywidgets.interactive(display_stock_info, stock_name=symbol_dropdown)
ipywidgets.VBox([symbol_dropdown, output])

Using the information of the different stocks above, we can visualize the price history of any stock of your choice. 

In the variable `stock_symbol_to_visualize` below, feel free to change the value to any stock symbol that you would like to visualize. Use the *buttons* in the visualization to easily explore different time periods for the selected stock.

In [None]:
# Change this to the stock symbol that you would like to visualize
# For example: Instead of 'AAPL' (Apple) you can change it to 'AMZN' (Amazon)
stock_symbol_to_visualize = 'AAPL'

try:
    stock_name = stock_symbols[stock_symbols['Symbol'] == stock_symbol_to_visualize]['Name'].iloc[0]
    symbol = stock_symbols['Symbol'].iloc[0]
    data = yf.download(symbol)
    
    stk_fig = px.line(data, x=data.index, y='Close', title=f'{stock_name} Historical Price')

    stk_fig.update_xaxes(title_text='Date', rangeslider_visible=True, rangeselector=dict(buttons=list([
                    dict(count=1, label="1d", step="day", stepmode="backward"),
                    dict(count=1, label="1m", step="month", stepmode="backward"),
                    dict(count=3, label="3m", step="month", stepmode="backward"),
                    dict(count=6, label="6m", step="month", stepmode="backward"),
                    dict(count=1, label="1y", step="year", stepmode="backward"),
                    dict(count=2, label="2y", step="year", stepmode="backward"),
                    dict(count=5, label="5y", step="year", stepmode="backward"),
                    dict(step="all")])))
    
    stk_fig.update_yaxes(title_text='Close Price (USD)')
    stk_fig.update_layout(title=f'{stock_name} - Historical Price').show()

except Exception:
    print(f"Error fetching stock data. Make sure that '{stock_symbol_to_visualize}' is a valid stock symbol.")
    data = None

In [None]:
data["Tomorrow"] = data["Close"].shift(-1)
data["Target"] = (data["Tomorrow"] > data["Close"]).astype(int)
data

In [None]:
data = data[data.index >= "2000-01-01"]
data

In [None]:
bad_data = data.copy()
bad_data = bad_data.dropna()

bad_model = RandomForestClassifier(n_estimators=100, min_samples_split=100, random_state=42)

train = bad_data.iloc[:-100]
test = bad_data.iloc[-100:]

predictors = ["Open", "High", "Low", "Close", "Volume", "Tomorrow", "Target"]
bad_model.fit(train[predictors], train["Target"])

In [None]:
predictions = bad_model.predict(test[predictors])
predictions = pd.Series(predictions, index=test.index)
precision_score(test["Target"], predictions)

In [None]:
model = RandomForestClassifier(n_estimators=100, min_samples_split=100, random_state=42)

train = data.iloc[:-100]
test = data.iloc[-100:]

predictors = ["Open", "High", "Low", "Close", "Volume"]
model.fit(train[predictors], train["Target"])

In [None]:
predictions = model.predict(test[predictors])
predictions = pd.Series(predictions, index=test.index)
precision_score(test["Target"], predictions)

In [None]:
combined = pd.concat([test["Target"], predictions], axis=1)
combined.rename(columns={"Target": "Actual", 0: "Predicted"}, inplace=True)
combined

In [None]:
px.line(combined, x=combined.index, y=['Actual', 'Predicted'], labels={'index': 'Date', 'value': 'Values'},title='Actual vs Predicted').show()

In [None]:
data_copy = data.copy()

predictors = ["Open", "High", "Low", "Close", "Volume"]
target = "Target"

for col in predictors[:4]:
    data_copy[col + ' Rolling Mean'] = data_copy[col].rolling(window=5).mean()

data_copy.dropna(inplace=True)
data_copy.head()

In [None]:
X = data_copy[predictors + [col + ' Rolling Mean' for col in predictors[:4]]]
y = data_copy[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
pipeline = Pipeline([('imputer', SimpleImputer(strategy="mean")),('classifier', XGBClassifier(random_state=42))])
param_grid = {'classifier__n_estimators': [50, 100, 200],'classifier__learning_rate': [0.01, 0.1, 0.2],'classifier__max_depth': [3, 4, 5]}
grid_search = GridSearchCV(pipeline, param_grid=param_grid, scoring='precision', cv=5)
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)
precision = precision_score(y_test, predictions)

print("Best Hyperparameters:", grid_search.best_params_)
print("Precision Score:", precision)