# INFO 3402 – Class 02: Computational Thinking and Hacker Ethic

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT).

## Learning Objectives
* Review computational thinking concepts, practices, and perspectives from INFO 1201, *etc*.
* Downloading, launching, and interacting with Jupyter Notebooks
* An exploratory data retrieval and analysis project

## Load libraries

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sb

import time

## Oh no, a wall of text
First a warning: we invest a **lot** of time in writing these documents with detailed narratives, examples, links to resources, etc. It is **really** important that you read, understand, and develop an expectation for what should happen before executing any cells. We often see students just auto-executing a whole notebook and wondering why things are not working.

For example, do not execute the following cell or else it will print "You should have read the instructions" and prevent anything else from executing for a long time. If you did accidently run this cell, you can go to Kernel > Interrupt to stop it. Instead, convert the block of code below from "Code" type to a "Raw NBConvert" type from the drop-down menu to prevent it from being accidentally run.

There are going to be many examples of cells you should exercise caution in running throughout the rest of the class. We will never intentionally include code that compromises the security of your computer but you should still be cautious about executing any code. 

Examples of cautious code blocks we will see in the remainder of our class include: installing a package and the code only needs to be run once, scraping data that could take minutes or hours to complete, or doing an analysis that might consume a lot of resources. 

*Please read the narratives and have some expectation for what each block of code should do before running it!*

In [None]:
i = 0

while i < 1e3:
    print("You should have read the instructions!")
    i += 1
    time.sleep(2)

## Computational thinking _practices_

Thinking back to examples of the *practices* of computational thinking (Brennan & Resnick 2012) from the slides:

* **Experimenting and iterating**: developing, experimenting, and developing some more
* **Testing and debugging**: making sure things work and solving problems when they arise
* **Reusing and remixing**: building on existing projects or ideas and sharing your own work
* **Abstracting and modularizing**: building something complex by putting together smaller parts

Let's see what these practices look like through an exploratory data collection and analysis exercise.

**Our goal**: Create a CSV file with the daily stock price since December 31, 2019 for each company in the [S&P 500 index](https://en.wikipedia.org/wiki/S%26P_500_Index).

We are not expecting that you already be familiar with every piece of syntax in the rest of this example, so *do not stress* that you are already behind. We are going to learn much of this in the weeks to come! However, you should be able to follow the narrative and start to develop intuitions for how pieces work.

### Step 1: Find a list of companies and ticker symbols
(**Reusing and remixing:** building on existing projects or ideas and sharing your own work)

Googling around, I found a dataset hosted by [DataHub.io](https://datahub.io/): [S&P 500 Companies with Financial Information](https://datahub.io/core/s-and-p-500-companies#data). We don't need to register for an account or anything! But the data was created 2 years ago, so it may not be up-to-date.

Option 1: Download the data to the same directory as this notebook and open it here.

In [2]:
sp500_companies_df = pd.read_csv('constituents_csv.csv')
sp500_companies_df.head(10)

Unnamed: 0,Symbol,Name,Sector
0,MMM,3M Company,Industrials
1,AOS,A.O. Smith Corp,Industrials
2,ABT,Abbott Laboratories,Health Care
3,ABBV,AbbVie Inc.,Health Care
4,ABMD,ABIOMED Inc,Health Care
5,ACN,Accenture plc,Information Technology
6,ATVI,Activision Blizzard,Communication Services
7,ADBE,Adobe Inc.,Information Technology
8,AAP,Advance Auto Parts,Consumer Discretionary
9,AMD,Advanced Micro Devices Inc,Information Technology


Option 2: Read the data directly into the notebook from the web!

In [None]:
sp500_companies_df = pd.read_csv('https://datahub.io/core/s-and-p-500-companies/r/constituents.csv')
sp500_companies_df.head(10)

### Step 2: Find a data service that provides historical stock prices for free
(**Reusing and remixing:** building on existing projects or ideas and sharing your own work)

Googling around, I found this blog post describing the "[Best 5 free stock market APIs in 2020](https://towardsdatascience.com/best-5-free-stock-market-apis-in-2019-ad91dddec984)" so let's start there! The top choice (and one I and [others](https://towardsdatascience.com/historical-stock-price-data-in-python-a0b6dc826836)) know from experience) relies on a library called [`yfinance`](https://github.com/ranaroussi/yfinance). I generally do not recommend installing libraries willy-nilly, but in this case, let's install a new library to let us access historical stock prices for free. 

We only need to install the library once (repeating it won't necessarily hurt) and we can do this from the command line or from within the notebook. I'm going install using the anaconda package manager.

In [4]:
# Only need to run once
! pip install yfinance

Collecting yfinance
  Using cached https://files.pythonhosted.org/packages/c2/31/8b374a12b90def92a4e27d0fc595fc43635f395984e36a075244d98bd265/yfinance-0.1.54.tar.gz
Collecting multitasking>=0.0.7 (from yfinance)
  Using cached https://files.pythonhosted.org/packages/69/e7/e9f1661c28f7b87abfa08cb0e8f51dad2240a9f4f741f02ea839835e6d18/multitasking-0.0.9.tar.gz
Building wheels for collected packages: yfinance, multitasking
  Building wheel for yfinance (setup.py): started
  Building wheel for yfinance (setup.py): finished with status 'done'
  Created wheel for yfinance: filename=yfinance-0.1.54-py2.py3-none-any.whl size=22414 sha256=de07e73c503488a91f499606cf1e0db9b96e401f7254e08cc4d625c43088922a
  Stored in directory: C:\Users\User\AppData\Local\pip\Cache\wheels\f9\e3\5b\ec24dd2984b12d61e0abf26289746c2436a0e7844f26f2515c
  Building wheel for multitasking (setup.py): started
  Building wheel for multitasking (setup.py): finished with status 'done'
  Created wheel for multitasking: filename

Import the library.

In [5]:
import yfinance as yf

Another alternative library we could use is [`pandas-datareader`](https://github.com/pydata/pandas-datareader) ([docs](https://pydata.github.io/pandas-datareader/)). This also doesn't come standard with Anaconda, so we will need to install the library as well from either the command line or from within the notebook. Again, you should only need to do this once.

In [None]:
# Only need to run once
! conda install -c anaconda pandas-datareader

Import the library.

In [8]:
import pandas_datareader as pdr

ModuleNotFoundError: No module named 'pandas_datareader'

### Step 3: Write some code to retrieve the historical price data for one company
(**Experimenting and iterating**: developing, experimenting, and developing some more)

When all else fails, read the documentation. The author of the `yfinance` package has a [nice blog post](https://aroussi.com/post/python-yahoo-finance) detailing how to use it!

Ask for Apple's data from January 6 through August 17 of this year.

In [10]:
aapl_yf_df = yf.download(name='AAPL',
                         start='2019-12-31',
                         end='2020-08-24',
                         progress=False)

TypeError: download() missing 1 required positional argument: 'tickers'

What does the resulting data look like?

In [None]:
aapl_yf_df.tail()

There are [many data readers available](https://pydata.github.io/pandas-datareader/remote_data.html) with `pandas-datareader`, including (curiously) an undocumented 'yahoo' datareader. 

In [None]:
aapl_pdr_df = pdr.DataReader(name='AAPL',
                             data_source='yahoo',
                             start='2019-12-31',
                             end='2020-08-24')

aapl_pdr_df.tail()

### Step 4: Write a loop to repeat this code for all 500 companies
(**Abstracting and modularizing**: building something complex by putting together smaller parts)

The `yfinance` library has a built-in functionality that lets it make many requests in parallel. Let's  show off this fancy functionality first.

First, get a list of all the companies' stock tickers from `sp500_companies_df` and turn it into a list using the [`.tolist()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tolist.html) method that converts the data from a pandas Series data type (which non-pandas functions like `yfinance` may not recognize) into a basic list data type.

In [None]:
sp500_symbol_list = sp500_companies_df['Symbol'].tolist()
sp500_symbol_list[:10]

Double-check it has 500 symbols. 505? Interesting, someone should get to the bottom of that.

In [None]:
len(sp500_symbol_list)

Pass the list of companies to the `download` function specific to the `yfinance` library.

In [None]:
all_df = yf.download(sp500_symbol_list,interval='1d',start='2019-12-31',end='2020-08-24')

This special functionality that takes a list of multiple symbols (in this case 505) is special and not what you would do 99% of the time: take each symbol, use the function to get its data, store this data in some structure, and move on to the next symbol. This takes much longer that the previous (special) approach but I've illustrated it below since it's much closer to how this works in most cases.

In [None]:
# Initialize an empty dictionary where we will store results
all_data_dict = {}

# Loop through each symbol 
for i,symbol in enumerate(sp500_symbol_list):
    
    # Get the data for that symbol, sil
    _df = yf.download(symbol,interval='1d',start='2019-12-31',end='2020-08-24',progress=False)
    
    # Could use the pandas-datareader function instead
    #aapl_pdr_df = pdr.DataReader(symbol,data_source='yahoo',start='2019-12-31',end='2020-08-17')
    
    # Store the data in the dictionary
    all_data_dict[symbol] = _df
    
    # It's common courtesy to wait a bit in between each request to avoid overwhelming a system
    time.sleep(.1)
    
    # Print out where we are in the process, every 50th symbol
    if i%50 == 0:
        print(i,symbol)

### Step 5: Write code to combine all this data together into one structure
(**Testing and debugging**: making sure things work and solving problems when they arise)

Using the magic `yfinance.download` function, we have a giant DataFrame with a [multi-index](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html) column.

In [None]:
all_df.head()

It's hard to see all those column values. We can access them with the `.columns.levels` attribute.

In [None]:
all_df.columns.levels[0]

We can access the "Adj Close" values by treating it like a dictionary. The rows are days and the columns are the company symbols.

In [None]:
all_df['Adj Close'].head()

If we used the loop strategy, then the data is parked inside the `all_data_dict`. We can use the [`concat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) function in pandas to combine them all.

In [None]:
concat_df = pd.concat(all_data_dict)
concat_df.head()

This returns a multi-index on the index with both symbol and date and then the values for that stock on each day as columns. I like the simpler dates as rows and symbols as columns approach from above. We can access a single column of `concat_df` with bracket notation and then [`unstack`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unstack.html?highlight=unstack#pandas.Series.unstack) the multi-indexed Series into a DataFrame.

In [None]:
adjclose_concat_s = concat_df['Adj Close']

# Inspect
adjclose_concat_s.head()

Unstack the multi-indexed series into a DataFrame.

In [None]:
adjclose_concat_df = adjclose_concat_s.unstack(0)

# Inspect
adjclose_concat_df.head()

### Step 6: Write to disk
(**Reusing and remixing**: building on existing projects or ideas and sharing your own work)

We have done all this hard work, now save the results to disk so you can just load them instead of crawling the data again as a CSV file. This file you could then give to someone else who could open in a spreadsheet program like Excel. Character encoding errors are a huge pain to deal with, so I also make sure to explicitly encode my data to disk with the common UTF-8 standard.

In [None]:
all_df.to_csv('sp500_constituents_2020.csv',encoding='ut8')

### Step 7: Do some exploratory analyses on all this data
(**Experimenting and iterating**: developing, experimenting, and developing some more)

There's a lot of variance in the prices of the S\&P 500 constituents: some stocks are only \\$10 per share while others are in excess of \\$1,000 per share. We could explore these values with a [histogram](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#histograms).

In [None]:
# Get all the prices  on December 31 by accessing the row using the .loc
dec31_prices = adjclose_concat_df.loc['2019-12-31']

# Make the histogram on these data
ax = dec31_prices.hist(bins=50)

# Always label your axes!
ax.set_xlabel('Adjusted Close')
ax.set_ylabel('Number of symbols')

What are these low and high values?

In [None]:
dec31_prices.sort_values().head(10)

In [None]:
dec31_prices.sort_values().tail(10)

The presence of a few outlier values skews the histogram. This large range of values is perfect to illustrate on a logarithmic scale, which calls for [logarithmic bins](https://stackoverflow.com/a/6856155/1574687).

In [None]:
ax = adjclose_concat_df.loc['2019-12-31'].hist(bins=np.logspace(0,4,50))
ax.set_xscale('log')
ax.set_ylim((0,60))
ax.set_xlabel('Adjusted close')
ax.set_ylabel('Number of companies')

# https://matplotlib.org/api/ticker_api.html#matplotlib.ticker.StrMethodFormatter
ax.xaxis.set_major_formatter(ticker.StrMethodFormatter('${x:,.0f}'))

The COVID-19 pandemic and resulting shutdowns caused a historic crash and recovery in the stock market in 2020. Let's visualize the daily prices of each constituent stock over time. Given that there is so much variance in the prices of these stocks, let's normalize their prices to 1 on December 31.

In [None]:
# Normalize by dividing the values by their Dec 31 values
norm_adjclose_concat_df = adjclose_concat_df.div(adjclose_concat_df.loc['2019-12-31'])

# The data doesn't include weekends, holidays, and other days when the markets are closed
# Reindex the data to include all dates since the 31st to make a cleaner axis
norm_adjclose_concat_df = norm_adjclose_concat_df.reindex(pd.date_range('2019-12-31','2020-08-17'))

# Plot the data
ax = norm_adjclose_concat_df.plot(c='k',alpha=.25,legend=False,figsize=(10,5))

# Include a consistent range of y-values
ax.set_ylim((0,2.5))

# Include a yellow line with the average
norm_adjclose_concat_df.mean(axis=1).plot(ax=ax,c='y',lw=5)

# Include a red dashed line at 1 as a baseline
ax.axhline(y=1,ls='--',c='r')

# Save the picture
plt.tight_layout()
plt.savefig('sp500_prices_normalized_2020.png',dpi=300)

We can see the crash in March when stock prices fell by almost 50% since the start of the year and the sustained rise since then as stimulus measures injected cash and liquidity into the economy. The gaps in the data are the days when the market was closed.

The stock market has (surprisingly, to me) almost recovered all its losses. Some stocks have performed amazingly well, nearly doubling in value since January, while others remain really depressed and only a quarter of their original value.

We can compute the ratio between the most recent and the December 31 values.

In [None]:
ratio_s = adjclose_concat_df.loc['2020-08-13']/adjclose_concat_df.loc['2019-12-31']

What are some of the higher and lowest ratios? Travel companies like Norwegian Cruise Lines (NCLH), Carnival Cruise Lines (CCL), and United Airlines (UAL); oil extractors like Occidental Petroleum (OXY); and luxury brand owner Coty (COTY).

In [None]:
ratio_s.sort_values().head()

Some of the best performers are e-commerce firms like Paypal (PYPL), chip makers like NVIDIA (NVDA), and pharmaceurical and medical device makers like Abiomed (ABMD), West Pharmaceuticals (WST), and Dexcom (DXCM).

In [None]:
ratio_s.dropna().sort_values().tail()

Visualize as a histogram to reveal that most companies are near their values at the start of the year.

In [None]:
ax = ratio_s.hist(bins=50)
ax.set_xlim((0,2))
ax.axvline(1,c='r',ls='--')
ax.set_xlabel('Price ratio')
ax.set_ylabel('Number of companies')

plt.tight_layout()
plt.savefig('sp500_price_ratio_2020.png',dpi=300)

## Computational thinking _perspectives_

Thinking back to examples of the *perspectives* of computational thinking (Brennan & Resnick 2012) from the slides:

* **Expressing**: computation as a medium for creative and critical expression
* **Connecting**: computation as a tool for of creating for and interacting with others
* **Questioning**: computation as a tool for investigating how the world works

What could we do next now that we have this data and these exploratory findings? 

What are other ways that we could use this notebook, these data, and these results to pursue creative or critical questions (**expressing**)? 

How could we share these findings, who else could be interested in using these data and results, and how could we get it to them (**connecting**)?

What do these data and results reveal about the disconnects between reactions of the stock market and the experiences of most Americans? What other kinds of data or analyses would we need to explore this (**questioning**)? 

Let's talk about these in lecture together!