In this demonstration we're using a library called the pandas_datareader, which brings large commonly used
datasets from public places off the web and into pandas. But this might not work in the Coursera
environment, or your home environment. Not to fear, we've also included the data in assets/stocks.csv. So
you can dig in just three cells down to start learning about heatmaps, but if you want to see a bit of how
we prepared the data check out the next couple of cells.

In [None]:
#Let's play with some finance data for this example. Let's load in all the usual suspects that we will need,
#like pandas matplotlib.
import pandas as pd
import numpy as np
import matplotlib as mpl
mpl.get_backend()
import matplotlib.pyplot as plt

#Now, to get the stock data we will need the datetime, and pandas-datareader package.
#the pandas-data reader package is pretty neat, it allows you to read in data from sources such as
#Google, World Bank, and yahoo. If you're following along, you'll need to install pandas data reader
#using the terminal, just open a new terminal and type pip install pandas-datareader and you should be all good.
import pandas_datareader as pdf
import datetime

In [None]:
#Let's write a little function to download some stocks, so we can explore them using heatmaps.

#We can easily do this by making a function that takes in the ticker or symbol of the stock or stocks that we 
#want, a start date and an end date.

def get(tickers, startdate, enddate):
    def data(ticker): #The next function that you see, data(), then takes the ticker to get your
        # data from the startdate to the enddate
        return (pdf.get_data_yahoo(ticker, start=startdate, end=enddate)) #and returns it so that the get() function can continue along 
    datas = map(data,tickers) # We will map the data with the right tickers and
    return(pd.concat(datas, keys=tickers, names=['Ticker', 'Date'])) # returns a Dataframe that
#concatenates the mapped data with tickers.

tickers = ['AAPL', 'MSFT', 'IBM', 'GOOG', '^GSPC']
#Uncomment this line to actually retrieve the data
#stocks = get(tickers, datetime.datetime(2006, 1, 1), datetime.datetime(2016, 1, 1))

#So, in this code we are extracting the stock data from Apple, Microsoft, IBM, Google, and the S&P500 from
#January 1st, 2006 to January 1st 2016 and gathering it into one nice big DataFrame,

#stocks.head() #Using the .head we can see that the variables include the high, low, open, close, 
#volume, and adj close stock values for each day and company.
#uncomment this line to save a copy of the data
#stocks.reset_index().to_csv("assets/stocks.csv", index=False)

In [None]:
#You can also find the data in the course material if you have any issues with this part
#So if you prefer, you can also just load in the data from a csv.

import pandas as pd
import numpy as np
import matplotlib as mpl
mpl.get_backend()
import matplotlib.pyplot as plt

#Load the Stocks.csv dataset
stocks = pd.read_csv("assets/stocks.csv")

In [None]:
#To get Python to manipulate how a date is formatted, we need to import the native datetime module
from datetime import datetime

#Once we have that, we can make sure our date is in the correct format
stocks["Date"] = pd.to_datetime(stocks["Date"])

#and create a new variable to hold the year, which we will format to be full year name.
stocks["Year"] = stocks["Date"].dt.strftime("%Y")

#Let's also change the ticker column name to company
stocks["Company"] = stocks["Ticker"]

#cool, let's check it out
stocks.head()

In [None]:
#Ok, so that worked nicely, next let's pull out the Company, Year, and "High" stock price value for each
#company so that we can focus on that

stockshigh = stocks[["Company", "Year", "High"]] # We can call the new data frame "stocks high".

#and check it out
stockshigh.head()

In [None]:
#Unfortunately, we can't supply the whole Dataframe to matplotlib's heatmap directly, since it expects
#company name as columns, date as index, and the High price as values.

#If you are familiar with Microsoft Excel, you might have experience using pivot tables. Pivot tables are a
#powerful technique to summarise the levels or values of a particular variable

#Well, we can also do this using panda's .pivot table function. We'll just create a new data frame called
#stocks high_pivot, and pass the fuction our stocks high dataframe and specify tha twe want our index to be
#Year, and our columns to be Company.

#Inside the pivot table we can also pass numpy's mean function to get the mean of high stock price for each
#month across all the companies.

stockshigh_pivot = stockshigh.pivot_table(index="Year", columns = "Company", aggfunc=np.mean)

stockshigh_pivot = pd.DataFrame(stockshigh_pivot) #and put it in correct form

stockshigh_pivot.head()
#This is giving us the mean high stock value for unique combinations of year and company.
#This way we can use the heatmap to explore patterns in each company's high stock price across each Year.


In [None]:
#Let's set up the visualization by first creating our fig and axis objects using matplotlib's plt.subplots()
#function. Since, matplotlib's plt.subplots() function returns a tuple containing a figure and axes object(s)
#we can unpack this tuple into the variables fig and ax, which is useful if we want to change figure-level
#attributes.

#For instance, we can specify the figure height and width size using matplotlib's figsize attribute
#This creates a figure object, which has a width of 20 inches and 15 in height.

fig, ax = plt.subplots(1,1, figsize=(20,15))

#The heatmap itself is an imshow plot, so we will just pass this the data frame "stockshigh_pivot" and set the
#color with cmap to blue purple shades.
im = plt.imshow(stockshigh_pivot, cmap ="BuPu")

#Next, we'll set the x ticks to be the company names, and the y ticks to be the year.
plt.xticks(range(len(stockshigh_pivot.columns)), stockshigh_pivot.columns)
plt.yticks(range(len(stockshigh_pivot)), stockshigh_pivot.index)

#The cbar can hold our color bar which we will pass the im figure we just created.
#We'll need to use the fraction and pad arguments to scale this to a size we want.
#The fraction controls the bar height and the pad controls how much white space there is between the
#figure and the bar.
cbar = plt.colorbar(im, fraction=0.086, pad=0.04)

#Finally, we'll just give the figure axes lables and a title

ax.set_title("Heatmap of High Stock Price from 2006 to 2016")
ax.set_xlabel("Company")
ax.set_ylabel("Year")
#Ok, so now we have a categorical heatmap that allows us to compare the high stock value overtime
#within and across each company
#The first thing I'm noticing is that this isn't a very useful figure.

In [None]:
#Alright, so hopefully you took a moment to ponder that and noticed that the values for GSCP are significantly
#higher than the other companies. As a result, we aren't really seeing the benefits of the color coding on the 
#heat map virtualization
#Scaling the data (to the Z-scale) just helps to "even out" any creases that may still exist in
#the data, which helps for visualization

#So, let's norm the values and plot this again.

#load in zscore
from scipy.stats import zscore
stockshigh_pivot_norm = stockshigh_pivot.apply(zscore) #Creating normed dataframe

#and plot the normed values
fig, ax = plt.subplots(1,1, figsize=(20,15))
im = plt.imshow(stockshigh_pivot_norm, cmap = "BuPu")

plt.xticks(range(len(stockshigh_pivot_norm.columns)), stockshigh_pivot.columns)
plt.yticks(range(len(stockshigh_pivot_norm)), stockshigh_pivot.index)
cbar = plt.colorbar(im, fraction=0.086, pad=0.04)

ax.set_title("Heatmap of High Stock Price from 2006 to 2016")
ax.set_xlabel("Company")
ax.set_ylabel("Year")

#Ok, now we can see that the high stock price has been increasing over time for all coompanies, because the
#boxes are getting darker over time. We can also compare companies and see that for instance, IBM's 
#high stock value was much lower than others in 2015.