**Develop a data visualization tutorial that teaches the reader about a data visualization tool, technique, or package that you find particularly useful/interesting. The focus of this assignment can take a few different forms:**

•	*Data Visualization Tool* – There are dozens of data visualization tools that we will discuss in class but will not have a chance to explore in depth. Create a tutorial that teaches the reader about the tool, its key features, as well as a demonstration to showcase how you can use the tool to create an interesting visualization.

•	*Data Visualization Package* – There are many additional data visualization packages for Python and R that we will not have time to explore. Provide an overview of a lesser known or utilized package, its primary benefits/advantages, as well as some interesting visualizations you created using the package with accompanying code.

•	*Data Visualization Technique* – If there is a data visualization technique that you are excited about or feel is particularly impactful, your assignment can focus on why others should use this technique more often. Please include example visualizations with code to illustrate your point.

**Regardless of your focus, your assignment should contain the following information:**

•	Be from 1500-2000 words

•	Be written so it is accessible to a broad audience

•	Includes screenshots/visualizations when appropriate

•	Provides commented code if producing visualizations from an Open Source package

**While not required, I would encourage many of you to post your completed tutorial to your public Github page and/or submit it to a data science blog such as Towards Data Science, KD Nuggets, or another online resource.**


# Static Plots are *so* 2019: How to Create Interactive Plotly Graphs and Publish them to your Website using BlogDown

Many of the data visualization tools commonly utilized in the data science community (including ggplot, matplotlib, seaborn, etc.) do not offer robust interactive features. When working with data that can be easily interactive, I often rely on the Plotly Python package as an alternative. This article will provide an overview of Plotly, its key features, as well as a demonstration on how to use Plotly to display interesting, interactive graphs.

But wait, there's more! We'll be walking through the steps to publish your fancy new interactive graphs to your personal Blogdown website using the popular Blogdown package in R. 

## What is Plotly?
Plotly is a data vizualization tool available in multiple languages. The Plotly Python library is an open-source package of data visualizations often lauded for its customizations available. However, some of the syntax within Plotly can be confusing for a first-time user (or even a long-time user). Luckily, Plotly offers the Plotly Express library to help us create cool interactive visualizations with simpler syntax than the original Plotly package.

## Installing Plotly

We can install Plotly with the following magic command in Jupyter Notebook:

In [1]:
!pip install plotly

Collecting plotly
  Downloading plotly-4.12.0-py2.py3-none-any.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 1.9 MB/s eta 0:00:01
Collecting retrying>=1.3.3
  Using cached retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25ldone
[?25h  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11430 sha256=19121637c211f3d680d54fcffa3d613a96d65b6c9493e335a01786171bf72621
  Stored in directory: /Users/adamhearn/Library/Caches/pip/wheels/c4/a7/48/0a434133f6d56e878ca511c0e6c38326907c0792f67b476e56
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.12.0 retrying-1.3.3


Now that we've installed the package, we import it:

In [13]:
# these are the packages we'll be working with today
import numpy as np
import plotly as py
import plotly.graph_objects as go
import pandas as pd
from plotly.offline import init_notebook_mode,iplot
import plotly.graph_objects as go
init_notebook_mode(connected=True)

Now we're ready to work with some real data! For this exercise, we'll be using data from a research project of mine: mapping Fall 2020 modes of instruction of 4-year colleges and universities:

In [38]:
dta = pd.read_csv('covid_college.csv')

# just the data we need
#dta = dta[['institution', 'lat', 'lon', 'act_avg', 'adm_rate', 'online','enrollment', 'endowment', 'public']]

dta

Unnamed: 0,unitid,institution,state,enrollment,public,private,sat_avg,act_avg,perc_oos,perc_sor,...,sen_rep,house_dem,house_rep,gov_dem,gov_rep,avg_cpc_jul,in_person,online,hybrid,policy
0,222178,Abilene Christian University,TX,3525.0,0,1,1129.0,24.0,0.13,0.32,...,0.612903,0.440000,0.560000,0,1,28.581084,1,0,0,Primarily in person
1,222831,Angelo State University,TX,9046.0,1,0,1051.0,21.0,0.03,0.03,...,0.612903,0.440000,0.560000,0,1,54.070145,0,0,1,Hybrid
2,222983,Austin College,TX,1294.0,0,1,1213.0,26.0,,0.17,...,0.612903,0.440000,0.560000,0,1,9.638675,1,0,0,Primarily in person
3,223232,Baylor University,TX,14108.0,0,1,1293.0,29.0,0.33,0.34,...,0.612903,0.440000,0.560000,0,1,40.953763,0,0,1,Hybrid
4,226091,Lamar University,TX,8697.0,1,0,1054.0,21.0,0.02,,...,0.612903,0.440000,0.560000,0,1,45.598425,1,0,0,Primarily in person
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
895,182281,University of Nevada-Las Vegas,NV,25282.0,1,0,1141.0,22.0,0.12,0.04,...,0.380952,0.666667,0.309524,1,0,36.490209,0,1,0,Primarily online
896,182290,University of Nevada-Reno,NV,17307.0,1,0,1179.0,23.0,0.26,0.01,...,0.380952,0.666667,0.309524,1,0,16.295994,0,0,1,Hybrid
897,102553,University of Alaska Anchorage,AK,12202.0,1,0,1119.0,21.0,0.07,,...,0.650000,0.375000,0.575000,0,1,14.280914,0,1,0,Primarily online
898,102614,University of Alaska Fairbanks,AK,6284.0,1,0,,,0.15,,...,0.650000,0.375000,0.575000,0,1,9.192894,0,0,1,Hybrid


Before we get into the more complicated GIS graphs (using the lat/long data), we'll start with some simpler ones. Like a **histogram**:

We'll be plotting the distribution of ACT scores in my sample of institutions:

In [40]:
# the data
data = [go.Histogram(x=dta['act_avg'],
                     nbinsx=40,
                     histnorm='density')]

# the layout 
layout = go.Layout(title="ACT Score Distribution",
                  xaxis=dict(title = 'Average ACT Score'),
                  yaxis=dict(title = 'Frequency'))

# the figure
fig = go.Figure(data = data,layout = layout)

#plot the figure
iplot(fig)

#save the figure
py.offline.plot(fig, filename='plots/my_first_plot.html')

'plots/my_first_plot.html'

Pretty easy, right! One things to note: if you ran this code yourself, you'd be able to hover over each bar to get both the X (ACT) and Y (Count) variables (see GIF below). This dynamic approach is especially valuable when you want clients to be able to interact with the graph.

We saved the graph to our local dictionary as an `.html` file, so we will soon post it to our fancy blogdown website!

Now we can do a **scatter plot**: Let's look at the relationship between in-state and out-of-state tuition rates for public universities:

In [41]:
pub = dta[dta['public'] == 1]

data = [go.Scatter(x=pub['tuition_in'],
                   y=pub['tuition_out'],
                   text = pub['institution'],
                   mode = 'markers')]

# the layout 
layout = go.Layout(title="In-state vs. Out-of-state tuition rates: 4-year public institutions",
                  xaxis=dict(title = 'In-state'),
                  yaxis=dict(title = 'Out-of-state'))

# the figure
fig = go.Figure(data = data,layout = layout)

#plot the figure
iplot(fig)

#save the figure
py.offline.plot(fig, filename='plots/my_scatter_plot.html')

'plots/my_scatter_plot.html'

Getting the hang of it? Pretty neat, right. Now we're ready for some more complicated graphs. But we could do better. 

Let's replicate the graph above to include colors for institutions that were fully online this Fall, as well as size by enrollment. 

In [42]:
pub = dta[dta['public'] == 1]


data = [go.Scatter(x=pub['tuition_in'],
                   y=pub['tuition_out'],
                   text = pub['institution'],
                   mode = 'markers',
                   marker=dict(size=pub['enrollment']/1000, # we need to scale this variable 
                               color=pub['online'],
                               colorscale = "Portland", # 
                               showscale=False))] 

# the layout 
layout = go.Layout(title="In-state vs. Out-of-state tuition rates: 4-year public institutions <br>\
                          Blue represents in-person instruction",
                  xaxis=dict(title = 'In-state'),
                  yaxis=dict(title = 'Out-of-state'))

# the figure
fig = go.Figure(data = data,layout = layout)

#plot the figure
iplot(fig)

#save the figure
py.offline.plot(fig, filename='plots/my_scatter_plot.html') # overwriting the previous file because
                                                      # this one is obviously better

'plots/my_scatter_plot.html'

Now that we're dynamic plot pros, we're ready to move on the big challenge: creating a GIS map of policy decisions by institution. Though it looks challenging, it's not too challenging once you understand what's going on "under the hood" in Plotly (which is what we've been practicing!). 

First, we need to do a few preprocessing steps to get the data the way we want it. You know how we hovered each bubble above and the institution name came up? What if we wanted to add enrollment and policy decision to that as well? We create a new feature, `text` to capture this information.

In [44]:
dta['text'] = dta['institution'] \
             + '<br>UG Enrollment ' \
             + dta['enrollment'].astype(str) \
             + '<br>' + dta['policy'].astype(str)


Since we're working with a categorical outcome variable (`policy`), we need to do some special formatting. It will involve some list comprehension, and can get pretty tricky. First we need to start by grabbing the location of each time policy "changes" in the dataframe, which we can do with the code below:

In [45]:
#sort by policy
dta = dta[(dta['policy'] == "Fully in person") | \
    (dta['policy'] == "Fully online") | \
    (dta['policy'] == "Hybrid") | \
    (dta['policy'] == "Primarily in person") | \
    (dta['policy'] == 'Primarily online')]

dta = dta.sort_values(by = ['policy'])

loc = 0
changes = []
policies = []
for row in dta.itertuples():
    if row.policy not in policies:
        policies.append(row.policy)
        changes.append(loc)
    loc = loc + 1
    
print("Policies:", policies)
print("Changes:", changes)

Policies: ['Fully in person', 'Fully online', 'Hybrid', 'Primarily in person', 'Primarily online']
Changes: [0, 17, 98, 328, 575]


This means we get a new policy in the data on rows 17, 98, 328, 575, and 850. To create our color scale, we need to make sure each of these "switches" are accounted for, which we want to do with a list of tuples. This should do the trick:

In [46]:
i = 0
limits = []
for x in changes:
    if i != len(changes)-1:
        limits.append((x, changes[i+1]-1))
    else:
        limits.append((x, dta.shape[0])) # till the end of the dataframe
    i = i + 1
                      
limits

[(0, 16), (17, 97), (98, 327), (328, 574), (575, 850)]

Nailed it. Last step before we get to the fun part! We just need to now define the colors, based on our sorted list of policies above:

In [47]:
colors = ["#780000", # fully in person
          "#00334f", # fully online
          "khaki", # hyrbid
          "#ea6867", # primarily in person
          "#6491b2"] # primarily online

Now we're ready to plot our beautiful new map.

In [48]:
scale = 400

fig = go.Figure()

for i in range(len(limits)): # looping over each policy
    lim = limits[i] 
    dta_sub = dta[lim[0]:lim[1]] # subsetting the data to get the policy we want
    fig.add_trace(go.Scattergeo( #notice Scattergeo, which is used for GIS data
        locationmode = 'USA-states', # this can be changed to "world"
        lon = dta_sub['lon'], #longitude
        lat = dta_sub['lat'], # latitude
        text = dta_sub['text'], # text variable we defined earlier
        marker = dict( # same marker syntax we worked with above!
            size = dta_sub['enrollment']/scale, # we defined this above
            color = colors[i],
            line_color='rgb(40,40,40)',
            line_width=0.5,
            sizemode = 'area'
        ),
        name = policies[i]))

fig.update_layout(
        title_text = '2020 Fall Instruction Plans<br> \
                      4-year Postsecondary Institutions',
        showlegend = True,
        geo = dict(
            scope = 'usa',
            landcolor = 'rgb(217, 217, 217)', #multiple ways to define colors!
        )
    )

fig.show()
fig.write_html("plots/map.html")

And now we have an awesome interactive map of the United States, along with policy decisions of each 4-year institution in the data set. Beautiful!

So, now what? We can't just print this out and have the same features. What if you wanted to show it to your friends so they can hover over their tiny alma mater? To share it with our friends, we must host it to a website. Now on to **Part II: Publishing your Graphs.** 

# Publishing your Interactive Plotly Graphs using Blogdown

Before I get into this section, I'll preface by saying that you need to have already set up your personal website using Blogdown. Blogdown is an increasingly popular package utilized within RStudio that allows you to build your own website. For whatever reason, Blogdown is particularly popular within Data Science communities.

In many instances, this is hosted on github.io, but it is possible to re-route it to your personal domain name. For example, I have already set up blogdown and uploaded my personal content to www.achearn.com.

For guidance on how to create your own website using Blogdown in R, I recommend this tutorial: https://bookdown.org/yihui/blogdown/

If you've worked with Blogdown in the past, you likely know about the `public` and `static` directories. To summarize, the `public` directory involves everything needed to build the site, and is often linked to its own GitHub repository. This is where the "magic" happens and your website is published. The static directory, on the other hand, contains all the files you want to publicize on your website. A typical Blogdown folder looks like the following:

[insert screenshot of website directory]

If you noticed above, we were saving each of our interactive visualizations in `.html` format along the way in a special `plots` folder within the working directory. This is where all of these graphs are saved. 

Now we just need to drag these graphs into our `static` directory in our `Website` folder (or, if you wanted to, you could have saved these graphs directly to the static folder!). 

[insert gif of dragging]

If you know Blogdown, you know you have to build your site using a special R script before pushing your changes to GitHub. We can do this with the R code below:

```{r}
blogdown::build_site(local=FALSE)
```

Last step! We just need to add, commit, and push our changes to Github using the `public` directory, which is now updated to include our new graphs after we just ran the `build_site()` command above.

```
git add --all
git commit -m "adding plotly tutorial"
git push origin master
```

And viola! Our new graphs are now published online so you can send your friends or clints interactive graphs:

- Histogram: https://www.achearn.com/files/my_first_plot.html

- Bar Chart: https://www.achearn.com/files/my_scatter_plot.html

- Map: https://www.achearn.com/files/map.html

I hope this tutorial has been useful for teaching users how to use Plotly, its key features, and how to utilize the package alongside Blogdown to create interactive visualizations to publish online.