## Tutorial 11: Example Analysis of American Authors

In this tutorial, we combine all of our skills from the semester so far
(as well as some new libraries) to do an actual data analysis from 
Wikipedia. 

### Question of interest

In this tutorial we will focus on collecting Wikipedia data from a list of
American novelists. Our initial question of interest will be figuring out
the relationship between several metrics within the pages that all, in their
own way, indicate how prominently a given author is represented. We will see
how to store this data as CSV file as well as how to produce interactive
graphics to explore the dataset.

### wiki.py

Last time we found a bug in my `wiki.py` code. Please re-download and replace the
script with the original. The following loads the module and checks that you have
version 2 or higher (if there is no error, it works!):

In [None]:
import wiki

assert wiki.__version__ >= 2

### Getting the data

There is a list on Wikipedia of pages for American novelists. Please
start by looking at the page in your browser:

- https://en.wikipedia.org/wiki/List_of_American_novelists

Notice that many of the links on this page are to specific novels by
an author. There are also links at the top and bottom of the page before
the actual lists starts. We don't want these in our analysis! In order
to just grab links to actual authors, we need to use regular expressions
and parse data directly from the page.

Start by loading the `re` module

In [None]:
import re

Next, grab the page for the list of American novelists. We'll need the actual text
of the page, which I print out the first 1000 characters of here for reference:

In [None]:
data = wiki.get_wiki_json("List_of_American_novelists")
data_html = data['text']['*']
print(data_html[:1000])

In order to *just* get authors, a trick we can use on this page is to only find
links that come after the HTML tag `<li>` (a list item). This will avoid most of
links we don't want, but will accidentally grab a few at the bottom of the page.
We deal with those in a moment. Here is the regular expression that grabs the
pages of interest.

In [None]:
authors = re.findall('<li><a href="/wiki/([^"]+)"', data_html)
authors

The list of authors includes over 1600 pages.

In [None]:
len(authors)

Notice that the list includes a few links at the bottom that we do not actually
want in our data.

In [None]:
authors[-40:]

We will 'cut-off' the list of authors manually with the follow code (similar
to how I cut out the header and footer of the raw HTML code in Tutorial 6).

In [None]:
authors = authors[:(authors.index('Leane_Zugsmith') + 1)]
authors[-10:]

Now that we have our list of authors, let's grab them all (or verify
that we have all of the links already).

In [None]:
for link in authors:
    wiki.get_wiki_json(link)

### Page metrics

Now that we have the pages for each of the authors, we want to gather a number
of metrics about each page. To write code that does this, typically I start by
playing around with a single page and *then* wrap it all up in a `for` loop.

For example, I wrote and tested the following code to figure out several metrics
of interest:

In [None]:
# load a single page of data
data = wiki.get_wiki_json("Mark_Twain")

In [None]:
# (1) get the title of the page
data['title']

In [None]:
# (2) determine the number of links to other languages
len(data['langlinks'])

In [None]:
# (3) determine number of internal links
len(data['links'])

In [None]:
# (4) determine number of characters in the text of the page
len(data['text']['*'])

In [None]:
# (5) determine number of external links
len(data['externallinks'])

You can use similar code to compute other metrics, such as the number of
images used on the page.

### Aggregating metrics

Now, we will use a `for` loop to collect the metadata and metrics
described in the prior section for each Wikipedia page in our corpus.
We cycle through each author, appending the new metric values to the
lists at the top of the code block. Fill in the information inside of
the `for` loop to append the metrics to each page.

In [None]:
author_name = []
num_langs = []
num_links = []
num_chars = []
num_elinks = []

for link in authors:
    data = wiki.get_wiki_json(link)
    
    # WRITE YOUR CODE HERE

Now, it will be useful to put all of this data together in a single table.
The standard library for working with tabular data in Python is called 
**pandas**, which we import here:

In [None]:
import pandas as pd

The object that stores tabular data in pandas is called a `DataFrame`
(yes, it's based on the data frame object native to R). There are many
ways to build a data frame object from a collection of lists, but this
block below illustrates my favorite method using an `OrderedDict`. Below,
I'll print out a copy of the table (notice that it prints nicely in a
the Jupyter notebook).

In [None]:
import collections

df = collections.OrderedDict()
df['author_name'] = author_name
df['url'] = authors
df['num_langs'] = num_langs
df['num_links'] = num_links
df['num_chars'] = num_chars
df['num_elinks'] = num_elinks

df = pd.DataFrame(df)
df

Pandas has a convenient method for storing a table of data as a CSV (comma
seperated values) file. Running the code below will save the table as the
file "american_authors.csv"; it can be read into programs such as Excel, 
Googe Sheets, and other programming languages.

In [None]:
df.to_csv("american_authors.csv", index=False)

If you open the file browser, you'll see the CSV file show up in your 'tutorials'
directory. You can similarly read a csv file back in Python using the `pd.read_csv`
function.

In [None]:
new_df = pd.read_csv("american_authors.csv")
new_df

### Plotting data

Another useful feature of the Pandas library is that it makes it easy to
produce plots of the data stored within a table. Here is some example 
code for producing a scatter plot from our Pandas dataset

In [None]:
%matplotlib inline
import matplotlib

In [None]:
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)

In [None]:
df.plot.scatter(x='num_langs',
                y='num_links',
                c='num_chars',
                colormap='viridis')

You can modify the `figure.figsize` parameter based on the size of your
computer screen.

### Interactive plotting with Bokeh

The static plot above is okay, but it's much more interesting to create an 
interactive graphic. To do this we will use the `bokeh` module. Load the
required functions from `bokeh` below and specify that the output should
appear within the Jupyter notebook.

In [None]:
from bokeh.plotting import figure, show, output_notebook, ColumnDataSource

output_notebook()

Now, the code block below produces an interactive scatter plot. You can pan and
zoom the plot depending on what parts of the plot you find interesting. Also,
perhaps most importantly, if you hover over a point the name of the author associated
with the point will show up. Try it!

In [None]:
TOOLTIPS = [
    ("Author", "@author_name"),
    ("Number Internal Links", "@num_links"),
    ("Number External Links", "@num_elinks"),
]

p = figure(plot_width=950,
           plot_height=500,
           tooltips=TOOLTIPS,
           tools="hover,pan,wheel_zoom,reset,tap",
           toolbar_location="below",
           toolbar_sticky=True,
           active_scroll='wheel_zoom',
           title="American Authors - Wikipedia Data",
           x_axis_label="Number of Language Pages",
           y_axis_label="Number of Internal Links")

p.circle(x='num_langs',
         y='num_links',
         size=10,
         fill_alpha=0.5,
         source=ColumnDataSource(data=df))

show(p)

You won't understand all of the components of the plot immediately, but hopefully the
example shows enough so that you could modify the plot to include different variables
or a different set of information when hovering over the points.

Finally, the plot below makes use of the `OpenURL` and `TapTool` models to make the
points clickable. Tapping on a point will open the Wikipedia page in a new tab. Try it
now!

In [None]:
from bokeh.models import OpenURL, TapTool

TOOLTIPS = [
    ("Author", "@author_name"),
    ("Number Internal Links", "@num_links"),
    ("Number External Links", "@num_elinks"),
]

p = figure(plot_width=950,
           plot_height=500,
           tooltips=TOOLTIPS,
           tools="hover,pan,wheel_zoom,reset,tap",
           toolbar_location="below",
           toolbar_sticky=True,
           active_scroll='wheel_zoom',
           title="American Authors - Wikipedia Data",
           x_axis_label="Number of Language Pages",
           y_axis_label="Number of Internal Links")

p.circle(x='num_langs',
         y='num_links',
         size=10,
         fill_alpha=0.5,
         source=ColumnDataSource(data=df))

taptool = p.select(type=TapTool)
taptool.callback = OpenURL(url="https://en.wikipedia.org/wiki/@url")

show(p)

Rather than a formal practice set of questions, today you will start working on your
first project, which uses the methods developed in this tutorial to analyze a new dataset.

### More practice

Feel free to check out the bokeh reference guide (note: it's huge!):

- https://bokeh.pydata.org/en/latest/docs/reference.html

In particular check out the Gallary and demos. If you are interested in
data visualization, there will be several of chances to build out interesting
bokeh-based applications later this semester.