## Tutorial 06: Using Requests and Regular Expressions to Count Words

Here we use the requests library to actually grab web data from Wikipedia as an HTML page.
Using the regular expressions you saw in the prior tutorial, you'll remove all of the special
formatting and count the most frequent words found on the page. This will be our first chance
to do something with the actual data from Wikipedia.

### Modules

To start, we will load both the `re` module and the `requests` module. The second
is what we will use to extract websites into Python.

In [None]:
import re
import requests

### Making a request

We will start by all grabbing the Wikipedia webpage associated with the University
of Richmond. At the end of the tutorial, you'll be able to grab a website of your
own choosing. I suggest opening the Wikipedia page in another tab so that you can
compare the website with the extracted code in Python.

To make a "request" using the `requests` module, we use the function `get` and pass
it the full URL to the page, like this:

In [None]:
url = 'https://en.wikipedia.org/wiki/University_of_Richmond'
r = requests.get(url)
r

You'll notice that the object that is returned, called `r` here, does not
print out anything resembling the actual website. Instead, we just get a
message that should say `<Response [200]>` (if not, you have a problem; perhaps
a network connectivity issue). What this means is that the request was processed
and returned the [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
of 200. This indicates that the request was processed with a status of OK; more
verbosely:

> Standard response for successful HTTP requests. The actual response will depend
> on the request method used. In a GET request, the response will contain an entity
> corresponding to the requested resource. In a POST request, the response will contain
> an entity describing or containing the result of the action.

The response given by the website contains a number of elements. For example, there
is the "HTTP header" that contains metadata about the HTTP request:

In [None]:
r.headers

The part that we are most interested in, however, is the text of the response. This is
given by the attribute `text`, which we can print out as follows:

In [None]:
print(r.text)

The text above is written in a markup language called HTML; rendered in your browser it
yields the pretty website that you are used to seeing when you navigate to Wikipedia.

In Python, this text is just stored as a very long string object similar to the strings
you saw in Tutorial 3. We will now make use of string methods and regular expression 
functions to process the string and extract the individual words.

### Cleaning HTML code

To start, we will save the request text as a variable called `website`. Just
to simplify the processing, in the code below I have also remove the first 
set of lines corresponding the HTML header and an embedded Javascript chunk 
at the bottom of the text.

In [None]:
website = r.text
website = website[website.find("<body"):website.find("<noscript>")]
print(website)

Make sure that you scroll through some of the text below; the top is a bit noisy, but
you should see the text of the page hidden within the HTML tags. 

Now, it's your turn to start writing some code. In the block below overwrite the variable
`website` by removing all HTML tags from the original string. Print out the `website`
variable at the end of the block of code.

This should already look a lot closer to the raw text on the website.
Recall that using the function `print` makes newline characters look
nice. It also makes a TAB (represented by the symbol `t`) appear nicely.
To see this, look at the raw string `website` by runing the code below; mentally
compare to the printed version above.

In [None]:
website

Let's now replace all copies of the special characters `\n`, `\r`, and `\t`
with a single space in the variable `website`. Make sure to save the result again
as the variable `website`. (Note: You can do this with three seperate calls to
`re.sub`; try to do it with just a single line).

Now we are really getting close to the raw text on the page!

As a next step, use the string method `.lower()` to make the website
all in lower case. This will help later so that words like "School"
and "school" are not counted differently. Print out the result again
just to check your code.

There are still a number of special formatting marks, as well as punctuation
and other special characters, in the this text. As a simple solution, in the
code block below write a call to the `re.sub` function that replaces anything
that **is not** a lower case letter with a space. (Hint: you did this exact
thing in Tutorial 5). Once again, print out the result at the end of the code
block.

As a final step in cleaning the output, notice that from the cleaning process
there are many places that have a long set of spaces inbetween them. Use a
regular expression to convert any sequence of spaces into a single space. And
again, print out the result.

Now you should have a nice clear version of just the words in the Wikipedia
page. Yay! See how awesome regular expression can be! 

### Extracting Words

Now that we have the raw text as a single string, we will want to using the
function `re.split` to split apart the individual words. Do this in the block
below, saving the result as a variable called `words`; print out the words with
the print function at the end of the code block.

As we move forward in the course, we will see a number of things that can be done with these
words such as building predictive and generative models. For today, let's just find the most
frequently used words on the page. To do this with a minimal amount of code, we will load a
function called `Counter` from the module `collections`:

In [None]:
from collections import Counter

If you saved the result above as a variable called `words`, as instructed, the
following code will then spit out the 30 most common words in the text along
with their counts. You can of course change the number 30 to anything you would
like, but 30 seems to be work well for this exercise.

In [None]:
Counter(words).most_common(30)

Are these the words you would have expected to be the most common on the University
of Richmond Wikipedia page? Why or why not?

**Answer**:

### Wrapping it all up

In this tutorial I broke down all of the steps in requesting, cleaning, and counting
the most frequent words from a page on Wikipedia. The entire process when combined
requires only a total of about 10 lines. In the code block below I want you to put
all of the steps together, with the page url at the top (here I put a new URL, the
one to the page about Marxism). At the end of the block the 30 most common words on
the page should show up.

In [None]:
url = 'https://en.wikipedia.org/wiki/Marxism'

# Put all of your code from above here and remove this comment

Once you have the code tested and working, try to input several other Wikipedia pages
and begin exploring what you see in the data. (Note: This should now be easy as you have
only to run a single block of code). Are there any interesting patterns or missing words
that start to show up?

**Answer**: