<br><br>
## <center> Lab 3: Web Scraping and Using a Word Embedding </center>##

#### <center> Working without APIs to get text and then one way of analyzing it.</center> ####
<br><br>
This notebook first works through a simple example of webscraping in python. APIs are the best way to get data, but sometimes they don't exist. Instead, you get the information by requesting the webpages your internet browser would get and design a simple program to pull out the useful parts. Often you'll actually be getting pages and dismantling them to find what pages to ask for next. Doing this is a form of what people call crawling, figuring out what data is out there and where is it because nobody is going to be able to, or want to, tell you the answer.<br> 
<br>
You'll start to understand webscraping requires you to be scrappy because there is no universal way of making a webpage. You have to reverse engineer what the web designer/programmer did to pack the underlying data into the nice formatting you see in a web browser. You'll learn there is a lot packed into the HTML document that becomes the page you see. We'll only just touch the surface of web design and what you might have to do to get the data you want. <br><br>

The second part of the lab is the analysis of our scraped text using a word embedding. This can start to uncover patterns in the text. A word embedding is a projection of a vocabulary into a high (often 300) dimensional space so that we can explore the "spatial" relationships between words. You can create your own word embedding, but you need far more input data than what we'll be collecting. Instead, we'll be using an existing word embedding to look for relationships in Bob Dylan's lyrics. This is only one way of analyzing text quantitatively, but it is a great place to start because it introduces many of the concepts and is tractable. 
<br><br>

<h4>Bob Dylan's Lyrics</h4>
To learn the basics of scraping, we are going to download all of Bob Dylan's lyrics, which someone has conveniently posted on a single website. Other than being a fairly straightforward example of scraping, it <em>might</em> be interesting to explore the lyrics through a word embedding.
<br>
<br>
First we import the module/package `Requests`, which will allow us to get webpages from the internet; 

In [7]:
import requests

If you're missing a module/package you'll need to get it. The easiest way is from the terminal/command line. If you're running python 2.7 use `pip`. If you're running python 3 use `pip3`. If you're using Anaconda, you should have the package already.<br>
Ex:<br>
`pip install requests`<br>
`pip3 install requests`

<br>
We supply the module's `get` function with the page we want, here <em><a href=http://bobdylan.com/songs>bobdylan.com/songs</a></em>. It will return the same page your internet browser would get. We store it in a variable. I named it `homepage`, but you can name it anything that is not a <a href=https://docs.python.org/2.5/ref/keywords.html>python keyword</a>. You just have to be consistent with whatever variables names you use.

In [8]:
homepage = requests.get("http://bobdylan.com/songs")

This function actually returns a <em>Response</em> object, which contains metadata in addition to the HTML used to render the page. To get to just the HTML, we need to look at the `text` field of the response object, i.e. `homepage.text`

If you look at this page in your browser, you'll see it is a long list of songs and not the lyrics themselves. That means we need to crawl this website to find the information we actually need. Crawling means creating a routine that identifies the location of the information we actually want so that we don't have to specify it ourselves. In this case, that means the links to the actual lyrics. We'll find those and then use them to get to the lyrics. The alternative is to create a list of links by hand (yuck).

<br>
If you print the `homepage.text` field, you'll see the HTML your browser uses to determine how to display the page. HTML is human-readable, but contains a lot of information we don't actually want. When printed without the proper visual formatting, it is downright disorienting. But the information we need is in there so we're going to dig it out!


In [None]:
homepage.text

<br>To be able to extract the information we actually want, we're going to use a module named `Beautiful Soup` to <em>parse</em> the HTML into something we can read and then approach computationally.

In [9]:
from bs4 import BeautifulSoup

We'll pass the HTML data in `homepage.text` to BeautifulSoup. We also need to specify a parser. Using `"html.parser"` is perfectly fine for 99.99% of use-cases.

In [10]:
soup = BeautifulSoup(homepage.text, "html.parser")

The <code>soup</code> object has presorted some commen elements of the HTML for us and gives us access to methods to find more complicated things. An example of the former is <code>soup.title</code>. Nearly all pages have a defined title element which your browser displays up on browser header, usually as the tab name.

In [5]:
soup.title

<title>Songs | The Official Bob Dylan Site</title>

You'll note that the text of title element that shows up in your broswer here has tags, i.e. `<title></title>`. That means we haven't actually gotten down to raw text. Actually the great thing about BeautifulSoup; your results are BeautifulSoup object that can be searched again and again. <b>BUT</b> it means we need to sure we do actually get to the raw text once we find what we're looking for. The next command does that and highlights the ability of python to navigate object structures succinctly.


In [6]:
soup.title.text

'Songs | The Official Bob Dylan Site'

<br><br>
Next, we'll find the element with the hyperlinks we want.

In [11]:
songs = soup.find_all("span","song")

The `find_all()` method with the parameters `"span"` and `"song"` finds all elements of type <em>span</em> with a class equal to <em>song</em>. The span element is a generic element type you'll find in a lot of HTML but the class <em>song</em> is custom to this website. So how did we know to look for span elements of the class song? I had to look at the raw HTML of the website and find examples of what I wanted and then figure out how the website developer encoded that information. This type of task is part of the black art of webscraping and can get much more complicated because developers might be inconsistent with their use of code or use more complex (and efficient) ways of passing data between their servers and your computer. Thankfully this website is quite easy to navigate. You could probably do it just looking at the `homepage.text` field above, but I'd recommend trying it in your browser. All the major browsers allow you to view the source document or find the thing that houses a particular part of the page. 

Firefox, Safari and Chrome all let you right click on a part of the page and "Inspect Element". This will open up a sidebar or another tab which jumps to and highlights the element you're viewing in the raw HTML. Elements are often nested in HTML (e.g. a cell in a table in a "div" in the body) and the browser doesn't necessarily highlight the lowest element. Accordingly you might need to expand the highlighted element to find what you're looking for. Try this on bobdylan.com/songs to see where the "span" and "song" parameters above come from.


<br><br>
Now let's look at one these song elements. We'll look at the second because it turns out the first song's lyrics aren't on the site.

In [9]:
songs[1]

<span class="song"><a href="http://bobdylan.com/songs/til-i-fell-love-you/">‘Til I Fell In Love With You</a></span>

You can see the hyperlink we want, but we need to dig it out before we store it.

In [12]:
songs_hrefs = []

for song in songs:
    songs_hrefs.append(song.find("a").get("href"))

Above we first created an empty list to store the hyperlinks we find. Then we iterate over the elements of `songs` and find the `<a>` (anchor) element and get the "href" variable from inside of it. We immediately put that in the list.

In [11]:
len(songs_hrefs)

653

Now you can see we've filled the list with 653 links to songs. Sure glad we didn't try to do that by hand!<Br><br>
    We now have the links and can go back to get all those pages using the `requests` module. But first, let's just start with one so that we can figure out how to get the relevant information out of the response we'll get

In [12]:
song_one = requests.get(songs_hrefs[1])

As you'll recall, the response object has metadata and then a bunch of raw HTML. We'll pull just the text and run it into BeautifulSoup to get something we can actually read

In [13]:
new_soup = BeautifulSoup(song_one.text,"html.parser")

In [14]:
lyrics = new_soup.find("div","article-content lyrics")

Just like for the home page, I looked at the raw HTML in order to figure out how the data we want is stored. It turned out to be in a `<div>` tag with a class named `article-content lyrics`<br><br>
Let's print this to see what we've got.

In [15]:
print(lyrics)

<div class="article-content lyrics">
			Well, my nerves are exploding and my body’s tense<br/>
I feel like the whole world got me pinned up against the fence<br/>
I’ve been hit too hard, I’ve seen too much<br/>
Nothing can heal me now, but your touch<br/>
I don’t know what I’m gonna do<br/>
I was all right ’til I fell in love with you<br/>
<br/>
Well, my house is on fire, burning to the sky<br/>
I thought it would rain but the clouds passed by<br/>
Now I feel like I’m coming to the end of my way<br/>
But I know God is my shield and he won’t lead me astray<br/>
Still I don’t know what I’m gonna do<br/>
I was all right ’til I fell in love with you<br/>
<br/>
Boys in the street beginning to play<br/>
Girls like birds flying away<br/>
When I’m gone you will remember my name<br/>
I’m gonna win my way to wealth and fame<br/>
I don’t know what I’m gonna do<br/>
I was all right ’til I fell in love with you<br/>
<br/>
Junk is piling up, taking up space<br/>
My eyes feel like 

That's looking pretty good, but there's still stuff in there we wouldn't want to have make it into our eventual analysis of the text. We'll get it out of here right now.<br><br>
The main problem is the copyright text at the bottom. The exact text will be different for different songs, but thankfully the web developer put it in its own `<p>` (paragraph) element. We'll identify that text and deal with it later.

In [16]:
copyright = lyrics.find("p")

The reason we have to deal with it later is that `lyrics` is a BeautifulSoup object. This makes it easier to find things, but we can't modify it. So we find the copyright information and then extract the HTML text from the lyrics object using the `text` field of the object. The text string we get from that <b>is</b> modifiable.

In [17]:
copyright.text

'Copyright © 1997 by Special Rider Music'

In [18]:
lyrics_rawtext = lyrics.text

This gives us an actual text string.

Now we'll use the python `replace` function to replace the copyright information with the empty string `""`. The replace function creates a new object so we'll save it to a new variable named `lyrics_text`. Note we access the `copyright` soup object's text field to do this.

In [19]:
lyrics_text = lyrics_rawtext.replace(copyright.text,"")

If you print this variable, it will look like we're close to done.

In [20]:
print(lyrics_text)


			Well, my nerves are exploding and my body’s tense
I feel like the whole world got me pinned up against the fence
I’ve been hit too hard, I’ve seen too much
Nothing can heal me now, but your touch
I don’t know what I’m gonna do
I was all right ’til I fell in love with you

Well, my house is on fire, burning to the sky
I thought it would rain but the clouds passed by
Now I feel like I’m coming to the end of my way
But I know God is my shield and he won’t lead me astray
Still I don’t know what I’m gonna do
I was all right ’til I fell in love with you

Boys in the street beginning to play
Girls like birds flying away
When I’m gone you will remember my name
I’m gonna win my way to wealth and fame
I don’t know what I’m gonna do
I was all right ’til I fell in love with you

Junk is piling up, taking up space
My eyes feel like they’re falling off my face
Sweat falling down, I’m staring at the floor
I’m thinking about that girl who won’t be back no more
I don’t know wh

We're actually not that close. It turns out that Jupyter notebooks use the HTML still in the text in order to format it for viewing when you use the `print()` command. Note the difference when you "execute" just the variable in the next line.

In [30]:
lyrics_text


			Well, my nerves are exploding and my body’s tense
I feel like the whole world got me pinned up against the fence
I’ve been hit too hard, I’ve seen too much
Nothing can heal me now, but your touch
I don’t know what I’m gonna do
I was all right ’til I fell in love with you

Well, my house is on fire, burning to the sky
I thought it would rain but the clouds passed by
Now I feel like I’m coming to the end of my way
But I know God is my shield and he won’t lead me astray
Still I don’t know what I’m gonna do
I was all right ’til I fell in love with you

Boys in the street beginning to play
Girls like birds flying away
When I’m gone you will remember my name
I’m gonna win my way to wealth and fame
I don’t know what I’m gonna do
I was all right ’til I fell in love with you

Junk is piling up, taking up space
My eyes feel like they’re falling off my face
Sweat falling down, I’m staring at the floor
I’m thinking about that girl who won’t be back no more
I don’t know wh

We need to get all those *whitespaces* and *delimiters* out while preserving the structure of the lyrics. Whoever encoded the lyrics ends the lines with `\r\n`. We'll split everything at those points first and iterate over the resulting lines, clean the individual lines and then add them to list named `clean_lines`.

In [22]:
clean_lines = []

lines = lyrics_text.split("\r\n")

for line in lines:
    clean_lines.append(line.strip().split())

In the last line we use `strip()` to strip non-alphanumeric characters off the line. We also split the line whereever there is whitespace using `split()`.<br><br>

In [29]:
print(clean_lines)

[['Well,', 'my', 'nerves', 'are', 'exploding', 'and', 'my', 'body’s', 'tense'], ['I', 'feel', 'like', 'the', 'whole', 'world', 'got', 'me', 'pinned', 'up', 'against', 'the', 'fence'], ['I’ve', 'been', 'hit', 'too', 'hard,', 'I’ve', 'seen', 'too', 'much'], ['Nothing', 'can', 'heal', 'me', 'now,', 'but', 'your', 'touch'], ['I', 'don’t', 'know', 'what', 'I’m', 'gonna', 'do'], ['I', 'was', 'all', 'right', '’til', 'I', 'fell', 'in', 'love', 'with', 'you'], ['Well,', 'my', 'house', 'is', 'on', 'fire,', 'burning', 'to', 'the', 'sky'], ['I', 'thought', 'it', 'would', 'rain', 'but', 'the', 'clouds', 'passed', 'by'], ['Now', 'I', 'feel', 'like', 'I’m', 'coming', 'to', 'the', 'end', 'of', 'my', 'way'], ['But', 'I', 'know', 'God', 'is', 'my', 'shield', 'and', 'he', 'won’t', 'lead', 'me', 'astray'], ['Still', 'I', 'don’t', 'know', 'what', 'I’m', 'gonna', 'do'], ['I', 'was', 'all', 'right', '’til', 'I', 'fell', 'in', 'love', 'with', 'you'], ['Boys', 'in', 'the', 'street', 'beginning', 'to', 'play'],

Above you can see that we now have the individual words of each line in a list. All of those lists are in one big list. We're also preserved the order the words appear in the lyrics. At this point though we can use python list *indexing* to get individual words. If you want the sixth word of the second line, just run the next line. (Remember that python list indices all start at zero, so the 6th element has an index number of 5.

In [28]:
print(clean_lines[1][5])

world


We could be done extracting data here, but because we don't know what type of analysis we'll end up doing let's also record the year the song was first copyrighted. We'll do this using "regular expressions", a way of expressing text characters abstractly. We'll look for the copyright year as a string of four numbers between 1000 and 2999.

In [16]:
import re

This is not the place to go into the details of "regex" (<b>reg</b>ular <b>ex</b>pressions), but note that we supply the copyright string we found earlier as the string being searched. We also convert the result from a string to an integer.

In [27]:
year = int(re.search("[1-2][0-9]{3}",copyright.text).group(0))
print(year)

1997


Ok, we've found everything we need it the HTML documents. But we don't want to run through the above steps 653 times, so let's package everything up into a single function. We can then just pass the page to the function and it will spit out exactly what we want.

In [36]:
def lyric_cleaner(page):
    soup = BeautifulSoup(requests.get(page).text,"html.parser")
    lyrics = soup.find("div","article-content lyrics")
    if lyrics.text.strip() != "":  
        copyright = lyrics.find("p")
        lyrics_rawtext = lyrics.text
        lyrics_text = lyrics_rawtext.replace(copyright.text,"")
        lines = lyrics_text.split("\r\n")
        clean_lines = []
        for line in lines:
            clean_lines.extend(line.strip().split())
        if copyright.text.strip() != "":   
            year = int(re.search("[1-2][0-9]{3}",copyright.text).group(0))
        else:
            year = None
        return [year,clean_lines]
    

There are two important things to note about this function. First is that we're returning a list of two elements, a year and the lyrics. This keep this information together for later. Second is that we first check that the elements of the page we're looking at aren't empty. If they were empty, Python would raise an error and stop everything. We screen the potential cases out using `if` statements to prevent this.<br><br>
Ok, now we're almost ready to gather and clean the pages/lyrics. But before we do, we import the `time` module.

In [14]:
import time

We do this so that we can <b><em>slow down</em></b> how quickly we do this whole process. We do this because if you send a hundreds of requests to a server basically all at once, you're going to get your IP address blacklisted. Websites don't like it if you ask for so much information that it ties up their servers. To avoid doing this, we slow things down. This ensures the success of our scraping but is also courteous. The time module's sleep function pauses the procedure for the number of seconds you tell it to.  1 second is sufficient for what we're doing, but if you're dealing with a major site and getting lots of data, you might increase the time.

In [37]:
lyrics = {}
lyrics_and_years = {}

for page in songs_hrefs[:3]:
    # the second to last item in split list is the name of the song
    song_name = page.split("/")[-2]
    
    cleaned_data = lyric_cleaner(page)
    if cleaned_data != None:
        lyrics_and_years[song_name] = cleaned_data
        lyrics[song_name] = cleaned_data[1]
    time.sleep(1) #sleeping for 1 second


Couple of things:<br>
We created two lists. We use `lyrics` to gather just the words of the songs. We <em>extend</em> the list each time. If you *extend* a list with a list, you get a flat list back (e.g. `["a", "list"].extend(["plus","a","list"]) = ["a", "list", plus","a","list"]`. If you *append* a list to a list, the outer list will contain a list (e.g. `["a","list"].append(["plus","a","list"]) = ["a","list",["plus","a","list"]]`. The package for doing word embeddings we are going to check out wants just a list of lists so in the end of this routine `lyrics` is one long list of lists where each inner list is the sequence of words from a song. This preserves the discrete object of the song which allows for a richer context for each word as the statistical structure is analyzed. This is different from the "bag of words" approach topic modeling uses and one the more appealling features of word embeddings.

<br><br>
Back to what we're done: We saved the song and copyright years together in a list of lists. We'll keep that list for later because knowing the contents of songs and when he wrote them might be a fruitful avenue for analysis.
<br><br>
Finally, the code above retrieves only the first three songs from our list of links because we "sliced" the list. `songs_hrefs[:3]` means take the first three elements of the list. I did this so that you don't accidentally run the code for all 653 songs, which takes 10+ minutes to run. To get all the songs, delete the `[:3]` part so that the line reads just `for page in songs_hrefs:`

<h2>Word Embeddings</h2>
<br>
The idea of embedding words is a bit like factor or principle component analysis; there are unobserved variables that capture the important features of the relationships between observations and if we can situate those observations in those dimensions, we can actually explore the relationships. In PCA or factor analysis you start with a high dimensional space defined by (although not necessarily identical to) the independent variables you have available. You then try to capture the variation present in a subset or linear combination of the variables. 
<br><br>
In word embeddings, however, our units of analysis are words (roughly, more on that shortly). We don't have a bunch of variable values for each word, so we aren't going from a variable space to a new space with fewer variables. Instead, we are projecting words into a high dimensional space in hopes that our mechanism for determining the projection will create meaningful geometric relationships between words in those dimensions. The primary means of creating the projection is to look at the context around the given word. It is common to look at the 3 words on either side of the target word and use them to predict the target word or vise versa. The more similar the contexts of two different words, the closer they end up in the space. (The location of each word in the space is represented by a vector with a length equal to the number of dimensions choosen for the space. The value at each entry in the vector is that word's position for that dimension.)
<br><br>
The concept of word embeddings have been around for a while but made a big splash in 2013 with the introduction of the *word2vec* (words to vectors) embedding. Its method for determining the projection was significantly faster than previous versions, which allowed for many more tokens (word instances) to be processed. The first results reported relied on 1.6 *billion* tokens. It turned out that running more words through the algorithm improved the performance of the embedding compared to previous ones.
<br><br>
We don't have billions of words so we aren't going to *create* a word embedding of our own. (Presumably distinct corpuses have distinct relationships among the words so one might want to do that.) If you want to get a sense of what that looks like, the relevant code is at the bottom of this lab. Instead we're going to make use of an existing embedding from the GloVe (*Glo*bal *Ve*ctors for Word Representations). There are actually several different publicly available embeddings gone with the GloVe algorithm and we'll be using the smallest because the bigger ones require a lot of computer memory. 
<br><br>
Below we'll load the 50 dimensional representation built using 6 billion words collected from Wikipedia and a corpus of new wire items (see [here]() for more information. It includes a vocabary of about 400k words. Again, what is included in this package is not raw text used to create an embedding. Instead we have a text file that contains the vector representations of the words. A 50 dimensional space is rather small--300 is common--but each one of those vectors would be 6 times as long and our file is already also 175MB. The 50 dimension space is good and will give you an idea of what working with an embedding is like, but it very likely less reliable than higher dimensional ones.
<br><br>
The first thing you'll need to do in order to continue is un-zip/extract the text file from the compressed version found in the `tools` folder inside of the folder this Notebook is in. The file is named `glove.6b.50d.txt.zip`. Double click on it or right click on it and open it with an archive utility. Once that is done you can start running the commands below.
<br><br>
First we import the package `pandas` so we can put the vectors into a data structure. We also need the `csv` package to tell `pandas` how to read the file.

In [59]:
import pandas as pd
import csv

In [60]:
try:
    words = pd.read_table("tools/glove.6B.50d.txt", sep=" ", index_col=0, header=None, quoting=csv.QUOTE_NONE)
    print("Vectors successfully loaded")
except:
    print("Data load failed. Chances are you didn't extract the text file as described above. Do that and run this cell again.")
    

Data load failed. Chances are you didn't extract the text file as described above. Do that and run this cell again.


In [42]:
"&7me".isalnum()

False

Now that we have the data loaded, let's create some functions for analyzing the data.

In [57]:
# This takes a word and gives back the associated vector
def get_vec(the_word):
    clean_word = "".join([i for i in the_word.lower() if i.isalnum()])
    return words.loc[clean_word].as_matrix()


# This takes two words and finds the Cartesian distance between them, a simple way of assessing the similarity of the word
def distance(word1, word2,verbose=False):
    missed_count = 0
    try:
        vec_1 = get_vec(word1) # we put the words in lower case
    except:
        if verbose:
            print("The first word is not in the embedding dictionary")
    try:
        vec_2 = get_vec(word2)
        return sum([(vec_1[i]-vec_2[i])**2 for i in range(len(vec_1))])**.5
    except:
        if verbose:
            print('The second word is not in the embedding dictionary')
        
    
    

# This return average, max, min and standard deviation of the distance between the words in a song.
def song_distances_summary(song,weighted=True,verbose=False):
    flat_list = [item for sublist in song for item in sublist]
    if not weighted:
        # creating a set removes multiple instances of words and therefore doesn't weight by number of occurrences.
        flat_list = list(set(song))
        
    distances = []
    for i in range(len(song)):
        for j in range(len(song)):
            if i < j:
                if song[i] != song[j]:
                    dist = distance(song[i],song[j],verbose)
                    if dist != None:
                        distances.append(dist)
                
    max_d = max(distances)
    min_d = min(distances)
    
    average = sum(distances)/len(distances)
    
    SD = (sum([(i-average)**2 for i in distances])/len(distances)-1)**.5
    
    return {"average":average, "std_dev":SD, "max":max_d, "min":min_d}
                


Now we can take a song we collected early and get a sense of spread of the words in the GloVe embedding.

In [46]:
first_song = list(lyrics.keys())[1]
print(first_song)

10000-men


In [58]:
song_distances_summary(lyrics[first_song],verbose=False)
# remove the verbose parameter if you don't want to see at the missing word messages

{'average': 5.0107855761136237,
 'max': 9.1707499130682884,
 'min': 0.0,
 'std_dev': 0.75532585939587604}

<br><br><br>
#### Creating your own embedding ####
Again, our Dylan lyrics collection is not nearly big enough to actually create an embedding that performs well. But we can still create one with the code below. We've cleaned everything so all the work is in the `gensim.model.Word2Vec` function. Running it with our puny corpus won't take long at all, but a suitably large corpus would need a lot more resources than your laptop. Thus this is just to show you how easy in is to create an embedding with existing python packages.

In [None]:
import gensim
model = gensim.models.Word2Vec(lyrics, size=100, window=7, min_count=1)
model.wv["just"] # return the vector
model.similarity("just","living,") # finds the distance between the words in the embedding.