# Radiohead Song Lyric Analysis
### Data Gathering

Before we can perform analysis on Radiohead song lyrics, we need to obtain and organize them. To start, I decided on the songs that I wanted the lyrics for. Radiohead has a lot of songs, but I only wanted the ones from their nine studio albums (I'm more familiar with these so working with them should be easier). Using data from [Wikipedia](https://en.wikipedia.org/wiki/Radiohead_discography), I put them into a [`yaml` file](Radiohead-Discography.yaml), organized them by album and song title.

Now that I have the names, I need the corresponding lyrics. I was able to find the following code on [Quora](https://www.quora.com/Whats-a-good-api-to-use-to-get-song-lyrics), created by [Sagun Shrestha](https://www.quora.com/profile/Sagun-Shrestha-7), to get lyrics from [azlyrics.com](azlyrics.com):

In [1]:
import re
import urllib.request
from bs4 import BeautifulSoup

def get_lyrics(artist, song_title):
    artist = artist.lower()
    song_title = song_title.lower()
    # remove all except alphanumeric characters from artist and song_title
    artist = re.sub('[^A-Za-z0-9]+', "", artist)
    song_title = re.sub('[^A-Za-z0-9]+', "", song_title)
    if artist.startswith("the"):  # remove starting 'the' from artist e.g. the who -> who
        artist = artist[3:]
    url = "http://azlyrics.com/lyrics/" + artist + "/" + song_title + ".html"

    try:
        content = urllib.request.urlopen(url).read()
    except Exception as e:
        print(e)
        return None
    soup = BeautifulSoup(content, 'html.parser')
    lyrics = str(soup)
    # lyrics lie between up_partition and down_partition
    up_partition = '<!-- Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that. -->'
    down_partition = '<!-- MxM banner -->'
    lyrics = lyrics.split(up_partition)[1]
    lyrics = lyrics.split(down_partition)[0]
    lyrics = lyrics.replace('<br>', '').replace('</br>', '').replace('</div>', '').strip()
    return lyrics


To test, let's grab the lyrics for their song "Kid A" (I can never tell what they're saying there):

In [2]:
print(get_lyrics("Radiohead", "Kid A"))

I slipped away
I slipped on a little white lie

We got heads on sticks
You got ventriloquists
We got heads on sticks
You got ventriloquists

Standing in the shadows at the end of my bed <i>[x4]</i>

Rats and children follow me out of town
Rats and children follow me out of town
Come on kids...


That's nice. With that done, it was only a matter of looping through the `yaml` file and getting the lyrics for each song. After doing this, I put them into a [`csv` file](lyrics.csv) with columns for `artist`, `album`, `song`, and `lyrics`.

This process worked for most songs, but the following did not give lyrics:  
- How Do You Do?
- High and Dry
- ***Treefingers***
- Packt Like Sardines in a Crushd Tin Box
- Pulk/Pull Revolving Doors
- Morning Bell/Amnesiac
- ***Hunting Bears***
- 2 + 2 = 5 (The Lukewarm.)
- Sit Down. Stand Up. (Snakes & Ladders.)
- Sail to the Moon. (Brush the Cobwebs out of th...
- Backdrifts. (Honeymoon is Over.)
- Go to Sleep. (Little Man being Erased.)
- Where I End and You Begin. (The Sky is Falling...
- We suck Young Blood. (Your Time is up.)
- The Gloaming. (Softly Open our Mouths in the C...
- There there. (The Boney King of Nowhere.)
- I Will. (No man's Land.)
- A Punchup at a Wedding. (No no no no no no no ...
- Myxomatosis. (Judge, Jury & Executioner.)
- Scatterbrain. (As Dead as Leaves.)
- A Wolf at the Door. (It Girl. Rag Doll.)
- ***Feral***

For some songs in the list, this is because they have no lyrics. Those songs are emboldened and italicized.  

For the other songs, it's because [azlyrics.com](azlyrics.com) has them under a different name than I do:  
- How Do You Do? -> How Do You ~~Do~~?  
- High and Dry -> High **&** Dry  
- Packt Like Sardines in a Crushd Tin Box -> Packt Like Sardines in a Crush**e**d Tin Box  
- Pulk/Pull Revolving Doors -> **Pull / Pulk** Revolving Doors  
- Morning Bell/Amnesiac -> **Amnesiac / Morning Bell**  

All of the songs with parentheses in the title are from the album Hail to the Thief, where each song was given two names. [azlyrics.com](azlyrics.com) only uses the first name for these.

### Data Cleanup

Sometimes the lyrics given by [azlyrics.com](azlyrics.com) were not in a useful form.

In certain songs, words or phrases are repeated, and the transcriber would put a sign to repeat them. This is good for humans, but not for a computer. Here's a good example:

In [3]:
print(get_lyrics("Radiohead", "A Punchup at a Wedding"))

No <i>[x42]</i>
I don't know why you bother
Nothing's ever good enough for you.
(By the way) I was there and it wasn't like that.
You've come here just to start a fight
You had to piss on our parade
You had to shred our big day
You had to ruin it for all concerned
In a drunken punch-up at a wedding
Yeah
Hypocrite opportunist
Don't infect me with your poison
A bully in a china shop
When I turn 'round you stay frozen to the spot
You had the pointless snide remarks
Of hammerheaded sharks
The pot will call the kettle black
It's a drunken punch-up at a wedding yeah
Oh no no


If we want the word-count out of this song, it would make sense to list out all of the no's. This could be solved by a simple function... if the notation had a regular form, that is.

In [4]:
print(get_lyrics("Radiohead", "We Suck Young Blood"))

Are you hungry?
Are you sick?
Are you begging for a break?
Are you sweet?
Are you fresh?
Are you strung up by the wrists?
We want the young blood (la <i>[x8]</i>)
Are you fracturing?
Are you torn at the seams?
Would you do anything?
Fleabitten motheaten?
We suck young blood (la <i>[x8]</i>)
We suck young blood (la <i>[x8]</i>)
Woah woah
Won't let the creeping ivy
Won't let the nervous bury me
Our veins are thin
Our rivers poisoned
We want the sweet meat (la <i>[x8]</i>)
We want the young blood 
La <i>[x8]</i>


Here it shows that the information in the parentheses should be repeated, not the whole line. This isn't the only irregularity, either:

In [5]:
print(get_lyrics("Radiohead", "Optimistic"))

Flies are buzzing around my head
Vultures circling the dead
Picking up every last crumb
The big fish eat the little ones
The big fish eat the little ones
Not my problem give me some

You can try the best you can
If you try the best you can
The best you can is good enough
<i>[x2]</i>

This one's optimistic
This one went to market
This one just came out of the swamp
This one dropped a payload
Fodder for the animals
Living on an animal farm

If you try the best you can
If you try the best you can
The best you can is good enough
<i>[x2]</i>

I'd really like to help you man
I'd really like to help you man.....
Nervous messed up marionette
Floating around on a prison ship

If you try the best you can
If you try the best you can
The best you can is good enough
If you try the best you can
If you try the best you can
Dinosaurs roaming the earth <i>[x3]</i>


Here it means that the whole paragraph should be repeated. And finally, here is the most interesting case of trascription with repetition:

In [6]:
print(get_lyrics("Radiohead", "Identikit"))

<i>[3x:]</i>
A moon-shaped pool, dancing clothes,
You love and now I know, but hold me, hold me.

Sweet-faced ones with nothing left inside
That we all can love
That we all can love
That we all...

Sweet-faced ones with nothing left inside
That we all can love
That we all can love
That we all...

And now I see you messing me around
I don't want to know
I don't want to know
I don't want to know

And now I see you messing me around
I don't want to know
I don't want to know
I don't want to know

Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts

Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain
Broken hearts make it rain

Pieces of a ragdoll mankind

Whoever transcribed this ignored all other repetition aside from that in the first stanza. Also, the repetition mark is before the thing to be repeated, when every other case has it after.

To solve this, I automated the repetition for simple ones, which make up the majority. I saved the rest to do manually.

Also worth noting, on the songs "You" and "Motion Picture Soundtrack", which both have lyrics from an earlier version of the song. We don't want these because they didn't make it to the album.

Finally, in "Idioteque" and "Daydreaming", there are pieces of sound that might as well not be words because of intense distortion. "Daydreaming" is probably the best example:

In [7]:
print(get_lyrics("Radiohead", "Daydreaming"))

Dreamers
They never learn
They never learn
Beyond the point
Of no return
Of no return

And it's too late
The damage is done
The damage is done

This goes
Beyond me
Beyond you
The white room
By window
Where the sun goes
Through

We are
Just happy to serve
Just happy to serve
You

efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH

efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
efil ym fo flaH
