# 8A. Song Lyrics Generator

In this lab, you will scrape a website to get lyrics of songs by your favorite artist. Then, you will train a model called a Markov chain on these lyrics so that you can generate a song in the style of your favorite artist.

# Question 1. Scraping Song Lyrics

Find a web site that has lyrics for several songs by your favorite artist. Scrape the lyrics into a Python list called `lyrics`, where each element of the list represents the lyrics of one song.

**Tips:**
- Find a web page that has links to all of the songs, like [this one](http://www.azlyrics.com/n/nirvana.html). [_Note:_ It appears that `azlyrics.com` blocks web scraping, so you'll have to find a different lyrics web site.] Then, you can scrape this page, extract the hyperlinks, and issue new HTTP requests to each hyperlink to get each song. 
- Use `time.sleep()` to stagger your HTTP requests so that you do not get banned by the website for making too many requests.

## Note

For this lab I use the Genius API to get the lyrics for Travis Scott, and a functions I wrote for navigating it in a file called spotfuncs. I wrote spotfuncs for another personal project I did in the past.

In [1]:
import requests
import time

from bs4 import BeautifulSoup

In [2]:
import spotipy
spotify = spotipy.Spotify()
import sys
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import BeautifulSoup
import lxml
import spotfuncs 
import requests
import pandas as pd
import numpy as np
from multiprocessing import Pool

In [3]:
def clean_lyrics(lyrics):
    a_list = lyrics.split("\n")
    new_lyrics = []
    counter = 0
    for i in a_list:
        if i[0] == "[" and i[-1] != "]":
            counter = 1
        if i[0] == "]":
            counter = 0
        if (i[0] != "]" and i[0] != "[") and counter != 1:
            i = i.replace("(","")
            i = i.replace(")","")
            if i != "":
                new_lyrics.append(i)
    return " <N> ".join(new_lyrics)

In [4]:
filepath = "/Users/ramanyakkala/Spec/genius.txt"

In [5]:
artist_id = spotfuncs.search_genius("Travis Scott", credentials_file=filepath)
genius_songs = spotfuncs.get_songs(artist_id, credentials_file=filepath)

In [6]:
travis_scott = pd.DataFrame([i for i in genius_songs.keys()],columns=["pos"])
travis_scott.head()

Unnamed: 0,pos
0,100 Bottles
1,100 Bottles (Remix)
2,10 2 10
3,10 2 10 (Remix)
4,12 Disciples


Let's get the lyrics and then clean them.

In [7]:
travis_scott["lyrics"] = spotfuncs.get_lyrics(genius_songs)
travis_scott_ = travis_scott.set_index("pos")
travis_scott_ = travis_scott_[travis_scott_["lyrics"] != ""]

travis_scott_["clean_lyrics"] = travis_scott_["lyrics"].apply(clean_lyrics)

Let's get rid of lyrics that are less than 200 characters since they are mainly leaks that have incoherent lyrics or not songs at all

In [8]:
travis_scott_ = travis_scott_[travis_scott_["clean_lyrics"].apply(len) > 200]

There were a specific pages that did not contain lyrics but feature names under Travis Scott.

Example: https://genius.com/Travis-scott-wav-radio-episode-1-tracklist-lyrics

In [9]:
for i in range(1,12):
    travis_scott_.drop(".Wav Radio Episode {0} Tracklist".format(i),inplace=True)

In [10]:
travis_scott_.drop("Featuring Quavo",inplace=True)

In [11]:
travis_scott_.drop("No Fear",inplace=True)

In [12]:
travis_scott_.drop("Birds Eye View Tour Dates",inplace=True)

In [13]:
travis_scott_.drop("Rodeo Tour Dates",inplace=True)

`pickle` is a Python library that serializes Python objects to disk so that you can load them in later.

In [14]:
import pickle
pickle.dump(travis_scott_["clean_lyrics"].values, open("lyrics.pkl", "wb"))

# Question 2. Unigram Markov Chain Model

You will build a Markov chain for the artist whose lyrics you scraped in Lab A. Your model will process the lyrics and store the word transitions for that artist. The transitions will be stored in a dict called `chain`, which maps each word to a list of "next" words.

For example, if your song was ["The Joker" by the Steve Miller Band](https://www.youtube.com/watch?v=FgDU17xqNXo), `chain` might look as follows:

```
chain = {
    "some": ["people", "call", "people"],
    "call": ["me", "me", "me"],
    "the": ["space", "gangster", "pompitous", ...],
    "me": ["the", "the", "Maurice"],
    ...
}
```

Besides words, you should include a few additional states in your Markov chain. You should have `"<START>"` and `"<END>"` states so that we can keep track of how songs are likely to begin and end. You should also include a state called `"<N>"` to denote line breaks so that you can keep track of where lines begin and end. It is up to you whether you want to include normalize case and strip punctuation.

So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    "<START>": ["Some", ...],
    "Some": ["people", ...],
    "people": ["call", ...],
    "call": ["me", ...],
    "me": ["the", ...],
    "the": ["space", ...],
    "space": ["cowboy,", ...],
    "cowboy,": ["yeah", ...],
    "yeah": ["<N>", ...],
    "<N>": ["Some", ..., "Come"],
    ...,
    "Come": ["on", ...],
    "on": ["baby", ...],
    "baby": ["and", ...],
    "and": ["I'll", ...],
    "I'll": ["show", ...],
    "show": ["you", ...],
    "you": ["a", ...],
    "a": ["good", ...],
    "good": ["time", ...],
    "time": ["<END>", ...],
}
```

Your chain will be trained on not just one song, but by all songs by your artist.

In [15]:
def train_markov_chain(songs):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a single word ("unigram") to a list of
      words that follow that word, representing the Markov
      chain trained on the lyrics.
    """
    chain = {"<START>": []}
    for lyrics in songs:
        new_lyrics = lyrics.split(" ")
        chain["<START>"].append(new_lyrics[0])
        for i in range(len(new_lyrics)):
            if new_lyrics[i] not in chain:
                if i == len(new_lyrics) - 1:
                    chain[new_lyrics[i]] = ["<END>"]
                else:
                    chain[new_lyrics[i]] = [new_lyrics[i+1]]
            else:
                if i == len(new_lyrics) - 1:
                    chain[new_lyrics[i]].append("<END>")
                else:
                    chain[new_lyrics[i]].append(new_lyrics[i+1])
        
    return chain

In [16]:
# Load the pickled lyrics object that you created in Lab A.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain["<START>"][:20])

# What words tend to begin a line (i.e., what words follow the line break tag?)
print(chain["<N>"][:20])

['La', 'La', "Imma'", 'I', "I'm", 'I', 'You', 'Part', 'Hustle', 'Bandana', 'Yeah', 'Yeah,', 'Yup', "Who's", 'Yo...', 'Dean,', 'Are', 'I', 'All', 'Uhh-uhh']
['Hundred', 'Shot', "Spendin'", "Let's", 'Hundred', 'Shot', "Spendin'", "Let's", 'Let', 'Let', 'Straight', 'Hundred', 'Shot', "Spendin'", "Let's", 'Let', 'Let', 'Let', 'Let', 'Let']


Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `"<START>"` state and randomly sample a word from the list of words that follow `"<START>"`. Then, at each step, we'll randomly sample the next word from the list of words that followed each current word. We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

You may find the `random.choice()` function helpful for this question.

In [17]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain["<START>"]))
    
    start_word = words[0]
    while start_word != "<END>":
        next_word = random.choice(chain[start_word])
        words.append(next_word)
        start_word = next_word
        
    
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [18]:
print(generate_new_lyrics(chain))

They don't play that she Tyson, go legend, we playin' dices, those goosebumps 
 Feel like a neo solo  
 Like a long as approaches 
 Shoot his new bitch to watch my soul in all summer 
 How could take off Shake it 
 Hopped up from snippet 
 I chase over again 
 If we won't fit 
 I like Silkk The only wanted fame for you, uh 
 I wouldn't even let that she be heroes


# Question 3. Bigram Markov Chain Model

Now you'll build a more complex Markov chain that uses the last _two_ words (or bigram) to predict the next word. Now your dict `chain` should map a _tuple_ of words to a list of words that appear after it.

As before, you should also include tags that indicate the beginning and end of a song, as well as line breaks. That is, a tuple might contain tags like `"<START>"`, `"<END>"`, and `"<N>"`, in addition to regular words. So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    (None, "<START>"): ["Some", ...],
    ("<START>", "Some"): ["people", ...],
    ("Some", "people"): ["call", ...],
    ("people", "call"): ["me", ...],
    ("call", "me"): ["the", ...],
    ("me", "the"): ["space", ...],
    ("the", "space"): ["cowboy,", ...],
    ("space", "cowboy,"): ["yeah", ...],
    ("cowboy,", "yeah"): ["<N>", ...],
    ("yeah", "<N>"): ["Some", ...],
    ("time", "<N>"): ["Come"],
    ...,
    ("<N>", "Come"): ["on", ...],
    ("Come", "on"): ["baby", ...],
    ("on", "baby"): ["and", ...],
    ("baby", "and"): ["I'll", ...],
    ("and", "I'll"): ["show", ...],
    ("I'll", "show"): ["you", ...],
    ("show", "you"): ["a", ...],
    ("you", "a"): ["good", ...],
    ("a", "good"): ["time", ...],
    ("good", "time"): ["<END>", ...],
}
```

In [19]:
def train_markov_chain(songs):
    """
    Args:
      - lyrics: a list of strings, where each string represents
                the lyrics of one song by an artist.
    
    Returns:
      A dict that maps a tuple of 2 words ("bigram") to a list of
      words that follow that bigram, representing the Markov
      chain trained on the lyrics.
    """
    chain = {(None,"<START>"): []}
    for lyrics in songs:
        new_lyrics = lyrics.split(" ")
        chain[(None,"<START>")].append(new_lyrics[0])
        for i in range(len(new_lyrics)):
            if i == 0:
                chain[("<START>",new_lyrics[i])] = [new_lyrics[i+1]]
            elif (new_lyrics[i-1],new_lyrics[i]) not in chain:
                if i == len(new_lyrics) - 1:
                    chain[(new_lyrics[i-1],new_lyrics[i])] = ["<END>"]
                else:
                    chain[(new_lyrics[i-1],new_lyrics[i])] = [new_lyrics[i+1]]
            else:
                if i == len(new_lyrics) - 1:
                    chain[(new_lyrics[i-1],new_lyrics[i])].append("<END>")
                else:
                    chain[(new_lyrics[i-1],new_lyrics[i])].append(new_lyrics[i+1])

    return chain

In [20]:
# Load the pickled lyrics object that you created in Lab A.
import pickle
lyrics = pickle.load(open("lyrics.pkl", "rb"))

# Call the function you wrote above.
chain = train_markov_chain(lyrics)

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(chain[(None, "<START>")])

['La', 'La', "Imma'", 'I', "I'm", 'I', 'You', 'Part', 'Hustle', 'Bandana', 'Yeah', 'Yeah,', 'Yup', "Who's", 'Yo...', 'Dean,', 'Are', 'I', 'All', 'Uhh-uhh', 'Yeah', 'Ayo', 'Yeah,', 'Yeah,', 'Hah,', 'The', "I'm", "Don't", 'I', 'She', 'She', 'Ocean', 'Yeah', 'High', 'A-Team', 'Metro', 'Mustard', 'Lyrics', 'Back', 'Aight,', 'Whoooo', 'Fuck', 'I', 'Mistercap', 'I', "Let's", "'70s,", 'Mhm', 'Yo,', 'Mixtape', 'That', 'Bitch,', 'Wheezy', 'Fuck', 'Call', 'Wakanda,', 'I', 'Thank', 'Tracklist', 'Yayo,', 'Yayo,', "What's", 'Always', 'God!', 'Full', 'Dis', 'Dis', 'Ouu-ouu-ouu-ouuu-ouu-ouu-ah', '**Feel', "I'm", 'Murda', 'Hey', 'I', 'No,', "What's", "Ballin'", 'Quavo!', 'Yeah,', 'Woke', 'Missed', 'Her', '30,', 'This', 'I', 'Sorry', 'Hey', 'Our', 'They', 'Where', 'Where', 'Mustard', 'Honorable', "Cos'", 'Midnight', 'Midnight', 'Tracklist:', 'Damn,', 'Yeah,', 'Yeah,', 'Devil', 'Mazi', 'And', 'Drink', 'Enter', 'Yeah', 'Lyrics', 'DJ', 'When', 'Cassette', 'Sometimes', 'Hooo', 'Travis', 'I', 'As', 'Ashes,'

Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `(None, "<START>")` state and randomly sample a word from the list of words that follow this bigram. Then, at each step, we'll randomly sample the next word from the list of words that followed the current bigram (i.e., the last two words). We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

In [21]:
import random

def generate_new_lyrics(chain):
    """
    Args:
      - chain: a dict representing the Markov chain,
               such as one generated by generate_new_lyrics()
    
    Returns:
      A string representing the randomly generated song.
    """
    
    # a list for storing the generated words
    words = []
    # generate the first word
    words.append(random.choice(chain[(None, "<START>")]))
  
    start_word = ("<START>",words[0])
    while "<END>" not in start_word:
        next_word = random.choice(chain[start_word])
        words.append(next_word)
        start_word = (words[-2],next_word)

    
    
    # join the words together into a string with line breaks
    lyrics = " ".join(words[:-1])
    return "\n".join(lyrics.split("<N>"))

In [22]:
print(generate_new_lyrics(chain))

I got my haters so so broke 
 Oh-oh-oh shit 
 I'm tryna tell you that black card, no limit 
 I love when my work done 
 Verse 2: Travi$ Scott 
 Verse: Tyler Devlin 
 I like my mothafuckin' granny did woo 
 I know you're home, baby ooh, 
 yeah 
 I tried the Perky but it was wild 
 Because it's no mistake 
 Rollin' the dice ah 
 I think you run into me, better have my 
 I've been drinkin' 
 Back-backyard, we gettin' it, far from what you do the impossible 
 I coulda went to school where they teach you finesse 
 Five hundred shoes for the poncho 
 Chiefin' while I'm here 
 Big shot hol' on, hol' on 
 Set my head as I get in there, get in that, dipped in that 
 Nah, then your father 
 I'm talking 'bout, you know no better 
 Baby, love, baby, baby, you're my love know that I'm a dog 
 , 
 yeah, yeah 
 I be ballin' 
 We so high, upper echelon Straight up 
 Have that, pass that, light it up, run it back like you would have made it out the ride 
 In the den, left it in your town, I'm spotting 

# Analysis

Compare the quality of the lyrics generated by the unigram model (in Lab B) and the bigram model (in Lab C). Which model seems to generate more reasonable lyrics? Can you explain why? What do you see as the advantages and disadvantages of each model?

Unigram (Pro)
- Achieves more randomness since the amount of possible "next words" are much larger.

Unigram (Neg)
- Lyrics are a lot less reasonable since it is a lot more random

Bigram (Pro)
- Lyrics are a lot more reasonable compared to unigram since it takes advantage of looking at the past two words and forms more coherent sentences.

Bigram (Neg)
- Susceptible to being not as random as unigram, since the amount of "next words" are smaller due to there not being a lot of the same bigrams as there are unigrams.

# Submission Instructions

Once you are finished, follow these steps:

1. Restart the kernel and re-run this notebook from beginning to end by going to `Kernel > Restart Kernel and Run All Cells`.
2. If this process stops halfway through, that means there was an error. Correct the error and repeat Step 1 until the notebook runs from beginning to end.
3. Double check that there is a number next to each code cell and that these numbers are in order.

Then, submit your lab as follows:

1. Go to `File > Export Notebook As > PDF`.
2. Double check that the entire notebook, from beginning to end, is in this PDF file. (If the notebook is cut off, try first exporting the notebook to HTML and printing to PDF.)
3. Upload the PDF [to PolyLearn](https://polylearn.calpoly.edu/AY_2018-2019/mod/assign/view.php?id=349486).