<a href="https://colab.research.google.com/github/claudiamoses/DataScience-Class-Projects/blob/main/Copy_of_DATASCI_112_Lab_5B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Song Lyrics Generator

In Part A of this lab, you scraped a website to get the lyrics of songs by your favorite artist.

In this part, you will train a model called a Markov chain on those lyrics, and generate a song in the style of your favorite artist.

## Question 0. Read in the Data

In Part A, you saved a `DataFrame` of songs by your favorite artist to a CSV file. Read in that CSV file here.

In [None]:
# TODO: Upload and read in your data from the CSV file
import pandas as pd
path = "/content/Stevie.csv"
df = pd.read_csv(path)
df.head(5)

Unnamed: 0,Song,Lyrics
0,Affairs of the Heart,One set of doors was the color of honey One se...
1,After the Glitter Fades,Well I never thought I'd make it Here in Holly...
2,Alice,Well I heard she flew down to the Mountain Cit...
3,The Apartment Song,I used to live in a two room apartment Neig...
4,Angel,Sometimes The most beautiful things The most i...


## Question 1. Unigram Markov Chain Model

Build a Markov chain using the data that you read in above. Your model will process the lyrics and store the word transitions for that artist. The transitions will be stored in a dict called `chain`, which maps each word to a list of "next" words.

For example, if your song was ["The Joker" by the Steve Miller Band](https://www.youtube.com/watch?v=dV3AziKTBUo), `chain` might look as follows:

```
chain = {
    "some": ["people", "call", "people"],
    "call": ["me", "me", "me"],
    "the": ["space", "gangster", "pompitous", ...],
    "me": ["the", "the", "Maurice"],
    ...
}
```

Besides words, you should include a few additional states in your Markov chain. You should have `"<START>"` and `"<END>"` states so that we can keep track of which words are likely to begin and end songs. You should also include a state called `"<N>"` to denote line breaks so that you can keep track of where lines begin and end. It is up to you whether you want to include normalize case and strip punctuation.

So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
chain = {
    "<START>": ["Some", ...],
    "Some": ["people", ...],
    "people": ["call", ...],
    "call": ["me", ...],
    "me": ["the", ...],
    "the": ["space", ...],
    "space": ["cowboy,", ...],
    "cowboy,": ["yeah", ...],
    "yeah": ["<N>", ...],
    "<N>": ["Some", ..., "Come"],
    ...,
    "Come": ["on", ...],
    "on": ["baby", ...],
    "baby": ["and", ...],
    "and": ["I'll", ...],
    "I'll": ["show", ...],
    "show": ["you", ...],
    "you": ["a", ...],
    "a": ["good", ...],
    "good": ["time", ...],
    "time": ["<END>", ...],
}
```

Your chain will be trained on not just one song, but by all songs by your artist.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
def train_unigram_markov_chain(lyrics):
  """
  Args:
    - lyrics: a list or Series of strings, where each string represents
              the lyrics of one song by an artist.

  Returns:
    A dict that maps a single word ("unigram") to a list of words
    that follow that unigram, representing the Markov chain.
  """
  chain = {"<START>": [], "<N>": []}
  for lyric in lyrics:
    # TODO: update the Markov chain
    words = lyric.split()
    for i in range(len(words)):
      if i == 0:
        chain["<START>"].append(words[i])
      if words[i] not in chain:
        chain[words[i]] = []
      if i != len(words) - 1:
        if not words[i + 1][0].isupper() or words[i + 1][0:1] == "I " or words[i + 1][0:1] == "I'":
          chain[words[i]].append(words[i+1])
        else:
          chain[words[i]].append("<N>")
          chain["<N>"].append(words[i + 1])
      else:
        chain[words[i]].append("<END>")

  return chain

In [None]:
# Call the function on the lyrics column of your `DataFrame`.
unigram_chain = train_unigram_markov_chain(df["Lyrics"])

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(unigram_chain["<START>"])

# What words tend to begin a line (i.e., what words follow the line break tag?)
print(unigram_chain["<N>"][:20])

['One', 'Well', 'Well', 'I', 'Sometimes', 'It', 'At', 'This', 'Beautiful', "You're", 'You', 'The', 'I', 'There', 'You--beloved', 'For', "I've", 'Meeting', 'Climbs', 'Listen', "What's", 'I', "C'mon,", "You've", 'You', 'She', 'Do', 'I', 'Eyes', 'Crying', 'Maybe', 'Well', 'Walk', 'Papa', 'It', 'Baby,', 'Now', 'Just', 'Crying', 'We', 'She', "Don't", 'I', 'I', 'So', 'I', '(Burning...)', 'To', 'I', "She's", 'Dim', 'You', 'I', 'There', 'Except', 'And', 'What', 'Through', 'When', 'Rock', 'Though', "Don't", 'Rhinestone', 'So', 'Has', 'Hello,', 'Alas', 'Still', 'Hurry', 'Yes,', 'I', 'When', 'Talk', "Someone's", 'You', 'Outside', 'One', 'I', 'Baby', "You've", 'Illume', 'He', "You've", 'Well', 'I', 'Every', 'The', "It's", 'I', 'Such', 'It', 'Nobody', 'Did', "I'll", 'Temptation', 'I', 'Now', 'Is', 'How', 'Well', 'Sunflowers', 'Can', 'It', 'Do', 'Oh', 'You', 'Wake', 'Ooh', "I'm", 'Come', 'And', '-', 'Some', "You're", 'Like', 'I', 'In', 'And', 'Thrown', 'When', 'I', 'It', 'If', 'I', 'I', '(Ooh', 'Out

Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `"<START>"` state and randomly sample a word from the list of words that follow `"<START>"`. Then, at each step, we'll randomly sample the next word from the list of words that followed each current word. We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

You may find the `random.choice()` function helpful for this question.

In [None]:
import random

def generate_new_lyrics_from_unigram_chain(chain):
  """
  Args:
    - chain: a dict representing a unigram Markov chain

  Returns:
    A string representing the randomly generated song.
  """
  # a list for storing the generated words,
  # initialized with the first word
  words = [random.choice(chain["<START>"])]

  # TODO: generate the next word, appending to the `words` list
  current_word = words[0]
  size_words = 1
  while current_word != "<END>":
    if current_word in chain:
      words.append(random.choice(chain[current_word]))
    else:
      words.append(random.choice(chain["<N>"]))
    current_word = words[size_words]
    size_words += 1


  # join the words together into a string with line breaks
  lyrics = " ".join(words[:-1])
  return "\n".join(lyrics.split("<N>"))

In [None]:
print(generate_new_lyrics_from_unigram_chain(unigram_chain))

Just because you choose 
 And a while 
 I can be gone, stolen from difference worlds...we are the days since you've never said you'd like an invitation would live with the one 
 If you cared about you understood 
 Like 
 I know that is my mindThe blonde in the night in her ancient ways to be your skin 
 She was always seems 
 She takes yours away 
 But more time, 
 You'll have to be here... 
 Li 
 For the one 
 Lo-o-o-o-ove 
 What's the spring to the window... 
 Intuition, has to me in my reflection in the same closed door 
 Well 
 He fell for a woman...yes 
 Don't come into you stay in your happy-ever-afters... 
 Paris 
 Oh, there's no difference 
 Seemingly waiting on 
 And the room in the love will disappear 
 It was late 
 Come in that the middle of that it's so much


## Question 2. Bigram Markov Chain Model

Now you'll build a more complex Markov chain that uses the last _two_ words (or bigram) to predict the next word. Now your dict should map a _tuple_ of words to a list of words that appear after it.

As before, you should also include tags that indicate the beginning and end of a song, as well as line breaks. That is, a tuple might contain tags like `"<START>"`, `"<END>"`, and `"<N>"`, in addition to regular words. So for example, for ["The Joker"](https://www.azlyrics.com/lyrics/stevemillerband/thejoker.html), you would add the following to your chain:

```
bigram_chain = {
    (None, "<START>"): ["Some", ...],
    ("<START>", "Some"): ["people", ...],
    ("Some", "people"): ["call", ...],
    ("people", "call"): ["me", ...],
    ("call", "me"): ["the", ...],
    ("me", "the"): ["space", ...],
    ("the", "space"): ["cowboy,", ...],
    ("space", "cowboy,"): ["yeah", ...],
    ("cowboy,", "yeah"): ["<N>", ...],
    ("yeah", "<N>"): ["Some", ...],
    ("time", "<N>"): ["Come"],
    ...,
    ("<N>", "Come"): ["on", ...],
    ("Come", "on"): ["baby", ...],
    ("on", "baby"): ["and", ...],
    ("baby", "and"): ["I'll", ...],
    ("and", "I'll"): ["show", ...],
    ("I'll", "show"): ["you", ...],
    ("show", "you"): ["a", ...],
    ("you", "a"): ["good", ...],
    ("a", "good"): ["time", ...],
    ("good", "time"): ["<END>", ...],
}
```

In [None]:
def train_bigram_markov_chain(lyrics):
  """
  Args:
    - lyrics: a list or Series of strings, where each string represents
              the lyrics of one song by an artist.

  Returns:
    A dict that maps a tuple of 2 words ("bigram") to a list of words
    that follow that bigram, representing the Markov chain.
  """
  chain = {(None, "<START>"): []}
  for lyric in lyrics:
    # TODO: train the Markov chain
    words = lyric.split()
    for i in range(len(words) - 1):
      if i == 0:
        chain[(None, "<START>")].append((words[i]))
        chain[("<START>", words[i])] = []
        chain[("<START>", words[i])].append(words[i + 1])
      if i != len(words) - 2:
        if not words[i + 1][0].isupper() or words[i + 1][0:1] == "I " or words[i + 1][0:1] == "I'":
          if not (words[i], words[i + 1]) in chain.keys():
            chain[(words[i], words[i + 1])] = []
          if not words[i + 2][0].isupper() or words[i + 2][0:1] == "I " or words[i + 2][0:1] == "I'":
            chain[(words[i], words[i + 1])].append(words[i + 2])
          else:
            chain[(words[i], words[i + 1])].append("<N>")
            if not ("<N>", words[i + 2]) in chain.keys():
              chain[("<N>", words[i + 2])] = []
        else:
          chain[(words[i], "<N>")] = []
          chain[(words[i], "<N>")].append(words[i + 1])
          chain[("<N>", words[i + 1])] = []
          chain[("<N>", words[i + 1])].append(words[i + 2])
      else:
        chain[(words[i], words[i + 1])] = []
        chain[(words[i], words[i + 1])].append("<END>")

  return chain

In [None]:
# Call your function on the lyrics column of your `DataFrame`.
bigram_chain = train_bigram_markov_chain(df["Lyrics"])

# What words tend to start a song (i.e., what words follow the <START> tag?)
print(bigram_chain[(None, "<START>")])

['One', 'Well', 'Well', 'I', 'Sometimes', 'It', 'At', 'This', 'Beautiful', "You're", 'You', 'The', 'I', 'There', 'You--beloved', 'For', "I've", 'Meeting', 'Climbs', 'Listen', "What's", 'I', "C'mon,", "You've", 'You', 'She', 'Do', 'I', 'Eyes', 'Crying', 'Maybe', 'Well', 'Walk', 'Papa', 'It', 'Baby,', 'Now', 'Just', 'Crying', 'We', 'She', "Don't", 'I', 'I', 'So', 'I', '(Burning...)', 'To', 'I', "She's", 'Dim', 'You', 'I', 'There', 'Except', 'And', 'What', 'Through', 'When', 'Rock', 'Though', "Don't", 'Rhinestone', 'So', 'Has', 'Hello,', 'Alas', 'Still', 'Hurry', 'Yes,', 'I', 'When', 'Talk', "Someone's", 'You', 'Outside', 'One', 'I', 'Baby', "You've", 'Illume', 'He', "You've", 'Well', 'I', 'Every', 'The', "It's", 'I', 'Such', 'It', 'Nobody', 'Did', "I'll", 'Temptation', 'I', 'Now', 'Is', 'How', 'Well', 'Sunflowers', 'Can', 'It', 'Do', 'Oh', 'You', 'Wake', 'Ooh', "I'm", 'Come', 'And', '-', 'Some', "You're", 'Like', 'I', 'In', 'And', 'Thrown', 'When', 'I', 'It', 'If', 'I', 'I', '(Ooh', 'Out

Now, let's generate new lyrics using the Markov chain you constructed above. To do this, we'll begin at the `(None, "<START>")` state and randomly sample a word from the list of words that follow this bigram. Then, at each step, we'll randomly sample the next word from the list of words that followed the current bigram (i.e., the last two words). We will continue this process until we sample the `"<END>"` state. This will give us the complete lyrics of a randomly generated song!

In [None]:
import random

def generate_new_lyrics_from_bigram_chain(chain):
  """
  Args:
    - chain: a dict representing a bigram Markov chain

  Returns:
    A string representing the randomly generated song.
  """

  # a list for storing the generated words,
  # initialized with the first word
  words = [random.choice(chain[(None, "<START>")])]
  current_tuple = (("<START>", words[0]))
  # TODO: generate the next word, appending to the `words` list

  while current_tuple[1] != "<END>":
    last_word = current_tuple[1]
    if current_tuple in chain:
      new_word = random.choice(chain[current_tuple])
      words.append(new_word)
      current_tuple = (last_word, new_word)
    else:
      possible_tuples = [bigram for bigram in chain if bigram[0] == last_word]
      random_tuple = random.choice(possible_tuples)
      words.append(random_tuple[1])
      current_tuple = random_tuple




  # join the words together into a string with line breaks
  lyrics = " ".join(words[:-1])
  return "\n".join(lyrics.split("<N>"))

In [None]:
print(generate_new_lyrics_from_bigram_chain(bigram_chain))

He fell for her 
 She burned his house down saying 
 I cannot show 
 Got the feeling remains... 
 Even in the world 
 Everyone was smiling 
 Though their pain was apparent and the wind 
 In dark sorrow 
 They don't matter at all 
 Where are the lost world. 
 Yea You're my little saviour, 
 I cannot pretend 
 You're not a sin 
 Give to me 
 But I happy 
 Yes for now, 
 But I didn't love you


## Question 3. Analysis

Compare the quality of the lyrics generated by the unigram model (in Question 1) and the bigram model (in Question 2). Which model seems to generate more reasonable lyrics? Can you explain why? What do you see as the advantages and disadvantages of each model?

Share your favorite generated song with the rest of the class in this [thread on Ed](https://edstem.org/us/courses/51640/discussion/4463213)! Please be mindful about the appropriateness of your song lyrics.

**YOUR ANSWER HERE.**

The bigram chain model generated a song that was much more sensical than did the unigram chain. This is because the word that was chosen to be added onto the list was based on a little more context because it had to logically follow two words in order rather than just one, so it was created from a larger part of a sentence that already made sense. One thing that we talked about in class was that these algorithms cause an overall lack of context when generating new outputs, and I definitely see that here.

Overall, the issue with the bigram chain is that it is much larger than the unigram chain. So although we got a fairly more sensical song, if we wanted to get more sensical, we would have to make even bigger bigram chains which could be more computationally and spacially expensive. So the advantage of the unigram chain is that it's a smaller chain/smaller computation, but the bigram makes a marginally better song.

The other thing I'm really impressed by is that I feel like the length of the songs are roughly equivalent to the length of an actual Stevie Nicks song, so the random choice brought us to the END state in a roughly accurate number of iterations.

## Submission Instructions

- Restart this notebook and run the cells from beginning to end.
  - Go to Runtime > Restart and Run All.

In [None]:
# @markdown Run this cell to download this notebook as a webpage, `_NOTEBOOK.html`.

import google, json, nbformat

# Get the current notebook and write it to _NOTEBOOK.ipynb
raw_notebook = google.colab._message.blocking_request("get_ipynb",
                                                      timeout_sec=30)["ipynb"]
with open("_NOTEBOOK.ipynb", "w", encoding="utf-8") as ipynb_file:
  ipynb_file.write(json.dumps(raw_notebook))

# Use nbconvert to convert .ipynb to .html.
!jupyter nbconvert --to html --log-level WARN _NOTEBOOK.ipynb

# Download the .html file.
google.colab.files.download("_NOTEBOOK.html")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

- Open `_NOTEBOOK.html` in your browser, and save it as a PDF.
    - Go to File > Print > Save as PDF.
- Double check that all of your code and output is visible in the saved PDF.
- Upload the PDF to [Gradescope](https://www.gradescope.com/courses/694907).
    - Please be sure to select the correct pages corresponding to each question.