# Web Mining and Applied NLP (44-620)

- Author: Aaron Gillespie
- Github: https://github.com/aarongilly 

## Requests, JSON, and NLP

### Student Name: Aaron Gillespie

Perform the tasks described in the Markdown cells below.  When you have completed the assignment make sure your code cells have all been run (and have output beneath them) and ensure you have committed and pushed ALL of your changes to your assignment repository.

Make sure you have [installed spaCy and its pipeline](https://spacy.io/usage#quickstart) and [spaCyTextBlob](https://spacy.io/universe/project/spacy-textblob)

Every question that requires you to write code will have a code cell underneath it; you may either write your entire solution in that cell or write it in a python file (`.py`), then import and run the appropriate code to answer the question.

This assignment requires that you write additional files (either JSON or pickle files); make sure to submit those files in your repository as well.

## Imports

In [85]:
# Create and activate a Python virtual environment. 
# Before starting the project, try all these imports FIRST
# Address any errors you get running this code cell 
# by installing the necessary packages into your active Python environment.
# Try to resolve issues using your materials and the web.
# If that doesn't work, ask for help in the discussion forums.
# You can't complete the exercises until you import these - start early! 
# We also import json and pickle (included in the Python Standard Library).

import json
# import pickle

import requests
import json
import os
import spacy
from spacy.tokens import Doc
from spacytextblob.spacytextblob import SpacyTextBlob

print('All prereqs installed.')
!pip list

All prereqs installed.
Package                   Version
------------------------- --------------
annotated-types           0.7.0
anyio                     4.9.0
appnope                   0.1.4
argon2-cffi               25.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 3.0.0
async-lru                 2.0.5
attrs                     25.3.0
babel                     2.17.0
beautifulsoup4            4.13.4
bleach                    6.2.0
blis                      1.3.0
catalogue                 2.0.10
certifi                   2025.6.15
cffi                      1.17.1
charset-normalizer        3.4.2
click                     8.2.1
cloudpathlib              0.21.1
comm                      0.2.2
confection                0.1.5
contourpy                 1.3.2
cycler                    0.12.1
cymem                     2.0.11
debugpy                   1.8.14
decorator                 5.2.1
defusedxml                0.7.1
en_core_web_sm         

## Questions

### Question 1:

1. The following code accesses the [lyrics.ovh](https://lyricsovh.docs.apiary.io/#reference/0/lyrics-of-a-song/search) public api, searches for the lyrics of a song, and stores it in a dictionary object.  Write the resulting json to a file (either a JSON file or a pickle file; you choose). You will read in the contents of this file for future questions so we do not need to frequently access the API.

In [86]:
artist = 'They Might Be Giants'
song = 'Birdhouse in your soul'
requestUrl = f'https://api.lyrics.ovh/v1/{artist}/{song}'
# print(f'Request URL: {requestUrl}')

# Check if the lyrics.json file exists, if not, fetch the lyrics from the API
if not os.path.exists('lyrics.json'):
    result = json.loads(requests.get(requestUrl).text)

# Save the result to a JSON file to avoid repeated API calls
with open('lyrics.json', 'w', encoding='utf-8') as f:
    json.dump(result, f, ensure_ascii=False, indent=4)

# Load the lyrics from the JSON file
with open('lyrics.json', 'r', encoding='utf-8') as f:
    lyrics = json.load(f)
    # Print the lyrics
    print(lyrics['lyrics'][:100])

I'm your only friend 
I'm not your only friend 
But I'm a little glowing friend 
But really I'm not 


### Question 2:

2. Read in the contents of your file.  Print the lyrics of the song (not the entire dictionary!) and use spaCyTextBlob to perform sentiment analysis on the lyrics.  Print the polarity score of the sentiment analysis.  Given that the range of the polarity score is `[-1.0,1.0]` which corresponds to how positive or negative the text in question is, do you think the lyrics have a more positive or negative connotaion?  Answer this question in a comment in your code cell.

In [87]:
# Load model
nlp = spacy.load("en_core_web_sm")

# Build pipeline
if "spacytextblob" not in nlp.pipe_names:
    nlp.add_pipe("spacytextblob", last=True)

# Print pipeline components
print("Pipeline components:", nlp.pipe_names)

# Manually register the extensions 
if not Doc.has_extension("polarity"):
    Doc.set_extension("polarity", getter=lambda doc: doc._.blob.polarity)
if not Doc.has_extension("subjectivity"):
    Doc.set_extension("subjectivity", getter=lambda doc: doc._.blob.subjectivity)

# Load lyrics from file
with open("lyrics.json", "r", encoding="utf-8") as f:
    data = json.load(f)
lyrics = data["lyrics"]

# Preview
print("\n_lyrics preview:_\n", f"*{lyrics[:100]}...*")  

# Run pipeline
doc = nlp(lyrics)

# Access registered attributes and round results
polarity = round(doc._.polarity, 2)
subjectivity = round(doc._.subjectivity, 2)

# Print results
print("\nPolarity:", polarity)
print("Subjectivity:", subjectivity)

# Q: Do you think the song is positive or negative?
# A: The song has a polarity of 0.05, which is very near zero, but slightly positive. 
#    This would be considered a (very) mildly positive song.

Pipeline components: ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'spacytextblob']

_lyrics preview:_
 *I'm your only friend 
I'm not your only friend 
But I'm a little glowing friend 
But really I'm not ...*

Polarity: 0.05
Subjectivity: 0.55


### Question 3:

3. Write a function that takes an artist, song, and filename, accesses the lyrics.ovh api to get the song lyrics, and writes the results to the specified filename.  Test this function by getting the lyrics to any four songs of your choice and storing them in different files.

In [88]:
test_artists = ['The Strokes', 'Goldfinger', 'AJR', 'Johnny Cash']
test_songs = ['Last Nite', '99 Red Balloons', 'Bang!', 'Hurt']
test_file_names = ['The Strokes - Last Nite', 'Goldfinger - 99 Red Balloons',
              'AJR - Bang!', 'Johnny Cash - Hurt']

def get_lyrics(artist, song, filename):
    requestUrl = f'https://api.lyrics.ovh/v1/{artist}/{song}'
    if not os.path.exists(filename):
        result = json.loads(requests.get(requestUrl).text)
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(result, f, ensure_ascii=False, indent=4)

# Test the function with the test artists and songs
for artist, song, name in zip(test_artists, test_songs, test_file_names):
    filename = f'{name}.json'
    get_lyrics(artist, song, filename)
    print(f"Fetched lyrics for {name} and saved to {filename}")

Fetched lyrics for The Strokes - Last Nite and saved to The Strokes - Last Nite.json
Fetched lyrics for Goldfinger - 99 Red Balloons and saved to Goldfinger - 99 Red Balloons.json
Fetched lyrics for AJR - Bang! and saved to AJR - Bang!.json
Fetched lyrics for Johnny Cash - Hurt and saved to Johnny Cash - Hurt.json


### Question 4:

4. Write a function that takes the name of a file that contains song lyrics, loads the file, performs sentiment analysis, and returns the polarity score.  Use this function to print the polarity scores (with the name of the song) of the three files you created in question 3.  Does the reported polarity match your understanding of the song's lyrics? Why or why not do you think that might be?  Answer the questions in either a comment in the code cell or a markdown cell under the code cell.

In [89]:
def get_lyrics_polarity(filename):
    """
    Loads lyrics from a JSON file, performs sentiment analysis using spaCyTextBlob,
    and returns the polarity score.
    """
    with open(filename, 'r', encoding='utf-8') as f:
        data = json.load(f)
    lyrics = data.get('lyrics', '')
    doc = nlp(lyrics)

    # Access registered attributes and round results
    polarity = round(doc._.polarity, 2)
    subjectivity = round(doc._.subjectivity, 2)

    # Print results
    # print(f"\nPolarity for {filename}:")
    # print("\nPolarity:", polarity)
    # print("Subjectivity:", subjectivity)
    return polarity
    
for name in test_file_names:
    filename = f'{name}.json'
    polarity = get_lyrics_polarity(filename)
    print(f"Polarity for {name}: {polarity}")

Polarity for The Strokes - Last Nite: 0.03
Polarity for Goldfinger - 99 Red Balloons: -0.01
Polarity for AJR - Bang!: 0.44
Polarity for Johnny Cash - Hurt: 0.07


#### Does each polarity match my expectations?

In short, **not really**.

**The Strokes - Last Nite** 
- Polarity: 
  - 0.03
- Interpretation: 
  - slightly positive, but mostly neutral
- Alignment with expectations:
  - **Not bad.** The song basically tells a story, which now that I'm looking at it in the context of a homework assignment is perhaps not the *best* story to be analyzing. Overall, I'd say from the prospective of the protagonist it's *probably* a positive thing. But... on the whole I wouldn't disagree with "slightly positive"

**Goldfinger - 99 Red Balloons**
- Polarity:
  - -0.01
- Interpretation:
  - Essentially neutral
- Alignment with expectations:
  - **Not great**. I'd expect *more negative*. The song talks about panic and war and whatnot. Interestingly I picked a song with some German in it. If I had more time I'd re-download the Spacy model with `german` ticked on and see if that changed the polarity.

**AJR - Bang!**
- Polarity:
  - 0.44
- Interpretation:
  - Positive
- Alignment with expectations:
  - **Good**. This is essentially just a party song.

**Johnny Cash - Hurt**
- Polarity:
  - 0.07
- Interpretation:
  - Slightly positive
- Alignment with expectations:
  - **Real bad**. This does not line with my expectations at all. I tried thinking of one of the saddest or most negative songs I could think of, and it wound up registering as a slightly positive song. To quote the song, "I will let you down".

Why:
Honestly I don't have a great answer for this. I could easily see why *Last Nite* or *99 Red Balloons* would yield mostly neutral results. Both songs are fairly opaque in their message. The sentiment polarity for *Bang!* was more or less spot-on, if anything I might consider that too low of a score. I suppose the repeated motif "Bang!" doesn't carry a lot of semantic meaning sans the context of its pronunciation. 

Now *why* it rates Johnny Cash's song that Google's AI labs describes as "a poignant reflection on aging, regret, and the consequences of past actions, particularly his struggles with addiction" as anything *other than* overwhelmingly negative is a bit beyond me. The lyrics aren't even really ambiguous or up to interpretation. It's about being sad about being old and making the choices you've made. So... not what I was expecting. 

In [None]:
# Export Jypter Notebook to HTML
os.system('jupyter nbconvert --to html module-4-P4.ipynb')