**A kind warning before you start**: Running these scripts will lead to **around 4 hours** of waiting for the web scraper to retrieve everything.

So if you are not really interested in playing with the web scrapers, you can go check the results directly without implementing this notebook.

# 1. Data Preparation

In the first part of the notebook, we will prepare all the data in need for analysis and further data story-telling. These data and their structures are:


*   **Metadata of the songs**: `Title`,` Artist`, `Artist ID`.

  These information will the retrieved from the Wikipedia pages of the [Billboard Year-End Hot 100 singles](https://https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_2000), where we can find the tables of the rankings of our focused period 2000-2023; We prepare the Wikidata item IDs of the artists in this step to facilitate the queries in the next step, and the IDs are scraped from the Wikipedia page of the specific artist.

  They will be stored into Pandas Dataframes, seperated by different years.
*   **Information of the artists**: `Nationality or Origin`, `Sex or Gender`.

  These information will be queried directly using the object properties of Wikidata knowledge base.

  They will be integrated into the afore-mentioned dataframes.

*   **Lyrics of the Songs**: The lyrics will be fetched from [Genius](https://genius.com) with an API key under the help of `lyricsgenius` package. They will be stored in plain text files in folders classifeid in different years.

## 1.1 Songs Metadata Extraction

We are going to use the python package `BeautifulSoup`, which is for parsing HTML and XML documents, including those with malformed markup. It creates a parse tree for documents that can be used to extract data from HTML, which is useful for web scraping.

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

After importing the packages in need, we start to scrape the data from the wikipedia pages of the Billboard Year-End Hot 100 singles rankings, which are contained in Wiki tables in the same structures, therefore facilitating our extraction.

* As we can see from the web pages, the tables are structured with the table headers (corresponding to the HTML elements `<th>`): `No.`, `Title`, and `Artist`, then the vertical rows (placed in the HTML elements `<tr>`), which horizontally contain the table data cell elements `<td>` instanciating the specific index numbers, titles and artists of each song. We take advantage of such clear sturctures and scrape them with Beutifulsoup.

* We have to extract the links to the Wikidata IDs of the artists additionally, so that we can further extracted their personal information later from a more structured information. It is going to be extracted as a metadata element `{"id": "t-wikibase"}`.


* To extract the same-structured information from the varying Wikipedia pages from 2000 to 2023, we design a for loop which can substitute the `{year}` element of the urls automatically, since the subdirectories of the urls are always written as `Billboard_Year-End_100_singels_of_{year}`.


After the extraction from the tables and iteration over 24 webpages, we obtain what we need. For the convenience of our downstream data manipulations, we put everything in Dataframes, which will be stored in csv files and can be read directly the the files list after you run the codes.

**This part of scripts run extremely slowly, please be patient.**

In [4]:
import requests
from bs4 import BeautifulSoup

# Define the function and start the connection of the scraper
def scrape_billboard_hot_100(year):
    url = f'https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_{year}'
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        songs_table = soup.find('table', {'class': 'wikitable'})

        songs_data = []
        current_artist = None
        artist_span_remaining = 0

# Extract information from Wiki tables of the website, and the Wikidata IDs from the metadata element.
        for row in songs_table.find_all('tr')[1:]:
            cells = row.find_all('td')
            if cells:
                if artist_span_remaining > 0:
                    artist_span_remaining -= 1
                else:
                    current_artist = cells[2].text.strip()
                    artist_link = cells[2].find('a', href=True)
                    artist_wikidata_id = None
                    if artist_link:
                        artist_url = 'https://en.wikipedia.org' + artist_link['href']
                        artist_page = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0'})
                        if artist_page.status_code == 200:
                            artist_soup = BeautifulSoup(artist_page.text, 'html.parser')
                            wikibase_item = artist_soup.find("li", {"id": "t-wikibase"})
                            if wikibase_item and wikibase_item.find('a', href=True):
                                artist_wikidata_id = wikibase_item.find('a')['href'].split('/')[-1]
                    if cells[2].has_attr('rowspan'):
                        artist_span_remaining = int(cells[2]['rowspan']) - 1

                title = cells[1].text.strip()
                songs_data.append({
                    'Title': title,
                    'Artist': current_artist,
                    'Wikidata ID': artist_wikidata_id
                })

        return pd.DataFrame(songs_data)
    else:
        print(f"Failed to retrieve the webpage for the year {year}: Status code {response.status_code}")
        return None

# A for loop for iterating from the rankings of 2000 to 2023
for year in range(2000, 2024):
    df_songs = scrape_billboard_hot_100(year)
    if df_songs is not None:
        print(f"Data for {year}:")
        print(df_songs.head())
        # Optionally, save to CSV
        df_songs.to_csv(f'billboard_hot_100_{year}.csv', index=False)


Data for 2000:
                   Title                             Artist Wikidata ID
0              "Breathe"                         Faith Hill     Q464241
1               "Smooth"       Santana featuring Rob Thomas     Q873384
2          "Maria Maria"  Santana featuring The Product G&B     Q873384
3         "I Wanna Know"                                Joe    Q1077266
4  "Everything You Want"                   Vertical Horizon    Q2061582
Data for 2001:
                          Title                            Artist Wikidata ID
0         "Hanging by a Moment"                         Lifehouse     Q845790
1                     "Fallin'"                       Alicia Keys     Q121507
2                 "All for You"                     Janet Jackson     Q131324
3  "Drops of Jupiter (Tell Me)"                             Train     Q282531
4     "I'm Real (Murder Remix)"  Jennifer Lopez featuring Ja Rule      Q40715
Data for 2002:
                    Title                         Artis

A quick recap before moving further: with the help of BeautifuSoup, and thanks to the clear structures of Wikipedia pages, we successfully retrive **the titles of the songs, the names and the Wikidata IDs of the artists** from the Billboard Hot 100 singles ranking lists from 2000 to 2023.

In the next step, we will try to go deeper into the Wikidata items of the corresponding artists, and extract the information about where they come from, and their genders, under the circumstance that the artists are individuals instead of bands which will not have the property of gender.

Notice a pecularity here: when the song is a collabaration of two artists, or is featured by a second artist, only the Wikidata ID of the first/main artist will be extracted and only her/his metadata will be analyzed in the following procedures.

## 1.2 Artists Metadata Extraction

Thanks to the great structured-data of Wikidata knowledge base, the information of our artists can be extract easily from the corresponding properties.

Here are the properties we'll retrive:


*   `P27` Country of Citizenship: When the artist is an individual, we retrieve this object as a country that recognizes the subject as its citizen.
*   `P495` Country of Origin: When the artists belong to a band, we retrieve this object as a country of origin of this item (creative work, food, phrase, product, etc.)
*   `P21` Sex or Gender: When the artist is an individual, we retrieve her/his/their sex or gender identity objects, which could fall into multiple categories: male, female, non-binary, intersex, transgender female, transgender male, agender, etc.

In [6]:
def fetch_artist_details(wikidata_id):
    url = f"https://www.wikidata.org/wiki/Special:EntityData/{wikidata_id}.json"
    response = requests.get(url)
    details = {
        'nationality_or_origin': 'Unknown',
        'gender': 'Unknown'
    }

    if response.status_code == 200:
        data = response.json()
        entity_data = data['entities'][wikidata_id]
        claims = entity_data.get('claims', {})

        # Fetching nationality or country of origin
        citizenship_claims = claims.get('P27')  # P27 is the property for country of citizenship
        if citizenship_claims:
            citizenship_qid = citizenship_claims[0]['mainsnak']['datavalue']['value']['id']
            details['nationality_or_origin'] = get_label_from_wikidata(citizenship_qid)
        else:
            origin_claims = claims.get('P495')  # P495 is the property for country of origin
            if origin_claims:
                origin_qid = origin_claims[0]['mainsnak']['datavalue']['value']['id']
                details['nationality_or_origin'] = get_label_from_wikidata(origin_qid)

        # Fetching gender
        gender_claims = claims.get('P21')  # P21 is the property for sex or gender
        if gender_claims:
            gender_qid = gender_claims[0]['mainsnak']['datavalue']['value']['id']
            details['gender'] = get_label_from_wikidata(gender_qid)

    return details

# Next we dive into the JSON data from Wikidata
def get_label_from_wikidata(qid):
    """Fetches the label for a given QID from Wikidata."""
    label_url = f"https://www.wikidata.org/wiki/Special:EntityData/{qid}.json"
    label_response = requests.get(label_url)
    if label_response.status_code == 200:
        label_data = label_response.json()
        label_entity = label_data['entities'][qid]
        if 'en' in label_entity['labels']:
            return label_entity['labels']['en']['value']
    return "Unknown"

Then let's interate our new metadata as two new columns into our previous csv files.

**The codes will run slowly again, so you are expected to be extremely patient with our web scraper. After running the following blocks, the previous csv files will be updated.**

If you see the Gender data is written as `Unknow`, this situation is mostly due to the fact that the artist is a band.

In [7]:
# Update our dataframes through the iteration over 24 years again
for year in range(2000, 2024):
    df_songs = scrape_billboard_hot_100(year)
    if df_songs is not None and not df_songs.empty:
        df_songs[['Nationality or Origin', 'Gender']] = df_songs['Wikidata ID'].apply(
            lambda x: pd.Series(fetch_artist_details(x)) if x else pd.Series(['Unknown', 'Unknown'])
        )
        print(f"Data for {year}:")
        print(df_songs.head())
        df_songs.to_csv(f'billboard_hot_100_{year}.csv', index=False)


Data for 2000:
                   Title                             Artist Wikidata ID  \
0              "Breathe"                         Faith Hill     Q464241   
1               "Smooth"       Santana featuring Rob Thomas     Q873384   
2          "Maria Maria"  Santana featuring The Product G&B     Q873384   
3         "I Wanna Know"                                Joe    Q1077266   
4  "Everything You Want"                   Vertical Horizon    Q2061582   

      Nationality or Origin   Gender  
0  United States of America   female  
1  United States of America  Unknown  
2  United States of America  Unknown  
3  United States of America     male  
4  United States of America  Unknown  
Data for 2001:
                          Title                            Artist Wikidata ID  \
0         "Hanging by a Moment"                         Lifehouse     Q845790   
1                     "Fallin'"                       Alicia Keys     Q121507   
2                 "All for You"           

KeyboardInterrupt: 

## 1.3 Lyrics Extraction with Genius API

I've experienced tons of trials and errors wroking with Genius API, then I found out the previous failures could be caused by the fact that I was using Google Colab to run the codes in the cloud server, leading Genius.com to reject the requests. Therefore, the following codes should be run locally to avoid network errors.

Here we use the python package `lyricsgenius` to make our retrieving with Genius API easier.

As you can see, the directory in the scripts leads to a local folder of my computer, so change it if you want to play with the scripts (not really worth waiting for 2 hours though).

In [15]:
import os
import pandas as pd
import requests
from lyricsgenius import Genius

def fetch_and_save_lyrics(year, df, directory):
    # Fetch and save lyrics for a given year and dataframe.
    lyrics_folder = os.path.join(directory, f'Lyrics/{year}')
    os.makedirs(lyrics_folder, exist_ok=True)
    
    genius = Genius('UHBf3tFQVURUVwuQlOtGk5YK736oEo1QSEpCoL-G7dXH6jXESVeimOJtuOvftyFh', timeout=10, retries=3,  # Increase timeout and retries
                    skip_non_songs=True, excluded_terms=["(Remix)", "(Live)"], remove_section_headers=True)
    for index, row in df.iterrows():
        retries = 3  # Define the number of retries for each song
        while retries > 0:
            try:
                song = genius.search_song(title=row['Title'], artist=row['Artist'])
                if song:
                    lyrics = song.lyrics
                    filename = f"{row['Title'].replace('/', '_')}.txt"
                    with open(os.path.join(lyrics_folder, filename), 'w', encoding='utf-8') as file:
                        file.write(lyrics)
                        print(f"Saved lyrics for {row['Title']} by {row['Artist']}")
                break  # Break the loop if successful
            except requests.Timeout:
                print(f"Timeout occurred for {row['Title']} by {row['Artist']}, retries left: {retries-1}")
                retries -= 1  # Decrease retries count
            except Exception as e:
                print(f"An error occurred for {row['Title']} by {row['Artist']}: {str(e)}")
                break  # Break the loop if a non-timeout error occurs

def process_files(directory):
    """Process each CSV file to fetch and save lyrics."""
    for year in range(2000, 2024):
        csv_file_path = os.path.join(f'/Users/delete4ever/Desktop/Digital Literary Analysis/codes/billboard_hot_100_{year}.csv')
        if os.path.exists(csv_file_path):
            df = pd.read_csv(csv_file_path)
            fetch_and_save_lyrics(year, df, directory)
        else:
            print(f"CSV file for year {year} does not exist in the specified directory.")

directory = '/Users/delete4ever/Desktop/Digital Literary Analysis/codes'
process_files(directory)


Searching for ""Breathe"" by Faith Hill...
Done.
Saved lyrics for "Breathe" by Faith Hill
Searching for ""Smooth"" by Santana featuring Rob Thomas...
Done.
Saved lyrics for "Smooth" by Santana featuring Rob Thomas
Searching for ""Maria Maria"" by Santana featuring The Product G&B...
Done.
Saved lyrics for "Maria Maria" by Santana featuring The Product G&B
Searching for ""I Wanna Know"" by Joe...
Done.
Saved lyrics for "I Wanna Know" by Joe
Searching for ""Everything You Want"" by Vertical Horizon...
Done.
Saved lyrics for "Everything You Want" by Vertical Horizon
Searching for ""Say My Name"" by Destiny's Child...
Done.
Saved lyrics for "Say My Name" by Destiny's Child
Searching for ""I Knew I Loved You"" by Savage Garden...
Done.
Saved lyrics for "I Knew I Loved You" by Savage Garden
Searching for ""Amazed"" by Lonestar...
Done.
Saved lyrics for "Amazed" by Lonestar
Searching for ""Bent"" by Matchbox Twenty...
Done.
Saved lyrics for "Bent" by Matchbox Twenty
Searching for ""He Wasn't 

Now that we have obtained the lyrics of almost 2400 songs (2370 of them are luckily collected in Genius database, which means 98.75% of our expected data are retrieved successfully), stored in `.txt` files and categorized in 24 folders according to the year when the songs were ranked top 100. 

For the further refinement and cleaning of the textual data, I will deal with it in the 3rd step:text analysis.