# Overview

# Names and Githubs

- Andrew Li - drewli815
- Jialin Shan - j5shan
- Sameer Ahmed - the3L3M3NT
- Lacey Umamoto - lumamoto
- Rickesh Khilnani - rick10101221


# Research Question

In the past 60 years, how has rap music's lyrics and overarching themes evolved into the repetitive, monotonous genre it has become today? Can we form the conclusion that the main topics in rap have not changed after performing an analysis of the Billboard's number-one rap singles for each decade ranging from the 1960s to the 2010s?

# Background Research

- We realized that contemporary music is repetitive because a lot of songs have a repeating theme in the form of either lyrics or the type of beat. This is very evident in contemporary rap. A lot of rap songs cover the same topics: guns, liquor, drugs, money, sex, getting famous, being rich, etc. But one has to consider the type of beat that rap music uses. It's quite hard to tell the difference from one rap song to another if you were to just listen to the instrumental versions. The dull and repetitive nature of rap comes from exactly this.

- Therefore, in order to measure the monotony, we will be accumulating data from thousands of songs ranging from the 1960s - 2010s and performing an analysis on their overarching themes by using TF-IDF on the lyrics of all 3800 rap songs and looking at the differences in unique and common words. The American rap genre will be considered to have evolved from innovative to monotonous if rap music from the 1960s covers a range of topics other than the ones listed above as opposed to contemporary rap. In the 20th century, rap music was much more impactful and many pieces pushed to make public change. Nowadays, that isn't always the case. At least, not as often.

- The overarching theme of each song will be the most repeated keyword (after removing common english words), word counts will be some sort of map or Counter that associates frequencies with words. We will primarily be using Genius' API to gather lyrics.

- We believe there should be a change in the music industry. Today's music industry makes music that is solely for 'instant gratification' and is completely shallow. (Refer to the second resource in the *Past Studies* section below)

- We all listen to and enjoy music.

- Past Studies
    - There is an [article](https://www.rapanalysis.com/2015/08/the-23-most-repetitive-rappers/) that has illustrated the 23 most repetitive rappers, the highest being Will.I.Am and Kid Cudi. The analysis looked at how often certain keywords were said, but only compared it from artist to artist.
    - In this [tiny piece](https://roundup.brophyprep.org/index.php/2012/03/popular-hip-hop-music-praises-shallow-superficial-decadence/) (written 9 years ago), talks about how music used to promote good values, but in the 21st century, it has completely shifted to a shallow and superficial genre. He adds a quote from one of Drake's songs that only focuses on the aspects of life that are instantly gratifying: getting money, drinking, and smoking, namely.

- References (include links):
    - [Billboard's Hot Weekly Charts](https://data.world/kcmillersean/billboard-hot-100-1958-2017) 
    - [Genius API](https://docs.genius.com/)
    - [Lyrics-Extractor Package for Python](https://pypi.org/project/lyrics-extractor/) (canceled)
        - This can be used for retrieving and cross-referencing lyrics for data validity. Uses Genius' API
    - [Language Detector](https://pypi.org/project/langdetect/) 
    - [Spotify's Web API](https://spotipy.readthedocs.io/en/2.16.0/) (canceled)
    - [Spotify](https://www.spotify.com/us/) (canceled)
    - [MusixMatch API for lyrics](https://developer.musixmatch.com/) (canceled)  
    - Google Sheets to store accumulated data for each decade (at the end of project for reference)


# Hypothesis

After conducting a thorough analysis of music repetition, we believe that there exists a trend in contemporary rap music. This trend will be evident after comparing the overarching themes of rap music for each decade. We expect that, through these observations, we will find a repetitive trend because contemporary rap music focuses on the same topics for each song whereas older rap music used to cover a diverse range of topics so that each song and artist was refreshing and unique.

# Data

- What variables: song name, song title, song year, lyrics for song (dataframe 1), and keywords with their respective counts (dataframe 2). Dataframe 1 will contain the data for all 3800 songs and dataframe 2 will divide the respective counts for each decade in separate columns.

- How many observations: Varies depending on how many songs are on each Billboard for each decade. We will have to find a way to normalize our data so that the number of songs for each decade are the same. In total, however, we will trim a dataset of 28000 songs down to 3800 which are specifically rap songs.

- Who/what/how would these data be collected: 
    - Who
        - Each week, teams will be formed and tasks will be assigned to each subteam so that we can achieve maximum efficiency in data collection and analysis. Everyone's time would not be spent completing the same task and we will finish the project quicker. For example, team A may be tasked with data collection while team B will be in charge of parsing through it for cleaning (removing stop words, stemming, etc.)
    - What
        - At first, we will be collecting song names, artists, and the date at which each song reached number one. Then, using lyrics extractor, we will find the lyrics for each song. At this point, this is all the data collection we need. Additional steps include parsing through the data with TF-IDF, removing common English words, and developing an extra dataframe for holding word counts for each decade..
    - How
        - First using Billboard's number-one songs and exporting as a csv file. Then, we will read it in as a dataframe so that we can then iterate through each song title and run the lyrics extractor package on it. Then we will simply store each lyric body in the dataframe, remove any common English words from each lyric body, and iterate through all of the lyrics and associate counts with each word.

- How would these data be stored/organized: Pandas dataframe which will be used for building visualizations. Again, the data will be organized as two separate data frames for processing (see above)

- What kind of songs are you collecting: We will be collecting Billboard's number-one songs for each decade from the 1960s to the 2010s. The number of songs for each decade varies, so we will have to establish a numerical bottleneck.

- Dataset 1:
    - Dataset Name: Billboard Hot Weekly Charts
    - [Link to the dataset](https://data.world/kcmillersean/billboard-hot-100-1958-2017)
    - Number of total observations: 28500
    - Truncated observation count for specifically 'rap': 3850

- Notes
    - We will ONLY be looking at rap songs for this analysis. This is purely from Billboard's Hot Weekly charts so that we get a rough estimate of what the most popular songs are for each decade. Therefore, this may not capture ALL topics in rap over the last 6 decades.
    - Because we have a method of retrieving all of the lyrics for each song in our data set, we know that we will have more than 25000+ words and 3800+ songs to parse through, which is more than enough to model the overarching themes for each decade for comparison with one another.


# Setup

In [3]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
import re
from nltk.stem import SnowballStemmer
from autocorrect import Speller
from collections import Counter 
import spacy
from datetime import datetime
import numpy as np
from langdetect import detect
from bs4 import BeautifulSoup
import requests
import re
import unicodedata

# Data Cleaning

Cut down 28000 observations to 3800 specifically pertaining to rap, merging columns to get a single ID, dropping identical songs, and creating a final dataframe to hold the lyrics of every song.

In [5]:
df = pd.DataFrame(pd.read_excel("../data/Hot 100 Audio Features.xlsx"))
df_hotstuff = pd.DataFrame(pd.read_csv("../data/Hot Stuff.csv"))

### Cutting Down Data

Here, we are dropping any obserations without any genre specified, assuming that they were not considered rap songs because they weren't explicitly tagged with it. This **ALSO** excludes songs without a rap tag entirely, including 'Hip Hop' songs (hip hop songs have an explcit 'hip hop' tag). Therefore, we are solely focusing on rap music in the entire data set.

In [6]:
# drop songs without genres
df.dropna(subset=['spotify_genre'], inplace=True)

# get songs with rap genre
df_rap = pd.DataFrame()
for index, row in df.iterrows():
    genres = df.spotify_genre.squeeze()[index]
    if 'rap' in genres:
        df_rap = df_rap.append(row)

# drop duplicate songs (songs with same songID)
df_rap = df_rap.drop_duplicates(subset=['SongID'], keep='first')

### Clean Up Data

In [7]:
# merge df_rap and df_hotstuff to get weekID
df_merge = pd.merge(df_rap, df_hotstuff, how='left')

# drop songs with no weekID
df_merge.dropna(subset=['WeekID'], inplace=True)

# drop duplicate songs (songs with same songID)
df_merge = df_merge.drop_duplicates(subset=['SongID'],keep='first')

### Getting Years

In [8]:
# get years
years = []
for index, row in df_merge.iterrows():
    weekID = df_merge.WeekID[index]
    year = datetime.strptime(weekID, "%m/%d/%Y").year
    years.append(year)
df_merge['Year'] = years

# get final dataframe with year, performer, and song
df_final = df_merge[['Year', 'Performer', 'Song']]
df_final = df_final.reset_index(drop=True)

### Getting Lyrics

#### Helper Function

Strips accents from text.

In [9]:
# strip accents from text
# ex. beyoncé --> beyonce
def strip_accents(text):
    try:
        text = unicode(text, 'utf-8')
    except NameError:
        pass
    text = unicodedata.normalize('NFD', text)\
           .encode('ascii', 'ignore')\
           .decode("utf-8")
    return str(text)

#### Make API call to get page to scrape from

In [1]:
def get_url(song_title, artist_name):
    # print("Searching for: ", song_title, "-", artist_name)
    
    # get song title and artist
    # convert to lowercase, remove non-alphanumeric characters
    
    title = re.sub(r'[^a-zA-Z0-9-$() ]', '', song_title.lower())
    #print("Title:", title)
    
    title_simple = re.sub(r'[^a-zA-Z0-9- ]', '', title)
    #print(title_simple)
    
    # title with no parentheses
    title_noparen = re.sub(r'\([^)]*\)', '', title)
    #print("Title no paren:", title_noparen)
    
    # replace dollar signs with s's
    title_nodollar = title.replace("$", "s")
    #print(title_nodollar)
    
    artist = re.sub(r'[^a-zA-Z0-9-$() ]', '', artist_name.lower())
    #print("Artist:", artist)
    
    artist_simple = re.sub(r'[^a-zA-Z0-9- ]', '', artist)
    #print(artist_simple)
    
    artist_nodollar = artist.replace("$", "s")
    #print(artist_nodollar)
    
    artist_split = artist.split()
    #print("Artist Split:", artist_split)
    
    # main artist
    if 'featuring' in artist:
        artist_nofeat = artist.split('featuring')[0]
    elif ',' in artist:
        artist_nofeat = artist.split(',')[0]
    else:
        artist_nofeat = artist 
    #print("Artist No Feat:", artist_nofeat)
    
    # set up request
    headers = {'Authorization': 'Bearer ' + 'zZ6YtjOlYsm1o5Me_vIO6MczexIf6k5PGlgiMHi4aO6bnZmsyVdG7J7YQ0VXIOHE'}
    data = {'q': title_noparen + ' ' + artist_nofeat}
    base_url = 'https://api.genius.com'
    search_url = base_url + '/search'
    
    current_page = 1 # page number of results
    next_page = True
    
    while next_page:
        params = {'page': current_page} # set page number
        response = requests.get(search_url, data=data, headers=headers, params=params)
        d = response.json()
        page_hits = d['response']['hits']
        
        # if there are hits on the page
        if page_hits:
            # go through all hits
            for hit in page_hits:
                res = hit['result']
                
                # name of primary artist
                name = res['primary_artist']['name']
                name = strip_accents(name)
                name = re.sub(r'[^a-zA-Z0-9- ]', '', name.lower())
                #print("Name:",name)
                
                full_title = res['full_title']
                full_title = strip_accents(full_title)
                # convert full_title to lowercase and remove non-alphanumeric characters
                full_title = re.sub(r'[^a-zA-Z0-9- ]', '', full_title.lower())
                #print("Full Title:", full_title)
                
                if (
                    # 'lyrics' substring is in url
                    'lyrics' in res['url'] and
                     # song title (w/ or w/o parentheses) is in full title
                    (title in full_title or 
                     title_noparen in full_title or
                     title_nodollar in full_title or
                     title_simple in full_title
                    ) and
                    # 1st or 2nd word in artist is in full title or 
                    # main artist (no features) is in full title or name
                    (artist_nofeat in full_title or 
                     artist_nofeat in name or
                     artist_split[0] in full_title or
                     (len(artist_split) > 1 and artist_split[1] in full_title) or
                     artist_nodollar in full_title or
                     artist_nodollar in name or
                     artist_simple in full_title or
                     artist_simple in name
                    ) and
                    # 1st or 2nd word in artist is in name from response
                    (artist_split[0] in name or
                     (len(artist_split) > 1 and artist_split[1] in name)
                    ) and
                    # song is not a translation
                    'espanol' not in full_title and
                    'nederlandse' not in full_title and
                    'polskie' not in full_title and
                    'portugues' not in full_title and
                    'francaise' not in full_title and
                    'deutsche' not in full_title and
                    'oversttelse' not in full_title and
                    'traduzione' not in full_title and
                    'ceviri' not in full_title and
                    'translation' not in full_title and
                    # song is not a review by rap critic
                    'rap critic' not in full_title and
                    # song is not instrumental
                    'instrumental' not in full_title and
                    # song is not a parody
                    'parody' not in full_title
                ):
                    url = res['url']
                    # print("URL found: ", url)
                    return url
                    
            # increment current_page value for next loop
            current_page += 1
            # print("Finished scraping page {}".format(current_page))
            
            # if lyrics not on first 10 pages, stop
            if (current_page == 10):
                next_page = False
        else:
            # if page_hits is empty, stop
            next_page = False
        
    return 0

#### Scrape page for lyrics

In [11]:
def get_lyrics(song_title, artist_name):
    url = get_url(song_title, artist_name)
    if url == 0:
        print("Lyrics not found for", song_title, "-", artist_name)
        return np.NaN
    else:
        page = requests.get(url)
        html = BeautifulSoup(page.text, 'html.parser')
        lyrics = html.find('div', class_='lyrics').get_text()
        return lyrics

#### Get lyrics

In [None]:
lyrics_list = []

for i, row in df_final.iterrows():
    if (i % 10 == 0): 
        print (str(round(i/len(df_final) * 100, 2)) + '% done')
    artist = row['Performer']
    song = row['Song']
    lyrics = get_lyrics(song, artist)
    # print(lyrics)
    lyrics_list.append(lyrics)
    
df_final['Lyrics'] = lyrics_list
df_final.to_csv("../data/Hot100DataWithNanLyrics.csv")

#### Drop rows with no lyrics

In [None]:
# drop rows with no lyrics
df_final.dropna(subset=['Lyrics'], inplace=True)

# export to csv
df_final.to_csv("../data/Hot100Data.csv")

We dropped 147 songs from our dataset because we were unable to find the lyrics for them using the Genius API. After dropping these songs, we now have a total of **3447 observations** in our dataset.

### Range of Years for Dataset
 - TODO:

# Data Analysis and Results

### Notes
- TODO: 

### Algorithms

- Spacy
- TF-IDF

In [None]:
contractions_dict = {     
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how is",
"I'd": "I had",
"I'd've": "I would have",
"I'll": "I will",
"I'll've": "I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "iit will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that had",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there had",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they had",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what will",
"what'll've": "what will have",
"what're": "what are",
"what's": "what is",
"what've": "what have",
"when's": "when is",
"when've": "when have",
"where'd": "where did",
"where's": "where is",
"where've": "where have",
"who'll": "who will",
"who'll've": "who will have",
"who's": "who is",
"who've": "who have",
"why's": "why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had",
"you'd've": "you would have",
"you'll": "you will",
"you'll've": "you will have",
"you're": "you are",
"you've": "you have"
}

In [None]:
wordnet_lemmatizer = WordNetLemmatizer()
snowball_stemmer = SnowballStemmer('english')
punct=punctuation+'’'+'“'+'”'+'–'

def expand_contractions(text, contractions_dict):
    """
    expand english contractions
    """
    contractions_pattern = re.compile('({})'.format('|'.join(contractions_dict.keys())),
                                      flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contractions_dict.get(match) \
            if contractions_dict.get(match) \
            else contractions_dict.get(match.lower())
        expanded_contraction = expanded_contraction
        return expanded_contraction

    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

def autospell(text):
    """
    correct the spelling of the word.
    """
    spell = Speller(lang='en',fast=True)
    spells = [spell(w) for w in (nltk.word_tokenize(text))]
    return " ".join(spells)

nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner']) #load small english core library
stops = stopwords.words("english") #load stopwords



def normalize(text, lowercase, remove_stopwords):
    '''
    clean the text into desired format for text analysis
    '''
    text = expand_contractions(text,contractions_dict) #expand english contractions
    text = autospell(text) #correct spelling
    #text = ' '.join([w.lower() for w in nltk.word_tokenize(text)])  #lowercase
    text = re.sub('<[^<]+?>','', text)  #remove brackets
    text = ''.join(c for c in text if not c.isdigit())  #remove numbers
    text = ''.join(c for c in text if c not in punct)  #remove punctuations
    if lowercase:
        text = text.lower()  #lowercase
    text = nlp(text)
    lemmatized = list()
    for word in text:
        lemma = word.lemma_.strip() #tokenize
        if lemma:
            if not remove_stopwords or (remove_stopwords and lemma not in stops):  #remove stopwords
                lemmatized.append(lemma)
    return lemmatized

In [None]:
df=pd.read_csv('../data/Hot100Data.csv')
df.drop(columns=['Unnamed: 0'], inplace=True)
df['lang']=df.Lyrics.apply(detect)
df=df[df['lang']=='en']
df['After_Clean'] = df['Lyrics'].apply(normalize, lowercase=True, remove_stopwords=True)
df.to_csv('cleaned3000.csv')