# Explicit Lyric Analysis

Inspiration: There are many opinions that songs now have more explicit words than songs in the past.

What I'm Gonna Do: I will perform an explicit word count across 7 decades to find if there is any truth to the statement above.

TLDR - I found that the later decades (2000s vs 2010s) did indeed include more explicit words compared to earlier decades. The 2000s had the most explicit words found per songs (average: 1.73 words, median: 1 word). When compared to the 1950s, the 2000s significantly had more explicit words in the lyrics. 

More Results Can Be Found under the Results header. 

If you would like to view the Tableau Dashboard I built, here's the link: [Tableau Dashboard](https://public.tableau.com/app/profile/ceara.glenn/viz/SongsExplicitWordCountByDecade/YTDashboard)

## Sources

* Genius API
* Spotify API
* [Carnegie Mellon Explicit Word List](https://www.cs.cmu.edu/~biglou/resources/) 
* [YouTube Explicit Word List](https://www.freewebheaders.com/youtube-blacklist-words-list-youtube-comment-moderation/)

## Import Libraries

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from lyricsgenius import Genius

import pandas as pd
import numpy as np
import scipy.stats as ss
import re
import time
import string
import contractions
from operator import itemgetter

import fuzzywuzzy
from fuzzywuzzy import process
from fuzzywuzzy import fuzz

import nltk
#nltk.download('stopwords')
en_stop = nltk.corpus.stopwords.words('english')



In [2]:
from collections import defaultdict

## Set Credentials

You can retrieve Spotify and Genius credentials by setting up a developer's profile on those sites.

In [None]:
#spotify
Screds = SpotifyClientCredentials(
    client_id=Spotify_id,
    client_secret=Spotify_secret
)
sp = spotipy.Spotify(auth_manager=Screds)

#genius
Gcreds = Genius_key
genius = Genius(Gcreds, sleep_time=1, retries=1, timeout=10, remove_section_headers=True)

## Pull The Decades Playlists from Spotify

I located playlists I wanted to use from Spotify. I retrieved the playlist ids from the playlist url.

In [None]:
#All Out 50s
fifties = sp.playlist_tracks("37i9dQZF1DWSV3Tk4GO2fq")#https://open.spotify.com/playlist/37i9dQZF1DWSV3Tk4GO2fq?si=bc646e564ef246c6
fifties_b = sp.playlist_tracks("37i9dQZF1DWSV3Tk4GO2fq", offset=100)
#All Out 60s
sixties = sp.playlist_tracks("37i9dQZF1DXaKIA8E7WcJj")#https://open.spotify.com/playlist/37i9dQZF1DXaKIA8E7WcJj?si=699d64348b8b48eb
sixties_b = sp.playlist_tracks("37i9dQZF1DXaKIA8E7WcJj", offset=100)
#All Out 70s
seventies = sp.playlist_tracks("37i9dQZF1DWTJ7xPn4vNaz")#https://open.spotify.com/playlist/37i9dQZF1DWTJ7xPn4vNaz?si=662ccc35b56b4ba6
seventies_b = sp.playlist_tracks("37i9dQZF1DWTJ7xPn4vNaz", offset=100)
#All Out 80s
eighties = sp.playlist_tracks("37i9dQZF1DX4UtSsGT1Sbe")#"https://open.spotify.com/playlist/37i9dQZF1DX4UtSsGT1Sbe?si=d53af64c58d84183"
eighties_b = sp.playlist_tracks("37i9dQZF1DX4UtSsGT1Sbe", offset=100)
#All Out 90s
nineties = sp.playlist_tracks("37i9dQZF1DXbTxeAdrVG2l")#"https://open.spotify.com/playlist/37i9dQZF1DXbTxeAdrVG2l?si=de39da16846e4703"
nineties_b = sp.playlist_tracks("37i9dQZF1DXbTxeAdrVG2l", offset=100)
#All Out 00s
millennium = sp.playlist_tracks("37i9dQZF1DX4o1oenSJRJd")#"https://open.spotify.com/playlist/37i9dQZF1DX4o1oenSJRJd?si=ab4ccbbb7726487b"
millennium_b = sp.playlist_tracks("37i9dQZF1DX4o1oenSJRJd", offset=100)
#All Out 10s
twenty10s = sp.playlist_tracks("37i9dQZF1DX5Ejj0EkURtP")#"https://open.spotify.com/playlist/37i9dQZF1DX5Ejj0EkURtP?si=82c356e32cdb40dd"
twenty10s_b = sp.playlist_tracks("37i9dQZF1DX5Ejj0EkURtP", offset=100)

In [None]:
#an example of what the raw data looks like once pulled from Spotify API
fifties

{'href': 'https://api.spotify.com/v1/playlists/37i9dQZF1DWSV3Tk4GO2fq/tracks?offset=0&limit=100&additional_types=track',
 'items': [{'added_at': '2021-04-28T06:38:23Z',
   'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/'},
    'href': 'https://api.spotify.com/v1/users/',
    'id': '',
    'type': 'user',
    'uri': 'spotify:user:'},
   'is_local': False,
   'primary_color': None,
   'track': {'album': {'album_type': 'compilation',
     'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/43ZHCT0cAZBISjO8DG9PnE'},
       'href': 'https://api.spotify.com/v1/artists/43ZHCT0cAZBISjO8DG9PnE',
       'id': '43ZHCT0cAZBISjO8DG9PnE',
       'name': 'Elvis Presley',
       'type': 'artist',
       'uri': 'spotify:artist:43ZHCT0cAZBISjO8DG9PnE'}],
     'available_markets': ['AD',
      'AE',
      'AG',
      'AL',
      'AM',
      'AO',
      'AR',
      'AT',
      'AU',
      'AZ',
      'BA',
      'BB',
      'BD',
      'BE',
      'BF',
   

## Building the Dataset

In this step, I was creating an intermediate dataset from the Spotify data. This data will outline items like track name, album name, artist name(s), and explicit tag. I extract the data from the Spotify JSON dictionary and translate it so that it will be used to pull song lyrics from the Genius API.

****************************************************************************
Below are the notes I took during analysis:

Using songs pulled from ~10/7/2021~ 10/10/2021

columns needed:
* artist name
* album name
* featured artists
* explicit tag
* track name
* track (true or false)
* release date for song
* album type (I think if it's a standalone single or apart of an album)
* playlist id - gives me the decade

columns added:
* track_id from Spotify
* flag for mutliple artists

In [None]:
#lambda functions created for the create_dataset function

#locate data within the json format for album
album_facts = lambda playlist, fact: [playlist["items"][num]["track"]["album"][fact] for num in range(len(playlist["items"]))]

#locate data within the json format for track
track_facts = lambda playlist, fact: [playlist["items"][num]["track"][fact] for num in range(len(playlist["items"]))]

In [None]:
#set up variables for the create_dataset function

#these are the features I want to pull from the album key in the json
album_features = ["release_date","type","name"]

#these are the features I want to pull from the track key in the json
track_features = ["explicit","track","name","id"]

#all of the results from the Spotify API pull put in a list
playlists = [fifties, fifties_b, sixties, sixties_b, seventies, seventies_b, 
             eighties, eighties_b, nineties, nineties_b, millennium, millennium_b,twenty10s,twenty10s_b]

In [None]:
def create_dataset(playlists,album_features,track_features):
    """
    This function will take the Spotify features from the JSON format and convert it into a pandas DataFrame.

    1) Create empty dataframe
    2) Iterate over each playlist and do the following:
        * Create an empty dataframe
        For the Track/Album Features
        * Retrieve the data using lambda function (album_facts or track_facts)
        * Create a column for the empty dataframe for each album/track feature
        For Playlist ID
        *Using regex, pull the playlist ID from the playlist link
        For Artists
        * Pull the artist key in the json
        * Create a column that says if the song has multiple artists (T/F)
        * Create column(s) for each artist in the song 
    3) Once the empty dataframe has been filled with all features from the playlist, the playlist dataframe
        will be appended to the overall dataframe

    Output: Pandas DataFrame filled with Spotify Data Needed for analysis
    """
    full_data = pd.DataFrame()
    for pl in playlists:
        songs = pd.DataFrame()
        for feature in album_features:
            facts = album_facts(pl, feature)
            songs["album_"+feature] = facts
        for feature in track_features:
            facts = track_facts(pl, feature)
            songs["track_"+feature] = facts
        songs["playlist_id"] = re.search(r"(?<=playlists/).*(?=/track)", pl["href"]).group(0)
        artists = list(list(map(itemgetter('name'), pl["items"][num]["track"]["artists"])) for num in range(len(pl["items"])))
        songs["multiple_artists"] = list(map(lambda x: True if len(x) > 1 else False, artists))
        artists_df = pd.DataFrame(artists, columns = ["Artist_"+str(i+1) for i in range(0,find_max(artists))])
        songs = pd.concat([songs,artists_df],axis=1)
        full_data = pd.concat([full_data,songs])
    return full_data

def find_max(List_of_Lists):
    """
    This function is to find the element in the list that has the most elements.
    Once I have that max number, this will become the number of columns that I will create for the artists.
    """
    maximum = 0
    for i in range(len(List_of_Lists)):
        num = len(List_of_Lists[i])
        if num > maximum:
            maximum = num
    return maximum

In [None]:
full_data = create_dataset(playlists,album_features,track_features)

In [None]:
full_data.head()

Unnamed: 0,album_release_date,album_type,album_name,track_explicit,track_track,track_name,track_id,playlist_id,multiple_artists,Artist_1,Artist_2,Artist_3,Artist_4,Artist_5
0,1958-03-21,album,Elvis' Golden Records,False,True,Don't Be Cruel,01u6AEzGbGbQyYVdxajxqk,37i9dQZF1DWSV3Tk4GO2fq,False,Elvis Presley,,,,
1,1956-03,album,Songs For Swingin' Lovers! (Remastered),False,True,I've Got You Under My Skin - Remastered 1998,3aEJMh1cXKEjgh52claxQp,37i9dQZF1DWSV3Tk4GO2fq,False,Frank Sinatra,,,,
2,1995-01-01,album,The Best Of The Platters,False,True,Smoke Gets In Your Eyes,307XEC1IUwUs9ojlEFwH7f,37i9dQZF1DWSV3Tk4GO2fq,False,The Platters,,,,
3,1959,album,What'd I Say,False,True,"What'd I Say, Pt. 1 & 2",5yQ9iMZXGcr5rlO4hoLsP4,37i9dQZF1DWSV3Tk4GO2fq,False,Ray Charles,,,,
4,1996-06-04,album,Ella & Friends,False,True,Dream A Little Dream Of Me,3vFVS2WYHDG4KkWCNecvpn,37i9dQZF1DWSV3Tk4GO2fq,True,Ella Fitzgerald,Louis Armstrong,,,


## Get Lyrics

At this step, I am using the Spotify data to retrieve the lyrics from the Genius API. This is still an intermediate dataset for my analysis. I will get retrieve the top result from searching the Genius database using the Spotify track name. 

In [None]:
def new_track_names(data):
    """
    Cleaning the song name from each instance in the data. This function uses regex to remove the "-" character from the song name from the Spotify data. 
    The cleaned name is what will be used to search the Genius API. 

    data: Pandas dataframe; ideally this is the dataset formed from the Spotify data.

    Example:
        Before: "I've Got You Under My Skin - Remastered 1998"
        After: "I've Got You Under My Skin"
    "
    """
    #clean song titles
    data["new_track_name"]=data["track_name"].str.replace("\s-.*","")
    data["new_track_name"] = data["new_track_name"].apply(lambda x: re.sub(r"\(.*\)","",x).strip())
    
    return data

In [1]:
genius_info = lambda df: [genius.search_song(title=df["new_track_name"][i], artist=df["Artist_1"][i]) for i in range(len(df))]
#lyrics_please = lambda List: ["None" if List[num] is None else List[num].lyrics.replace("\n"," ").replace("EmbedShare URLCopyEmbedCopy","") for num in range(len(List))]

def add_Genius_info(data):
    """
    This function adds the genius data to the data set. Here are the steps taken to engineer the data:
    1) Create an empty dataframe & create list of intervals by 25. I create this interval to track the progress of rows processed in the dataset. It also allows for some built in sleep
    time between each pull of data from the API.
    2) Retrieve lyrics, titles, and artists from the Genius API. The lyrics are pulled for text analysis but the Genius titles and artists are pulled for accuracy comparison. In a later step,
    the Genius vs Spotify comparison will be discussed.
    3) Set the columns from the Genius API for song lyrics, song titles, and artists & concat the data to the empty dataframe.

    The result will be a dataset with the Genius API data.
    """
    complete_data = pd.DataFrame()
    interval = list(range(0,len(data)+1,25))
    for i in range(len(interval)):
        try:
            print(interval[i], interval[i+1])
            part_data = data.iloc[interval[i]:interval[i+1],:].reset_index()
            lyrics, full_titles, artists = get_Genius_info(part_data)
            part_data["lyrics"] = lyrics
            part_data["g_full_titles"] = full_titles
            part_data["g_artists"] = artists
            complete_data = pd.concat([complete_data,part_data])
            time.sleep(10)
        except IndexError as e:
            if interval[i] == len(data):
                print("end")
    
    return complete_data
                
def get_Genius_info(data):
    """
    This function searches the Genius API to retrieve the song lyrics. This function is embedded in the add_Genius_info function.
    1) Retrieve Genius data using cleaned song title and artist from the data.
    2) Search the Genius data results for lyrics, song titles, and artists. Additionally, I clean the lyrics to remove unwanted jargon that exists in all entries.

    The result is several lists that will be used to create columns in the dataset for the add_Genius_info function.
    """
    retrieval = genius_info(data)
    lyrics = (list(map(lambda x: "None" if x is None else itemgetter("lyrics")(x.to_dict()).replace("\n"," ").replace("EmbedShare URLCopyEmbedCopy",""),retrieval)))
    full_titles = (list(map(lambda x: "None" if x is None else itemgetter("full_title")(x.to_dict()), retrieval)))
    artists = (list(map(lambda x: "None" if x is None else itemgetter("artist")(x.to_dict()),retrieval)))
    
    return lyrics, full_titles, artists

#### Run the functions!

In [None]:
full_data = new_track_names(full_data)
# complete_data = add_Genius_info(full_data)

0 25
Searching for "Don't Be Cruel" by Elvis Presley...
Done.
Searching for "I've Got You Under My Skin" by Frank Sinatra...
Done.
Searching for "Smoke Gets In Your Eyes" by The Platters...
Done.
Searching for "What'd I Say, Pt. 1 & 2" by Ray Charles...
Done.
Searching for "Dream A Little Dream Of Me" by Ella Fitzgerald...
Done.
Searching for "That'll Be The Day" by Buddy Holly...
Done.
Searching for "Blueberry Hill" by Fats Domino...
Done.
Searching for "I Kissed You" by The Everly Brothers...
Done.
Searching for "That's Amore" by Dean Martin...
Done.
Searching for "Baby" by Dinah Washington...
Done.
Searching for "Hound Dog" by Elvis Presley...
Done.
Searching for "Twilight Time" by The Platters...
Done.
Searching for "Pennies From Heaven" by Louis Prima...
Done.
Searching for "It's Only Make Believe" by Conway Twitty...
Done.
Searching for "Rags To Riches" by Tony Bennett...
Done.
Searching for "Everyday" by Buddy Holly...
Done.
Searching for "The Great Pretender" by The Platters...

Done.
Searching for "Get a Job" by The Silhouettes...
Done.
Searching for "Hey, Good Lookin'" by Hank Williams...
Done.
Searching for "I've Got The World On A String" by Frank Sinatra...
Done.
Searching for "It's Not for Me to Say" by Johnny Mathis...
Done.
Searching for "Rebel Rouser" by Duane Eddy...
Specified song does not contain lyrics. Rejecting.
Searching for "Sh-Boom" by The Crew Cuts...
Done.
Searching for "Yakety Yak" by The Coasters...
Done.
Searching for "Keep A Knockin" by Little Richard...
Done.
Searching for "Smile" by Nat King Cole...
Done.
Searching for "You Make Me Feel So Young" by Frank Sinatra...
Done.
Searching for "Heartbreak Hotel" by Elvis Presley...
Done.
Searching for "Jambalaya" by Hank Williams...
Done.
Searching for "Great Balls Of Fire" by Jerry Lee Lewis...
Done.
150 175
Searching for "Here Comes The Sun" by The Beatles...
Done.
Searching for "When a Man Loves a Woman" by Percy Sledge...
Done.
Searching for "Satisfaction" by The Rolling Stones...
Done.
S

Done.
Searching for "For Your Love" by The Yardbirds...
Done.
Searching for "Yesterday" by The Beatles...
Done.
Searching for "I Fall To Pieces" by Patsy Cline...
Done.
Searching for "Baby" by Dinah Washington...
Done.
Searching for "Do You Love Me" by The Contours...
Done.
Searching for "Somethin' Stupid" by Frank Sinatra...
Done.
Searching for "Sunny" by Bobby Hebb...
Done.
275 300
Searching for "Gimme Shelter" by The Rolling Stones...
Done.
Searching for "Hello Mary Lou, Goodbye Heart" by Ricky Nelson...
Done.
Searching for "I'd Rather Go Blind" by Etta James...
Done.
Searching for "Dearest" by Buddy Holly...
Done.
Searching for "Ramble On" by Led Zeppelin...
Done.
Searching for "To Love Somebody" by Bee Gees...
Done.
Searching for "Do You Know the Way to San Jose" by Dionne Warwick...
Done.
Searching for "Hey Paula" by Paul & Paula...
Done.
Searching for "Carrie Anne" by The Hollies...
Done.
Searching for "Please Mr. Postman" by The Marvelettes...
Done.
Searching for "A Sunday Kind

Done.
Searching for "New York Groove" by Ace Frehley...
Done.
Searching for "Don't Go Breaking My Heart" by Elton John...
Done.
Searching for "The Logical Song" by Supertramp...
Done.
Searching for "Good Times" by CHIC...
Done.
Searching for "Pink Moon" by Nick Drake...
Done.
Searching for "I Was Made For Lovin' You" by KISS...
Done.
Searching for "Red Light Spells Danger" by Billy Ocean...
Done.
Searching for "Old Time Rock & Roll" by Bob Seger...
Done.
Searching for "Big Yellow Taxi" by Joni Mitchell...
Done.
Searching for "All Right Now" by Free...
Done.
Searching for "The Harder They Come" by Jimmy Cliff...
Done.
Searching for "Heart of Gold" by Neil Young...
Done.
Searching for "Miss You" by The Rolling Stones...
Done.
Searching for "Don't It Make My Brown Eyes Blue" by Crystal Gayle...
Done.
Searching for "I Feel Love" by Donna Summer...
Done.
Searching for "Life in the Fast Lane" by Eagles...
Done.
Searching for "Play That Funky Music" by Wild Cherry...
Done.
Searching for "What

Done.
Searching for "Total Eclipse of the Heart" by Bonnie Tyler...
Done.
Searching for "Nikita" by Elton John...
Done.
550 575
Searching for "Jump" by Van Halen...
Done.
Searching for "Ebony & Ivory" by Paul McCartney...
Done.
Searching for "One More Night" by Phil Collins...
Done.
Searching for "If You Don't Know Me by Now" by Simply Red...
Done.
Searching for "That's All" by Genesis...
Done.
Searching for "All Right" by Christopher Cross...
Done.
Searching for "Just Around The Corner" by Cock Robin...
Done.
Searching for "Down Under" by Men At Work...
Done.
Searching for "Straight From The Heart" by Bryan Adams...
Done.
Searching for "Give It Up" by KC & The Sunshine Band...
Done.
Searching for "These Dreams" by Heart...
Done.
Searching for "We Are The World" by U.S.A. For Africa...
Done.
Searching for "Slave To Love" by Bryan Ferry...
Done.
Searching for "99 Luftballons" by Nena...
Done.
Searching for "Hungry Heart" by Bruce Springsteen...
Done.
Searching for "All Around the World"

Done.
Searching for "Independent Women, Pt. 1" by Destiny's Child...
Done.
Searching for "Can You Feel the Love Tonight" by Elton John...
Done.
Searching for "Missing" by Everything But The Girl...
Done.
Searching for "Don't Speak" by No Doubt...
Done.
Searching for "Big Big World" by Emilia...
Done.
Searching for "Save Tonight" by Eagle-Eye Cherry...
Done.
Searching for "Ironic" by Alanis Morissette...
Done.
700 725
Searching for "Always Be My Baby" by Mariah Carey...
Done.
Searching for "Sweetest Thing" by U2...
Done.
Searching for "Stay" by Lisa Loeb & Nine Stories...
Done.
Searching for "Say My Name" by Destiny's Child...
Done.
Searching for "Let's Talk About Sex" by Salt-N-Pepa...
Done.
Searching for "Fastlove, Pt. 1" by George Michael...
No results found for: 'Fastlove, Pt. 1 George Michael'
Searching for "High" by Lighthouse Family...
Done.
Searching for "One Sweet Day" by Mariah Carey...
Done.
Searching for "Joyride" by Roxette...
Done.
Searching for "Disco 2000" by Pulp...
Don

Done.
Searching for "Thnks fr th Mmrs" by Fall Out Boy...
Done.
Searching for "All Rise" by Blue...
Done.
Searching for "All For You" by Janet Jackson...
Done.
Searching for "Yellow" by Coldplay...
Done.
Searching for "Baby Boy" by Sean Paul...
Done.
850 875
Searching for "You Found Me" by The Fray...
Done.
Searching for "I Need a Girl  [feat. Loon, Ginuwine & Mario Winans]" by Diddy...
Done.
Searching for "Kiss Me Thru The Phone" by Soulja Boy...
Done.
Searching for "I Knew You Were Trouble" by Taylor Swift...
Done.
Searching for "Single Ladies" by Beyoncé...
Done.
Searching for "American Boy" by Estelle...
Done.
Searching for "If I Ain't Got You" by Alicia Keys...
Done.
Searching for "I'm Real" by Jennifer Lopez...
Done.
Searching for "Bleeding Love" by Leona Lewis...
Done.
Searching for "Moving Too Fast" by Artful Dodger...
Done.
Searching for "All Summer Long" by Kid Rock...
Done.
Searching for "World of Our Own" by Westlife...
Done.
Searching for "Dragostea Din Tei" by O-Zone...
D

Done.
Searching for "Try" by P!nk...
Done.
Searching for "Thunder" by Imagine Dragons...
Done.
1000 1025
Searching for "IDGAF" by Dua Lipa...
Done.
Searching for "Bad Blood" by Taylor Swift...
Done.
Searching for "Break Free" by Ariana Grande...
Done.
Searching for "Talk" by Khalid...
Done.
Searching for "Burn" by Ellie Goulding...
Done.
Searching for "I'm Not The Only One" by Sam Smith...
Done.
Searching for "Best Day Of My Life" by American Authors...
Done.
Searching for "Time of Our Lives" by Pitbull...
Done.
Searching for "i'm so tired..." by Lauv...
Done.
Searching for "Shut Up and Dance" by WALK THE MOON...
Done.
Searching for "Heroes" by Alesso...
Done.
Searching for "Sweet but Psycho" by Ava Max...
Done.
Searching for "Whatever It Takes" by Imagine Dragons...
Done.
Searching for "Rather Be" by Clean Bandit...
Done.
Searching for "When I Was Your Man" by Bruno Mars...
Done.
Searching for "Like I Can" by Sam Smith...
Done.
Searching for "Summer Air" by ItaloBrothers...
Done.
Sear

NameError: name 'by25' is not defined

In [None]:
#Adding the decades to the database. This will allow for easier filtering when I want to poke around in the data.
decades = dict(zip(list(data.playlist_id.unique()),["1950s","1960s","1970s","1980s","1990s","2000s","2010s"]))
def map_values(row, values_dict):
    return values_dict[row]

complete_data['playlist_name'] = complete_data['playlist_id'].apply(map_values, args = (decades,))

In [None]:
complete_data.head()

Unnamed: 0,index,album_release_date,album_type,album_name,track_explicit,track_track,track_name,track_id,playlist_id,multiple_artists,Artist_1,Artist_2,Artist_3,Artist_4,Artist_5,new_track_name,lyrics,g_full_titles,g_artists,playlist_name
0,0,1958-03-21,album,Elvis' Golden Records,False,True,Don't Be Cruel,01u6AEzGbGbQyYVdxajxqk,37i9dQZF1DWSV3Tk4GO2fq,False,Elvis Presley,,,,,Don't Be Cruel,You know I can be found Sitting home all alone...,Don't Be Cruel by Elvis Presley,Elvis Presley,1950s
1,1,1956-03,album,Songs For Swingin' Lovers! (Remastered),False,True,I've Got You Under My Skin - Remastered 1998,3aEJMh1cXKEjgh52claxQp,37i9dQZF1DWSV3Tk4GO2fq,False,Frank Sinatra,,,,,I've Got You Under My Skin,I've got you under my skin I've got you deep i...,I've Got You Under My Skin by Frank Sinatra,Frank Sinatra,1950s
2,2,1995-01-01,album,The Best Of The Platters,False,True,Smoke Gets In Your Eyes,307XEC1IUwUs9ojlEFwH7f,37i9dQZF1DWSV3Tk4GO2fq,False,The Platters,,,,,Smoke Gets In Your Eyes,They asked me how I knew My true love was true...,Smoke Gets in Your Eyes by The Platters,The Platters,1950s
3,3,1959,album,What'd I Say,False,True,"What'd I Say, Pt. 1 & 2",5yQ9iMZXGcr5rlO4hoLsP4,37i9dQZF1DWSV3Tk4GO2fq,False,Ray Charles,,,,,"What'd I Say, Pt. 1 & 2","Hey mama, don't you treat me wrong Come and lo...",What'd I Say (Parts 1 & 2) by Ray Charles,Ray Charles,1950s
4,4,1996-06-04,album,Ella & Friends,False,True,Dream A Little Dream Of Me,3vFVS2WYHDG4KkWCNecvpn,37i9dQZF1DWSV3Tk4GO2fq,True,Ella Fitzgerald,Louis Armstrong,,,,Dream A Little Dream Of Me,Stars shining bright above you Night breezes s...,Dream a Little Dream of Me by Ella Fitzgerald ...,Ella Fitzgerald & Louis Armstrong,1950s


## Get Lyrics, Part 2

In my exploration of the API functions and data, I found that Genius did not always match the exact song that was in the data. I decided to assign a fuzzy matching score to the Genius song title pulled from the API. This score will compare the Genius song title to the Spotify song title. If the Genius song title received a score of 80 or higher, the lyrics will be used in the explicit word analysis. Otherwise, that song will be dropped.

In [None]:
#resetting the index - I assume it is due to dropping and adding rows and columns
complete_data = complete_data.reset_index().drop(columns=["level_0","index"])

#obtaining the song title from the genius results - Before: "Jenny From The Block by Jennifer Lopez", After: "Jenny From The Block"
complete_data["g_title"] = complete_data["g_full_titles"].apply(lambda x: re.sub(r"\sby.*","",x).strip())

#creating a lambda function to calculate fuzzy matching ratio between Genius and Spotify song titles
score = lambda df,x,y: [fuzz.ratio(df[x][i].lower(), df[y][i].lower()) for i in range(len(df))]

# creating a column with the fuzzy matching score
complete_data["title_match_score"] = score(complete_data,"new_track_name","g_title")

#creating a pass/fail column based on the fuzzy matching score
complete_data["title_pass_fail"] = complete_data["title_match_score"].apply(lambda x: "fail" if x < 80 else "pass")

## Clean Up Lyrics
I have retrieved all of the lyrics!!! Now I need to clean the lyrics before analysis. Below I outlined my order of operations to clean the lyrics. The result of each row will be a set of words used in the lyrics. By default, a set will include the unique words used not all of the iterations of the word.

#### Order of Operations
* Break apart contractions - Since I use the stop words set from nltk, it doesn't typically include contractions (didn't to did not). Before I removed punctuation, I broke apart the contractions to prepare for stopword removal.
* Remove Punctuation - Since I'm reviewing the text on a word level, I didn't care for the syntax of the lyrics.
* Remove Stopwords - Removing words such as "she", "the", "as" to review important words
* Remove Duplicates - I wanted to focus on if an explicit word appeared and not the multiple times the word occurred.

In [None]:
def prep_lyric_cleanup():
    """
    This function was created to prep the contractions dictionary and punctuation list to clean up the lyrics.
    1) Load contractions dictionary
    2) Split the dictionary and apply lowercase function to keys and values
    3) Zip the contractions dictionary back
    4) Drop months of the year keys from the contractions dictionary - Frankly, I felt it didn't matter so I dropped them.
    5) Create a punctuation translation list

    Output: lowercase contractions dictionary, translation list

    These outputs will be used to break apart contractions and remove punctuation.
    """
    cont_dict = contractions.contractions_dict
    lower_keys = [x.lower() for x in list(cont_dict.keys())]
    lower_values = [x.lower() for x in list(cont_dict.values())]
    lower_dict = dict(zip(lower_keys,lower_values))
    
    drop_keys = ['jan.','feb.','mar.','apr.','jun.','jul.','aug.','sep.','oct.','nov.','dec.']
    for key in drop_keys:
        lower_dict.pop(key)
   
    translation= str.maketrans("","",string.punctuation)
    
    return lower_dict, translation
    
def clean_it_up(lyrics):
    """
    This function cleans the lyrics for analysis.
    1) Apply lowercase function to lyrics
    2) Create a list of words by splitting the lyrics
    3) Break apart contractions using the contraction dictionary
    4) Remove punctuation using translation function
    5) Remove stop words using nltk's stop words set
    6) Remove duplicate words - The analysis is focused on word choice and not the number of times a word is used.
    """
    lower_lyrics = lyrics.lower()
    list_lyrics = lower_lyrics.split(" ")
    #break apart contractions
    processing1 = [lower_dict[word] if word in list(lower_dict.keys()) else word for word in list_lyrics]
    #remove punctuation
    processing2 = " ".join(processing1).translate(translation).split(" ")
    #remove stop words
    processing3 = [word for word in processing2 if word not in en_stop]
    #remove duplicates
    final_lyrics = set(processing3)#list(set(processing3))
    
    return final_lyrics

In [None]:
#Load the contractions dictionary (lowercase) & punctuation translation list
lower_dict, translation = prep_lyric_cleanup()
# Create a cleaned lyrics columns using the cleaning lyrics functionv
data["cleaned_lyrics"] = data["lyrics"].apply(clean_it_up)

## Explicit Word Count Analysis

I am finally at the point where the data is ready to be analyzed. I created a few functions here that will analyze the explicit word count in the lyrics and add the additional columns to the data.

Explicit Word Lists Used:
* Carnegie Mellon Explicit Word List - This list was obtained from a research group at Carnegie Mellon. 
    * Thoughts After Analysis: I felt that this list was too rigid for my analysis. There were words included that aren't typically considered explicit like fire. 
* YouTube Explicit List - A list obtained from a random site claimed to be updated to January 2021.
    * Thoughts After Analysis: This list was exactly what I was looking for. I based my analysis more strongly on this list but I did still calculate the word count using the CMU list.

In [None]:
def create_explicit_word_dictionary():
    """
    This function cleans up the explicit word lists to prepare for analysis usage. In my EDA of the list,
    there were some repeats in words and excessive amounts of blanks. 

    1) Create a lambda: This function will create a list that will strip empty spaces on either side of a word.
    2) Read in the explicit word lists from CMU and YouTube.
    3) Apply the lambda function to the explicit word lists to clean up the documents.
    4) Clean up CMU list - I found some repeats and pecularities in this list so I added a few additional actions.
    5) Create Dictionary - A dictionary is created with two key/value pairs.
        * YT : List of YT Explicit Words
        * CMU: List of CMU Explicit Words
    """
    clean_it_up_lines = lambda List: [List[num].strip() for num in range(len(List))]
    
    with open('cmu_offensive_word_list.txt') as f:
        lines_cmu = f.readlines()
    with open('youtube-blacklist-words-list_comma-separated-text-file_2021-01-19.txt') as h:
        lines_yt = h.readlines()
        
    yt_explicit = set(clean_it_up_lines(lines_yt[13].split(",")))
    cmu_explicit = clean_it_up_lines(lines_cmu)
    
    #additional cleanup for cmu
    cmu_explicit.pop(0)
    cmu_explicit = set(cmu_explicit)
    
    explicit_dict = dict(zip(["yt","cmu"],[yt_explicit,cmu_explicit]))
    
    return explicit_dict

In [None]:
def get_explicit_columns(data, explicit_dict):
    """
    This function executes the explicit word count analysis and creates 2 new columns to append to the data set:
        * Explicit Words - The explicit words found in the lyrics when compared to explicit word list
        * Explicit Word Count - The number of explicit words found in the lyrics
    """
    new_data = data.copy()
    #for each explicit list in the explicit word dictionary
    for List in explicit_dict.keys():
        #find the words that intersect between the song lyrics and explicit word list (CMU or YT)
        explicit_set = new_data["cleaned_lyrics"].apply(lambda x: explicit_dict[List].intersection(set(x)))
        #count the explicit words found
        explicit_count = explicit_set.apply(lambda x: len(x))
        #create column with the explicit words found
        new_data[List+"_explicit_words_2"] = explicit_set
        #create column with the explicit word count
        new_data[List+"_explicit_count_2"] = explicit_count
    
    return new_data

In [None]:
#create an explicit word dictionary using CMU and YT explicit word lists
explicit_dict = create_explicit_word_dictionary()
#create the final dataset for analysis
final_data = get_explicit_columns(data, explicit_dict)

In [None]:
# drop rows where lyrics were not found
complete = final_data.drop(final_data[final_data.lyrics=="None"].index)
# drop rows where the matching title score was less than 80
complete = complete.drop(complete[complete.title_match_score < 80].index)

## Results

Below are the results of my lyric analysis. For each decade, I've calculated summary statistics. I also completed a t-test to compare the explicit word count of the decades.

There is also a Tableau dashboard that shows the data in a more interactive way. The results below are more for the nerds like me :)


### Summary Statistics

In [None]:
def get_stats(df, explicit_word_count_column):
    """
    The calculation of summary statistics for each decade.

    df: data, pandas DataFrame
    explicit_word_count_column: the name of the column; yt_explicit_count or cmu_explicit_count
    """
    for keys in list(decades.keys()):
        print("*"*50)
        decade_df = df[df.playlist_id == keys][explicit_word_count_column].describe()
        print("total: ", decade_df["count"])
        print("decade: ", decades[keys])
        print("average: ", round(decade_df["mean"],2))
        print("median: ", round(decade_df["50%"],2))
        print("max: ", decade_df["max"])
        print("song w/ most explicit words: ", list(df[(df[explicit_word_count_column] == decade_df["max"])&(df["playlist_name"]==decades[keys])]["track_name"].values))

In [None]:
#summary statistics of each decade using the Carnegie Mellon explicit word list
get_stats(data,"cmu_explicit_count")

**************************************************
total:  139.0
decade:  1950s
average:  0.58
median:  0.0
max:  5.0
song w/ most explicit words:  ['Manhattan']
**************************************************
total:  133.0
decade:  1960s
average:  0.76
median:  0.0
max:  8.0
song w/ most explicit words:  ['Sympathy For The Devil']
**************************************************
total:  141.0
decade:  1970s
average:  1.22
median:  1.0
max:  12.0
song w/ most explicit words:  ['American Pie']
**************************************************
total:  142.0
decade:  1980s
average:  0.84
median:  1.0
max:  11.0
song w/ most explicit words:  ["We Didn't Start the Fire"]
**************************************************
total:  135.0
decade:  1990s
average:  1.3
median:  1.0
max:  6.0
song w/ most explicit words:  ["Freedom! '90 - Remastered", 'Ironic - 2015 Remaster', "Let's Talk About Sex"]
**************************************************
total:  141.0
decade:  2000s
average:  2.

In [None]:
#summary statistics of each decade using the YouTube explicit word list
get_stats(data,"yt_explicit_count")

**************************************************
total:  139.0
decade:  1950s
average:  0.22
median:  0.0
max:  3.0
song w/ most explicit words:  ['Bo Diddley']
**************************************************
total:  133.0
decade:  1960s
average:  0.23
median:  0.0
max:  3.0
song w/ most explicit words:  ['Like a Rolling Stone']
**************************************************
total:  141.0
decade:  1970s
average:  0.35
median:  0.0
max:  3.0
song w/ most explicit words:  ['Sultans Of Swing', 'The Ballroom Blitz']
**************************************************
total:  142.0
decade:  1980s
average:  0.2
median:  0.0
max:  4.0
song w/ most explicit words:  ["We Didn't Start the Fire"]
**************************************************
total:  135.0
decade:  1990s
average:  0.34
median:  0.0
max:  4.0
song w/ most explicit words:  ["Let's Talk About Sex"]
**************************************************
total:  141.0
decade:  2000s
average:  1.73
median:  1.0
max:  15.0
song 

In [3]:
data = pd.read_csv("Final_data_w_explicit.csv")

### Comparing Decades' Explicit Word Count

By performing t-tests, I compared all of the decades to each other. In the summary statistics, I did find that the later decades had a higher average explicit word count. Now I want to see if any of my results were significant.

Results - I did find that the later decades (2000s and 2010s) typically had significant results having more words than previous decades.

In [17]:
#create all possible pairs for the decades
decade_pairs = [(a, b) for idx, a in enumerate(list(data["playlist_name"].unique())) for b in list(data["playlist_name"].unique())[idx + 1:]]

In [25]:
for pair in decade_pairs:
    #filter data by decade 1
    decade1 = data[data["playlist_name"]==pair[0]]
    #filter data by decade 2
    decade2 = data[data["playlist_name"]==pair[1]]
    #t-test comparing the decade 2 to the mean of decade 1
    test_results = ss.ttest_1samp(decade2["yt_explicit_count"],decade1["yt_explicit_count"].mean())
    #print results
    print(pair[0], "vs.", pair[1])
    print(f'The {pair[1]} has {"significantly" if test_results[1] < 0.05 else ""} {"more" if test_results[0] > 0 else "less"} explicit words than the {pair[0]}.')
    print("t-statistic:", round(test_results[0],2), "p-value:", round(test_results[1],3))
    print("*"*50)

1950s vs. 1960s
The 1960s has  more explicit words than the 1950s.
t-statistic: 0.05 p-value: 0.957
**************************************************
1950s vs. 1970s
The 1970s has significantly more explicit words than the 1950s.
t-statistic: 2.34 p-value: 0.021
**************************************************
1950s vs. 1980s
The 1980s has  less explicit words than the 1950s.
t-statistic: -0.43 p-value: 0.671
**************************************************
1950s vs. 1990s
The 1990s has  more explicit words than the 1950s.
t-statistic: 1.94 p-value: 0.054
**************************************************
1950s vs. 2000s
The 2000s has significantly more explicit words than the 1950s.
t-statistic: 6.84 p-value: 0.0
**************************************************
1950s vs. 2010s
The 2010s has significantly more explicit words than the 1950s.
t-statistic: 5.44 p-value: 0.0
**************************************************
1960s vs. 1970s
The 1970s has significantly more explicit 

In [26]:
for pair in decade_pairs:
    #filter data by decade 1
    decade1 = data[data["playlist_name"]==pair[0]]
    #filter data by decade 2
    decade2 = data[data["playlist_name"]==pair[1]]
    #t-test comparing the decade 2 to the mean of decade 1
    test_results = ss.ttest_1samp(decade2["cmu_explicit_count"],decade1["cmu_explicit_count"].mean())
    #print results
    print(pair[0], "vs.", pair[1])
    print(f'The {pair[1]} has {"significantly" if test_results[1] < 0.05 else ""} {"more" if test_results[0] > 0 else "less"} explicit words than the {pair[0]}.')
    print("t-statistic:", round(test_results[0],2), "p-value:", round(test_results[1],3))
    print("*"*50)

1950s vs. 1960s
The 1960s has  more explicit words than the 1950s.
t-statistic: 1.89 p-value: 0.061
**************************************************
1950s vs. 1970s
The 1970s has significantly more explicit words than the 1950s.
t-statistic: 4.74 p-value: 0.0
**************************************************
1950s vs. 1980s
The 1980s has significantly more explicit words than the 1950s.
t-statistic: 2.53 p-value: 0.012
**************************************************
1950s vs. 1990s
The 1990s has significantly more explicit words than the 1950s.
t-statistic: 5.94 p-value: 0.0
**************************************************
1950s vs. 2000s
The 2000s has significantly more explicit words than the 1950s.
t-statistic: 8.02 p-value: 0.0
**************************************************
1950s vs. 2010s
The 2010s has significantly more explicit words than the 1950s.
t-statistic: 8.21 p-value: 0.0
**************************************************
1960s vs. 1970s
The 1970s has signifi