# Analyze Taylor Swift Lyrics with Python

[The original dataset was curated by Jan Llenzl Dagohoy and published on Kaggle](https://www.kaggle.com/datasets/thespacefreak/taylor-swift-song-lyrics-all-albums)

In [22]:
# Import relevant libraries

import pandas as pd
from sklearn.feature_extraction import text # importing stop words

#### 1. Combine CSV files

In [3]:
files = ['taylor_swift_song_lyrics\\01-taylor_swift.csv', 'taylor_swift_song_lyrics\\02-fearless_taylors_version.csv', 'taylor_swift_song_lyrics\\03-speak_now_deluxe_package.csv', 'taylor_swift_song_lyrics\\04-red_deluxe_edition.csv', 'taylor_swift_song_lyrics\\05-1989_deluxe.csv', 'taylor_swift_song_lyrics\\06-reputation.csv', 'taylor_swift_song_lyrics\\07-lover.csv', 'taylor_swift_song_lyrics\\08-folklore_deluxe_version.csv', 'taylor_swift_song_lyrics\\09-evermore_deluxe_version.csv'] 
 
# Read each csv file and store them in a list
dataframe = [pd.read_csv(file) for file in files]   

# Concatenante the data
combined_data = pd.concat(dataframe, ignore_index=True)

# Saved the combined data to a csv file
combined_data.to_csv('taylor_swift_song_lyrics\\taylor_swift_lyrics.csv', index=False)

# Check the first 5 rows
print(combined_data.head())

     album_name track_title  track_n  \
0  Taylor Swift  Tim McGraw        1   
1  Taylor Swift  Tim McGraw        1   
2  Taylor Swift  Tim McGraw        1   
3  Taylor Swift  Tim McGraw        1   
4  Taylor Swift  Tim McGraw        1   

                                         lyric  line  
0          He said the way my blue eyes shined     1  
1  Put those Georgia stars to shame that night     2  
2                       I said, "That's a lie"     3  
3                  Just a boy in a Chevy truck     4  
4         That had a tendency of gettin' stuck     5  


In [4]:
# Check the info of the combined data
print(combined_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8358 entries, 0 to 8357
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   album_name   8358 non-null   object
 1   track_title  8358 non-null   object
 2   track_n      8358 non-null   int64 
 3   lyric        8358 non-null   object
 4   line         8358 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 326.6+ KB
None


#### 2. Add album year

In [5]:
# Check existing album
print(combined_data['album_name'].unique())

['Taylor Swift' 'Fearless (Taylor’s Version)' 'Speak Now (Deluxe)'
 'Red (Deluxe Edition)' '1989 (Deluxe)' 'reputation' 'Lover'
 'folklore (deluxe version)' 'evermore (deluxe version)']


In [6]:
def release_year(album_name):
    if 'Taylor Swift' in album_name:
        return 2006
    elif 'fearless' in album_name:
        return 2008
    elif 'speak_now' in album_name:
        return 2010
    elif 'red' in album_name:
        return 2012
    elif '1989' in album_name:
        return 2014
    elif 'reputation' in album_name:
        return 2017
    elif 'lover' in album_name:
        return 2019
    elif 'folklore' in album_name:
        return 2020
    elif 'evermore' in album_name:
        return 2020
    else:
        return 'No Year'
    
# Apply the function to the album_name column
combined_data['release_year'] = combined_data['album_name'].apply(release_year)

print(combined_data.head())

     album_name track_title  track_n  \
0  Taylor Swift  Tim McGraw        1   
1  Taylor Swift  Tim McGraw        1   
2  Taylor Swift  Tim McGraw        1   
3  Taylor Swift  Tim McGraw        1   
4  Taylor Swift  Tim McGraw        1   

                                         lyric  line release_year  
0          He said the way my blue eyes shined     1         2006  
1  Put those Georgia stars to shame that night     2         2006  
2                       I said, "That's a lie"     3         2006  
3                  Just a boy in a Chevy truck     4         2006  
4         That had a tendency of gettin' stuck     5         2006  


#### 3. Clean the lyrics text
To accurately count keyword mentions, we need to make everything lowercase everything, remove punctuation, and exclude stop words.

- Change everything to lower case and save the result in a new column called clean_lyric.
- Remove punctuation and save the result to the existing clean_lyric column.
- Remove stopwords.


In [11]:
# Change lyrics to lowercase
combined_data['clean_lyric'] = combined_data['lyric'].str.lower()


In [12]:
# Remove punctuation
combined_data['clean_lyric'] = combined_data['clean_lyric'].str.replace(r'[^\w\s\" "\ ,\.\']', '')
print(combined_data['clean_lyric'])

0               he said the way my blue eyes shined
1       put those georgia stars to shame that night
2                            i said, "that's a lie"
3                       just a boy in a chevy truck
4              that had a tendency of gettin' stuck
                           ...                     
8353         you know, you know, you know, you know
8354                           when it's time to go
8355                                 so then you go
8356                                    then you go
8357                                    you just go
Name: clean_lyric, Length: 8358, dtype: object


In [23]:
skl_stopwords = text.ENGLISH_STOP_WORDS
print(skl_stopwords)
#create a small list of English stop words, feel free to edit this list



frozenset({'six', 'before', 'next', 'onto', 'former', 'us', 'show', 'until', 'himself', 'done', 'i', 'four', 'towards', 'each', 'who', 'part', 'into', 'move', 'nowhere', 'afterwards', 'thus', 'those', 'no', 'though', 'any', 'to', 'never', 'amount', 'thereby', 'off', 'else', 'by', 'up', 'thru', 'five', 'themselves', 'the', 'been', 'was', 'with', 'therein', 'sixty', 'few', 'amongst', 'others', 'toward', 'because', 'do', 'thick', 'becomes', 'enough', 'this', 'whereby', 'something', 'everywhere', 'how', 'there', 'mine', 'yourself', 'noone', 'must', 'see', 'more', 'keep', 'mostly', 'some', 'it', 'made', 'also', 'their', 'anywhere', 'they', 'either', 'empty', 'you', 'seems', 'here', 'our', 'of', 'in', 'within', 'down', 'system', 'give', 'bottom', 'after', 'moreover', 'hers', 'mill', 'anyone', 'eg', 'is', 'most', 'same', 'might', 'except', 'whereupon', 'as', 'per', 'am', 'own', 'around', 'herself', 'side', 'less', 'all', 'hasnt', 'someone', 'full', 'its', 'will', 'twelve', 'therefore', 'cant'

In [15]:
# Split the lyrics into words
combined_data['clean_lyric_list'] = combined_data['clean_lyric'].str.split()
print(combined_data['clean_lyric_list'].head())

0         [he, said, the, way, my, blue, eyes, shined]
1    [put, those, georgia, stars, to, shame, that, ...
2                         [i, said,, "that's, a, lie"]
3                  [just, a, boy, in, a, chevy, truck]
4         [that, had, a, tendency, of, gettin', stuck]
Name: clean_lyric_list, dtype: object


In [24]:
# Remove stop words
combined_data['clean_lyric_list'] = combined_data['clean_lyric_list'].apply(lambda x: [item for item in x if item not in skl_stopwords])
print(combined_data['clean_lyric_list'].head())

0    [said, way, blue, eyes, shined]
1     [georgia, stars, shame, night]
2             [said,, "that's, lie"]
3          [just, boy, chevy, truck]
4         [tendency, gettin', stuck]
Name: clean_lyric_list, dtype: object


In [25]:
# Re-join the words into a string
combined_data['clean_lyric'] = combined_data['clean_lyric_list'].str.join(' ')
print(combined_data['clean_lyric'])

0       said way blue eyes shined
1       georgia stars shame night
2              said, "that's lie"
3            just boy chevy truck
4          tendency gettin' stuck
                  ...            
8353       know, know, know, know
8354                    it's time
8355                             
8356                             
8357                         just
Name: clean_lyric, Length: 8358, dtype: object


#### 4. Find keyword mentions

- Create a new column to indicate if a lyric has "midnight"in it.
- Check how may times midnight occurs

In [29]:
def midnight_lyric(lyric):
    if 'midnight' in lyric:
        return True
    else:
        return False

combined_data['midnight_lyric'] = combined_data['clean_lyric'].apply(midnight_lyric)
print(combined_data['midnight_lyric'].value_counts())

midnight_lyric
False    8349
True        9
Name: count, dtype: int64


#### 5. Expand the keyword list
Midnight might not be the only way that Taylor Swift has talked about night. We need to expand our list. We've made a list of night words.
- Join the lists into a regular expression string using the .join() function and the | to indicate "or"
- Create a new column for each word category (day, night, time) that evaluates the clean lyrics for the presence of the words in the regular expression.
- Count how many times the words appeared and print the result to the screen
