## Cleaning Text - Pop

In this notebook, I will clean the text of the lyrics so it can be used in machine learning. I am cutting this down to just the pop lyrics because they are most prevalent over all the decades, as shown in the previous EDA notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spacy
import nltk
import re
import string

In [2]:
df = pd.read_csv("allsongscombined.csv")

In [3]:
df_pop = df[df['tag'] == 'pop']

I am removing the line breaks before cleaning it so that it can recognize the words separately from the line break characters.

In [4]:
df_pop['lyrics'] = df_pop['lyrics'].str.replace('\n', ' ', regex=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pop['lyrics'] = df_pop['lyrics'].str.replace('\n', ' ', regex=False)


For cleaning the text, I am using SpaCy. Initially I tried cleaning it all at once, but it crashed my memory, so I looked for a solution to not use so much memory. Along with the [spaCY documentation](https://spacy.io/), Perplexity gave me a few sources that were helpful to use a pipeline instead of doing the entire thing at once and to decide between SpaCy and NLTK: [SpaCy or NLTK?](https://blog.parlanchin.com/blog/spacy-or-nltk/), [Spacy vs NLTK: Which NLP Library is Right for You?](https://botpenguin.com/blogs/spacy-vs-nltk), and [How to Clean Text Like a Boss for NLP in Python](https://dataknowsall.com/blog/textcleaning.html)

In [5]:
nlp = spacy.load('en_core_web_lg')

In [6]:
texts = df_pop['lyrics'].tolist()

In [7]:
docs = nlp.pipe(texts, batch_size=1000, disable=["ner", "parser"])

In [8]:
cleaned_texts = []
for doc in docs:
    tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct]
    cleaned_texts.append(" ".join(tokens))

In [9]:
df_pop['cleaned_lyrics'] = cleaned_texts

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pop['cleaned_lyrics'] = cleaned_texts


In [10]:
df_pop.head()

Unnamed: 0,title,artist,tag,year,lyrics,album,explicit,danceability,energy,key,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,decade,cleaned_lyrics
83,Idioteque,radiohead,pop,2000,Who's in the bunker? Who's in the bunker? Wom...,Kid A,False,0.615,0.931,3,...,0.273,0.0352,2.2e-05,0.0915,0.525,137.544,309093,3.0,2000s,bunker bunker woman child child child laugh ...
110,Billie Jean,michael jackson,pop,1982,She was more like a beauty queen from a movie...,Thriller,False,0.932,0.457,11,...,0.0541,0.0173,0.0436,0.0414,0.884,117.002,294227,4.0,1980s,like beauty queen movie scene say mind mean ...
132,Holiday,vampire weekend,pop,2010,"Holiday, oh, a holiday And the best one of th...",Holiday,False,0.715,0.769,2,...,0.105,0.023,0.000732,0.127,0.891,155.827,138293,4.0,2010s,holiday oh holiday good year doze underneath...
169,Islands,shakira,pop,2010,I don't have to leave anymore What I have is ...,Sale el Sol,False,0.778,0.76,8,...,0.045,0.104,1.8e-05,0.116,0.686,134.098,162893,4.0,2010s,leave anymore right spend night day search w...
190,Hold It Against Me,britney spears,pop,2011,"Hey, over there Please forgive me if I'm comi...",Femme Fatale (Deluxe Version),False,0.648,0.722,0,...,0.0427,0.0103,0.0,0.24,0.389,132.973,228827,4.0,2010s,hey forgive come strong hate stare win play ...


In [13]:
df_pop.head()

Unnamed: 0,title,artist,tag,year,lyrics,album,explicit,danceability,energy,key,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,decade,cleaned_lyrics
83,Idioteque,radiohead,pop,2000,\nWho's in the bunker? Who's in the bunker?\nW...,Kid A,False,0.615,0.931,3,...,0.273,0.0352,2.2e-05,0.0915,0.525,137.544,309093,3.0,2000s,bunker bunker woman child child child laugh...
110,Billie Jean,michael jackson,pop,1982,\nShe was more like a beauty queen from a movi...,Thriller,False,0.932,0.457,11,...,0.0541,0.0173,0.0436,0.0414,0.884,117.002,294227,4.0,1980s,like beauty queen movie scene say mind mean ...
132,Holiday,vampire weekend,pop,2010,"\nHoliday, oh, a holiday\nAnd the best one of ...",Holiday,False,0.715,0.769,2,...,0.105,0.023,0.000732,0.127,0.891,155.827,138293,4.0,2010s,holiday oh holiday good year doze underneat...
169,Islands,shakira,pop,2010,\nI don't have to leave anymore\nWhat I have i...,Sale el Sol,False,0.778,0.76,8,...,0.045,0.104,1.8e-05,0.116,0.686,134.098,162893,4.0,2010s,leave anymore right spend night day search...
190,Hold It Against Me,britney spears,pop,2011,"\nHey, over there\nPlease forgive me if I'm co...",Femme Fatale (Deluxe Version),False,0.648,0.722,0,...,0.0427,0.0103,0.0,0.24,0.389,132.973,228827,4.0,2010s,hey forgive come strong hate stare win pla...


Trying to clean all of the text at the same time with spacy ran through all the memory in my computer, so I went back and used a pipeline instead. I'm going to export this as a csv again, because it took a very long time to run the spacy text cleaning and lemmatization on it.

In [12]:
df_pop.to_csv('poplyricscleaned.csv', index=False)

In [13]:
df_pop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48027 entries, 83 to 106138
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title             48027 non-null  object 
 1   artist            48027 non-null  object 
 2   tag               48027 non-null  object 
 3   year              48027 non-null  int64  
 4   lyrics            48027 non-null  object 
 5   album             48027 non-null  object 
 6   explicit          48027 non-null  bool   
 7   danceability      48027 non-null  float64
 8   energy            48027 non-null  float64
 9   key               48027 non-null  int64  
 10  loudness          48027 non-null  float64
 11  mode              48027 non-null  int64  
 12  speechiness       48027 non-null  float64
 13  acousticness      48027 non-null  float64
 14  instrumentalness  48027 non-null  float64
 15  liveness          48027 non-null  float64
 16  valence           48027 non-null  floa