## Mic check / Introduction
Rap music has been one of the most influential music genres for the biggest part of the last quarter century. It all started by the name of "Rhythm and Poetry", meaning, words mean the world here. This data analysis will try to shed more light on the way the most influential rappers, well, rap.

The purpose of this project is to analyze 5 songs per rapper. We analyze 7 rappers, meaning our data will contain lyrics from 35 songs. The goal of our analysis is to make conclusions on the positivity and the subjectivity of the rappers, as well as the size of their vocabulary.

This project, as hopefully every other data science project, will go through the notorious process of "Data Cleaning". It's boring and all that, yet it's absoulutely vital for the sanity of our results. If you put sour food in the juicer, you can't really expect not to get sour juice, right? Feed a model with bad data, it feeds you with bad results. In the world of data science, it's almost impossible to completely clean your data, yet we should always try to minimize the garbage, so we can maximize the correctness of our output.

Clean data is not enough. We need properly structured data. Why? Well, algorithms have a thing for properly structured data. Organizing data in the right way makes it easier for algorithms to extract better information.

It's worth noting that the choice of rappers and songs is totally subjective. You could to this analysis for any musician, author, poet or any other text content producer. Good luck with cleaning data for mumble rap lyrics though.

That said, let's see the final list of steps we will go through before we actually do some really cool stuff with our data:

1. Optain data
2. Clean data
3. Structure data
    1. **Corpus** - collection of song lyrics from every rapper
   
    2. **Word Matrix** - matrix format of every word used in the Corpus
    
As we will notice, the "Clean data" and "Structure data" steps are dependent and could be executed in parallel.

Finally, two days ago, on June 16, it was Tupac's birthday. So this is my way of saying "Happy birthday, man". Legends never die.

## 1. Optain data
Here is the list of rappers and songs we will optain data for:
1. Tupac Amaru Shakur
    1. "Life Goes On"
    2. "Unconditional Love"
    3. "Changes"
    4. "Until The End Of Time"
    5. "Dear Mama"
2. The Notorious B.I.G.
    1. "Ten Crack Commandments"
    2. "Hypnotize"
    3. "Juicy"
    4. "One More Chance"
    5. "Big Poppa"
3. Snoop Doggy Dogg
    1. "What's My Name?"
    2. "Gin And Juice"
    3. "That's That Shit"
    4. "The Doggfather"
    5. "Eastside Party"
4. Nas
    1. "N.Y. State Of Mind"
    2. "Nas Is Like"
    3. "Surviving The Times"
    4. "The Message"
    5. "Memory Lane"
5. The Game
    1. "Where I'm From"
    2. "One Night"
    3. "Don't Need That Love"
    4. "Laugh"
    5. "Too Much"
6. 50 Cent
    1. "Hustler's Ambition"
    2. "Window Shopper"
    3. "If I Can't"
    4. "Many Men"
    5. "When It Rains It Pours"
7. Eminem
    1. "Rap God"
    2. "Phenomenal"
    3. "Wicked Ways"
    4. "Kings Never Die"
    5. "The Ringer"

We will find the lyrics on www.genius.com and store the URLs in variables:
    

In [22]:
# We will use Requests and BeatufiulSoup for scraping
import requests
from bs4 import BeautifulSoup
# We will use Pickle and OS to create folders and and store files
import pickle
import os
# We will use Pandas to structure and maintain our data
import pandas
# We will use Re and String modules when we start cleaning our data
import re
import string
# We will use this module in the 'Cleaning Data' and 'Structure Data' steps
from sklearn.feature_extraction.text import CountVectorizer

# Storing URLs to the Tupac songs lyrics
tupac_urls = ['https://genius.com/2pac-life-goes-on-lyrics',
             'https://genius.com/2pac-unconditional-love-lyrics',
             'https://genius.com/2pac-changes-lyrics',
             'https://genius.com/2pac-until-the-end-of-time-lyrics',
             'https://genius.com/2pac-dear-mama-lyrics']

# Storing URLs to the Biggie songs lyrics
biggie_urls = ['https://genius.com/The-notorious-big-ten-crack-commandments-lyrics',
              'https://genius.com/The-notorious-big-one-more-chance-lyrics',
              'https://genius.com/The-notorious-big-big-poppa-lyrics',
              'https://genius.com/The-notorious-big-juicy-lyrics',
              'https://genius.com/The-notorious-big-hypnotize-lyrics']

# Storing URLs to the Snoop songs lyrics
snoop_urls = ['https://genius.com/Snoop-dogg-gin-and-juice-lyrics',
             'https://genius.com/Snoop-dogg-thats-that-shit-lyrics',
             'https://genius.com/Snoop-dogg-tha-doggfather-lyrics',
             'https://genius.com/Snoop-dogg-eastside-party-lyrics',
             'https://genius.com/Snoop-dogg-who-am-i-lyrics']

# Storing URLs to the Nas songs lyrics
nas_urls = ['https://genius.com/Nas-surviving-the-times-lyrics',
           'https://genius.com/Nas-nas-is-like-lyrics',
           'https://genius.com/Nas-ny-state-of-mind-lyrics',
           'https://genius.com/Nas-the-message-lyrics',
           'https://genius.com/Nas-memory-lane-sittin-in-da-park-lyrics']

# Storing URLs to the The Game songs lyrics
game_urls = ['https://genius.com/The-game-where-im-from-lyrics',
            'https://genius.com/The-game-one-night-lyrics',
            'https://genius.com/The-game-dont-need-your-love-lyrics',
            'https://genius.com/The-game-laugh-lyrics',
            'https://genius.com/The-game-too-much-lyrics']

# Storing URLs to the 50 Cent songs lyrics
fifty_urls = ['https://genius.com/50-cent-many-men-wish-death-lyrics',
            'https://genius.com/50-cent-hustlers-ambition-lyrics',
            'https://genius.com/50-cent-if-i-cant-lyrics',
            'https://genius.com/50-cent-window-shopper-lyrics',
            'https://genius.com/50-cent-when-it-rains-it-pours-lyrics']

eminem_urls = ['https://genius.com/Eminem-rap-god-lyrics',
               'https://genius.com/Eminem-phenomenal-lyrics',
              'https://genius.com/Eminem-wicked-ways-lyrics',
              'https://genius.com/Eminem-kings-never-die-lyrics',
              'https://genius.com/Eminem-the-ringer-lyrics']

Next, we need a function that will take a URL and return the scraped text.

In [23]:
# Function that takes a URL as parameter, scrapes the Genius page at that URL, extracts and returns the lyrics
def fetch_lyrics(url):
    print(url)
    
    page = requests.get(url)
    html = BeautifulSoup(page.text, "html.parser")
    lyrics = html.find("div", class_="lyrics").get_text()
    
    return lyrics

Now that we have a function, we will use it to fetch lyrics. This could take a couple of minutes. You will see the URL of the scraped page printed out.

In [24]:
# We will store the lyrics in this dictionary
lyrics = {}
lyrics['tupac'] = [fetch_lyrics(url) for url in tupac_urls]
lyrics['biggie'] = [fetch_lyrics(url) for url in biggie_urls]
lyrics['snoop'] = [fetch_lyrics(url) for url in snoop_urls]
lyrics['nas'] = [fetch_lyrics(url) for url in nas_urls]
lyrics['game'] = [fetch_lyrics(url) for url in game_urls]
lyrics['fifty'] = [fetch_lyrics(url) for url in fifty_urls]
lyrics['eminem'] = [fetch_lyrics(url) for url in eminem_urls]

https://genius.com/2pac-life-goes-on-lyrics
https://genius.com/2pac-unconditional-love-lyrics
https://genius.com/2pac-changes-lyrics
https://genius.com/2pac-until-the-end-of-time-lyrics
https://genius.com/2pac-dear-mama-lyrics
https://genius.com/The-notorious-big-ten-crack-commandments-lyrics
https://genius.com/The-notorious-big-one-more-chance-lyrics
https://genius.com/The-notorious-big-big-poppa-lyrics
https://genius.com/The-notorious-big-juicy-lyrics
https://genius.com/The-notorious-big-hypnotize-lyrics
https://genius.com/Snoop-dogg-gin-and-juice-lyrics
https://genius.com/Snoop-dogg-thats-that-shit-lyrics
https://genius.com/Snoop-dogg-tha-doggfather-lyrics
https://genius.com/Snoop-dogg-eastside-party-lyrics
https://genius.com/Snoop-dogg-who-am-i-lyrics
https://genius.com/Nas-surviving-the-times-lyrics
https://genius.com/Nas-nas-is-like-lyrics
https://genius.com/Nas-ny-state-of-mind-lyrics
https://genius.com/Nas-the-message-lyrics
https://genius.com/Nas-memory-lane-sittin-in-da-park-

We could print out the lyrics collection for one rapper just to see that our scraping was successful.

In [25]:
#print(lyrics['tupac'])

Now that we have scraped our data, we need to organize it. We will save the lyrics collection for every rapper in a separate file using Pickle, which we imported at the start. Pickle is used in Python for object serialization. The files will be saved in a directory that we will create, if it doesn't already exist.

In [26]:
# Create the directory where we will keep the lyrics files
if not os.path.exists('lyrics'):
    os.makedirs('lyrics')
    
rappers = ['tupac', 'biggie', 'snoop', 'nas', 'game', 'fifty', 'eminem']

# Create named files
for i, rapper in enumerate(rappers):
    with open("lyrics/" + rapper + ".txt", "wb") as file:
        pickle.dump(lyrics[rapper], file)

We will read the data from the files and apply a few cleaning techniques to it.

In [27]:
data = {}

for i, rapper in enumerate(rappers):
    with open("lyrics/" + rapper + ".txt", "rb") as file:
        data[rapper] = pickle.load(file)

At this point you could check if data was loaded correctly. We are done with obtaining data, now we proceed to the second step, cleaning data.

In [28]:
#data.keys()

In [29]:
#data['tupac']

Notice that we actually have a list of song lyrics instead of one single text.

## 2. Clean Data 
The techniques used to clean textual data are called "text pre-processing techniques". We will use few of those, but won't go to any advanced data cleaning. We will go through the following techniques:
1. Remove garbage characters (like "/n", "[]" and so on)
2. Remove punctuation
3. Remove numerical values
4. Make all text lowercase
5. Remove stop words

First, we will structure the data in a better way. As noticed before, our data per rapper is a list of lyrics. So, we have to transform it into one single text. We will write a function that will accept a list of lyrics and will return one single text. That can only mean one thing...

## 3. Structure data
So, first we will merge our data into single text, and then we will put it in the following formats:
- Corpus - collection of song lyrics from every rapper
- Word Matrix - matrix format of every word used in the Corpus

We will move back and forth between cleaning and structuring data, as we can't efficiently clean badly structured data.

In [30]:
def merge_lyrics(list_of_lyrics):
    merged_text = ' '.join(list_of_lyrics)
    return merged_text

merged_data = {key: [merge_lyrics(value)] for (key, value) in data.items()}

Let's try it out. We will print the data before and after executing this function.

In [31]:
#data['tupac']

In [32]:
#merged_data['tupac']

That's better. Our data in the moment is in the dictionary format. We will put it in pandas dataframe, or we will re-structure it. 

We kinda already have our data in the Corpus format, now we will create Pandas dataframe from it. Why? Because it will be easier to clean our data if it is structured in a dataframe.

In [33]:
pandas.set_option('max_colwidth',150)

lyrics_dataframe = pandas.DataFrame.from_dict(merged_data).transpose()
lyrics_dataframe.columns = ['lyrics']
lyrics_dataframe = lyrics_dataframe.sort_index()

We will check to see if our dataframe is what we expect it to be.

In [34]:
#lyrics_dataframe

Now we could take a look at Tupac's (or any other rapper's) data.

In [35]:
#lyrics_dataframe.lyrics.loc['eminem']

Now that we have our data restructured we move on to the second data cleaning technique that we will apply: remove garbage characters.

In [36]:
def remove_garbage_characters(lyrics):
    # remove data in brackets
    lyrics = re.sub('\[.*?\]', '', lyrics)
    # remove multiple "\n" characters with ''
    lyrics = re.sub('\n{2,}', '', lyrics)
    # remove single "\n" character with ' '
    lyrics = re.sub('\n', ' ', lyrics)
    return lyrics

garbage_characters_removal_technique = lambda x: remove_garbage_characters(x)

clean_data = pandas.DataFrame(lyrics_dataframe.lyrics.apply(garbage_characters_removal_technique))

We could check to see if there are any changes in the data.

In [37]:
#clean_data.lyrics.loc['eminem']

We move on to the third data cleaning step: remove all punctuation.

In [38]:
def remove_punctuation(lyrics):
    lyrics = re.sub('[%s]' % re.escape(string.punctuation), '', lyrics)
    lyrics = re.sub('[‘’“”…]', '', lyrics)
    return lyrics

remove_punctuation_technique = lambda x: remove_punctuation(x)

clean_data = pandas.DataFrame(clean_data.lyrics.apply(remove_punctuation_technique))

See where we are:

In [39]:
#clean_data.lyrics.loc['tupac']

That's beautiful. If you run the last line of code, you will notice that we still have numbers in the text. We don't need numbers, so let's remove them.

In [40]:
def remove_numerical_values(lyrics):
    lyrics = re.sub('\w*\d\w*', '', lyrics)
    return lyrics

remove_numerical_values_technique = lambda x: remove_numerical_values(x)

clean_data = pandas.DataFrame(clean_data.lyrics.apply(remove_numerical_values_technique))

In [41]:
#clean_data.lyrics.loc['tupac']

Do you see any numbers? I don't. So we are doing good. Let's make all text lowercase now.

In [42]:
def transform_to_lowercase(lyrics):
    lyrics = lyrics.lower()
    return lyrics

transform_to_lowercase_technique = lambda x: transform_to_lowercase(x)

clean_data = pandas.DataFrame(clean_data.lyrics.apply(transform_to_lowercase_technique))

Let's check it out.

In [43]:
#clean_data.lyrics.loc['eminem']

We will now add the full artist name of the rappers in our dataframe.

In [44]:
rappers_full_names = ['The Notorious B.I.G.', 'Eminem', '50 Cent', 'The Game', 'Nas', 'Snoop Doggy Dogg', 'Tupac Amaru']

clean_data['name'] = rappers_full_names

clean_data.to_pickle("corpus.pkl")

clean_data

Unnamed: 0,lyrics,name
biggie,the ten crack commandments what nigga cant tell me nothing about this coke cant tell me nothing about this crack this weed for my hustlin...,The Notorious B.I.G.
eminem,look i was gonna go easy on you not to hurt your feelings but im only going to get this one chance somethings wrong i can feel it six minutes six ...,Eminem
fifty,man we gotta go get somethin to eat man im hungry as a motherfucker ayo man damn whats takin homie so long son calm down he coming gunshots ahh ...,50 Cent
game,im a bldoubleod been on songs with sndoubleop inside a ferrari with the dre run up i let it sing like nate dodoubleg walk through mile gunits on ...,The Game
nas,but thats the whole tragic point my friends what would i do if i could suddenly feel and to know once again that what i feel is real i could cry i...,Nas
snoop,ugh hahaha im serious nigga one of yall niggas got some bad motherfuckin breath oh man aye baby aye baby shit aye baby get some bubblegum in this ...,Snoop Doggy Dogg
tupac,how many brothers fell victim to the streets rest in peace young nigga theres a heaven for a g be a lie if i told you that i never thought of deat...,Tupac Amaru


Now let's move on to the last step of our data cleaning process: remove stop words. But, before we do that, we have to organize our data in a word matrix. We will use CountVectorizer to tokenize the lyrics and we can remove the stop words with the same module. Every row will contain data for a rapper and every column will be a different word.

In [45]:
cv = CountVectorizer(stop_words='english')
tokenized_data = cv.fit_transform(clean_data.lyrics)
word_matrix = pandas.DataFrame(tokenized_data.toarray(), columns=cv.get_feature_names())
word_matrix.index = clean_data.index

Let's see our word matrix:

In [46]:
word_matrix

Unnamed: 0,aaaahif,abandoned,able,abuse,ac,accident,accolades,accomplishments,account,accountant,...,yous,youthful,youve,youyeah,zags,zeros,zod,zombie,zone,zé
biggie,0,0,0,0,1,0,0,0,0,1,...,3,0,0,0,0,0,0,0,0,0
eminem,0,0,2,1,1,1,2,1,1,0,...,0,1,1,0,0,0,1,1,1,0
fifty,0,0,0,0,0,0,0,0,0,0,...,12,0,0,1,0,0,0,0,0,0
game,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
nas,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
snoop,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
tupac,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,1,0


Finally, we will save all data to files:

In [47]:
word_matrix.to_pickle("word_matrix.pkl")

clean_data.to_pickle('clean_data.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))