# Most Common Words for Artists

## Introduction
In this project we are going to use a Dataset kindly provided by *Sergey Kuznetsov* in the following link: https://www.kaggle.com/mousehead/songlyrics

The dataset is populated by more that 50,000 songs. It includes the artist name, the title of the song and the lyrics. 

Focusing on the lyrics, our objetive will be to withdraw a wordcloud for the most common words used for a certain group. 

## Importing Libraries 

We are going to use the basic lybraries such as pandas, numpys and matplotlib; as well as some specifics ones to work with natural language with the following pourpose:
 - nltk.corpus: to bring a list os stopwords, also known as functional words, are those words that does not add meaning to the sentece. We do not want to get those words in our results.
 - collections: this library allows to get the elements from a dictionary with its highest values. 
 - wordcloud: to plot the wordcloud image

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import collections
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

% matplotlib inline

In [None]:
#Reading the csv file and dropping the column  with info about the links
#Here for time economy reasons we are going to downsize the dataset to make the process less intensive

songs = pd.read_csv('../input/songdata.csv')
songs = songs.drop("link", axis = 1)
songs = songs.sample(n = 1000).reset_index(drop=True)

In [None]:
#Let´s see how it looks
songs.head()


In [None]:
#We are going to clean the lyrics´ field from unwanted characters

songs["text"] = songs["text"].str.replace("\n"," ").str.replace("  "," ").str.replace("(","").str.replace(")","").str.replace("?","").str.replace("!","")
songs["text"] = songs["text"].str.replace("/","").str.replace("\\","").str.replace(","," ").str.replace("."," ").str.replace("  "," ").str.lower()
songs["text"] = songs["text"].str.replace("-"," ").str.replace("[","").str.replace("]"," ").str.replace(":"," ").str.replace("  "," ").str.replace("  "," ")

In [None]:
#Let´s see how the lyrics for the first song look now
songs["text"][0]

In [None]:
#How many artist there are in the dataset?
n_artist = len(songs["artist"].unique())
n_artist

In [None]:
#Let´s check now what artist we have in our dataset
#Below we can see the first 100 artists in alphabetic order 
artist_list = songs["artist"].unique()[0:100]
artist_list


We want to have a representative function of and artist´s lyrics in order to make a proper wordcloud. Therefore we need to check how many songs we have for each artist.

In [None]:
songs["artist"].value_counts().head(30)

In [None]:
#variable that will allow to delete all the functional or stop words
#we are going to add some extra stop words  specif for the domain of the dataset such as "chorus", "oh", or "nah"

function_words = stopwords.words('english')
function_words = function_words + ["la"]+["i'm"]+["i've"]+["oh"]+["nah"]+["chorus"]+["na"]+["i´ll"]+["can´t"]+["yeah"]

In [None]:
#function to group all the lyris for the same group in the same string

def group_lyrics(artist):
    lyrics = " "
    for index, row in songs.iterrows():
        if songs.loc[index,"artist"] == artist:
            lyrics = lyrics + " "+ (row[2])
    lyrics.split()
    return lyrics

In [None]:
#function to plot a word cloud for a single artist 

def plot_wordcloud(artist):
    word_cloud = WordCloud(stopwords = function_words, width = 1600, height = 1200, min_font_size = 4, collocations= True, background_color="white", max_words=20).generate(group_lyrics(artist))
    plt.imshow(word_cloud, interpolation='bilinear' )
    plt.axis("off")
    plt.show()

And finally we only neet to call the function *plot_wordcloud* selecting the artist and voila! 

We are going to start plotting the wordcloud for The Beatles. Any bet about their most common words?...

In [None]:
#Let´s see it

plot_wordcloud("The Beatles")

In [None]:
#Let´s see some more examples

for artist in artist_list[0:5]:
    print(artist)
    plot_wordcloud(artist)


We have identified that *Love* is a pretty common word to be found in lyrics. What a surprise... 

Let´s continue the analysis trying to find who is the artist that uses *Love* the most. 

In [None]:
artist_love = {}
for artist in songs["artist"].unique():
    artist_love[artist] = group_lyrics(artist).count("love")

In [None]:
collections.Counter(artist_love).most_common(10)
