
### Determine Flesch-Kincaid Grade Level


From [wikipedia](https://en.wikipedia.org/wiki/Flesch–Kincaid_readability_tests):

The Flesch–Kincaid readability tests are readability tests designed to indicate how difficult a passage in English is to understand. There are two tests, the Flesch Reading Ease, and the Flesch–Kincaid Grade Level. Although they use the same core measures (word length and sentence length), they have different weighting factors.

The results of the two tests correlate approximately inversely: a text with a comparatively high score on the Reading Ease test should have a lower score on the Grade-Level test. 


These readability tests are used extensively in the field of education. The "Flesch–Kincaid Grade Level Formula" instead presents a score as a U.S. grade level, making it easier for teachers, parents, librarians, and others to judge the readability level of various books and texts. It can also mean the number of years of education generally required to understand this text, relevant when the formula results in a number greater than 10. The grade level is calculated with the following formula:

$$0.39 (\frac{total words}{total sentences}) + 11.8 (\frac{total syllables}{total words}) - 15.59 $$



The result is a number that corresponds with a U.S. grade level. The sentence, "The Australian platypus is seemingly a hybrid of a mammal and reptilian creature" is an 11.3 as it has 24 syllables and 13 words. The different weighting factors for words per sentence and syllables per word in each scoring system mean that the two schemes are not directly comparable and cannot be converted. The grade level formula emphasises sentence length over word length. By creating one-word strings with hundreds of random characters, grade levels may be attained that are hundreds of times larger than high school completion in the United States. Due to the formula's construction, the score does not have an upper bound.

The lowest grade level score in theory is −3.40, but there are few real passages in which every sentence consists of a single one-syllable word. Green Eggs and Ham by Dr. Seuss comes close, averaging 5.7 words per sentence and 1.02 syllables per word, with a grade level of −1.3. (Most of the 50 used words are monosyllabic; "anywhere", which occurs eight times, is the only exception.)



In [2]:
import pandas as pd
import re
import os

#Read Pre-Processed Files


In [3]:
countryDF = pd.read_pickle('../data/msongs/out/Country_Preprocessed_1109.p')
popDF = pd.read_pickle('../data/msongs/out/Pop_Preprocessed_1109.p')

In [4]:
popDF.columns

Index(['index', 'artist_id', 'tags', 'track_id', 'title', 'song_id', 'release',
       'artist_mbid', 'artist_name', 'duration', 'artist_familiarity',
       'artist_hotttnesss', 'year', 'track_7digitalid', 'shs_perf', 'shs_work',
       'lyrics_text', 'spotifyURI', 'songFeatures', 'danceability', 'energy',
       'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo', 'type', 'id', 'uri',
       'track_href', 'analysis_url', 'duration_ms', 'time_signature', 'genre',
       'country_count', 'pop_count', 'other_count'],
      dtype='object')

In [5]:
countryDF.columns

Index(['level_0', 'index', 'artist_id', 'tags', 'track_id', 'title', 'song_id',
       'release', 'artist_mbid', 'artist_name', 'duration',
       'artist_familiarity', 'artist_hotttnesss', 'year', 'track_7digitalid',
       'shs_perf', 'shs_work', 'lyrics_text', 'spotifyURI', 'songFeatures',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'type', 'id', 'uri', 'track_href', 'analysis_url', 'duration_ms',
       'time_signature', 'genre', 'country_count', 'pop_count', 'other_count'],
      dtype='object')

In [6]:
countryDF.drop('level_0', axis=1, inplace=True)

In [7]:
songsDF = pd.DataFrame(data=None, columns=popDF.columns)
songsDF = popDF.append(countryDF, ignore_index=True)

In [8]:
len(songsDF)

23850

In [9]:
#for testing subset
beyonce = songsDF[songsDF.artist_id == 'AR65K7A1187FB4DAA4'].reset_index(drop=True)


### Text Pre-processing:  Clean Lyrics to replace characters and remove verse tags

* drop records where language = 'en'
* remove tags i.e. [Chorus], [Verses], etc
* remove extra lines

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import feedparser
import nltk
from datetime import datetime
from time import time

plt.style.use("ggplot")

### language Detection


In [None]:
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0

r_lang = []
for idx, row in songsDF.iterrows():
    try:
        r_lang.append(detect(row.lyrics_text))
    except:
        r_lang.append('not found')
        print('error processing row {}'.format(idx))

In [11]:
songsDF['language'] = r_lang

In [12]:
#count number of records by language

print(set(r_lang), '\n')

for lang in set(r_lang):
    print(lang, ':', r_lang.count(lang))

{'no', 'sk', 'id', 'sv', 'sl', 'tl', 'pl', 'hr', 'ca', 'pt', 'fr', 'af', 'vi', 'cs', 'cy', 'et', 'sq', 'fi', 'ar', 'sw', 'fa', 'da', 'tr', 'not found', 'hu', 'nl', 'de', 'ro', 'it', 'lv', 'en', 'es', 'so'} 

no : 241
sk : 1
id : 33
sv : 168
sl : 1
tl : 16
pl : 16
hr : 12
ca : 25
pt : 224
fr : 239
af : 7
vi : 1
cs : 1
cy : 2
et : 3
sq : 1
fi : 169
ar : 2
sw : 9
fa : 1
da : 26
tr : 11
not found : 14
hu : 4
nl : 93
de : 368
ro : 441
it : 212
lv : 3
en : 20447
es : 1043
so : 16


In [13]:
#filter out languages other than english
other_lang_idx = songsDF.index[songsDF['language'] != 'en']
songsDF.drop(other_lang_idx, inplace=True)

In [14]:
#### Other Missing Values

In [15]:
nullidx = songsDF.index[songsDF.valence.isnull() == True] 
songsDF.drop(nullidx, inplace=True)

### Clean Lyrics and save it in the dataframe so lyrics processing in the future will no longer need this cleaning step

In [16]:
lyrics_cleaned = []
lyrics = ''
for idx, row in songsDF.iterrows():
    try:
        lyrics = re.sub(r'[\(\[].*?[\)\]]', '.', row.lyrics_text)
        lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])
    except:
        #lyrics_cleaned.append('not found')
        print('error processing row {}'.format(idx))
    
    lyrics_cleaned.append(lyrics)

#remove empty lines

#all_words = all_words.replace('\n', ' ')

In [17]:
songsDF['lyrics_clean'] = lyrics_cleaned

In [18]:
songsDF.to_pickle('../data/msongs/out/Songs_TextPreProcessed_1b.p')

In [19]:
songsDF.head()

Unnamed: 0,index,artist_id,tags,track_id,title,song_id,release,artist_mbid,artist_name,duration,...,track_href,analysis_url,duration_ms,time_signature,genre,country_count,pop_count,other_count,language,lyrics_clean
0,0.0,AR009211187B989185,"[(pop rock,), (pop,), (synthpop,), (reggae pop,)]",TRDBNUI128F933DE6E,I'm So Sorry,SOZCBYK12AB0180B4D,The Best Of Original British Lovers Rock Volum...,9dfe78a6-6d91-454e-9b95-9d7722cbc476,Carroll Thompson,260.91057,...,https://api.spotify.com/v1/tracks/19mWMfe0fX4l...,https://api.spotify.com/v1/audio-analysis/19mW...,426520.0,4.0,pop,0.0,4.0,0.0,en,Gabrielle\nGabrielle\nI'm So Glad\nI can never...
1,1.0,AR00FOZ1187FB5C9F3,"[(synthpop,), (pop,), (pukkelpop,)]",TRWTSUW12903CD2DEA,Stay The Same,SOBGFYV12AB018DEAD,Stay The Same,92337972-f0c5-4ebd-be8c-f6b23d596ae1,autoKratz,404.97587,...,https://api.spotify.com/v1/tracks/1wZY4LPJsABn...,https://api.spotify.com/v1/audio-analysis/1wZY...,283333.0,4.0,pop,0.0,3.0,0.0,en,"We can't always stay the same, but we all keep..."
2,2.0,AR00FOZ1187FB5C9F3,"[(synthpop,), (pop,), (pukkelpop,)]",TRNZCDV128F92F00A1,Stay The Same,SOYUZXU12A58A78201,Animal,92337972-f0c5-4ebd-be8c-f6b23d596ae1,autoKratz,311.06567,...,https://api.spotify.com/v1/tracks/1gZ4TP1pQwRD...,https://api.spotify.com/v1/audio-analysis/1gZ4...,311200.0,4.0,pop,0.0,3.0,0.0,en,"We can't always stay the same, but we all keep..."
3,3.0,AR00FOZ1187FB5C9F3,"[(synthpop,), (pop,), (pukkelpop,)]",TRPGUDC12903CD2DEC,Stay The Same,SOBZAGY12AB018E2A5,Stay The Same,92337972-f0c5-4ebd-be8c-f6b23d596ae1,autoKratz,277.73342,...,https://api.spotify.com/v1/tracks/26wPSNT05P58...,https://api.spotify.com/v1/audio-analysis/26wP...,291000.0,4.0,pop,0.0,3.0,0.0,en,"We can't always stay the same, but we all keep..."
4,4.0,AR00FOZ1187FB5C9F3,"[(synthpop,), (pop,), (pukkelpop,)]",TRTMPTG128F92F00A0,Always More,SOUQQXB12A8C140687,Animal,92337972-f0c5-4ebd-be8c-f6b23d596ae1,autoKratz,255.26812,...,https://api.spotify.com/v1/tracks/1gZ4TP1pQwRD...,https://api.spotify.com/v1/audio-analysis/1gZ4...,311200.0,4.0,pop,0.0,3.0,0.0,en,"Faith I'm sure, there's something wanting but ..."


### These sections are just to view and analyze lyrics text

In [5]:
original_lyrics = beyonce.loc[3,'lyrics_text']
lyrics = beyonce.loc[3,'lyrics_text']

In [6]:
#clean the lyrics to replace characters and remove verse tags i.e. '[Verse 1]'

lyrics = re.sub(r'[\(\[].*?[\)\]]', '', lyrics)

#remove empty lines
lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])
#all_words = all_words.replace('\n', ' ')

In [7]:
original_lyrics

'[Verse 1]\nDiamonds used to be coal\nLook young cause they got soul\nThat\'s why they\'re beautiful...\nAnd my heart used to be cold\n\'til your hands laid on my soul\nBaby, that\'s why you\'re beautiful...\n\n[Chorus]\nI\'m not wondering why\nThe sky\'s blue; that\'s not my business\nAll I know is I\nLook up and tell myself\n"Be patient, love. That could be us."\n\n[Verse 2]\nLovers used to make love\nAnd died just to give us\nTheir piece of the beautiful\nRemember when we made love?\nLove...\nWasn\'t it beautiful?\n\n[Chorus]\nDon\'t ask me why\nThe sky\'s blue; that\'s not my business\nAll I know is I....\nLook up and tell myself\n"Be patient, love. That could be us."\n\n[Bridge]\nDiamonds used to be coal\nLook young cause they got soul\nAnd my heart used to be cold\n\'Til your hands laid on my soul\nSomebody\'s got to stay deep in love\nThat could be us...\nThat\'s why we\'re beautiful\nThat\'s why you\'re beautiful\nOoh\nWhy, why\nThat\'s why you\'re beautiful\nThat\'s why you\'r

In [174]:
print(original_lyrics)

[Verse 1]
Diamonds used to be coal
Look young cause they got soul
That's why they're beautiful...
And my heart used to be cold
'til your hands laid on my soul
Baby, that's why you're beautiful...

[Chorus]
I'm not wondering why
The sky's blue; that's not my business
All I know is I
Look up and tell myself
"Be patient, love. That could be us."

[Verse 2]
Lovers used to make love
And died just to give us
Their piece of the beautiful
Remember when we made love?
Love...
Wasn't it beautiful?

[Chorus]
Don't ask me why
The sky's blue; that's not my business
All I know is I....
Look up and tell myself
"Be patient, love. That could be us."

[Bridge]
Diamonds used to be coal
Look young cause they got soul
And my heart used to be cold
'Til your hands laid on my soul
Somebody's got to stay deep in love
That could be us...
That's why we're beautiful
That's why you're beautiful
Ooh
Why, why
That's why you're beautiful
That's why you're beautiful
That's why you're beautiful


In [8]:
lyrics

'Diamonds used to be coal\nLook young cause they got soul\nThat\'s why they\'re beautiful...\nAnd my heart used to be cold\n\'til your hands laid on my soul\nBaby, that\'s why you\'re beautiful...\nI\'m not wondering why\nThe sky\'s blue; that\'s not my business\nAll I know is I\nLook up and tell myself\n"Be patient, love. That could be us."\nLovers used to make love\nAnd died just to give us\nTheir piece of the beautiful\nRemember when we made love?\nLove...\nWasn\'t it beautiful?\nDon\'t ask me why\nThe sky\'s blue; that\'s not my business\nAll I know is I....\nLook up and tell myself\n"Be patient, love. That could be us."\nDiamonds used to be coal\nLook young cause they got soul\nAnd my heart used to be cold\n\'Til your hands laid on my soul\nSomebody\'s got to stay deep in love\nThat could be us...\nThat\'s why we\'re beautiful\nThat\'s why you\'re beautiful\nOoh\nWhy, why\nThat\'s why you\'re beautiful\nThat\'s why you\'re beautiful\nThat\'s why you\'re beautiful'

In [9]:
print(lyrics)

Diamonds used to be coal
Look young cause they got soul
That's why they're beautiful...
And my heart used to be cold
'til your hands laid on my soul
Baby, that's why you're beautiful...
I'm not wondering why
The sky's blue; that's not my business
All I know is I
Look up and tell myself
"Be patient, love. That could be us."
Lovers used to make love
And died just to give us
Their piece of the beautiful
Remember when we made love?
Love...
Wasn't it beautiful?
Don't ask me why
The sky's blue; that's not my business
All I know is I....
Look up and tell myself
"Be patient, love. That could be us."
Diamonds used to be coal
Look young cause they got soul
And my heart used to be cold
'Til your hands laid on my soul
Somebody's got to stay deep in love
That could be us...
That's why we're beautiful
That's why you're beautiful
Ooh
Why, why
That's why you're beautiful
That's why you're beautiful
That's why you're beautiful


### Compute for Flesch Kincaid Readability Grade 


In [20]:
import spacy 
import nltk
from nltk.tokenize import sent_tokenize
from textstat.textstat import textstatistics, easy_word_set, legacy_round
from nltk.tokenize import TreebankWordTokenizer, WordPunctTokenizer, WhitespaceTokenizer


In [21]:
import curses 
from curses.ascii import isdigit 
import nltk 
from nltk.corpus import cmudict 

d = cmudict.dict() 

def nsyl(text):
    numsyl = 0
    #for word in nltk.word_tokenize(text):
    for word in WhitespaceTokenizer().tokenize(text):
        try:
            syllen = [len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]]
        except:
            syllen = [1]
       
        numsyl += syllen[0]
    
    return numsyl  

def break_sentences_nltk(text):
    sent_tokenize_list = sent_tokenize(lyrics)
    return sent_tokenize_list

def word_count_nltk(text):
    sentences = break_sentences_nltk(text)
    words = 0
    for sentence in sentences:
        #words += len([token for token in nltk.word_tokenize(sentence)])
        words += len([token for token in WhitespaceTokenizer().tokenize(sentence)])
    return words

def sentence_count_nltk(text):
    sentences = break_sentences_nltk(text)
    out_sentences = list(sentences)
    return len(out_sentences)

def avg_syllables_per_word_nltk(text):
    syllable = nsyl(text)
    #print('syllable', syllable)
    words = word_count_nltk(text)
    #print('words', words)
    ASPW = float(syllable) / float(words)
    #print('ASPW', ASPW)
    return legacy_round(ASPW, 2)

In [22]:
def break_sentences(text):
    nlp = spacy.load('en')
    doc = nlp(text)
    return doc.sents

def word_count(text):
    sentences = break_sentences(text)
    words = 0
    for sentence in sentences:
        words += len([token for token in sentence])
    return words

def sentence_count(text):
    sentences = break_sentences(text)
    out_sentences = list(sentences)
    return len(out_sentences)

def avg_sentence_length(text):
    words = word_count(text)
    sentences = sentence_count(text)
    average_sentence_length = float(words / sentences)
    return average_sentence_length

def syllables_count(word):
    return textstatistics().syllable_count(word)

def avg_syllables_per_word(text):
    syllable = syllables_count(text)
    words = word_count(text)
    ASPW = float(syllable) / float(words)
    return legacy_round(ASPW, 2)


In [23]:
def flesch_kincaid_reading_grade(lyrics):
    FKRG = 0.0
    
    #lyrics = clean_lyrics_text(text)
    try:
        FKRG = float(0.39 * (word_count(lyrics)/sentence_count(lyrics) ))  + \
               float(11.8 * avg_syllables_per_word(lyrics))
    except:
        FKRG = 0.0
                    
                     
    return legacy_round(FKRG,2)

In [61]:
def flesch_kincaid_reading_grade_nltk2(lyrics):
    ''' this function computes the Flesch Kincaid Reading Grade for
        sentence tokenizer it is using spacy.'''
    FKRG = 0.0
    sent_count = 0.0
    try:
        sent_count = sentence_count(lyrics)
        FKRG = (float(0.39 * (word_count_nltk(lyrics)/sent_count ))  + \
               float(11.8 * avg_syllables_per_word_nltk(lyrics))) - 15.59
    except:
        FKRG = 0.0
                    
                     
    return legacy_round(FKRG,2), sent_count

In [62]:
def flesch_kincaid_reading_grade_nltk(lyrics):
    ''' this function computes the Flesch Kincaid Reading Grade for
        sentence tokenizer it is using nltk.'''
    
    FKRG = 0.0
    try:
        sent_count = sentence_count_nltk(lyrics)
        FKRG = (float(0.39 * (word_count_nltk(lyrics)/sent_count ))  + \
               float(11.8 * avg_syllables_per_word_nltk(lyrics))) - 15.59
    except:
        FKRG = 0.0
                    
                     
    return legacy_round(FKRG,2), sent_count

In [25]:
lyrics

'.\nI jumped in the river and what did I see?\nBlack-eyed angels swam with me\nA moon full of stars and astral cars\nAnd all the figures I used to see\nAll my lovers were there with me\nAll my past and futures\nAnd we all went to heaven in a little row boat\nThere was nothing to fear and nothing to doubt\n.\nI jumped into the river\nBlack-eyed angels swam with me\nA moon full of stars and astral cars\nAnd all the figures I used to see\nAll my lovers were there with me\nAll my past and futures\nAnd we all went to heaven in a little row boat\nThere was nothing to fear and nothing to doubt\nThere was nothing to fear and nothing to doubt\nThere was nothing to fear and nothing to doubt'

In [251]:
lyrics = beyonce.loc[3,'lyrics_text']
lyrics = clean_lyrics_text(lyrics)

In [46]:
lyrics = df.loc[4,'lyrics_clean']

In [47]:
#lyrics = beyonce.loc[3,'lyrics_text']
#FKRG = flesch_kincaid_reading_grade(lyrics)
FKRG_nltk, count = flesch_kincaid_reading_grade_nltk(lyrics)
FKRG_nltk, count

(62.96, 1)

In [48]:
FKRG_nltk, count = flesch_kincaid_reading_grade_nltk2(lyrics)
FKRG_nltk, count

(1.81, 18)

In [161]:

tokens = nltk.word_tokenize(lyrics)
 
    


In [163]:
len(tokens)

218

In [172]:
#spacy break sentences
spacy_sent = break_sentences(lyrics)

count = 0
for sent in spacy_sent:
    count += 1
    print(count, sent)
    
print('total sentences', count)

1 Diamonds used to be coal

2 Look young cause they got soul

3 That's why they're beautiful...

4 And my heart used to be cold

5 'til your hands laid on my soul
Baby
6 , that's why you're beautiful...

7 I'm not wondering why
The sky's blue; that's not my business

8 All I know is I
Look up and tell myself
"Be patient, love.
9 That could be us."

10 Lovers used to make love
And died just to give us
Their piece of the beautiful
Remember when we made love?

11 Love...

12 Wasn't it beautiful?

13 Don't ask me why
The sky's blue; that's not my business

14 All I know is I....

15 Look up and tell myself
"Be patient, love.
16 That could be us."

17 Diamonds used to be coal

18 Look young cause they got soul

19 And my heart used to be cold
'
20 Til your hands laid on my soul

21 Somebody's got to stay deep in love

22 That could be us...

23 That's why we're beautiful

24 That's why you're beautiful

25 Ooh

26 Why, why
That's why you're beautiful

27 That's why you're beautiful

28 That

In [166]:
#compare with nltk sent_tokenizer
sent_tokenize_list = sent_tokenize(lyrics)


In [167]:
len(sent_tokenize_list)

8

In [170]:
count = 0
for sent in sent_tokenize_list:
    #print(count, sent)
    count += 1
    print(count, sent)

1 Diamonds used to be coal
Look young cause they got soul
That's why they're beautiful...
And my heart used to be cold
'til your hands laid on my soul
Baby, that's why you're beautiful...
2 I'm not wondering why
The sky's blue; that's not my business
All I know is I
Look up and tell myself
"Be patient, love.
3 That could be us."
4 Lovers used to make love
And died just to give us
Their piece of the beautiful
Remember when we made love?
5 Love...
Wasn't it beautiful?
6 Don't ask me why
The sky's blue; that's not my business
All I know is I....
Look up and tell myself
"Be patient, love.
7 That could be us."
8 Diamonds used to be coal
Look young cause they got soul
And my heart used to be cold
'Til your hands laid on my soul
Somebody's got to stay deep in love
That could be us...
That's why we're beautiful
That's why you're beautiful
Ooh
Why, why
That's why you're beautiful
That's why you're beautiful
That's why you're beautiful


In [36]:
print(sent_tokenize_list[0])

Diamonds used to be coal
Look young cause they got soul
That's why they're beautiful...
And my heart used to be cold
'til your hands laid on my soul
Baby, that's why you're beautiful...


In [178]:
#compare various tokenizers
from nltk.tokenize import TreebankWordTokenizer, WordPunctTokenizer, WhitespaceTokenizer

sample_treebank_ = TreebankWordTokenizer().tokenize(lyrics)
sample_wordpunct_ = WordPunctTokenizer().tokenize(lyrics)
sample_wspace_ = WhitespaceTokenizer().tokenize(lyrics)
sample_word_ = nltk.word_tokenize(lyrics)

### Since we are computing for syllables and the number of words for the Flesch-Kincaid Grade the Whitespace tokenizer will be used to keep contractions as one word/token to prevent counting unnecessary tokens as a word, which affects the Flesch Kincaid Reading grade.

In [179]:
print(sample_treebank_)
print('Treebank Tokenizer found %d tokens\n' % len(sample_treebank_))

print(sample_wordpunct_)
print('WordPunct Tokenizer found %d tokens\n' % len(sample_wordpunct_))

print(sample_wspace_)
print('Whitespace Tokenizer found %d tokens\n' % len(sample_wspace_))

print(sample_word_)
print('nltk.word_tokenize Tokenizer found %d tokens\n' % len(sample_word_))

['Diamonds', 'used', 'to', 'be', 'coal', 'Look', 'young', 'cause', 'they', 'got', 'soul', 'That', "'s", 'why', 'they', "'re", 'beautiful', '...', 'And', 'my', 'heart', 'used', 'to', 'be', 'cold', "'til", 'your', 'hands', 'laid', 'on', 'my', 'soul', 'Baby', ',', 'that', "'s", 'why', 'you', "'re", 'beautiful', '...', 'I', "'m", 'not', 'wondering', 'why', 'The', 'sky', "'s", 'blue', ';', 'that', "'s", 'not', 'my', 'business', 'All', 'I', 'know', 'is', 'I', 'Look', 'up', 'and', 'tell', 'myself', "''", 'Be', 'patient', ',', 'love.', 'That', 'could', 'be', 'us.', "''", 'Lovers', 'used', 'to', 'make', 'love', 'And', 'died', 'just', 'to', 'give', 'us', 'Their', 'piece', 'of', 'the', 'beautiful', 'Remember', 'when', 'we', 'made', 'love', '?', 'Love', '...', 'Was', "n't", 'it', 'beautiful', '?', 'Do', "n't", 'ask', 'me', 'why', 'The', 'sky', "'s", 'blue', ';', 'that', "'s", 'not', 'my', 'business', 'All', 'I', 'know', 'is', 'I', '...', '.', 'Look', 'up', 'and', 'tell', 'myself', "''", 'Be', 'pat

In [175]:
words = word_count_nltk(lyrics)

<class 'list'>


In [176]:
words

218

In [177]:
words = word_count(lyrics)
print(words)

253


In [40]:
#beyonce.reset_index(inplace=True, drop=True)

In [120]:
#lyrics = beyonce.loc[3,'lyrics_text']
#FKRG = flesch_kincaid_reading_grade(lyrics)
FKRG_nltk = flesch_kincaid_reading_grade_nltk(lyrics)


<class 'list'>
218
8
0.7
<class 'list'>
<class 'list'>


In [121]:
print(FKRG, FKRG_nltk)

11.78 0.0


In [89]:
test = 'lover'
print(syllables_count(test))


1


In [165]:
print(syllables_in_word('actually'))

actually
6


### Create FK_grade column by computing Flesch-Kincaid reading grade

In [57]:
songsDF.reset_index(drop=True, inplace=True)

In [63]:
def populate_FKRG(df):
    
    counter = 0
    for i, row in enumerate(df[20000:].itertuples(), start=20000):
    
        lyrics = row.lyrics_clean
        FKRG1, sent1 = flesch_kincaid_reading_grade_nltk(lyrics) #using sent_tokenize
        FKRG2, sent2 = flesch_kincaid_reading_grade_nltk2(lyrics) #using spacy
        
        #FKRG = flesch_kincaid_reading_grade_nltk(lyrics)
        df.set_value(i,'FKRG1', FKRG1)
        df.set_value(i,'fkrg_sent1', sent1)
        df.set_value(i,'FKRG2', FKRG2)
        df.set_value(i,'fkrg_sent2', sent2)
    
        counter += 1
        if counter % 500 == 0:
            print('processing row ', i, row.artist_name)
    
    return df

In [64]:
songsDF = populate_FKRG(songsDF) #start 430

In [70]:
songsDF.to_pickle('../data/msongs/out/Songs_NLP_allFeatures_1b.p')
#dfPop.to_pickle('../data/msongs/out/Pop_NLP_allFeatures.p')