# Feature Engineering

Features engineered in this notebook from the lyrics column:

* Number of sentences (n_sentences)

* Word count (word_count)

* Character count (character_count)

* Number of syllables (n_syllables)

* Unique word count (unique_word_count)

* Number of long words (n_long_words)

* Number of monosyllable words (n_monosyllable_words)

* Number of polysyllable words (n_polysyllable_words)

* Vader compound score/Valence (vader_compound)

* Vader negative score (vader_neg)

* Vader neutral score (vader_neu)

* Vader positive score (vader_pos)

* Objectivity score (objectivity_score)

* Positive versus negative score (pos_vs_neg)


The lyrics column that will be used for sentiment analysis and NLP will have to be processed from its RAW state to eligble version to be processed by textacy and other NLP packages.


These are some of the cleaning steps which I included in most of my functions:


* Removing elements such as [Chorus] or [Intro] added by the Lyrics Genius Website.

* Tokenization

* Conversion to lowercase

* Lemmatization

* Parsing

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Sentiment-Features-with-Textacy" data-toc-modified-id="Sentiment-Features-with-Textacy-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sentiment Features with Textacy</a></span><ul class="toc-item"><li><span><a href="#Functions" data-toc-modified-id="Functions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Functions</a></span></li></ul></li><li><span><a href="#Lexical-Diversity-Feature-Engineering" data-toc-modified-id="Lexical-Diversity-Feature-Engineering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Lexical Diversity Feature Engineering</a></span><ul class="toc-item"><li><span><a href="#Functions" data-toc-modified-id="Functions-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Functions</a></span></li></ul></li></ul></div>

## Imports

**Basics**

In [1]:
import numpy as np
import pandas as pd
import re

**Sentiment Feature Creation**


! Only run textacy amd vaderSentiment when Kernel is set to **textacy** !

In [2]:
import textacy

In [3]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction import DictVectorizer

**Lexical Diversity**

In command line/bash:

```pip install lexical-diversity```

In [2]:
from lexical_diversity import lex_div as ld

## Sentiment Features with Textacy

### Functions

In [7]:
en_nlp = textacy.load_spacy_lang('en_core_web_sm')

In [5]:
def textacy_stats(column):
    '''
    
    '''
    en_nlp = textacy.load_spacy_lang('en_core_web_sm')

    #Lists of textacy statistics which can be added to dataframe as columns.
    n_sentences = []
    word_count = []
    character_count = []
    n_syllables = []
    unique_word_count = []
    n_long_words = []
    n_monosyllable_words = []
    n_polysyllable_words = []

    automated_readability_index = []
    coleman_liau_index = []
    flesch_kincaid_grade_level = []
    flesch_reading_ease = []
    gulpease_index = []
    gunning_fog_index = []
    lix = []
    smog_index = []
    wiener_sachtextformel = []

    for item in column:
        # Cleans the lyrics from unwanted elements.
        item = re.sub("[\[].*?[\]]", "", item)
        item = item.replace('\n', ' ')
        item = item.replace('\\', '')
        item = re.sub("[\(\)]", "", item)
        item = item.replace('  ', ' ')

        #Produces tokenized lyrics and produces statistics in the ts object.
        parsed = textacy.make_spacy_doc(item, lang=en_nlp)
        parsed._.to_tokenized_text()
        ts = textacy.TextStats(parsed)

        n_sentences.append(ts.basic_counts['n_sents'])
        word_count.append(ts.basic_counts['n_words'])
        character_count.append(ts.basic_counts['n_chars'])
        n_syllables.append(ts.basic_counts['n_syllables'])
        unique_word_count.append(ts.basic_counts['n_unique_words'])
        n_long_words.append(ts.basic_counts['n_long_words'])
        n_monosyllable_words.append(ts.basic_counts['n_monosyllable_words'])
        n_polysyllable_words.append(ts.basic_counts['n_polysyllable_words'])

        #automated_readability_index.append(ts.readability_stats['automated_readability_index'])
        #coleman_liau_index.append(ts.readability_stats['coleman_liau_index'])
        #flesch_kincaid_grade_level.append(ts.readability_stats['flesch_kincaid_grade_level'])
        #flesch_reading_ease.append(ts.readability_stats['flesch_reading_ease'])
        #gulpease_index.append(ts.readability_stats['gulpease_index'])
        #gunning_fog_index.append(ts.readability_stats['gunning_fog_index'])
        #lix.append(ts.readability_stats['lix'])
        #smog_index.append(ts.readability_stats['smog_index'])
        #wiener_sachtextformel.append(ts.readability_stats['wiener_sachtextformel'])

    return n_sentences, word_count, character_count, n_syllables, unique_word_count, n_long_words, n_monosyllable_words, n_polysyllable_words  #, automated_readability_index, coleman_liau_index, flesch_kincaid_grade_level, flesch_reading_ease, gulpease_index, gunning_fog_index, lix, smog_index, wiener_sachtextformel

In [6]:
def processing(column):
    """
    This function accepts a text column as an argument,
    cleans each cell's text from its unwanted elements (keeping punctuation and apostrophes),
    converts it to lowercase and appends each string text to a list.
    The function returns this list.
    """
    lyrics = []

    for x in column:
        x = re.sub("[\[].*?[\]]", "", x)
        x = x.replace('\n', ' ')
        x = x.replace('\\', '')
        x = re.sub("[\(\)]", "", x)
        #x = re.sub("[\!\.\,\?]", " ", x)
        #x = re.sub("[\']", "", x)
        x = x.replace('  ', ' ')
        x = x.lower()

        lyrics.append(x)

    return lyrics

Conducted in the ```textacy``` environment. Please set kernel to ```textacy```.

In [7]:
data = pd.read_csv("../../../../../../project-capstone/personal-github/Resources/artist_predictor_RAW.csv")

In [8]:
data.head()

Unnamed: 0,track_name,artist_name,release_year,spotify_uri,lyrics,genre,track_id,popularity,acousticness,danceability,...,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,Canal St.,A$AP Rocky,2015-05-26,spotify:track:0rBMP6VVGRgwnzZCLpijyl,"[Produced by Hector Delgado, Frans Mernick, an...",Hip-Hop,0rBMP6VVGRgwnzZCLpijyl,68.0,0.308,0.696,...,0.583,0.0,D#,0.0964,-9.351,Minor,0.301,137.079,4/4,0.524
1,Brotha Man,A$AP Rocky,2018-05-25,,"[Verse 1: A$AP Rocky]\nYoung man, brotha broth...",Hip-Hop,6MmtDonkpvSoRx9ACwrGDu,60.0,0.52,0.573,...,0.398,0.0,F,0.113,-9.372,Minor,0.114,124.073,5/4,0.436
2,LVL,A$AP Rocky,2013-01-15,spotify:track:4dz4nfY5LeFUR99uspA5e6,[Produced by Clams Casino]\n\n[Intro: A$AP Roc...,Hip-Hop,787rCZF9i4L1cXGMkdyIk4,59.0,0.18,0.595,...,0.427,1.3e-05,F,0.121,-6.764,Minor,0.054,120.085,4/4,0.0994
3,Holy Ghost,A$AP Rocky,2015-05-26,spotify:track:6AHNkRbVzkh95xilnYzDT7,"[Produced by Danger Mouse, additional producti...",Hip-Hop,6AHNkRbVzkh95xilnYzDT7,59.0,0.127,0.545,...,0.779,0.0,A#,0.349,-6.513,Minor,0.19,75.844,4/4,0.325
4,Fashion Killa,A$AP Rocky,2013-01-15,spotify:track:40H5libEZWrbkc8HTlXGbt,"[Chorus]\nHer pistol go…\n(Doot-doot-doot, ban...",Hip-Hop,0O3TAouZE4vL9dM5SyxgvH,65.0,0.293,0.802,...,0.82,0.0,B,0.515,-5.188,Major,0.154,139.972,4/4,0.81


In [9]:
n_sentences, word_count, character_count, n_syllables, unique_word_count, n_long_words, n_monosyllable_words, n_polysyllable_words = textacy_stats(data.lyrics)

In [10]:
data['n_sentences'] = n_sentences
data['word_count'] = word_count
data['character_count'] = character_count
data['n_syllables'] = n_syllables
data['unique_word_count'] = unique_word_count
data['n_long_words'] = n_long_words
data['n_monosyllable_words'] = n_monosyllable_words
data['n_polysyllable_words'] = n_polysyllable_words
data.head()

Unnamed: 0,track_name,artist_name,release_year,spotify_uri,lyrics,genre,track_id,popularity,acousticness,danceability,...,time_signature,valence,n_sentences,word_count,character_count,n_syllables,unique_word_count,n_long_words,n_monosyllable_words,n_polysyllable_words
0,Canal St.,A$AP Rocky,2015-05-26,spotify:track:0rBMP6VVGRgwnzZCLpijyl,"[Produced by Hector Delgado, Frans Mernick, an...",Hip-Hop,0rBMP6VVGRgwnzZCLpijyl,68.0,0.308,0.696,...,4/4,0.524,54,628,2354,721,249,39,542,7
1,Brotha Man,A$AP Rocky,2018-05-25,,"[Verse 1: A$AP Rocky]\nYoung man, brotha broth...",Hip-Hop,6MmtDonkpvSoRx9ACwrGDu,60.0,0.52,0.573,...,5/4,0.436,54,519,2039,595,232,44,454,11
2,LVL,A$AP Rocky,2013-01-15,spotify:track:4dz4nfY5LeFUR99uspA5e6,[Produced by Clams Casino]\n\n[Intro: A$AP Roc...,Hip-Hop,787rCZF9i4L1cXGMkdyIk4,59.0,0.18,0.595,...,4/4,0.0994,42,397,1497,465,188,23,341,7
3,Holy Ghost,A$AP Rocky,2015-05-26,spotify:track:6AHNkRbVzkh95xilnYzDT7,"[Produced by Danger Mouse, additional producti...",Hip-Hop,6AHNkRbVzkh95xilnYzDT7,59.0,0.127,0.545,...,4/4,0.325,32,430,1666,507,215,28,364,10
4,Fashion Killa,A$AP Rocky,2013-01-15,spotify:track:40H5libEZWrbkc8HTlXGbt,"[Chorus]\nHer pistol go…\n(Doot-doot-doot, ban...",Hip-Hop,0O3TAouZE4vL9dM5SyxgvH,65.0,0.293,0.802,...,4/4,0.81,147,745,2867,881,201,54,624,13


**VADER**

In [11]:
analyzer = SentimentIntensityAnalyzer()

In [12]:
data['lyrics_processed'] = processing(data.lyrics)

In [13]:
vader_scores = data.lyrics_processed.map(analyzer.polarity_scores)

In [14]:
vader_scores.head()

0    {'neg': 0.11, 'neu': 0.792, 'pos': 0.098, 'com...
1    {'neg': 0.091, 'neu': 0.747, 'pos': 0.162, 'co...
2    {'neg': 0.183, 'neu': 0.701, 'pos': 0.116, 'co...
3    {'neg': 0.128, 'neu': 0.778, 'pos': 0.094, 'co...
4    {'neg': 0.072, 'neu': 0.807, 'pos': 0.121, 'co...
Name: lyrics_processed, dtype: object

In [15]:
dvec = DictVectorizer()

vader_scores = dvec.fit_transform(vader_scores)
vader_scores

<1688x4 sparse matrix of type '<class 'numpy.float64'>'
	with 6752 stored elements in Compressed Sparse Row format>

In [16]:
dvec.feature_names_

['compound', 'neg', 'neu', 'pos']

In [17]:
for i, col in enumerate(dvec.feature_names_):
    data['vader_{}'.format(col)] = vader_scores[:, i].toarray().ravel()

```
objectivity = 1. - (pos_score + neg_score)
pos_vs_neg = pos_score - neg_score```

In [18]:
data['objectivity_score'] = 1.0 -(data.vader_pos - data.vader_neg)

In [19]:
data['pos_vs_neg'] = data.vader_pos - data.vader_neg

In [20]:
data.to_csv('capstone_data_with_sentiment.csv', index=False)

## Lexical Diversity Feature Engineering

Change back into Python 3 environment/Kernel and run neccessary imports from **Imports and Installations** section.

### Functions

In [3]:
def TTR_function(column):
    """
    This is a function that is able to produce the type-token-ratio.
    
    It takes a whole column of a dataframe as an argument (column of lyrics, text entries).
    1. It cleans the lyrics into a lowercase text without punctuation and then tokenizes this text
    WITHOUT lemmatization.
    2. The .ttr() function merely divides the number of unique words by the number of all words in the text(x).
    
    Returns a list of TTR values. One entry per column entry.
    """
    lyrics = []

    for x in column:
        x = re.sub("[\[].*?[\]]", "", x)
        x = x.replace('\n', ' ')
        x = x.replace('\\', '')
        x = re.sub("[\(\)]", "", x)
        x = re.sub("[\!\.\,\?]", " ", x)
        #x = re.sub("[\']", "", x)
        x = x.replace('  ', ' ')
        x = x.lower()

        x = ld.tokenize(x)

        #x = ld.flemmatize(x)

        ttr = ld.ttr(x)

        lyrics.append(ttr)

    return lyrics

In [4]:
def MTLD_function(column):
    """
    This is a function that is able to produce the 'Measure of Textual Lexical Diversity'. 
    1. It cleans the lyrics into a lowercase text without punctuation and then tokenize-lemmatizes it 
    with the lexical-diversity function .flemmatize().
    2. The lexical-diversity's .mtld() function calculates MTLD based on McCarthy and Jarvis (2010).
    """
    lyrics = []

    for x in column:
        x = re.sub("[\[].*?[\]]", "", x)
        x = x.replace('\n', ' ')
        x = x.replace('\\', '')
        x = re.sub("[\(\)]", "", x)
        x = re.sub("[\!\.\,\?]", " ", x)
        #x = re.sub("[\']", "", x)
        x = x.replace('  ', ' ')
        x = x.lower()

        x = ld.flemmatize(x)

        MTLD = ld.mtld(x)

        lyrics.append(MTLD)

    return lyrics

In [5]:
data = pd.read_csv(
    '../../../../../../project-capstone/personal-github/Resources/capstone_data_with_sentiment.csv'
)

In [6]:
data.head()

Unnamed: 0,track_name,artist_name,release_year,spotify_uri,lyrics,genre,track_id,popularity,acousticness,danceability,...,n_long_words,n_monosyllable_words,n_polysyllable_words,lyrics_processed,vader_compound,vader_neg,vader_neu,vader_pos,objectivity_score,pos_vs_neg
0,Canal St.,A$AP Rocky,2015-05-26,spotify:track:0rBMP6VVGRgwnzZCLpijyl,"[Produced by Hector Delgado, Frans Mernick, an...",Hip-Hop,0rBMP6VVGRgwnzZCLpijyl,68.0,0.308,0.696,...,39,542,7,"yeah, live through the strugglin', life's a ...",-0.7965,0.11,0.792,0.098,1.012,-0.012
1,Brotha Man,A$AP Rocky,2018-05-25,,"[Verse 1: A$AP Rocky]\nYoung man, brotha broth...",Hip-Hop,6MmtDonkpvSoRx9ACwrGDu,60.0,0.52,0.573,...,44,454,11,"young man, brotha brotha, you gotta fight for...",0.99,0.091,0.747,0.162,0.929,0.071
2,LVL,A$AP Rocky,2013-01-15,spotify:track:4dz4nfY5LeFUR99uspA5e6,[Produced by Clams Casino]\n\n[Intro: A$AP Roc...,Hip-Hop,787rCZF9i4L1cXGMkdyIk4,59.0,0.18,0.595,...,23,341,7,"clams casino, nigga a$ap mr. pistol popper,...",-0.9856,0.183,0.701,0.116,1.067,-0.067
3,Holy Ghost,A$AP Rocky,2015-05-26,spotify:track:6AHNkRbVzkh95xilnYzDT7,"[Produced by Danger Mouse, additional producti...",Hip-Hop,6AHNkRbVzkh95xilnYzDT7,59.0,0.127,0.545,...,28,364,10,"ay, i have a message from the most high that...",-0.9406,0.128,0.778,0.094,1.034,-0.034
4,Fashion Killa,A$AP Rocky,2013-01-15,spotify:track:40H5libEZWrbkc8HTlXGbt,"[Chorus]\nHer pistol go…\n(Doot-doot-doot, ban...",Hip-Hop,0O3TAouZE4vL9dM5SyxgvH,65.0,0.293,0.802,...,54,624,13,"her pistol go… doot-doot-doot, bang-bang, boo...",0.9722,0.072,0.807,0.121,0.951,0.049


In [7]:
data.columns

Index(['track_name', 'artist_name', 'release_year', 'spotify_uri', 'lyrics',
       'genre', 'track_id', 'popularity', 'acousticness', 'danceability',
       'duration_ms', 'energy', 'instrumentalness', 'key', 'liveness',
       'loudness', 'mode', 'speechiness', 'tempo', 'time_signature', 'valence',
       'n_sentences', 'word_count', 'character_count', 'n_syllables',
       'unique_word_count', 'n_long_words', 'n_monosyllable_words',
       'n_polysyllable_words', 'lyrics_processed', 'vader_compound',
       'vader_neg', 'vader_neu', 'vader_pos', 'objectivity_score',
       'pos_vs_neg'],
      dtype='object')

**Type-Token-Ratio (TTR)**

In [8]:
data['TTR'] = TTR_function(data.lyrics)

**Measure of lexical textual diversity (MTLD)**

In [9]:
data['MTLD'] = MTLD_function(data.lyrics)

**Saving dataframe**

In [10]:
data.to_csv('capstone_feature_engineered.csv', index=False)