# Analysis

So what did I mean when saying I'd explore the differences between speeches and songs? That's a good question, especially since each corpus was authored by a different group. 

First, we'll start off with some simple features of the corpora such as vocabulary, length, word frequency (in-document) and any more that come to mind as I progress. Following that, we'll dive into sentiment analysis and topic modeling, see if songs are more subjective than speeches or if "great speeches" are often positive. Remember this is a learning project, the data being explored may be a bit silly but through it we can learn different NLP techniques and see the problems associated with real life data.

In [1]:
# !pip install textblob
import numpy as np
import pandas as pd
# import logging
import re

In [2]:
comm_df = pd.read_excel('communication_data_clean.xlsx')
bow_df = pd.read_excel('communication_data_bow.xlsx')

In [3]:
comm_df['Chars'] = comm_df['Corpus'].apply(len)
comm_df['Words'] = comm_df['Corpus'].apply(lambda c: len(c.split()))
comm_df['Unique Words'] = bow_df['Corpus'].apply(lambda c: len(set(c.split())))
comm_df

Unnamed: 0,Originator,Title,Corpus,Link,Text Type,Chars,Words,Unique Words
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,i am a storyteller. and i would like to tell y...,https://jamesclear.com/great-speeches/the-dang...,speech,15829,2867,724
1,Jeff Bezos,What Matters More Than Your Talents,"as a kid, i spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...,speech,7275,1380,424
2,John C. Bogle,Enough,here how i recall the wonderful story that set...,https://jamesclear.com/great-speeches/enough-b...,speech,8362,1448,490
3,Brené Brown,The Anatomy of Trust,"oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...,speech,18717,3657,599
4,John Cleese,Creativity in Management,"you know, when video arts asked me if i would ...",https://jamesclear.com/great-speeches/creativi...,speech,27557,5015,1044
5,William Deresiewicz,Solitude and Leadership,my title must seem like a contradiction. what ...,https://jamesclear.com/great-speeches/solitude...,speech,32611,5857,1231
6,Richard Feynman,Seeking New Laws,what i want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...,speech,32609,5866,856
7,Neil Gaiman,Make Good Art,i never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...,speech,15470,2970,626
8,John W. Gardner,Personal Renewal,i am going to talk about self renewal. one of ...,https://jamesclear.com/great-speeches/personal...,speech,22612,4171,1028
9,Elizabeth Gilbert,Your Elusive Creative Genius,i am a writer. writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...,speech,18636,3516,738


In [4]:
speeches = comm_df[comm_df['Text Type']=='speech']
albums = comm_df[comm_df['Text Type']=='song']

In [5]:
print('Speech Means:')
print(speeches[['Chars','Words','Unique Words']].mean())
print('\nAlbum Means:')
print(albums[['Chars','Words','Unique Words']].mean())
print('_'*40)
print('\nSpeech Standard Deviations:')
print(speeches[['Chars','Words','Unique Words']].std())
print('\nAlbum Standard Deviations:')
print(albums[['Chars','Words','Unique Words']].std())
print('_'*40)
print('\nSpeech Skew:')
print(speeches[['Chars','Words','Unique Words']].skew())
print('\nAlbum Skew:')
print(albums[['Chars','Words','Unique Words']].skew())
print('_'*40)
print('\nSpeech Kurtosis:')
print(speeches[['Chars','Words','Unique Words']].kurtosis())
print('\nAlbum Kurtosis:')
print(albums[['Chars','Words','Unique Words']].kurtosis())

Speech Means:
Chars           20768.458333
Words            3871.916667
Unique Words      787.500000
dtype: float64

Album Means:
Chars           14611.521739
Words            3060.565217
Unique Words      460.913043
dtype: float64
________________________________________

Speech Standard Deviations:
Chars           9196.893223
Words           1712.795805
Unique Words     244.017284
dtype: float64

Album Standard Deviations:
Chars           6602.855545
Words           1393.735818
Unique Words     176.030192
dtype: float64
________________________________________

Speech Skew:
Chars           0.168748
Words           0.177389
Unique Words    0.106348
dtype: float64

Album Skew:
Chars           0.934680
Words           0.946151
Unique Words    0.290726
dtype: float64
________________________________________

Speech Kurtosis:
Chars          -1.520080
Words          -1.466457
Unique Words   -1.156174
dtype: float64

Album Kurtosis:
Chars           0.621664
Words           0.516645
Unique W

Replace \n with periods in songs so we can do sentiment analysis by line (similar to what I'll do for speeches). We can look at how textblob takes informal text as seen in songs (just pull a few samples) and look to see if vader is any better. To filter noise, we can filter truly neutral sentences (there might be a large number that'll dampen our scores, but then again maybe not, so it's worth looking into).

## Sentiment Analysis

### Vader vs. TextBlob

VADER is used for many contexts, especially social media (can understand lots of things even if people's english sux :)). TextBlob tends to work better on proper writing, but of course since they have different implementations, I'd expect that this is not always the case. Here are the links to [VADER](https://github.com/cjhutto/vaderSentiment) and [TextBlob](https://github.com/sloria/TextBlob)'s respective Githubs. VADER's README file includes [how its scoring works](https://github.com/cjhutto/vaderSentiment#about-the-scoring).

I'm sure you're aware, but I want to point out that my corpora are quite long. How will that affect our analyzers and is there anything we can do to improve it? We'll test out a couple methods of approach and see which one seems most accurate. I'm not worried about the speeches (they're all well written/formatted) but songs are a different story. They sometimes have periods but many do not and that's understandable, songs aren't usually made of complete sentences.

In [6]:
from textblob import TextBlob
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentence = 'i said you want to be starting something'
vader_analyzer = SentimentIntensityAnalyzer()
test = TextBlob(sentence)
print(test.sentiment)
print(vader_analyzer.polarity_scores(sentence))

Sentiment(polarity=0.0, subjectivity=0.1)
{'neg': 0.0, 'neu': 0.822, 'pos': 0.178, 'compound': 0.0772}


In [7]:
comm_df['VADER Polarity'] = comm_df['Corpus'].apply(lambda s: vader_analyzer.polarity_scores(s)['compound'])
comm_df['TextBlob Polarity'] = comm_df['Corpus'].apply(lambda s: TextBlob(s).sentiment.polarity)
comm_df

Unnamed: 0,Originator,Title,Corpus,Link,Text Type,Chars,Words,Unique Words,VADER Polarity,TextBlob Polarity
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,i am a storyteller. and i would like to tell y...,https://jamesclear.com/great-speeches/the-dang...,speech,15829,2867,724,0.9564,0.060755
1,Jeff Bezos,What Matters More Than Your Talents,"as a kid, i spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...,speech,7275,1380,424,0.9996,0.166246
2,John C. Bogle,Enough,here how i recall the wonderful story that set...,https://jamesclear.com/great-speeches/enough-b...,speech,8362,1448,490,0.9999,0.152561
3,Brené Brown,The Anatomy of Trust,"oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...,speech,18717,3657,599,1.0,0.093895
4,John Cleese,Creativity in Management,"you know, when video arts asked me if i would ...",https://jamesclear.com/great-speeches/creativi...,speech,27557,5015,1044,0.9999,0.135321
5,William Deresiewicz,Solitude and Leadership,my title must seem like a contradiction. what ...,https://jamesclear.com/great-speeches/solitude...,speech,32611,5857,1231,0.9999,0.138556
6,Richard Feynman,Seeking New Laws,what i want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...,speech,32609,5866,856,0.9993,0.065163
7,Neil Gaiman,Make Good Art,i never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...,speech,15470,2970,626,0.9998,0.18402
8,John W. Gardner,Personal Renewal,i am going to talk about self renewal. one of ...,https://jamesclear.com/great-speeches/personal...,speech,22612,4171,1028,0.9998,0.139498
9,Elizabeth Gilbert,Your Elusive Creative Genius,i am a writer. writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...,speech,18636,3516,738,0.9998,0.113314


Obviously, VADER is not built for large pieces of text which I wasn't aware of, but sorta makes sense. Textblob scores look a bit better, but the scores are quite mild (which probably makes sense given many sentences are likely neutral sounding).

Let's look into our different options and see if we can get a more representative result. From both VADER and Textblob. It won't be fair to compare songs using VADER and Textblob on speeches even though those analyzers lend themselves better to specific text formats (or so I've seen from online videos/tutorials).

In [31]:
def get_sentiment_by_paragraph(s,analyzer,proportional_weighting=True):
    if analyzer=='TextBlob':
        get_sentiment = lambda s: TextBlob(s).sentiment.polarity
    elif analyzer=='VADER':
        get_sentiment = lambda s: vader_analyzer.polarity_scores(s)['compound']
    else:
        return np.nan
    paragraphs = s.split('\n')
    par_sent = np.array(list(map(get_sentiment,paragraphs)))
    if proportional_weighting:
        # weigh the sentiment per paragraph by its fraction of the total words included
        par_weight = np.array(list(map(lambda p: len(p.split()),paragraphs)))/len(s.split())
        return (par_weight*par_sent).sum()
    return par_sent.mean()

def get_sentiment_by_sentence(s,analyzer,proportional_weighting=True):
    if analyzer=='TextBlob':
        get_sentiment = lambda s: TextBlob(s).sentiment.polarity
    elif analyzer=='VADER':
        get_sentiment = lambda s: vader_analyzer.polarity_scores(s)['compound']
    else:
        return np.nan
    s = re.sub('\s',' ',s)
    # add special chars to mark sentence breaks
    s = re.sub('(.?!)','\1~',s)
    sentenses = s.split('~')
    sen_sent = np.array(list(map(get_sentiment,sentenses)))
    if proportional_weighting:
        # weigh the sentiment per paragraph by its fraction of the total words included
        sen_weight = np.array(list(map(lambda p: len(p.split()),sentenses)))/len(s.split())
        return (sen_weight*sen_sent).sum()
    return sen_sent.mean()


comm_df['VADER Polarity by Sentence'] = comm_df['Corpus'].apply(get_sentiment_by_sentence, analyzer='VADER')
comm_df['TextBlob Polarity by Sentence'] = comm_df['Corpus'].apply(get_sentiment_by_sentence, analyzer='TextBlob')    
comm_df['VADER Polarity by Paragraph'] = comm_df['Corpus'].apply(get_sentiment_by_paragraph, analyzer='VADER')
comm_df['TextBlob Polarity by Paragraph'] = comm_df['Corpus'].apply(get_sentiment_by_paragraph, analyzer='TextBlob')

# VADER was one that was especially low variance
print("Mean:")
print(comm_df.mean())
print("\n\nVariance:")
print(comm_df.var())

Mean:
Chars                             17755.489362
Words                              3474.872340
Unique Words                        627.680851
VADER Polarity                        0.998455
TextBlob Polarity                     0.132245
VADER Polarity by Paragraph           0.354882
TextBlob Polarity by Paragraph        0.118260
VADER Polarity by Sentence            0.844023
TextBlob Polarity by Sentence         0.130929
dtype: float64


Variance:
Chars                             7.282110e+07
Words                             2.563931e+06
Unique Words                      7.182400e+04
VADER Polarity                    4.096992e-05
TextBlob Polarity                 4.610731e-03
VADER Polarity by Paragraph       3.754233e-02
TextBlob Polarity by Paragraph    3.200606e-03
VADER Polarity by Sentence        1.274082e-01
TextBlob Polarity by Sentence     4.484380e-03
dtype: float64


Especially when looking at the VADER polarity scores, we see that chopping our corpora up into paragraphs and sentences noticeably increases our variance. TextBlob isn't really affected, so it's likely that it was built for text of variable length. 

Song and speech data are not really formatted to same. This isn't to say that we should use a diffent measure for each (that might be beneficial if more accurate) but it's in a way changing the yard stick. Let's look at the mean and variance for each category to see if one measure is better than the other. 

I'm mostly concerned about song data because there are sometimes ends to sentences and song lyrics aren't often proper English (in fact Santana's album is mostly Spanish, but I'll deal with that later). This is something to be done later, but only keeping valid english words by checking [pyenchant](https://pypi.org/project/pyenchant/) will likely prove beneficial for both sentiment analysis and topic modeling

## Topic Modelling

For this project, we're going to look at if there are any common recognizable topics in these popular songs and speeches. We also might see more songs for some topics more than others. Remember, this analysis is mostly for learning and to look at some different NLP techniques and hopefully draw some general conclusions for differences between the types of text we have.

We'll be using a couple different unsupervised topic modelling algorithms, namely Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF/NNMF). Both require you to provide the number of topics. As a brief (or maybe not brief) description:

### LDA - About:
A statistical model (algorithm) used to determine which words fit best under which topics. As mentioned, you say how many topics and then words start (randomly at first) being assigned to the different groups. From there, it's an iterative process of going through the words and deciding if the word should be put under a different topic. How does it do this? Well, you start looking at the probability of topic given the corpus and the probability of a word given the topic. The product of these is essenentially our model's probability the word belongs to the topic. For a more in depth explanation (and to cite my sources in an informal way) take a look at [Edwin Chen's intro to LDA page](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/).

### NMF - About: 
A group of algorithms used to (approximately) factorize a non-negative matrix (all entries $\ge 0$) into 2 matrices that are also non-negative. We use algorithms to approximate the factorization because there isn't always an analytical solution. According to this [Medium article](https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45#2ba8) (I know, don't trust everything you read on the internet), the left factorized matrix holds the topics while the right one hold the weights of those topics. Of course, we'll be using the bag of words version of our data as our initial matrix, so our "topics" will actually be words. Unlike LDA, we'll be using TF-IDF applied to our BoW model since that is a way to measure how important words are to the specific article (recall TF is document specific). 

### Helpful Article for sklearn Topic Modelling

While I am certainly applying these techniques to my own project and the sklearn objects can only be used in a certain way, this section was made much easier by reading over Aneesha Bakharia's Medium blog [here](https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730). She went substantially further in detail into how the algorithms work and goes over a complete example of topic modelling using an sklearn dataset, so check it out if you're interested. 

### LDA

Alright, so since we've decided to use LDA, we have to figure out which library we want to use (I am not ambitious enough to build my own, at least not when there isn't a practical reason to do so). Scikit-learn and gensim both have LDA models, but I will be using sklearn because it's faster. It's hard to objectively compare as attempted in the below articles, so why not use the faster one if there isn't a clear winner?

https://medium.com/@benzgreer/sklearn-lda-vs-gensim-lda-691a9f2e9ab7 <br>
https://www.kaggle.com/morrisb/compare-lda-topic-modeling-in-sklearn-and-gensim

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import make_multilabel_classification
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

NUM_TOPICS = 5

The following code block, we use CountVectorizer to get the count (or term frequency, aka tf) of each word in each document (corpus). If you take a look at sklearn's [documentation for CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), you'll notice I'm not using many of the optional parameters. This is because we've already dealt with stopwords, we aren't dealing with n-grams (we're not looking at groups of n contiguous words) and I'm not sure if the length of our documents will prove a problem for the algorithms. Many of the parameters like min_df are to denoise/ignore words that don't meet whatever threshold the parameter concerns itself with. I'll test them out and perhaps add some comments on that after the fact. 

In [None]:
tf_vectorizer = CountVectorizer()
tf = tf_vectorizer.fit_transform(bow_df['Corpus'])
tf_feature_names = tf_vectorizer.get_feature_names()

lda = LatentDirichletAllocation(NUM_TOPICS, random_state = 8)
# lda.fit(df)

### NMF

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

In [None]:
from sklearn.decomposition import NMF
# don't include terms that appear fewer than 2 times (df is document frequency)
tf_idf_vectorizer = TfidfVectorizer()
tf_idf = tf_idf_vectorizer.fit_transform(bow_df['Corpus'])
tf_idf_feature_names = tf_idf_vectorizer.get_feature_names()

# adjust based on findings later
nmf = NMF(n_components=NUM_TOPICS, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tf_idf)