# Analysis

So what did I mean when saying I'd explore the differences between speeches and songs? That's a good question, especially since each corpus was authored by a different group. 

First, we'll start off with some simple features of the corpora such as vocabulary, length, word frequency (in-document) and any more that come to mind as I progress. Following that, we'll dive into sentiment analysis and topic modeling, see if songs are more subjective than speeches or if "great speeches" are often positive. Remember this is a learning project, the data being explored may be a bit silly but through it we can learn different NLP techniques and see the problems associated with real life data.

In [1]:
!pip install textblob



In [2]:
import numpy as np
import pandas as pd
import logging

In [3]:
comm_df = pd.read_excel('communication_data_clean.xlsx')
bow_df = pd.read_excel('communication_data_bow.xlsx')

In [4]:
comm_df['Chars'] = comm_df['Corpus'].apply(len)
comm_df['Words'] = comm_df['Corpus'].apply(lambda c: len(c.split()))
comm_df['Unique Words'] = bow_df['Corpus'].apply(lambda c: len(set(c.split())))
comm_df

Unnamed: 0,Originator,Title,Corpus,Source,Chars,Words,Unique Words
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,i am a storyteller. and i would like to tell y...,https://jamesclear.com/great-speeches/the-dang...,15788,2867,853
1,Jeff Bezos,What Matters More Than Your Talents,"as a kid, i spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...,7253,1379,490
2,John C. Bogle,Enough,here how i recall the wonderful story that set...,https://jamesclear.com/great-speeches/enough-b...,8340,1448,542
3,Brené Brown,The Anatomy of Trust,"oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...,4564,917,282
4,John Cleese,Creativity in Management,"you know, when video arts asked me if i would ...",https://jamesclear.com/great-speeches/creativi...,27394,5013,1280
5,William Deresiewicz,Solitude and Leadership,my title must seem like a contradiction. what ...,https://jamesclear.com/great-speeches/solitude...,32604,5864,1543
6,Richard Feynman,Seeking New Laws,what i want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...,32603,5881,1180
7,Neil Gaiman,Make Good Art,i never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...,15413,2967,785
8,John W. Gardner,Personal Renewal,i am going to talk about self renewal. one of ...,https://jamesclear.com/great-speeches/personal...,22549,4171,1209
9,Elizabeth Gilbert,Your Elusive Creative Genius,i am a writer. writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...,18585,3516,931


In [5]:
dfs = np.array_split(comm_df, 2)
speeches = dfs[0]
albums = dfs[1]

In [6]:
print('Speech Means:')
print(speeches[['Chars','Words','Unique Words']].mean())
print('\nAlbum Means:')
print(albums[['Chars','Words','Unique Words']].mean())
print('_'*40)
print('\nSpeech Standard Deviations:')
print(speeches[['Chars','Words','Unique Words']].std())
print('\nAlbum Standard Deviations:')
print(albums[['Chars','Words','Unique Words']].std())
print('_'*40)
print('\nSpeech Skew:')
print(speeches[['Chars','Words','Unique Words']].skew())
print('\nAlbum Skew:')
print(albums[['Chars','Words','Unique Words']].skew())
print('_'*40)
print('\nSpeech Kurtosis:')
print(speeches[['Chars','Words','Unique Words']].kurtosis())
print('\nAlbum Kurtosis:')
print(albums[['Chars','Words','Unique Words']].kurtosis())

Speech Means:
Chars           18509.3
Words            3402.3
Unique Words      909.5
dtype: float64

Album Means:
Chars           14432.8
Words            3056.9
Unique Words      532.7
dtype: float64
________________________________________

Speech Standard Deviations:
Chars           10187.893906
Words            1824.764280
Unique Words      397.968522
dtype: float64

Album Standard Deviations:
Chars           7285.826084
Words           1543.368139
Unique Words     238.540493
dtype: float64
________________________________________

Speech Skew:
Chars           0.148041
Words           0.078804
Unique Words   -0.046021
dtype: float64

Album Skew:
Chars           0.664115
Words           0.616294
Unique Words    0.478950
dtype: float64
________________________________________

Speech Kurtosis:
Chars          -1.297700
Words          -1.363568
Unique Words   -0.877986
dtype: float64

Album Kurtosis:
Chars           0.481425
Words           0.244483
Unique Words    0.996907
dtype: flo

Replace \n with periods in songs so we can do sentiment analysis by line (similar to what I'll do for speeches). We can look at how textblob takes informal text as seen in songs (just pull a few samples) and look to see if vader is any better. To filter noise, we can filter truly neutral sentences (there might be a large number that'll dampen our scores, but then again maybe not, so it's worth looking into).

## Sentiment Analysis

Vader TextBlob and how they work
VADER is used for many contexts, especially social media (can understand lots of things even if their english sux as formal writing :))

I'm sure you're aware, but I want to point out that my corpora are quite long. How will that affect our analyzers and is there anything we can do to improve it? We'll test out a couple methods of approach and see which one seems most accurate. I'm not worried about the speeches (they're all well written/formatted) but songs are a different story. They sometimes have periods but many do not and that's understandable, songs aren't usually made of complete sentences.

In [7]:
from textblob import TextBlob
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentence = 'i said you want to be starting something'#"Textblob is amazingly simple to use. What great fun!"
vader_analyzer = SentimentIntensityAnalyzer()
test = TextBlob(sentence)
print(test.sentiment)
print(vader_analyzer.polarity_scores(sentence))

Sentiment(polarity=0.0, subjectivity=0.1)
{'neg': 0.0, 'neu': 0.822, 'pos': 0.178, 'compound': 0.0772}


In [8]:
comm_df['VADER Polarity'] = comm_df['Corpus'].apply(lambda s: vader_analyzer.polarity_scores(s)['compound'])
comm_df['TextBlob Polarity'] = comm_df['Corpus'].apply(lambda s: TextBlob(s).sentiment.polarity)
comm_df

Unnamed: 0,Originator,Title,Corpus,Source,Chars,Words,Unique Words,VADER Polarity,TextBlob Polarity
0,Chimamanda Ngozi Adichie,The Danger of a Single Story,i am a storyteller. and i would like to tell y...,https://jamesclear.com/great-speeches/the-dang...,15788,2867,853,0.9564,0.060755
1,Jeff Bezos,What Matters More Than Your Talents,"as a kid, i spent my summers with my grandpare...",https://jamesclear.com/great-speeches/what-mat...,7253,1379,490,0.9996,0.166246
2,John C. Bogle,Enough,here how i recall the wonderful story that set...,https://jamesclear.com/great-speeches/enough-b...,8340,1448,542,0.9999,0.152561
3,Brené Brown,The Anatomy of Trust,"oh, it just feels like an incredible understat...",https://jamesclear.com/great-speeches/the-anat...,4564,917,282,0.9982,0.031229
4,John Cleese,Creativity in Management,"you know, when video arts asked me if i would ...",https://jamesclear.com/great-speeches/creativi...,27394,5013,1280,0.9999,0.135321
5,William Deresiewicz,Solitude and Leadership,my title must seem like a contradiction. what ...,https://jamesclear.com/great-speeches/solitude...,32604,5864,1543,0.9999,0.137679
6,Richard Feynman,Seeking New Laws,what i want to talk to you about tonight is st...,https://jamesclear.com/great-speeches/seeking-...,32603,5881,1180,0.9993,0.065163
7,Neil Gaiman,Make Good Art,i never really expected to find myself giving ...,https://jamesclear.com/great-speeches/make-goo...,15413,2967,785,0.9998,0.190168
8,John W. Gardner,Personal Renewal,i am going to talk about self renewal. one of ...,https://jamesclear.com/great-speeches/personal...,22549,4171,1209,0.9998,0.139498
9,Elizabeth Gilbert,Your Elusive Creative Genius,i am a writer. writing books is my profession ...,https://jamesclear.com/great-speeches/your-elu...,18585,3516,931,0.9998,0.113314


Obviously, VADER is not built for large pieces of text which I wasn't aware of, but sorta makes sense. Textblob scores look a bit better, but the scores are quite mild (which probably makes sense given many sentences are likely neutral sounding).

Let's look into our different options and see if we can get a more representative result. From both VADER and Textblob. It won't be fair to compare songs using VADER and Textblob on speeches even though those analyzers lend themselves better to specific text formats (or so I've seen from online videos/tutorials).