![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Classifiers and sentiment analysis

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Sentiment-Analysis" data-toc-modified-id="Sentiment-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sentiment Analysis</a></span><ul class="toc-item"><li><span><a href="#Sentiment-Analysis-Tools" data-toc-modified-id="Sentiment-Analysis-Tools-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Sentiment Analysis Tools</a></span></li></ul></li><li><span><a href="#Analyse-trivial-documents-with-built-in-sentiment-analysis-tool" data-toc-modified-id="Analyse-trivial-documents-with-built-in-sentiment-analysis-tool-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analyse trivial documents with built-in sentiment analysis tool</a></span></li><li><span><a href="#Analyse-foot-and-mouth-dataset" data-toc-modified-id="Analyse-foot-and-mouth-dataset-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Analyse foot and mouth dataset</a></span><ul class="toc-item"><li><span><a href="#Overall-Average-Polarity-and-Subjectivity" data-toc-modified-id="Overall-Average-Polarity-and-Subjectivity-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Overall Average Polarity and Subjectivity</a></span></li><li><span><a href="#Sentiment-Analysis-by-Occupation" data-toc-modified-id="Sentiment-Analysis-by-Occupation-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Sentiment Analysis by Occupation</a></span></li></ul></li><li><span><a href="#Sorting-out-Sentiment-Score-Issues" data-toc-modified-id="Sorting-out-Sentiment-Score-Issues-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Sorting out Sentiment Score Issues</a></span><ul class="toc-item"><li><span><a href="#Inspecting-Sentiment-and-Polarity-by-Row" data-toc-modified-id="Inspecting-Sentiment-and-Polarity-by-Row-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Inspecting Sentiment and Polarity by Row</a></span></li><li><span><a href="#Filtering-the-Sentiment-Scores" data-toc-modified-id="Filtering-the-Sentiment-Scores-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Filtering the Sentiment Scores</a></span></li></ul></li><li><span><a href="#Analysis-on-Filtered-Sentiments" data-toc-modified-id="Analysis-on-Filtered-Sentiments-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Analysis on Filtered Sentiments</a></span><ul class="toc-item"><li><span><a href="#Overall-Score-Average" data-toc-modified-id="Overall-Score-Average-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Overall Score Average</a></span></li><li><span><a href="#Occupation-Sentiment-Averages" data-toc-modified-id="Occupation-Sentiment-Averages-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Occupation Sentiment Averages</a></span></li></ul></li><li><span><a href="#Textblob-on-Different-dataset-formats" data-toc-modified-id="Textblob-on-Different-dataset-formats-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Textblob on Different dataset formats</a></span><ul class="toc-item"><li><span><a href="#Analysis-On-Raw-Data" data-toc-modified-id="Analysis-On-Raw-Data-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Analysis On Raw Data</a></span></li><li><span><a href="#Analysis-on-Sentences" data-toc-modified-id="Analysis-on-Sentences-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Analysis on Sentences</a></span></li></ul></li><li><span><a href="#Using-VADER" data-toc-modified-id="Using-VADER-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Using VADER</a></span><ul class="toc-item"><li><span><a href="#On-the-Raw-Data" data-toc-modified-id="On-the-Raw-Data-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>On the Raw Data</a></span></li><li><span><a href="#On-Sentence-Tokenised-data" data-toc-modified-id="On-Sentence-Tokenised-data-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>On Sentence Tokenised data</a></span></li><li><span><a href="#On-Fully-Processed-Data" data-toc-modified-id="On-Fully-Processed-Data-8.3"><span class="toc-item-num">8.3&nbsp;&nbsp;</span>On Fully-Processed Data</a></span></li></ul></li><li><span><a href="#VADER-Comparisons" data-toc-modified-id="VADER-Comparisons-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>VADER Comparisons</a></span></li></ul></div>


There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

## Introduction

Now that we have finished processing the data and put it into a nice, easy to read format it's onto the extraction! But first let's get all the necessary processing summary code ran.

In [1]:
# It is good practice to always start by importing the modules and packages you will need. 

import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
import re                         # re is for regular expressions, which we use later 
import pandas as pd               # we need pandas to import the foot_mouth_original.xls file
! pip install xlrd                # apparently we also need xlrd to read the .xls file because pandas is not old school
import xlrd                       # le sigh

nltk.download('punkt')
from nltk import word_tokenize    # importing the word_tokenize function from nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

!pip install autocorrect           # Spellchecker
from autocorrect import Speller

from nltk.corpus import wordnet                    # Finally, things we need for lemmatising!
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')        # Like a POS-tagger...

print("Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

foot_mouth_df = pd.read_csv ('../code/data/foot_mouth/text.csv')
print (foot_mouth_df[:10])

ERROR: Invalid requirement: '#'
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Succesfully imported necessary modules
   Unnamed: 0                0  \
0           0  5407diary02.rtf   
1           1  5407diary03.rtf   
2           2  5407diary07.rtf   
3           3  5407diary08.rtf   
4           4  5407diary09.rtf   
5           5  5407diary10.rtf   
6           6  5407diary13.rtf   
7           7  5407diary14.rtf   
8           8  5407diary15.rtf   
9           9  5407diary16.rtf   

                                                   1  
0  \n\nInformation about diarist\nDate of birth: ...  
1  Information about diarist\nDate of birth: 1966...  
2  \n\nInformation about diarist\nDate of birth: ...  
3  Information about diarist\nDate of birth: 1963...  
4  Information about diarist\nDate of birth: 1981...  
5  Information about diarist\nDate of birth: 1937...  
6  Information about diarist\nDate of birth: 1947...  
7  \nInformation about diarist\nDate of birth: 19...  
8  Information about diarist\nDate of birth: 1949...  
9  \nInformation about diarist\nDate

ERROR: Invalid requirement: '#'
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
# Renaming Columns
foot_mouth_df = pd.read_csv ('../code/data/foot_mouth/text.csv') 

foot_mouth_df.columns = ["Number", "Filename", "everything_else"]
print(foot_mouth_df.head())

print("Columns renamed :)")
print(" ")

#Creating New Dataframe with Occupation Column
oc_foot_mouth = foot_mouth_df.assign(Occupation = foot_mouth_df['everything_else'].str.extract(r'(\w+\s+\d{1,2})'))

print(oc_foot_mouth.head()) # checking if it worked!

print(" Occupation Dataframe Created!")

print(" ")
print(" EVERYTHING READY! :)")
print(" ")

   Number         Filename                                    everything_else
0       0  5407diary02.rtf  \n\nInformation about diarist\nDate of birth: ...
1       1  5407diary03.rtf  Information about diarist\nDate of birth: 1966...
2       2  5407diary07.rtf  \n\nInformation about diarist\nDate of birth: ...
3       3  5407diary08.rtf  Information about diarist\nDate of birth: 1963...
4       4  5407diary09.rtf  Information about diarist\nDate of birth: 1981...
Columns renamed :)
 
   Number         Filename                                    everything_else  \
0       0  5407diary02.rtf  \n\nInformation about diarist\nDate of birth: ...   
1       1  5407diary03.rtf  Information about diarist\nDate of birth: 1966...   
2       2  5407diary07.rtf  \n\nInformation about diarist\nDate of birth: ...   
3       3  5407diary08.rtf  Information about diarist\nDate of birth: 1963...   
4       4  5407diary09.rtf  Information about diarist\nDate of birth: 1981...   

  Occupation  
0    Grou

In [3]:
#Tokenising by word
foot_mouth_df['tokenised_words'] = foot_mouth_df.apply(lambda row: nltk.word_tokenize(row['everything_else']), axis=1)

#Removing Uppercase
foot_mouth_df['txt_lower'] = foot_mouth_df['tokenised_words'].apply(lambda x: [w.lower() for w in x])

#Correcting Spelling
spell = Speller(lang='en')

foot_mouth_df['spell_checked'] = foot_mouth_df['txt_lower'].apply(lambda x: [spell(w) for w in x])

def replacement_mapping(x):
        if x == "der":
            return re.sub("der","defra",x)
        else:
            return x  

def replacement_mapping_2(x):
        if x == "ffd":
            return re.sub("ffd","fmd",x)
        else:
            return x  
        
foot_mouth_df["spell_checked"] = foot_mouth_df["spell_checked"].apply(lambda x:[replacement_mapping(w) for w in x])
foot_mouth_df["spell_checked"] = foot_mouth_df["spell_checked"].apply(lambda x:[replacement_mapping_2(w) for w in x])

#Removing Punctuation and getting rid of resulting space
English_punctuation = "!\"#$%&()*+,./:;<=>?@[\]^_`{|}~“”-"      # Define a variable with all the punctuation to remove.
print(English_punctuation)                                     # Print that defined variable, just to check it is correct.
print("...") 


def remove_punctuation(from_text):                           # Had to define a function to iterate over the strings in a row
    table = str.maketrans('', '', English_punctuation)       # The python function 'maketrans' creates a table that maps
    stripped = [w.translate(table) for w in from_text]        # the punctation marks to 'None'. Print the table to check. 
    return stripped

foot_mouth_df['no_punct'] = [remove_punctuation(i) for i in foot_mouth_df['spell_checked']] # Iterating above function to each

foot_mouth_df['no_punct_no_space'] = [list(filter(None, sublist)) for sublist in foot_mouth_df['no_punct']]

#POS Tagging and Lemmatisation

foot_mouth_df['pos_tag'] = foot_mouth_df['no_punct_no_space'].apply(lambda x: nltk.pos_tag(x))

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer() 
foot_mouth_df['lemmatised'] = foot_mouth_df['pos_tag'].apply(lambda x: [lemmatizer.lemmatize(y[0], get_wordnet_pos(y[0])) for y in x])

#Removing Stop words
stop_words = set(stopwords.words('english'))
                                                                                                                   
new_stopwords =[e for e in stop_words if e not in ("aren't","couldn't","didn't","doesn't","don't","hadn't","hasn't","haven't","isn't","mightn't","mustn't","needn't",'no','not',"only","shouldn't","wasn't","weren't","won't","wouldn't")]
foot_mouth_df['no_stop_words'] = foot_mouth_df['lemmatised'].apply(lambda x: [item for item in x if item not in new_stopwords])                        

print("All done!")

!"#$%&()*+,./:;<=>?@[\]^_`{|}~“”-
...
All done!


In [4]:
# Appending final processed data column onto the oc_foot_mouth Dataframe

processed = foot_mouth_df['no_stop_words']

oc_foot_mouth = oc_foot_mouth.join(processed)
oc_foot_mouth.rename(columns = {'no_stop_words':'processed_text'}, inplace = True)

print(foot_mouth_df.head())
print(" ")
oc_foot_mouth.head()

print(" ")
print("All Set for Extraction!")

   Number         Filename                                    everything_else  \
0       0  5407diary02.rtf  \n\nInformation about diarist\nDate of birth: ...   
1       1  5407diary03.rtf  Information about diarist\nDate of birth: 1966...   
2       2  5407diary07.rtf  \n\nInformation about diarist\nDate of birth: ...   
3       3  5407diary08.rtf  Information about diarist\nDate of birth: 1963...   
4       4  5407diary09.rtf  Information about diarist\nDate of birth: 1981...   

                                     tokenised_words  \
0  [Information, about, diarist, Date, of, birth,...   
1  [Information, about, diarist, Date, of, birth,...   
2  [Information, about, diarist, Date, of, birth,...   
3  [Information, about, diarist, Date, of, birth,...   
4  [Information, about, diarist, Date, of, birth,...   

                                           txt_lower  \
0  [information, about, diarist, date, of, birth,...   
1  [information, about, diarist, date, of, birth,...   
2  [info

In [5]:
oc_foot_mouth.loc[:5]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend..."
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend..."
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend..."
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend..."
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend..."
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend..."


## Sentiment Analysis

The first (and main) step in the extraction process will be Sentiment Analysis. Sentiment analysis is a commonly used example of automatic classification. To be clear, automatic classification means that a model or learning algorithm has been trained on correctly classified documents and it uses this training to return a probability assessment of what class a new document should belong to. 

Sentiment analysis works the same way, but usually only has two classes - positive and negative. A trained model looks at new data and says whether that new data is likely to be positive or negative and this is what we will be using to conduct our analysis. Let's take a look!

### Sentiment Analysis Tools

Let's start off by importing and downloading some useful packages, including `textblob`: it is based on `nltk` and has built in sentiment analysis tools. 

To import the packages, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. 

For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for 'click in the code cell below and hit the 'Run' button at the top of this page or by hold down the 'Shift' key while hitting the 'Enter' key'. 

Run/Shift+Enter.

In [5]:
import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
import csv                        # csv is for importing and working with csv files
import statistics

In [6]:
!pip install -U textblob -q
!python -m textblob.download_corpora -q
from textblob import TextBlob

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!

## Analyse trivial documents with built-in sentiment analysis tool

Now,let's analyse some simple data before moving on to the more compex foot mouth data.

Run/Shift+Enter, as above!

In [26]:
Doc1 = TextBlob("Textblob is just super. I love it!")             # Convert a few basic strings into Textblobs 
Doc2 = TextBlob("Cabbages are the worst. Say no to cabbages!")    # Textblobs, like other text-mining objects, are often called
Doc3 = TextBlob("Paris is the capital of France. ")               # 'documents'

type(Doc1)

textblob.blob.TextBlob

Docs 1 through 3 are Textblobs, which we can see by the output of type(Doc1). 

We get a Textblob by passing a string to the function that we imported above. Specifically, this is done by using this format --> Textblob('string goes here'). Textblobs are ready for analysis through the textblob tools, such as the built-in sentiment analysis tool that we see in the code below. 

Run/Shift+Enter on those Textblobs.

In [12]:
print(Doc1.sentiment)
print(Doc2.sentiment)
print(Doc3.sentiment)

Sentiment(polarity=0.47916666666666663, subjectivity=0.6333333333333333)
Sentiment(polarity=-1.0, subjectivity=1.0)
Sentiment(polarity=0.0, subjectivity=0.0)


The output of the previous code returns two values for each Textblob object. Polarity refers to a positive-negative spectrum while subjectivity refers to an opinion-fact spectrum. 

(We can see, for example, that Doc1 is fairly positive but also quite subjective while Doc2 is very negative and very subjective. Doc3, in contrast, is both neutral and factual.)

To get only one of the two values, you can call the appropriate sub-function as shown below. 

Run/Shift+Enter for sub-functional fun. 

In [25]:
print(Doc1.sentiment.polarity)
print(Doc1.sentiment.subjectivity)

0.47916666666666663
0.6333333333333333


## Analyse foot and mouth dataset

Super. Now let's do this with our foot and mouth Pandas DataFrame. For this we will be looking at the polarity and subjectivity of Occupations 1 and 4 in comparison to the average across occupations!

### Overall Average Polarity and Subjectivity

To look at the overall averages we will be using the following for loop to iterate over all of the rows in our `proccessed_text` column and calculate the overall polarity and subjectivity!

Run/Shift+Enter

In [81]:
for row in oc_foot_mouth:
    text = oc_foot_mouth.loc[:,'processed_text'].tolist()
    words = " ".join(str(x) for x in text)
    text = TextBlob(words)
    total_score = text.sentiment
    
print(total_score)

Sentiment(polarity=0.07367922973999301, subjectivity=0.4463072592542594)


Interesting. It appears that sentiments were neutral overall even with them being fairly subjective. This is quite interesting given how sad the topic is!

### Sentiment Analysis by Occupation

Now let's do our sentiment analysis on Groups 1 and 4.

First we will use the same `==` operator used in the pre-processing stage to filter the dataframe by Occupation

In [72]:
group1_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 1']
group4_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 4']

Doing a quick check to see if the Dataframe has been filtered correctly...

In [73]:
#For group 1
print(group1_foot.loc[:])

print(" ")
print("length is " + str(len(group1_foot))) # length should be 13

    Number         Filename  \
33      33  5407diary47.rtf   
34      34  5407diary48.rtf   
35      35  5407diary49.rtf   
36      36  5407diary52.rtf   
37      37  5407diary53.rtf   
38      38  5407diary54.rtf   
40      40     5407fg01.rtf   
80      80    5407int47.rtf   
81      81    5407int48.rtf   
82      82    5407int49.rtf   
83      83    5407int52.rtf   
84      84    5407int53.rtf   
85      85    5407int54.rtf   

                                      everything_else Occupation  \
33  \nInformation about diarist\nDate of birth: 19...    Group 1   
34  \nInformation about diarist\nDate of birth: 19...    Group 1   
35  \t\nInformation about diarist\nDate of birth: ...    Group 1   
36  \nInformation about diarist\nDate of birth: 19...    Group 1   
37  \nInformation about diarist\nDate of birth: 19...    Group 1   
38  \nInformation about diarist\nDate of birth: 19...    Group 1   
40  \nGroups Discussion with Members of  Farmers F...    Group 1   
80  \nDate of Intervi

In [74]:
#For group 4
print(group4_foot.loc[:])

print(" ")
print("length is " + str(len(group4_foot))) # length should be 16

    Number         Filename  \
12      12  5407diary19.rtf   
13      13  5407diary21.rtf   
14      14  5407diary22.rtf   
15      15  5407diary23.rtf   
16      16  5407diary24.rtf   
17      17  5407diary26.rtf   
32      32  5407diary44.rtf   
43      43     5407fg04.rtf   
58      58    5407int19.rtf   
59      59    5407int20.rtf   
60      60    5407int21.rtf   
61      61    5407int22.rtf   
62      62    5407int23.rtf   
63      63    5407int24.rtf   
64      64    5407int26.rtf   
79      79    5407int44.rtf   

                                      everything_else Occupation  \
12  \nInformation about diarist\nDate of birth: 19...    Group 4   
13  \nInformation about diarist\nDate of birth: 19...    Group 4   
14  \nInformation about diarist\nDate of birth: 19...    Group 4   
15  \nInformation about diarist\nDate of birth: 19...    Group 4   
16  \nInformation about diarist\nDate of birth: 19...    Group 4   
17  \nInformation about diarist\nDate of birth: 19...    Group 4

Perfect! Now let's do Sentiment analysis on these new filtered Dataframes!

In [80]:
len(group1_foot.loc[:,'processed_text'])

13

In [82]:
# Writing the codes for the analysis
# Group 1
for row in group1_foot:
    g1_text = group1_foot.loc[:,'processed_text'].tolist()
    words = " ".join(str(x) for x)
    g1_text = TextBlob(words)
    group1_score = g1_text.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)


In [88]:
# Group 4
for row in group4_foot:
    g4_text = group4_foot.loc[:,'processed_text'].tolist()
    g4_words = " ".join(str(x) for x in text)
    g4_text = TextBlob(g4_words)
    group4_score = g4_text.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)


In [89]:
# Finding out the scores for each

print("The scores for Group 1 are " + str(group1_score))
print(" ")
print("The scores for Group 4 are " + str(group4_score))

The scores for Group 1 are Sentiment(polarity=0.0, subjectivity=0.0)
 
The scores for Group 4 are Sentiment(polarity=0.0, subjectivity=0.0)


Ok that's strange... Will have to look more closely to see what the issue is!

## Sorting out Sentiment Score Issues

### Inspecting Sentiment and Polarity by Row 

We will now use the following code to have a closer look at the sentiment scores.

In [7]:
oc_foot_mouth['scores'] = oc_foot_mouth['processed_text'].apply(lambda x: [TextBlob(y).sentiment for y in x])

In [9]:
oc_foot_mouth[:10]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,scores
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,Group 5,"[information, diarist, date, birth, 1947, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1964, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,Group 5,"[information, diarist, date, birth, 1949, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1951, gend...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0..."


In [97]:
oc_foot_mouth.loc[1,'scores']

[Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=-0.25, subjectivity=0.25),
 Sentiment(polarity=-0.05, subjectivity=0.4),
 Sentiment(polarity=-0.2916666666666667, subjectivity=0.5416666666666666),
 Sentiment(polarity=0.0, subjectivity=0.0),
 Sentiment(polarity=0.0, subjectivity=0.

Ok so it looks like there is an issue with there being too many neutral words in this dataset that are skewing the results (for example whole phrases such as " Information about diarist" or "date of birth" that are just documentations and do not have anything to do with the actual responses submitted by the participants). So we will need to filter out all of these neutral scores!

### Filtering the Sentiment Scores

So first we will be splitting the sentiment scores by polarity and subjectivity

In [101]:
oc_foot_mouth['polarity'] = oc_foot_mouth['processed_text'].apply(lambda x: [TextBlob(y).sentiment.polarity for y in x])
oc_foot_mouth['subjectivity'] = oc_foot_mouth['processed_text'].apply(lambda x: [TextBlob(y).sentiment.subjectivity for y in x])

In [102]:
# Checking this worked

oc_foot_mouth.loc[1,'polarity']

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.25,
 -0.05,
 -0.2916666666666667,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.1,
 0.0,
 -0.8,
 0.0,
 0.0,
 0.0,
 -0.1875,
 0.0,
 -0.5,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.3333333333333333,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.1,
 0.0,
 -0.3,
 -0.8,
 0.0,
 0.0,
 0.3,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.3,
 0.0,
 0.0,
 0.0,
 0.4000000000000001,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.5,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.3333333333333333,
 0.0,
 0.0,
 0.0,
 0.0,
 0.25,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.1,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.1,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,

In [107]:
oc_foot_mouth.loc[1,'subjectivity']

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.25,
 0.4,
 0.5416666666666666,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.14285714285714285,
 0.2,
 0.0,
 0.9,
 0.0,
 0.0,
 0.0,
 0.5,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.3,
 0.0,
 0.6,
 0.9,
 0.0,
 0.0,
 0.9,
 0.0,
 0.5,
 0.0,
 0.0,
 0.0,
 0.4,
 0.0,
 0.0,
 0.0,
 0.9,
 0.0,
 0.06666666666666667,
 0.0,
 0.0,
 0.0,
 0.0,
 0.5,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.8333333333333334,
 0.0,
 1.0,
 0.0,
 0.0,
 0.3333333333333333,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.4,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.4,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0

In [105]:
oc_foot_mouth['pol_filtered'] = [list(filter(lambda x: x != 0, sublist)) for sublist in oc_foot_mouth['polarity']]
oc_foot_mouth['subj_filtered'] = [list(filter(lambda x: x != 0, sublist)) for sublist in oc_foot_mouth['subjectivity']]

In [106]:
oc_foot_mouth[:20]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,polarity,scores,subjectivity,pol_filtered,subj_filtered
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.25, -0.3, 0.7, 0.1, -0.16666666666666666, ...","[0.25, 0.2, 0.06666666666666667, 0.4, 0.066666..."
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.25, -0.05, -0.2916666666666667, -0.1, -0.8...","[0.25, 0.4, 0.5416666666666666, 0.142857142857..."
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.13636363636363635, -0.16666666666666666, 0....","[0.45454545454545453, 0.16666666666666666, 0.6..."
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.1, 1.0, 1.0, 0.7, 0.2, -0.6999999999999998,...","[0.2, 0.3, 0.3, 0.6000000000000001, 0.2, 0.666..."
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.6, 0.5, -0.25, 0.2, 0.25, 0.25, -0.25, 0.1...","[0.9, 0.4, 0.5, 0.25, 0.2, 0.25, 0.25, 1.0, 0...."
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.1, -0.5, 0.5, -0.6, 0.8, 0.1, 0.5, -0.6, 0....","[0.4, 1.0, 0.06666666666666667, 1.0, 1.0, 0.75..."
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,Group 5,"[information, diarist, date, birth, 1947, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.5, -0.25, 0.375, 0.5, 0.05000000000000002,...","[1.0, 1.0, 0.06666666666666667, 0.25, 0.75, 0...."
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.4, 0.10000000000000002, 0.15, -0.6999999999...","[0.7, 1.0, 0.3833333333333333, 0.6499999999999..."
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,Group 5,"[information, diarist, date, birth, 1949, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.7, -0.125, 0.7, 0.3333333333333333, 0.7, 0....","[0.9, 0.3333333333333333, 1.0, 0.6000000000000..."
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1951, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.4, 0.2, 0.25, 0.25, 0.2, 0.7, 0.36666666666...","[0.06666666666666667, 0.5, 0.4, 0.066666666666..."


So much better! Now lets do the analysis on these new scores!

## Analysis on Filtered Sentiments

For this we will just be calculating the means of the scores! First though we will need to import numpy so we can use its mean function!

In [123]:
import numpy as np

Now we will be creating new columns for the means of each polarity and subjectivity row - this will make thngs easier for the next steps!

In [129]:
oc_foot_mouth['pol_mean'] = oc_foot_mouth['pol_filtered'].apply(np.mean)
oc_foot_mouth['subj_mean'] = oc_foot_mouth['subj_filtered'].apply(np.mean)

oc_foot_mouth[:10]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,polarity,scores,subjectivity,pol_filtered,subj_filtered,pol_mean,subj_mean,Mean
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.25, -0.3, 0.7, 0.1, -0.16666666666666666, ...","[0.25, 0.2, 0.06666666666666667, 0.4, 0.066666...",0.127907,0.479233,0.127907
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.25, -0.05, -0.2916666666666667, -0.1, -0.8...","[0.25, 0.4, 0.5416666666666666, 0.142857142857...",0.083887,0.514394,0.083887
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.13636363636363635, -0.16666666666666666, 0....","[0.45454545454545453, 0.16666666666666666, 0.6...",0.202948,0.517709,0.202948
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.1, 1.0, 1.0, 0.7, 0.2, -0.6999999999999998,...","[0.2, 0.3, 0.3, 0.6000000000000001, 0.2, 0.666...",0.1265,0.505286,0.1265
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.6, 0.5, -0.25, 0.2, 0.25, 0.25, -0.25, 0.1...","[0.9, 0.4, 0.5, 0.25, 0.2, 0.25, 0.25, 1.0, 0....",0.159727,0.521339,0.159727
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.1, -0.5, 0.5, -0.6, 0.8, 0.1, 0.5, -0.6, 0....","[0.4, 1.0, 0.06666666666666667, 1.0, 1.0, 0.75...",0.108185,0.48387,0.108185
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,Group 5,"[information, diarist, date, birth, 1947, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.5, -0.25, 0.375, 0.5, 0.05000000000000002,...","[1.0, 1.0, 0.06666666666666667, 0.25, 0.75, 0....",0.152179,0.500282,0.152179
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.4, 0.10000000000000002, 0.15, -0.6999999999...","[0.7, 1.0, 0.3833333333333333, 0.6499999999999...",0.13913,0.533937,0.13913
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,Group 5,"[information, diarist, date, birth, 1949, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.7, -0.125, 0.7, 0.3333333333333333, 0.7, 0....","[0.9, 0.3333333333333333, 1.0, 0.6000000000000...",0.182986,0.556282,0.182986
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1951, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.4, 0.2, 0.25, 0.25, 0.2, 0.7, 0.36666666666...","[0.06666666666666667, 0.5, 0.4, 0.066666666666...",0.148523,0.508256,0.148523


Perfect! Now lets work go back to our original analyses of the dataset!

### Overall Score Average

In [131]:
print(oc_foot_mouth["pol_mean"].mean())
print(oc_foot_mouth["subj_mean"].mean())

0.09797969069667002
0.5004429816411067


Okay s compared to before both the polarity AND the subjectivity are slightly higher, but still not significantly different to what was found before. Let's see if this translates to any difference per occupation!

### Occupation Sentiment Averages

In [136]:
group1_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 1']
group4_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 4']

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,polarity,scores,subjectivity,pol_filtered,subj_filtered,pol_mean,subj_mean,Mean
33,33,5407diary47.rtf,\nInformation about diarist\nDate of birth: 19...,Group 1,"[information, diarist, date, birth, 1956, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[(0.0, 0.0), (0.0, 0.0), (0.0, 0.0), (0.0, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.25, 0.8, 0.25, 0.3666666666666667, 0.400000...","[0.3333333333333333, 1.0, 0.3333333333333333, ...",0.083072,0.468844,0.083072


In [137]:
print(group1_foot["pol_mean"].mean())
print(group1_foot["subj_mean"].mean())

print(" ")

print(group4_foot["pol_mean"].mean())
print(group4_foot["subj_mean"].mean())

0.08583683066634482
0.4784183401572556
 
0.10520366263253528
0.5051201291424214


Ok much better than before! Although it does definitely look like things are not that much diifferet to the general average!

Now for visualisations purposes I will get the averages for the other occupations.

In [138]:
group2_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 2']
group3_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 3']
group5_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 5']
group6_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 6']

In [139]:
print(group2_foot["pol_mean"].mean())
print(group2_foot["subj_mean"].mean())

print(" ")

print(group3_foot["pol_mean"].mean())
print(group3_foot["subj_mean"].mean())

print(" ")

print(group5_foot["pol_mean"].mean())
print(group5_foot["subj_mean"].mean())

print(" ")

print(group6_foot["pol_mean"].mean())
print(group6_foot["subj_mean"].mean())


0.1062511833977345
0.5086427142605836
 
0.07805426313312835
0.4975191717791073
 
0.11024942120003758
0.5115891299348682
 
0.10246320328808936
0.4940893600064796


## Textblob on Different dataset formats

Since the "analysis per word" method gives strange results we will try doing it on less processed bits of data.

### Analysis On Raw Data

In [29]:
oc_foot_mouth[['polarity', 'subjectivity']] = oc_foot_mouth['everything_else'].apply(lambda text: pd.Series(TextBlob(text).sentiment))
oc_foot_mouth[:5]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,polarity,subjectivity
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...",0.102367,0.430962
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...",0.071534,0.483925
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...",0.16185,0.479241
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...",0.094893,0.446693
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...",0.120849,0.486689


In [30]:
group1_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 1']
group4_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 4']
group2_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 2']
group3_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 3']
group5_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 5']
group6_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 6']

In [32]:
print("Overall Score averages are: ")

print(oc_foot_mouth["polarity"].mean())
print(oc_foot_mouth["subjectivity"].mean())

print(" ")

print(" Group 1 averages are: ")
print(group1_foot["polarity"].mean())
print(group1_foot["subjectivity"].mean())

print(" ")
print(" Group 4 averages are: ")
print(group4_foot["polarity"].mean())
print(group4_foot["subjectivity"].mean())

print(" ")
print(" Other averages are:")

print(" ")
print("Group2:")
print(group2_foot["polarity"].mean())
print(group2_foot["subjectivity"].mean())

print(" ")

print("Group3:")
print(group3_foot["polarity"].mean())
print(group3_foot["subjectivity"].mean())

print(" ")

print("Group5:")
print(group5_foot["polarity"].mean())
print(group5_foot["subjectivity"].mean())

print(" ")
print("Group6:")
print(group6_foot["polarity"].mean())
print(group6_foot["subjectivity"].mean())

Overall Score averages are: 
0.07005421874423302
0.44396826245946436
 
 Group 1 averages are: 
0.06223521503453617
0.42215742338412193
 
 Group 4 averages are: 
0.07092672635674173
0.4398351067769926
 
 Other averages are:
 
Group2:
0.07292686149497901
0.4542654403661014
 
Group3:
0.06097999793431373
0.44443789720562005
 
Group5:
0.07731680608744904
0.4521710803801739
 
Group6:
0.07745590373464921
0.44974279078552926


Okay looks mostly the same! Let's try by sentence! But first...

In [33]:
oc_foot_mouth.drop('polarity', axis=1, inplace=True)
oc_foot_mouth.drop('subjectivity', axis=1, inplace=True)
oc_foot_mouth.drop('scores', axis=1, inplace=True)

oc_foot_mouth[:5]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend..."
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend..."
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend..."
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend..."
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend..."


### Analysis on Sentences

In [63]:
oc_foot_mouth['sent_tokenised'] = oc_foot_mouth.apply(lambda row: nltk.sent_tokenize(row['everything_else']), axis=1)
oc_foot_mouth.loc[5, 'sent_tokenised']

['Information about diarist\nDate of birth: 1937\nGender: M\nOccupation: Group 5\nGeographic region: North Cumbria\n\nWeek 1\nMonday 11th March 2002\nWhilst watching the local TV news at 6 p.m. there was a news item that caused us to reflect back on the events a year ago.',
 'A young lady had just left a court where she had been found guilty of assaulting a Police Officer and also being in change of an offensive weapon -–a knife.',
 'The judge had acquitted her of the offences, he showed leniency towards her.',
 'Last year during the FMD crisis she had returned to her home to find that her pet goat had been killed by slaughterers because the animal was within the 3 km radius.',
 'She had gone berserk over this and threatened the Police Officer and others with the knife.',
 'She had to be forcibly restrained, she was very distraught over this killing.',
 'Even after she had appeared in court and had been acquitted of all charges she showed great emotion not only being freed but also qui

In [35]:
oc_foot_mouth['polarity'] = oc_foot_mouth['sent_tokenised'].apply(lambda x: [TextBlob(y).sentiment.polarity for y in x])
oc_foot_mouth['subjectivity'] = oc_foot_mouth['sent_tokenised'].apply(lambda x: [TextBlob(y).sentiment.subjectivity for y in x])

oc_foot_mouth[:5]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,sent_tokenised,polarity,subjectivity
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...",[\n\nInformation about diarist\nDate of birth:...,"[0.0, 0.125, 0.0, 0.0, 0.39999999999999997, 0....","[0.0, 0.375, 0.125, 0.17777777777777778, 0.466..."
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...",[Information about diarist\nDate of birth: 196...,"[-0.19722222222222222, -0.3186111111111111, 0....","[0.3972222222222222, 0.5777777777777777, 0.0, ..."
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...",[\n\nInformation about diarist\nDate of birth:...,"[0.11742424242424243, 0.0, 0.31145833333333334...","[0.28030303030303033, 0.0, 0.4166666666666667,..."
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...",[Information about diarist\nDate of birth: 196...,"[0.3666666666666667, 0.5, 0.0, -0.051851851851...","[0.16666666666666666, 0.15, 0.0, 0.51851851851..."
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...",[Information about diarist\nDate of birth: 198...,"[-0.0625, 0.5, -0.024999999999999994, 0.0, 0.0...","[0.3875, 0.5, 0.225, 0.0, 0.0, 0.25, 0.0, 0.25..."


In [36]:
oc_foot_mouth['pol_filtered'] = [list(filter(lambda x: x != 0, sublist)) for sublist in oc_foot_mouth['polarity']]
oc_foot_mouth['subj_filtered'] = [list(filter(lambda x: x != 0, sublist)) for sublist in oc_foot_mouth['subjectivity']]

In [37]:
import numpy as np
oc_foot_mouth['pol_mean'] = oc_foot_mouth['pol_filtered'].apply(np.mean)
oc_foot_mouth['subj_mean'] = oc_foot_mouth['subj_filtered'].apply(np.mean)

#oc_foot_mouth.drop('polarity', axis=1, inplace=True)
#oc_foot_mouth.drop('subjectivity', axis=1, inplace=True)
#oc_foot_mouth.drop('pol_filtered', axis=1, inplace=True)
#oc_foot_mouth.drop('subj_filtered', axis=1, inplace=True)
#oc_foot_mouth.drop('pol_mean', axis=1, inplace=True)
#oc_foot_mouth.drop('subj_mean', axis=1, inplace=True)
#oc_foot_mouth.drop('sent_tokenised', axis=1, inplace=True)

oc_foot_mouth[:10]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,sent_tokenised,polarity,subjectivity,pol_filtered,subj_filtered,pol_mean,subj_mean
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...",[\n\nInformation about diarist\nDate of birth:...,"[0.0, 0.125, 0.0, 0.0, 0.39999999999999997, 0....","[0.0, 0.375, 0.125, 0.17777777777777778, 0.466...","[0.125, 0.39999999999999997, 0.183333333333333...","[0.375, 0.125, 0.17777777777777778, 0.46666666...",0.116305,0.444695
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...",[Information about diarist\nDate of birth: 196...,"[-0.19722222222222222, -0.3186111111111111, 0....","[0.3972222222222222, 0.5777777777777777, 0.0, ...","[-0.19722222222222222, -0.3186111111111111, 0....","[0.3972222222222222, 0.5777777777777777, 0.687...",0.082216,0.511248
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...",[\n\nInformation about diarist\nDate of birth:...,"[0.11742424242424243, 0.0, 0.31145833333333334...","[0.28030303030303033, 0.0, 0.4166666666666667,...","[0.11742424242424243, 0.31145833333333334, 0.2...","[0.28030303030303033, 0.4166666666666667, 0.31...",0.197901,0.500509
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...",[Information about diarist\nDate of birth: 196...,"[0.3666666666666667, 0.5, 0.0, -0.051851851851...","[0.16666666666666666, 0.15, 0.0, 0.51851851851...","[0.3666666666666667, 0.5, -0.05185185185185181...","[0.16666666666666666, 0.15, 0.5185185185185186...",0.113457,0.481575
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...",[Information about diarist\nDate of birth: 198...,"[-0.0625, 0.5, -0.024999999999999994, 0.0, 0.0...","[0.3875, 0.5, 0.225, 0.0, 0.0, 0.25, 0.0, 0.25...","[-0.0625, 0.5, -0.024999999999999994, 0.25, 0....","[0.3875, 0.5, 0.225, 0.25, 0.25, 1.0, 0.333333...",0.148358,0.503865
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend...",[Information about diarist\nDate of birth: 193...,"[0.0, -0.13333333333333333, 0.0, -0.1, 0.0, -0...","[0.0, 0.4666666666666666, 0.0, 0.0333333333333...","[-0.13333333333333333, -0.1, -0.14, 0.4, -0.06...","[0.4666666666666666, 0.03333333333333333, 1.0,...",0.085418,0.454015
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,Group 5,"[information, diarist, date, birth, 1947, gend...",[Information about diarist\nDate of birth: 194...,"[0.0, -0.5, 0.0, 0.0, -0.1875, 0.375, 0.0, 0.5...","[0.0, 1.0, 0.0, 0.5333333333333333, 0.3125, 0....","[-0.5, -0.1875, 0.375, 0.5, -0.324999999999999...","[1.0, 0.5333333333333333, 0.3125, 0.75, 0.5, 0...",0.165448,0.487512
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1964, gend...",[\nInformation about diarist\nDate of birth: 1...,"[0.0, 0.2, 0.10000000000000002, 0.0, 0.0499999...","[0.0, 0.85, 0.3833333333333333, 0.0, 0.2499999...","[0.2, 0.10000000000000002, 0.04999999999999999...","[0.85, 0.3833333333333333, 0.24999999999999997...",0.123176,0.492388
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,Group 5,"[information, diarist, date, birth, 1949, gend...",[Information about diarist\nDate of birth: 194...,"[0.0, 0.0, 0.2875, 0.0, 0.3509259259259259, 0....","[0.3333333333333333, 0.0, 0.8, 0.0, 0.62962962...","[0.2875, 0.3509259259259259, -0.17121212121212...","[0.3333333333333333, 0.8, 0.6296296296296297, ...",0.158254,0.521919
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1951, gend...",[\nInformation about diarist\nDate of birth: 1...,"[0.0, 0.0, 0.020000000000000007, 0.425, 0.0, 0...","[0.0, 0.06666666666666667, 0.3983333333333333,...","[0.020000000000000007, 0.425, 0.125, 0.3, 0.37...","[0.06666666666666667, 0.3983333333333333, 1.0,...",0.117394,0.474066


In [89]:
group1_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 1']
group4_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 4']
group2_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 2']
group3_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 3']
group5_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 5']
group6_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 6']

In [39]:
print("Overall Score averages are: ")

print(oc_foot_mouth["pol_mean"].mean())
print(oc_foot_mouth["subj_mean"].mean())

print(" ")

print(" Group 1 averages are: ")
print(group1_foot["pol_mean"].mean())
print(group1_foot["subj_mean"].mean())

print(" ")
print(" Group 4 averages are: ")
print(group4_foot["pol_mean"].mean())
print(group4_foot["subj_mean"].mean())

print(" ")
print(" Other averages are:")

print(" ")
print("Group2:")
print(group2_foot["pol_mean"].mean())
print(group2_foot["subj_mean"].mean())

print(" ")

print("Group3:")
print(group3_foot["pol_mean"].mean())
print(group3_foot["subj_mean"].mean())

print(" ")

print("Group5:")
print(group5_foot["pol_mean"].mean())
print(group5_foot["subj_mean"].mean())

print(" ")
print("Group6:")
print(group6_foot["pol_mean"].mean())
print(group6_foot["subj_mean"].mean())

Overall Score averages are: 
0.08458962440621778
0.4730428549538535
 
 Group 1 averages are: 
0.07660359635291054
0.4556586795690575
 
 Group 4 averages are: 
0.08588717471431488
0.4728307087663699
 
 Other averages are:
 
Group2:
0.08211729314421548
0.4786862218230877
 
Group3:
0.07161054306789462
0.4738004165942562
 
Group5:
0.09657965994035568
0.48024687682273853
 
Group6:
0.09659313260314527
0.4737395089878944


Ok so it seems as if the unexpxectedness might just be due to the dataset and not an analysis error. But just to be safe we will also try another sentiment analysis tool.

## Using VADER

It would appear that VADER is a generally more accurate sentiment analysis tool (see [here](https://investigate.ai/investigating-sentiment-analysis/comparing-sentiment-analysis-tools/))

So we will try our analysis with this tool instead! First step is importing modules!

In [81]:
import nltk
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
nltk.download('punkt')

from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


(Now how it works on trivial data)

In [82]:
print(sia.polarity_scores("Textblob is just super. I love it!"))
print(sia.polarity_scores("Cabbages are the worst. Say no to cabbages!"))
print(sia.polarity_scores("Paris is the capital of France"))

{'neg': 0.0, 'neu': 0.323, 'pos': 0.677, 'compound': 0.8553}
{'neg': 0.524, 'neu': 0.476, 'pos': 0.0, 'compound': -0.7644}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


**Things to Note:**

1) Works better WITH punctuation e.g. ! intensifies the emotion of the sentence

2) Capitalisation works better with this e.g HOT > hot

3) KEEP DEGREE MODIFIERS AND NEGATIVES!

Therefore will be trying this first on the RAW data!

### On the Raw Data

In [None]:
def get_scores(content):
    sia_scores = sia.polarity_scores(content)
    return sia_scores['compound']

oc_foot_mouth["Compound"] = oc_foot_mouth['everything_else'].apply(lambda x : get_scores(x))
oc_foot_mouth[:10]

Uhh... again really really strange for the sentiments to be so strong. Let's take a deeper look at what's going on here

In [61]:
oc_foot_mouth["VADER"] = oc_foot_mouth['everything_else'].apply(lambda x : sia.polarity_scores(x))
oc_foot_mouth.loc[10, 'VADER']

{'neg': 0.056, 'neu': 0.803, 'pos': 0.141, 'compound': 1.0}

Ok so it looks like it isn't summing the sentiments properly so will have to try manually doing this instead.

### On Sentence Tokenised data

We will first try analysing by sentence (the closest to the RAW dataset) using the same methodologies as with the Textblob

In [83]:
oc_foot_mouth['sent_tokenised'] = oc_foot_mouth.apply(lambda row: nltk.sent_tokenize(row['everything_else']), axis=1)

def get_scores(content):
    sia_scores = sia.polarity_scores(content)
    return sia_scores['compound']

In [87]:
#Calculating polarities and filtering out the 0.0 values
oc_foot_mouth['sent_vader'] = oc_foot_mouth['sent_tokenised'].apply(lambda x : [get_scores(y)for y in x])
oc_foot_mouth['filtered'] = [list(filter(lambda x: x != 0, sublist)) for sublist in oc_foot_mouth['sent_vader']]

# Finding the mean of both the filtered and unfiltered polarities
oc_foot_mouth['vad_mean'] = oc_foot_mouth['sent_vader'].apply(np.mean) 
oc_foot_mouth['filt_vad_mean'] = oc_foot_mouth['filtered'].apply(np.mean) 

oc_foot_mouth[:10]

In [73]:
print(oc_foot_mouth['vad_mean'].mean())
print(oc_foot_mouth['filt_vad_mean'].mean())

0.06949687176063819
0.1150819329965901


### On Fully-Processed Data

And now doing the means on the original completely processed data

In [None]:
#Calculating polarities and filtering out the 0.0 values
oc_foot_mouth['proc_vader'] = oc_foot_mouth['processed_text'].apply(lambda x : [get_scores(y)for y in x])
oc_foot_mouth['proc_filtered'] = [list(filter(lambda x: x != 0, sublist)) for sublist in oc_foot_mouth['proc_vader']]

# Finding the mean of both the filtered and unfiltered polarities
oc_foot_mouth['proc_vad_mean'] = oc_foot_mouth['proc_vader'].apply(np.mean)
oc_foot_mouth['proc_filt_vad_mean'] = oc_foot_mouth['proc_filtered'].apply(np.mean)

oc_foot_mouth[:10]

In [None]:
print(oc_foot_mouth['proc_vad_mean'].mean())
print(oc_foot_mouth['proc_filt_vad_mean'].mean())

## VADER Comparisons

And finally doing comparisons of all of the VADER sentiment analyses done!

In [90]:
# Sentence Data (unfiltered)

print("Sentence Unfiltered Score averages are: ")

print(oc_foot_mouth["vad_mean"].mean())

print(" ")

print(" Group 1 averages are: ")
print(group1_foot["vad_mean"].mean())


print(" ")
print(" Group 4 averages are: ")
print(group4_foot["vad_mean"].mean())


print(" ")
print(" Other averages are:")

print(" ")
print("Group2:")
print(group2_foot["vad_mean"].mean())


print(" ")

print("Group3:")
print(group3_foot["vad_mean"].mean())

print(" ")

print("Group5:")
print(group5_foot["vad_mean"].mean())

print(" ")
print("Group6:")
print(group6_foot["vad_mean"].mean())

Sentence Unfiltered Score averages are: 
0.06949687176063819
 
 Group 1 averages are: 
0.06209566099468896
 
 Group 4 averages are: 
0.06781730023153121
 
 Other averages are:
 
Group2:
0.06753245052948226
 
Group3:
0.07133954471656889
 
Group5:
0.0731050334750517
 
Group6:
0.07491307704990476


In [91]:
# Sentence Data (filtered)

print("Sentence Filtered Score averages are: ")

print(oc_foot_mouth['filt_vad_mean'].mean())

print(" ")

print(" Group 1 averages are: ")
print(group1_foot['filt_vad_mean'].mean())


print(" ")
print(" Group 4 averages are: ")
print(group4_foot['filt_vad_mean'].mean())


print(" ")
print(" Other averages are:")

print(" ")
print("Group2:")
print(group2_foot['filt_vad_mean'].mean())


print(" ")

print("Group3:")
print(group3_foot["filt_vad_mean"].mean())

print(" ")

print("Group5:")
print(group5_foot['filt_vad_mean'].mean())

print(" ")
print("Group6:")
print(group6_foot['filt_vad_mean'].mean())

Sentence Filtered Score averages are: 
0.1150819329965901
 
 Group 1 averages are: 
0.12500756430242138
 
 Group 4 averages are: 
0.10790695828077973
 
 Other averages are:
 
Group2:
0.10536974116136351
 
Group3:
0.11395831371874353
 
Group5:
0.12239513557360256
 
Group6:
0.11421255089606336


In [92]:
# Processed Data (unfiltered)

print("Processed Unfiltered Score averages are: ")

print(oc_foot_mouth['proc_vad_mean'].mean())

print(" ")

print(" Group 1 averages are: ")
print(group1_foot['proc_vad_mean'].mean())


print(" ")
print(" Group 4 averages are: ")
print(group4_foot['proc_vad_mean'].mean())


print(" ")
print(" Other averages are:")

print(" ")
print("Group2:")
print(group2_foot['proc_vad_mean'].mean())


print(" ")

print("Group3:")
print(group3_foot['proc_vad_mean'].mean())

print(" ")

print("Group5:")
print(group5_foot['proc_vad_mean'].mean())

print(" ")
print("Group6:")
print(group6_foot['proc_vad_mean'].mean())

Processed Unfiltered Score averages are: 
0.010651155968645912
 
 Group 1 averages are: 
0.010430390635996496
 
 Group 4 averages are: 
0.009178703520318595
 
 Other averages are:
 
Group2:
0.011188353932942423
 
Group3:
0.010140732881883252
 
Group5:
0.011453328422140465
 
Group6:
0.012082437170911862


In [93]:
# Processed Data (filtered)

print("Processed Filtered Score averages are: ")

print(oc_foot_mouth['proc_filt_vad_mean'].mean())

print(" ")

print(" Group 1 averages are: ")
print(group1_foot['proc_filt_vad_mean'].mean())


print(" ")
print(" Group 4 averages are: ")
print(group4_foot['proc_filt_vad_mean'].mean())


print(" ")
print(" Other averages are:")

print(" ")
print("Group2:")
print(group2_foot['proc_filt_vad_mean'].mean())


print(" ")

print("Group3:")
print(group3_foot['proc_filt_vad_mean'].mean())

print(" ")

print("Group5:")
print(group5_foot['proc_filt_vad_mean'].mean())

print(" ")
print("Group6:")
print(group6_foot['proc_filt_vad_mean'].mean())

Processed Filtered Score averages are: 
0.09082551868902182
 
 Group 1 averages are: 
0.09741171618892061
 
 Group 4 averages are: 
0.07716526516106408
 
 Other averages are:
 
Group2:
0.09077748021238714
 
Group3:
0.08688578256118008
 
Group5:
0.10130420217906953
 
Group6:
0.09098646502316536
