![UKDS Logo](images/UKDS_Logos_Col_Grey_300dpi.png)

# Text-mining: Classifiers and sentiment analysis

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Sentiment-Analysis" data-toc-modified-id="Sentiment-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Sentiment Analysis</a></span><ul class="toc-item"><li><span><a href="#Sentiment-Analysis-Tools" data-toc-modified-id="Sentiment-Analysis-Tools-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Sentiment Analysis Tools</a></span></li></ul></li><li><span><a href="#Foot-and-mouth-dataset-analysis-with-VADER" data-toc-modified-id="Foot-and-mouth-dataset-analysis-with-VADER-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Foot and mouth dataset analysis with VADER</a></span><ul class="toc-item"><li><span><a href="#Filtering-the-Polarity-scores" data-toc-modified-id="Filtering-the-Polarity-scores-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Filtering the Polarity scores</a></span></li><li><span><a href="#Analysis-on-Filtered-Sentiments" data-toc-modified-id="Analysis-on-Filtered-Sentiments-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Analysis on Filtered Sentiments</a></span><ul class="toc-item"><li><span><a href="#Overall-Polarity-Average" data-toc-modified-id="Overall-Polarity-Average-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Overall Polarity Average</a></span></li><li><span><a href="#Sentiment-Analysis-by-Occupation" data-toc-modified-id="Sentiment-Analysis-by-Occupation-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Sentiment Analysis by Occupation</a></span></li></ul></li></ul></li><li><span><a href="#Conclusions" data-toc-modified-id="Conclusions-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusions</a></span></li><li><span><a href="#VADER-Comparisons" data-toc-modified-id="VADER-Comparisons-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>VADER Comparisons</a></span></li></ul></div>


There is a table of contents provided here at the top of the notebook, but you can also access this menu at any point by clicking the Table of Contents button on the top toolbar (an icon with four horizontal bars, if unsure hover your mouse over the buttons). 

## Introduction

Now that we have finished processing the data and put it into a nice, easy to read format it's onto the extraction! But first let's get all the necessary processing summary code ran.

In [1]:
# It is good practice to always start by importing the modules and packages you will need. 

import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
import re                         # re is for regular expressions, which we use later 
import pandas as pd               # we need pandas to import the foot_mouth_original.xls file
! pip install xlrd                # apparently we also need xlrd to read the .xls file because pandas is not old school
import xlrd                       # le sigh

nltk.download('punkt')
from nltk import word_tokenize    # importing the word_tokenize function from nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

!pip install autocorrect           # Spellchecker
from autocorrect import Speller

from nltk.corpus import wordnet                    # Finally, things we need for lemmatising!
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')        # Like a POS-tagger...

print("Succesfully imported necessary modules")    # The print statement is just a bit of encouragement!

foot_mouth_df = pd.read_csv ('../code/data/foot_mouth/text.csv')
print (foot_mouth_df[:10])

ERROR: Invalid requirement: '#'
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Succesfully imported necessary modules
   Unnamed: 0                0  \
0           0  5407diary02.rtf   
1           1  5407diary03.rtf   
2           2  5407diary07.rtf   
3           3  5407diary08.rtf   
4           4  5407diary09.rtf   
5           5  5407diary10.rtf   
6           6  5407diary13.rtf   
7           7  5407diary14.rtf   
8           8  5407diary15.rtf   
9           9  5407diary16.rtf   

                                                   1  
0  \n\nInformation about diarist\nDate of birth: ...  
1  Information about diarist\nDate of birth: 1966...  
2  \n\nInformation about diarist\nDate of birth: ...  
3  Information about diarist\nDate of birth: 1963...  
4  Information about diarist\nDate of birth: 1981...  
5  Information about diarist\nDate of birth: 1937...  
6  Information about diarist\nDate of birth: 1947...  
7  \nInformation about diarist\nDate of birth: 19...  
8  Information about diarist\nDate of birth: 1949...  
9  \nInformation about diarist\nDate

ERROR: Invalid requirement: '#'
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
# Renaming Columns
foot_mouth_df = pd.read_csv ('../code/data/foot_mouth/text.csv') 

foot_mouth_df.columns = ["Number", "Filename", "everything_else"]
print(foot_mouth_df.head())

print("Columns renamed :)")
print(" ")

#Creating New Dataframe with Occupation Column
oc_foot_mouth = foot_mouth_df.assign(Occupation = foot_mouth_df['everything_else'].str.extract(r'(\w+\s+\d{1,2})'))

print(oc_foot_mouth.head()) # checking if it worked!

print(" Occupation Dataframe Created!")

print(" ")
print(" EVERYTHING READY! :)")
print(" ")

   Number         Filename                                    everything_else
0       0  5407diary02.rtf  \n\nInformation about diarist\nDate of birth: ...
1       1  5407diary03.rtf  Information about diarist\nDate of birth: 1966...
2       2  5407diary07.rtf  \n\nInformation about diarist\nDate of birth: ...
3       3  5407diary08.rtf  Information about diarist\nDate of birth: 1963...
4       4  5407diary09.rtf  Information about diarist\nDate of birth: 1981...
Columns renamed :)
 
   Number         Filename                                    everything_else  \
0       0  5407diary02.rtf  \n\nInformation about diarist\nDate of birth: ...   
1       1  5407diary03.rtf  Information about diarist\nDate of birth: 1966...   
2       2  5407diary07.rtf  \n\nInformation about diarist\nDate of birth: ...   
3       3  5407diary08.rtf  Information about diarist\nDate of birth: 1963...   
4       4  5407diary09.rtf  Information about diarist\nDate of birth: 1981...   

  Occupation  
0    Grou

In [3]:
#Tokenising by word
foot_mouth_df['tokenised_words'] = foot_mouth_df.apply(lambda row: nltk.word_tokenize(row['everything_else']), axis=1)

#Removing Uppercase
foot_mouth_df['txt_lower'] = foot_mouth_df['tokenised_words'].apply(lambda x: [w.lower() for w in x])

#Correcting Spelling
spell = Speller(lang='en')

foot_mouth_df['spell_checked'] = foot_mouth_df['txt_lower'].apply(lambda x: [spell(w) for w in x])

def replacement_mapping(x):
        if x == "der":
            return re.sub("der","defra",x)
        else:
            return x  

def replacement_mapping_2(x):
        if x == "ffd":
            return re.sub("ffd","fmd",x)
        else:
            return x  
        
foot_mouth_df["spell_checked"] = foot_mouth_df["spell_checked"].apply(lambda x:[replacement_mapping(w) for w in x])
foot_mouth_df["spell_checked"] = foot_mouth_df["spell_checked"].apply(lambda x:[replacement_mapping_2(w) for w in x])

#Removing Punctuation and getting rid of resulting space
English_punctuation = "!\"#$%&()*+,./:;<=>?@[\]^_`{|}~“”-"      # Define a variable with all the punctuation to remove.
print(English_punctuation)                                     # Print that defined variable, just to check it is correct.
print("...") 


def remove_punctuation(from_text):                           # Had to define a function to iterate over the strings in a row
    table = str.maketrans('', '', English_punctuation)       # The python function 'maketrans' creates a table that maps
    stripped = [w.translate(table) for w in from_text]        # the punctation marks to 'None'. Print the table to check. 
    return stripped

foot_mouth_df['no_punct'] = [remove_punctuation(i) for i in foot_mouth_df['spell_checked']] # Iterating above function to each

foot_mouth_df['no_punct_no_space'] = [list(filter(None, sublist)) for sublist in foot_mouth_df['no_punct']]

#POS Tagging and Lemmatisation

foot_mouth_df['pos_tag'] = foot_mouth_df['no_punct_no_space'].apply(lambda x: nltk.pos_tag(x))

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer() 
foot_mouth_df['lemmatised'] = foot_mouth_df['pos_tag'].apply(lambda x: [lemmatizer.lemmatize(y[0], get_wordnet_pos(y[0])) for y in x])

#Removing Stop words
stop_words = set(stopwords.words('english'))
                                                                                                                   
new_stopwords =[e for e in stop_words if e not in ("aren't","couldn't","didn't","doesn't","don't","hadn't","hasn't","haven't","isn't","mightn't","mustn't","needn't",'no','not',"only","shouldn't","wasn't","weren't","won't","wouldn't")]
foot_mouth_df['no_stop_words'] = foot_mouth_df['lemmatised'].apply(lambda x: [item for item in x if item not in new_stopwords])                        

print("All done!")

!"#$%&()*+,./:;<=>?@[\]^_`{|}~“”-
...
All done!


In [4]:
# Appending final processed data column onto the oc_foot_mouth Dataframe

processed = foot_mouth_df['no_stop_words']

oc_foot_mouth = oc_foot_mouth.join(processed)
oc_foot_mouth.rename(columns = {'no_stop_words':'processed_text'}, inplace = True)

print(foot_mouth_df.head())
print(" ")
oc_foot_mouth.head()

print(" ")
print("All Set for Extraction!")

   Number         Filename                                    everything_else  \
0       0  5407diary02.rtf  \n\nInformation about diarist\nDate of birth: ...   
1       1  5407diary03.rtf  Information about diarist\nDate of birth: 1966...   
2       2  5407diary07.rtf  \n\nInformation about diarist\nDate of birth: ...   
3       3  5407diary08.rtf  Information about diarist\nDate of birth: 1963...   
4       4  5407diary09.rtf  Information about diarist\nDate of birth: 1981...   

                                     tokenised_words  \
0  [Information, about, diarist, Date, of, birth,...   
1  [Information, about, diarist, Date, of, birth,...   
2  [Information, about, diarist, Date, of, birth,...   
3  [Information, about, diarist, Date, of, birth,...   
4  [Information, about, diarist, Date, of, birth,...   

                                           txt_lower  \
0  [information, about, diarist, date, of, birth,...   
1  [information, about, diarist, date, of, birth,...   
2  [info

In [5]:
oc_foot_mouth.loc[:5]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend..."
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend..."
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend..."
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend..."
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend..."
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend..."


## Sentiment Analysis

The first (and main) step in the extraction process will be Sentiment Analysis. Sentiment analysis is a commonly used example of automatic classification. To be clear, automatic classification means that a model or learning algorithm has been trained on correctly classified documents and it uses this training to return a probability assessment of what class a new document should belong to. 

Sentiment analysis works the same way, but usually only has two classes - positive and negative. A trained model looks at new data and says whether that new data is likely to be positive or negative and this is what we will be using to conduct our analysis. Let's take a look!

### Sentiment Analysis Tools

Let's start off by importing and downloading some useful packages, including `VADER`: it is based on `nltk` and has built in sentiment analysis tools. 

To import the packages, click in the code cell below and hit the 'Run' button at the top of this page or by holding down the 'Shift' key and hitting the 'Enter' key. 

For the rest of this notebook, I will use 'Run/Shift+Enter' as short hand for 'click in the code cell below and hit the 'Run' button at the top of this page or by hold down the 'Shift' key while hitting the 'Enter' key'. 

Run/Shift+Enter.

In [98]:
import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
import csv                        # csv is for importing and working with csv files
import statistics

In [99]:
nltk.download('vader_lexicon')
nltk.download('movie_reviews')
nltk.download('punkt')

from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
sia = SIA()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\L_Pel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


VADER works by giving the negative `neg` score, neutral `neu` score, and positive `pos` score along with the overal `compound` polarity score! (see example below)

In [100]:
print(sia.polarity_scores("Textblob is just super. I love it!"))
print(sia.polarity_scores("Cabbages are the worst. Say no to cabbages!"))
print(sia.polarity_scores("Paris is the capital of France"))

{'neg': 0.0, 'neu': 0.323, 'pos': 0.677, 'compound': 0.8553}
{'neg': 0.524, 'neu': 0.476, 'pos': 0.0, 'compound': -0.7644}
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


In order to then find JUST the overall polarity, you can create and use the following `get_scores` function

In [101]:
def get_scores(content):
    sia_scores = sia.polarity_scores(content)
    return sia_scores['compound']

print(get_scores("Textblob is just super. I love it!"))

0.8553


## Foot and mouth dataset analysis with VADER

Super. Now let's do this with our foot and mouth Pandas DataFrame. For this we will be looking at the polarity of Groups 1 and 4 in comparison to the average across occupations!

First, we will use the previously mentioned `get_scores` function to calculate the polarity of each file/row

In [107]:
def get_scores(content):
    sia_scores = sia.polarity_scores(content)
    return sia_scores['compound']

oc_foot_mouth['polarities'] = oc_foot_mouth['processed_text'].apply(lambda x : [get_scores(y)for y in x])

oc_foot_mouth[:10]


Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,polarities
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,Group 5,"[information, diarist, date, birth, 1947, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,Group 5,"[information, diarist, date, birth, 1949, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1951, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [109]:
#Looking more closely at the polarity values in one file 
oc_foot_mouth.loc[1,'polarities']

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.1027,
 0.0,
 0.3818,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.5719,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.4767,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.1779,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.5719,
 0.0,
 0.0,
 0.0,
 -0.3818,
 0.4019,
 0.0,
 0.0,
 0.3612,
 0.0,
 0.0,
 0.0,
 0.0,
 0.3182,
 0.25,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.128,
 0.0,
 0.0,
 -0.4767,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 -0.3612,
 -0.4019,
 0.0,
 0.4215,
 -0.4019,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,

Ok so it looks like there is an issue with there being too many neutral words in this dataset that could skew the results (for example whole phrases such as " Information about diarist" or "date of birth" that are just documentations and do not have anything to do with the actual responses submitted by the participants). So we will need to filter out all of these neutral scores!

### Filtering the Polarity scores

In [110]:
oc_foot_mouth['filtered'] = [list(filter(lambda x: x != 0, sublist)) for sublist in oc_foot_mouth['polarities']]
oc_foot_mouth[:10]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,polarities,filtered
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.296, 0.4588, 0.2732, -0.296, -0.6249, 0.361..."
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.1027, 0.3818, -0.5719, -0.4767, -0.1779, -..."
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.4404, 0.3182, 0.296, -0.3612, -0.296, -0.54..."
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.6369, 0.6369, 0.128, 0.4404, -0.5423, -0.31..."
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.128, 0.2732, 0.0772, -0.4404, 0.25, 0.0772..."
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.4215, -0.5859, -0.4588, -0.296, 0.2023, -0..."
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,Group 5,"[information, diarist, date, birth, 1947, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.3612, -0.4019, 0.4939, 0.4939, 0.3182, -0...."
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.4939, 0.3818, 0.5719, 0.4019, -0.5423, 0.0..."
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,Group 5,"[information, diarist, date, birth, 1949, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.4215, 0.4404, 0.4019, 0.4404, -0.4019, 0.40..."
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1951, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.25, -0.4939, 0.4939, 0.4767, 0.4404, 0.401..."


So much better! Now lets do the analysis on these new scores!

### Analysis on Filtered Sentiments

For this we will just be calculating the means of the scores! First though we will need to import numpy so we can use its mean function!

In [111]:
import numpy as np

Now we will be creating a new column for the mean polarity score for each row/file - this will make thngs easier for the next steps!

In [122]:
oc_foot_mouth['pol_mean'] = oc_foot_mouth['filtered'].apply(np.mean)
oc_foot_mouth[:10]

Unnamed: 0,Number,Filename,everything_else,Occupation,processed_text,polarities,filtered,pol_mean
0,0,5407diary02.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1975, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.296, 0.4588, 0.2732, -0.296, -0.6249, 0.361...",0.095934
1,1,5407diary03.rtf,Information about diarist\nDate of birth: 1966...,Group 6,"[information, diarist, date, birth, 1966, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.1027, 0.3818, -0.5719, -0.4767, -0.1779, -...",0.020041
2,2,5407diary07.rtf,\n\nInformation about diarist\nDate of birth: ...,Group 6,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.4404, 0.3182, 0.296, -0.3612, -0.296, -0.54...",0.195452
3,3,5407diary08.rtf,Information about diarist\nDate of birth: 1963...,Group 6,"[information, diarist, date, birth, 1963, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.6369, 0.6369, 0.128, 0.4404, -0.5423, -0.31...",0.079971
4,4,5407diary09.rtf,Information about diarist\nDate of birth: 1981...,Group 5,"[information, diarist, date, birth, 1981, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.128, 0.2732, 0.0772, -0.4404, 0.25, 0.0772...",0.14399
5,5,5407diary10.rtf,Information about diarist\nDate of birth: 1937...,Group 5,"[information, diarist, date, birth, 1937, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.4215, -0.5859, -0.4588, -0.296, 0.2023, -0...",-0.030438
6,6,5407diary13.rtf,Information about diarist\nDate of birth: 1947...,Group 5,"[information, diarist, date, birth, 1947, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.3612, -0.4019, 0.4939, 0.4939, 0.3182, -0....",0.073972
7,7,5407diary14.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1964, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.4939, 0.3818, 0.5719, 0.4019, -0.5423, 0.0...",0.10303
8,8,5407diary15.rtf,Information about diarist\nDate of birth: 1949...,Group 5,"[information, diarist, date, birth, 1949, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.4215, 0.4404, 0.4019, 0.4404, -0.4019, 0.40...",0.192486
9,9,5407diary16.rtf,\nInformation about diarist\nDate of birth: 19...,Group 5,"[information, diarist, date, birth, 1951, gend...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[-0.25, -0.4939, 0.4939, 0.4767, 0.4404, 0.401...",0.101971


Interesting. It appears that sentiments were neutral overall even with them being fairly subjective. This is quite interesting given how sad the topic is!

#### Overall Polarity Average

In [114]:
print(oc_foot_mouth['pol_mean'].mean())

0.09082551868902182


#### Sentiment Analysis by Occupation

Now let's do our sentiment analysis on Groups 1 and 4.

First we will use the same `==` operator used in the pre-processing stage to filter the dataframe by Occupation

In [115]:
group1_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 1']
group4_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 4']

Doing a quick check to see if the Dataframe has been filtered correctly...

In [116]:
#For group 1
print(group1_foot.loc[:])

print(" ")
print("length is " + str(len(group1_foot))) # length should be 13

    Number         Filename  \
33      33  5407diary47.rtf   
34      34  5407diary48.rtf   
35      35  5407diary49.rtf   
36      36  5407diary52.rtf   
37      37  5407diary53.rtf   
38      38  5407diary54.rtf   
40      40     5407fg01.rtf   
80      80    5407int47.rtf   
81      81    5407int48.rtf   
82      82    5407int49.rtf   
83      83    5407int52.rtf   
84      84    5407int53.rtf   
85      85    5407int54.rtf   

                                      everything_else Occupation  \
33  \nInformation about diarist\nDate of birth: 19...    Group 1   
34  \nInformation about diarist\nDate of birth: 19...    Group 1   
35  \t\nInformation about diarist\nDate of birth: ...    Group 1   
36  \nInformation about diarist\nDate of birth: 19...    Group 1   
37  \nInformation about diarist\nDate of birth: 19...    Group 1   
38  \nInformation about diarist\nDate of birth: 19...    Group 1   
40  \nGroups Discussion with Members of  Farmers F...    Group 1   
80  \nDate of Intervi

In [117]:
#For group 4
print(group4_foot.loc[:])

print(" ")
print("length is " + str(len(group4_foot))) # length should be 16

    Number         Filename  \
12      12  5407diary19.rtf   
13      13  5407diary21.rtf   
14      14  5407diary22.rtf   
15      15  5407diary23.rtf   
16      16  5407diary24.rtf   
17      17  5407diary26.rtf   
32      32  5407diary44.rtf   
43      43     5407fg04.rtf   
58      58    5407int19.rtf   
59      59    5407int20.rtf   
60      60    5407int21.rtf   
61      61    5407int22.rtf   
62      62    5407int23.rtf   
63      63    5407int24.rtf   
64      64    5407int26.rtf   
79      79    5407int44.rtf   

                                      everything_else Occupation  \
12  \nInformation about diarist\nDate of birth: 19...    Group 4   
13  \nInformation about diarist\nDate of birth: 19...    Group 4   
14  \nInformation about diarist\nDate of birth: 19...    Group 4   
15  \nInformation about diarist\nDate of birth: 19...    Group 4   
16  \nInformation about diarist\nDate of birth: 19...    Group 4   
17  \nInformation about diarist\nDate of birth: 19...    Group 4

Perfect! Now let's do Sentiment analysis on these new filtered Dataframes!

In [118]:
print(" Group 1 averages are: " + str (group1_foot["pol_mean"].mean()))

print(" ")
print(" Group 4 averages are: " + str(group4_foot["pol_mean"].mean()))

 Group 1 averages are: 0.09741171618892061
 
 Group 4 averages are: 0.07716526516106408


Ok not bad! Although it does definitely look like things are not that much different to the general average!

Now for visualisations purposes I will get the averages for the other occupations.

In [119]:
group2_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 2']
group3_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 3']
group5_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 5']
group6_foot = oc_foot_mouth[oc_foot_mouth['Occupation'] == 'Group 6']

In [120]:
print(group2_foot["pol_mean"].mean())


print(" ")
print(group3_foot["pol_mean"].mean())


print(" ")
print(group5_foot["pol_mean"].mean())


print(" ")
print(group6_foot["pol_mean"].mean())

0.09077748021238714
 
0.08688578256118008
 
0.10130420217906953
 
0.09098646502316536


## Conclusions

Now that we have all of the proper values for everything we can summarise them all in one cell!

In [121]:
print("Overall Score averages are: ")

print(oc_foot_mouth["pol_mean"].mean())

print(" ")
print(" Group 1 averages are: ")
print(group1_foot["pol_mean"].mean())


print(" ")
print(" Group 4 averages are: ")
print(group4_foot["pol_mean"].mean())


print(" ")
print(" Other averages are:")

print(" ")
print("Group2:")
print(group2_foot["pol_mean"].mean())


print(" ")
print("Group3:")
print(group3_foot["pol_mean"].mean())


print(" ")
print("Group5:")
print(group5_foot["pol_mean"].mean())


print(" ")
print("Group6:")
print(group6_foot["pol_mean"].mean())

Overall Score averages are: 
0.09082551868902182
 
 Group 1 averages are: 
0.09741171618892061
 
 Group 4 averages are: 
0.07716526516106408
 
 Other averages are:
 
Group2:
0.09077748021238714
 
Group3:
0.08688578256118008
 
Group5:
0.10130420217906953
 
Group6:
0.09098646502316536


And finally we will save the resulting Dataframe into a csv file so that we can export it into an R notebook for summarisation, visualisation and statistical testing purposes!

In [123]:
os.getcwd()

'C:\\Users\\L_Pel\\OneDrive\\Documents\\GitHub\\text-mining-private\\code'

In [124]:
oc_foot_mouth.to_csv('C:/Users/L_Pel/OneDrive/Documents/GitHub/text-mining-private/code/data/foot_mouth_analysed.csv')