### Improving Sample for Test Data: a stricter polarity threshold keeping only subjective sentences

Here we create a new verion of the sample for the test data that tries to improve the iter-rater reliabiliy agreements on the sentiment of the answers/texts, by only keeping those sentences that, in each answer/text, are classified as subjective (TextBlob) and have a polarity score that meets a stricter threshold (|0.3|).

Workflow:

1. Only keep those sentences that, in each answer/text, are subjective
2. Calculate polarity score for each sentence in each text using Vader
3. Eliminate all those scores that do not meet the (stricter) threshold for positivity/nagativity
4. Calculate mean polarity score for the text

### 1. Imports and Set Up

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from ast import literal_eval

In [32]:
#### Set up working directory
cwd = os.chdir('/Users/alessia/Documents/DataScience/NLP_Project/Outputs')

In [33]:
pd.set_option('display.max_colwidth', -1)

### 2. Get Data

In [34]:
# Read in data using literal_eval and converers
testdata_sample = pd.read_csv("sa_q1_sample_testdata.csv", converters=dict(VDR_SA_scores_sents=literal_eval, 
                                                                           TB_SA_score_sents = literal_eval, 
                                                                           subjty_score_sents = literal_eval))

In [62]:
testdata_sample.dtypes;

In [36]:
# Fix the columns containing a mixture of strings and floats (NaN) due to pd.to_csv...

testdata_sample['only_subj_VDR_scores'] = testdata_sample['only_subj_VDR_scores'].map(lambda x: literal_eval(x) if isinstance(x, str) else [x])

testdata_sample['only_subj_TB_scores'] = testdata_sample['only_subj_TB_scores'].map(lambda x: literal_eval(x) if isinstance(x, str) else [x])

### 3. Import basic NLP functions

In [8]:
cwd = os.chdir('/Users/alessia/Documents/DataScience/textconsultations/')

In [9]:
os.listdir()

['nlpfunctions', 'tutorial', 'README.md', '.git']

In [10]:
os.listdir('nlpfunctions');

In [11]:
import nlpfunctions.basic_NLP_functions as b_nlp



### 4. Calculate stricter mean polarity score for the only-subjective texts

In [37]:
testdata_sample.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1', 'Respondent ID',
       'Q1_census_methods',
       'Are you responding on behalf of an organisation, or as an individual?Response',
       'PublicSector', 'PrivateSector', 'OtherSectors', 'sent_tok_text',
       'VDR_SA_scores_sents', 'mean_VDR_SA_scores', 'VDR_polarity',
       'TB_SA_score_sents', 'TB_mean_SA_score', 'TB_polarity',
       'subjty_score_sents', 'Q1_only_subj_sents', 'only_subj_VDR_scores',
       'only_subj_mean_VDR_score', 'only_subj_VDR_polarity',
       'only_subj_TB_scores', 'only_subj_mean_TB_score',
       'only_subj_TB_polarity', 'strict_VDR_SA_scores_sents',
       'mean_strict_VDR_score', 'strict_VDR_polarity',
       'only_strict_polarity_sents', 'strict_sents_TB_scores',
       'strict_sent_mean_TB_score', 'strict_sent_TB_polarity'],
      dtype='object')

Sentence-tokenise texts

In [38]:
testdata_sample['subj_sent_tok_text'] = testdata_sample['Q1_only_subj_sents'].apply(lambda x: b_nlp.sent_tokenise_df(x))

#### Remove (i.e., assign NaN) to all those VDR polarity scores that do not meet the threshold. 
I.e., -0.3 <= score <= 0.3 are removed.

In [39]:
testdata_sample['only_subj_VDR_scores'].map(type)[:5]

0    <class 'list'>
1    <class 'list'>
2    <class 'list'>
3    <class 'list'>
4    <class 'list'>
Name: only_subj_VDR_scores, dtype: object

In [41]:
testdata_sample['only_subj_strict_VDR_scores_sents'] = testdata_sample['only_subj_VDR_scores'].apply(lambda x: b_nlp.get_sentiment_stricter_threshold_df(x, polarity_threshold = 0.3))

In [45]:
testdata_sample[['only_subj_strict_VDR_scores_sents', 'only_subj_VDR_scores', 'strict_VDR_SA_scores_sents']];

#### Re-calculate text's mean polarity score using VADER

In [46]:
testdata_sample['mean_only_subj_strict_VDR_score'] = testdata_sample['only_subj_strict_VDR_scores_sents'].apply(lambda x: np.nanmean(x))

  """Entry point for launching an IPython kernel.


In [47]:
testdata_sample['only_subj_strict_VDR_polarity'] = testdata_sample['mean_only_subj_strict_VDR_score'].apply(lambda x: 'pos' if x > 0 else 'neg' if x < 0 else "")

#### From each text, remove the sentences whose polarity score does not meet the stricter threshold

In [48]:
testdata_sample['only_subj_strict_polarity_sents'] = testdata_sample['subj_sent_tok_text'].apply(lambda x: b_nlp.keep_only_strict_polarity_sents_df(x))

In [51]:
# Check
testdata_sample[['subj_sent_tok_text', 
                 'only_subj_strict_polarity_sents', 
                 'only_subj_strict_VDR_polarity', 
                 'mean_only_subj_strict_VDR_score', 'only_subj_strict_VDR_scores_sents']][:10];

# Some negative sentences are not picked up correctly...

#### Re-Calculate TextBlob polarity score on strict polarity texts

In [52]:
testdata_sample['only_subj_strict_sents_TB_scores'] = testdata_sample['only_subj_strict_polarity_sents'].apply(lambda x: b_nlp.get_textblob_sentiment_score_df(x))

In [53]:
testdata_sample['only_subj_strict_sent_mean_TB_score'] = testdata_sample['only_subj_strict_sents_TB_scores'].apply(lambda x: np.nanmean(x))

  """Entry point for launching an IPython kernel.


In [54]:
testdata_sample['only_subj_strict_sent_TB_polarity'] = testdata_sample['only_subj_strict_sent_mean_TB_score'].apply(lambda x: 'pos' if x > 0 else 'neg' if x < 0 else "")

### 5. Agreements

102 texts go "removed", so our sample size is now 98 texts... 

In [63]:
print("Number of NaN: {}".format(testdata_sample['only_subj_strict_VDR_polarity'].isnull().sum()))
print(testdata_sample['only_subj_strict_VDR_polarity'].value_counts())

Number of NaN: 0
       102
pos    63 
neg    35 
Name: only_subj_strict_VDR_polarity, dtype: int64


In [56]:
# Checks
testdata_sample.head(3);

In [58]:
pd.crosstab(testdata_sample['only_subj_strict_VDR_polarity'], testdata_sample['only_subj_strict_sent_TB_polarity'])
# still many texts (21) that VADER classifies as negative while TB considers posiive

only_subj_strict_sent_TB_polarity,Unnamed: 1_level_0,neg,pos
only_subj_strict_VDR_polarity,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,102,0,0
neg,3,11,21
pos,3,7,53


Comparison between VDR polarity rating of (A&B) stricter threshold subjective texts vs. (B) stricter threshold text

In [59]:
pd.crosstab(testdata_sample['only_subj_strict_VDR_polarity'], testdata_sample['strict_VDR_polarity'], )


strict_VDR_polarity,neg,pos
only_subj_strict_VDR_polarity,Unnamed: 1_level_1,Unnamed: 2_level_1
,38,29
neg,25,10
pos,9,54


Comparison between VDR polarity rating of (A&B) stricter threshold subjective texts vs. (B) only subjective text

In [60]:
pd.crosstab(testdata_sample['only_subj_strict_VDR_polarity'], testdata_sample['only_subj_VDR_polarity'], )


only_subj_VDR_polarity,neg,pos
only_subj_strict_VDR_polarity,Unnamed: 1_level_1,Unnamed: 2_level_1
,9,14
neg,34,1
pos,2,61


In [61]:
# save data
testdata_sample.to_csv("/Users/alessia/Documents/DataScience/NLP_Project/Outputs/sa_q1_sample_testdata.csv")