TextBlob outputs a `polarity` score for each tweet rather than predicting a sentiment. It is upto the developer to choose thresholds for which a tweet is tagged as positive, neutral, or negative. This notebook focusses on finding the best thresholds for which a tweet is successfully predicted as its actual sentiment label.

### Table of Content

- [Imports and Configurations](#imports-and-configurations)
- [Importing the Dataset](#importing-the-dataset)
- [Filtering Data](#filtering-data)
- [Tuning Parameters](#tuning-parameters)
- [Conclusion](#conclusion)

### Imports and Configurations

In [1]:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

In [2]:
import numpy as np
import pyspark.pandas as ps
import pandas as pd
from textblob import TextBlob

### Importing the Dataset

In [3]:
columns = ["Tweet_ID","Entity","Sentiment","Tweet_Content"]
label = "Sentiment"

In [4]:
pdf_train = pd.read_table("./twitter_training.csv",names=columns,index_col="Tweet_ID",sep=",")
pdf_valid = pd.read_table("./twitter_validation.csv",names=columns,index_col="Tweet_ID",sep=",")

In [5]:
df = ps.concat([
    ps.from_pandas(pdf_train),
    # The following line can be commented out if Validation data is not to be used for tuning
    # After tuning with/without validation data, the same thresholds were tuned.
    ps.from_pandas(pdf_valid)
])

  fields = [
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/11/30 04:18:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():


In [6]:
df.head()

                                                                                

Unnamed: 0_level_0,Entity,Sentiment,Tweet_Content
Tweet_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2401,Borderlands,Positive,im coming on borderlands and i will murder you...
2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


### Filtering Data 

In [7]:
# As discussed in the data analysis notebook irrelevant tweets can be ignored
# for tuning parameters
df = df[df["Sentiment"]!="Irrelevant"]

In [8]:
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)

### Tuning Parameters 

The `apply` method of a pandas dataframe is used to count correct predictions. This method takes only one argument which is a function which accepts one parameter. For tuning different values of `n` and `p` are to be used. Hence, a wrapper function is written below which can allow a for loop to change values of `n` and `p` while still passing a function that takes a single argument to the `apply` method.

Here, `p` is the polarity value above which a tweet is tagged as positive. A polarity value between `n` and `p` is tagged as neutral. A polarity value below `n` is tagged as negative.

In [9]:
def getSentimentTextBlob(n=-0.5,p=0.5):
    def getSentiment(line):
        analysis = TextBlob(line)
        if analysis.sentiment.polarity>=p:
            return "Positive"
        elif analysis.sentiment.polarity>= n:
            return "Neutral"
        return "Negative"
    return getSentiment

In [10]:
positive_cut = None
neutral_cut = None
best_score = 0
grid = np.linspace(-1,1,21)
grid

array([-1. , -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1,  0. ,
        0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ])

In [11]:
for p in range(len(grid)-2,1,-1):
    for n in range(1,p):
        sentiments = df[label].apply(getSentimentTextBlob(round(grid[n],1),round(grid[p],1)))
        score = (sentiments == df[label]).sum()/len(df)
        if score > best_score:
            positive_cut = p
            neutral_cut = n
            best_score = score

  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, series in pdf.iteritems():
  fields = [
  for column, s

In [13]:
print(f"Positive Cut: {round(grid[positive_cut],1)}, Neutral Cut: {round(grid[neutral_cut],1)}, Best Score: {best_score}")

Positive Cut: 0.2, Neutral Cut: -0.2, Best Score: 1.0


----

### Conclusion 



The dataset was loaded in a Pandas API on Spark dataset, which was filtered by removing missing values, duplicates, and records with `Irrelevant` sentiment. Then parameters for positive and neutral cuts were tuned by using a grid of values between -1 and +1. The parameters were tuned on the basis of finding the best accuracy, and the resultant cuts were estimated as follows:

<pre>
Positive Sentiment:     1 >= Polarity >=0.2    
Neutral Sentiment:      0.2 > Polarity >= -0.2    
Negative Sentiment:     -0.2 > Polarity >= -1
</pre>