<a href="https://colab.research.google.com/github/VIVEK-JADHAV/TweetSentimentExtraction/blob/master/CaseStudy2NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import os
import shutil
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS 
import re 
import string
import tensorflow as tf
from tensorflow.keras.layers import Dense,Input,Activation,Flatten,Embedding,Lambda,Dropout,LSTM,Conv1D
from tensorflow.keras.models import Model
from tqdm import tqdm
import spacy
import random
from spacy.util import compounding
from spacy.util import minibatch


  import pandas.util.testing as tm


In [None]:
!kaggle competitions download -c tweet-sentiment-extraction

Downloading test.csv to /content/data
  0% 0.00/307k [00:00<?, ?B/s]
100% 307k/307k [00:00<00:00, 46.8MB/s]
Downloading train.csv.zip to /content/data
  0% 0.00/1.23M [00:00<?, ?B/s]
100% 1.23M/1.23M [00:00<00:00, 82.2MB/s]
Downloading sample_submission.csv to /content/data
  0% 0.00/41.4k [00:00<?, ?B/s]
100% 41.4k/41.4k [00:00<00:00, 42.8MB/s]


In [None]:
!unzip train.csv.zip

Archive:  train.csv.zip
  inflating: train.csv               


In [None]:
#Loading the train dataset

train=pd.read_csv('/content/data/train.csv')
print('Shape of the train data is ',train.shape)
train.head()

Shape of the train data is  (27481, 4)


Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


* Train dataset has 27481 datapoints and 4 features.

In [None]:
#Loading the test dataset

test=pd.read_csv('/content/data/test.csv')
print('Shape of the test data is ',test.shape)
test.head()

Shape of the test data is  (3534, 3)


Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive


* Test dataset has 3534 datapoints and 3 features.

In [None]:
#Information regarding train data
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   textID         27481 non-null  object
 1   text           27480 non-null  object
 2   selected_text  27480 non-null  object
 3   sentiment      27481 non-null  object
dtypes: object(4)
memory usage: 858.9+ KB


* text and selected_text columns have one null value and hence, dropping that data point.

In [None]:
#Dropping the row with null value
train.dropna(inplace=True)
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27480 entries, 0 to 27480
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   textID         27480 non-null  object
 1   text           27480 non-null  object
 2   selected_text  27480 non-null  object
 3   sentiment      27480 non-null  object
dtypes: object(4)
memory usage: 1.0+ MB


In [None]:
#Information regarding test data
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3534 entries, 0 to 3533
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   textID     3534 non-null   object
 1   text       3534 non-null   object
 2   sentiment  3534 non-null   object
dtypes: object(3)
memory usage: 83.0+ KB


* Test dataset has no rows with null values.

MODEL ARCHIETECTURE:
* Three Named Entity Recognition(NER) models are built, one for each sentiment
* Each of the NER model is trained  with selected_text as a entity to recognize from the tweet. Given a tweet, the model would lable part of the tweet as selected_text
* Each model is trained until loss starts to increase or reduction in loss is low.


In [None]:
#Reference: https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/

def train_data(train_data,output_dir,new_name,model=None):
  '''Creates a ner model '''
  if (model==None):
    #Create a new model for English language
    nlp=spacy.blank('en')
    print('Created new model')
  else:
    #Load existing model
    nlp=spacy.load(output_dir)
    print('Loaded old model ')

  #Create ner pipe if doesn't exist else load ner pipe
  if ("ner" not in nlp.pipe_names):
     ner = nlp.create_pipe("ner")
     nlp.add_pipe(ner, last=True)
  else:
    ner=nlp.get_pipe('ner')

  for text, annotations in train_data:
    for ent in annotations.get("entities"):
      ner.add_label(ent[2])
  
  #Find the pipes other than ner and disable them for training
  other_pipes = [p for p in nlp.pipe_names if p != "ner"]
  with nlp.disable_pipes(*other_pipes):
    if model==None:
      nlp.begin_training()
    else:
      nlp.resume_training()

    #Training the model with 10 iterations
    for itn in tqdm(range(5)):
      random.shuffle(train_data)
      batches = minibatch(train_data, size=compounding(4.0, 500.0, 1.001))  
      losses = {}
      for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(texts,annotations,drop=0.5,losses=losses) 
      print('loss:',losses)

  #Save the model
  nlp.meta["name"] = new_name
  nlp.to_disk(output_dir) 




In [None]:
#Splitting the data into train and test 
from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_val=train_test_split(train,train['sentiment'],
                                              test_size=0.2,random_state=42,stratify=train['sentiment'])

X_train.reset_index(inplace=True,drop=True)
X_test.reset_index(inplace=True,drop=True)



print('X_train shape',X_train.shape) 
print('X_test shape',X_test.shape)

X_train shape (21984, 4)
X_test shape (5496, 4)


In [None]:
def createTrainingData(sentiment,data):
  '''Create data as per spacy for each sentiment '''
  final_data=[]
  for i, row in data.iterrows():
    if row.sentiment==sentiment:
      selected_text=row.selected_text
      text=row.text
      start=text.find(selected_text)
      end=start+len(selected_text)
      final_data.append((text,{'entities':[(start,end,'selected_text')]}))
  return final_data


In [None]:
#Creating directories to store model data

%cd '/content/'
os.mkdir('/content/positive')
os.mkdir('/content/negative')
os.mkdir('/content/neutral')

/content


In [None]:
#Training for positive sentiment

positive_data=createTrainingData('positive',X_train)
train_data(positive_data,'/content/positive','pos')

  0%|          | 0/10 [00:00<?, ?it/s]

Created new model


 10%|█         | 1/10 [00:35<05:23, 35.90s/it]

loss: {'ner': 27689.42110305946}


 20%|██        | 2/10 [01:11<04:46, 35.76s/it]

loss: {'ner': 25454.201032690027}


 30%|███       | 3/10 [01:46<04:09, 35.58s/it]

loss: {'ner': 23828.83874272783}


 40%|████      | 4/10 [02:21<03:33, 35.53s/it]

loss: {'ner': 22964.594430033558}


 50%|█████     | 5/10 [02:58<02:59, 35.84s/it]

loss: {'ner': 22306.53968724727}


 60%|██████    | 6/10 [03:34<02:24, 36.02s/it]

loss: {'ner': 21777.336945568026}


 70%|███████   | 7/10 [04:10<01:47, 35.77s/it]

loss: {'ner': 21119.455629631935}


 80%|████████  | 8/10 [04:45<01:11, 35.58s/it]

loss: {'ner': 20880.659306122303}


 90%|█████████ | 9/10 [05:24<00:36, 36.79s/it]

loss: {'ner': 20433.544970024694}


100%|██████████| 10/10 [06:06<00:00, 36.62s/it]

loss: {'ner': 20592.343001055684}





In [None]:
#Training for negative sentiment

negative_data=createTrainingData('negative',X_train)
train_data(negative_data,'/content/negative','neg')


  0%|          | 0/10 [00:00<?, ?it/s][A

Created new model



 10%|█         | 1/10 [00:35<05:19, 35.48s/it][A

loss: {'ner': 25827.15608202864}



 20%|██        | 2/10 [01:09<04:40, 35.02s/it][A

loss: {'ner': 23496.310243669737}



 30%|███       | 3/10 [01:43<04:03, 34.76s/it][A

loss: {'ner': 22341.483347295303}



 40%|████      | 4/10 [02:17<03:26, 34.49s/it][A

loss: {'ner': 20987.657571903812}



 50%|█████     | 5/10 [02:52<02:53, 34.67s/it][A

loss: {'ner': 20574.77125196917}



 60%|██████    | 6/10 [03:28<02:20, 35.09s/it][A

loss: {'ner': 19554.667962447274}



 70%|███████   | 7/10 [04:02<01:44, 34.78s/it][A

loss: {'ner': 19137.85273454134}



 80%|████████  | 8/10 [04:36<01:09, 34.54s/it][A

loss: {'ner': 18579.801552748977}



 90%|█████████ | 9/10 [05:11<00:34, 34.49s/it][A

loss: {'ner': 18044.588299441464}



100%|██████████| 10/10 [05:50<00:00, 35.03s/it]

loss: {'ner': 17834.81188769448}





In [None]:
train_data(negative_data,'/content/negative','neg','/content/negative')

Loaded old model 



  0%|          | 0/5 [00:00<?, ?it/s][A
 20%|██        | 1/5 [00:41<02:46, 41.58s/it][A

loss: {'ner': 17490.297766154035}



 40%|████      | 2/5 [01:22<02:04, 41.52s/it][A

loss: {'ner': 16887.991383886387}



 60%|██████    | 3/5 [02:04<01:23, 41.51s/it][A

loss: {'ner': 16685.665047250834}



 80%|████████  | 4/5 [02:45<00:41, 41.43s/it][A

loss: {'ner': 16313.12813841504}



100%|██████████| 5/5 [03:27<00:00, 41.41s/it]

loss: {'ner': 16105.474078456138}





In [None]:
#Training for neutral sentiment

neutral_data=createTrainingData('neutral',X_train)
train_data(neutral_data,'/content/neutral','neu')



  0%|          | 0/5 [00:00<?, ?it/s][A[A

Created new model




 20%|██        | 1/5 [00:42<02:49, 42.43s/it][A[A

loss: {'ner': 6126.086347264734}




 40%|████      | 2/5 [01:26<02:08, 42.98s/it][A[A

loss: {'ner': 4390.583928749054}




 60%|██████    | 3/5 [02:15<01:29, 44.75s/it][A[A

loss: {'ner': 4147.1346603988895}




 80%|████████  | 4/5 [03:04<00:45, 45.93s/it][A[A

loss: {'ner': 4042.8038954479666}




100%|██████████| 5/5 [03:52<00:00, 46.54s/it]

loss: {'ner': 4115.085065671129}





In [None]:
def predict_selected_text(text,model):
  '''Determine the selected text for the given text '''
  #Giving the text to the model
  doc=model(text)
  ent_list=[]
  for e in doc.ents:
    #Finding the start and end index of selected text
    start=text.find(e.text)
    end=start+len(e.text)

    #Append the start and end index and label to an array
    arr=[start,end,e.label_]
    if arr not in ent_list:
      ent_list.append(arr)
  #If the model does not predict any label, selected_text is the whole text
  if len(ent_list)==0:
    selected_text=text
  else:
    selected_text=text[ent_list[0][0]:ent_list[0][1]]

  return selected_text


In [None]:
def jaccard(str1, str2): 
  '''Returns the jaccard score for the given two strings '''
  a = set(str1.lower().split()) 
  b = set(str2.lower().split())
  c = a.intersection(b)
  return float(len(c)) / (len(a) + len(b) - len(c))

In [None]:
#Creating a dataframe withh all the attributes along with jaccard score
def jaccardScorePrediction(data):
  ''' Predicts the selected_text and computes jaccard score'''
  
  #Creating a dataframe with the below columns
  new_df=pd.DataFrame(columns=['textID','text','selected_text','predicted_text','sentiment','jaccard'])
    
  for i,(_,row) in enumerate(data.iterrows()):
    new_df.loc[i,'textID']=row.textID
    new_df.loc[i,'text']=row.text
    new_df.loc[i,'selected_text']=row.selected_text
    new_df.loc[i,'sentiment']=row.sentiment
    
    #For each sentiment, load the model, predict the selected_text and find jaccard score
    if (row.sentiment=='positive'):
      selected_text=predict_selected_text(row.text,spacy.load('/content/positive'))
      new_df.loc[i,'predicted_text']=selected_text
      new_df.loc[i,'jaccard']=jaccard(selected_text,row.selected_text)
    elif (row.sentiment=='negative'):
      selected_text=predict_selected_text(row.text,spacy.load('/content/negative'))      
      new_df.loc[i,'predicted_text']=selected_text
      new_df.loc[i,'jaccard']=jaccard(selected_text,row.selected_text)
    else:
      selected_text=predict_selected_text(row.text,spacy.load('/content/neutral'))
      new_df.loc[i,'predicted_text']=selected_text
      new_df.loc[i,'jaccard']=jaccard(selected_text,row.selected_text)

  return new_df




In [None]:
#Prediction on test data

test_df=jaccardScorePrediction(X_test)
test_df.head()

Unnamed: 0,textID,text,selected_text,predicted_text,sentiment,jaccard
0,45be0423e4,I thought that there was going to be another D...,crappy karaoke game. I miss the fighting,crappy,negative,0.142857
1,521d5dd501,I bet you received lots of hit from that twee...,I bet you received lots of hit from that tweet...,I bet you received lots of hit from that twee...,negative,1.0
2,605225ad21,Freakin` frustrated why can`t my coach realize...,frustrated,frustrated,negative,1.0
3,0abe62c2ee,is feeling so bored... i miss school time,is feeling so bored..,is feeling so bored... i miss school time,negative,0.333333
4,eca513ce47,wow this morning 8.15 hrs ding dong breakfasts...,"Mother hapy,",wow,positive,0.0


In [None]:
#Overall Jaccard score for hold out test data

average=np.mean(test_df['jaccard'])
print('The average jaccard score is ',average)

The average jaccard score is  0.6467800626675628


In [None]:
#Average Jaccard score for each sentiment

pos_average=np.mean(test_df['jaccard'][test_df['sentiment']=='positive'])
print('The average jaccard score for postive sentiment is  ',pos_average)

neg_average=np.mean(test_df['jaccard'][test_df['sentiment']=='negative'])
print('The average jaccard score for negative sentiment is  ',neg_average)

neu_average=np.mean(test_df['jaccard'][test_df['sentiment']=='neutral'])
print('The average jaccard score for neutral sentiment is  ',neu_average)

The average jaccard score for postive sentiment is   0.4363084132448223
The average jaccard score for negative sentiment is   0.4129347301888076
The average jaccard score for neutral sentiment is   0.972783969028342


* The jaccard score is 0.97 for neutral tweets and around 0.4 for postive and negative tweets.

In [None]:
#Calculating the total number of tweets for each sentiment
total_positive=test_df['sentiment'].value_counts()['positive']
total_negative=test_df['sentiment'].value_counts()['negative']
total_neutral=test_df['sentiment'].value_counts()['neutral']

#Counting the tweets for each sentiment with jaccard score=1
correct_positive=test_df[(test_df['jaccard']==1) & (test_df['sentiment']=='positive')].shape[0]
correct_negative=test_df[(test_df['jaccard']==1) & (test_df['sentiment']=='negative')].shape[0]
correct_neutral=test_df[(test_df['jaccard']==1) & (test_df['sentiment']=='neutral')].shape[0]

print('The percent of postive tweets correctly predicted is ',correct_positive*100/total_positive)
print('The percent of negative tweets correctly predicted is ',correct_negative*100/total_negative)
print('The percent of neutral tweets correctly predicted is ',correct_neutral*100/total_neutral)

The percent of postive tweets correctly predicted is  27.505827505827504
The percent of negative tweets correctly predicted is  24.67866323907455
The percent of neutral tweets correctly predicted is  89.56834532374101


* 27% of  positive tweets and 24% of negative tweets have jaccard score of 1
* 89% of neutral tweets have jaccard score of 1

### PREDICTION ANALYSIS

The Error Analysis is performed considering how well the model identifies the selected text when length of the the tweet is different.
* Category1: When length of tweet and selected text is same(Difference in length is less than 0)
* Category2: When length of tweet and selected text is similar(Difference in length is less than 10)
* Category3: When length of tweet and selected text is not high(Difference in length is less than 20)
* Category4: When length of tweet and selected text is  high(Difference in length is greater than 20)

In [None]:
#Adding len of the selected text column

test_df['len_selected_text']=test_df['selected_text'].apply(lambda x: len(x.split()))
test_df['len_text']=test_df['text'].apply(lambda x: len(x.split()))
test_df['diff_len']=abs(test_df['len_selected_text']-test_df['len_text'])
test_df.head()

Unnamed: 0,textID,text,selected_text,predicted_text,sentiment,jaccard,len_selected_text,len_text,diff_len
0,45be0423e4,I thought that there was going to be another D...,crappy karaoke game. I miss the fighting,crappy,negative,0.142857,7,27,20
1,521d5dd501,I bet you received lots of hit from that twee...,I bet you received lots of hit from that tweet...,I bet you received lots of hit from that twee...,negative,1.0,17,17,0
2,605225ad21,Freakin` frustrated why can`t my coach realize...,frustrated,frustrated,negative,1.0,1,24,23
3,0abe62c2ee,is feeling so bored... i miss school time,is feeling so bored..,is feeling so bored... i miss school time,negative,0.333333,4,8,4
4,eca513ce47,wow this morning 8.15 hrs ding dong breakfasts...,"Mother hapy,",wow,positive,0.0,2,21,19


#### POSITIVE SENTIMENT

In [None]:
#Filtering positive sentiment tweets
positive_df=test_df[test_df['sentiment']=='positive']

#Filtering positive sentiment tweets with jaccard score=1 and 0
pos_jac_1=positive_df[positive_df['jaccard']==1]
pos_jac_0=positive_df[positive_df['jaccard']!=1]

In [None]:
print('The total number of postive tweets is',positive_df.shape[0])
print('The total number of postive tweets with jaccard=1 is',pos_jac_1.shape[0])
print('The total number of postive tweets with jaccard not 1 is',pos_jac_0.shape[0])

The total number of postive tweets is 1716
The total number of postive tweets with jaccard=1 is 472
The total number of postive tweets with jaccard not 1 is 1244


In [None]:
#Performance of postive model when both tweet and selected text is same

total_zero_diff=positive_df[positive_df['diff_len']==0].shape[0]
zero_diff_jaccard_1=pos_jac_1[pos_jac_1['diff_len']==0].shape[0]
result=zero_diff_jaccard_1*100/total_zero_diff
print('The percentage of positive tweets with jaccard=1 when both tweet and selected text were same is',result)

The percentage of positive tweets with jaccard=1 when both tweet and selected text were same is 49.57983193277311


In [None]:
#Performance of postive model when difference between  tweet and selected text is less than 10

diff_less_10=positive_df[positive_df['diff_len']<=10].shape[0]
diff_less_10_jaccard_1=pos_jac_1[pos_jac_1['diff_len']<=10].shape[0]
result=diff_less_10_jaccard_1*100/diff_less_10
print('The percentage of positive tweets with jaccard=1 when length difference is less than 10 is',result)

The percentage of positive tweets with jaccard=1 when difference between  tweet and selected text is less than 10 is 28.72983870967742


In [None]:
#Performance of postive model when difference between  tweet and selected text is greter than 10 but less than 20

diff_less_20=positive_df[(positive_df['diff_len']>10) & (positive_df['diff_len']<=20)].shape[0]
diff_less_20_jaccard_1=pos_jac_1[(pos_jac_1['diff_len']>10) & (pos_jac_1['diff_len']<=20)].shape[0]
result=diff_less_20_jaccard_1*100/diff_less_20
print('The percentage of positive tweets with jaccard=1 when length difference is greter than 10 and less than 20 is',result)

The percentage of positive tweets with jaccard=1 when length difference is greter than 10 and less than 20 is 25.47332185886403


In [None]:
#Performance of postive model when difference between  tweet and selected text is greter than 20

diff_less_30=positive_df[(positive_df['diff_len']>20)].shape[0]
diff_less_30_jaccard_1=pos_jac_1[(pos_jac_1['diff_len']>20)].shape[0]
result=diff_less_30_jaccard_1*100/diff_less_30
print('The percentage of positive tweets with jaccard=1 when length difference is greter than 20 is',result)

The percentage of positive tweets with jaccard=1 when length difference is greter than 20 is 27.272727272727273


CONCLUSION
* The positive sentiment model predicts around 50% of the tweets correctly when tweet and selected text are of same length.
* The postive sentiment model does not outperform in any of the above difference in length categories but performs moderately, around 25%.

#### Negative Sentiment

In [None]:
#Filtering negative sentiment tweets
negative_df=test_df[test_df['sentiment']=='negative']

#Filtering negative sentiment tweets with jaccard score=1 and 0
neg_jac_1=negative_df[negative_df['jaccard']==1]
neg_jac_0=negative_df[negative_df['jaccard']!=1]

In [None]:
print('The total number of negative tweets is',negative_df.shape[0])
print('The total number of negative tweets with jaccard=1 is',neg_jac_1.shape[0])
print('The total number of negative tweets with jaccard not 1 is',neg_jac_0.shape[0])

The total number of negative tweets is 1556
The total number of negative tweets with jaccard=1 is 384
The total number of negative tweets with jaccard not 1 is 1172


In [None]:
#Performance of postive model when both tweet and selected text is same

total_zero_diff=negative_df[negative_df['diff_len']==0].shape[0]
zero_diff_jaccard_1=neg_jac_1[neg_jac_1['diff_len']==0].shape[0]
result=zero_diff_jaccard_1*100/total_zero_diff
print('The percentage of negative tweets with jaccard=1 when both tweet and selected text were same is',result)

The percentage of negative tweets with jaccard=1 when both tweet and selected text were same is 53.49794238683128


In [None]:
#Performance of negative model when difference between  tweet and selected text is less than 10

diff_less_10=negative_df[negative_df['diff_len']<=10].shape[0]
diff_less_10_jaccard_1=neg_jac_1[neg_jac_1['diff_len']<=10].shape[0]
result=diff_less_10_jaccard_1*100/diff_less_10
print('The percentage of negative tweets with jaccard=1 when difference between  tweet and selected text is less than 10 is',result)

The percentage of negative tweets with jaccard=1 when difference between  tweet and selected text is less than 10 is 27.874186550976138


In [None]:
#Performance of negative model when difference between  tweet and selected text is greter than 10 but less than 20

diff_less_20=negative_df[(negative_df['diff_len']>10) & (negative_df['diff_len']<=20)].shape[0]
diff_less_20_jaccard_1=neg_jac_1[(neg_jac_1['diff_len']>10) & (neg_jac_1['diff_len']<=20)].shape[0]
result=diff_less_20_jaccard_1*100/diff_less_20
print('The percentage of negative tweets with jaccard=1 when length difference is greter than 10 and less than 20 is',result)

The percentage of negative tweets with jaccard=1 when length difference is greter than 10 and less than 20 is 19.29460580912863


In [None]:
#Performance of negative model when difference between  tweet and selected text is greter than 20 but less than 30

diff_less_30=negative_df[(negative_df['diff_len']>20) & (negative_df['diff_len']<=30)].shape[0]
diff_less_30_jaccard_1=neg_jac_1[(neg_jac_1['diff_len']>20) & (neg_jac_1['diff_len']<=30)].shape[0]
result=diff_less_30_jaccard_1*100/diff_less_30
print('The percentage of negative tweets with jaccard=1 when length difference is greter than 20 is',result)

The percentage of negative tweets with jaccard=1 when length difference is greter than 20 is 22.36842105263158


CONCLUSION
* The negative sentiment model predicts around 53% of the tweets correctly when tweet and selected text are of same length.
* The negative sentiment model performs poorly when the  difference in length is above 10.

#### Neutral Sentiment 

In [None]:
#Filtering neutral sentiment tweets
neutral_df=test_df[test_df['sentiment']=='neutral']

#Filtering neutral sentiment tweets with jaccard score=1 and 0
neu_jac_1=neutral_df[neutral_df['jaccard']==1]
neu_jac_0=neutral_df[neutral_df['jaccard']!=1]

In [None]:
print('The total number of neutral tweets is',neutral_df.shape[0])
print('The total number of neutral tweets with jaccard=1 is',neu_jac_1.shape[0])
print('The total number of neutral tweets with jaccard not 1 is',neu_jac_0.shape[0])

The total number of neutral tweets is 2224
The total number of neutral tweets with jaccard=1 is 1992
The total number of neutral tweets with jaccard not 1 is 232


In [None]:
#Performance of postive model when both tweet and selected text is same

total_zero_diff=neutral_df[neutral_df['diff_len']==0].shape[0]
zero_diff_jaccard_1=neu_jac_1[neu_jac_1['diff_len']==0].shape[0]
result=zero_diff_jaccard_1*100/total_zero_diff
print('The percentage of neutral tweets with jaccard=1 when both tweet and selected text were same is',result)

The percentage of neutral tweets with jaccard=1 when both tweet and selected text were same is 96.11461874696455


In [None]:
#Performance of neutral model when difference between  tweet and selected text is less than 10

diff_less_10=neutral_df[neutral_df['diff_len']<=10].shape[0]
diff_less_10_jaccard_1=neu_jac_1[neu_jac_1['diff_len']<=10].shape[0]
result=diff_less_10_jaccard_1*100/diff_less_10
print('The percentage of neutral tweets with jaccard=1 when difference between tweet and selected text is less than 10 is',result)

The percentage of neutral tweets with jaccard=1 when difference between tweet and selected text is less than 10 is 90.42215161143895


In [None]:
#Performance of neutral model when difference between  tweet and selected text is greter than 10 but less than 20

diff_less_20=neutral_df[(neutral_df['diff_len']>10) & (neutral_df['diff_len']<=20)].shape[0]
diff_less_20_jaccard_1=neu_jac_1[(neu_jac_1['diff_len']>10) & (neu_jac_1['diff_len']<=20)].shape[0]
result=diff_less_20_jaccard_1*100/diff_less_20
print('The percentage of neutral tweets with jaccard=1 when length difference is greter than 10 and less than 20 is',result)

The percentage of neutral tweets with jaccard=1 when length difference is greter than 10 and less than 20 is 0.0


CONCLUSION
* The neutral sentiment model predicts around 96% of the tweets correctly when tweet and selected text are of same length.
* The neutral sentiment model performs well when the  difference in length is less than 10 but not when it is greater than 10.