# Sentiment Analysis: Data Gathering 2 (Lexicon Filter)

By using the intersection of the texts whose original sentiments match with that of the VADER sentiments, a smaller but seemingly more accurate dataset could be manually analyzed for words and phrases that indicate a specific sentiment. Gathering these phrases as a domain speicific lexicon and iteratively filtering out each sentiment is the approach of this notebook. As a result, a balanced training dataset could be gathered for Sentiment Analysis with around 4.5k samples



In [2]:
import pandas as pd
import re

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
PROJECT_PATH = '/content/drive/MyDrive/Colab/data/ma_data/'
RAW_DATA_PATH = PROJECT_PATH + 'feedback_all_vader.csv'
DATA_POS_ORIG = PROJECT_PATH + 'ma_feedback_pos_orig.csv'
DATA_POS_VADER = PROJECT_PATH + 'ma_feedback_pos_vader.csv'
DATA_NEG_VADER = PROJECT_PATH + 'ma_feedback_neg_vader.csv'
DATA_NEU_VADER = PROJECT_PATH + 'ma_feedback_neutrals_vader.csv'
DATA_NEU_ORIG = PROJECT_PATH + 'ma_feedback_neutrals_orig.csv'
DATA_NEU_FILTERED = PROJECT_PATH + 'ma_feedback_neutrals_filtered.csv'

DATA_ALL_CHECK = PROJECT_PATH + 'ma_feedback_all_doublecheck.csv'

DATA_GOLD = PROJECT_PATH + 'ma_feedback_all_gold1k_sentiments.csv'
DATA_GOLD_CHECK = PROJECT_PATH + 'ma_feedback_gold5k_doublecheck.csv'

In [6]:
df_raw = pd.read_csv(RAW_DATA_PATH)
df_raw.rename(columns = {'Unnamed: 0': 'id'}, inplace=True)
print('All Data length: ', len(df_raw))

All Data length:  38972


In [8]:
df_raw.tail()

Unnamed: 0,id,feedback_text_en,sentiment,vader,vader score,delivery,feedback_return,product,monetary,one_hot_labels,feedback_normalized,normalized_with_stopwords
38967,38967,Sunglasses were broken on the nose pads. Can I...,NONE,NONE,0.0,False,False,False,False,[0 0 0 0],sunglass break nose pad get price reduction se...,sunglass be break on the nose pad can i get a ...
38968,38968,"Dear ladies and gentlemen, I have sold the ite...",NEGATIVE,NONE,0.3804,False,False,True,False,[0 0 1 0],sell item pram girl quilt baby jumpsuit pink c...,and i have sell the item quot pram girl quilt ...
38969,38969,"Hello, after only a few months loosen the Velc...",NEGATIVE,NONE,0.0772,False,True,True,False,[0 1 1 0],month loosen velcro closure shoe shoe longer r...,after only a few month loosen the velcro closu...
38970,38970,"Hello, I gave up my order today and wanted to ...",NEGATIVE,NEGATIVE,-0.4854,False,False,False,True,[0 0 0 1],give order today want redeem voucher unfortuna...,i give up my order today and want to redeem my...
38971,38971,Because I have been in the holidays I could no...,NONE,POSITIVE,0.8434,False,False,False,False,[0 0 0 0],holiday could pick order sent th march send ba...,because i have be in the holiday i could not p...


In [None]:
df_pos_orig = pd.read_csv(DATA_POS_ORIG, delimiter=';')
df_pos_orig.rename(columns = {'Unnamed: 0': 'id'}, inplace=True)
df_pos_vader = pd.read_csv(DATA_POS_VADER, delimiter=";")
df_pos_vader.rename(columns = {'Unnamed: 0': 'id'}, inplace=True)
df_neg = pd.read_csv(DATA_NEG_VADER, delimiter=";")
df_neg.rename(columns = {'Unnamed: 0': 'id'}, inplace=True)
df_neu_vader = pd.read_csv(DATA_NEU_VADER, delimiter=";")


### Remove pos and neg ids from rows of raw dataset

In [None]:
df_pos_orig_ids = df_pos_orig["id"].astype(int).to_list()
df_pos_vader_ids = df_pos_vader["id"].astype(int).to_list()
df_neg_ids = df_neg["id"].astype(int).to_list()

ids = set(df_pos_orig_ids + df_pos_vader_ids + df_neg_ids)
print(f"Ids length: {len(ids)}")

Ids length: 5426


In [None]:
df_all_copy = df_raw.copy()

In [None]:
# Remove gold labels from raw dataset
duplicate_ids = df_all_copy[df_all_copy["id"].isin(ids)].index
df_all_copy.drop(duplicate_ids, inplace=True)
len(df_all_copy)

33546

In [None]:
# Filter for neg / pos words
neg_words = ["unfortunately", "however", "examination", "a mistake", "annoying", "broken", "destroyed", "missing", "fade", 
             "too small", "too big", "delay", "several time", "missing", "not received", "no confirmation", "loosen", "shock",
             "spot", "i do not like", "a complaint", "want to complain", "not true", "not justified", "no confirmation", "discolor",
             "annoy at", "dissolve", "worn", "break", "dirty", "lose its color", "fuzz", "seam", " hole", "appalled", "dirt", " stain",
             "too thin", "no package", "shortcoming", "do not suit me", "no message", "cheat", "i have problem", "torn", "no return",
             "tear", "regretfully", "painful", "not receive a refund", "not updated", " damage", "technical issue", "thin sole",
             "not my fault", "i do not want to pay", "i find no", "never receive the order", "sole come off", "dissatisfied",
             "loose", "a defect", "drop a stone", "there be problem", "no feedback", "not arrive", "wear out", "disappointed",
             "crack", "unusual", "not fit", "incomprehensible", "not receive", "ruin", "ruined", "very thin", "quite thin",
             "not strong", "not look good", "too high", "too few", "not accept", "not valid", "the thin ", "second mistake",
             "be deny", "no button", "outrage", "not find", "very long time", "bug", "no payment", "zipper", "sad", "the defect",
             "rash", "tattered", "second reminder", "warn letter", "not at the desired", "no information", "upset", "take a long time",
             "warning", "i ask for clarification", "there be an error", "angry", "frustrate", "this problem",
             "must be an error", "no order confirmation", "i be still wait", "compensation", "the mistake", "by mistake",
             "steal", "this can not be", "cut be not nice", "bad quality", "not look nice", "dissapoints", "refuse", "incorrect",
             "lose its shape", "totally fade", "i have a problem", "wrong", "not clear to me", "unnecessarily", "forget it",
             "wash out", "poor quality", "pity", "misunderstanding", "unjustified", "shit", "confused", "bad service",
             "correct that", "not well write", "for clarification", "two defect", "optical defect", "have defect", "material defect",
             "obvious defect", "defective", "be defect", "this defect", "my complaint", "the complaint", "possible complaint",
             "possible to complain", "have to complain", "like to complain", "immediately complain", "reason to complain",
             "still complain about", "to complain again", "i complain", "there to complain", "and complain", "have complain",
             "before complain", "can i complain", "really annoy", "bit annoy", "get annoy", "increasingly annoy", "than annoy",
             "very annoy", "rather annoy", "especially annoy", "not recommend", "not satisfied", "come late", "uncomfortable",
             "please take care of it", "please help", "totally crumple", "no notification", "annoyed", "unfair",
             "there be a problem", "questionable", "processing error", "threat", "without comment", "be not correct",
             "still no money back", "never heard anything", "i be have problem", "this annoy", "item be miss", "absolutely not clear",
             "too tight", "be very bad", "sole wear off", "smear", "i have already", "why do i ", " not there yet", "second time",
             "despite", "be miss", "never arrive", "supposedly", "presumably", " nerve", " odd", "not come", "not understand",
             "why be", "get a reminder", "still not", "where be my", "how come", "disappear", "desperately", "pilling",
             "terrible mess", "quality loss", "leave something to be desire", "repeat reminder", "already pay", "correct this",
             "clarify", "brighter", "dented", "miss", "please check", "scratch", "not in the", "will not pay", " exam", "glue",
             "deform", "not quite understand", "not quite clear", "disproportionate", "deficiency", "decline", "my problem",
             "never receive", "reminder", "not deliver yet", "could not be read", "be not include", "not a nice", "not yet credit",
             "be not possible", "unacceptable", "ugly", "not comfortably", "why my order need", "wrinkle", "although", "still get nothing",
             "i be wait", "third time", "release the thread", "not be refund", "not willing to", "too long", "misuse", "not be deliver",
             "come off", "not yet arrive", "be not there", "lack of quality", "peel off", "that annoy", "be annoy", "small error",
             "inaccurate", " ago", "possibility to complain", " rip", " itch", "lawyer", " pale", "broke", "not answer",
             "miserable", "request for correction", "photo", "too large", "oversight", "get thin", "bleach", " patch", 
             "do not agree", "thread error", "should not happen", "faulty", "abrasion", "material error", "unstable",
             "not yet receive", "considerable defect", "picture"] 

pos_words = ["very statisfied", "super fast", "fun to shop", "fast delivery", "fast transaction", "very fast", "fit perfectly", 
             "perfect fit", "always happy", "beautiful", "super happy", "very happy", "huge selection", "great", "i like it", 
             "uncomplicated", "i love", "super fast", "sexy", " comfortable", "totally happy", " reliable", "a lot of choice",
             "keep it up", "everything perfect", "always perfect", "be awesome", "very nice", "big selection", "absolutely fit",
             "very good", "simplicity", "large selection", "cheap price", "all good", "satisfied", "compliment", "smooth", "no defect",
             "very pleased", "undamaged", "no complaint", "without complaint", "without error", "nothing to complain", "unproblematic",
             "generous", "no annoy", "in case of complaint", "flawless", "good quality", "super quality", "without problem", 
             "absolutely satisfy", "speedy delivery", "without any hassle", "recommend", "perfect", "super ", "unbeatable",
             "without any complication", "no problem", "all the best", "everything optimal", "i be satisfy", "no shipping cost",
             "nothing to improve", "deliver quickly", "wonderful", "good status", "stress free", "all right", "tip top",
             "no technical difficulty", "always fast", "always good", "my pleasure", "positive experience", "excellent service",
             "any time", "ship promptly", "everything work out", " satisfaction", "always satisfy", "everything be fine",
             "arrive quickly", "attractive", "very clear", "best quality", "big choice", "fast", "clear page", "clear website",
             "clear structure", " convenient", "do quickly", "everything be good", "everything be fine", "everything fit", "everything ok",
             "everything okay", "everything top", "everything go well", "everything work", "fast and easy", "fast process",
             "fast ship", "free shipping", "gladly again", "good choice", "good overview", "good selection",
             "wide selection", "prompt delivery", "so far so good", "only recommend", "very friendly", "completely satisfy", 
             "i like", "very satisfy", "be well describe", "amazing offer"]

In [None]:
def get_filtered_sentiments(texts, lexicon_neg=neg_words, lexicon_pos=pos_words):
  res = []
  for text in texts:
    if any(n in text for n in lexicon_neg):
      res.append("NEGATIVE")
    elif any(p in text for p in lexicon_pos):
      res.append("POSITIVE")
    else:
      res.append("NEUTRAL")
  
  return res

In [None]:
df_neg_texts = df_neg["normalized_with_stopwords"].astype(str).tolist()
df_pos_orig_texts = df_pos_orig["normalized_with_stopwords"].astype(str).tolist()
df_pos_vader_texts = df_pos_vader["normalized_with_stopwords"].astype(str).tolist()
df_neu_vader_texts = df_neu_vader["normalized_with_stopwords"].astype(str).tolist()

### Small test
To double check that negative or positive words do not appear withing neutral texts


In [None]:
# Neutral text
test_text = ["i have order the emporio armani shoe in size amp i forget to send the size back after contact with your customer support i be tell that i may exceptionally send the item back would you be so nice and would send me a return label in this regard in advance and happy easter"]

In [None]:
lexicon = pos_words # or neg_words
for text in test_text:
  for i in range(len(lexicon)):
    if lexicon[i] in text:
      print(lexicon[i])

In [None]:
res = get_filtered_sentiments(test_text)
res

['NEUTRAL']

# Iterate over manual samples
Letting the filter run on samples that have already been manually filtered to identify outliers and their phrases that point towards a specific sentiment in which case they are added to the lexicon above

In [None]:
neg_filtered_sentiments = get_filtered_sentiments(df_neg_texts)
pos_orig_filtered_sentiments = get_filtered_sentiments(df_pos_orig_texts)
pos_vader_filtered_sentiments = get_filtered_sentiments(df_pos_vader_texts)
neu_filtered_sentiments = get_filtered_sentiments(df_neu_vader_texts)

df_neg["sent labels"] = neg_filtered_sentiments
df_pos_orig_texts["sent labels"] = pos_orig_filtered_sentiments
df_pos_vader_texts["sent labels"] = pos_vader_filtered_sentiments
df_neu_vader["sent labels"] = neu_filtered_sentiments

In [None]:
# Double checking how many of the manually positive gathered cases were identified
df_res = pd.DataFrame(df_pos_vader_texts, columns=["labels"])
df_res.describe()

Unnamed: 0,labels
count,858
unique,3
top,POSITIVE
freq,811


In [None]:
# Save results for getting new phrases for the lexicon and re-iterate
df_neg.to_csv(DATA_NEG_VADER_CHECK)
df_pos_orig.to_csv(DATA_POS_ORIG_CHECK)
df_pos_vader.to_csv(DATA_POS_VADER_CHECK)

# Get Sentiment for All Data
And also comparing the sentiment distributions of the 3 columns
* "sentiment": the original sentiments
* "vader": sentiments predicted by VADER
* "filtered sentiments": sentiments found by this notebook 


In [None]:
df_all_copy_texts = df_all_copy["normalized_with_stopwords"].astype(str).tolist()
all_filtered_sentiments = get_filtered_sentiments(df_all_copy_texts)

df_res = pd.DataFrame(all_filtered_sentiments, columns=["labels"])
df_all_copy["filtered sentiment"] = all_filtered_sentiments
df_res.describe()

Unnamed: 0,labels
count,33546
unique,3
top,NEGATIVE
freq,22343


In [None]:
df_all_copy["sentiment"].value_counts()

NEGATIVE    27415
NONE         5263
POSITIVE      572
BOTH          296
Name: sentiment, dtype: int64

In [None]:
df_all_copy["vader"].value_counts()

NONE        23989
NEGATIVE     7818
POSITIVE     1739
Name: vader, dtype: int64

In [None]:
df_all_copy["filtered sentiment"].value_counts()

NEGATIVE    22343
NEUTRAL      9950
POSITIVE     1253
Name: filtered sentiment, dtype: int64

In [None]:
df_all_copy.to_csv(DATA_ALL_CHECK)

# Gold Labels
After multiple iterations of the steps above, a certain amount for each sentiments were gathered, cleaned and filtered a final time to make the dataset as balanced as possible and to make it ready for training the custom sentiment analysis model

In [9]:
df_gold = pd.read_csv(DATA_GOLD, delimiter=';')
print('Gold length: ', len(df_gold))
df_gold.drop(columns="Unnamed: 0", inplace=True)
df_gold.tail()

Gold length:  5236


Unnamed: 0,id,feedback_text_en,sentiment,vader,vader score,delivery,feedback_return,product,monetary,one_hot_labels,Unnamed: 10,normalized_with_stopwords,sent labels
5231,22851,I had ordered for the first time and was very ...,POSITIVE,NONE,0.4754,False,False,False,False,[0 0 0 0],order first time satisfy,i have order for the first time and be very sa...,POSITIVE
5232,22855,The photos and videos of the Artkikeln are sup...,POSITIVE,NONE,0.5994,False,False,True,False,[0 0 1 0],photo videos artkikeln super get good picture,the photo and video of the artkikeln be super ...,POSITIVE
5233,22859,- wide selection of high quality branded produ...,POSITIVE,NEGATIVE,-0.4588,False,False,True,False,[0 0 1 0],wide selection high quality brand product simp...,wide selection of high quality brand product s...,POSITIVE
5234,22865,Everything was fine ...... have nothing to men...,POSITIVE,NONE,0.5984,False,False,False,False,[0 0 0 0],everything fine nothing mention satisfy,everything be fine have nothing to mention ver...,POSITIVE
5235,22866,Good choice,POSITIVE,NONE,0.0,False,False,True,False,[0 0 1 0],good choice,good choice,POSITIVE


In [None]:
df_gold.drop_duplicates(subset="id", inplace=True)
len(df_gold)

4567

In [None]:
df_gold["sent labels"].value_counts()

NEUTRAL     1565
POSITIVE    1540
NEGATIVE    1462
Name: sent labels, dtype: int64

In [None]:
df_gold_texts = df_gold["normalized_with_stopwords"].astype(str).tolist()
filtered_sentiments = get_filtered_sentiments(df_gold_texts)
df_gold["sent labels"] = filtered_sentiments
df_gold["sent labels"].value_counts()

NEUTRAL     1530
POSITIVE    1522
NEGATIVE    1515
Name: sent labels, dtype: int64

In [None]:
df_gold.to_csv(DATA_GOLD_CHECK)