# Twitter Sentiment Analysis - using WordNet PosTag Lemmatization

Ќе го модификуваме претходното решение малку, со тоа што ќе се обидеме да направиме лематизација 
на зборовите користејќи ја WordNet мрежата.

Но за да разликуваме синоними кои имаат различна природа (глагол наспроти именка и сл.) ќе се обидеме прво на секој
од зборовите да му ја најдеме улогата во реченицата, т.е. PosTag-от со nltk. 

Зависно од типот на тагот ќе имаме 3 категории на зборови:
* Зборови кои ќе се игнорираат (испуштаат)
 * CC, DT, EX, LS, PDT, POS, RP, UH, WDT, WP, WP\$, WRB, MD
* Зборови кои ќе се остават какви што се
 * CD, FW, IN, PRP, PRP\$
* Зборови кои ќе се лематизираат
 * JJ, JJR, JJS, NN, NNS, NNPS, RB, RBR, RBS, VB, VBD, VBG, VBN, VBP, VBZ
 
 За што означува секој од таговите повеќе на следниов [линк](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) 

In [3]:
# lemmatization with wordnet and lexicon creation
POS_TAGS_TO_IGNORE = ['CC', 'DT', 'EX', 'LS', 'PDT', 'POS', 'RP', 'UH', 'WDT', 'WP', 'WP$', 'WRB', 'MD']
POS_TAGS_TO_LEAVE_AS_IS = ['CD', 'FW', 'IN', 'PRP', 'PRP$']
POS_TAGS_TO_LEMMATIZE = ['JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNPS', 'RB', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP',
                         'VBZ']

PосTag-ерот на nltk има многу категории, но WordNet разликува само 4 категории, придавки (adj), глаголи (verb), именки (noun) и прилози (adv)
 
Конверзијата меѓу двата вида тагови ја правиме со

In [4]:
def get_wordnet_pos(pos_tag: str):
    if pos_tag.startswith('J'):
        return nltk.corpus.reader.ADJ
    elif pos_tag.startswith('V'):
        return nltk.corpus.reader.VERB
    elif pos_tag.startswith('N'):
        return nltk.corpus.reader.NOUN
    elif pos_tag.startswith('R'):
        return nltk.corpus.reader.ADV
    else:
        return None

Пред да лематизираме проверуваме дали пос тагот е погоден за ворднет

In [5]:
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from typing import List


def lemmatize_with_wordnet(word: str, pos_tag: str, lemmatizer: WordNetLemmatizer):
    wordnet_pos = get_wordnet_pos(pos_tag)
    if wordnet_pos is not None:
        return lemmatizer.lemmatize(word, wordnet_pos)

    return word

# end of lexicon creation and lemmatization


Фунцкиите кои листа од твитови ја лематизираат и токенизираат се дадени следни

In [6]:
def prune_tokens_based_on_pos(word: str, pos_tag: str, lemmatizer: WordNetLemmatizer):
    if pos_tag in POS_TAGS_TO_IGNORE:
        return None
    elif pos_tag in POS_TAGS_TO_LEAVE_AS_IS:
        return word
    elif pos_tag in POS_TAGS_TO_LEMMATIZE:
        return lemmatize_with_wordnet(word, pos_tag, lemmatizer)
    else:
        return word


def prune_token_list(tokens: List[str], lemmatizer: WordNetLemmatizer) -> List[List[str]]:
    tokens_with_pos = nltk.pos_tag(tokens)
    pruned = [prune_tokens_based_on_pos(token[0], token[1], lemmatizer) for token in tokens_with_pos]
    return list(filter(lambda x: x is not None, pruned))


def tokenize_and_lemmatize_tweets(tweets: List[str]) -> List[List[str]]:
    lemmatizer = WordNetLemmatizer()

    pruned_tweets = []
    count = 0
    for tweet in tweets:
        count += 1
        tokenized_tweet = nltk.word_tokenize(tweet.lower())
        pruned_tweet = prune_token_list(tokenized_tweet, lemmatizer)
        pruned_tweets.append(pruned_tweet)
        if count % 1000 == 0:
            print("Processed tweets: ",count)
            
    return pruned_tweets



Пример како се користи функцијата, и какви резултати дава:

In [7]:
print("Example usage:")
print(tokenize_and_lemmatize_tweets(["i like to eat pie"]))
print(tokenize_and_lemmatize_tweets(["eating pie can be extremely difficult"]))
print(tokenize_and_lemmatize_tweets(["How are you my friend? -Fine, thanks."]))

Example usage:
[['i', 'like', 'to', 'eat', 'pie']]
[['eat', 'pie', 'be', 'extremely', 'difficult']]
[['be', 'you', 'my', 'friend', '?', '-fine', ',', 'thanks', '.']]


И овој пат ќе ги отстраниме линковите од твитовите, со истата функција од претходно

In [9]:
import re

def remove_pattern_from_string(given_string, re_compiled_pattern):
    """removes a re compiled regex pattern from a given string """
    return re_compiled_pattern.sub("", given_string)

def remove_links_from_tweets(tweets_list: List[str]) -> List[str]: 
    """for a given list of tweet texts removes all links starting with http:// or https://"""
    regex = re.compile("(https?://\S+)", re.IGNORECASE)
    tweets_without_links = []
    for tweet in tweets_list:
        clean_tweet = remove_pattern_from_string(tweet, regex)
        tweets_without_links.append(clean_tweet)
    return tweets_without_links

Сакаме да се обидеме зборовите кои воопшто не се во WordNet да ги игнорираме

In [10]:
from nltk.corpus import wordnet
from nltk import WordNetLemmatizer

letters_only_regex = re.compile("[^a-zA-Z]")

def keep_only_wordnet_tokens(tweet_tokens:List[str]) -> List[str]:
    tweet_tokens = [letters_only_regex.sub("", word) for word in tweet_tokens]
    return [word for word in tweet_tokens if wordnet.synsets(word)]
    
example_tokens = ['i', 'like', 'pie', '#fire', 'bableh','3rd']
pruned_tokens = keep_only_wordnet_tokens(example_tokens)
print(example_tokens, ' -> ', pruned_tokens)

def keep_only_wordnet_tokens_in_list_of_tweets(tweets: List[List[str]]) -> List[List[str]]:
    return [keep_only_wordnet_tokens(tweet) for tweet in tweets]

['i', 'like', 'pie', '#fire', 'bableh', '3rd']  ->  ['i', 'like', 'pie', 'fire']


Сите трансформации на твитовите ги комбинираме во една функција

In [26]:
def join_words_in_list_of_tweets(tweet_list : List[List[str]]) -> List[str]:
    return [" ".join(tweet) for tweet in tweet_list]

def clean_list_of_tweets(tweet_list:List[str], keep_only_wordnet_tokens: bool = True) -> List[str]:
    modified_tweets = remove_links_from_tweets(tweet_list)
    modified_tweets = tokenize_and_lemmatize_tweets(modified_tweets)
    if keep_only_wordnet_tokens:
        modified_tweets = keep_only_wordnet_tokens_in_list_of_tweets(modified_tweets)
    modified_tweets = join_words_in_list_of_tweets(modified_tweets)
    return modified_tweets

Како и претходно ги вчитуваме сите твитови од влезната датотека и ги конвертираме во погоден формат за пречистување

In [40]:
input_lines = []

# training data set, each line = (tweet, class)
train_file_name = "train_and_dev_data/tweet_input/train_input.tsv"

# test data set, each line = (index/ line no., tweet)
test_file_name = "train_and_dev_data/tweet_input/test_input.tsv"

# solutions file, each line = (index, correct class)
solutions_file_name = "train_and_dev_data/tweet_output/test_solutions.tsv"

Reading input from:  train_and_dev_data/tweet_input/train_input.tsv


In [29]:
print("Reading input from: ", train_file_name)
with open(train_file_name) as f:
    input_lines = f.readlines()

def get_tuple_from_input_file(lines_with_tweet_and_class, delimiter):
    """splits each tuple (tweet, class) and appends them to tweets[] and classes[] accordingly and 
        returns them as (tweets, classes)"""
    tweets = []
    classes = []
    for tweet in lines_with_tweet_and_class:
        splits = tweet.split(delimiter)
        tweets.append(splits[0])
        classes.append(splits[1])

    return tweets, classes

# convert from (tweet, class)[] to (tweet[], class[])
tweets_and_class_tuple = get_tuple_from_input_file(input_lines, "\t")
clean_tweets = clean_list_of_tweets(tweets_and_class_tuple[0], False)

Processed tweets:  1000
Processed tweets:  2000
Processed tweets:  3000
Processed tweets:  4000


In [32]:
print(clean_tweets[0])

we think it 's much good than kid obsess over people like jay-z amp ; cyrus .


In [33]:
from sklearn.feature_extraction.text import CountVectorizer

print("Creating the bag of words ...")
vectorizer = CountVectorizer(analyzer="word",
                             tokenizer=None,
                             preprocessor=None,
                             stop_words=None,
                             max_features=5000)

# each tweet is converted to a vector using the bag of words method
train_data_features = vectorizer.fit_transform(clean_tweets)
train_data_features = train_data_features.toarray()
vocab = vectorizer.get_feature_names()
print("Bag of words done, example vocab: ",vocab[0:10])

Creating the bag of words ...
Bag of words done, example vocab:  ['00', '000', '00pm', '01', '03', '039', '08', '0800', '09', '10']


Методот на креирање на речникот и потоа тренирање на класификаторот остануваат исти

In [34]:
import sklearn.svm as svm

# from sklearn.linear_model import LogisticRegression
# from sklearn.ensemble import ExtraTreesClassifier
# from sklearn.ensemble import RandomForestClassifier

print("Training the classifier")
classifier = svm.SVC(kernel='rbf', C=10, gamma=0.001, max_iter=-1,
                     verbose=True, class_weight='balanced', cache_size=4000,
                     probability=True)
# classifier = RandomForestClassifier(n_estimators=200)
# classifier = ExtraTreesClassifier(n_estimators=200, n_jobs=-1, verbose=False,
#                                   class_weight='auto',
#                                   bootstrap=True)
# classifier = LogisticRegression(verbose=False)
classifier = classifier.fit(train_data_features, tweets_and_class_tuple[1])

print("Training done")

Training the classifier
[LibSVM]Training done


In [35]:

def get_tuple_from_test_input_file(tweets_with_number, delimiter):
    """splits each tuple (index, tweet) adding the results in into tweets and indexes returns (tweets, indexes)"""
    tweets = []
    index_numbers = []
    for tweet in tweets_with_number:
        splits = tweet.split(delimiter)
        tweets.append(splits[1])
        index_numbers.append(splits[0])

    return tweets, index_numbers


test_lines = []
with open(test_file_name) as f:
    test_lines = f.readlines()

print("Cleaning test tweets")
# convert from (index, tweet)[] to (index[], tweet[])
test_tweets_and_index = get_tuple_from_test_input_file(test_lines, "\t")

clean_test_tweets = clean_list_of_tweets(test_tweets_and_index[0], False)

Cleaning test tweets
Processed tweets:  1000
Processed tweets:  2000
Processed tweets:  3000
Processed tweets:  4000
Processed tweets:  5000
Processed tweets:  6000
Processed tweets:  7000
Processed tweets:  8000
Processed tweets:  9000
Processed tweets:  10000
Processed tweets:  11000
Processed tweets:  12000
Processed tweets:  13000
Processed tweets:  14000
Processed tweets:  15000
Processed tweets:  16000
Processed tweets:  17000
Processed tweets:  18000
Processed tweets:  19000
Processed tweets:  20000
Processed tweets:  21000
Processed tweets:  22000
Processed tweets:  23000
Processed tweets:  24000
Processed tweets:  25000
Processed tweets:  26000
Processed tweets:  27000
Processed tweets:  28000
Processed tweets:  29000
Processed tweets:  30000
Processed tweets:  31000
Processed tweets:  32000


In [36]:

# create a vector representation of the test tweets
test_data_features = vectorizer.transform(clean_test_tweets)
test_data_features = test_data_features.toarray()

print("Starting prediction")
result = classifier.predict(test_data_features)


print("Prediction done, starting evaluation")

Starting prediction
Prediction done, starting evaluation


In [37]:
POSITIVE = "positive"
NEGATIVE = "negative"
NEUTRAL = "neutral"


def calculate_score(solutions, result):
    if len(solutions) != len(result):
        raise Exception(
            "solutions and result are not the same length, " + str(len(solutions)) + " vs " + str(len(result)))

    confusion_matrix = {}

    total = len(solutions)
    for i in range(0, total):
        result_class = result[i].strip()
        if result_class not in confusion_matrix:
            confusion_matrix[result_class] = {}

        result_row = confusion_matrix[result_class]
        solution_class = solutions[i].strip()
        if solution_class not in result_row:
            result_row[solution_class] = 0.0

        result_row[solution_class] += 1.0

    PP = confusion_matrix[POSITIVE][POSITIVE]
    PU = confusion_matrix[POSITIVE][NEUTRAL]
    PN = confusion_matrix[POSITIVE][NEGATIVE]
    UP = confusion_matrix[NEUTRAL][POSITIVE]
    NP = confusion_matrix[NEGATIVE][POSITIVE]
    NN = confusion_matrix[NEGATIVE][NEGATIVE]
    NU = confusion_matrix[NEGATIVE][NEUTRAL]
    UN = confusion_matrix[NEUTRAL][NEGATIVE]

    precision_p = PP / (PP + PU + PN)
    recall_p = PP / (PP + UP + NP)
    F1_P = (2 * precision_p * recall_p) / (precision_p + recall_p)

    precision_n = NN / (NP + NU + PN)
    recall_n = NN / (NN + UN + PN)
    F1_N = (2 * precision_n * recall_n) / (precision_n + recall_n)
    return (F1_P + F1_N) / 2

solutions = []
with open(solutions_file_name) as f:
    solutions = f.readlines()

solution_and_index = get_tuple_from_test_input_file(solutions, "\t")
solutions = solution_and_index[0]
f1 = calculate_score(solutions, result)

In [38]:
print(f1)

0.4634390115963537


Резултатите се далеку полоши од претходно, ни со менување на тоа дали ќе се земат предвид само зборови кои ги има во WordNet каталогот или не