## BANA 275: NATRL LANG PROCESS

# Homework 1: Rule-Based Classification


The first programming assignment will familiarize you with the basic text processing methods, the
use of pre-built lexicons and rules for text classification.

## 1 Task: Sentiment Classification

The primary objective for the assignment is to predict the sentiment of a movie review. In particular, we
will be providing you with a dataset containing the text of the movie reviews from IMDB, and for each
review, you have to predict whether the review is positive or negative. We will also provide some code to
help you read and write the output files.

### 1.1 Data

The primary data file is named data.zip, which contains the following:
```
- lexicon/: Two sentiment lexicons. The code for reading them is included.
- test/: Folder of text files containing reviews that are not labeled.
- train/: Folder of text files containing the reviews that are part of labeled data.
- train.csv: List of files and associated sentiment label, for evaluating your classifier.

Note: train/ and test/ folder should contain 25,000 files. If you have 25,001 on disk, remember to delete the .DS_Store or desktop.ini before running any code.
```

### 1.2 Kaggle

Kaggle is a website that hosts machine learning competitions, and we will be using it to evaluate and
compare the accuracy of your classifiers. We know the true sentiment for each of the _unlabeled_ reviews,
which we will use to evaluate your submissions, and thus your submission file to Kaggle should contain a
predicted label for all the unlabeled reviews. In particular, the submission file ```test.csv``` should have the following format (code already does this):

- Start with a single line header: ```Fileindex, Category```
- For each of the unlabeled speech (sorted by name) there is a line containing an increasing integer index (i.e. line number 1), then a comma, and then the string label prediction of that speech.
- See ```test-basic.csv``` for example.

You can make ***at most _three_*** submissions each day, so we encourage you to test your submission files early,
and observe the performance of your system. By the end of the submission period, you will have to select
the two submissions the best of which you want to be judged as your final submission. Public leaderboard uses 30% of the data while your performance is evaluated by private leaderboard that uses 70% of the data.

### 1.3 Source Code

Some initial code contains methods for loading the data and lexicons, and calling the methods to run and
evaluate your classifier. It also contains the code to output the submission file from your classifier (called
```test.csv```) that you will submit to Kaggle. Your directory structure should look like this.
```
hw1  
│
└───code
│   └───rule-based-classification.ipynb
└───data
│   └───lexicon
│       │   inqtabs.txt
│       └───SentiWordNet_3.0.0_20130122.txt
│   └───test
│       │   0.txt
│       │   1.txt
│       │   ...   
│       └───24999.txt
│   └───train
│       │   0.txt
│       │   1.txt
│       │   ...   
│       └───24999.txt
│   └───train.csv
└───output
    │   fn.txt
    │   fp.txt
    │   test.csv
    │   tn.txt
    └───tp.txt
Note: You need mannually create folder 'code' and 'output'.
```

This [code block](#cb) contains the skeleton of your classifier; this is the **only** part you need to modify.

## 2 What to submit?

Prepare and submit a single write-up ( **PDF, maximum 2 pages** ) and a jupyter notebook to Canvas. **Do not include your student ID number** , since we might share it with the class if it’s
worth highlighting. The write-up and code should address the following.

### 2.1 Preliminaries (5 points)

At the top of your write-up, include your team's Kaggle name such as '**Sec A Team 1**' , and the accuracy that your **_best_** submission obtained on Kaggle. You do **not** need to include any other details such as name, UCINet Id, etc. 

### 2.2 Rule-Based Classifier (40 points)

Your main goal is to improve the basic classifier. For this, you should consider doing both of the following:

- **Lexicons** : We have provided two lexicons for your use. Each lexicon is a dictionary containing words
as keys and the sentiment as the value. For Harvard Inquirer [(inqtabs_dict)](http://www.wjh.harvard.edu/~inquirer/), the value is a sentiment
label: 0 for negative and 1 for positive. For SentiWordNet [(swn_dict)](http://sentiwordnet.isti.cnr.it/), each value is a pair of positive
and negative scores, respectively. Use them as you see fit.

- **Regular Expressions** : After looking at some reviews, you may have ideas for rules on the review
text that you think will help predict the sentiment. Implement them using if/then and regular
expressions.

Implement your suggestions in ```classify()```, and describe them in a few sentences in your
report. The primary evaluation for this part will be the performance of your classifier, combined with how
creative/interesting your proposed ideas are.

### 2.3 Examples (30 points)

In order to aid analysis, you also need to figure out the errors being made by your classifiers, i.e. split each prediction into _four_ categories: true positives, true negatives, false positives, and false negatives. If you look at ```get_error_type()```, there is an incorrect implementation of this method. Fix this code to print the appropriate examples, which will result in 4 files full of reviews, called ```fp.txt, fn.txt, tp.txt, and tn.txt```. Include 2-3 examples from the false positives and negatives in your report.

### 2.4 Analysis (20 points)

Analyze the above false positive and false negatives in your writeup. In particular, in a few sentences,
describe what is lacking in your approach, i.e. why do you think the errors exist. Write a sentence or two
about how you would address them if you had more time. You will be evaluated on how well you were able
to identify the problems, and the creativity of your proposed future solution.



## 3 Statement of Collaboration (5 points)

It is **mandatory** to include a _Statement of Collaboration_ in each submission, with respect to the guidelines
below. Include the names of everyone involved in the discussions (especially in-person ones), and what
was discussed.

All students are required to follow the academic honesty guidelines posted on the course website. For
programming assignments, in particular, we encourage the students to organize to
discuss the task descriptions, requirements, bugs in our code, and the relevant technical content _before_ they
start working on it. However, you should not discuss the specific solutions, and, as a guiding principle, you
are not allowed to take anything written or drawn away from these discussions (i.e. no photographs of the
blackboard, written notes, etc.). Especially _after_ you have started working on the
assignment, try to restrict the discussion on Canvas as much as possible, so that there is no doubt as to the
extent of your collaboration.

In [1]:
# Some initial codes. Do not modify.
import os
import csv
import sys
from tqdm.notebook import tqdm

POS_LABEL = '1'
NEG_LABEL = '0'


def check_if_exist(file_path):
    if not os.path.exists(file_path):
        print(file_path + ' could not be found')
        return False
    return True


def extract_word(word):
    return word.lower() if word.find('#') < 0 else word[:word.find('#')].lower()


def read_inqtabs(input_file_path):
    """
    :param input_file_path:
    :return lexicons: dictionary of labels (e.g. lexicons['good']: 1, lexicons['bad']: 0)
    """
    if not check_if_exist(input_file_path):
        return

    lexicons = dict()
    with open(input_file_path, 'r', encoding='utf-8') as fp:
        for line in fp.readlines():
            elements = line.strip().split('\t')
            word = extract_word(elements[0])
            if len(word) > 0 and (elements[2] == 'Positiv' or elements[3] == 'Negativ'):
                label = POS_LABEL if elements[2] == 'Positiv' else NEG_LABEL
                lexicons[word] = label
    return lexicons


def read_senti_word_net(input_file_path):
    """
    :param input_file_path:
    :return lexicon: dictionary of lists (e.g. lexicons['good'][0]: positive score, lexicons['bad'][1]: negative score)
    """
    if not check_if_exist(input_file_path):
        return

    all_lexicons = dict()
    with open(input_file_path, 'r', encoding='utf-8') as fp:
        for line in fp.readlines():
            if line.startswith('#'):
                continue

            elements = line.strip().split('\t')
            if len(elements) < 5 or len(elements[4]) == 0:
                continue

            for tmp_word in elements[4].split(' '):
                word = extract_word(tmp_word).replace('_', ' ')
                if len(word) > 0 and len(elements[2]) > 0 and len(elements[3]) > 0:
                    if word not in all_lexicons.keys():
                        all_lexicons[word] = list()
                        all_lexicons[word].append(list())
                        all_lexicons[word].append(list())
                    all_lexicons[word][0].append(float(elements[2]))
                    all_lexicons[word][1].append(float(elements[3]))

    lexicons = dict()
    for word in all_lexicons.keys():
        lexicons[word] = (max(all_lexicons[word][0]), max(all_lexicons[word][1]))
    return lexicons


def get_training_data(filedir):
    with open(os.path.join(filedir, 'train.csv'), encoding='utf-8') as csvfile:
        training_data = [row for row in csv.DictReader(csvfile, delimiter=',')]
        for entry in training_data:
            with open(os.path.join(filedir, 'train', entry['FileIndex'] + '.txt'), encoding='utf-8') as reviewfile:
                entry['Review'] = reviewfile.read()
    return training_data


def get_training_accuracy(data, inqtabs_dict, swn_dict):
    num_correct = 0
    etype_files = {}
    for etype in ["fp", "fn", "tp", "tn"]:
        etype_files[etype] = open('../output/'+ etype + '.txt', 'w+', encoding='utf-8')
    for row in data:
        sentiment_prediction = classify(row['Review'], inqtabs_dict, swn_dict)
        sentiment_label = int(row['Category'])
        if sentiment_prediction == sentiment_label:
            num_correct += 1
        etype = get_error_type(sentiment_prediction, sentiment_label)
        etype_files[etype].write("%s\t%s\n"%(row['FileIndex'], row['Review']))
    accuracy = num_correct * 1.0 / len(data)
    for etype in ["fp", "fn", "tp", "tn"]:
        etype_files[etype].close()
    print("Accuracy: " + str(accuracy))
    return accuracy


def write_predictions(filedir, inqtabs_dict, swn_dict, output_file_name):
    testfiledir = os.path.join(filedir, 'test')
    with open(output_file_name, 'w+', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, delimiter=',', fieldnames=['FileIndex', 'Category'])
        writer.writeheader()
        for filename in tqdm(sorted(os.listdir(testfiledir), key=lambda x: int(os.path.splitext(x)[0]))):
            with open(os.path.join(testfiledir, filename), encoding='utf-8') as reviewfile:
                review = reviewfile.read()
                prediction = dict()
                prediction['FileIndex'] = os.path.splitext(filename)[0]
                prediction['Category'] = classify(review, inqtabs_dict, swn_dict)
                writer.writerow(prediction)

<a id="cb"></a>
# Code Block for you to modify:

In [10]:
import re
import nltk
import numpy as np
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict
import operator


def get_error_type(pred, label):
    # return the type of error: tp,fp,tn,fn
    if pred == 1 and label == 1:
        return 'tp'
    elif pred == 1 and label == 0:
        return 'fp'
    elif pred == 0 and label == 1:
        return 'fn'
    elif pred == 0 and label == 0:
        return 'tn'


lemmatizer = WordNetLemmatizer()
term_count = {}
term_sentiment = {}

def top_terms(text, inqtabs_dict,swn_dict): #used to update dictionary, not to be applied into 'classify' function
    #clean up text
    regex_clean_html = re.compile(r'<.*?>') # Remove any HTML tags
    clean = re.sub(regex_clean_html, '', text)
    regex = r'[A-Za-z]{3,}\b' #You'll probably want to update this regular expression
    words = re.findall(regex, clean) # Extract word from a text
    
    positive_words_inqtabs = []
    negative_words_inqtabs = []
    words_matched_inqtabs = []
    positive_words_swn = {}
    negative_words_swn = {}
    positive_words_swn_count = []
    negative_words_swn_count = []
    words_matched_swn = []
    
    # Stopwords
    stopword = stopwords.words('english')
    words_lower = [x.lower() for x in words] # lower case
    words_rm = [x for x in words_lower if x not in stopword] # remove stopwords
    
    for word in words_rm:
        word = word.lower()  #You can do other preprocessing here like remove punctuation, etc
        if word in inqtabs_dict:
            if inqtabs_dict[word] == '1':
                positive_words_inqtabs.append(word)
            else:
                negative_words_inqtabs.append(word)
            words_matched_inqtabs.append(word)
        if word not in inqtabs_dict and word in swn_dict:
            positive_words_swn[word] = swn_dict[word][0]
            negative_words_swn[word] = swn_dict[word][1]
            words_matched_swn.append(word)

        
        if word not in term_count:
            term_count[word] = 1
        else:
            term_count[word] += 1

#built for inqtabs_dict
    sorted_pos = positive_words_inqtabs
    sorted_neg = negative_words_inqtabs
# #built for swn_dict
#     sorted_pos = sorted(positive_words_swn.items(), key=lambda x: x[1], reverse=True)
#     sorted_neg = sorted(negative_words_swn.items(), key=lambda x: x[1], reverse=True)
#     n = 20
#     print(sorted_pos)
#     print()
#     print(sorted_neg)
    
    
#     print("Positive words found: ", len(positive_words_swn), sorted(positive_words_swn.items(), key=lambda x: x[1], reverse=True))
#     print("Total positive sentiment: ", sum(positive_words_swn.values()))
#     print("\nNegative words scores: ", len(negative_words_swn), sorted(negative_words_swn.items(), key=lambda x: x[1], reverse=True))
#     print("Total negative sentiment:", sum(negative_words_swn.values()))
    return sorted_pos,sorted_neg    

#Put the code under the classify method here.  Here's some code to start you off
def classify(text, inqtabs_dict, swn_dict):
    regex_clean_html = re.compile(r'<.*?>') # Remove any HTML tags
    clean = re.sub(regex_clean_html, '', text)
    regex = r'[A-Za-z]{3,}\b' #You'll probably want to update this regular expression
    words = re.findall(regex, clean) # Extract word from a text
    inqtabs_dict['fun'] = 1
#     del inqtabs_dict['get']
    
    aa_temp = list(swn_dict['dog'])
    aa_temp[0],aa_temp[1] = 1,0
    swn_dict['dog'] = tuple(aa_temp)
    
    aa_temp = list(swn_dict['kill'])
    aa_temp[0],aa_temp[1] = 0,swn_dict['kill'][1]
    swn_dict['kill'] = tuple(aa_temp)
    
    aa_temp = list(swn_dict['good'])
    aa_temp[0],aa_temp[1] = swn_dict['good'][1],0
    swn_dict['good'] = tuple(aa_temp)
    
    positive_words_inqtabs = []
    negative_words_inqtabs = []
    positive_words_inqtabs_count = 0
    negative_words_inqtabs_count = 0    
    
    words_matched_inqtabs = []
    positive_words_swn = {}
    negative_words_swn = {}
    positive_words_swn_count = []
    negative_words_swn_count = []
    words_matched_swn = []
    
    # Stopwords
    stopword = stopwords.words('english') # Default English Stopwords
    words_lower = [x.lower() for x in words] # lower case
    words_rm = [x for x in words_lower if x not in stopword] # remove stopwords
  
#tokenize, pos tagging, lemmatize
#     sentences = " ".join(words_rm)
#     tokens = nltk.word_tokenize(sentences)
#     tag_map = defaultdict(lambda : wordnet.NOUN)
#     tag_map['J'] = wordnet.ADJ
#     tag_map['V'] = wordnet.VERB
#     tag_map['R'] = wordnet.ADV
#     word_list = []
#     for token,tag in pos_tag(tokens):
#         lemma = lemmatizer.lemmatize(token, tag_map[tag[0]])
#         word_list.append(lemma)  
    
    for word in words_rm:
        #word = word.lower()  #You can do other preprocessing here like remove punctuation, etc
        if word in inqtabs_dict:
            if inqtabs_dict[word] == '1':
                positive_words_inqtabs_count += 1
            else:
                negative_words_inqtabs_count += 1
            words_matched_inqtabs.append(word)
        if (word not in inqtabs_dict and word in swn_dict):
            positive_words_inqtabs_count += (swn_dict[word][0])**3
            negative_words_inqtabs_count += (swn_dict[word][1])**3
            words_matched_swn.append(word)

    pos_cnt = positive_words_inqtabs_count + 1
    neg_cnt = negative_words_inqtabs_count + 1
    if ((pos_cnt / neg_cnt) >1.47):
        score = 1
    else:
        score = 0

#     print("Original text: ", text)
#     print("Words in text: ", words)
#     print("\nScores from Lexicon 1 (Harvard Inquirer)")
#     print("Positive words found: ", len(positive_words_inqtabs), positive_words_inqtabs)
#     print("Negative words found: ", len(negative_words_inqtabs), negative_words_inqtabs)
#     print("\nScores from Lexicon 2 (SentiWordNet)")
#     print("Positive words found: ", len(positive_words_swn), sorted(positive_words_swn.items(), key=lambda x: x[1], reverse=True))
#     print("Total positive sentiment: ", sum(positive_words_swn.values()))
#     print("\nNegative words scores: ", len(negative_words_swn), sorted(negative_words_swn.items(), key=lambda x: x[1], reverse=True))
#     print("Total negative sentiment:", sum(negative_words_swn.values()))
#     print("\nScore: ", score)
    return score

In [11]:
%%time

filedir = 'C:/Users/fcbmu/OneDrive - UC Irvine/Academic 2020-21/Spring 2021/BANA 275/Python/HW 1/data'
output_file_name = 'C:/Users/fcbmu/OneDrive - UC Irvine/Academic 2020-21/Spring 2021/BANA 275/Python/HW 1/output/test.csv'
print("Reading data...")
data = get_training_data(filedir)
lexicon_dir = os.path.join(filedir, 'lexicon')
inqtabs_dict = read_inqtabs(os.path.join(lexicon_dir, 'inqtabs.txt'))
swn_dict = read_senti_word_net(os.path.join(lexicon_dir, 'SentiWordNet_3.0.0_20130122.txt'))
print("Classifying...")
get_training_accuracy(data, inqtabs_dict, swn_dict)
print("Writing output...")
write_predictions(filedir, inqtabs_dict, swn_dict, output_file_name)

Reading data...
Classifying...
Accuracy: 0.67312
Writing output...


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=25000.0), HTML(value='')))


Wall time: 2min 57s


# Code for Error Analysis

In [12]:
%%time
# Rerun this as you make changes to the classify method 
# I strongly recommend you look at the number of false positives and false negatives after running this.
# You can see the false positives and false negatives by looking at the size of the fp.txt and fn.txt
# Note: you need to complete the get_error_type function to see the false positives and false negatives 
# prior to update kill and dog 0.67308
get_training_accuracy(data, inqtabs_dict, swn_dict)

Accuracy: 0.67312
Wall time: 57.6 s


0.67312

In [13]:
# find your current directory
import os
#curDir = os.getcwd()
#print(curDir)
fn = open('C:/Users/fcbmu/OneDrive - UC Irvine/Academic 2020-21/Spring 2021/BANA 275/Python/HW 1/output/fn.txt','r',encoding='utf-8')
fp = open('C:/Users/fcbmu/OneDrive - UC Irvine/Academic 2020-21/Spring 2021/BANA 275/Python/HW 1/output/fp.txt','r',encoding='utf-8')
tp = open('C:/Users/fcbmu/OneDrive - UC Irvine/Academic 2020-21/Spring 2021/BANA 275/Python/HW 1/output/tp.txt','r',encoding='utf-8')
tn = open('C:/Users/fcbmu/OneDrive - UC Irvine/Academic 2020-21/Spring 2021/BANA 275/Python/HW 1/output/tn.txt','r',encoding='utf-8')
fn_count = len(fn.readlines())
fp_count= len(fp.readlines())
tp_count= len(tp.readlines())
tn_count= len(tn.readlines())
#print(fn_count)
#print(fp_count)
#print(tp_count)
#print(tn_count)
total = fp_count + fn_count+tp_count+tn_count
total
#Confusion Matrix
print("Confusion Matrix")
print("TP: ",tp_count,", TN: ",tn_count)
print("FP: ",fp_count,", FN: ",fn_count)
print()
print("Confusion Matrix")
print("TP: ",round(tp_count/total,2),", TN: ",round(tn_count/total,2))
print("FP: ",round(fp_count/total,2),", FN: ",round(fn_count/total,2))
print()
print("Precision: ",round(tp_count/(tp_count+fp_count),2))
print("Recall: ",round(tp_count/(tp_count+fn_count),2))

Confusion Matrix
TP:  7650 , TN:  9178
FP:  3322 , FN:  4850

Confusion Matrix
TP:  0.31 , TN:  0.37
FP:  0.13 , FN:  0.19

Precision:  0.7
Recall:  0.61


# Error Analysis, debugging classify on a single example

In [14]:
import re
import nltk
import numpy as np
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag
from collections import defaultdict
import operator


def get_error_type(pred, label):
    # return the type of error: tp,fp,tn,fn
    if pred == 1 and label == 1:
        return 'tp'
    elif pred == 1 and label == 0:
        return 'fp'
    elif pred == 0 and label == 1:
        return 'fn'
    elif pred == 0 and label == 0:
        return 'tn'


lemmatizer = WordNetLemmatizer()
term_count = {}
term_sentiment = {}

def top_terms(text, inqtabs_dict,swn_dict): #used to update dictionary, not to be applied into 'classify' function
    #clean up text
    regex_clean_html = re.compile(r'<.*?>') # Remove any HTML tags
    clean = re.sub(regex_clean_html, '', text)
    regex = r'[A-Za-z]{3,}\b' #You'll probably want to update this regular expression
    words = re.findall(regex, clean) # Extract word from a text
    
    positive_words_inqtabs = []
    negative_words_inqtabs = []
    words_matched_inqtabs = []
    positive_words_swn = {}
    negative_words_swn = {}
    positive_words_swn_count = []
    negative_words_swn_count = []
    words_matched_swn = []
    
    # Stopwords
    stopword = stopwords.words('english')
    words_lower = [x.lower() for x in words] # lower case
    words_rm = [x for x in words_lower if x not in stopword] # remove stopwords
    
    for word in words_rm:
        word = word.lower()  #You can do other preprocessing here like remove punctuation, etc
        if word in inqtabs_dict:
            if inqtabs_dict[word] == '1':
                positive_words_inqtabs.append(word)
            else:
                negative_words_inqtabs.append(word)
            words_matched_inqtabs.append(word)
        if word not in inqtabs_dict and word in swn_dict:
            positive_words_swn[word] = swn_dict[word][0]
            negative_words_swn[word] = swn_dict[word][1]
            words_matched_swn.append(word)

        
        if word not in term_count:
            term_count[word] = 1
        else:
            term_count[word] += 1

#built for inqtabs_dict
    sorted_pos = positive_words_inqtabs
    sorted_neg = negative_words_inqtabs
# #built for swn_dict
#     sorted_pos = sorted(positive_words_swn.items(), key=lambda x: x[1], reverse=True)
#     sorted_neg = sorted(negative_words_swn.items(), key=lambda x: x[1], reverse=True)
#     n = 20
#     print(sorted_pos)
#     print()
#     print(sorted_neg)
    
    
#     print("Positive words found: ", len(positive_words_swn), sorted(positive_words_swn.items(), key=lambda x: x[1], reverse=True))
#     print("Total positive sentiment: ", sum(positive_words_swn.values()))
#     print("\nNegative words scores: ", len(negative_words_swn), sorted(negative_words_swn.items(), key=lambda x: x[1], reverse=True))
#     print("Total negative sentiment:", sum(negative_words_swn.values()))
    return sorted_pos,sorted_neg    

#Put the code under the classify method here.  Here's some code to start you off
def classify(text, inqtabs_dict, swn_dict):
    regex_clean_html = re.compile(r'<.*?>') # Remove any HTML tags
    clean = re.sub(regex_clean_html, '', text)
    regex = r'[A-Za-z]{3,}\b' #You'll probably want to update this regular expression
    words = re.findall(regex, clean) # Extract word from a text
    inqtabs_dict['fun'] = 1
#     del inqtabs_dict['get']
    
    aa_temp = list(swn_dict['dog'])
    aa_temp[0],aa_temp[1] = 1,0
    swn_dict['dog'] = tuple(aa_temp)
    
    aa_temp = list(swn_dict['kill'])
    aa_temp[0],aa_temp[1] = 0,swn_dict['kill'][1]
    swn_dict['kill'] = tuple(aa_temp)
    
    aa_temp = list(swn_dict['good'])
    aa_temp[0],aa_temp[1] = swn_dict['good'][1],0
    swn_dict['good'] = tuple(aa_temp)
    
    positive_words_inqtabs = []
    negative_words_inqtabs = []
    positive_words_inqtabs_count = 0
    negative_words_inqtabs_count = 0    
    
    words_matched_inqtabs = []
    positive_words_swn = {}
    negative_words_swn = {}
    positive_words_swn_count = []
    negative_words_swn_count = []
    words_matched_swn = []
    
    # Stopwords
    stopword = stopwords.words('english')
    words_lower = [x.lower() for x in words] # lower case
    words_rm = [x for x in words_lower if x not in stopword] # remove stopwords
  
#tokenize, pos tagging, lemmatize
#     sentences = " ".join(words_rm)
#     tokens = nltk.word_tokenize(sentences)
#     tag_map = defaultdict(lambda : wordnet.NOUN)
#     tag_map['J'] = wordnet.ADJ
#     tag_map['V'] = wordnet.VERB
#     tag_map['R'] = wordnet.ADV
#     word_list = []
#     for token,tag in pos_tag(tokens):
#         lemma = lemmatizer.lemmatize(token, tag_map[tag[0]])
#         word_list.append(lemma)  
    
    for word in words_rm:
        #word = word.lower()  #You can do other preprocessing here like remove punctuation, etc
        if word in inqtabs_dict:
            if inqtabs_dict[word] == '1':
                positive_words_inqtabs_count += 1
            else:
                negative_words_inqtabs_count += 1
            words_matched_inqtabs.append(word)
        if (word not in inqtabs_dict and word in swn_dict):
            positive_words_inqtabs_count += (swn_dict[word][0])**3
            negative_words_inqtabs_count += (swn_dict[word][1])**3
            words_matched_swn.append(word)

    pos_cnt = positive_words_inqtabs_count + 1
    neg_cnt = negative_words_inqtabs_count + 1
    if ((pos_cnt / neg_cnt) >1.47):
        score = 1
    else:
        score = 0

#     print("Original text: ", text)
#     print("Words in text: ", words)
#     print("\nScores from Lexicon 1 (Harvard Inquirer)")
#     print("Positive words found: ", len(positive_words_inqtabs), positive_words_inqtabs)
#     print("Negative words found: ", len(negative_words_inqtabs), negative_words_inqtabs)
#     print("\nScores from Lexicon 2 (SentiWordNet)")
#     print("Positive words found: ", len(positive_words_swn), sorted(positive_words_swn.items(), key=lambda x: x[1], reverse=True))
#     print("Total positive sentiment: ", sum(positive_words_swn.values()))
#     print("\nNegative words scores: ", len(negative_words_swn), sorted(negative_words_swn.items(), key=lambda x: x[1], reverse=True))
#     print("Total negative sentiment:", sum(negative_words_swn.values()))
#     print("\nScore: ", score)
    return score

In [15]:
# Modify this to see it on a different example
text = '''
Arguably this is a very good "sequel", better than the first live action film 101 Dalmatians. It has good dogs, good actors, good jokes and all right slapstick! <br /><br />Cruella DeVil, who has had some rather major therapy, is now a lover of dogs and very kind to them. Many, including Chloe Simon, owner of one of the dogs that Cruella once tried to kill, do not believe this. Others, like Kevin Shepherd (owner of 2nd Chance Dog Shelter) believe that she has changed. <br /><br />Meanwhile, Dipstick, with his mate, have given birth to three cute dalmatian puppies! Little Dipper, Domino and Oddball...<br /><br />Starring Eric Idle as Waddlesworth (the hilarious macaw), Glenn Close as Cruella herself and Gerard Depardieu as Le Pelt (another baddie, the name should give a clue), this is a good family film with excitement and lots more!! One downfall of this film is that is has a lot of painful slapstick, but not quite as excessive as the last film. This is also funnier than the last film.<br /><br />Enjoy "102 Dalmatians"! :-)
'''
classify(text, inqtabs_dict, swn_dict)

1

In [None]:
# top_terms(text, inqtabs_dict, swn_dict)
aa_temp = list(swn_dict['dog'])
print(aa_temp)
aa_temp[0],aa_temp[1] = 1,0
swn_dict['dog'] = tuple(aa_temp)
swn_dict['dog']



# swn_dict['dog'][0] = 1
# swn_dict['dog'][1] = 0

In [None]:
# top_terms(text, inqtabs_dict, swn_dict)
aa_temp = list(swn_dict['kill'])
print(aa_temp)
aa_temp[0],aa_temp[1] = 0,swn_dict['kill'][1]
swn_dict['kill'] = tuple(aa_temp)
swn_dict['kill']

In [None]:
aa_temp = list(swn_dict['good'])
print(aa_temp)
aa_temp[0],aa_temp[1] = swn_dict['good'][1],0
swn_dict['good'] = tuple(aa_temp)
swn_dict['good']

In [None]:
# del inqtabs_dict['get']
# inqtabs_dict['dog']

# Everything below this cell is meant to identify (by descending count) most frequent terms and their respective sentiment (pos or neg) scores

In [None]:
%%time
temp_data_1 = data[0:5000]
temp_data_2 = data[5000:1000]
temp_data_3 = data[10000:15000]
temp_data_4 = data[15000:20000]
temp_data_5 = data[20000:250001]

neg_word_dict_count = {}
pos_word_dict_count = {}
# print(temp_data)
print()
# temp = [{'FileIndex': '901', 'Category': '1', 'Review': 'NEW WORLD ORDER BOYS BOYS One of my favorite scenes One One One is at the beginning when guests on a private yacht decide to take an impromptu swim - in their underwear! Rather risqué for 1931!'}]

sorted_pos = []
sorted_neg = []

#identify tuples and append them to a new list
def append_dict(df_list):
    for row in df_list:
        sorted_pos_list,sorted_neg_list = top_terms(row['Review'], inqtabs_dict, swn_dict)
        sorted_pos.extend(sorted_pos_list)
        sorted_neg.extend(sorted_neg_list)

In [None]:
%%time
sorted_pos = []
sorted_neg = []
append_dict(temp_data_1)
append_dict(temp_data_2)
append_dict(temp_data_3)
append_dict(temp_data_4)
append_dict(temp_data_5)

In [None]:
#view list
sorted_neg_count_dict = {}
for word in sorted_neg:
    if word not in sorted_neg_count_dict:
        sorted_neg_count_dict[word] = 1
    else:
        sorted_neg_count_dict[word] += 1
sorted_neg = dict(sorted(sorted_neg_count_dict.items(), key=lambda item: item[1], reverse = True))
sorted_neg

In [None]:
get

In [None]:
sorted_pos_count_dict = {}
for word in sorted_pos:
    if word not in sorted_pos_count_dict:
        sorted_pos_count_dict[word] = 1
    else:
        sorted_pos_count_dict[word] += 1
sorted_pos = dict(sorted(sorted_pos_count_dict.items(), key=lambda item: item[1], reverse = True))
sorted_pos

In [None]:
#words to update:
# Possible words to update: one
#Consider re-scoring words in both list / making them both zero (i.e. they may be more common in positive texts when really they are used negativly. vice versa)
#consider words with a zero score which should have a value
#consider making some words mutually exclusive to one sentiment (i.e. 'one' has a value of 0.5 for both sentiment types when it likely has no negative or positive connotation)
#consider term count (righ tnow its removing all duplicates)
#using swn dictionary reduced accuracy to 50%, using swn dictionary with duplicate terms reduced accuracy to ~55%

# Non-lemmatized thresholds:
#     if ((pos_cnt / neg_cnt) >1.7): accuracy 66.392%
#     if ((pos_cnt / neg_cnt) >1.55): accuracy 66.92%
#     if ((pos_cnt / neg_cnt) >1.5): accuracy 66.97%
#     if ((pos_cnt / neg_cnt) >1.49): accuracy 67.14%
#     if ((pos_cnt / neg_cnt) >1.47): accuracy 67.14% #optimal
#     if ((pos_cnt / neg_cnt) >1.45): accuracy 67.1%
#     if ((pos_cnt / neg_cnt) >1.4): accuracy 66.93%
#     if ((pos_cnt / neg_cnt) >1.25): accuracy 64%

#lemmatized threshold accuracy #lemmatizing appears to decrease accuracy
#     if ((pos_cnt / neg_cnt) >1.47): accuracy 66.348

#non_lemmatized threshold with inqtabs_dict[fun' == 1]
#     if ((pos_cnt / neg_cnt) >1.47): accuracy 67.14% #optimal

#non-lemmatized with inqtabs_dict[fun' == 1] using both dictionaries
#     if ((pos_cnt / neg_cnt) >1.47): accuracy 66.504%

#non-lemmatized with inqtabs_dict[fun' == 1] using both dictionaries with swn value ^2
#     if ((pos_cnt / neg_cnt) >1.47): accuracy %67.096

#non-lemmatized with inqtabs_dict[fun' == 1] using both dictionaries with swn value ^3
#     if ((pos_cnt / neg_cnt) >1.47): accuracy %67.304

#non-lemmatized with inqtabs_dict[fun' == 1] using both dictionaries with swn value ^10
#     if ((pos_cnt / neg_cnt) >1.47): accuracy %67.304
#non-lemmatized with inqtabs_dict[fun' == 1] using both dictionaries with swn value ^10 update 'get', kill, dog
#     if ((pos_cnt / neg_cnt) >1.47): accuracy %67.304
#non-lemmatized with inqtabs_dict[fun' == 1] using both dictionaries with swn value multiple 2 update inq:'fun',update swn: 'get', kill, dog
#     if ((pos_cnt / neg_cnt) >1.47): accuracy %67.304