#### Group exercise 13.02:
- Load the dataset Product Review Dataset for Sentiment Analysis from http://people.mpi-inf.mpg.de/~smukherjee/data/
- Do necessary preprocessing of the text lemmatization and stemming. Decide how you would deal with the apostrophes (n’t => not) and other special symbols.
https://drive.google.com/open?id=1A-N8MpBfpMBnJhSeWx2YoU84a1poAVur
- Use page 6 to observe a variety of decisions need to be made while doing     preprocessing https://www.nyu.edu/projects/spirling/documents/preprocessing.pdf
- Inspect the vocabulary and further decrease the size of it by matching words like USA and U.S.A., etc.
- Find collocations using PointWise Mutual Information and t-test metrics (15.3.1) of the bigram
- For each unigram and bigram find the probability in the corpus
- Implement add-one smoothing, discounting, back-off and stupid back-off, interpolation


In [1]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.probability import FreqDist
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder

In [2]:
path = 'feature-specific/Dataset2.txt'

stemmer = SnowballStemmer('english')
tokenizer = RegexpTokenizer('\w+')

unigrams, bigrams = FreqDist(), FreqDist()

In [3]:
def preprocess(text, tokenizer=tokenizer, stemmer=stemmer):
    '''Preprocesses a line and returns a list of preprocessed tokens'''

    # lower case and remove leading/trailing spaces
    text = text.lower().strip()

    # tokenize the text
    tokens = tokenizer.tokenize(review)
    
    # apply stemmer to each token
    tokens = list(map(stemmer.stem, tokens))
    
    return tokens

In [4]:
with open(path, mode='r') as f:
    
    for line in f.readlines():

        # format: category $ label $ review
        parts = line.split('$')
        
        review = parts[2]
        tokens = preprocess(review)
        
        finder = BigramCollocationFinder.from_words(tokens)
        
        # add unigrams and bigrams of each review to global unigrams and bigrams
        unigrams += finder.word_fd
        bigrams += finder.ngram_fd

In [5]:
# construct a BigramCollocationFinder out of all the unigrams and bigrams
# we found from the product reviews
finder = BigramCollocationFinder(unigrams, bigrams)

bigram_measures = BigramAssocMeasures()

pmi_scored = finder.score_ngrams(bigram_measures.pmi)
t_scored = finder.score_ngrams(bigram_measures.student_t)

In [6]:
print('pmi scores:')
for bigram, score in pmi_scored[:20]:
    print(bigram, score)

pmi scores:
('10m', 'ethernet') 16.04180885427799
('3m', '4m') 16.04180885427799
('640', '30fps') 16.04180885427799
('65536', '16bit') 16.04180885427799
('abandon', 'thier') 16.04180885427799
('accentu', 'dimish') 16.04180885427799
('achill', 'heel') 16.04180885427799
('afro', 'cuban') 16.04180885427799
('agk', 'studio') 16.04180885427799
('anni', 'lebovitz') 16.04180885427799
('bose', 'quietcomfort') 16.04180885427799
('bothersom', 'alleg') 16.04180885427799
('brief', 'synopsi') 16.04180885427799
('bulbous', 'formless') 16.04180885427799
('cdr', 'cdrw') 16.04180885427799
('cdrw', 'dvdr') 16.04180885427799
('ceram', 'tile') 16.04180885427799
('comress', 'smaler') 16.04180885427799
('cordless', 'drill') 16.04180885427799
('cosmet', 'standpoint') 16.04180885427799


It seems that the top candidates according to the pmi score contain rare words. Maybe even c(w1) = c(w1, w2) = 1?

In [7]:
print('t-scores:')
for bigram, score in t_scored[:20]:
    print(bigram, score)

t-scores:
('it', 's') 12.884295578726505
('i', 'have') 12.812256227650057
('of', 'the') 12.605367250001615
('easi', 'to') 11.967589633694583
('n', 't') 11.956585688767547
('if', 'you') 11.918357796787728
('to', 'use') 11.746838969167108
('it', 'is') 11.182119745380465
('you', 'can') 10.976127688906095
('i', 've') 10.434353257186
('is', 'a') 10.292449565083063
('on', 'the') 10.246031139677918
('this', 'is') 9.72287405475398
('this', 'camera') 9.650076070474315
('i', 'm') 9.182969523737354
('this', 'phone') 9.13843545620724
('the', 'ipod') 9.07075862237984
('to', 'be') 9.018155296082693
('with', 'the') 8.972702920561462
('i', 'am') 8.933835053570371


The top candidates here seem to be bigrams that appear relatively often.

#### Implement add-one smoothing, discounting, back-off and stupid back-off, interpolation

TODO

Implement just a function that calculates the probability or also compute the required values to calculate the probability? Should we apply this somehow?

Discounting and back-off are just concepts rather than something one can just implement?