### LING1340 Homework 1
Name: **Daniel Zheng**

Email: **daniel.zheng@pitt.edu**

Date: **September 3, 2017**

#### Dataset
Large Movie Review Dataset from [here](http://ai.stanford.edu/~amaas/data/sentiment/).

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. *The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*

In [10]:
# useful libraries
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
import re
from nltk import FreqDist, ngrams, pos_tag
from nltk.stem import WordNetLemmatizer
import glob


In [11]:
# looking at some data

train_neg_files = glob.glob('data/aclImdb/train/neg/*')
train_pos_files = glob.glob('data/aclImdb/train/pos/*')
train_neg = []
train_pos = []
for neg, pos in zip(train_neg_files, train_pos_files):
    with open(neg) as f:
        train_neg.append(f.read())
    with open(pos) as f:   
        train_pos.append(f.read())

In [12]:
print(train_neg[:2])

['Yes, he is! ...No, not because of Pintilie likes to undress his actors and show publicly their privies. Pintilie IS THE naked "emperor" - so to speak...<br /><br />It\'s big time for someone to state the truth. This impostor is a voyeur, a brat locked in an old man\'s body. His abundance of nude scenes have no artistic legitimacy whatsoever. It is 100% visual perversion: he gets his kicks by making the actors strip in the buff and look at their willies. And if he does this in front of the audience, he might eve get a hard-on! Did you know that, on the set of "Niki Ardelean", he used to embarrass poor Coca Bloss, by telling her: "Oh, Coca, how I wanna f*** you!"? She is a great lady, very decent and sensitive, and she became unspeakably ashamed - to his petty satisfaction! And, as a worrying alarm signal about the degree of vulgarity and lack of education in Romanian audiences, so many people are still so foolish to declare these visual obscenities "works of art"! Will anyone have eve

In [4]:
print(train_pos[:2])

['The movie "MacArthur" begins and ends at Gen. Douglas MacArthur\'s, Gregory Peck, Alma Mata the US Military Academy of West Point on the Hudson. We see a frail 82 year old Gen.MacArthur give the commencement speech to the graduating class of 1962 about what an honor it is to serve their country. The film then goes into an almost two hour long flashback on Gen. MacArthur\'s brilliant as well as controversial career that starts in the darkest hours of WWII on the besieged island of Corregidor in the Philippines in the early spring of 1942.<br /><br />Told to leave he island for Australia before the Japanese military invade it Gen. MacArthur for the very first time in his military career almost disobeys a direct order from his superior US President Franklin D. Roosevelt, Dan O\'Herlihy. Feeling that he\'ll be deserting his men at their greatest hour of need MacArthur reluctantly, together with his wife and young son, did what he was told only to have it haunt him for the reminder of the

### Description of Data
There are train and test sets of data. Within `train` and `test`, there is a `neg` and `pos` folder each with 12,500 negative and positive samples. In the `train` folder there is also a folder called `unsup` with 50,000 examples for unsupervised learning.

### Processing
A lot of the data contains "<br>" tags from HTML, which will have to be cleaned up. I will attempt to weight using tf-idf and train a classifier with the word2vec library. For movie reviews being positive, I have a feeling adjectives are the most important, so I might only keep adjectives and train another classifier with those.

In [51]:
# first clean up data, taking out punctuation, numbers and special characters

# some questionable regular expressions...
train_neg = [' '.join([word_tokenize(x) for x in string.split() if re.sub('[a-zA-Z0-9_.,!"\'-/]', '', x) == '']) for string in train_neg]
train_pos = [' '.join([word_tokenize(x) for x in string.split() if re.sub('[a-zA-Z0-9_.,!"\'-/]', '', x) == '']) for string in train_pos]



In [52]:
# example of what the filtered data looks like
print(train_pos[0])

['The movie MacArthur begins and ends at Gen Douglas MacArthurs Gregory Peck Alma Mata the US Military Academy of West Point on the Hudson We see a frail  year old GenMacArthur give the commencement speech to the graduating class of  about what an honor it is to serve their country The film then goes into an almost two hour long flashback on Gen MacArthurs brilliant as well as controversial career that starts in the darkest hours of WWII on the besieged island of Corregidor in the Philippines in the early spring of to leave he island for Australia before the Japanese military invade it Gen MacArthur for the very first time in his military career almost disobeys a direct order from his superior US President Franklin D Roosevelt Dan OHerlihy Feeling that hell be deserting his men at their greatest hour of need MacArthur reluctantly together with his wife and young son did what he was told only to have it haunt him for the reminder of the war It was that reason his escape under fire from 

In [1]:

tokenized = [word.lower() for word in word_tokenize(' '.join(first_filter)) if len(word) > 1 and len(word) < 25]

# comprised of only those chars and has at least one letter or number
cleaned = tokenized

#     cleaned = ' '.join(x for x in result.split() if len(x) > 1 and len(x) < 25)

# assumption: all folder names will contain nouns
# first build pos tags for every unique thing
tagged_set = pos_tag(set(cleaned))
tagged_dict = dict()
# generate lookup table
for word, tag in tagged_set:
    tagged_dict[word] = tag


lemmatizer = WordNetLemmatizer()

from nltk.corpus import wordnet

# need to convert since wordnet lemmatizer only takes 4 things
lemma_tag = {'NN':wordnet.NOUN,'JJ':wordnet.ADJ,'VB':wordnet.VERB,'RB':wordnet.ADV}

# group things together w/ lemmas
lemmas = [lemmatizer.lemmatize(t, pos=lemma_tag[tagged_dict[t][:2]]) if tagged_dict[t][:2] in ['NN', 'JJ', 'VB', 'RB'] else t for t in cleaned ]

lemma_set = pos_tag(set(lemmas))
lemma_dict = dict()
# generate lookup table
for word, tag in lemma_set:
    lemma_dict[word] = tag
all_counts = dict()
sizes = [1,2,3]
for size in sizes:
    noun_grams = []
    for tup in ngrams(lemmas, size):
        for word in tup:
            # only keep stuff that has nouns
            if lemma_dict[word].find('NN')!= -1:
                noun_grams.append(tup)
                break
    all_counts[size] = FreqDist(noun_grams)

best_keywords = all_counts[sizes[0]].most_common(1)[0]
for size in sizes:

    if all_counts[size].most_common(1)[0][1] * 2 > best_keywords[1]:
        best_keywords = all_counts[size].most_common(1)[0]

print(best_keywords)

# idea: basic algorithm to find most important phrase
# go from 1-5 in ngram size, take top result.
# if 2-gram is more than 1/2 counts of 1-gram, use the 2-gram
# keep going down


return best_keywords[0]

NameError: name 'result' is not defined