Alicia Sigmon, als333@pitt.edu, 9/2/2017

- Corpus: Pros and Cons
- Author: Bing Liu
- https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/pros_cons.zip
- size: 746276, 2 txt files: 1381 KB and 1471 KB
- format: corpus (2 files)
- License: Creative Commons Attribution 4.0 International
- The corpus's 2 files are of positive and negative reviews of technology products

# Summary of Code:

- First I read in the 2 files from the Pros and Cons corpus.
- Next I calculated the number of files in the corpus using PlaintextCorpusReader
    - I also practiced this using glob and commented it out - both methods are valid.
- Then I printed 200 characters of each text to show what the data looked like.  
- For my basic stats, I started by counting the number of entries for each text file.
    - Because each entry began and ended with markers, I split the file using these parameters to create a list of entries, which I then could use to find the number of entries.      
- I then removed the beginning and end markers so that they did not interfere with my word and sentence counts.
- I used nltk to find word tokens, word types, and sentences in the texts.
    -This data is messy because it comes from online forums. There are a lot of misspellings, missing / extra punctuation, and symbols that interfere with the word and sentence counts.  
- For the discovery, I wanted to know what words and bigrams were most indicative of negative and postive reviews.
    - I looked into frequencies using nltk. I did not remove the punctuation, so many of the most frequent words and bigrams included punctuation.


# Future wish:

I think it would be interesting to look at spelling errors in this corpus.
Did people tend to make more errors in negative or positive reviews? 
I'm not sure how to look at spelling errors in a corpus like this 
because it is full of correct and incorrect spellings and extra symbols. 
We previously looked at minimum edit distance and spell correct in the Introduction to Computational Lingustics course, 
and I wonder how I would delve into a corpus looking for how often people make errors.

Some issues with this idea are that it's likely that some people inputting their entries had spell 
checkers while others didn't, and they would have been different spell checkers regardless.
If I were able to look at spelling errors in this corpus, it would therefore not be possible to generalize to 
erros in pro and con entries in general, but I could see in this corpus how many words in 
each category had spelling errors. To really look into spelling erros in pros and cons entries, 
the corpus would have to include people that had the same spell checker. 

Also, how would we compare the words to correctly spelled words? One idea I have is using 
a corpus of English words or maybe also a corpus of English slang to compare to the words in the Pros and Cons 
corpus. I'm not sure what the code would look like, and when I try to imagine it, it seems like each word type in the 
Pros and Cons corpus would have to be compared to words in English Words/Slang corpora to check for errors, 
which would take a very long time to run.


In [1]:
# Reading in the data
file = open("data/IntegratedCons.txt")
constxt = file.read()
file.close()

file = open("data/IntegratedPros.txt")
prostxt = file.read()
file.close()

In [2]:
# Getting the fileids and counting the number of files
# Option 1: Using PlaintextCorpusReader
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'data'
corpus = PlaintextCorpusReader(corpus_root, '.*')
#dir(corpus)
#corpus.raw()[:500]
print("There are " + str(len(corpus.fileids())) + " files in this corpus:")
for x in corpus.fileids():
    print("\t"+x)

There are 2 files in this corpus:
	IntegratedCons.txt
	IntegratedPros.txt


In [3]:
# Option 2: Using glob

#import glob
#files = glob.glob('data\*.txt')
#print("There are " + str(len(files)) + " files in this corpus:")
#for x in files:
#    print("\t"+x.replace("data\\", ""))

In [4]:
# Looking at the data
print("Cons Preview:")
print(constxt[:200])
print("Pros Preview:")
print(prostxt[:200])

Cons Preview:
        <Cons>East batteries! On-off switch too easy to maneuver.</Cons>
        <Cons>Eats...no, GULPS batteries</Cons>
        <Cons>Awkward ergonomics, no optical viewfinder, short battery life, sl
Pros Preview:
        <Pros>Easy to use, economical!</Pros>
        <Pros>Digital is where it's at...down with developing film!</Pros>
        <Pros>Good image quality, 3x optical zoom, macro mode, inexpensive</Pro


In [5]:
# Basic Stats

#nltk.download()
import nltk

#Splits the files by entry
cons_split = constxt.replace("<Cons>", "").split("</Cons>\n")
pros_split = prostxt.replace("<Pros>", "").split("</Pros>\n")

#Removes the beginning and end markers of the entries for further anaylses
new_constxt = constxt.replace("<Cons>", "").replace("</Cons>\n", "")
new_prostxt = prostxt.replace("<Pros>", "").replace("</Pros>\n", "")

# These include many things that are not words or not sentences
    # Would be best to remove punction / symbols
contoks = nltk.word_tokenize(new_constxt.lower())
contypes = sorted(set(contoks))
consents = nltk.sent_tokenize(new_constxt)
protoks = nltk.word_tokenize(new_prostxt.lower())
protypes = sorted(set(protoks))
prosents = nltk.sent_tokenize(new_prostxt)

In [6]:
#Printing basic stats from above - plus using numpy.sum() 
print("There are " + str(len(cons_split)) + " con entries.")
print("There are " + str(len(pros_split)) + " pro entries.\n")

import numpy
print("There are " + str(numpy.sum([len(cons_split),len(pros_split)])) + " total entries in the Pros and Cons Corpus.\n")

print("There are " + str(len(contoks)) + " con word tokens.")
print("There are " + str(len(contypes)) + " con word types.")
print("There are " + str(len(consents)) + " con sentence tokens.\n")
print("There are " + str(len(protoks)) + " pro word tokens.")
print("There are " + str(len(protypes)) + " pro word types.")
print("There are " + str(len(prosents)) + " pro sentence tokens.")

There are 22936 con entries.
There are 22941 pro entries.

There are 45877 total entries in the Pros and Cons Corpus.

There are 185655 con word tokens.
There are 14487 con word types.
There are 6850 con sentence tokens.

There are 206167 pro word tokens.
There are 12350 pro word types.
There are 6956 pro sentence tokens.


In [7]:
# Discovery Part 1: 
# What words are the most indicative of pro vs. con?

#Takes the tokens and creates a frequency dictionary
conFreq = nltk.FreqDist(contoks)
proFreq = nltk.FreqDist(protoks)

# Print most common words
print("Top 20 Con Words:")
for x in conFreq.most_common(20):
    print(x)
print("\nTop 20 Pro Words:")
for x in proFreq.most_common(20):
    print(x)

# Difference in total number of tokens
print("\nThere are " + str(len(protoks)-len(contoks)) + " more pro word tokens than con word tokens.\n")

# Examples of some frequency percentages
print("Here are examples of frequency perctanges:")
print("Con frequency of \",\":", conFreq.freq(","))
print("Pro frequency of \",\":", proFreq.freq(","))

print("Con frequency of \"battery\":", conFreq.freq("battery"))
print("Pro frequency of \"battery\":", proFreq.freq("battery"))

print("Con frequency of \"!\":", conFreq.freq("!"))
print("Pro frequency of \"!\":", proFreq.freq("!"))

Top 20 Con Words:
(',', 14205)
('.', 5763)
('to', 3472)
('a', 2925)
('no', 2675)
('not', 2633)
('the', 2555)
('battery', 2269)
('is', 2196)
('and', 2081)
(';', 1966)
('of', 1848)
('!', 1575)
('for', 1490)
('life', 1484)
("n't", 1472)
('slow', 1335)
('in', 1298)
('quality', 1241)
('it', 1108)

Top 20 Pro Words:
(',', 32632)
('.', 5739)
('to', 5196)
('quality', 5120)
('easy', 4516)
('and', 4490)
('great', 4209)
('use', 4071)
('good', 3644)
(';', 3154)
('of', 2742)
('price', 1985)
('!', 1838)
('features', 1813)
('for', 1602)
('battery', 1539)
('small', 1532)
('the', 1531)
('a', 1464)
('&', 1452)

There are 20512 more pro word tokens than con word tokens.

Here are examples of frequency perctanges:
Con frequency of ",": 0.07651288680617274
Pro frequency of ",": 0.15827945306474847
Con frequency of "battery": 0.012221593816487571
Pro frequency of "battery": 0.007464822207239762
Con frequency of "!": 0.008483477417791064
Pro frequency of "!": 0.008915102805007591


In [8]:
# Discovery Part 2:
# What bigrams are most indicative of con or pro?

bi_con = nltk.ngrams(contoks,2)
bi_con_freq = nltk.FreqDist(bi_con)
print("Top 20 Con Word Bigrams:")
for (x, y) in bi_con_freq.most_common(20):
    print(x, y)

bi_pro = nltk.ngrams(protoks,2)
bi_pro_freq = nltk.FreqDist(bi_pro)
print("\nTop 20 Pro Word Bigrams:")
for (x, y) in bi_con_freq.most_common(20):
    print(x, y)

Top 20 Con Word Bigrams:
(',', 'no') 1216
('battery', 'life') 1188
('a', 'little') 676
('life', ',') 669
('amp', ';') 593
('&', 'amp') 557
('hard', 'to') 542
(',', 'poor') 470
('!', '!') 431
('does', "n't") 431
(',', 'not') 425
('a', 'bit') 407
('.', 'no') 401
(',', 'battery') 396
(';', '#') 345
('quality', ',') 332
(',', 'slow') 328
('could', 'be') 304
('ca', "n't") 300
(';', '&') 273

Top 20 Pro Word Bigrams:
(',', 'no') 1216
('battery', 'life') 1188
('a', 'little') 676
('life', ',') 669
('amp', ';') 593
('&', 'amp') 557
('hard', 'to') 542
(',', 'poor') 470
('!', '!') 431
('does', "n't") 431
(',', 'not') 425
('a', 'bit') 407
('.', 'no') 401
(',', 'battery') 396
(';', '#') 345
('quality', ',') 332
(',', 'slow') 328
('could', 'be') 304
('ca', "n't") 300
(';', '&') 273
