## Introduction

**Natural language processing**(NLP) is the task of making computers understand and produce human languages. And it always start with corpus i.e _body of test_

## What is a corpus ?
There are many corpora(*plural of corpus*) available in NLTK, lets start with an english one called the **Brown corpus**.

When using a new corpus in NLTK for the first time, downloads the corpus with the `nltk.download()` function, e.g. 

In [1]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /home/avi/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

After its download you can import it as such:

In [2]:
from nltk.corpus import brown

In [3]:
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [4]:
len(brown.words())  # No. of words in the corpus

1161192

In [5]:
brown.sents()  # Returns a list of list of strings 

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [6]:
brown.sents(fileids='ca01') # You can access a specific file with 'fileids' argument.

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

The actual `brown` corpus data is **packaged as raw text files**. And you can find their IDs with: 

In [7]:
len(brown.fileids())

500

In [8]:
print(brown.fileids()[:100])

['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20', 'ca21', 'ca22', 'ca23', 'ca24', 'ca25', 'ca26', 'ca27', 'ca28', 'ca29', 'ca30', 'ca31', 'ca32', 'ca33', 'ca34', 'ca35', 'ca36', 'ca37', 'ca38', 'ca39', 'ca40', 'ca41', 'ca42', 'ca43', 'ca44', 'cb01', 'cb02', 'cb03', 'cb04', 'cb05', 'cb06', 'cb07', 'cb08', 'cb09', 'cb10', 'cb11', 'cb12', 'cb13', 'cb14', 'cb15', 'cb16', 'cb17', 'cb18', 'cb19', 'cb20', 'cb21', 'cb22', 'cb23', 'cb24', 'cb25', 'cb26', 'cb27', 'cc01', 'cc02', 'cc03', 'cc04', 'cc05', 'cc06', 'cc07', 'cc08', 'cc09', 'cc10', 'cc11', 'cc12', 'cc13', 'cc14', 'cc15', 'cc16', 'cc17', 'cd01', 'cd02', 'cd03', 'cd04', 'cd05', 'cd06', 'cd07', 'cd08', 'cd09', 'cd10', 'cd11', 'cd12']


In [9]:
print(brown.raw('cb01').strip()[:1000])  # Filter 1000 characters

Assembly/nn-hl session/nn-hl brought/vbd-hl much/ap-hl good/nn-hl 
The/at General/jj-tl Assembly/nn-tl ,/, which/wdt adjourns/vbz today/nr ,/, has/hvz performed/vbn in/in an/at atmosphere/nn of/in crisis/nn and/cc struggle/nn from/in the/at day/nn it/pps convened/vbd ./.
It/pps was/bedz faced/vbn immediately/rb with/in a/at showdown/nn on/in the/at schools/nns ,/, an/at issue/nn which/wdt was/bedz met/vbn squarely/rb in/in conjunction/nn with/in the/at governor/nn with/in a/at decision/nn not/* to/to risk/vb abandoning/vbg public/nn education/nn ./.


	There/ex followed/vbd the/at historic/jj appropriations/nns and/cc budget/nn fight/nn ,/, in/in which/wdt the/at General/jj-tl Assembly/nn-tl decided/vbd to/to tackle/vb executive/nn powers/nns ./.
The/at final/jj decision/nn went/vbd to/in the/at executive/nn but/cc a/at way/nn has/hvz been/ben opened/vbn for/in strengthening/vbg budgeting/vbg procedures/nns and/cc to/to provide/vb legislators/nns information/nn they/ppss need/vb ./.




Noticeable thigs are words have / and label and punctuations are seperated by <>/<>

From above understanding we can come to a point that our next step should be **word tokenize** and **sentence tokenize**

## Tokenization

**Sentence tokenization** is the process of splitting up strings into "sentences"

**Word tokenization** is the process of splitting up sentences into "words"

In [10]:
# lets try webtext
nltk.download("webtext")

[nltk_data] Downloading package webtext to /home/avi/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


True

In [11]:
from nltk.corpus import webtext

webtext.fileids()

['firefox.txt',
 'grail.txt',
 'overheard.txt',
 'pirates.txt',
 'singles.txt',
 'wine.txt']

In [12]:
# Each line is one advertisement.
print(webtext.raw('singles.txt').strip()[:1000])
print("\n")
for i, line in enumerate(webtext.raw('singles.txt').split('\n')):
    if i > 10: # Lets take a look at the first 10 ads.
        break
    
    print(str(i) + ':\t' + line)

25 SEXY MALE, seeks attrac older single lady, for discreet encounters.
35YO Security Guard, seeking lady in uniform for fun times.
40 yo SINGLE DAD, sincere friendly DTE seeks r/ship with fem age open S/E
44yo tall seeks working single mum or lady below 45 fship rship. Nat Open
6.2 35 yr old OUTGOING M seeks fem 28-35 for o/door sports - w/e away
A professional business male, late 40s, 6 feet tall, slim build, well groomed, great personality, home owner, interests include the arts travel and all things good, Ringwood area, is seeking a genuine female of similar age or older, in same area or surrounds, for a meaningful long term rship. Looking forward to hearing from you all.
ABLE young man seeks, sexy older women. Phone for fun ready to play
AFFECTIONATE LADY Sought by generous guy, 40s, mutual fulfillment
ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone similar. 

In [13]:
# lets explore candidate no. 8
single_no8 = webtext.raw('singles.txt').split('\n')[8]
print(single_no8)

ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone similar. You WONT be disappointed.


## Sentence Tokenization

In NLTK `sent_tokenize()` the default tokenizer function that you can use to split strings into "sentences".

In [14]:
from nltk import sent_tokenize, word_tokenize

sent_tokenize(single_no8)

['ARE YOU ALONE or lost in a r/ship too, with no hope in sight?',
 'Maybe we could explore new beginnings together?',
 'Im 45 Slim/Med build, GSOH, high needs and looking for someone similar.',
 'You WONT be disappointed.']

In [15]:
for sent in sent_tokenize(single_no8):
    print(word_tokenize(sent))

 ## Lowercasing

tokenization use capitalization as cues to know when to split so lowercasing before calling tokenization would be 
sub-optimal.

In [16]:
for sent in sent_tokenize(single_no8):
    # It's a little inefficient to loop through each word,
    # but sometimes it helps to get better tokens.
    print([word.lower() for word in word_tokenize(sent)])
    # alternatively:
    # print(list(map(str.lower, word_tokenize(sent))))

['are', 'you', 'alone', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?']
['maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?']
['im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.']
['you', 'wont', 'be', 'disappointed', '.']

In [17]:
print(word_tokenize(single_no8))  # Treats the whole line as one document.  

['ARE', 'YOU', 'ALONE', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?', 'Maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'Im', '45', 'Slim/Med', 'build', ',', 'GSOH', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.', 'You', 'WONT', 'be', 'disappointed', '.']


## Tangential Note

NLTK use punkt for tokenization. punkt is a statistical model so it applies the knowledge it has learnt from previous data.

Generally, it **works for most of the cases on well formed texts** but if your data is different e.g. user-generated noisy texts, you might have to retrain a new model.

e.g. if we look at candidate no. 9 (shown below), we see that it's splitting on y.o. (its thinking that its the end of the sentence) and not splitting on &c. (its thinking that it's an abbreviation, e.g. Mr. , Inc. )

In [18]:
single_no9 = webtext.raw('singles.txt').split('\n')[9]
print(single_no9)

AMIABLE 43 y.o. gentleman with European background, 170 cm, medium build, employed, never married, no children. Enjoys sports, music, cafes, beach &c. Seeks an honest, attractive lady with a European background, without children, who would like to get married and have chil dren in the future. 29-39 y.o. Prefer non-smoker and living in Adelaide.


In [19]:
# Not splitting on &c but splits on y.o.
sent_tokenize(single_no9)

So here we have to introduce new concept called stopwords

## Stopwords

**Stopwords** are non-content words that primarily has only grammatical function

In NLTK, you can access them as follows. 

In [20]:
from nltk.corpus import stopwords

stopwords_en = stopwords.words('english')
print(stopwords_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

We use stopwords to remove unnecessary content from the text and maintain the gist of important words.

In [21]:
# Treat the multiple sentence as one document (no need for sentence tokenization)
# Tokenize and lowercase
single_no8_tokenized_lowered = list(map(str.lower, word_tokenize(single_no8)))
print(single_no8_tokenized_lowered)

['are', 'you', 'alone', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?', 'maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.', 'you', 'wont', 'be', 'disappointed', '.']


## Let's try to remove the stopwords using the `english` stopwords list in NLTK

In [22]:
# we already have taken stopwords in stopwords_en

# List comprehension.
print([word for word in single_no8_tokenized_lowered if word not in stopwords_en])

['alone', 'lost', 'r/ship', ',', 'hope', 'sight', '?', 'maybe', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'looking', 'someone', 'similar', '.', 'wont', 'disappointed', '.']


In [23]:
# We can remove punctuations as well using string.punctuation
from string import punctuation
# It's a string so we have to convert them into a set type
print('From string.punctuation:', type(punctuation), punctuation)

From string.punctuation: <class 'str'> !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


## Combining the punctuation from stopwords from NLTK

In [24]:
print(stopwords_en)
stopwords_en_withpunct = set(stopwords_en).union(set(punctuation))
print(stopwords_en_withpunct)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [25]:
# Removing stopwords with punctuations from single_no8_tokenized_lowered
print([word for word in single_no8_tokenized_lowered if word not in stopwords_en_withpunct])

## Using a stronger/longer list of stopwords

From the previous output we have still dangly model verbs(i.e. 'could', 'wont', etc.)

We can combine the stopwords we have in NLTK with other stopwords list we find online.

Personally, I like to use `stopword-json` because it has stopwords in 50 languages 
[stopwords-json](https://github.com/6/stopwords-json)

In [26]:
# Stopwords from stopwords-json
stopwords_json = {"en":["a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","aside","ask","asking","associated","at","available","away","awfully","b","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","c","c'mon","c's","came","can","can't","cannot","cant","cause","causes","certain","certainly","changes","clearly","co","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","currently","d","definitely","described","despite","did","didn't","different","do","does","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't","have","haven't","having","he","he's","hello","help","hence","her","here","here's","hereafter","hereby","herein","hereupon","hers","herself","hi","him","himself","his","hither","hopefully","how","howbeit","however","i","i'd","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch","inc","indeed","indicate","indicated","indicates","inner","insofar","instead","into","inward","is","isn't","it","it'd","it'll","it's","its","itself","j","just","k","keep","keeps","kept","know","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near","nearly","necessary","need","needs","neither","never","nevertheless","new","next","nine","no","nobody","non","none","noone","nor","normally","not","nothing","novel","now","nowhere","o","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","only","onto","or","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","own","p","particular","particularly","per","perhaps","placed","please","plus","possible","presumably","probably","provides","q","que","quite","qv","r","rather","rd","re","really","reasonably","regarding","regardless","regards","relatively","respectively","right","s","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","she","should","shouldn't","since","six","so","some","somebody","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","t","t's","take","taken","tell","tends","th","than","thank","thanks","thanx","that","that's","thats","the","their","theirs","them","themselves","then","thence","there","there's","thereafter","thereby","therefore","therein","theres","thereupon","these","they","they'd","they'll","they're","they've","think","third","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","to","together","too","took","toward","towards","tried","tries","truly","try","trying","twice","two","u","un","under","unfortunately","unless","unlikely","until","unto","up","upon","us","use","used","useful","uses","using","usually","uucp","v","value","various","very","via","viz","vs","w","want","wants","was","wasn't","way","we","we'd","we'll","we're","we've","welcome","well","went","were","weren't","what","what's","whatever","when","whence","whenever","where","where's","whereafter","whereas","whereby","wherein","whereupon","wherever","whether","which","while","whither","who","who's","whoever","whole","whom","whose","why","will","willing","wish","with","within","without","won't","wonder","would","wouldn't","x","y","yes","yet","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","z","zero"]}

In [27]:
stopwords_json_en = set(stopwords_json['en'])
stopwords_nltk_en = set(stopwords.words('english'))
stopwords_punct = set(punctuation)

# Combine the stopwords, Its a lot longer so I'm not printing it out

stoplist_combined = set.union(stopwords_json_en, stopwords_nltk_en, stopwords_punct)


# Remove the stopwords from single_no8_tokenized_lowered
print([word for word in single_no8_tokenized_lowered if word not in stoplist_combined])

['lost', 'r/ship', 'hope', 'sight', 'explore', 'beginnings', 'im', '45', 'slim/med', 'build', 'gsoh', 'high', 'similar', 'wont', 'disappointed']


## Stemming and Lemmatization

Matching words with root words e.g. "walks", "walking", "walked" to "walk"

The stemming and lemmatization process are hand-written regex rules written to find the root word.

**Stemming** : Trying to shorten a word with simple regex rules.

(in): having

(out): hav

**Lemmatization** : Trying to find the root word with linguistics rules (with the use of regexes)

(in): having

(out): have

There are various stemmers and one lemmatizer in NLTK, the most common being:

* **Porter Stemmer**

* **Wordnet Lemmatizer**


In [28]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

for word in ['Walking', 'JOKING', 'walked']:
    print(porter.stem(word))  # stem and convert to lowercase 

walk
joke
walk


In [29]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

for word in ['Walking', 'walks', 'walked']:
    print(wnl.lemmatize(word))  # only lemmatize

Walking
walk
walked


## The lemmatizer is pretty complicated, it needs parts of speech (POS) tags.

We won't cover what's POS today so I'll just show to "whip" the lemmatizer to do what you need.

By default, the WordNetLemmatizer.lemmatize() function will assume that the word is a Noun if there's no explisit POS tag in the input. 



In [30]:
from nltk import pos_tag

# 'pos_tag' takes the tokenized sentence as input and returns the tuple of (word, tag)

walking_tagged = pos_tag(word_tokenize('He is walking to school'))
print(walking_tagged)

[('He', 'PRP'), ('is', 'VBZ'), ('walking', 'VBG'), ('to', 'TO'), ('school', 'NN')]


In [31]:
def penn2morphy(penntag):
    """
    Converts Penn Treebank tags to Wordnet
    :param penntag: 
    :return: 
    """
    
    morphy_tag = {'NN': 'n', 'JJ': 'a',
                  'VB': 'v', 'RB': 'r'}
    
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n' # if mapping isn't found, fall back to Noun

In [32]:
[wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) for word, tag in walking_tagged]

['he', 'be', 'walk', 'to', 'school']

## Now lets create a new lemmatization function for sentences given what we learnt above.

In [33]:
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

def penn2morphy(penntag):
    """ Converts penn treebank tags to wordnet"""
    morphy_tag = {'NN': 'n', 'JJ': 'a',
                  'VB': 'v', 'RB': 'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n'
    
def lemmatize_sent(text):
    # Text input is string, returns lowered strings.
    return [wnl.lemmatize((word.lower()), pos=penn2morphy(tag)) for word, tag in pos_tag(word_tokenize(text))]

lemmatize_sent('He is walking to school')

['he', 'be', 'walk', 'to', 'school']

## Lets try the `lemmatize_sent()` and remove stopwords from single_no8

In [34]:
print('Original single no 8 : ')
print(single_no8)
print('Lemmatized and removed stopwords :')
print([word for word in lemmatize_sent(single_no8) 
       if word not in stoplist_combined
       and not word.isdigit() ])

Original single no 8 : 
ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone similar. You WONT be disappointed.
Lemmatized and removed stopwords :
['lose', 'r/ship', 'hope', 'sight', 'explore', 'beginning', 'im', 'slim/med', 'build', 'gsoh', 'high', 'similar', 'wont', 'disappoint']


## Combining what we know about removing stopwords and lemmatization

In [35]:
def preprocess_text(text):
    # Input: str, i.e. documents/sentence
    # Output : list(str), i.e. list of lemmas
    return ([word for word in lemmatize_sent(text)
             if word not in stoplist_combined
             and not word.isdigit()])

In [36]:
print(preprocess_text("going running and walks"))

['run', 'walk']


## From Strings to vectors

**Vector** is an array of numbers

a = [1,2,3,4]

**Vector space Model** is conceptualizing language as a whole lot of numbers

**Bag-of-Words(BoW)**: Counting each document/sentence as a vector of numbers, with each number representing the count of a word in the corpus.

To count, we use the python `collections.Counter`

In [37]:
from collections import Counter

sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."

# Lemmatize and remove stopwords
processed_sent1 = preprocess_text(sent1)
processed_sent2 = preprocess_text(sent2)

print('Processed sentence: ')
print(processed_sent1)
print()
print('Word counts: ')
print(Counter(processed_sent1))

Processed sentence: 
['quick', 'brown', 'fox', 'jump', 'lazy', 'brown', 'dog']

Word counts: 
Counter({'brown': 2, 'dog': 1, 'lazy': 1, 'fox': 1, 'quick': 1, 'jump': 1})


In [38]:

print(Counter(processed_sent2))

## Vectorization with sklearn

In `scikit-learn`, there're pre-built functions to do the preprocessing and vectorization that we've been doing using the `Countvectorizer` object.



In [39]:
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer


# sent1
# sent2

with StringIO('\n'.join([sent1, sent2])) as fin:
    # Create the vectorizer
    count_vect = CountVectorizer()
    count_vect.fit_transform(fin)

In [40]:
# vocabulary_ return word as keys and ids(not count) as value.
count_vect.vocabulary_

{'brown': 0,
 'dog': 1,
 'fox': 2,
 'jumps': 3,
 'lazy': 4,
 'mr': 5,
 'over': 6,
 'quick': 7,
 'the': 8}

**note** we have not counted anything yet

`the` is in the vocabulary list but it is a stopword.

`jumps` isn't stemmed or lemmatized.

We can **override the tokenizer and stop_words**

In [41]:
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer

# sent1
# sent2

with StringIO('\n'.join([sent1, sent2])) as fin:
    # Override the analyzer  totally with stopwords and tokenizer
    count_vect = CountVectorizer(stop_words=stoplist_combined, tokenizer=word_tokenize)
    count_vect.fit_transform(fin)
count_vect.vocabulary_



{'brown': 0, 'dog': 1, 'fox': 2, 'jumps': 3, 'lazy': 4, 'mr': 5, 'quick': 6}

Or just **override the analyzer** totally with our preprocess text:

In [42]:
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer

# sent1
# sent2

with StringIO('\n'.join([sent1, sent2])) as fin:
    # Override the analyzer  totally with our preprocess text
    count_vect = CountVectorizer(analyzer=preprocess_text)
    count_vect.fit_transform(fin)
count_vect.vocabulary_

{'brown': 0, 'dog': 1, 'fox': 2, 'jump': 3, 'lazy': 4, 'mr': 5, 'quick': 6}

To vectorize any new sentence, we use `Countvectorizer.transform()` The function will return a sparse matrix.

In [43]:
count_vect.transform([sent1, sent2])

In [44]:
# To view matrix you can output it to an array
from operator import itemgetter

# print the words sorted by their index
words_sorted_by_index, _ = zip(*sorted(count_vect.vocabulary_.items(), key=itemgetter(1)))

print(preprocess_text(sent1))
print(preprocess_text(sent2))

print()

print('Vocab: ', words_sorted_by_index)
print()
print('matrix/vectors:\n', count_vect.transform([sent1, sent2]).toarray())

['quick', 'brown', 'fox', 'jump', 'lazy', 'brown', 'dog']
['mr', 'brown', 'jump', 'lazy', 'fox']

Vocab:  ('brown', 'dog', 'fox', 'jump', 'lazy', 'mr', 'quick')

matrix/vectors:
 [[2 1 1 1 1 0 1]
 [1 0 1 1 1 1 0]]


## Now that we have learnt some basic NLP and vectorization, lets apply it to a fun task

[Random acts of pizza](https://www.kaggle.com/c/random-acts-of-pizza)

This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome(successful/unsuccessful) and meta-data.

![pizza image](https://kaggle2.blob.core.windows.net/competitions/kaggle/3949/media/pizzas.png)

## lets explore training data

In [62]:
import json
import os

print(os.getcwd())

with open('data/pizza/train.json') as fin:
    train_json = json.load(fin)
train_json[0]

/home/avi/github/basics-of-nlp


{'giver_username_if_known': 'N/A',
 'number_of_downvotes_of_request_at_retrieval': 0,
 'number_of_upvotes_of_request_at_retrieval': 1,
 'post_was_edited': False,
 'request_id': 't3_l25d7',
 'request_number_of_comments_at_retrieval': 0,
 'request_text': 'Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated',
 'request_text_edit_aware': 'Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated',
 'request_title': 'Request Colorado Springs Help Us Please',
 'requester_accoun

We're only interested in the text fields:

**Input**

* request_id : unique identifier for the request
* request_title : title of the reddit post for pizza request
* request_text_edit_aware : expository to request for pizza

**Output**

* requester_received_pizza : whether requester gets his/her pizza

For our purpose, lets only use the `request_text` as the input to build our Naive Bayes classifier and the output is the `requester_received_pizza` field.

**note** the `request_id` is only used for mapping purpose when we're submitting the result to Kaggle task.

In [63]:
print('UID:\t', train_json[0]['request_id'], '\n')
print('Title:\t', train_json[0]['request_title'], '\n')
print('Text:\t', train_json[0]['request_text_edit_aware'], '\n')
print('Tag:\t', train_json[0]['requester_received_pizza'], '\n')

UID:	 t3_l25d7 

Title:	 Request Colorado Springs Help Us Please 

Text:	 Hi I am in need of food for my 4 children we are a military family that has really hit hard times and we have exahusted all means of help just to be able to feed my family and make it through another night is all i ask i know our blessing is coming so whatever u can find in your heart to give is greatly appreciated 

Tag:	 False 



## Convert json to pandas DataFrame

In [64]:
import pandas as pd
df = pd.io.json.json_normalize(train_json)  # Pandas magic...
df_train = df[['request_id', 'request_title',
               'request_text_edit_aware',
               'requester_received_pizza']]
df_train.head()

Unnamed: 0,request_id,request_title,request_text_edit_aware,requester_received_pizza
0,t3_l25d7,Request Colorado Springs Help Us Please,Hi I am in need of food for my 4 children we a...,False
1,t3_rcb83,"[Request] California, No cash and I could use ...",I spent the last money I had on gas today. Im ...,False
2,t3_lpu5j,"[Request] Hungry couple in Dundee, Scotland wo...",My girlfriend decided it would be a good idea ...,False
3,t3_mxvj3,"[Request] In Canada (Ontario), just got home f...","It's cold, I'n hungry, and to be completely ho...",False
4,t3_1i6486,[Request] Old friend coming to visit. Would LO...,hey guys:\n I love this sub. I think it's grea...,False


## Lets take a look at test data

In [65]:
import json

with open('data/pizza/test.json') as fin:
    test_json = json.load(fin)
    
print('UID:\t', test_json[0]['request_id'], '\n')
print('Title:\t', test_json[0]['request_title'], '\n')
print('Text:\t', test_json[0]['request_text_edit_aware'], '\n')
print('Tag:\t', test_json[0]['requester_received_pizza'], '\n')

UID:	 t3_i8iy4 

Title:	 [request] pregger gf 95 degree house and no food.. promise to pay it forward! Northern Colorado 

Text:	 Hey all! It's about 95 degrees here and our kitchen is pretty much empty save for some bread and cereal.  My girlfriend/fiance is 8 1/2 months pregnant and we could use a good meal.  We promise to pay it forward when we get money! Thanks so much in advance! 



KeyError: 'requester_received_pizza'

From the above error we can understand that `requester_received_pizza` won't be known to us since that's the thing that our classifier is predicting.

**Note**: Whatever features that we're going to train our classifier with, we should have them in our test set too. In our case we need to make sure that the test set has `request_text_edit_aware` field.

## lets put the test data into pandas frame too

In [66]:
import pandas as pd
df = pd.io.json.json_normalize(test_json)

df_test = df[['request_id',
              'request_title',
              'request_text_edit_aware']]

df_test.head()

Unnamed: 0,request_id,request_title,request_text_edit_aware
0,t3_i8iy4,[request] pregger gf 95 degree house and no fo...,Hey all! It's about 95 degrees here and our ki...
1,t3_1mfqi0,"[Request] Lost my job day after labour day, st...",I didn't know a place like this exists! \n\nI ...
2,t3_lclka,(Request) pizza for my kids please?,Hi Reddit. Im a single dad having a really rou...
3,t3_1jdgdj,[Request] Just moved to a new state(Waltham MA...,Hi I just moved to Waltham MA from my home sta...
4,t3_t2qt4,"[Request] Two girls in between paychecks, we'v...",We're just sitting here near indianapolis on o...


## Split training data before vectorization

The first thig to do is split our training data into 2 parts:
* training : Use for training the model
* validation : Use to check the "soundness" of our model

**Note**:

* Splitting the data into 2 parts and holding out one part to check the model is one of method to validate the "soundness" of our model. It's call the **hold-out** validation.
* Another popular validation method is cross-validation, it's out of scope here but you can take a look at `crossvalidation` in `scikit-learn`

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# It doesn't really matter what the function name is called
# but the `train_test_split` is splitting up the data into 
# 2 parts according to the `test_size` argument you've set.

# When we're splitting up the training data, we're splitting up
# into train, valid split. The function name is just a name

train, valid = train_test_split(df_train, test_size=0.2)

## Vectorize the train and validation set

In [69]:
# Initialize the vectorizer and 
# override the analyzer totally with the preprocess_text().
# Note: the vectorizer is just an 'empty' object now.

count_vect = CountVectorizer(analyzer=preprocess_text)

# When we use `CounterVectorizer.fit_transform`,
# we essentially create the dictionary and 
# vectorize our input text at the same time. 
train_set = count_vect.fit_transform(train['request_text_edit_aware'])
train_tags = train['requester_received_pizza']

# When vectorizing the validation data, we use `Countvectorizer.transform()`
valid_set = count_vect.transform(valid['request_text_edit_aware'])
valid_tags = valid['requester_received_pizza']

## Now we need to vectorize the test data too

After we vectorize our data, the input to train the classifier would be the vectorized text. 

When we predict the label with the trained model, our input needs to be vectorized too. 

In [70]:
# When vectorizing the test data, we use `Countvectorizer.transform()`
test_set = count_vect.transform(df_test['request_text_edit_aware'])

## There are different variants of Naive Bayes (NB) classifier in `sklearn`.
For simplicity, lets just use the `MultinomialNB`.

**Multinomial** is a bag of word but it just means many classes/categories/bins/boxes that needs  to be classified.

In [71]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()

# To train the classifier, simple do
clf.fit(train_set, train_tags)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Before we test our classifier on the test set, we get a sence of how good it is on validation set.

In [72]:
from sklearn.metrics import accuracy_score

# To predict out tags (i.e. whether requesters get their pizza).
# we feed the vectorized `test_set` to .predit()

predictions_valid = clf.predict(valid_set)

print('Pizza reception accuracy = {}'.format(
    accuracy_score(predictions_valid, valid_tags) * 100)
)

Pizza reception accuracy = 71.53465346534654


## Now lets use the full training data set and re-vectorize and retrain the classifier
More data == better model (in most cases)

In [74]:
full_train_set = count_vect.fit_transform(df_train['request_text_edit_aware'])
full_tags = df_train['requester_received_pizza']

# Note : We have to re-vectorize the test set since 
# Now our vectorizer is different using the full training set

test_set = count_vect.transform(df_test['request_text_edit_aware'])

# To train the classifier
clf = MultinomialNB()
clf.fit(full_train_set, full_tags)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Finally, we use the classifier to predict on the test set

In [75]:
# To predict our tags (i.e. whether requesters get their pizza),
# we feed the vectorized `test_set` to .predit()

predictions = clf.predict(test_set)

**Note**: Since we don't have the `requester_received_pizza` field in test data, we can't measure accuracy. But we can do some exploration as shown below.

## From the training data, we had 24% pizza giving rate

In [76]:
success_rate = sum(df_train['requester_received_pizza']) / len(df_train) * 100
print(str('Of {} requests, only {} gets their pizzas,'
          '{}% success rate...'.format(len(df_train),
                                       sum(df_train['requester_received_pizza']),
                                       success_rate)
          )
      )

Of 4040 requests, only 994 gets their pizzas,24.603960396039604% success rate...


## Lolz, our classifier is rather stingy

In [77]:
success_rate = sum(predictions)/len(predictions) * 100
print(str('Of {} requests, only {} gets their pizzas,'
          '{}% success rate...'.format(len(predictions),
                                       sum(predictions),
                                       success_rate)
          )
      )

Of 1631 requests, only 51 gets their pizzas,3.126916002452483% success rate...


## How accurate is our count vectorization naive bayes classifier on the test data ?

Since we don't have the `requester_received_pizza` field in the test data, we have to check that with an oracle(i.e. the person that knows).

On Kaggle, **checking with the oracle** means uploading the file in the correct format and their script will process the scores and tell you how you did.

**Note**: Different task will use different metrics but in most cases getting as many correct predictions as possible is the thing to aim for. We won't get into the details of how classifiers are evaluated but for a start, please see [precision, recall and F1-scores](https://en.wikipedia.org/wiki/Precision_and_recall)

## Finally, lets take a look at what format the oracle expects and create the output file for our predictions accordingly

In [78]:
df_output = pd.DataFrame({'request_id': list(df_test['request_id']),
                          'requester_received_pizza': list(predictions)
                          })

# convert the predictions from boolean to interger.
df_output['requester_received_pizza'] = df_output['requester_received_pizza'].astype(int)
df_output.head()

Unnamed: 0,request_id,requester_received_pizza
0,t3_i8iy4,0
1,t3_1mfqi0,0
2,t3_lclka,0
3,t3_1jdgdj,0
4,t3_t2qt4,0


In [60]:
# Create the csv file
df_output.to_csv('basic-nlp-submission.csv', index=False)

NameError: name 'df_output' is not defined