# Basics of tokenizing
URL Tutorial: https://www.kaggle.com/alvations/basic-nlp-with-nltk/#Combining-the-punctuation-with-the-stopwords-from-NLTK.
others:
https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis
https://www.kaggle.com/itratrahman/nlp-tutorial-using-python#Training-Model
We will want some text to work with, so we download and load the brown corpus, and get the name of the type of object for easy reference in the API

In [1]:
import nltk
from nltk.corpus import brown
type(brown)

nltk.corpus.util.LazyCorpusLoader

The `words` and `sents` functions of `brown` each return a list of strings. Since `brown` is a corpus, or collection of documents, we can also look at the individual files, and `fileids`. Both of these functions do some processing on the raw text, which can be seen with the `raw` function. In the raw form, we can see labels, which appear to correspond to the part of speech of the word.

In [2]:
print(len(brown.words()))
print(len(brown.sents()))
print(len(brown.sents(fileids="ca01")))
print(len(brown.fileids()))
print(brown.raw('cb01').strip()[:1000]) # First 1000 characters.

1161192
57340
98
500
Assembly/nn-hl session/nn-hl brought/vbd-hl much/ap-hl good/nn-hl 
The/at General/jj-tl Assembly/nn-tl ,/, which/wdt adjourns/vbz today/nr ,/, has/hvz performed/vbn in/in an/at atmosphere/nn of/in crisis/nn and/cc struggle/nn from/in the/at day/nn it/pps convened/vbd ./.
It/pps was/bedz faced/vbn immediately/rb with/in a/at showdown/nn on/in the/at schools/nns ,/, an/at issue/nn which/wdt was/bedz met/vbn squarely/rb in/in conjunction/nn with/in the/at governor/nn with/in a/at decision/nn not/* to/to risk/vb abandoning/vbg public/nn education/nn ./.


	There/ex followed/vbd the/at historic/jj appropriations/nns and/cc budget/nn fight/nn ,/, in/in which/wdt the/at General/jj-tl Assembly/nn-tl decided/vbd to/to tackle/vb executive/nn powers/nns ./.
The/at final/jj decision/nn went/vbd to/in the/at executive/nn but/cc a/at way/nn has/hvz been/ben opened/vbn for/in strengthening/vbg budgeting/vbg procedures/nns and/cc to/to provide/vb legislators/nns information/nn the

In [4]:
# Each line is one file.
for i, line in enumerate(brown.raw('cb01').split('\n')):
    if i > 10: # Lets take a look at the first 10 ads.
        break
    print(str(i) + ':\t' + line)

0:	
1:	
2:	
3:	Assembly/nn-hl session/nn-hl brought/vbd-hl much/ap-hl good/nn-hl 
4:	The/at General/jj-tl Assembly/nn-tl ,/, which/wdt adjourns/vbz today/nr ,/, has/hvz performed/vbn in/in an/at atmosphere/nn of/in crisis/nn and/cc struggle/nn from/in the/at day/nn it/pps convened/vbd ./.
5:	It/pps was/bedz faced/vbn immediately/rb with/in a/at showdown/nn on/in the/at schools/nns ,/, an/at issue/nn which/wdt was/bedz met/vbn squarely/rb in/in conjunction/nn with/in the/at governor/nn with/in a/at decision/nn not/* to/to risk/vb abandoning/vbg public/nn education/nn ./.
6:	
7:	
8:		There/ex followed/vbd the/at historic/jj appropriations/nns and/cc budget/nn fight/nn ,/, in/in which/wdt the/at General/jj-tl Assembly/nn-tl decided/vbd to/to tackle/vb executive/nn powers/nns ./.
9:	The/at final/jj decision/nn went/vbd to/in the/at executive/nn but/cc a/at way/nn has/hvz been/ben opened/vbn for/in strengthening/vbg budgeting/vbg procedures/nns and/cc to/to provide/vb legislators/nns inform

In [5]:
from nltk.corpus import webtext

In [6]:
print(webtext.fileids())
print(webtext.sents())
single_8 = webtext.raw('singles.txt').split('\n')[8]
print(single_8)

['firefox.txt', 'grail.txt', 'overheard.txt', 'pirates.txt', 'singles.txt', 'wine.txt']
[['Cookie', 'Manager', ':', '"', 'Don', "'", 't', 'allow', 'sites', 'that', 'set', 'removed', 'cookies', 'to', 'set', 'future', 'cookies', '"', 'should', 'stay', 'checked', 'When', 'in', 'full', 'screen', 'mode', 'Pressing', 'Ctrl', '-', 'N', 'should', 'open', 'a', 'new', 'browser', 'when', 'only', 'download', 'dialog', 'is', 'left', 'open', 'add', 'icons', 'to', 'context', 'menu', 'So', 'called', '"', 'tab', 'bar', '"', 'should', 'be', 'made', 'a', 'proper', 'toolbar', 'or', 'given', 'the', 'ability', 'collapse', '/', 'expand', '.'], ['[', 'XUL', ']', 'Implement', 'Cocoa', '-', 'style', 'toolbar', 'customization', '.'], ...]
ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone similar. You WONT be disappointed.


In [7]:
from nltk import sent_tokenize, word_tokenize


In [8]:
print(sent_tokenize(single_8))
print(word_tokenize(single_8))

for sent in sent_tokenize(single_8):
    print(word_tokenize(sent))
    #print([word.lower() for word in word_tokenize(sent)]) # these should convert text to all lower case
    #print(list(map(str.lower, word_tokenize(sent))))

['ARE YOU ALONE or lost in a r/ship too, with no hope in sight?', 'Maybe we could explore new beginnings together?', 'Im 45 Slim/Med build, GSOH, high needs and looking for someone similar.', 'You WONT be disappointed.']
['ARE', 'YOU', 'ALONE', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?', 'Maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'Im', '45', 'Slim/Med', 'build', ',', 'GSOH', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.', 'You', 'WONT', 'be', 'disappointed', '.']
['ARE', 'YOU', 'ALONE', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?']
['Maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?']
['Im', '45', 'Slim/Med', 'build', ',', 'GSOH', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.']
['You', 'WONT', 'be', 'disappointed', '.']


# Stopwords and punctuation
Stopwords are those words that don't have much semantic meaning, are are more grammatical in nature. For analyses, these should often be stripped out for simplicity.

In [9]:
from nltk.corpus import stopwords


In [10]:

stopwords_en = stopwords.words('english')
print(stopwords_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [11]:
word_tokenize(single_8)
single_8_tokenized_lowered = list(map(str.lower, word_tokenize(single_8)))
print(single_8_tokenized_lowered)



stopwords_en = set(stopwords.words('english')) # Set checking is faster in Python than list.
print([word for word in single_8_tokenized_lowered if word not in stopwords_en])

['are', 'you', 'alone', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?', 'maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.', 'you', 'wont', 'be', 'disappointed', '.']
['alone', 'lost', 'r/ship', ',', 'hope', 'sight', '?', 'maybe', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'looking', 'someone', 'similar', '.', 'wont', 'disappointed', '.']


In [12]:
from string import punctuation
# It's a string so we have to them into a set type
print('From string.punctuation:', type(punctuation), punctuation)
print(type(punctuation))

From string.punctuation: <class 'str'> !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
<class 'str'>


In [13]:
print([word for word in single_8_tokenized_lowered if word not in punctuation])

['are', 'you', 'alone', 'or', 'lost', 'in', 'a', 'r/ship', 'too', 'with', 'no', 'hope', 'in', 'sight', 'maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', 'im', '45', 'slim/med', 'build', 'gsoh', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', 'you', 'wont', 'be', 'disappointed']


This uses a stronger form of the stopword list, that is more comprehensive. By combining the stopwords and the punctuation, we can effectively remove both rapidly.

In [20]:
stopwords_en_withpunct = stopwords_en.union(set(punctuation))
print(stopwords_en_withpunct)

print([word for word in single_8_tokenized_lowered if word not in stopwords_en_withpunct])

# Stopwords from stopwords-json
stopwords_json = {"en":["a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","aside","ask","asking","associated","at","available","away","awfully","b","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","c","c'mon","c's","came","can","can't","cannot","cant","cause","causes","certain","certainly","changes","clearly","co","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","currently","d","definitely","described","despite","did","didn't","different","do","does","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't","have","haven't","having","he","he's","hello","help","hence","her","here","here's","hereafter","hereby","herein","hereupon","hers","herself","hi","him","himself","his","hither","hopefully","how","howbeit","however","i","i'd","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch","inc","indeed","indicate","indicated","indicates","inner","insofar","instead","into","inward","is","isn't","it","it'd","it'll","it's","its","itself","j","just","k","keep","keeps","kept","know","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near","nearly","necessary","need","needs","neither","never","nevertheless","new","next","nine","no","nobody","non","none","noone","nor","normally","not","nothing","novel","now","nowhere","o","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","only","onto","or","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","own","p","particular","particularly","per","perhaps","placed","please","plus","possible","presumably","probably","provides","q","que","quite","qv","r","rather","rd","re","really","reasonably","regarding","regardless","regards","relatively","respectively","right","s","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","she","should","shouldn't","since","six","so","some","somebody","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","t","t's","take","taken","tell","tends","th","than","thank","thanks","thanx","that","that's","thats","the","their","theirs","them","themselves","then","thence","there","there's","thereafter","thereby","therefore","therein","theres","thereupon","these","they","they'd","they'll","they're","they've","think","third","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","to","together","too","took","toward","towards","tried","tries","truly","try","trying","twice","two","u","un","under","unfortunately","unless","unlikely","until","unto","up","upon","us","use","used","useful","uses","using","usually","uucp","v","value","various","very","via","viz","vs","w","want","wants","was","wasn't","way","we","we'd","we'll","we're","we've","welcome","well","went","were","weren't","what","what's","whatever","when","whence","whenever","where","where's","whereafter","whereas","whereby","wherein","whereupon","wherever","whether","which","while","whither","who","who's","whoever","whole","whom","whose","why","will","willing","wish","with","within","without","won't","wonder","would","wouldn't","x","y","yes","yet","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","z","zero"]}
stopwords_json_en = set(stopwords_json['en'])
stopwords_nltk_en = set(stopwords.words('english'))
stopwords_punct = set(punctuation)
# Combine the stopwords. Its a lot longer so I'm not printing it out...
stoplist_combined = set.union(stopwords_json_en, stopwords_nltk_en, stopwords_punct)

# Remove the stopwords from `single_no8`.
print('With combined stopwords:')
print([word for word in single_8_tokenized_lowered if word not in stoplist_combined])

{'ain', '%', 'while', '*', 'no', '>', '~', 'mightn', 'such', 'which', "mightn't", 'me', 'my', "she's", 'its', 'we', 'to', 'then', 'now', 'aren', "haven't", 'shan', "won't", 'doing', 'most', 'him', 'than', 'out', 'and', 'of', 'ourselves', 've', '&', 'been', 'haven', '=', 'only', 's', 'other', '{', 'some', 'her', "couldn't", 'any', 'couldn', "you're", 'isn', 'wouldn', 'these', "isn't", 'don', "hasn't", ']', 'hadn', 'weren', 'there', 'was', 'under', 'ours', 'didn', 'same', 'for', 'myself', 'you', "'", 'do', 'having', 'with', 'when', 'all', 'll', ';', '+', 'mustn', 'm', 'itself', 'above', 'whom', 'yourself', '!', 'does', 'both', 'between', '<', '^', 'yours', ',', 'or', 'is', "doesn't", "you'd", '?', "aren't", 'a', "it's", "that'll", 'ma', 'had', 'further', 't', 'over', 'hers', 're', 'did', 'won', '#', 'through', "you'll", 'if', '|', "mustn't", 'can', ')', 'should', '/', '[', 'has', 'o', '$', "needn't", 'each', "hadn't", 'during', 'shouldn', 'being', 'from', 'again', 'be', 'are', 'i', 'will

# Stemming, Lemmatization
Both of these are methods of shortening words to find something that is a root, to be able to group together different forms of the same word. The difference is that Stemming is faster but stupider, and won't necessarily return full words (but will often still be recognizable), whereas Lemmatization is more complicated, and therefore slower, but also more semantically correct, because it uses linguistic rules. It also usually requires context, and unless otherwise specified, will assume that the word is a noun.

In [14]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
porter = PorterStemmer()

for word in ['walking', 'walks', 'walked']:
    print(porter.stem(word))
    
    
wnl = WordNetLemmatizer()
for word in ['walking', 'walks', 'walked']:
    print(wnl.lemmatize(word))

walk
walk
walk
walking
walk
walked


This section is just assigning a tag `morphy_tag` to different types of words to encode information about what type of word it is.

In [15]:
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

def penn2morphy(penntag):
    """ Converts Penn Treebank tags to WordNet. """
    morphy_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB':'r'}
    try:
        return morphy_tag[penntag[:2]]
    except:
        return 'n' # if mapping isn't found, fall back to Noun.
    
# `pos_tag` takes the tokenized sentence as input, i.e. list of string,
# and returns a tuple of (word, tg), i.e. list of tuples of strings
# so we need to get the tag from the 2nd element.
print(word_tokenize("He is walking to school"))
walking_tagged = pos_tag(word_tokenize('He is walking to school'))
print(walking_tagged)

[wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) for word, tag in walking_tagged]

['He', 'is', 'walking', 'to', 'school']
[('He', 'PRP'), ('is', 'VBZ'), ('walking', 'VBG'), ('to', 'TO'), ('school', 'NN')]


['he', 'be', 'walk', 'to', 'school']

In [16]:
def lemmatize_sent(text): 
    # Text input is string, returns lowercased strings.
    return [wnl.lemmatize(word.lower(), pos=penn2morphy(tag)) 
            for word, tag in pos_tag(word_tokenize(text))]

lemmatize_sent('He is walking to school')

['he', 'be', 'walk', 'to', 'school']

In [22]:
stoplist_combined = set.union(stopwords_json_en, stopwords_nltk_en, stopwords_punct)

def preprocess_text(text):
    # Input: str, i.e. document/sentence
    # Output: list(str) , i.e. list of lemmas
    return [word for word in lemmatize_sent(text) 
            if word not in stoplist_combined
            and not word.isdigit()]

print('Original Single no. 8:')
print(single_8, '\n')
print('Lemmatized and removed stopwords:')
print(preprocess_text(single_8))

Original Single no. 8:
ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone similar. You WONT be disappointed. 

Lemmatized and removed stopwords:
['lose', 'r/ship', 'hope', 'sight', 'explore', 'beginning', 'im', 'slim/med', 'build', 'gsoh', 'high', 'similar', 'wont', 'disappoint']


# Vectorizing a sentence
Counter takes the unique words in a list of strings, and counts the number of instances of each. This can be used to create a table with unique words as columns, sentences as rows, and intersection values showing the number of instances of the word in the sentence. (Doesn't seem useful for even moderately large string sets)

In [None]:
from collections import Counter

sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."

# Lemmatize and remove stopwords
processed_sent1 = preprocess_text(sent1)
processed_sent2 = preprocess_text(sent2)

print('Processed sentence:')
print(processed_sent1)
print()
print('Word counts:')
print(Counter(processed_sent1))

print('Processed sentence:')
print(processed_sent2)
print()
print('Word counts:')
print(Counter(processed_sent2))