# Natural Language Processing

* **Corpus**: The body/collection of text being investigated.
* **Document**: The unit of analysis, what is considered a single observation.

# Spacy

In [1]:
import spacy

In [6]:
nlp = spacy.load("en_core_web_sm")

In [7]:
doc = nlp('Lets try out spacy. We can easily divide our text into sentences! I have run out of ideas.')

In [9]:
for sentence in doc.sents:
    print(sentence)

Lets try out spacy.
We can easily divide our text into sentences!
I have run out of ideas.


In [10]:
doc[0]

Lets

In [11]:
doc[6]

can

In [None]:
pos_ = Part of speech
tag_ = Detailed Part of speech

In [None]:
link: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [12]:
doc = nlp('The quick brown fox jumped over the lazy dog. Mr. Peanut wears a top hat.')

In [13]:
tags = set()

for word in doc:
    tags.add(word.tag_)
    print((word.text,  word.pos_, word.tag_))

('The', 'DET', 'DT')
('quick', 'ADJ', 'JJ')
('brown', 'ADJ', 'JJ')
('fox', 'PROPN', 'NNP')
('jumped', 'VERB', 'VBD')
('over', 'ADP', 'IN')
('the', 'DET', 'DT')
('lazy', 'ADJ', 'JJ')
('dog', 'NOUN', 'NN')
('.', 'PUNCT', '.')
('Mr.', 'PROPN', 'NNP')
('Peanut', 'PROPN', 'NNP')
('wears', 'VERB', 'VBZ')
('a', 'DET', 'DT')
('top', 'ADJ', 'JJ')
('hat', 'NOUN', 'NN')
('.', 'PUNCT', '.')


In [14]:
tags

{'.', 'DT', 'IN', 'JJ', 'NN', 'NNP', 'VBD', 'VBZ'}

In [15]:
for tag in tags:
    print((tag, spacy.explain(tag)))

('NN', 'noun, singular or mass')
('.', 'punctuation mark, sentence closer')
('JJ', 'adjective')
('VBD', 'verb, past tense')
('VBZ', 'verb, 3rd person singular present')
('NNP', 'noun, proper singular')
('DT', 'determiner')
('IN', 'conjunction, subordinating or preposition')


In [19]:
import wikipedia

In [23]:
wikipedia.page('Python (programming language)')

In [25]:
def pages_to_sentences(*pages):
    
    sentences = []
    
    for page in pages:
        p = wikipedia.page(page)
        doc = nlp(p.content)
        sentences += [sent.text for sent in doc.sents]
    return sentences

In [26]:
lang_sents = pages_to_sentences('Python (programming language)')

In [28]:
animal_sents = pages_to_sentences("Reticulated python", "Ball Python")

In [29]:
documents = lang_sents + animal_sents

In [30]:
lang_sents[:5]

['Python is an interpreted, high-level, general-purpose programming language.',
 "Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
 'Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.',
 'Python is dynamically typed and garbage-collected.',
 'It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming.']

In [31]:
animal_sents[:5]

['The reticulated python (Malayopython reticulatus) is a species of snake in the family Pythonidae.',
 'The species is native to South Asia and Southeast Asia.',
 "It is the world's longest snake and listed as least concern on the IUCN Red List because of its wide distribution.",
 'In several range countries, it is hunted for its skin, for use in traditional medicine, and for sale as a pet.',
 'It is an excellent swimmer, has been reported far out at sea and has colonized many small islands within its range.\n']

# bag of words

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

In [33]:
bag_of_words = CountVectorizer()

In [34]:
bag_of_words.fit(documents)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [35]:
word_counts = bag_of_words.transform(documents)

In [69]:
word_counts

<659x2824 sparse matrix of type '<class 'numpy.int64'>'
	with 9141 stored elements in Compressed Sparse Row format>

The `transform` method returns a sparse matrix. A sparse matrix is a more efficient manner of storing a matrix. If a matrix has mostly zero entries, it is better to just store the non-zero entries and their occurrence, their row and column. Sparse matrices have the method `toarray()` that returns a full matrix **but** doing so may result in memory issues. Some key hyperparameters of the `CountVectorizer` are shown below:

* `min_df`: only counts words that appear in a minimum number of documents.
* `max_df`: only counts words that do not appear more than a maximum number of documents.
* `max_features`: limits the number of generated features, based on the frequency.

After fitting a `CountVectorizer` object, the following method and attribute help with determining which index belongs to which word.

* `get_feature_names()`: Returns a list of words used as features. The index of the word corresponds to the column index.
* `vocabulary_`: A dictionary mapping a word to its corresponding feature index.

Let's use `vocabulary_` to determine how many times "programming" occurs in the documents for Python the programming language and python the animal. Do the results make sense?

In [43]:
animal_counts = bag_of_words.transform(animal_sents)
lang_counts = bag_of_words.transform(lang_sents)

In [52]:
prog_index = bag_of_words.vocabulary_['programming']

In [56]:
animal_counts.sum(axis=0)[0, prog_index]

0

In [57]:
lang_counts.sum(axis=0)[0, prog_index]

29

# The HashingVectorizer transformer

In [58]:
print(hash('hi'))

-5701525620873357580


In [59]:
print(hash('digging'))

-7273384014955944524


In [60]:
print(hash('data'))

1219583126548586098


In [61]:
print(hash('apple'))
print(hash('apples'))

6912009904231611981
8865167685245516106


In [62]:
from sklearn.feature_extraction.text import HashingVectorizer

In [66]:
hash_bag_of_words = HashingVectorizer(norm=None)

In [67]:
hash_bag_of_words.fit(documents)

HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative=False,
         norm=None, preprocessor=None, stop_words=None, strip_accents=None,
         token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None)

In [68]:
hash_bag_of_words.transform(documents)

<659x1048576 sparse matrix of type '<class 'numpy.float64'>'
	with 9141 stored elements in Compressed Sparse Row format>

# tfidf

In [71]:
from sklearn.feature_extraction.text import TfidfTransformer

In [72]:
tfidf = TfidfTransformer()

In [73]:
tfidf_wights = tfidf.fit_transform(word_counts)

In [74]:
print(tfidf_wights)

  (0, 2017)	0.12406941267839564
  (0, 1999)	0.4332472531353061
  (0, 1977)	0.27490568683645294
  (0, 1464)	0.3748599854757929
  (0, 1428)	0.24562367384616593
  (0, 1359)	0.15941849497574362
  (0, 1344)	0.4074104756037003
  (0, 1202)	0.38907899021289033
  (0, 1101)	0.34491072729047456
  (0, 223)	0.24235519670854555
  (1, 2780)	0.1312317717331203
  (1, 2762)	0.263174743750557
  (1, 2688)	0.22280181219440504
  (1, 2673)	0.18508541773543052
  (1, 2311)	0.25133316318455723
  (1, 2174)	0.22280181219440504
  (1, 2100)	0.2096936074270906
  (1, 2054)	0.25133316318455723
  (1, 2017)	0.08014505724364421
  (1, 1883)	0.22280181219440504
  (1, 1768)	0.09418834112336935
  (1, 1743)	0.27986451417470937
  (1, 1376)	0.14820060789454798
  (1, 1269)	0.09285323905135626
  (1, 1156)	0.24214811971573488
  :	:
  (649, 2022)	0.46695799816305983
  (649, 1879)	0.6768975860919657
  (649, 331)	0.45299375591365837
  (649, 278)	0.344320407466149
  (650, 1879)	0.6543113476481107
  (650, 948)	0.7562252708941385
  (651

In [79]:
top_idf_idices = tfidf.idf_.argsort()[:-20:-1]

In [81]:
ind_to_words = bag_of_words.get_feature_names()

In [106]:
len(ind_to_words)

2824

In [82]:
for ind in top_idf_idices:
    print(tfidf.idf_[ind], ind_to_words[ind])

6.799092654460526 zope
6.799092654460526 mindstorms
6.799092654460526 microcontrollers
6.799092654460526 micropython
6.799092654460526 microthreads
6.799092654460526 mid
6.799092654460526 midbody
6.799092654460526 middle
6.799092654460526 mime
6.799092654460526 mimicking
6.799092654460526 mimics
6.799092654460526 mindanao
6.799092654460526 mindoro
6.799092654460526 minimalist
6.799092654460526 metres
6.799092654460526 connecting
6.799092654460526 missouri
6.799092654460526 mistakes
6.799092654460526 mistaking


# Stop words

In [83]:
from spacy.lang.en import STOP_WORDS

In [84]:
print(type(STOP_WORDS))

<class 'set'>


In [87]:
STOP_WORDS_python = STOP_WORDS.union({"python"})
STOP_WORDS_python

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

#  Stemming and lemmatization

In [88]:
print([word.lemma_ for word in nlp('run runs ran running')])
print([word.lemma_ for word in nlp('buy buys buying bought')])
print([word.lemma_ for word in nlp('see saw seen seeing')])

['run', 'run', 'run', 'run']
['buy', 'buy', 'buy', 'buy']
['see', 'see', 'see', 'see']


In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [102]:
def lemmatizer(text):
    return [word.lemma_ for word in nlp(text)]

In [98]:
stop_words_str = ' '.join(STOP_WORDS)

stop_words_lemma = set(word.lemma_ for word in nlp(stop_words_str))

In [103]:
tidif = TfidfVectorizer(max_features=100, stop_words= stop_words_lemma.union({'python'}), tokenizer= lemmatizer)

In [104]:
tidif.fit(documents)

  'stop_words.' % sorted(inconsistent))


TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=100, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words={'no', 'elsewhere', 'two', 'i', 'without', 'therefore', 'something', 'upon', 'might', 'side', 'thru', 'could', 'hereupon', 'above', 'fifteen', 'several', 'all', 'or', 'out', 'nor', 'get', 'rather', 'off', 'only', 'anything', 'below', 'put', 'else', 'of', 're', 'about', 'own', 'must', 'thr...'also', 'should', 'thence', 'use', 'any', 'well', 'either', 'along', 'in', 'six', 'front', 'anyway'},
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=<function lemmatizer at 0x00000258E435C400>,
        use_idf=True, vocabulary=None)

In [107]:
len(tidif.get_feature_names())

100

# Tokenization and n-grams

In [108]:
count_bigrams = CountVectorizer(max_features=100, stop_words= stop_words_lemma.union({'python'}), ngram_range=(2, 2))

In [109]:
count_bigrams.fit(documents)

  'stop_words.' % sorted(inconsistent))


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=100, min_df=1,
        ngram_range=(2, 2), preprocessor=None,
        stop_words={'no', 'elsewhere', 'two', 'i', 'without', 'therefore', 'something', 'upon', 'might', 'side', 'thru', 'could', 'hereupon', 'above', 'fifteen', 'several', 'all', 'or', 'out', 'nor', 'get', 'rather', 'off', 'only', 'anything', 'below', 'put', 'else', 'of', 're', 'about', 'own', 'must', 'thr...'also', 'should', 'thence', 'use', 'any', 'well', 'either', 'along', 'in', 'six', 'front', 'anyway'},
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [110]:
count_bigrams.get_feature_names()

['12 september',
 '23 ft',
 '25 ft',
 'arbitrary precision',
 'archived original',
 'are available',
 'are common',
 'are supported',
 'are used',
 'are written',
 'artificial intelligence',
 'assignment statement',
 'auliya et',
 'ball care',
 'ball is',
 'ball pythons',
 'been killed',
 'blah eggs',
 'block code',
 'boy was',
 'classes are',
 'code block',
 'code is',
 'colour pattern',
 'design philosophy',
 'double quote',
 'enclosure is',
 'et al',
 'executes block',
 'expressions are',
 'external links',
 'floating point',
 'floor division',
 'ft length',
 'ft long',
 'functional programming',
 'guido van',
 'had been',
 'has been',
 'his friends',
 'indentation similar',
 'indonesia was',
 'integer division',
 'is better',
 'is considered',
 'is incremented',
 'is recommended',
 'is true',
 'is used',
 'is written',
 'isbn 978',
 'it does',
 'it has',
 'it is',
 'it was',
 'language is',
 'languages java',
 'large standard',
 'list comprehensions',
 'longest snake',
 'new featur

In [None]:

tf-idf weighting
stop words
words and bigrams
lemmatization

# A simple NLP model

In [None]:

tf-idf weighting
stop words
words and bigrams
lemmatization

In [112]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

In [114]:
documents = animal_sents + lang_sents

In [115]:
labels = ['animals'] * len(animal_sents) + ['language']*len(lang_sents)

In [116]:
labels

['animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'animals',
 'an

In [136]:
stop_words_str = ' '.join(STOP_WORDS)

stop_words_lemma = set(word.lemma_ for word in nlp(stop_words_str))

In [137]:
tidif = TfidfVectorizer(stop_words=stop_words_lemma, tokenizer= lemmatizer, ngram_range=(1,2))

In [138]:
X = tidif.fit_transform(documents)

In [139]:
model = MultinomialNB()

In [140]:
model.fit(X, labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [141]:
model.score(X, labels)

0.9711684370257967

In [142]:
test_docs = ["My Python program is only 100 bytes long.",
             "A python's bite is not venomous but still hurts.",
             "I can't find the error in the python code.",
             "Where is my pet python; I can't find her!",
             "I use for and while loops when writing Python.",
             "The python will loop and wrap itself onto me.",
             "I use snake case for naming my variables.",
             "My python has grown to over 10 ft long!",
             "I use virtual environments to manage package versions.",
             "Pythons are the largest snakes in the environment."]

In [131]:
class_label = ['animal', 'language']

In [143]:
transformed_test = tidif.transform(test_docs)

In [145]:
y_prob = model.predict_proba(transformed_test)

In [149]:
predicted_indices = (y_prob[:, 1] > 0.5).astype(int)

In [150]:
for i, index in enumerate(predicted_indices):
    print(test_docs[i], " = ", class_label[index], ' at ', 100*y_prob[i, index])

My Python program is only 100 bytes long.  =  language  at  68.75212565333943
A python's bite is not venomous but still hurts.  =  animal  at  53.445442328258366
I can't find the error in the python code.  =  language  at  75.21037288878709
Where is my pet python; I can't find her!  =  animal  at  51.870029061448506
I use for and while loops when writing Python.  =  language  at  82.5001339993406
The python will loop and wrap itself onto me.  =  language  at  67.06073182840161
I use snake case for naming my variables.  =  language  at  60.675321671352364
My python has grown to over 10 ft long!  =  animal  at  58.22505020470826
I use virtual environments to manage package versions.  =  language  at  78.41673587332183
Pythons are the largest snakes in the environment.  =  animal  at  68.39335550179028
