## 0. Text processing (without NLTK)

In [1]:
my_string = 'The quick brown fox jumps over the lazy dog!!!.'

### Lowercase, Punctuation

In [2]:
cleaned = ''.join([char for char in my_string.lower() if char not in '!@.#$%^&*()'])
cleaned

'the quick brown fox jumps over the lazy dog'

### Tokenization

- characters
- words
- sentences


In [3]:
tokenized = cleaned.split()
tokenized 

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

### Stopwords

In [4]:
# remove stopwords in tokenized

stopwords = ['the', 'and', 'this', 'but']

no_stopwords = [word for word in tokenized if word not in stopwords]

no_stopwords

['quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']

### Stemming/Lemmatization

#### Stemming
- jumped -> jump
- jumps -> jump
- jumping -> jump

#### Lemmatization
- running -> run
- ran -> run
- flies -> fly
- flew -> fly
- flying -> fly


<div class="alert alert-block alert-info">
<b>rstrip()</b>  method removes any trailing characters (characters at the end a string)</div>

In [5]:
# Stem any words that are plural
stemmed = [word.rstrip('s') for word in no_stopwords]

stemmed

['quick', 'brown', 'fox', 'jump', 'over', 'lazy', 'dog']

## N-grams

#### 2-gram
'i do not like data science' => (i, do), (do, not), (not, like), ... , (data, science)

#### 3-gram
=> (i, do, not), (do, not, like), ... , (like, data, science)

In [6]:
sentence = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

In [7]:
# TODO: 2-gram our list of words
[(sentence[i] , sentence[i+1]) for i in range(len(sentence)-1)]

[('the', 'quick'),
 ('quick', 'brown'),
 ('brown', 'fox'),
 ('fox', 'jumps'),
 ('jumps', 'over'),
 ('over', 'the'),
 ('the', 'lazy'),
 ('lazy', 'dog')]

In [8]:
# Also try it without a loop
    # output => [('quick', 'brown'), ('brown', 'fox'), ...]

#list(zip(sentence[:-1],sentence[1:]))
list(zip(sentence[:-1],sentence[1:]))

[('the', 'quick'),
 ('quick', 'brown'),
 ('brown', 'fox'),
 ('fox', 'jumps'),
 ('jumps', 'over'),
 ('over', 'the'),
 ('the', 'lazy'),
 ('lazy', 'dog')]

## NLTK - Natual Language Toolkit is a leading platform for building Python programs to work with human language data.

## 1. Tokenize Words and Sentences with NLTK

Natural Language toolkit has very important module tokenize which further comprises of sub-modules:

1. word tokenize
2. sentence tokenize

### 1.1 Word tokenize
We use the method word_tokenize() to split a sentence into words. 

In [9]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/gehadbarakat/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))

['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']


### 1.2 Sentence tokenize

In [11]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(text))

['God is Great!', 'I won a lottery.']


In [12]:
import requests
from bs4 import BeautifulSoup
import re
import nltk

page = requests.get("http://en.wikipedia.org/wiki/Apple_Inc.")
soup = BeautifulSoup(page.content, 'html.parser')
findings = soup.find(attrs={'id':'mw-content-text'}).find_all('p')

for p in findings:
    text = ''.join(re.split('\[\d+\]',p.text))
    if text.strip()!= '':
        print(nltk.tokenize.sent_tokenize(text.strip()))

['Apple Inc. is an American multinational technology company headquartered in Cupertino, California.', "Apple is the world's largest technology company by revenue, with US$394.3 billion in 2022 revenue.", "As of March\xa02023[update], Apple is the world's biggest company by market capitalization.", 'As of June 2022, Apple is the fourth-largest personal computer vendor by unit sales and the second-largest mobile phone manufacturer in the world.', 'It is considered one of the Big Five American information technology companies, alongside Alphabet (parent company of Google), Amazon, Meta Platforms, and Microsoft.']
["Apple was founded as Apple Computer Company on April 1, 1976, by  Steve Wozniak, Steve Jobs (1955–2011) and Ronald Wayne to develop and sell Wozniak's Apple I personal computer.", 'It was incorporated by Jobs and Wozniak as Apple Computer, Inc. in 1977.', "The company's second computer, the Apple II, became a best seller and one of the first mass-produced microcomputers.", 'Ap

In [13]:
findings

[<p class="mw-empty-elt">
 </p>,
 <p><b>Apple Inc.</b> is an American <a href="/wiki/Multinational_corporation" title="Multinational corporation">multinational</a> <a href="/wiki/Technology_company" title="Technology company">technology company</a> headquartered in <a href="/wiki/Cupertino,_California" title="Cupertino, California">Cupertino, California</a>. Apple is the world's <a href="/wiki/List_of_largest_technology_companies_by_revenue" title="List of largest technology companies by revenue">largest technology company by revenue</a>, with <span style="white-space: nowrap"><a href="/wiki/United_States_dollar" title="United States dollar">US$</a>394.3 billion</span> in 2022 revenue.<sup class="reference" id="cite_ref-6"><a href="#cite_note-6">[6]</a></sup> As of March 2023<sup class="plainlinks noexcerpt noprint asof-tag update" style="display:none;"><a class="external text" href="https://en.wikipedia.org/w/index.php?title=Apple_Inc.&amp;action=edit">[update]</a></sup>, Apple is the

## 2. Removing stop words with NLTK

In [14]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/gehadbarakat/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example ='''Apple Inc. is an American multinational technology 
company headquartered in Cupertino, California.', 
"Apple is the world's largest technology company by revenue, 
with US$394.3 billion in 2022 revenue.", 
"As of March\xa02023[update], Apple is the world's biggest company by
market capitalization.", 'As of June 2022, Apple is the fourth-largest
personal computer vendor by unit sales and the second-largest mobile 
phone manufacturer in the world.', 'It is considered one of 
the Big Five American information technology companies, alongside
Alphabet (parent company of Google), 
Amazon, Meta Platforms, and Microsoft.'''
stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example)

filtered_sentence = [w for w in word_tokens if not w in stop_words] 

filtered_sentence = [] 
for w in word_tokens: 
    if w not in stop_words:
        filtered_sentence.append(w) 
#print(word_tokens)
print(filtered_sentence) 

['Apple', 'Inc.', 'American', 'multinational', 'technology', 'company', 'headquartered', 'Cupertino', ',', 'California', '.', "'", ',', "''", 'Apple', 'world', "'s", 'largest', 'technology', 'company', 'revenue', ',', 'US', '$', '394.3', 'billion', '2022', 'revenue', '.', '``', ',', "''", 'As', 'March', '2023', '[', 'update', ']', ',', 'Apple', 'world', "'s", 'biggest', 'company', 'market', 'capitalization', '.', '``', ',', "'As", 'June', '2022', ',', 'Apple', 'fourth-largest', 'personal', 'computer', 'vendor', 'unit', 'sales', 'second-largest', 'mobile', 'phone', 'manufacturer', 'world', '.', "'", ',', "'It", 'considered', 'one', 'Big', 'Five', 'American', 'information', 'technology', 'companies', ',', 'alongside', 'Alphabet', '(', 'parent', 'company', 'Google', ')', ',', 'Amazon', ',', 'Meta', 'Platforms', ',', 'Microsoft', '.']


## 3. Stemming and Lemmatization

In our original hand-built vocabulary, we had to include both "computer" and "computers".  It would have been useful to identify them as one word.

This is not limited to just trailing "s" characters: e.g., the words "carry", "carries", "carrying", and "carried" all carry -- roughly -- the same meaning.  The process of replacing them by a common root, or **stem**, is called stemming -- the stem will not, in general, be a full word itself.

There's a related process called **lemmatization**: The analog of the "stem" here _is_ an actual word.  We can choose to first stem our words before counting them:
`nltk.stem` is a package that performs stemming using different classes. Popular ones are `SnowballStemmer` and `PorterStemmer`
- `PorterStemmer`: See http://www.tartarus.org/~martin/PorterStemmer/ for the homepage of the algorithm.
- `SnowballStemmer`: The algorithm for English is documented here: Porter, M. "An algorithm for suffix stripping." Program 14.3 (1980): 130-137.


In [20]:
#Snowball Stemmer
stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords= True)
print("stemming Carry:",[stemmer.stem(s) for s in["Carry","Carries","Carrying","Carried"]])
print("stemming eat:",[stemmer.stem(s) for s in["eat","eating","eaten","ate"]])
print("stemming Computer:",[stemmer.stem(s) for s in["computer","Compute","Computers"]])

stemming Carry: ['carri', 'carri', 'carri', 'carri']
stemming eat: ['eat', 'eat', 'eaten', 'ate']
stemming Computer: ['comput', 'comput', 'comput']


In [22]:
#PorterStemmer
stemmer = nltk.stem.PorterStemmer()
print("stemming carry", [stemmer.stem(s) for s in ["carry", "carries", "carrying", "carried"]])
print("stemming eat", [stemmer.stem(s) for s in ["eat", "eating", "eaten", "ate"]])
print("stemming computer", [stemmer.stem(s) for s in ["computer", "computers"]])

stemming carry ['carri', 'carri', 'carri', 'carri']
stemming eat ['eat', 'eat', 'eaten', 'ate']
stemming computer ['comput', 'comput']


**Lemmatization** is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

In [23]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/gehadbarakat/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [29]:
from nltk.stem import WordNetLemmatizer

#init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
print("stemming carry", [lemmatizer.lemmatize(s) for s in ["carry", "carries", "carrying", "carried"]])
print("stemming eat", [lemmatizer.lemmatize(s) for s in ["eat", "eating", "eaten", "ate"]])
print("stemming computer", [lemmatizer.lemmatize(s) for s in ["computer", "computers"]])

stemming carry ['carry', 'carry', 'carrying', 'carried']
stemming eat ['eat', 'eating', 'eaten', 'ate']
stemming computer ['computer', 'computer']



<h2> 4. Tagging <span class="badge badge-secondary">Important</span>
</h2>

Consider the "Ford" vs "ford" example.  As a human being, the easiest way to tell these apart is that Ford is a __noun__ while ford is a __verb__.

Fortunately, NLTK also has a part-of-speech tagger: You give it a sentence, and it tries to tag the parts of speech (e.g., noun, verb, adjective, etc.).  The command is `nltk.pos_tag` and for documentation on the tags either search around online, or use `nltk.help.upenn_tagset`:

In [30]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/gehadbarakat/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [83]:
s1 = "I tried to ford the river, and my unfortunate oxen died"
s2 = "Henry Ford built factories to facilitate the construction of the Ford automobile."
output = nltk.pos_tag(nltk.tokenize.word_tokenize(s1))
print(output)

[('I', 'PRP'), ('tried', 'VBD'), ('to', 'TO'), ('ford', 'VB'), ('the', 'DT'), ('river', 'NN'), (',', ','), ('and', 'CC'), ('my', 'PRP$'), ('unfortunate', 'JJ'), ('oxen', 'NN'), ('died', 'VBD')]


In [34]:
import nltk
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to
[nltk_data]     /Users/gehadbarakat/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [35]:
nltk.help.upenn_tagset('PRP')

PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us


<div class="alert alert-danger">
<b>Review my code vs their</b> </div>

In [78]:
def feature_verbs(sentence):
    ### TODO: remove the pass and insert your code below
    ### input: sentence
    ### output: boolean - True/False
    output = nltk.pos_tag(nltk.tokenize.word_tokenize(sentence))
    for s in output:
        if s[1] =="VBP":
            print(s[0],"is a Verb")
        else:
            pass

In [81]:
def feature_verbs(sentence):
    ### input: sentence
    ### output: boolean - True/False
    words = nltk.tokenize.word_tokenize(sentence)
    pos_tag = nltk.pos_tag(words)
    print(pos_tag)
    return len([i for i in range(len(words)) if 'VB' in pos_tag[i][1]]) > 0

In [82]:
feature_verbs("I drive a Ford.")

[('I', 'PRP'), ('drive', 'VBP'), ('a', 'DT'), ('Ford', 'NNP'), ('.', '.')]


True

------------------------------------------------------------------------
## 5. Get Data (for our vectorizer later)

In [90]:
from glob import glob

book_corpus = []

for path in glob('books/*.txt'):
    with open(path,encoding = 'utf8') as f:
        book = f.read()
        book_corpus.append(book)

In [91]:
[path for path in glob('books/*.txt')]

['books/democracy.txt',
 'books/Sports.txt',
 'books/technology.txt',
 'books/Arts.txt']

In [92]:
len(book_corpus)

4

In [100]:
import pandas as pd
# pd.Series(book_corpus[0].split().value_counts().head(10))
pd.Series(book_corpus[0].split()).value_counts()

the           32
of            28
and           20
to            14
democracy      9
              ..
safeguards     1
abuse          1
uphold         1
dignity        1
world.}        1
Length: 250, dtype: int64

In [125]:
import requests
from bs4 import BeautifulSoup
import re
import nltk

var = "democracy"
page = requests.get("http://en.wikipedia.org/wiki/{}".format(var))
soup = BeautifulSoup(page.content, 'html.parser')
findings = soup.find(attrs={'id':'mw-content-text'}).find_all('p')

for p in findings:
    text = ''.join(re.split('\[\d+\]',p.text))
    if text.strip()!= '':
        output = nltk.tokenize.sent_tokenize(text.strip())
        print(output)
    f= open("books/{}.txt".format(var), "a")
    f.write(str(output))
    f.close()
        

['Democracy (from Ancient Greek: δημοκρατία, romanized:\xa0dēmokratía, dēmos \'people\' and kratos \'rule\') is a form of government in which the people have the authority to deliberate and decide legislation ("direct democracy") or to choose governing officials to do so ("representative democracy").', 'Who is considered part of "the people" and how authority is shared among or delegated by the people has changed over time and at different rates in different countries.', 'Features of democracy often times include freedom of assembly, association, property rights, freedom of religion and speech, citizenship, consent of the governed, voting rights, freedom from unwarranted governmental deprivation of the right to life and liberty, and minority rights.']
['The notion of democracy has evolved over time considerably.', 'Throughout history, one can find evidence of direct democracy, in which communities make decisions through popular assembly.', 'Today, the dominant form of democracy is repr

In [111]:
f = open("books/democracy.txt", "r+")


In [113]:
findings

[<p class="mw-empty-elt">
 </p>,
 <p class="mw-empty-elt">
 </p>,
 <p><b>Democracy</b> (from <a class="mw-redirect" href="/wiki/Ancient_Greek_language" title="Ancient Greek language">Ancient Greek</a>: <span lang="grc">δημοκρατία</span>, <small><a class="mw-redirect" href="/wiki/Romanization_of_Ancient_Greek" title="Romanization of Ancient Greek">romanized</a>: </small><span title="Ancient Greek-language romanization"><i lang="grc-Latn">dēmokratía</i></span>, <i>dēmos</i> 'people' and <i>kratos</i> 'rule')<sup class="reference" id="cite_ref-Oxford_1-0"><a href="#cite_note-Oxford-1">[1]</a></sup> is a form of <a href="/wiki/Government" title="Government">government</a> in which <a href="/wiki/People" title="People">the people</a> have the <a href="/wiki/Authority" title="Authority">authority</a> to <a class="mw-redirect" href="/wiki/Deliberate" title="Deliberate">deliberate</a> and decide legislation ("<a href="/wiki/Direct_democracy" title="Direct democracy">direct democracy</a>") or t