# Natural Language Processing using NLTK



In [1]:
# Install NLTK - pip install nltk
import nltk
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /projects/0ddeade5-3577-4fe
[nltk_data]     8-8cd6-8a0cb653428e/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /projects/0ddeade5-3577-4fe8-
[nltk_data]     8cd6-8a0cb653428e/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## NLP Part 0 - Get some Data!

This section's code is mostly given to you as a review for how you can scrape and manipulate data from the web. 



In [2]:
import urllib
import bs4 as bs
import re

In [3]:
# We will read the contents of the Wikipedia article "Global_warming" as an example, please feel free to use your own! You can use the url below:
url = 'https://en.wikipedia.org/wiki/Global_warming' # you can change this to use other sites as well.

# We can open the page using "urllib.request.urlopen" then read it using ".read()"
source = urllib.request.urlopen(url).read()

# Beautiful Soup is a Python library for pulling data out of HTML and XML files.
# you may need to install a parser library --> "!pip3 install lxml"
# Parsing the data/creating BeautifulSoup object

soup = bs.BeautifulSoup(source,"html.parser") 

# Fetching the data
text = ""
for paragraph in soup.find_all('p'): #The <p> tag defines a paragraph in the webpages
    text += paragraph.text

# Preprocessing the data

text = re.sub(r'\[[0-9]*\]',' ',text) # [0-9]* --> Matches zero or more repetitions of any digit from 0 to 9
text = text.lower() #everything to lowercase
text = re.sub(r'\W^.?!',' ',text) # \W --> Matches any character which is not a word character except (.?!)
text = re.sub(r'\d',' ',text) # \d --> Matches any decimal digit
text = re.sub(r'\s+',' ',text) # \s --> Matches any characters that are considered whitespace (Ex: [\t\n\r\f\v].)

In [4]:
text[:400]

" contemporary climate change includes both global warming and its impacts on earth's weather patterns. there have been previous periods of climate change, but the current changes are distinctly more rapid and not due to natural causes. instead, they are caused by the emission of greenhouse gases, mostly carbon dioxide (co ) and methane. burning fossil fuels for energy use creates most of these emi"

## NLP Part 1 - Tokenization of paragraphs/sentences

In this section we are going to tokenize our sentences and words. If you aren't familiar with tokenization, we recommend looking up "what is tokenization". 

You should also spend time on the [NLTK documentation](https://www.nltk.org/). If you're not sure how to do something, or get an error, it is best to google it first and ask questions as you go!



In [5]:
#'''
#Your code here: Tokenize the words from the data and set it to a variable called words.
#Hint: how to this might be on the very home page of NLTK!
#'''
import nltk
sentence = "contemporary climate change includes both global warming and its impacts on earth's weather pattern"
words = nltk.word_tokenize(sentence)
words



['contemporary',
 'climate',
 'change',
 'includes',
 'both',
 'global',
 'warming',
 'and',
 'its',
 'impacts',
 'on',
 'earth',
 "'s",
 'weather',
 'pattern']

In [6]:
print(words[:10])

['contemporary', 'climate', 'change', 'includes', 'both', 'global', 'warming', 'and', 'its', 'impacts']


In [35]:
#'''
#Your code here: Tokenize the sentences from the data  and set it to a variable called sentences.
#Hint: try googling how to tokenize sentences in NLTK!
#'''

import nltk
text = " contemporary climate change includes both global warming and its impacts on earth's weather patterns. there have been previous periods of climate change, but the current changes are distinctly more rapid and not due to natural causes. instead, they are caused by the emission of greenhouse gases, mostly carbon dioxide (co ) and methane. burning fossil fuels for energy use creates most of these emi"
sentences = text.split('.')
sentences

[" contemporary climate change includes both global warming and its impacts on earth's weather patterns",
 ' there have been previous periods of climate change, but the current changes are distinctly more rapid and not due to natural causes',
 ' instead, they are caused by the emission of greenhouse gases, mostly carbon dioxide (co ) and methane',
 ' burning fossil fuels for energy use creates most of these emi']

In [8]:
print(sentences[:10])

[" contemporary climate change includes both global warming and its impacts on earth's weather patterns", ' there have been previous periods of climate change, but the current changes are distinctly more rapid and not due to natural causes', ' instead, they are caused by the emission of greenhouse gases, mostly carbon dioxide (co ) and methane', ' burning fossil fuels for energy use creates most of these emi']


## NLP Part 2 - Stopwords and Punctuation
Now we are going to work to remove stopwords and punctuation from our data. Why do you think we are going to do this? Do some research if you don't know yet. 

In [9]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /projects/0ddeade5-3577-4
[nltk_data]     fe8-8cd6-8a0cb653428e/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:

def rem_stopwords(sentence):
    return ' '.join([word for word in nltk.word_tokenize(sentence) if word not in stopwords.words('english')])

sentences = [rem_stopwords(sentence) for sentence in sentences]
print(sentences[:100]) #Check if it worked correctly. Are all stopwords removed?


["contemporary climate change includes global warming impacts earth 's weather patterns", 'previous periods climate change , current changes distinctly rapid due natural causes', 'instead , caused emission greenhouse gases , mostly carbon dioxide ( co ) methane', 'burning fossil fuels energy use creates emi']


In [33]:
#'''

#define a function called "remove_punctuation" that removes punctuation from the sentences.
#'''
import re
import string
def convert(string):
    list1=[]
    list1[:0]=string
    return list1
punctlist = convert(string.punctuation)

def remove_punctuation(sentences):
    sentences_list = []
#    ### Some code goes here. Hint: Try looking up how to remove stopwords in NLTK if you get stuck. ###
    for i in range(4):
        sentences_without_punctuation = re.sub(r'[^\w\s]','', sentences[i])
        sentences_list.append(sentences_without_punctuation)
    return sentences_list
'''
def rem_punct(sentence):
    return " ".join([word for word in nltk.word_tokenize(sentence) if word not in punctlist])
'''
sentences = remove_punctuation(sentences)
print(sentences[:100]) #eliminating all punctuation.

#i think it's because sentences is a list and not a string
#yeah you're right

['contemporary', 'climate', 'change', 'includes']


## NLP Part 3a - Stemming the words

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. There is an example below!



In [12]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
# try each of the words below
x=stemmer.stem('troubled')
#stemmer.stem('trouble')
#stemmer.stem('troubling')
#stemmer.stem('troubles')
print(x)

troubl


In [13]:
#'''
#Your code here:
#Define a function called "stem_sentences" that takes in a list of sentences and returns a list of stemmed sentences.
#'''
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()



def stem_sentences(sentences):
    ### Some code goes here. Hint: Try looking up how to stem words in NLTK if you get stuck (or simply use the example above and run stemmer in a loop!). ###
    stemmed = []
    for i in range(len(sentences)):
        words = nltk.word_tokenize(sentences[i])
        for x in range(len(words)):
            stemmed.append(stemmer.stem(words[x]))
    for j in range(len(stemmed)):
        print(stemmed[j], end = " " )
    return stemmed
sentences = stem_sentences(sentences)

contemporari climat chang includ global warm impact earth s weather pattern previou period climat chang current chang distinctli rapid due natur caus instead caus emiss greenhous gase mostli carbon dioxid co methan burn fossil fuel energi use creat emi 

In [14]:

print(sentences)

['contemporari', 'climat', 'chang', 'includ', 'global', 'warm', 'impact', 'earth', 's', 'weather', 'pattern', 'previou', 'period', 'climat', 'chang', 'current', 'chang', 'distinctli', 'rapid', 'due', 'natur', 'caus', 'instead', 'caus', 'emiss', 'greenhous', 'gase', 'mostli', 'carbon', 'dioxid', 'co', 'methan', 'burn', 'fossil', 'fuel', 'energi', 'use', 'creat', 'emi']


## NLP Part 3b - Lemmatization
Lemmatization considers the context and converts the word to its meaningful base form. There is a cool tutorial and definition of lemmatization in NLTK [here](https://www.geeksforgeeks.org/python-lemmatization-with-nltk/).

In [31]:
from nltk.stem import WordNetLemmatizer
    
## Step 1: Import the lemmatizer
lemmatizer = WordNetLemmatizer()

'''
Your code here: Define a function called "lem_sentences" that: loops through the sentences, split the sentences up by words and applies "lemmatizer.lemmatize" to each word and then join everything back into a sentence
'''
lem_list = []
##Similar to stopwords: For loop through the sentences, split by words and apply "lemmatizer.lemmatize" to each word and join back into a sentence
def lem_sentences(sentences):
    for i in range(len(sentences)):
        words = nltk.word_tokenize(sentences[i])
        for x in range(len(words)): 
            lem_list.append(lemmatizer.lemmatize(words[x]))
    for j in range(len(lem_list)):
        print(lem_list[j], end = " ")
    return lem_list
sentences = lem_sentences(sentences)


contemporary climate change includes both global warming and it impact on earth 's weather pattern there have been previous period of climate change , but the current change are distinctly more rapid and not due to natural cause instead , they are caused by the emission of greenhouse gas , mostly carbon dioxide ( co ) and methane burning fossil fuel for energy use creates most of these emi 

In [32]:
print(sentences[:100]) 

['contemporary', 'climate', 'change', 'includes', 'both', 'global', 'warming', 'and', 'it', 'impact', 'on', 'earth', "'s", 'weather', 'pattern', 'there', 'have', 'been', 'previous', 'period', 'of', 'climate', 'change', ',', 'but', 'the', 'current', 'change', 'are', 'distinctly', 'more', 'rapid', 'and', 'not', 'due', 'to', 'natural', 'cause', 'instead', ',', 'they', 'are', 'caused', 'by', 'the', 'emission', 'of', 'greenhouse', 'gas', ',', 'mostly', 'carbon', 'dioxide', '(', 'co', ')', 'and', 'methane', 'burning', 'fossil', 'fuel', 'for', 'energy', 'use', 'creates', 'most', 'of', 'these', 'emi']


## NLP Part 4 - POS Tagging
Parts of speech tagging is marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context.

In [17]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to /project
[nltk_data]     s/0ddeade5-3577-4fe8-8cd6-8a0cb653428e/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [18]:
# POS Tagging example
# CC - coordinating conjunction
# NN - noun, singular (cat, tree)
all_words = nltk.word_tokenize(text)  ###If we want to look at part of speech taking before we stem/lem

tagged_words = nltk.pos_tag(all_words)
##Creates a list of lists where each element of the list is [word,partofspeech abbreviation]

# Tagged word paragraph
word_tags = []
for tw in tagged_words:
    word_tags.append(tw[0]+"_"+tw[1])

tagged_paragraph = ' '.join(word_tags)

'''
Your code here: print the first 1000 characters of tagged_paragraph.
'''
print(tagged_paragraph[:1000])

contemporary_JJ climate_NN change_NN includes_VBZ both_DT global_JJ warming_NN and_CC its_PRP$ impacts_NNS on_IN earth_NN 's_POS weather_NN patterns_NNS ._. there_EX have_VBP been_VBN previous_JJ periods_NNS of_IN climate_NN change_NN ,_, but_CC the_DT current_JJ changes_NNS are_VBP distinctly_RB more_RBR rapid_JJ and_CC not_RB due_JJ to_TO natural_JJ causes_NNS ._. instead_RB ,_, they_PRP are_VBP caused_VBN by_IN the_DT emission_NN of_IN greenhouse_NN gases_NNS ,_, mostly_RB carbon_NN dioxide_NN (_( co_NN )_) and_CC methane_NN ._. burning_VBG fossil_JJ fuels_NNS for_IN energy_NN use_NN creates_VBZ most_JJS of_IN these_DT emi_NNS


# Word2Vec Model Visualization

In [19]:
# Install gensim - pip install gensim
import nltk
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
nltk.download('punkt')
from wordcloud import WordCloud

[nltk_data] Downloading package punkt to /projects/0ddeade5-3577-4fe8-
[nltk_data]     8cd6-8a0cb653428e/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [37]:
#Let's go ahead and create a list that's formatted how word2vec needs:
    # a list of lists where the ith entry in the list is the word tokenizaiton of the ith sentence (after preprocessing)
'''
url2 = 'https://en.wikipedia.org/wiki/Global_warming'
source2 = urllib.request.urlopen(url2).read()
soup2 = bs.BeautifulSoup(source2,"html.parser")
text2 = ""
for paragraph in soup2.find_all('p'):
    text2 += paragraph.text
text2 = re.sub(r'\[[0-9]*\]',' ',text2)
text2 = text.lower()
text2 = re.sub(r'\W^.?!',' ',text2)
text2 = re.sub(r'\d',' ',text2)
text2 = re.sub(r'\s+',' ',text2)
'''

from gensim.models import KeyedVectors

sentences2 = nltk.sent_tokenize(text)
sentences2 = [rem_stopwords(sentence) for sentence in sentences2]
sentences2 = remove_punctuation(sentences2)
word_list = [nltk.word_tokenize(sentence) for sentence in sentences2]

In [38]:
# print the tokenized list of lists
print(word_list [:10])

[['contemporary', 'climate', 'change', 'includes', 'global', 'warming', 'impacts', 'earth', 's', 'weather', 'patterns'], ['previous', 'periods', 'climate', 'change', 'current', 'changes', 'distinctly', 'rapid', 'due', 'natural', 'causes'], ['instead', 'caused', 'emission', 'greenhouse', 'gases', 'mostly', 'carbon', 'dioxide', 'co', 'methane'], ['burning', 'fossil', 'fuels', 'energy', 'use', 'creates', 'emi']]


## Training the Word2Vec model

For this part you may want to follow a guide [here](https://radimrehurek.com/gensim/models/word2vec.html). 



In [51]:
''' Training the Word2Vec model. You should pass:
1. a list of lists where the ith entry in the list is the word tokenizaiton of the ith sentence
2. min_count=1 --> Ignores all words with total frequency lower than 1 (i.e., include everything).
'''
# create the model
model = Word2Vec(word_list , min_count=1)

# get the most common words of the model (it's entire vocabulary)
most_common_words = model.wv.index_to_key

# save the model to use it later
model.save("word2vec.model")
# model = Word2Vec.load("word2vec.model")

In [54]:
#print the first 10 most common words.
print(most_common_words[:10])

['climate', 'change', 'emi', 'weather', 'distinctly', 'changes', 'current', 'periods', 'previous', 'patterns']


In [43]:
# Look up the most similar words to certain words in your text using the model.wv.most_similar() function
most_similar_words = model.wv.most_similar
print

TypeError: 'Word2Vec' object is not iterable

## Testing our model



In [0]:
    # Finding Word Vectors - print word vectors for certain words in your text


In [0]:
    ### Finding the most similar words in the model ###


In [0]:
similar1, similar2

In [0]:
# code to print a wordcloud for your sentences
wordcloud = WordCloud(
                        background_color='white',
                        max_words=100,
                        max_font_size=50, 
                        random_state=42
                        ).generate(str(sentences))
fig = plt.figure(1)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

### Why did we do all this work?

In [0]:
# reFetching the data
lame_text = ""
for paragraph in soup.find_all('p'): #The <p> tag defines a paragraph in the webpages
    lame_text += paragraph.text

In [0]:
'''
Doing the same without removing stop words or lemming
'''
# tokenize the text using sent_tokenize

# from this list of sentences, create a list of lists where the ith entry in the list is the word tokenizaiton of the ith sentence (after preprocessing)

In [0]:
# Redo the word cloud but set stopwords to empty so it looks really bad
wordcloud = WordCloud(
                        background_color='white',
                        max_words=100,
                        max_font_size=50, 
                        random_state=42, ###SET STOPWORDS = [] and/or include_numbers = True or you will get the same thing!!!
                        stopwords = [],
                        include_numbers = True).generate(str(lame_sentences)) 
fig = plt.figure(1)
plt.figure(figsize=(10,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

In [0]:
# Training the Word2Vec model (same code as before), but one change: use our lame data that was not preprocessed

# Try printing this after training the model.
words = model.wv.index_to_key
print(words[:10])

In [0]:
# Finding a vector of a word, but badly

In [0]:
### Finding the most similar words in the model but... you get the idea ###



## Reflection
How important do you think proper preprocessing in NLP is?