# Background
- In a lot of the cases in which we have been dealing with strings we don't take into account specifics about the contents, besides the literal word/spelling.
- Using Natural Language Processing (or NLP) we are able to actually analyze specifics about the words themselves in the Strings. 
- We do this by using TextBlobs and the NLTK libraries. 

#### API for TextBlob https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.WordList


#### TextBlobs are another Python module/library that is prebuilt that we can use to do different things. 
#### They can do a lot of things, including 
- Tokenization—splitting text into pieces called tokens, which are meaningful units, such as words and numbers.
- Parts-of-speech (POS) tagging—identifying each word’s part of speech, such as noun, verb, adjective, etc.
- Noun phrase extraction—locating groups of words that represent nouns, such as “red brick factory.”
- Sentiment analysis—determining whether text has positive, neutral or negative sentiment. 
- Inter-language translation and language detection powered by Google Translate.
- Inflection—pluralizing and singularizing words. There are other aspects of inflection that are not part of TextBlob. 
- Word frequencies—determining how often each word appears in a corpus.
- WordNet integration for finding word definitions, synonyms and antonyms.
- Stop word elimination—removing common words, such as a, an, the, I, we, you and more to analyze the important words in a corpus.


#### All of these are things that depending on the situation are very useful. Each of you has done some of these yourself in the previous labs.

To download what we need we go into the anaconda commandline, in much of the same way we did earlier in the semester. We run the following commands and follow the instructions for each.

- conda install -c conda-forge textblob
- ipython -m textblob.download_corpora


Before we begin, we need to run the following code. The commented code should only be run once!

In [3]:
#%conda install pandas
#%conda install numpy
#%conda install textblob
#%conda install requests

from textblob import TextBlob
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')
nltk.download('movie_reviews')
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alanr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\alanr\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\alanr\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\alanr\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\alanr\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alanr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

True

#### Lets look at how these things actually work, by looking at some basic examples.

In [4]:



string="This is a sentence, as it contains a subject and a verb. In fact it is a complex sentence! I think."

blob=TextBlob(string)
print(blob)
blob

This is a sentence, as it contains a subject and a verb. In fact it is a complex sentence! I think.


TextBlob("This is a sentence, as it contains a subject and a verb. In fact it is a complex sentence! I think.")

#### The variable blob in the cell above, is a TextBlob object. This is important to recognize as it allows us to do more things with it than a traditional string. Note that when we just print out the TextBlob we just print what is contained in it, and frankly is not obvious that it came a TextBlob to begin with. Let's see what else it can do.

In [5]:
# What if we wanted to break up via sentences? Like we did on a previous lab?
sentences=blob.sentences

# Note the .sentences does not have parentheses after it. It is known as a property and it's a pretty common technique in different libraries.

print(sentences)

# The type of each item in this list is a Sentence Object meaning it also has a slew of functions we can use on them.

print(type(sentences[0]))
print(sentences[0])



# What if we wanted to break up via words?
# It works just as you would expect, similar to the sentences above.

words=blob.words

print(words)
print(type(words[0]))
print(words[0])

[Sentence("This is a sentence, as it contains a subject and a verb."), Sentence("In fact it is a complex sentence!"), Sentence("I think.")]
<class 'textblob.blob.Sentence'>
This is a sentence, as it contains a subject and a verb.
['This', 'is', 'a', 'sentence', 'as', 'it', 'contains', 'a', 'subject', 'and', 'a', 'verb', 'In', 'fact', 'it', 'is', 'a', 'complex', 'sentence', 'I', 'think']
<class 'textblob.blob.Word'>
This


#### Parts of Speech

In [6]:
## What if I was interested in the parts of speech of our particular words?
partsOfSpeech=blob.tags
print(partsOfSpeech)

# This can be useful for us/computer to get a better understanding of what words
# we are talking about as words can have huge amounts of meanings, and specifying
# the POS (part of speech) can limit our possible definitions.

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sentence', 'NN'), ('as', 'IN'), ('it', 'PRP'), ('contains', 'VBZ'), ('a', 'DT'), ('subject', 'NN'), ('and', 'CC'), ('a', 'DT'), ('verb', 'NN'), ('In', 'IN'), ('fact', 'NN'), ('it', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('complex', 'JJ'), ('sentence', 'NN'), ('I', 'PRP'), ('think', 'VBP')]


##### Ok but what do those tags mean? Lets go look it up!

https://www.geeksforgeeks.org/python-part-of-speech-tagging-using-textblob/

There is also a dictionary you can access below:

In [7]:
# Dictionary of parts-of-speech markers.

POSdict = {'CC': 'coordinating conjunction',
'CD':'cardinal digit',
'DT': 'determiner',
'EX': "existential there (like: 'there is' … think of it like 'there exists')",
'FW': 'foreign word',
'IN': 'preposition/subordinating conjunction',
'JJ': "adjective 'big'",
'JJR': "adjective, comparative 'bigger'",
'JJS': "adjective, superlative 'biggest'",
'LS': 'list marker 1)',
'MD': 'modal could, will',
'NN': "noun, singular 'desk'",
'NNS': "noun plural 'desks'",
'NNP': "proper noun, singular 'Harrison'",
'NNPS': "proper noun, plural 'Americans'",
'PDT': "predeterminer 'all the kids'",
'POS': "possessive ending parent's",
'PRP':'personal pronoun I, he, she',
'PRP$': 'possessive pronoun my, his, hers',
'RB': 'adverb very, silently',
'RBR': 'adverb, comparative better',
'RBS': 'adverb, superlative best',
'RP': 'particle give up',
'TO': "to go 'to' the store.",
'UH': 'interjection errrrrrrrm',
'VB': 'verb, base form take',
'VBD': 'verb, past tense took',
'VBG': 'verb, gerund/present participle taking',
'VBN': 'verb, past participle taken',
'VBP': 'verb, sing. present, non-3d take',
'VBZ': 'verb, 3rd person sing. present takes',
'WDT': 'wh-determiner which',
'WP': 'wh-pronoun who, what',
'WP$': 'possessive wh-pronoun whose',
'WRB' :'wh-adverb where, when'}

In [8]:
# What if I wanted to get frequency like we did on previous labs?
freqDict=blob.word_counts
print(freqDict)

#Note this returns a defaultdict object- if we want to convert it to a primitive python dict we can just convert literally
print(dict(freqDict))

defaultdict(<class 'int'>, {'this': 1, 'is': 2, 'a': 4, 'sentence': 2, 'as': 1, 'it': 2, 'contains': 1, 'subject': 1, 'and': 1, 'verb': 1, 'in': 1, 'fact': 1, 'complex': 1, 'i': 1, 'think': 1})
{'this': 1, 'is': 2, 'a': 4, 'sentence': 2, 'as': 1, 'it': 2, 'contains': 1, 'subject': 1, 'and': 1, 'verb': 1, 'in': 1, 'fact': 1, 'complex': 1, 'i': 1, 'think': 1}


#### Noun Phrases
Sometimes words individually don't hold as much meaning as they do in groups. For instance the word water ski- water and ski have two different meanings that only when put together make up the word water ski. Computers should be able to recognize that- thus we have noun phrases.



In [9]:
nounPhraseStr=TextBlob("I love using my jet ski on a wonderful day like today.")
print(nounPhraseStr)
nounPhrase=nounPhraseStr.noun_phrases
# Note: this is a wordList object, which acts like a list but allows for extra functions to be called on them
print(type(nounPhrase))
print(nounPhrase)

#Additionally even the multiple words in each element of the wordList is still a word object
print(type(nounPhrase[0]))
print(nounPhrase[0])




I love using my jet ski on a wonderful day like today.
<class 'textblob.blob.WordList'>
['jet ski', 'wonderful day']
<class 'textblob.blob.Word'>
jet ski


#### Lets take a look at the basics of sentiment analysis.
Sentiment Analysis is the idea of taking some sentence or series of sentences and tring to quantatatively discuss whether the sentence is positive, negative, or neutral.

TextBlobs do this by assigning a polarity value from -1 to 1. The closer to negative 1 the more negative the sentiment, the closer to positive 1 the more positive, and the closest to 0 the more neutral.

TextBlobs also can rate subjectivity vs objectivity. Thus TextBlobs gives a subjectivity rating from 0 to 1. The higher the value the more subjective the sentence is. Lower values come from a more objective statment.

In [10]:
angrySent=TextBlob("That food was terrible!")

neutralStr=TextBlob("The food was mediocre at best.")

positiveStr=TextBlob("The food was excellent.")

objectiveStr=TextBlob("My name is William.")

print(angrySent,angrySent.sentiment)

print(neutralStr,neutralStr.sentiment)

print(positiveStr,positiveStr.sentiment)

print(objectiveStr,objectiveStr.sentiment)

#Note on the objective sentence we have a polarity of zero and a subjectivity of zero, meaning we have a neutral sentence with absolute objectivity.


#If we wanted to get the specfic values for polarity or subjectivity out of this subjectivity object, we could do the following-
print(angrySent.sentiment.polarity)
print(angrySent.sentiment[0])
print(angrySent.sentiment.subjectivity)
print(angrySent.sentiment[1])



That food was terrible! Sentiment(polarity=-1.0, subjectivity=1.0)
The food was mediocre at best. Sentiment(polarity=0.25, subjectivity=0.65)
The food was excellent. Sentiment(polarity=1.0, subjectivity=1.0)
My name is William. Sentiment(polarity=0.0, subjectivity=0.0)
-1.0
-1.0
1.0
1.0


The TextBlob library also comes with a NaiveBayesAnalyzer, which was trained on a database of movie reviews. Naive Bayesis acommonly used machine learning text-classification algorithm. The following uses the analyzer keyword argument to specify a TextBlob’s sentiment analyzer.

In [27]:
from textblob.sentiments import NaiveBayesAnalyzer
angrySent=TextBlob("That food was terrible!",analyzer=NaiveBayesAnalyzer())

print(angrySent,angrySent.sentiment)

That food was terrible! Sentiment(classification='neg', p_pos=0.22647144513750855, p_neg=0.7735285548624911)


These word objects allow us to do alot more things then just look at sentences as a whole. What if we were interested in getting definitions? Or synonyms? Antonyms? All of these are possible!

In [28]:
from textblob import Word
word=Word("Happy")

#print(word.definitions)
#print(word.synsets)

###Each Synset represents a group of synonyms. In the notation happy.a.01:
#• happy is the original Word’s lemmatized form (in this case, it’s the same).
#• a is the part of speech, which can be a for adjective, n for noun, v for verb, r for
#adverb or s for adjective satellite. Many adjective synsets in WordNet have satellite synsets that represent similar adjectives.
#• 01 is a 0-based index number. Many words have multiple meanings, and this is
#the index number of the corresponding meaning in the WordNet database.

#Note you could pass a POS into a different function .get_synsets() to only get a specifc POS in the set
#word=Word("House")
#synset=word.get_synsets('n')
#print(word,synset)
#synset=word.get_synsets('v')
#print(word,synset)

You might remember us talking about stop words on the sonnet lab. Those came from NLTK! It also includes a lot more natural language stopwords. Let's focus on the english ones.

In [31]:
import nltk
from nltk.corpus import stopwords
stops = stopwords.words('english')
#print((stops))

##Note for below! We can still do a lot of our string manipulation tricks, lower upper, slicing etc all can be done on TextBlobs
englishString=TextBlob("My name is Josh. I am 32. I teach math and computer science.").lower()
## We can use this list of stopwords to eliminate them just like we did before. 
wordlst=[word for word in englishString.words if word not in stops]
print(wordlst)

['name', 'josh', '32', 'teach', 'math', 'computer', 'science']


Let's try our hand at a bit of analyis similar to what we did before. Lets make some plots on the frequency of non stopwords in Romeo and Juliet.

In [32]:
import pandas as pd
import re


# Open a file- read it into a string and turn it directly into a TextBlob
file=open("romeoJuliet.txt")
dataString=TextBlob(file.read().lower())
file.close()

dataString=dataString.replace("â€™",'')
print(type(dataString))


freqDict=dict(dataString.word_counts).items()

words = [word for word in freqDict if word[0] not in stops]

#print(words)


def getItem(t):
    return t[1]



sortedFreqListKeys=sorted(words, key=getItem,reverse=True)[:20]
print(sortedFreqListKeys)

df=pd.DataFrame(sortedFreqListKeys,columns=["Word","Frequency"])
axes=df.plot.bar(x='Word', y='Frequency', legend=False)



FileNotFoundError: [Errno 2] No such file or directory: 'romeoJuliet.txt'

## Challenge Questions

Let us again turn to the text of the Constitution found below:

In [37]:
import requests

response = requests.get('https://www.usconstitution.net/const.txt')
text = response.text

text = text[278:]

1. Turn this string into a TextBlob. Create a list of all the sentences in the Constitution and a list of all the words.

2. Using the functions we have discussed today, make a dictionary of the word frequencies in the Constitution. 

3. Determine the parts of speech for each word in the text:

4. Write a program which counts the different parts of speech. Display that information in a table (i.e. a Pandas DataFrame).

5. Perform a sentiment analysis on each sentence of the text and store that information as a list.

6. Create a histogram of all the positive sentiments. (Hint: `sentence.sentiment[1]` is the positive sentiment for `sentence`)