
# Introduction to Text Analysis with Python

Welcome to the Digital Scholarship Lab introduction to Text Analysis with Python class. In this class we'll learn the basics of text analysis:

- parsing text
- analyzing the text

We'll use our own home made analysis tool first, then we'll use a python library called `TextBlob` to use some built-in analysis tools.

This workshop assumes you've completed our Intro to Python [workshop](https://brockdsl.github.io/Intro_to_Python_Workshop/)



Be sure to enable line numbers by looking for the 'gear' icon and checking the box in the 'Editor' panel.

## EG. Scrabble!

<img src="https://upload.wikimedia.org/wikipedia/commons/5/5d/Scrabble_game_in_progress.jpg" width =500x>

Scrabble is a popular game where players try to score points by spelling words and placing them on the game board. We'll use Scrabble scoring our our first attempt at text analysis. This will demonstart the basics of how Text Analysis works.

The function below gives you the Scrabble scored of any word you give it.

In [1]:
# This function will return the Scrabble score of a word

def scrabble_score(word):

    #Dictionary of our scrabble scores
    score_lookup = {
        "a": 1,
        "b": 3,
        "c": 3,
        "d": 2,
        "e": 1,
        "f": 4,
        "g": 2,
        "h": 4,
        "i": 1,
        "j": 8,
        "k": 5,
        "l": 1,
        "m": 3,
        "n": 1,
        "o": 1,
        "p": 3,
        "q": 10,
        "r": 1,
        "s": 1,
        "t": 1,
        "u": 1,
        "v": 4,
        "w": 4,
        "x": 8,
        "y": 4,
        "z": 10,
        "\n": 0, #just in case a new line character jumps in here
        " ":0 #normally single words don't have spaces but we'll put this here just in case

    }

    total_score = 0

    #We look up each letter in the scoring dictionary and add it to a running total
    #to make our dictionary shorter we are just using lowercase letters so we need to
    #change all of our input to lowercase with .lower()
    for letter in word:
        total_score = total_score + score_lookup[letter.lower()]

    return total_score


Text Analysis is a process comprised of three basic steps:
1. Identifying the text (or corpus) that you'd like to an analyze
1. Apply the analysis to your prepared text
1. Review the results

In our very basic example of scrabble we just are interested in finding the points we would get for spelling a specific word.

In a more complex example with a larger corpus you can do any of the following types of analysis:
- determine the sentiment (positive / negative tone) of the text
- quantify how complex a piece of writing is based on the vocabulary it uses
- determine what topics are in your corpus
- classify your text into different categories based on what it is about

Of course, there are many other different outcomes you can get from peforming text analysis.

Try questions Q1 - Q2 and type "All Done" in the chat box when you are done.

## Q1

Score your name by creating the text variable _name_ on line 1.

How many Points do you get for your name? Complete the expression below to find out the scrabble score of your name

In [None]:
name = ""
print("Score for my name is:", scrabble_score(name))


## Q2

Score your pet's name (or favorite character from a story)  by creating the text variable _pet_name_ on line 1.
Does your name or the name of your pet score higher in Scrabble?

In [None]:
pet_name = ""
print("Score for my pet's name is:",scrabble_score(pet_name))

#Compare to see which gets more points!
if scrabble_score(pet_name) > scrabble_score(name):
    print("My pet's name scores more points!")
else:
    print("My name scores more (or the same) amount of points as my pets name")



# Beyond the basics

We just completed a very basic text analysis where we analyzed two different bits of text to see which one scores higher in Scrabble. Let's expand this idea to a more complex example using the [TextBlob](https://textblob.readthedocs.io/en/dev/) Python Library. There are other more complex libraries that you can use for text analysis, we are using more simple solutions so we can spend more time looking at results compared to setting up the code.

# Installing and Loading the Libraries

This next cell will install and load the requires libraries that will do the text analysis.

In [None]:
#Install textblob using magic commands
#Only needed once
%pip install textblob
#%python -m textblob.download_corpora
#%pip install textblob.download_corpora

from textblob import TextBlob

import pandas as pd
import nltk
from nltk.corpus import stopwords
import requests
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')
nltk.download('wordnet')
nltk.download('omw-1.4')

#Let's make sure our previews show more information
pd.set_option('display.max_colwidth', 999)

#Classifier for laster
from textblob.classifiers import NaiveBayesClassifier
from textblob import Word

# Corpus

![winnie_splash](https://raw.githubusercontent.com/BrockDSL/Text_Analysis_with_Python/master/winnie_splash.png)

Corpus is a fancy way of saying the text that we will be looking at. Cleaning up a corpus and getting it ready for analysis is a big part of the process, once that is done the rest is easy. For our example we are going to be looking at some entries from the 1900 [diary](https://dr.library.brocku.ca/handle/10464/7282) of Winnie Beam. The next cell will load this corpus into a Pandas dataframe and show us a few entires.

In [None]:
winnie_corpus = pd.read_csv('https://raw.githubusercontent.com/BrockDSL/Text_Analysis_with_Python/master/winnie_corpus.txt', header = None, delimiter="\t")
winnie_corpus.columns = ["page","date","entry"]
winnie_corpus['date'] = pd.to_datetime(winnie_corpus['date'])
winnie_corpus['entry'] = winnie_corpus.entry.astype(str)

#preview our top entries
winnie_corpus.head()

# Measuring Sentiment

We can analyze the _sentiment_ of the text (more [details](https://planspace.org/20150607-textblob_sentiment/).) The next cell demonstrates this:

In [None]:

happy_sentence = "Python is the best programming language ever!"
sad_sentence = "Python is difficult to use, and very frustrating"


print("Sentiment of happy sentence ", TextBlob(happy_sentence).sentiment)
print("Sentiment of sad sentence ", TextBlob(sad_sentence).sentiment)

# polarity ranges from -1 to 1.
# subjectvity ranges from 0 to 1.



## Q3

Try a couple of different sentences in the code cell below. See if you can create something that scores -1 and another that scores 1 for _polarity_. See if you can minimize the _subjectivity_ of your sentence. *Share your answers in the chat box*.

(We can create a multi line string of text by putting it in triple quotes like the cell following.)

In [None]:
test_sentence = """

"""
print("Score of test sentence is ", TextBlob(test_sentence).sentiment)

# Adding Sentiment to our Diary entries

This next cell will score each diary entry in a new column that will be added to the dataframe. We loop through each entry, calculate the two scores that represent the sentiment. After all the scores are computed with add them to the dataframe.

In [None]:
#Apply sentiment analysis from TextBlob

polarity = []
subjectivity = []


for day in winnie_corpus.entry:
    #print(day,"\n")
    score = TextBlob(day)
    polarity.append(score.sentiment.polarity)
    subjectivity.append(score.sentiment.subjectivity)

winnie_corpus['polarity'] = polarity
winnie_corpus['subjectivity'] = subjectivity


#Let's look at our new top entries
winnie_corpus.head()

Now that we have daily sentiment values, let's try to visualize how they go up and down over the course of the first 3 months of the year.

In [None]:
#Let's graph out the sentiment as it changes day to day.

plt.plot(winnie_corpus["date"],winnie_corpus["polarity"])
plt.xticks(rotation='45')
plt.title("Sentiment of Winnie's Diary Entries")
plt.show()

## Interesting spikes?

We see some really strong negative and positive spikes in the sentiment. Let's just take a look at some of those entries. Run the next three cells to look at the individual negative and positive entries.

In [None]:
#instead of looking at just the hightest and lowest value we'll reduce that number by a threshold value
#that way we can see numbers that are close to the highest sentiment and the lowest sentiment
#we'll start with 20%.


threshold = 0.2

In [None]:
#Very Negative
bad_sentiment = winnie_corpus["polarity"].min()

#Reduce this number by threshold %
bad_sentiment = bad_sentiment - (bad_sentiment * threshold)

winnie_corpus[winnie_corpus["polarity"] <= bad_sentiment]

In [None]:
#Very Positive
good_sentiment = winnie_corpus["polarity"].max()

#Reduce this number by threshold %
good_sentiment = good_sentiment - (good_sentiment * threshold)

winnie_corpus[winnie_corpus["polarity"] >= good_sentiment]

## Q4

Do you agree with the sentiment scores that are applied in the above two cells? Share your thoughts in the chat.

# What else can we get from the text?

We've seen some details about sentiment, but what else can we get from the text? Let's grab a random entry and see what we can find out about it. We'll choose the *22*nd entry.

In [None]:
entry_number = 22
bit_of_corpus = TextBlob(winnie_corpus["entry"][entry_number])
bit_of_corpus

# Sentences and Sentiment

We applied sentiment on to daily entries but we can apply it down to sentences just to see how a score fluctuates.

In [None]:
for sentence in bit_of_corpus.sentences:
    print(sentence)
    print(sentence.sentiment,"\n")

# Words in sentences

You can parse through words in a sentence using TextBlob as well. The next cell illustrates this. We'll need to to get to calculate specific sentiment scores in our next section.

In [None]:
for sentence in bit_of_corpus.sentences:
    for word in sentence.words:
        print(word)

## Q5

Another random journal entry. Pick a random number between 1 and the length of the dataframe and update *en_no* in line 1. If you get an interesting result, share it with the class in the chat box.

In [None]:
#Pick a value between 1 and this number
len(winnie_corpus)

In [None]:
en_no =

another_bit_of_corpus = TextBlob(winnie_corpus["entry"][en_no])

print("Random Entry: \n")
print(another_bit_of_corpus,"\n")

#Go through all of the sentences of this entry and determine their sentiment
for sentence in another_bit_of_corpus.sentences:
    print(sentence)
    print(sentence.sentiment,"\n")

# Stopwords

Often we want to remove common words (eg. a, the, of) in our corpus before we analyze things. For the most part TextBlob will ignore these words if we use the right analysis. We'll just look at it here to understand the idea.

In [None]:
for word in stopwords.words('english'):
    print(word)

Now let's see how we can remove stopwords from a piece of text. Experiment by changing to a different example sentence.

In [None]:
ex_text = TextBlob("I know this, do you?")


for sentence in ex_text.sentences:
    for word in sentence.words:

        if word.lower() not in stopwords.words('english'):
            print(word)

## Lemmatization

Some times analysis requires us to **stem** or **lemmatize** our word so that is becomes a root word. For example the word _cats_ would be transformed into _cat_. Doing this makes our analysis a bit more clear as we can more readily compare words against one another. Our Python Library allows us to do this fairly easily.

A full & comprehensive text analysis project would require us to do this for our corpus.

In [None]:
w = Word("cats")
w.lemmatize()

In [None]:
ex_text_2 = TextBlob("I have chased so many mice and cats today, I am exhausted.")


for sentence in ex_text_2.sentences:
    for word in sentence.words:
        print(Word(word).lemmatize())


## Sentiment v. Stop Words v. Lemmatization

The question now becomes does Sentiment score change now if we apply our processing steps? Let's try by taking our random entry and applying these steps and getting the score.

In [None]:
#Sentiment of unchanged entry
raw_sentiment = another_bit_of_corpus.sentiment

print("\nSentiment of entry: \n",raw_sentiment)


#Remove stopwords & score sentiment
stopword_sent = ""
for sentence in bit_of_corpus.sentences:
    for word in sentence.words:
        if word.lower() not in stopwords.words('english'):
            stopword_sent = stopword_sent + " " + str(word)

stopword_sentiment = TextBlob(stopword_sent).sentiment
print("\nSentiment of entry without stopwords: \n",stopword_sentiment)


#Lemmatize the words and print sentiment
lemm_sent = ""
for sentence in bit_of_corpus.sentences:
    for word in sentence.words:
        lw = str(Word(word).lemmatize())
        lemm_sent = lemm_sent + " " + lw

lemm_sentiment = TextBlob(lemm_sent).sentiment
print("\nSentiment of entry with lemmatization: \n",lemm_sentiment)



---
Bottom line, our sentiment tool already considers stopwords and lemmatization. In *this case* we didn't have to process the text further

# Noun Phrases

We can get a good idea about what a corpus is about by looking at the different _nouns_ that show up in it. _Nouns_ that show up a lot give us an idea of the contents of the text. TextBlob is smart enough to ignore *stopwords* when it does this.

In [None]:
for np in bit_of_corpus.noun_phrases:
    print(np)

### Automatic Keyword generator

One good use of Noun Phrase identification is automatically creating keywords for a collection of works in your corpus. The basics structure goes like this:

1. Read through each document in your corpus

2. Identify each noun phrase in your documents

3. NP that show up the most are the keywords for your document

We are going to be looking at the book [The Prince](https://en.wikipedia.org/wiki/The_Prince) (You can modify line #4 to download a different book, just pick the full-text [Guttenberg](https://www.gutenberg.org/) URL for the variable.)

You'll need to be patient while this cell runs.

In [None]:
keywords = dict()

#You can replace with any book on Gutenberg, we are using The Prince - https://www.gutenberg.org/ebooks/1232
BOOK_URL = "https://www.gutenberg.org/files/1232/1232-0.txt"


#We are using a Library called requests to download the book (https://realpython.com/python-requests/)
print("Downloading book...")
book = requests.get(BOOK_URL)

#Turn text into text blob
book_blob = TextBlob(book.text)


print("Identiying Noun phrases and building frequency dictionary...")

#Go through all noun phrases
for np in book_blob.noun_phrases:
    if np in keywords:
        keywords[np] += 1
    else:
        keywords[np] = 1


#Sort dictionary and print top 20 entries
print("Most common Nouns...")

for np in sorted(keywords, key=keywords.get, reverse=True)[0:20]:
    print(np, keywords[np])

## A closer look at the corpus

Let's look at the January Diary entries

In [None]:
#January Entries
jan_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-01-01') & (winnie_corpus['date'] <= '1900-01-31')]

Let's see what Winnie talks about the most in the month. We can do this by extracting the _noun phrases_ in her entries. We can put them in a dictionary to count how many times a phrase is used

In [None]:
jan_phrases = dict()

for entry in jan_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in jan_phrases:
            jan_phrases[np] += 1
        else:
            jan_phrases[np] = 1

In [None]:
#Print the top 10 things she mentioned in January

for np in sorted(jan_phrases, key=jan_phrases.get, reverse=True)[0:10]:
    print(np, jan_phrases[np])



## Q6

Let's compare against the first 6 months of the year. Run the following set of cells.
What can you say about Winnie's topics over the first half of the year? Share your thoughts in the chat box.

In [None]:
#February Entries
feb_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-02-01') & (winnie_corpus['date'] <= '1900-02-28')]

feb_phrases = dict()

for entry in feb_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in feb_phrases:
            feb_phrases[np] += 1
        else:
            feb_phrases[np] = 1

#Print the top 10 things she mentioned in February

for np in sorted(feb_phrases, key=feb_phrases.get, reverse=True)[0:10]:
    print(np, feb_phrases[np])

In [None]:
#March Entries
mar_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-03-01') & (winnie_corpus['date'] <= '1900-03-31')]


mar_phrases = dict()

for entry in mar_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in mar_phrases:
            mar_phrases[np] += 1
        else:
            mar_phrases[np] = 1

#Print the top 10 things she mentioned in March

for np in sorted(mar_phrases, key=mar_phrases.get, reverse=True)[0:10]:
    print(np, mar_phrases[np])

In [None]:
#April Entries
april_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-04-01') & (winnie_corpus['date'] <= '1900-04-30')]

april_phrases = dict()

for entry in april_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in april_phrases:
            april_phrases[np] += 1
        else:
            april_phrases[np] = 1

#Print the top 10 things she mentioned in April

for np in sorted(april_phrases, key=april_phrases.get, reverse=True)[0:10]:
    print(np, april_phrases[np])

In [None]:
#May Entries
may_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-05-01') & (winnie_corpus['date'] <= '1900-05-31')]

may_phrases = dict()

for entry in may_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in may_phrases:
            may_phrases[np] += 1
        else:
            may_phrases[np] = 1

#Print the top 10 things she mentioned in may

for np in sorted(may_phrases, key=may_phrases.get, reverse=True)[0:10]:
    print(np, may_phrases[np])

In [None]:
#June Entries
june_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-06-01') & (winnie_corpus['date'] <= '1900-06-30')]

june_phrases = dict()

for entry in june_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in june_phrases:
            june_phrases[np] += 1
        else:
            june_phrases[np] = 1

#Print the top 10 things she mentioned in june

for np in sorted(june_phrases, key=june_phrases.get, reverse=True)[0:10]:
    print(np, june_phrases[np])

## Q7

Find a URL to perform some analysis. You can try to get something from:
- [CBC news](https://www.cbc.ca/news)
- [New York Times](https://www.nytimes.com/)
- The text of a tweet...
- What else?

Paste your URL into the variable defined in line 1.

Share the URL you've analyzed by sharing a link in the chat box

In [None]:
URL = "https://www.cbc.ca/news/science/wikipedia-bias-1.6129073"

res = requests.get(URL)
html_page = res.content
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True)

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head',
    'input',
    'script',
    'style',
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)

eTB = TextBlob(output)

#Sentiment
print("Sentiment:\n")
print(eTB.sentiment)


#Noun Phrases
print("\nNoun Phrases:\n")
ex_phrases = dict()

#We go through and count how many times a Noun-Phrase shows up
for np in eTB.noun_phrases:
    if np in ex_phrases:
        ex_phrases[np] += 1
    else:
        ex_phrases[np] = 1

#We'll print the noun-phrase and times they show up
#We'll stop when the noun-phrases only show up once

for np in sorted(ex_phrases, key=ex_phrases.get, reverse=True):
    if ex_phrases[np] == 1:
        break
    print(np, ex_phrases[np])


# And now for something completely different

## A very basic classifier

We looked at how to score the sentiment of a corpus. We can also create a classifier on our own if we provide testing and training data. In our example we are going to look at whether some statements about Twitter are subjective ( _sub_ ) or objective ( _obj_ ).

In [None]:
train = [
    ('I think Twitter is stupid', 'sub'),
    ('Lots of people send too much time on Twitter.', 'obj'),
    ('Twitter is a waste of time.', 'sub'),
    ('Twitter can be used to find information.', 'obj'),
    ('Many celebrites have Twitter accounts.', 'obj'),
    ('I think there is too much misinformation on Twitter', 'sub'),
    ("I don't like Twitter.", 'sub'),
    ("Twitter is the best ever", 'sub'),
    ('Twitter is great because all of my friends us it', 'sub'),
    ('Twitter is a fortune 500 company', 'obj')
    ]


test = [
     ('Twitter is a company', 'obj'),
     ("You can't communicate well with such short sentences", 'sub'),
     ("Twitter is disruptive to soceity", 'sub'),
     ("Over 500 million people use Twitter", 'obj'),
     ('A Twitter message can have 280 characters', 'obj'),
     ("A Twitter message is always stupid", 'sub')
    ]


In [None]:
#Builds the classifer and run the training data through it
cl = NaiveBayesClassifier(train)

In [None]:
#Classify each item in the test set to see how well the classifier works.

for item in test:
    print("Item: ",item[0])
    print("Guess: \t\t",cl.classify(item[0]))
    print("Actual: \t",item[1],"\n")

print("\nAccuracy of guesses", cl.accuracy(test))



In [None]:
# We can have the classifer tells us some things it has noticed with the samples
cl.show_informative_features()

## Q8

As our last activity try to create your own classifier in the next code cell. You'll just need to provide examples for the classifer to train on.

In [None]:
train_2 = [
    ('I love this sandwich.', 'pos'),
    ('','pos'), #add a positive sentence
    ('','pos'), #add a positive sentence
    ('','pos'), #add a positive sentence
    ('I do not like this restaurant', 'neg'),
    ('','neg'), #add a negative sentence
    ('','neg'), #add a negative sentence
    ('','neg')  #add a negative sentence
    ]


cl_2 = NaiveBayesClassifier(train_2)

print("Our Important features:")
cl_2.show_informative_features()

Run the following cell as often as you'd like to have the classifier attempt more sentences.

In [None]:
print("\nInput a sentence you wish to classify")
test_sentence = input()
print("Classification category: ", cl_2.classify(test_sentence))

# Congrats!

You have now learned the basics of Text Analysis using Python and TextBlob.

We also offer a workshop called [Advanced Text Analysis](https://brockdsl.github.io/Advanced_Text_Analysis_with_Python/) if you'd like to dig into more details on topic modelling.

All of our workshops are posted on [Eventbrite](https://brockdsl.eventbrite.com/)


# More Links


- [Sentiment Analysis of Tweets Using Python](https://www.greycampus.com/blog/data-science/sentiment-analysis-on-twitter-tweets-using-python) - a case study that uses twitter data to generate sentiment values.


- [VADER](https://github.com/cjhutto/vaderSentiment#python-demo-and-code-examples) - (Valence Aware Dictionary and sEntiment Reasoner) is a sentiment library designed to be used for social media that can better reflect the sentiment of slang, emoticons and hashtags.


- [Topic Modelling with gensim](https://towardsdatascience.com/topic-modeling-with-gensim-a5609cefccc) - The next step in your understanding of text analysis should be topic modelling, where we try to determine what topics are in a corpus. It is bit too complex to tackling in this workshop.


- [Kaggle](https://www.kaggle.com/search?q=text+analysis) - If you do data science using a Python in a notebook, this the place for you.


- [Python for Librarians](https://libraryjuiceacademy.com/shop/course/270-python-for-librarians/) - An upcoming workshop that will look at many interesting pieces of Python.