# Week 3 - Text Analysis & Working with Text Files

(Be sure to copy to drive)

Text data is a bit different from numeric data. We can easily find the average of a series of numbers and things like the highest and lowest values in a range to get some ideas on what we are dealing with. We can't really do that with text. We'll focus on some tools that you can use to actually analyze text. We'll start with a library called [TextBlob](https://textblob.readthedocs.io/en/dev/).

In [None]:
#Load up our libraries
from textblob import TextBlob
from google.colab import drive

#these should look familar
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import requests

#Some extra libraries we'll need for text analysis
import nltk
nltk.download('punkt')
nltk.download('brown')
nltk.download('punkt_tab')


#Connect to Gdrive
drive.mount('/content/gdrive')

print("Libraries and Drive Ready!")

# A text dataset

We are going to use a digial humanities example to explore some things that are possible. We'll begin with a diary from a young girl from the year 1901. Her name is Winnie Beam. Her [diary](http://hdl.handle.net/10464/7282) has been digitized, as well as turned into [data](https://docs.google.com/spreadsheets/d/17FO_a6jcLgycwDd7uYsrpPqcIterjquY2UyC7QAZmWs/edit?usp=sharing).

We are going to read in a CSV version of the data and put it into a dataframe.

In [None]:
#Run this cell to load up our 'corpus'

winnie_corpus = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vQcj2YkYKMn3HZo-5yfKw65kVENg_RZLRTrBxvJeRPB46k0z_BqIUD5ecuoyEmEGuCJ79ZyP-8rGeIv/pub?gid=0&single=true&output=csv', header = None)
winnie_corpus.columns = ["page","date","entry"]
winnie_corpus['date'] = pd.to_datetime(winnie_corpus['date'])
winnie_corpus['entry'] = winnie_corpus.entry.astype(str)

#preview our top entries
winnie_corpus.head()

# Sentiment Analysis


We can analyze the _sentiment_ of the text (more [details](https://planspace.org/20150607-textblob_sentiment/).) The next cell demonstrates this:

In [None]:
happy_sentence = "Python is the best programming language ever!"
sad_sentence = "Python is difficult to use, and very frustrating"


print("Sentiment of happy sentence ", TextBlob(happy_sentence).sentiment)
print("Sentiment of sad sentence ", TextBlob(sad_sentence).sentiment)

# polarity ranges from -1 to 1.
# subjectvity ranges from 0 to 1.


**Q1**  Try a couple of different sentences in the code cell below. See if you can create something that scores -1 and another that scores 1 for polarity. See if you can minimize the subjectivity of your sentence.
(We can create a multi line string of text by putting it in triple quotes like the cell following.)


In [None]:
test_text = """

Will this be a happy sentence or a sad one? Only TextBlob will tell!


"""
print("Score of test sentence is ", TextBlob(test_text).sentiment)

# Adding Sentiment to our Diary entries

This next cell will score each diary entry in a new column that will be added to the dataframe. We loop through each entry, calculate the two scores that represent the sentiment. After all the scores are computed we will add them to the dataframe.

In [None]:
#Apply sentiment analysis from TextBlob

polarity = []
subjectivity = []


for day in winnie_corpus.entry:
    score = TextBlob(day)
    polarity.append(score.sentiment.polarity)
    subjectivity.append(score.sentiment.subjectivity)

winnie_corpus['polarity'] = polarity
winnie_corpus['subjectivity'] = subjectivity


#Let's look at our new top entries
winnie_corpus.head()


# Entries

We can now see that scoring has been added. Run the next cell a few times to see different random entries.

In [None]:
winnie_corpus.sample(5)

# Now what?

We can do many things with the new data. For now lets just draw a line graph showing the changes in _polarity_ in her dairy entries.

In [None]:
#Let's graph out the sentiment as it changes day to day.

plt.plot(winnie_corpus["date"],winnie_corpus["polarity"])
plt.title("Polarity sentiment of Winnie's Diary Entries")
plt.show()

**Q2** Modify the following code cell to create a line graph of the _subjectivity_ of her diary entries for the year.

In [None]:
plt.plot(winnie_corpus["date"],winnie_corpus[])
plt.title("CHANGEME")
plt.ylabel("CHANGEME")
plt.xlabel("CHANGEME")
plt.show()


# Closer look?

Let's take a closer look at the really high _polarity_ sentiment entries to see what is going on.

In [None]:
#Top Five
winnie_corpus.sort_values(by = 'polarity', ascending = False).head(5)

In [None]:
#Top Five
winnie_corpus.sort_values(by = 'polarity', ascending = True).head(5)

**Q3** Do you agree with the polarity scores that TextBlob assigns to these diary entries? Why or what not? Feel free to add some notes into the following text cell

I think that the sentiment scoring is...

# Noun Phrases


We can get a good idea about what a corpus is about by looking at the different nouns that show up in it. Nouns that show up a lot give us an idea of the contents of the text. Textblob can do this for us. Run the cell below a few times to grab random entries in the data and to see what noun phrases they use.

In [None]:
#We need a library to get Python to do stuff with Random values
import random
random_entry_number = random.randint(0,len(winnie_corpus))

#We finally pick a random entry here
bit_of_corpus = TextBlob(winnie_corpus["entry"][random_entry_number])

print("page:",winnie_corpus.iloc[random_entry_number]['page'])
print("date: ", winnie_corpus.iloc[random_entry_number]['date'])
print("entry: \n", winnie_corpus.iloc[random_entry_number]['entry'])

print("---")
print("Noun Phrases found")
print("---")
for np in bit_of_corpus.noun_phrases:
    print(np)

**Q4** What do you think about the Noun Phrase identification? Is it useful or not?



I think Noun Phrase...

# Noun Phrases for the Diary

Now let's generate the noun phrases for January's entries

In [None]:
#we use some pandas work to just grab the January entries
#stretch back into your memory to think about conditionals again
jan_corpus = winnie_corpus[(winnie_corpus['date'] >= '1900-01-01') & (winnie_corpus['date'] <= '1900-01-31')]

jan_phrases = dict()

for entry in jan_corpus.entry:

    tb = TextBlob(entry)
    #we create a dictionary that will hold the noun phrases
    #if it is the first time we see this np we put it in the dictionary
    #if not, we must have a count already, so we increase that by one
    for np in tb.noun_phrases:
        if np in jan_phrases:
            jan_phrases[np] += 1
        else:
            jan_phrases[np] = 1

#Print the top 10 things she mentioned in January

for np in sorted(jan_phrases, key=jan_phrases.get, reverse=True)[0:10]:
    print(np, jan_phrases[np])



**Q5** Modify the next series of cells to generate noun phrases for the next 5 months of the year.

In [None]:
#February Entries
feb_corpus = winnie_corpus[(winnie_corpus['date'] >= '') & (winnie_corpus['date'] <= '')]

feb_phrases = dict()

for entry in feb_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in feb_phrases:
            feb_phrases[np] += 1
        else:
            feb_phrases[np] = 1

#Print the top 10 things she mentioned in February

for np in sorted(feb_phrases, key=feb_phrases.get, reverse=True)[0:10]:
    print(np, feb_phrases[np])

In [None]:
#March Entries
mar_corpus = winnie_corpus[(winnie_corpus['date'] >= '') & (winnie_corpus['date'] <= '')]


mar_phrases = dict()

for entry in mar_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in mar_phrases:
            mar_phrases[np] += 1
        else:
            mar_phrases[np] = 1

#Print the top 10 things she mentioned in March

for np in sorted(mar_phrases, key=mar_phrases.get, reverse=True)[0:10]:
    print(np, mar_phrases[np])

In [None]:
#April Entries
april_corpus = winnie_corpus[(winnie_corpus['date'] >= '') & (winnie_corpus['date'] <= '')]

april_phrases = dict()

for entry in april_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in april_phrases:
            april_phrases[np] += 1
        else:
            april_phrases[np] = 1

#Print the top 10 things she mentioned in April

for np in sorted(april_phrases, key=april_phrases.get, reverse=True)[0:10]:
    print(np, april_phrases[np])

In [None]:
#May Entries
may_corpus = winnie_corpus[(winnie_corpus['date'] >= '') & (winnie_corpus['date'] <= '')]

may_phrases = dict()

for entry in may_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in may_phrases:
            may_phrases[np] += 1
        else:
            may_phrases[np] = 1

#Print the top 10 things she mentioned in may

for np in sorted(may_phrases, key=may_phrases.get, reverse=True)[0:10]:
    print(np, may_phrases[np])

In [None]:
#June Entries
june_corpus = winnie_corpus[(winnie_corpus['date'] >= '') & (winnie_corpus['date'] <= '')]

june_phrases = dict()

for entry in june_corpus.entry:
    tb = TextBlob(entry)
    for np in tb.noun_phrases:
        if np in june_phrases:
            june_phrases[np] += 1
        else:
            june_phrases[np] = 1

#Print the top 10 things she mentioned in june

for np in sorted(june_phrases, key=june_phrases.get, reverse=True)[0:10]:
    print(np, june_phrases[np])

# Changes in topic


**Q6**
Take a moment to look at what is printed for each month. Can you get a sense of what Winnie is writing about over the months? Or how those topics change?

What I can tell from looking at Noun Phrases in the diary is...

# Text and Text Files

Our week 2 & week 3 warmup material introduced some ideas about working with files in Google Drive and in our Colab environment. Since we are dealing with text analysis right now we'll take a moment to talk about text files as well.

Sometimes we want to take a string variable and write it to a file so that we can use it a later time.

We'll also make use of automatically grabbing content from the web using the [requests](https://pypi.org/project/requests/) library just like we did in week 1. We are going to grab a book from the [Project Guttenberg](https://www.gutenberg.org/) site as our example.

This is technically an example of screen scrapping. IE. we are programmatically grabbing content from the web using an automated tool. This is the type of thing that AI bots are doing and arguably it is [ruining](https://library.unc.edu/news/library-it-vs-the-ai-bots/) the web.

In [None]:
#We'll be using the H.G. Wells book - The Invisible Man (https://www.gutenberg.org/ebooks/5230)
#but we'll focus on the plain text version
book_text_url = "https://www.gutenberg.org/cache/epub/5230/pg5230.txt"

response = requests.get(book_text_url)

In [None]:
#we now have a string variable (response.text) which holds the whole text of the book
response.text

In [None]:
#File I/O in Python is a whole week of content on its own
#but quickly, the 'w' means we are writing to the file
with (open('invisible_man.txt', 'w')) as f:
    f.write(response.text)

In [None]:
#Magic command to display contents of folder
!ls -l


# File I/O? More to go!

I'll encourage you to look up more tutorials with working with files in Python to get a full sense of what is possible.

Check out your drive through the [web](https://drive.google.com/), navigate to your `LibraryJuicePython` folder and have a look at what is there now.


# One final activity: Automatic Keyword Generator

Let's put all of what we have learned together to create an automatic keyword generator that identifies Noun Phrases in a book from Guttenberg.

We are going to be looking at the book [The Prince](https://en.wikipedia.org/wiki/The_Prince)


In [None]:
keywords = dict()

# We are using The Prince - https://www.gutenberg.org/ebooks/1232
book_url = "https://www.gutenberg.org/files/1232/1232-0.txt"
book_title = "The Prince"


print("Downloading book...")
book = requests.get(book_url)

#save a copy of the downloaded book as a text file
with (open(book_title+'.txt', 'w')) as f:
    f.write(book.text)

#Turn text into text blob
book_blob = TextBlob(book.text)


print("Identiying Noun phrases and building frequency dictionary...")

#Go through all noun phrases
for np in book_blob.noun_phrases:
    if np in keywords:
        keywords[np] += 1
    else:
        keywords[np] = 1

noun_phrases = ""
#Sort dictionary and print top 20 entries
print("Most common Nouns...")

for np in sorted(keywords, key=keywords.get, reverse=True)[0:20]:
    noun_phrases += np + ","+str(keywords[np])+"\n"
    print(np, keywords[np])

with(open(book_title+'_keywords.txt','w')) as f:
    f.write(noun_phrases)

In [None]:
#Let's look again at what is in our folder
!ls

In [None]:
#Let's move all of the files we created for this exercise to our usual folder
!mv *.txt /content/gdrive/MyDrive/LibraryJuicePython

**Q7** Trying running the automatic keyword generator on a different book from Guttenberg. Perhaps something you have already read. Do you think it gives you a good idea of what the thing is about?


You will need to change the values for `book_url` and `book_title` in line 4 & 5.


# Moral of the story

Text analysis lets you do a bunch of different things. We have just scratched the surface here.