Text Analysis

... in which we open a file for reading; <br>
count the number of total words and unique words;<br>
and produce word counts using a dict.

In this first cell, we have 2 ways to access the text data:
- (a) in a local file on disk
- (b) from a remote website


In [53]:
## (A) if working with a local file on your disk: 
# open a file for reading:
# filename = 'time.txt' # Time
# # filename = 'romeo.txt' # Romeo and Juliet
# filename = 'grapes.txt' # Grapes of Wrath
# f = open(filename, 'r')
# read all text, split into list of words:
# text = f.read()

## (B) working with a remote file, download from WWW:
import requests 
# url_time = "https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/refs/heads/main/data/time.txt" # Time
# response = requests.get(url_time)
url_romeo = "https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/refs/heads/main/data/romeo.txt"  # Romeo and Juliet
response = requests.get(url_romeo)
# url_grapes = "https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/refs/heads/main/data/grapes.txt" # Grapes of Wrath

text = response.content
# print(text)



Let's examine this text:
- Find total count of words
- Use a dictionary to count unique words


In [54]:
# split into a list of words:
words = text.split() # split based on white space
# str.strip() # remove whitespace
# words = [str(w.lower()) for w in words]
words = [w.decode('utf-8') for w in words]
# print(words)
print(f"Found {len(words)} total words.")

# number of distinct words:
distinct_words = set(words)
print(f"Found {len(distinct_words)} distinct total words.")

# word counts: how many times does each word occur?
# go through each `distinct_words`, and count occurences in `words`
word_counts = {} # empty dictionary
for w in distinct_words:
    # store the count with the word:
    word_counts[w] = words.count(w)

# show the dictionary of word counts:
#print(word_counts)

# which words occur most frequently?
pairs = list(word_counts.items())

pairs = sorted(pairs, key=lambda x: x[1], reverse=True)
# print(pairs)
# print out the top 10 most frequent occurring words:
for i in range(10):
    item = pairs[i]
    print(f'word: {item[0]}, count: {item[1]}')



Found 6276 total words.
Found 2467 distinct total words.
word: the, count: 180
word: of, count: 144
word: I, count: 134
word: and, count: 117
word: a, count: 102
word: to, count: 92
word: in, count: 76
word: my, count: 72
word: is, count: 67
word: you, count: 60


**First order Markov model**

... in which we track words (from original text) and the list of words that can follow each one.

In [56]:
# use a dictionary to hold our Markov model
model = {}

# assume the first word
# current_word = words[0]
current_word = "$" # special symbol for "start of sentence"

# loop over all remaining words:
# for next_word in words[1:]:
for next_word in words:

    # add this sequence to our dictionary
    # building a list of words that can follow `current_word`
    # print(f'{current_word} -> {next_word}')

    # check: have we seen `current_word` before?
    if current_word in model:
        # append to the list of words that can follow `current_word`
        follow_words = model[current_word]
        follow_words.append(next_word)
    else:
        # insert into the dictionary (first time we see `current_word`)
        model[current_word] = [next_word]

    # when we get to the end of a sentence, the next word is a sentence-starting word
    # if next_word[-1] in '.?!': # example: 'to end.'
    if '.' in next_word or '!' in next_word or '?' in next_word: 
        current_word = '$' # sentence-starting word follows
    else:
        # advance to the next word:
        current_word = next_word

# print(model)
# for w in model:
#     print(f'{w}: {model[w]}')

**Generating Text (unintelligently)**

... in which we use our Markov model of the text to generate new text.

In [58]:
import random

# start with sentence-starting word:
current_word = '$'

# do this N times to generate N words
for i in range(25):

    # find the words that can follow `current_word`
    follow_words = model[current_word]

    # pick one word from that list -- unintelligently!!
    next_word = random.choice(follow_words)
    print(next_word, end=' ')

    # advance the `current_word`
    # when we get to the end of a sentence, the next word is a sentence-starting word
    # if next_word[-1] in '.?!': # example: 'to end.'
    if '.' in next_word or '!' in next_word or '?' in next_word: 
        current_word = '$' # sentence-starting word follows
    else:
        # advance to the next word:
        current_word = next_word

CITIZENS: Clubs, bills, and they do with sweetmeats tainted are: Sometime she be fourteen; That dreamers often lie. saw you do, sir, in good mark-man!- 