This assignment walks you through the steps required to perform an advanced frequency analysis on words in a given text source. It includes the following steps:

Convert a text file into a string.
Split a string into words, excluding punctuation marks.
Remove stop words from the string.
Lemmatize the words in the string so that all words are stem words.
Count the frequency of each stem word and store the results in a dictionary.
Convert the dictionary to a JSON file.
You may use any text file you wish, including files used in lessons and exercises in this course, files downloaded from a website like Project Gutenberg (Links to an external site.), or a file you create specifically for this assignment. After completing the activity, you should test it using at least one other file.

You may create this as a single script that includes all steps, or you can split the steps into individual scripts.

Step 1: Convert a Text File to a String
Create a function that takes as input the path to a text file and returns the contents of the file as a string.

In [3]:
def read_text_file(file_path):
    f = open(file_path, "r")
    text = f.read()
    return text

text = read_text_file("FileIO-DataFiles/test_file.txt")
print(text)

Hello World!
Hello World!
Hello World!


Step 2: Split the String into Words
Create a function that takes as input a string and returns a list of strings representing the words in the text file.

The function should divide the string into words based on any type of punctuation.

The function should convert all words into lowercase.

In [13]:
import re
import string
def read_text_file(file_path):
    f = open(file_path, "r")
    text = f.read()
    return text

def split_text(text):
    words = re.findall(r"[\w']+|[.,!?;]", text.lower())
    punctuation_list =  list(string.punctuation)
    w_clean = list()
    
    for word in words:
        new_word = ""
        for char in word:
            if char[0] and char[-1] not in punctuation_list:
                new_word += char
            
        if len(new_word) > 0:
            w_clean.append(new_word)
        
    return w_clean
 
text = read_text_file("FileIO-DataFiles/flatland01.txt")
words = split_text(text)
print(words)

['flatland', 'part', '1', 'this', 'world', 'section', '1', 'of', 'the', 'nature', 'of', 'flatland', 'i', 'call', 'our', 'world', 'flatland', 'not', 'because', 'we', 'call', 'it', 'so', 'but', 'to', 'make', 'its', 'nature', 'clearer', 'to', 'you', 'my', 'happy', 'readers', 'who', 'are', 'privileged', 'to', 'live', 'in', 'space', 'imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines', 'triangles', 'squares', 'pentagons', 'hexagons', 'and', 'other', 'figures', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places', 'move', 'freely', 'about', 'on', 'or', 'in', 'the', 'surface', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it', 'very', 'much', 'like', 'shadows', 'only', 'hard', 'with', 'luminous', 'edges', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen', 'alas', 'a', 'few', 'years', 'ago', 'i', 'should', 'have', 'said', 'my', 'universe', 'but', 'n

Step 3: Exclude Stop Words
When searching or indexing text content (such as web pages or large documents), we typically want to exclude frequently-used words like "the," "a," or "and" so that the search or analysis includes only the words that are more likely to produce meaningful results. We use the term "stop words" to reference this collection of words.

Because this is a common task when working with text, Python has an nltk module that includes stop words for a variety of languages. We can use this module to remove stop words from text we want to search or analyze.

You may need to download extra parts of this module. To do this, run the following snippet in a cell by itself.

In [6]:
import nltk
 
nltk.download('stopwords')
 
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Veronica\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Veronica\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

Create a function that takes as input a list of words and removes all stop words. The basic steps of importing the stopwords module are provided for you, but you may find it useful to do more research on stop words before completing this step.

In [15]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

import re
import string
def read_text_file(file_path):
    f = open(file_path, "r")
    text = f.read()
    return text

def split_text(text):
    words = re.findall(r"[\w']+|[.,!?;]", text.lower())
    punctuation_list =  list(string.punctuation)
    w_clean = list()
    
    for word in words:
        new_word = ""
        for char in word:
            if char[0] and char[-1] not in punctuation_list:
                new_word += char
            
        if len(new_word) > 0:
            w_clean.append(new_word)
        
    return w_clean
 
def remove_stop_words(words,stop_words):
    words_clean = list()
    for word in words:
        if word not in stop_words:
            words_clean.append(word)
        
    return words_clean
 
text = read_text_file("FileIO-DataFiles/flatland01.txt")
words = split_text(text)
print(words)
words_clean = remove_stop_words(words,stop_words)
print(words_clean)

['flatland', 'part', '1', 'this', 'world', 'section', '1', 'of', 'the', 'nature', 'of', 'flatland', 'i', 'call', 'our', 'world', 'flatland', 'not', 'because', 'we', 'call', 'it', 'so', 'but', 'to', 'make', 'its', 'nature', 'clearer', 'to', 'you', 'my', 'happy', 'readers', 'who', 'are', 'privileged', 'to', 'live', 'in', 'space', 'imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines', 'triangles', 'squares', 'pentagons', 'hexagons', 'and', 'other', 'figures', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places', 'move', 'freely', 'about', 'on', 'or', 'in', 'the', 'surface', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it', 'very', 'much', 'like', 'shadows', 'only', 'hard', 'with', 'luminous', 'edges', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen', 'alas', 'a', 'few', 'years', 'ago', 'i', 'should', 'have', 'said', 'my', 'universe', 'but', 'n

Step 4: Lemmatize the Words
We can also use the nltk module to lemmatize words in a text file. The term lemmatize refers to the process of identifying words that are inflected versions of the same stem word, so that only the stem word is included in the analysis.

For example, each of the following phrases includes an inflected form of the stem word "walk":

I walked to the coffee shop last night.
Helen regularly walks her dog in the evening.
They saw the boys walking toward the house.
A strict textual analysis would count each of these as a separate word, but they are all actually different forms of the same stem word, "walk." Lemmatizing the words reduces the number of words that a process must analyze, making the process more efficient and the results more meaningful.

The following code imports WordNetLemmatizer from the nltk.stem module and creates a lemmatizer. We can then use the lemmatizer to identify the lemma (or root form) of an inflected word, as shown in the example.

In [16]:
# example code
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
 
word = "priorities"
word_lemmatized =  lemmatizer.lemmatize(word)
print(word) #original word
print(word_lemmatized) #lemmatized word

priorities
priority


Using this code as a starting point, create a function to lemmatize each word in a list of words produced in the previous step of this activity.

In [18]:
# use this cell to complete the activity
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
import re
import string

def read_text_file(file_path):
    f = open(file_path, "r")
    text = f.read()
    return text

def split_text(text):
    words = re.findall(r"[\w']+|[.,!?;]", text.lower())
    punctuation_list =  list(string.punctuation)
    w_clean = list()
    
    for word in words:
        new_word = ""
        for char in word:
            if char[0] and char[-1] not in punctuation_list:
                new_word += char
            
        if len(new_word) > 0:
            w_clean.append(new_word)
        
    return w_clean
 
def remove_stop_words(words,stop_words):
    words_clean = list()
    for word in words:
        if word not in stop_words:
            words_clean.append(word)
        
    return words_clean

def lemmatize_words(words_clean):
    words_lemmatized = list()
    for word in words_clean:
        word_lemmatized =  lemmatizer.lemmatize(word)
        words_lemmatized.append(word_lemmatized)
    return words_lemmatized
 
text = read_text_file("FileIO-DataFiles/flatland01.txt")
words = split_text(text)
print(words)
print("\n")
words_clean = remove_stop_words(words,stop_words)
print(words_clean)
print("\n")
words_lemmatized = lemmatize_words(words_clean)
print(words_lemmatized)

['flatland', 'part', '1', 'this', 'world', 'section', '1', 'of', 'the', 'nature', 'of', 'flatland', 'i', 'call', 'our', 'world', 'flatland', 'not', 'because', 'we', 'call', 'it', 'so', 'but', 'to', 'make', 'its', 'nature', 'clearer', 'to', 'you', 'my', 'happy', 'readers', 'who', 'are', 'privileged', 'to', 'live', 'in', 'space', 'imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines', 'triangles', 'squares', 'pentagons', 'hexagons', 'and', 'other', 'figures', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places', 'move', 'freely', 'about', 'on', 'or', 'in', 'the', 'surface', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it', 'very', 'much', 'like', 'shadows', 'only', 'hard', 'with', 'luminous', 'edges', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen', 'alas', 'a', 'few', 'years', 'ago', 'i', 'should', 'have', 'said', 'my', 'universe', 'but', 'n

Stem 5: Count the Words
Create a function that takes as input a list of lemmatized words and returns a dictionary that has the frequency of occurrence of each lemma.

In [19]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
import re
import string

def read_text_file(file_path):
    f = open(file_path, "r")
    text = f.read()
    return text

def split_text(text):
    words = re.findall(r"[\w']+|[.,!?;]", text.lower())
    punctuation_list =  list(string.punctuation)
    w_clean = list()
    
    for word in words:
        new_word = ""
        for char in word:
            if char[0] and char[-1] not in punctuation_list:
                new_word += char
            
        if len(new_word) > 0:
            w_clean.append(new_word)
        
    return w_clean
 
def remove_stop_words(words,stop_words):
    words_clean = list()
    for word in words:
        if word not in stop_words:
            words_clean.append(word)
        
    return words_clean

def lemmatize_words(words_clean):
    words_lemmatized = list()
    for word in words_clean:
        word_lemmatized =  lemmatizer.lemmatize(word)
        words_lemmatized.append(word_lemmatized)
    return words_lemmatized
def compute_frequency_words(words_lemmatized):
    word_freq = dict()

    for word in words_lemmatized:
        word_freq[word] = (words.count(word))
    
    return word_freq

text = read_text_file("FileIO-DataFiles/flatland01.txt")
words = split_text(text)
print(words)
print("\n")
words_clean = remove_stop_words(words,stop_words)
print(words_clean)
print("\n")
words_lemmatized = lemmatize_words(words_clean)
print(words_lemmatized)
print("\n")
words_frequency = compute_frequency_words(words_lemmatized)
print(type(words_frequency)) #should print dict
print("\n")
print(words_frequency)

['flatland', 'part', '1', 'this', 'world', 'section', '1', 'of', 'the', 'nature', 'of', 'flatland', 'i', 'call', 'our', 'world', 'flatland', 'not', 'because', 'we', 'call', 'it', 'so', 'but', 'to', 'make', 'its', 'nature', 'clearer', 'to', 'you', 'my', 'happy', 'readers', 'who', 'are', 'privileged', 'to', 'live', 'in', 'space', 'imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines', 'triangles', 'squares', 'pentagons', 'hexagons', 'and', 'other', 'figures', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places', 'move', 'freely', 'about', 'on', 'or', 'in', 'the', 'surface', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it', 'very', 'much', 'like', 'shadows', 'only', 'hard', 'with', 'luminous', 'edges', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen', 'alas', 'a', 'few', 'years', 'ago', 'i', 'should', 'have', 'said', 'my', 'universe', 'but', 'n

Step 6: Export the Results to JSON
Create a function that takes as input a dictionary where the key is a word and the value is the frequency of occurrence of that word in an input text.

The function should store the dictionary in a JSON file named words_frequency.json.

In [23]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
import re
import string
import json

def read_text_file(file_path):
    f = open(file_path, "r")
    text = f.read()
    return text

def split_text(text):
    words = re.findall(r"[\w']+|[.,!?;]", text.lower())
    punctuation_list =  list(string.punctuation)
    w_clean = list()
    
    for word in words:
        new_word = ""
        for char in word:
            if char[0] and char[-1] not in punctuation_list:
                new_word += char
            
        if len(new_word) > 0:
            w_clean.append(new_word)
        
    return w_clean
 
def remove_stop_words(words,stop_words):
    words_clean = list()
    for word in words:
        if word not in stop_words:
            words_clean.append(word)
        
    return words_clean

def lemmatize_words(words_clean):
    words_lemmatized = list()
    for word in words_clean:
        word_lemmatized =  lemmatizer.lemmatize(word)
        words_lemmatized.append(word_lemmatized)
    return words_lemmatized
def compute_frequency_words(words_lemmatized):
    word_freq = dict()

    for word in words_lemmatized:
        word_freq[word] = (words.count(word))
    
    return word_freq

def save_words_frequency(words_frequency,file_path="FileIO-DataFiles/words_frequency.json"):
    with open(file_path, 'w') as outfile:  
        json.dump(words_frequency, outfile)

text = read_text_file("FileIO-DataFiles/flatland01.txt")
words = split_text(text)
print(words)
print("\n")
words_clean = remove_stop_words(words,stop_words)
print(words_clean)
print("\n")
words_lemmatized = lemmatize_words(words_clean)
print(words_lemmatized)
print("\n")
words_frequency = compute_frequency_words(words_lemmatized)
print(type(words_frequency)) #should print dict
print("\n")
print(words_frequency)
save_words_frequency(words_frequency,"FileIO-DataFiles/words_frequency.json")

['flatland', 'part', '1', 'this', 'world', 'section', '1', 'of', 'the', 'nature', 'of', 'flatland', 'i', 'call', 'our', 'world', 'flatland', 'not', 'because', 'we', 'call', 'it', 'so', 'but', 'to', 'make', 'its', 'nature', 'clearer', 'to', 'you', 'my', 'happy', 'readers', 'who', 'are', 'privileged', 'to', 'live', 'in', 'space', 'imagine', 'a', 'vast', 'sheet', 'of', 'paper', 'on', 'which', 'straight', 'lines', 'triangles', 'squares', 'pentagons', 'hexagons', 'and', 'other', 'figures', 'instead', 'of', 'remaining', 'fixed', 'in', 'their', 'places', 'move', 'freely', 'about', 'on', 'or', 'in', 'the', 'surface', 'but', 'without', 'the', 'power', 'of', 'rising', 'above', 'or', 'sinking', 'below', 'it', 'very', 'much', 'like', 'shadows', 'only', 'hard', 'with', 'luminous', 'edges', 'and', 'you', 'will', 'then', 'have', 'a', 'pretty', 'correct', 'notion', 'of', 'my', 'country', 'and', 'countrymen', 'alas', 'a', 'few', 'years', 'ago', 'i', 'should', 'have', 'said', 'my', 'universe', 'but', 'n

Step 7: Combine All Steps in a Single Program
Using the skeleton below, combine all of the previous steps into a single script that will perform the following steps:

Convert a text file to a string.
Split the string into words, excluding punctuation marks.
Remove stop words from the list of strings.
Lemmatize the words in the list so that all words are stem words.
Count the frequency of each stem word and store the results in a dictionary.
Convert the dictionary to a JSON file.

In [24]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
import re
import string
import json

def read_text_file(file_path):
    f = open(file_path, "r")
    text = f.read()
    return text

def split_text(text):
    words = re.findall(r"[\w']+|[.,!?;]", text.lower())
    punctuation_list =  list(string.punctuation)
    w_clean = list()
    
    for word in words:
        new_word = ""
        for char in word:
            if char[0] and char[-1] not in punctuation_list:
                new_word += char
            
        if len(new_word) > 0:
            w_clean.append(new_word)
        
    return w_clean
 
def remove_stop_words(words,stop_words):
    words_clean = list()
    for word in words:
        if word not in stop_words:
            words_clean.append(word)
        
    return words_clean

def lemmatize_words(words_clean):
    words_lemmatized = list()
    for word in words_clean:
        word_lemmatized =  lemmatizer.lemmatize(word)
        words_lemmatized.append(word_lemmatized)
    return words_lemmatized
def compute_frequency_words(words_lemmatized):
    word_freq = dict()

    for word in words_lemmatized:
        word_freq[word] = (words.count(word))
    
    return word_freq

def save_words_frequency(words_frequency,file_path="FileIO-DataFiles/words_frequency.json"):
    with open(file_path, 'w') as outfile:  
        json.dump(words_frequency, outfile)

text = read_text_file("FileIO-DataFiles/flatland01.txt")
words = split_text(text)
words_clean = remove_stop_words(words,stop_words)
words_lemmatized = lemmatize_words(words_clean)
words_frequency = compute_frequency_words(words_lemmatized)
save_words_frequency(words_frequency,"FileIO-DataFiles/words_frequency.json")

Requirements
After completing all steps in this assignment, verify that your code meets the following requirements:

Your name and a current date appear as a comment in the first line of code.
The final version of the file successfully completes each of the following tasks:
Convert a text file to a string.
Split the string into words, excluding punctuation marks.
Remove stop words from the list of strings.
Lemmatize the words in the list so that all words are stem words.
Count the frequency of each stem word and store the results in a dictionary.
Convert the dictionary to a JSON file.
Include appropriate exception handling for predictable errors such as missing files.