# Assignment: Word Counts

Perform an advanced frequency analysis on words in a given text source.
1. Convert a text file into a string.
2. Split a string into words, excluding punctuation marks.
3. Remove stop words from the string.
4. Lemmatize the words in the string so that all words are stem words.
5. Count the frequency of each stem word and store the results in a dictionary.
6. Convert the dictionary to a JSON file.

### Step 1: Convert a Text File to a String
Create a function that takes as input the path to a text file and returns the contents of the file as a string.

In [None]:
import os
if os.path.exists("data/text.txt"):
    print("File already exist.")
else:
    f = open("data/text.txt", "a")
    f.write("Hello World!")
    f.close()

def read_text_file(file_path):
    f = open(file_path, 'r')
    r = f.read()
    return r
text = read_text_file("data/text.txt")
print(text)

### Step 2: Split the String into Words
Create a function that takes as input a string and returns a list of strings representing the words in the text file.

The function should divide the string into words based on any type of punctuation.

The function should convert all words into lowercase.

In [None]:
import os
import csv
import re
def split_text(text):
    
    if os.path.exists("data/file.txt") == False:
        f = open("data/file.txt", "a")
        f.write(text)
        f.close()
    f = open("data/file.txt", "r")
    content = f.read()
    lower_case = content.lower()
    
    # content_list = lower_case.split(" ")
    
    content_list = re.split(r"[-;,.\s]\s*", lower_case)
    
    f.close()
    return (content_list)
    
text = 'Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise, but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen; but, as a lover, he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer—excellent for drawing the veil from men’s motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his. And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory.'
words = split_text(text)
print(words)

### Step 3: Exclude Stop Words
When searching or indexing text content (such as web pages or large documents), we typically want to exclude frequently-used words like "the," "a," or "and" so that the search or analysis includes only the words that are more likely to produce meaningful results. We use the term "stop words" to reference this collection of words.

Because this is a common task when working with text, Python has an nltk module that includes stop words for a variety of languages. We can use this module to remove stop words from text we want to search or analyze.

Create a function that takes as input a list of words and removes all stop words. The basic steps of importing the stopwords module are provided for you, but you may find it useful to do more research on stop words before completing this step.

In [None]:
import nltk
 
nltk.download('stopwords')
 
nltk.download('wordnet')

In [None]:
import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stop_words(words,stop_words):
    
    filtered_sentence = []
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(w)
    return filtered_sentence
words = ['sherlock', 'holmes', 'she', 'is', 'always', 'the', 'woman', 'i', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', 'in', 'his', 'eyes', 'she', 'eclipses', 'and', 'predominates', 'the', 'whole', 'of', 'her', 'sex', 'it', 'was', 'not', 'that', 'he', 'felt', 'any', 'emotion', 'akin', 'to', 'love', 'for', 'irene', 'adler', 'all', 'emotions', 'and', 'that', 'one', 'particularly', 'were', 'abhorrent', 'to', 'his', 'cold', 'precise', 'but', 'admirably', 'balanced', 'mind', 'he', 'was', 'i', 'take', 'it', 'the', 'most', 'perfect', 'reasoning', 'and', 'observing', 'machine', 'that', 'the', 'world', 'has', 'seen', 'but', 'as', 'a', 'lover', 'he', 'would', 'have', 'placed', 'himself', 'in', 'a', 'false', 'position', 'he', 'never', 'spoke', 'of', 'the', 'softer', 'passions', 'save', 'with', 'a', 'gibe', 'and', 'a', 'sneer', 'they', 'were', 'admirable', 'things', 'for', 'the', 'observer—excellent', 'for', 'drawing', 'the', 'veil', 'from', 'men’s', 'motives', 'and', 'actions', 'but', 'for', 'the', 'trained', 'reasoner', 'to', 'admit', 'such', 'intrusions', 'into', 'his', 'own', 'delicate', 'and', 'finely', 'adjusted', 'temperament', 'was', 'to', 'introduce', 'a', 'distracting', 'factor', 'which', 'might', 'throw', 'a', 'doubt', 'upon', 'all', 'his', 'mental', 'results', 'grit', 'in', 'a', 'sensitive', 'instrument', 'or', 'a', 'crack', 'in', 'one', 'of', 'his', 'own', 'high', 'power', 'lenses', 'would', 'not', 'be', 'more', 'disturbing', 'than', 'a', 'strong', 'emotion', 'in', 'a', 'nature', 'such', 'as', 'his', 'and', 'yet', 'there', 'was', 'but', 'one', 'woman', 'to', 'him', 'and', 'that', 'woman', 'was', 'the', 'late', 'irene', 'adler', 'of', 'dubious', 'and', 'questionable', 'memory', '']
words_clean = remove_stop_words(words,stop_words)
print(words_clean)
print(len(words_clean))
print(len(words))

### Step 4: Lemmatize the Words
We can also use the nltk module to lemmatize words in a text file. The term lemmatize refers to the process of identifying words that are inflected versions of the same stem word, so that only the stem word is included in the analysis.

For example, each of the following phrases includes an inflected form of the stem word "walk":

I walked to the coffee shop last night.
Helen regularly walks her dog in the evening.
They saw the boys walking toward the house.
A strict textual analysis would count each of these as a separate word, but they are all actually different forms of the same stem word, "walk." Lemmatizing the words reduces the number of words that a process must analyze, making the process more efficient and the results more meaningful.


In [None]:
# example code
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer() 
 
word = "priorities"
word_lemmatized =  lemmatizer.lemmatize(word)
print(word) #original word
print(word_lemmatized) #lemmatized word

create a function to lemmatize each word in a list of words produced in the previous step of this activity.

In [None]:
# use this cell to complete the activity
def lemmatize_words(words_clean):
    lem_words = []
    for word in words_clean:
        word_lemmatized = lemmatizer.lemmatize(word)
        lem_words.append(word_lemmatized)
    return lem_words
words_lemmatized = lemmatize_words(words_clean)
print(words_lemmatized)

### Step 5: Count the Words
Create a function that takes as input a list of lemmatized words and returns a dictionary that has the frequency of occurrence of each lemma.

In [None]:
words_lemmatized = ['sherlock', 'holmes', 'always', 'woman', 'seldom', 'heard', 'mention', 'name', 'eye', 'eclipse', 'predominates', 'whole', 'sex', 'felt', 'emotion', 'akin', 'love', 'irene', 'adler', 'emotion', 'one', 'particularly', 'abhorrent', 'cold', 'precise', 'admirably', 'balanced', 'mind', 'take', 'perfect', 'reasoning', 'observing', 'machine', 'world', 'seen', 'lover', 'would', 'placed', 'false', 'position', 'never', 'spoke', 'softer', 'passion', 'save', 'gibe', 'sneer', 'admirable', 'thing', 'observer—excellent', 'drawing', 'veil', 'men’s', 'motif', 'action', 'trained', 'reasoner', 'admit', 'intrusion', 'delicate', 'finely', 'adjusted', 'temperament', 'introduce', 'distracting', 'factor', 'might', 'throw', 'doubt', 'upon', 'mental', 'result', 'grit', 'sensitive', 'instrument', 'crack', 'one', 'high', 'power', 'lens', 'would', 'disturbing', 'strong', 'emotion', 'nature', 'yet', 'one', 'woman', 'woman', 'late', 'irene', 'adler', 'dubious', 'questionable', 'memory']
def compute_frequency_words(words_lemmatized):
    word_freq = dict()
    # Create a dictionary of words as key and count as value, starting with zero first
    for i in words_lemmatized:
        word_freq[i] = 0

    # Loop through the list of words and if the word appears as a key in the dictionary append 1 to the value
    for i in words_lemmatized:
        if i in word_freq:
            word_freq[i] += 1
    # print("Word Frequency: ", word_freq)
    return word_freq

words_frequency = compute_frequency_words(words_lemmatized)
print(type(words_frequency)) #should print dict
print(words_frequency)

### Step 6: Export the Results to JSON
Create a function that takes as input a dictionary where the key is a word and the value is the frequency of occurrence of that word in an input text.

The function should store the dictionary in a JSON file named words_frequency.json.


In [None]:
import json

json_dict = {'sherlock': 1, 'holmes': 1, 'always': 1, 'woman': 3, 'seldom': 1, 'heard': 1, 'mention': 1, 'name': 1, 'eye': 1, 'eclipse': 1, 'predominates': 1, 'whole': 1, 'sex': 1, 'felt': 1, 'emotion': 3, 'akin': 1, 'love': 1, 'irene': 2, 'adler': 2, 'one': 3, 'particularly': 1, 'abhorrent': 1, 'cold': 1, 'precise': 1, 'admirably': 1, 'balanced': 1, 'mind': 1, 'take': 1, 'perfect': 1, 'reasoning': 1, 'observing': 1, 'machine': 1, 'world': 1, 'seen': 1, 'lover': 1, 'would': 2, 'placed': 1, 'false': 1, 'position': 1, 'never': 1, 'spoke': 1, 'softer': 1, 'passion': 1, 'save': 1, 'gibe': 1, 'sneer': 1, 'admirable': 1, 'thing': 1, 'observer—excellent': 1, 'drawing': 1, 'veil': 1, 'men’s': 1, 'motif': 1, 'action': 1, 'trained': 1, 'reasoner': 1, 'admit': 1, 'intrusion': 1, 'delicate': 1, 'finely': 1, 'adjusted': 1, 'temperament': 1, 'introduce': 1, 'distracting': 1, 'factor': 1, 'might': 1, 'throw': 1, 'doubt': 1, 'upon': 1, 'mental': 1, 'result': 1, 'grit': 1, 'sensitive': 1, 'instrument': 1, 'crack': 1, 'high': 1, 'power': 1, 'lens': 1, 'disturbing': 1, 'strong': 1, 'nature': 1, 'yet': 1, 'late': 1, 'dubious': 1, 'questionable': 1, 'memory': 1}

# convert a dictionary into a string object that we can display. 
json_data = json.dumps(json_dict, indent = 4) 
print(json_dict) # dict
print(type(json_dict))
 
print(json_data) # string
print(type(json_data))

In [None]:
import json
from pprint import pprint
json_dict = {'sherlock': 1, 'holmes': 1, 'always': 1, 'woman': 3, 'seldom': 1, 'heard': 1, 'mention': 1, 'name': 1, 'eye': 1, 'eclipse': 1, 'predominates': 1, 'whole': 1, 'sex': 1, 'felt': 1, 'emotion': 3, 'akin': 1, 'love': 1, 'irene': 2, 'adler': 2, 'one': 3, 'particularly': 1, 'abhorrent': 1, 'cold': 1, 'precise': 1, 'admirably': 1, 'balanced': 1, 'mind': 1, 'take': 1, 'perfect': 1, 'reasoning': 1, 'observing': 1, 'machine': 1, 'world': 1, 'seen': 1, 'lover': 1, 'would': 2, 'placed': 1, 'false': 1, 'position': 1, 'never': 1, 'spoke': 1, 'softer': 1, 'passion': 1, 'save': 1, 'gibe': 1, 'sneer': 1, 'admirable': 1, 'thing': 1, 'observer—excellent': 1, 'drawing': 1, 'veil': 1, 'men’s': 1, 'motif': 1, 'action': 1, 'trained': 1, 'reasoner': 1, 'admit': 1, 'intrusion': 1, 'delicate': 1, 'finely': 1, 'adjusted': 1, 'temperament': 1, 'introduce': 1, 'distracting': 1, 'factor': 1, 'might': 1, 'throw': 1, 'doubt': 1, 'upon': 1, 'mental': 1, 'result': 1, 'grit': 1, 'sensitive': 1, 'instrument': 1, 'crack': 1, 'high': 1, 'power': 1, 'lens': 1, 'disturbing': 1, 'strong': 1, 'nature': 1, 'yet': 1, 'late': 1, 'dubious': 1, 'questionable': 1, 'memory': 1}
with open('data/words_frequency.json', 'w') as outfile:
    json.dump(json_dict, outfile)
f = open('data/words_frequency.json', 'r')
pprint(f.read())

Step 7: Combine All Steps in a Single Program
Using the skeleton below, combine all of the previous steps into a single script that will perform the following steps:

1. Convert a text file to a string.
2. Split the string into words, excluding punctuation marks.
3. Remove stop words from the list of strings.
4. Lemmatize the words in the list so that all words are stem words.
5. Count the frequency of each stem word and store the results in a dictionary.
6. Convert the dictionary to a JSON file.

In [None]:
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
import json


def string_freq(file_path):
    # Read text file to a string
    f = open(file_path, 'r')
    content = f.read()
    # Split string into words excluding punctuation marks:
    lower_case = content.lower()
    content_list = re.split(r"[-;,.\s]\s*", lower_case)
    
    # Remove stop words:
    stop_words = set(stopwords.words('english'))
    filtered_sentence = []
    for w in content_list:
        if w not in stop_words:
            filtered_sentence.append(w)
    # filtered_sentence = filtered_sentence.remove("\''")
    
    # Lemmatize the words:
    lemmatizer = WordNetLemmatizer() 
    lem_words = []
    for word in filtered_sentence:
        word_lemmatized = lemmatizer.lemmatize(word)
        lem_words.append(word_lemmatized)
    
    # Count the frequency of each stem word and store them in a dictionary
    word_freq = dict()
    # Create a dictionary of words as key and count as value, starting with zero first
    for i in words_lemmatized:
        word_freq[i] = 0

    # Loop through the list of words and if the word appears as a key in the dictionary append 1 to the value:
    for i in words_lemmatized:
        if i in word_freq:
            word_freq[i] += 1

    # Convert the dictionary into a JSON file:
    with open('data/words_frequency.json', 'w') as outfile:
        json.dump(json_dict, outfile)
        f = open('data/words_frequency.json', 'r')
        p = pprint(f.read())
    return p
text = "data/file.txt"
print(string_freq(text))


In [None]:
text = 'Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise, but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen; but, as a lover, he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer—excellent for drawing the veil from men’s motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his. And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory.'

In [14]:
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
import json

def read_text_file(file_path):
    f = open(file_path, 'r')
    r = f.read()
    f.close()
    return r
def split_text(text):
    lower_case = text.lower()
    content_list = re.split(r"[-;,.\s]\s*", lower_case)
    try:
        while True:
            content_list.remove('')
    except ValueError:
        pass
    return content_list
def remove_stop_words(words,stop_words):
    filtered_sentence = []
    for w in words:
        if w not in stop_words:
            filtered_sentence.append(w)
    return filtered_sentence
def lemmatize_words(words_clean):
    lemmatizer = WordNetLemmatizer() 
    lem_words = []
    for word in words_clean:
        word_lemmatized = lemmatizer.lemmatize(word)
        lem_words.append(word_lemmatized)
    return lem_words
def compute_frequency_words(words_lemmatized):
    word_freq = dict()
    # Create a dictionary of words as key and count as value, starting with zero first
    for i in words_lemmatized:
        word_freq[i] = 0

    # Loop through the list of words and if the word appears as a key in the dictionary append 1 to the value:
    for i in words_lemmatized:
        if i in word_freq:
            word_freq[i] += 1
    return word_freq
def save_words_frequency(words_frequency,file_path="data/words_frequency.json"):
    with open(file_path, 'w') as outfile:
        json.dump(words_frequency, outfile)
        
    f = open(file_path, 'r')
    print(f.read())
    f.close()

stop_words = set(stopwords.words('english'))
text = read_text_file("data/text.txt")
words = split_text(text)
words_clean = remove_stop_words(words,stop_words)
words_lemmatized = lemmatize_words(words_clean)
words_frequency = compute_frequency_words(words_lemmatized)
save_words_frequency(words_frequency,file_path="data/words_frequency.json")


{"sherlock": 1, "holmes": 1, "always": 1, "woman": 3, "seldom": 1, "heard": 1, "mention": 1, "name": 1, "eye": 1, "eclipse": 1, "predominates": 1, "whole": 1, "sex": 1, "felt": 1, "emotion": 3, "akin": 1, "love": 1, "irene": 2, "adler": 2, "one": 3, "particularly": 1, "abhorrent": 1, "cold": 1, "precise": 1, "admirably": 1, "balanced": 1, "mind": 1, "take": 1, "perfect": 1, "reasoning": 1, "observing": 1, "machine": 1, "world": 1, "seen": 1, "lover": 1, "would": 2, "placed": 1, "false": 1, "position": 1, "never": 1, "spoke": 1, "softer": 1, "passion": 1, "save": 1, "gibe": 1, "sneer": 1, "admirable": 1, "thing": 1, "observer\u2014excellent": 1, "drawing": 1, "veil": 1, "men\u2019s": 1, "motif": 1, "action": 1, "trained": 1, "reasoner": 1, "admit": 1, "intrusion": 1, "delicate": 1, "finely": 1, "adjusted": 1, "temperament": 1, "introduce": 1, "distracting": 1, "factor": 1, "might": 1, "throw": 1, "doubt": 1, "upon": 1, "mental": 1, "result": 1, "grit": 1, "sensitive": 1, "instrument": 1