# Homework 1

## Question 1

Write a function called `count_words_basic` that takes in the test phrase and returns a dictionary of word counts.

**Output:**

`{'i': 1,
 'bought': 1,
 'a': 2,
 'sandwich': 1,
 'with': 1,
 'side': 1,
 'of': 1,
 'chips': 1}`

In [271]:
test_phrase_basic = 'i bought a sandwich with a side of chips'

In [272]:
def count_words_basic(text):
    """
    Takes input a string and returns dictionary of the word count
    args:
        text : input str
    output:
        ans_dict: dictionary of word count of text 
    """
    ans_dict = {}
    # Splitting text into words and looping over each word to generate a count dictionary
    for word in text.split(' '):
        if word not in ans_dict:
            ans_dict[word] = 0
        ans_dict[word]+=1
    return ans_dict

ans_dict = {'i': 1,  'bought': 1,  'a': 2,  'sandwich': 1,  'with': 1,  'side': 1,  'of': 1,  'chips': 1}
output_dict = count_words_basic(text = test_phrase_basic)

assert ans_dict == output_dict, "Wrong Answer"
print("Correct Answer")
print(output_dict)

Correct Answer
{'i': 1, 'bought': 1, 'a': 2, 'sandwich': 1, 'with': 1, 'side': 1, 'of': 1, 'chips': 1}


## Question 2

Write a function called `count_words` that takes in the test phrase and returns a dictionary of word counts.
  * Step 1: Write a function called `clean_text` that removes a few common punctuation marks (`.,?!'"`) and makes all the text lowercase.
  * Step 2: Write another function called `count_words` that splits a string of text into words and then cleans it using the function `clean_text`.

In [288]:
test_phrase = 'I bought a sandwich with a side of chips!'

In [303]:
def clean_text(word_list):
    """
    The function removes common punctuation marks (" , ? ! ')  from the input text
    args:
        word_list : list of input words to be cleaned
    output 
        out_list: list of cleaned words 
    """
    # Creating a string with all punctuation marks that are to removed
    punctuations = """,?!'".[](){}:;"""
    
    out_list = []
    
    # Iterating over each word from the word list
    # Removing the punctuation marks using strip function
    for word in word_list:
        word = word.strip(punctuations).lower()
        out_list.append(word)
    return out_list
    

def count_words(text):
    """
    1) Split the input string into words based on space
    2) Clean the list of words by removing common punctuation marks and making the text lowercase
    Cleans the text 
    3) count the frequency of each word
    args:
        text : input text 
    output:
        count_dict : dictionary of count of words
    """
    
    # Splitting the cleaned text into words
    word_list = text.split(' ')
    # Cleans the text using clean_text function
    cleaned_word_list = clean_text(word_list)
    
    #  get freq count of each word
    count_dict = {}
    for word in cleaned_word_list:
        if word not in count_dict:
            count_dict[word] = 0
        count_dict[word]+=1
    return count_dict

In [304]:
count_words(test_phrase)

{'i': 1,
 'bought': 1,
 'a': 2,
 'sandwich': 1,
 'with': 1,
 'side': 1,
 'of': 1,
 'chips': 1}

## Question 3

Write a function called `most_common` that returns the most common word from a string. Explain what the code is actually doing.

In [305]:
test_phrase = 'I bought a sandwich with a side of chips!'

In [306]:
def most_common(text):
    """
    1) Given an input text, the function cleans the text by converting into lower text
       and removing punctuations.
    2) Creates a frequency map of the words
    3) Find the most common word from frequency map (highest frequency word)
    args:
        text : input string
    output:
        most_common_word : highest frequency word from the text 
    """
    # Text cleaning and freq map generation
    count_dict = count_words(text)
    
    # Get highest frequency word from freq map
    most_common_word = max(count_dict, key = lambda x : count_dict[x])
    
    print(f"Highest Frequency Word : {most_common_word} Frequency : {count_dict[most_common_word]}")
    return most_common_word

In [307]:
most_common(text = test_phrase)

Highest Frequency Word : a Frequency : 2


'a'

## Question 4

Write a function called `count_words_from_file` that allows you to read in a file, clean the text, count the words and return the most common word.

**Reminder:** For this homework assignment, there should not be use of any external libraries, so no using pandas.

**Hint:** Also remove the `\n` character in your `clean_text` step, which means "new line".

In [308]:
file_name = "peter_pan_chapter_1.txt"

In [311]:
def count_words_from_file(file_name):
    """
    a) The function reads the document from a file
    b) Cleans the text by converting into lowercase and removing punnctuations
    c) Creates a dictionary of word counts
    d) Get the m
    """
    # Reading the file using open and reading the contents of the file
    with open(file_name, 'r') as file:
        data = file.read()
        
    # Replacing new line with blank space
    data = data.replace('\n', ' ')
    
    # Generating Freq map of the words after data cleaning using count_words method
    count_dict = count_words(data)
    
    # Finding word with highest frequency from the count dict 
    most_common_word = max(count_dict, key = lambda x : count_dict[x])
    print(f"Most Common Word : {most_common_word} Frequency : {count_dict[most_common_word]}")
    return most_common_word, count_dict

In [312]:
most_common_word, count_dict = count_words_from_file(file_name)

Most Common Word : the Frequency : 135


## Question 5

Expand on the previous functions to create a function called `count_words_in_many_documents` that reads in a list of documents, and creates a dictionary of word counts across all the documents.

In [313]:
documents = ['I enjoyed a delicious homemade lasagna for dinner last night.',
            'She ordered a classic Caesar salad for lunch at the restaurant.',
            'Breakfast is my favorite meal of the day! I always start with a hearty omelette.',
            "We're having grilled chicken with roasted vegetables for tonight's meal.",
            'On a hot summer day, nothing beats a refreshing fruit salad.',
            'They decided to order takeout sushi for a convenient and tasty meal.',
            'Thanksgiving dinner is traditionally a feast of turkey, stuffing, and cranberry sauce.',
            "My grandmother's homemade apple pie is the perfect dessert to end any meal.",
            "Aromatic spices and herbs can transform a simple dish into a culinary masterpiece.",
            "Exploring street food markets in foreign cities is a delightful way to experience local culture.",
            "The rich and creamy texture of a perfectly ripe avocado is a true gastronomic pleasure.",
            "Sushi, with its delicate balance of flavors and textures, is an art form on a plate.",
            "Homemade apple pie, fresh out of the oven, fills the air with a comforting, cinnamon-infused aroma.",
            "Savoring a warm bowl of chicken soup on a chilly day is like a hug for the soul.",
            "Food brings people together, creating lasting memories around the dinner table.",
            "Discovering new flavors and cuisines is an exciting culinary adventure that broadens the palate.",
            "Grilling outdoors on a sunny day creates a mouthwatering symphony of sizzling meats and vegetables.",
            "The joy of sharing a delicious meal with loved ones is one of life's simple pleasures."]

In [284]:
def update_freq_map(final_freq_map, temp_freq_map):
    """
    The function updates the final frequecy map using the temporary freq map
    args:
        final_freq_map: output freq map which needs to be updated 
        temp_freq_map: temporary freq map which updates the final map 
    returns:
        final_freq_map: updated freq map
    """
    for key, value in temp_freq_map.items():
        if key in final_freq_map:
            final_freq_map[key]+=value
        else:
            final_freq_map[key] = value
    return final_freq_map


def count_words_in_many_documents(documents):
    """
    Given a list of documents, the function performs following operations
        a) Cleans each text document by making them lowercase and removing punctuations
        b) Creates a individual frequency map for each document 
        c) Combines all the individual frequency maps
    args:
        documents : list of input documents
    returns:
        final_count_dict : combined frequency map of all the documents
    """
    final_count_dict = {}
    for document in documents:
        
        # Getting word count dict for each document 
        temp_count_dict = count_words(document)
        
        # Merging temp word count dict into final word count dict 
        final_count_dict = update_freq_map(final_count_dict, temp_count_dict)
    
    return final_count_dict   

In [315]:
doc_word_dict = count_words_in_many_documents(documents)
print(doc_word_dict)

{'i': 2, 'enjoyed': 1, 'a': 20, 'delicious': 2, 'homemade': 3, 'lasagna': 1, 'for': 5, 'dinner': 3, 'last': 1, 'night': 1, 'she': 1, 'ordered': 1, 'classic': 1, 'caesar': 1, 'salad': 2, 'lunch': 1, 'at': 1, 'the': 10, 'restaurant': 1, 'breakfast': 1, 'is': 9, 'my': 2, 'favorite': 1, 'meal': 5, 'of': 9, 'day': 4, 'always': 1, 'start': 1, 'with': 5, 'hearty': 1, 'omelette': 1, "we're": 1, 'having': 1, 'grilled': 1, 'chicken': 2, 'roasted': 1, 'vegetables': 2, "tonight's": 1, 'on': 4, 'hot': 1, 'summer': 1, 'nothing': 1, 'beats': 1, 'refreshing': 1, 'fruit': 1, 'they': 1, 'decided': 1, 'to': 3, 'order': 1, 'takeout': 1, 'sushi': 2, 'convenient': 1, 'and': 7, 'tasty': 1, 'thanksgiving': 1, 'traditionally': 1, 'feast': 1, 'turkey': 1, 'stuffing': 1, 'cranberry': 1, 'sauce': 1, "grandmother's": 1, 'apple': 2, 'pie': 2, 'perfect': 1, 'dessert': 1, 'end': 1, 'any': 1, 'aromatic': 1, 'spices': 1, 'herbs': 1, 'can': 1, 'transform': 1, 'simple': 2, 'dish': 1, 'into': 1, 'culinary': 2, 'masterpiec

## Question 6

Create a function called `return_top_n_words` that takes in a dictionary of word counts (output of previous question) and returns the top n words specified by the user. The default value of n should be 3 if the user doesn't specify in a value.

**Input:**

`return_top_n_words(docs, n=5)`

**Output:**

`The word 'a' appears 20 times.
The word 'the' appears 10 times.
The word 'is' appears 9 times.
The word 'of' appears 9 times.
The word 'and' appears 7 times.
`

In [316]:
def return_top_n_words(docs, n=5):
    """
    Given a list of document and value of n, the function parses top n words from the document 
    The function uses the count_words_in_many_documents function to create the freq map of all the documents
    args:
        docs : list of input documents
        n : numbers of top freq words to be returned
    output:
        top_n_words_dict: dictionary of top n words along with freq
    """
    # Generating word dict from document 
    doc_word_dict = count_words_in_many_documents(docs)
    
    if n > len(doc_word_dict):
        print(f"n value exceeds the length of the word count dictionary, n = {n} , len = {len(doc_word_dict)}")
        print(f"Using n as maximum length of the word count dictionary")
        n = len(doc_word_dict)
    
    # Sorting the dictionary based on frequency values of the dictionary and getting top n values
    top_n_matches = sorted(zip(doc_word_dict.values(), doc_word_dict.keys()), reverse=True)[:n]
    
    top_n_words_dict = {match[1] : match[0] for match in top_n_matches}
    return top_n_words_dict

In [318]:
print(return_top_n_words(documents, n = 10))

{'a': 20, 'the': 10, 'of': 9, 'is': 9, 'and': 7, 'with': 5, 'meal': 5, 'for': 5, 'on': 4, 'day': 4}
