In [None]:
# Open the file in read mode
with open('t8.shakespeare.txt', 'r') as file:
    # Initialize a counter
    line_count = 0
    # Iterate through each line in the file
    for line in file:
        # Increment the counter for each line
        line_count += 1

print(f'The file has {line_count} lines.')

In [None]:
# docdist1.py
#
#
# This program computes the "distance" between two text files
# as the angle between their word frequency vectors.
#
# For each input file, a word-frequency vector is computed as follows:
#    (1) the specified file is read in
#    (2) it is converted into a list of alphanumeric "words"
#        Here a "word" is a sequence of consecutive alphanumeric
#        characters.  Non-alphanumeric characters are treated as blanks.
#        Case is not significant.
#    (3) for each word, its frequency of occurrence is determined
#    (4) the word/frequency lists are sorted into order alphabetically
#
# The "distance" between two vectors is the angle between them.
# If x = (x1, x2, ..., xn) is the first vector (xi = freq of word i)
# and y = (y1, y2, ..., yn) is the second vector,
# then the angle between them is defined as:
#    d(x,y) = arccos(inner_product(x,y) / (norm(x)*norm(y)))
# where:
#    inner_product(x,y) = x1*y1 + x2*y2 + ... xn*yn
#    norm(x) = sqrt(inner_product(x,x))

import math
    # math.acos(x) is the arccosine of x.
    # math.sqrt(x) is the square root of x.

import string
    # string.join(words,sep) takes a given list of words,
    #    and returns a single string resulting from concatenating them
    #    together, separated by the string sep .
    # string.lower(word) converts word to lower-case

import sys
    # sys.exit() allows us to quit (if we can't read a file)

# Operation 1: read a text file ##
##################################
def read_file(filename):
    ###    Read the text file with the given filename;
    ###     return a list of the lines of text in the file.
    try:
        fp = open(filename)
        L = fp.readlines()
    except IOError as excObj:
        print(str(excObj))
        print("Error opening or reading input file: " + filename)
        sys.exit()
    return L

1. How many lines are in assets/t8.shakespeare.txt?

In [None]:
# Open the file in read mode
with open('/home/jovyan/work/resources/data/t8.shakespeare.txt', 'r') as file:
    # Initialize a counter
    line_count = 0
    # Iterate through each line in the file
    for line in file:
        # Increment the counter for each line
        line_count += 1

print(f'The file has {line_count} lines.')

The file has 124456 lines.

2. In the function word_frequencies_for_file(...), where is most of the time spent?

    > read_file(...)
    > get_words_from_line_list(...)
    > count_frequency(...)
    > insertion_sort(...)

#############################################
## compute word frequencies for input file ##
#############################################
import cProfile

def word_frequencies_for_file(filename,verbose=False):

    ### Return alphabetically sorted list of (word,frequency) pairs
    ### for the given file.

    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    insertion_sort(freq_mapping)
    if verbose:
        print("File",filename,":", len(line_list),"lines,", len(word_list),"words,", len(freq_mapping),"distinct words")

    return freq_mapping

def profile_word_frequencies(filename): 
    word_frequencies_for_file(filename, verbose=True) 
# Profile the function
cProfile.run('profile_word_frequencies("/home/jovyan/work/resources/data/t8.shakespeare.txt")')

3. Which of the following commands would result in output that looks similar to:

        102 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

    
    > %%timeit
    > document_similarity('assets/short.t1.txt','assets/short.t2.txt')
    > %timeit  document_similarity('assets/short.t1.txt','assets/short.t2.txt')
    > %lprun -f document_similarity document_similarity('assets/short.t1.txt','assets/short.t2.txt')
    > %lprun -f __main__ document_similarity('assets/short.t1.txt','assets/short.t2.txt')
    > Any of the above would generate this output.

> %timeit document_similarity('assets/short.t1.txt','assets/short.t2.txt')

The %timeit magic command in IPython or Jupyter Notebook is used to time the execution of a single statement. It runs the statement multiple times to get a more accurate measurement of the execution time, providing the mean and standard deviation of the runs.

The other commands listed do not produce this specific type of output:

    - %%timeit is a cell magic that times the execution of the entire cell, but it doesn't match the exact format of the output shown.

    - document_similarity('assets/short.t1.txt','assets/short.t2.txt') simply runs the function without timing it.

    - %lprun -f document_similarity document_similarity('assets/short.t1.txt','assets/short.t2.txt') and %lprun -f __main__ document_similarity('assets/short.t1.txt','assets/short.t2.txt') are used with the line_profiler to profile the function line by line, but they do not produce the same output format as %timeit.

So, the correct command is %timeit document_similarity('assets/short.t1.txt','assets/short.t2.txt'). 

4. What order is count_frequency()?

    > O(1)
    - O(n)
    > O(n^2)
    > O(nlogn)
    > O(2^n)


The count_frequency function has a time complexity of O(n), where n is the number of words in the word_list.

Explanation
1. Creating the Dictionary:
    - Initializing the dictionary frequency_dict is an O(1) operation.
2. Iterating Through the Word List:
    - The for loop iterates through each word in the word_list, which takes O(n) time, where n is the length of the list.
3. Updating the Frequency Count:
    - Checking if a word is in the dictionary and updating its count both have an average time complexity of O(1) due to the efficient average-case performance of dictionary operations in Python.
4. Converting Dictionary to List:
    - Converting the dictionary items to a list using list(frequency_dict.items()) takes O(m) time, where m is the number of unique words. However, since m is generally less than or equal to n, this step is also considered O(n) in the worst case.

Combining these steps, the overall time complexity of the count_frequency function is O(n). This means the function scales linearly with the size of the input list. 📊✨

5. Let's say that you have two files that were the same size and running document_similarity() on them took 18 seconds to complete.  If you then ran document_similarity() on two different files, each of which was twice the size of the original files, and it took 72 seconds to complete, what would that tell you about the order of the overall script?