<h1 align='center'> CSC4120 Programming Assignment 1 </h1>

## Submission Requirements

   The submission <font color = #FF0000>deadline is January 28 (Sun.), 2024, 11:59 pm</font>. Solutions submitted after the deadline will be graded as 0 points. Please submit an **ipynb** file and clearly state your group members' student IDs. Otherwise, your points will be deducted.

## What you need to do

1. Understand the document distance problem.

2. Understand the python code and how we improve the algorithm in each step.

3. Implement merge sort and the dictionary version.

## Student IDs

- 120090244
- 121090271
- 122090031

In [2]:
import math
import sys
import cProfile
import string

filename_1 = "file1.txt"
filename_2 = "file2.txt"
translation_table = str.maketrans(string.punctuation + string.ascii_uppercase,
                                     " "*len(string.punctuation) + string.ascii_lowercase)


## 1. Initial version of document distance

This program computes the "distance" between two text files as the angle between their word frequency vectors (in radians).

For each input file, a word-frequency vector is computed as follows:

   (1) the specified file is read in

   (2) it is converted into a list of alphanumeric "words"

       Here a "word" is a sequence of consecutive alphanumeric
       characters.  Non-alphanumeric characters are treated as blanks.
       Case is not significant.

   (3) for each word, its frequency of occurrence is determined

The "distance" between two vectors is the angle between them.

If $ x = (x_1, x_2, ..., x_n) $ is the first vector ($ x_i $ = freq of word i)
and $ y = (y_1, y_2, ..., y_n) $ is the second vector,
then the angle between them is defined as:

   $$ d(x,y) = \arccos{\left(\frac{\operatorname*{innerProduct}(x,y)}{\operatorname*{norm}(x) * \operatorname{norm}(y)}\right)} $$

where:
$$
\begin{cases}
\operatorname*{innerProduct}(x,y) = x_1*y_1 + x_2*y_2 + \cdots + x_n*y_n \\[1em]
\operatorname*{norm}(x) = \sqrt{\operatorname*{innerProduct}(x,x)}
\end{cases}
$$

   ***


### What you need to do

Run the code and report the running time.

There are 71158 function calls (71147 primitive calls) in 1.880 seconds.

In [3]:
def read_file(filename):
    """ 
    Read the text file with the given filename;
    return a list of the lines of text in the file.
    """
    try:
        f = open(filename, 'r', encoding= 'utf-8')
        return f.readlines()
    except IOError:
        print("Error opening or reading input file: ", filename)
        sys.exit()

def get_words_from_line_list(L):
    """
    Parse the given list L of text lines into words.
    Return list of all words found.
    """
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list = word_list + words_in_line
    return word_list

def get_words_from_string(line):
    """
    Return a list of the words in the given input string,
    converting each word to lower-case.

    Input:  line (a string)
    Output: a list of strings 
              (each string is a sequence of alphanumeric characters)
    """
    line = line.translate(translation_table)
    word_list = line.split()
    return word_list

def count_frequency(word_list):
    """
    Return a list giving pairs of form: (word,frequency)
    """
    L = []
    for new_word in word_list:
        for entry in L:
            if new_word == entry[0]:
                entry[1] = entry[1] + 1
                break
        else:
            L.append([new_word, 1])
    return L

def word_frequencies_for_file(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    return freq_mapping

def inner_product(L1, L2):
    """
    Inner product between two vectors, where vectors
    are represented as lists of (word,freq) pairs.

    Example: inner_product([["and",3],["of",2],["the",5]],
                           [["and",4],["in",1],["of",1],["this",2]]) = 14.0 
    """
    sum = 0.0
    for word1, count1 in L1:
        for word2, count2 in L2:
            if word1 == word2:
                sum += count1 * count2
    return sum

def vector_angle(L1, L2):
    """
    The input is a list of (word,freq) pairs, sorted alphabetically.

    Return the angle between these two vectors.
    """
    numerator = inner_product(L1, L2)
    denominator = math.sqrt(inner_product(L1, L1) * inner_product(L2, L2))
    return math.acos(numerator / denominator)

def docdist1():
    document_vector_1 = word_frequencies_for_file(filename_1)
    document_vector_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(document_vector_1, document_vector_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist1()")

The distance between the documents is: 0.619328 (radians)
         71158 function calls (71147 primitive calls) in 1.880 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.004    0.002 573875801.py:1(read_file)
        2    1.686    0.843    1.716    0.858 573875801.py:13(get_words_from_line_list)
    23110    0.004    0.000    0.029    0.000 573875801.py:24(get_words_from_string)
        2    0.146    0.073    0.146    0.073 573875801.py:37(count_frequency)
        2    0.000    0.000    1.866    0.933 573875801.py:51(word_frequencies_for_file)
        3    0.011    0.004    0.011    0.004 573875801.py:61(inner_product)
        1    0.000    0.000    0.011    0.011 573875801.py:76(vector_angle)
        1    0.001    0.001    1.879    1.879 573875801.py:86(docdist1)
        2    0.000    0.000    0.000    0.000 <frozen abc>:121(__subclasscheck__)
        2    0.000    0.000    0.000    0.000 <f

## 2. Change concatenate to extend in get_words_from_line_list

Compare the running time, analyze why we get improvement (or why not), identify it using `cProfile`.

\###

Your answer goes here

\###



In [None]:
def get_words_from_line_list(L):
    """
    Parse the given list L of text lines into words.
    Return list of all words found.
    """
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list.extend(words_in_line)
    return word_list

def docdist2():
    document_vector_1 = word_frequencies_for_file(filename_1)
    document_vector_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(document_vector_1, document_vector_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist2()")

## 3. Sort the document vector

Compare the running time, analyze why we get improvement (or why not), identify it using `cProfile`.

\###

Your answer goes here

\###


In [None]:
def insertion_sort(A):
    """
    Sort list A into order, in place.

    From Cormen/Leiserson/Rivest/Stein,
    Introduction to Algorithms (second edition), page 17,
    modified to adjust for fact that Python arrays use 
    0-indexing.
    """
    for j in range(len(A)):
        key = A[j]
        # insert A[j] into sorted sequence A[0..j-1]
        i = j - 1
        while i > -1 and A[i] > key:
            A[i + 1] = A[i]
            i = i - 1
        A[i + 1] = key
    return A
    
def word_frequencies_for_file(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    insertion_sort(freq_mapping)
    return freq_mapping

def inner_product(L1, L2):
    """
    Inner product between two vectors, where vectors
    are represented as alphabetically sorted (word,freq) pairs.

    Example: inner_product([["and",3],["of",2],["the",5]],
                           [["and",4],["in",1],["of",1],["this",2]]) = 14.0 
    """
    sum = 0.0
    i = 0
    j = 0
    while i < len(L1) and j < len(L2):
        # L1[i:] and L2[j:] yet to be processed
        if L1[i][0] == L2[j][0]:
            # both vectors have this word
            sum += L1[i][1] * L2[j][1]
            i += 1
            j += 1
        elif L1[i][0] < L2[j][0]:
            # word L1[i][0] is in L1 but not L2
            i += 1
        else:
            # word L2[j][0] is in L2 but not L1
            j += 1
    return sum

def docdist3():
    sorted_word_list_1 = word_frequencies_for_file(filename_1)
    sorted_word_list_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist3()")

## 4. Change sorting from insertion sort to merge sort

Implement merge sort.

Compare the running time, analyze why we get improvement (or why not), identify it using `cProfile`.

\###

Your answer goes here

\###

In [None]:
def merge_sort(A):
    """
    Sort list A into order, and return result.
    """
    #######################
    #                     #
    #     TO IMPLEMENT    #
    #                     #
    #######################

def word_frequencies_for_file(filename):
    """
    Return alphabetically sorted list of (word,frequency) pairs 
    for the given file.
    """
    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    freq_mapping = merge_sort(freq_mapping)
    return freq_mapping

def docdist3():
    sorted_word_list_1 = word_frequencies_for_file(filename_1)
    sorted_word_list_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist3()")

## 5. Use dictionaries instead of lists

Implement the algorithm using dictionaries instead of lists. 

Analyze why we get improvement and identify it using `cProfile`.

\### 

Your answer goes here

\###

In [None]:
def read_file(filename):
    """ 
    Read the text file with the given filename;
    return a list of the lines of text in the file.
    """
    try:
        f = open(filename, 'r', encoding= 'utf-8')
        return f.readlines()
    except IOError:
        print("Error opening or reading input file: ", filename)
        sys.exit()

def get_words_from_line_list(L):
    """
    Parse the given list L of text lines into words.
    Return list of all words found.
    """
    word_list = []
    for line in L:
        words_in_line = get_words_from_string(line)
        word_list.extend(words_in_line)
    return word_list

def get_words_from_string(line):
    """
    Return a list of the words in the given input string,
    converting each word to lower-case.

    Input:  line (a string)
    Output: a list of strings 
              (each string is a sequence of alphanumeric characters)
    """
    line = line.translate(translation_table)
    word_list = line.split()
    return word_list

def count_frequency(word_list):
    """
    Input a list of words
    Return a DICTIONARY of (word, frequency) pairs
    """
    #######################
    #                     #
    #     TO IMPLEMENT    #
    #                     #
    #######################

def word_frequencies_for_file(filename):
    """
    Return dictionary of (word,frequency) pairs for the given file.
    """

    line_list = read_file(filename)
    word_list = get_words_from_line_list(line_list)
    freq_mapping = count_frequency(word_list)
    return freq_mapping

def inner_product(D1, D2):
    """
    Inner product between two vectors, where vectors
    are represented as dictionaries of (word,freq) pairs.

    Example: inner_product({"and":3,"of":2,"the":5},
                           {"and":4,"in":1,"of":1,"this":2}) = 14.0 
    """
    #######################
    #                     #
    #     TO IMPLEMENT    #
    #                     #
    #######################

def vector_angle(D1, D2):
    """
    The input are two vectors represented as dictionary of (word,freq) pairs.
    Return the angle between these two vectors.
    """
    numerator = inner_product(D1, D2)
    denominator = math.sqrt(inner_product(D1, D1) * inner_product(D2, D2))
    return math.acos(numerator / denominator)

def docdist5():
    word_dict_1 = word_frequencies_for_file(filename_1)
    word_dict_2 = word_frequencies_for_file(filename_2)
    distance = vector_angle(word_dict_1, word_dict_2)
    print("The distance between the documents is: %0.6f (radians)"%distance)

cProfile.run("docdist5()")