# Cosine Similarity Implementation

## Needed libraries

In [3]:
# For list type hint
from typing import List

# Library for command-line argument parsing
import argparse

# For concurrent file reading
from concurrent.futures import ThreadPoolExecutor

# For TF-IDF calculation
from sklearn.feature_extraction.text import TfidfVectorizer

# For cosine similarity calculation
from sklearn.metrics.pairwise import cosine_similarity

# For matrix type hint
from scipy.sparse import csr_matrix


## Functions

  - Many functions might not have explanation, because they are written in self-explanatory manner: verbose names with type hints and DocString attached.

### File Reading

#### Single File
- it might be improved - whole file is read into memory

In [4]:

def read_single_file(file_path: str) -> str:
    """
    Read the content of a file given its file path.
    If the file does not exist or an error occurs, raise an error.
    """
    try:
        with open(file_path, 'r') as f:
            return f.read()
    except FileNotFoundError:
        raise FileNotFoundError(f"File {file_path} does not exist.")
    except IOError as error:
        raise IOError(f"An error occurred while reading {file_path}: {error}")

- This function might change (https://github.com/colesbury/nogil-3.12), because probably I will use it for CPU intensive
    tasks that need to be done in parallel (for now I don't know if API is 100% compatible).

#### Multiple Files

- It is written in modular style - reading single file might be changed and program will still work as long as reading
  single file returns single string 
- Once the `with` block is completed, the thread pool is automatically shut down with `shutdown()` function

In [5]:
def read_multiple_files(file_paths: List[str]) -> List[str]:
    """
    Read multiple files concurrently and return their content.
    """
    with ThreadPoolExecutor() as executor:
        return list(executor.map(read_single_file, file_paths))

### Validation

#### File Paths
- Unique file paths checker

In [None]:

def validate_file_paths(file_paths: List[str]) -> None:
    """
    Validate that all file paths are unique. Raise ValueError if they are not.
    """
    if len(set(file_paths)) != len(file_paths):
        raise ValueError("All files must be unique.")

#### 

<div class="admonition note">
    <p class="admonition-title">Note</p>
    <p>
        If two distributions are similar, then their entropies are similar,
        implies the KL divergence with respect to two distributions will be
        smaller...
    </p>
</div>

three sample files:

- file1.txt:

```
Cats are small, furry animals that are known for their playful and independent nature.
They are one of the most popular pets in the world.
Cats have sharp claws and teeth, which they use for hunting.

```

- file2.txt
  
```
Cats are known for their love of sleeping. 
They can sleep up to 16 hours a day.
Unlike dogs, cats are solitary animals and do not require constant attention
```

- file3.txt

```
Cats have a strong territorial instinct. 
They mark their territory by releasing pheromones.
They are also known for their agility and are excellent climbers.
```