<a href="https://colab.research.google.com/github/WayneGretzky1/CSCI-4521-Applied-Machine-Learning/blob/main/2_1_tf_idf_vectorization%2C_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A simple tf-idf implementation:

In [None]:
import math
def tfidf(term, doc, doc_list):
  docs_with_term = [doc for doc in doc_list if term in doc]
  tf = doc.count(term) #term count
  df = len(docs_with_term)/len(doc_list) #document frequency
  idf = math.log(1/df)
  return tf * idf

In [None]:
documents = ["the mouse", "the small cat", "the cheese", "the big cat"]

In [None]:
# TODO: calculate the importance of each word in a document
tfidf("cheese", "the small cat", documents)

0.0

In [None]:
def getUniqueVocab(documents):
  #Find unique vocab words
  vocab = []
  for doc in documents:
    vocab.extend(doc.split())
  vocab = set(vocab)
  return vocab

def vectorize(phrase, documents):
  #Find unique vocab words
  vocab = getUniqueVocab(documents)

  #Build vector representation
  result = []
  for word in vocab:
    result.append(tfidf(word, phrase, documents))
  return result

In [None]:
# TODO: lets vectorize some documents!
print(getUniqueVocab(documents))
print(vectorize("the small cat", documents))

{'cheese', 'small', 'big', 'the', 'mouse', 'cat'}
[0.0, 1.3862943611198906, 0.0, 0.0, 0.0, 0.6931471805599453]


----
## TF-IDF with SK-Learn

This is a pretty cude approach to tf-idf. In practice we often want to smooth out the tf-idf computation, and include extra normalization terms.

There are many corner cases to consider, and variations on how we can compute the `tf` term and the `idf` term.

Its often useful to use a library to compute tf-idf, and one is provided in `SK Learn`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# TODO: create vectorizer and fit_transform our documents (training set)
vectorizer = TfidfVectorizer(min_df=1)

In [None]:
vectorized_docs = vectorizer.fit_transform(documents)

In [None]:
vectorizer.get_feature_names_out()

array(['big', 'cat', 'cheese', 'mouse', 'small', 'the'], dtype=object)

In [None]:
[print(train_vec.toarray()) for train_vec in vectorized_docs]

[[0.         0.         0.         0.88654763 0.         0.46263733]]
[[0.         0.5728925  0.         0.         0.72664149 0.37919167]]
[[0.         0.         0.88654763 0.         0.         0.46263733]]
[[0.72664149 0.5728925  0.         0.         0.         0.37919167]]


[None, None, None, None]

Note that "the" (word #5) has a non-zero value even though it occurs in every document. "The" is still the smallest values (around 0.4), and "cheese"/"mouse" still the largest value (0.88).

**TODO: ANSWER ON CHIME IN** For which examples does SK-Learn's TF-IDF formula give "the" the most weight? Why is that? Any advantages to this method?

actual sklearn formulat : tfidf = log(N/df) * ln(1+M/T) where N is the number of occurrences of the word in the corpus, Df is the total number of documents in the collection, M is the number of times the word occurs in this document, and T is the total number of words in this particular document.

We can use `vectorizer.transform` along with `toarray()` to get the vectorized result of a new document:

In [None]:
# TODO: print the vector version of "the big cheese"
print(vectorizer.transform(["the big cheese"]).toarray())

[[0.66338461 0.         0.66338461 0.         0.         0.34618161]]


----
## Your turn:

1. First, can you compute the distance between two vectors. My examples give the squared Euclidean distance, but you can use other distance terms (e.g. cosine distance).

Here is a stub of `my_dist` to get you started

In [None]:
import numpy as np
def my_dist(v1, v2):
  return 0 # TODO: Update this to be the distance between v1 and v2

In [None]:
print(my_dist(np.array([3]),np.array([4])))          #Squared L2 Dist is 1
print(my_dist(np.array([1,1]),np.array([2,2])))      #Squared L2 Dist is 2
print(my_dist(np.array([1,2,3]),np.array([1,-1,7]))) #Squared L2 Dist is 25

If you above distance function works, you can now take the distance between every vector and an new document using code like this:

In [None]:
new_post_vec = vectorizer.transform(["the big cheese"])
[my_dist(new_post_vec.toarray(),train_vec.toarray()) for train_vec in vectorized_docs] #I get [1.7, 1.7, 0.5, 0.7], your results may differ

2. Now write a function called `findClosest` which takes as input a string, and returns a string of the document in `documents` which has the closest feature vector.

Again, we'll give you a stub of `findClosest` to help get you started:

In [None]:
def findClosest(promt):
  closest_id = 0  # TODO: Fix this to be the actual closest index
  return documents[closest_id]

In [None]:
findClosest("the big cheese") #Should probably return "the cheese"

---
Thought Experiment - **ANSWER ON CHIME IN**
 - What would you want to happen when the input is `"a cheesy slice of pizza"`
 - What happens in reality?
 - How might we fix this?

In [None]:
findClosest("a cheesy slice of pizza") #Should this stil return "the cheese"?