<a href="https://colab.research.google.com/github/WayneGretzky1/CSCI-4521-Applied-Machine-Learning/blob/main/2_2_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Download and extra the data
Data is in a zip file

In [None]:
!wget "https://raw.githubusercontent.com/be-prado/csci4521/refs/heads/main/20news-bydate.tar.gz"

In [None]:
!tar -xf 20news-bydate.tar.gz

## Using SciKit-Learn's CountVectorizer


In [None]:
from sklearn.feature_extraction.text import CountVectorizer

The parameter `min_df` controls effect words that are not used frequently (min_df = minimum document frequency).
 - If it is an integer, all words occurring less than that value will be dropped.
 - If it is a fraction, all words that occur less than that fraction of the overall dataset are be dropped.

`max_df` works in a similar manner

In [None]:
vectorizer = CountVectorizer(min_df=1) #min_df=1 --> use all words

In [None]:
CountVectorizer?

Consider two sentences:

In [None]:
content = ["How to catch pokemon", "Which Pokemon is the hardest to catch?"]

How many uniuqe words between the two?
  - Is `catch` and `catch?` the same word?
  - Is `Pokemon` and `pokemon` the same word?
  - Would `catch` and `catching` be the same word?

In [None]:
# TODO: fit_transform the sentences then print the vocab


We can turn each sentence into a "bag of words" ... for each sentence:
 - 1 is word is present
 - 0 is word is absent

In [None]:
print(X.toarray())

### CountVectorizer on UseNet posts

In [None]:
import os
DIR = "/content/20news-bydate-train/rec.sport.hockey"

In [None]:
posts = [open(os.path.join(DIR, filename)).read() for filename in os.listdir(DIR)]

In [None]:
posts[45]

"From: maynard@ramsey.cs.laurentian.ca (Roger Maynard)\nSubject: Re: Canadiens - another Stanley Cup???\nOrganization: Dept. of Computer Science, Laurentian University, Sudbury, ON\nLines: 35\n\nIn <rauser.734062534@sfu.ca> rauser@fraser.sfu.ca (Richard John Rauser) writes:\n\n>pereira@CAM.ORG (Dean Pereira) writes:\n\n\n>>\t\tWith the kind of team Montreal has now,  they can take the\n>>cup easily.  The only problem they have right now is that everyone is\n>>trying to steal the show and play alone.  They need some massive teamwork.\n\nThis is known as the Savard syndrome - and we are talking Denis, not Serge.\nNo team will ever win squat with the likes of Denis Savard in their lineup.\n\n>>\tThey are also in a little of a slump because long-time hockey\n>>Montreal Canadiens announcer Claude Mouton died last tuesday and it was\n>>rough on everybody because he has worked with the organization for 21\n>>years.  But I know that is no excuse.  But if the Habs manage to get some\n>>good tea

In [None]:
# TODO: fit_transform the vectorizer with our new data


In [None]:
X_train.shape

In [None]:
print(vectorizer.get_feature_names_out())

In [None]:
# TODO: vectorize the sentence "Should a team be added in Wisconsin?"


In [None]:
# TODO: get the names of features present

### Finding Nearest Neighbors

`new_post_vec` is a feature vector, and we can try to find its nearest neighbors in the training set

In [None]:
import numpy as np

In [None]:
def dist_raw(v1, v2):
  delta = v1-v2
  return np.linalg.norm(delta)

In [None]:
# TODO: find the distances between the new post and the vectors in our training set


In [None]:
# TODO: which post is the closest?


In [None]:
dists[best_match_ID]

In [None]:
posts[best_match_ID]

Hmm. The querry document was `"Should a new team be added to Wisconsin?"`.

Does this post seem related to our query feature? Let's check which elements of the feature vectors overlap.

In [None]:
# TODO: print the query vector and the closest vector?


In [None]:
print(new_post_vec)

That worked poorly... There is no overlap in features. What happened?

#### Normalized distance
Normalizing vectors before computing distance focuses on document content rather than length

In [None]:
def dist_norm(v1, v2):
  v1_normalized = v1/np.linalg.norm(v1) #Normalize vectors to unit length
  v2_normalized = v2/np.linalg.norm(v2)
  delta = v1_normalized-v2_normalized   #Then take distance
  return np.linalg.norm(delta)

In [None]:
# TODO: find the normalized distances between the new post and the vectors in our training set then find the new closest post


In [None]:
best_match_ID = np.argmin(dists)
print(best_match_ID)

In [None]:
posts[best_match_ID]

## Stop Words, Stemming, and TF-IDF
Ignoring common words (stop words)

In [None]:
vectorizer = CountVectorizer(min_df=1, stop_words='english')

We'll lose some words now. The size of the feature vector should be smaller.

In [None]:
X_train.shape #Old Vectorizations

In [None]:
X_train = vectorizer.fit_transform(posts)

In [None]:
X_train.shape #New Vectorizations

In [None]:
sorted(vectorizer.get_stop_words())

In [None]:
# TODO: based on a new query post, which post in our dataset is closest?


How did this do?

We can can also add stemming and tf-idf:

In [None]:
import nltk.stem
english_stemmer = nltk.stem.SnowballStemmer('english')

In [None]:
english_stemmer.stem("happily")

In [None]:
class StemmedCountVectorizer(CountVectorizer):
   def build_analyzer(self):
     analyzer = super(StemmedCountVectorizer, self).build_analyzer()
     return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

In [None]:
vectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')
X_train = vectorizer.fit_transform(posts)

In [None]:
X_train.shape

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
class StemmedTfidfVectorizer(TfidfVectorizer):
  def build_analyzer(self):
    analyzer = super(TfidfVectorizer, self).build_analyzer()
    return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))

In [None]:
vectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english')
X_train = vectorizer.fit_transform(posts)

In [None]:
X_train.shape

In [None]:
# TODO: with these new vectorizers, lets test that query post again


#### Cosine Similarity

We can use the cosine similarity instead of the normalized vector distance.

But remember to maximize similarity vs. minimize distance.

In [None]:
def cos_similarity(v1, v2):
  return np.vdot(v1,v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

In [None]:
# TODO: use cosine similarity as a distance metric and try the query post again


## Closest Document Function

In [None]:
# Helper function!
def findClosestStory(promt):
  new_post_vec = vectorizer.transform([promt])
  dists = [cos_similarity(new_post_vec.toarray(),train_vec.toarray()) for train_vec in X_train]
  closest_id = np.argmax(dists) #switch to arg max!
  return posts[closest_id]


In [None]:
print(findClosestStory("where can I find cheap tickets?"))