# Worksheet

[FORM](https://forms.gle/hmgXYrwLn7ckapQN7)

### Latent Semantic Analysis

In this section we will fetch news articles from 3 different categories. We will perform Tfidf vectorization on the corpus of documents and use SVD to represent our corpus in the feature space of topics that we've uncovered from SVD. We will attempt to cluster the documents into 3 clusters as we vary the number of singular vectors we use to represent the corpus (i.e. as we vary the embedding space) and compare the output to the clustering created by the news article categories. Do we end up with a better clustering the more singular vectors we use? Is there an optimal embedding space?

In [6]:
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

categories = ['comp.os.ms-windows.misc', 'sci.space','rec.sport.baseball']
news_data = fetch_20newsgroups(subset='train', categories=categories)
vectorizer = TfidfVectorizer(stop_words='english', min_df=4,max_df=0.8)

stemmed_data = [" ".join(SnowballStemmer("english", ignore_stopwords=True).stem(word)  
         for sent in sent_tokenize(message)
        for word in word_tokenize(sent))
        for message in news_data.data]

dtm = vectorizer.fit_transform(stemmed_data)
terms = vectorizer.get_feature_names_out()
centered_dtm = dtm - np.mean(dtm, axis=0)

u, s, vt = np.linalg.svd(centered_dtm)
plt.xlim([0,50])
plt.plot(range(1,len(s)+1),s)
plt.show()

ag = []
max = len(u)
for k in range(1,25):
    vectorsk = u.dot(np.diag(s))[:,:k]
    kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=10, random_state=0)
    kmeans.fit_predict(np.asarray(vectorsk))
    labelsk = kmeans.labels_
    ag.append(metrics.v_measure_score(labelsk, news_data.target)) # closer to 1 means closer to news categories

plt.plot(range(1,25),ag)
plt.ylabel('Agreement',size=20)
plt.xlabel('No of Prin Comps',size=20)
plt.show()


URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)>

### Embeddings

The data comes from the [Yelp Dataset](https://www.yelp.com/dataset). Each line is a review that consists of a label (0 for negative reviews and 1 for positive reviews) and a set of words.

```
1 i will never forget this single breakfast experience in mad...
0 the search for decent chinese takeout in madison continues ...
0 sorry but me julio fell way below the standard even for med...
1 so this is the kind of food that will kill you so there s t...
```

In order to transform the set of words into vectors, we will rely on a method of feature engineering called word embeddings (Tfidf is one way to get these embeddings). Rather than simply indicating which words are present, word embeddings represent each word by "embedding" it in a low-dimensional vector space which may carry more information about the semantic meaning of the word. (for example in this space, the words "King" and "Queen" would be close).

`word2vec.txt` contains the `word2vec` embeddings for about 15 thousand words. Not every word in each review is present in the provided `word2vec.txt` file. We can treat these words as being "out of vocabulary" and ignore them.

### Example

Let x_i denote the sentence `“a hot dog is not a sandwich because it is not square”` and let a toy word2vec dictionary be as follows:

```
hot      0.1     0.2     0.3
not      -0.1    0.2     -0.3
sandwich 0.0     -0.2    0.4
square   0.2     -0.1    0.5
```

we would first `trim` the sentence to only contain words in our vocabulary: `"hot not sandwich not square”` then embed x_i into the feature space:

$$ φ2(x_i)) = \frac{1}{5} (word2vec(\text{hot}) + 2 · word2vec(\text{not}) + word2vec(\text{sandwich}) + word2vec(\text{square})) = \left[0.02 \hspace{2mm} 0.06 \hspace{2mm} 0.12 \hspace{2mm}\right]^T $$

a) Implement a function to trim out-of-vocabulary words from the reviews. Your function should return an nd array of the same dimension and dtype as the original loaded dataset.

In [9]:
import csv
import numpy as np

VECTOR_LEN = 300   # Length of word2vec vector
MAX_WORD_LEN = 64  # Max word length in dict.txt and word2vec.txt

def load_tsv_dataset(file):
    """
    Loads raw data and returns a tuple containing the reviews and their ratings.

    Parameters:
        file (str): File path to the dataset tsv file.

    Returns:
        An np.ndarray of shape N. N is the number of data points in the tsv file.
        Each element dataset[i] is a tuple (label, review), where the label is
        an integer (0 or 1) and the review is a string.
    """
    dataset = np.loadtxt(file, delimiter='\t', comments=None, encoding='utf-8',
                         dtype='l,O')
    return dataset


def load_feature_dictionary(file):
    """
    Creates a map of words to vectors using the file that has the word2vec
    embeddings.

    Parameters:
        file (str): File path to the word2vec embedding file.

    Returns:
        A dictionary indexed by words, returning the corresponding word2vec
        embedding np.ndarray.
    """
    word2vec_map = dict()
    with open(file) as f:
        read_file = csv.reader(f, delimiter='\t')
        for row in read_file:
            word, embedding = row[0], row[1:]
            word2vec_map[word] = np.array(embedding, dtype=float)
    return word2vec_map


def trim_reviews(path_to_dataset):
    """
    Trims reviews by removing words not found in the word2vec_map.

    Parameters:
        path_to_dataset (str): File path to the dataset tsv file.
    
    Returns:
        np.ndarray: An array of tuples, where each tuple is (label, trimmed_review).
    """
    # Load dataset
    dataset = load_tsv_dataset(path_to_dataset)
    
    # Load the word2vec dictionary inside this function
    word2vec_map = load_feature_dictionary("./data/word2vec.txt")
    
    # Function to trim a single review
    def trim_review(review):
        words = review.split()  # Split review into words
        trimmed_words = [word for word in words if word in word2vec_map]  # Keep words in word2vec_map
        return ' '.join(trimmed_words)  # Join trimmed words back into a string
    
    # Apply trimming to all reviews in the dataset
    trimmed_dataset = []
    for label, review in dataset:
        trimmed_review = trim_review(review)
        trimmed_dataset.append((label, trimmed_review))
    
    return np.array(trimmed_dataset, dtype=object)

trim_train = trim_reviews("./data/train_small.tsv")
trim_test = trim_reviews("./data/test_small.tsv")

b) Implement the embedding and store it to a .tsv file where the first column is the label and the rest are the features from the embedding. Round all numbers to 6 decimal places. embedded_train_small.tsv contains the expected output of your function.

In [10]:
def embed_reviews(trimmed_dataset):
    """
    Embeds each review by averaging the word embeddings of words found in the word2vec_map.

    Parameters:
        trimmed_dataset (np.ndarray): Array of tuples (label, trimmed_review).
    
    Returns:
        np.ndarray: Embedded dataset with shape (N, VECTOR_LEN + 1), where the first column is the label
                    and the next VECTOR_LEN columns are the averaged word embeddings.
    """
    # Load the word2vec dictionary
    word2vec_map = load_feature_dictionary("./data/word2vec.txt")
    
    embedded_dataset = []
    
    for label, review in trimmed_dataset:
        words = review.split()
        
        # Extract word embeddings from word2vec_map
        embeddings = [word2vec_map[word] for word in words if word in word2vec_map]
        
        if embeddings:
            # Average the word embeddings for this review
            avg_embedding = np.mean(embeddings, axis=0)
        else:
            # If no valid embeddings found, use a zero vector
            avg_embedding = np.zeros(VECTOR_LEN)
        
        # Add the label as the first element
        embedded_review = np.hstack(([label], avg_embedding))
        embedded_dataset.append(embedded_review)
    
    return np.array(embedded_dataset)

def save_as_tsv(dataset, filename):
    """
    Saves the embedded dataset as a .tsv file with labels as the first column and embeddings as the rest.
    
    Parameters:
        dataset (np.ndarray): Embedded dataset, with each row being (label, embedding_vector).
        filename (str): The path to save the .tsv file.
    """
    with open(filename, 'w+') as f:
        for row in dataset:
            # Format each value in the row, rounding to 6 decimal places
            row_str = '\t'.join([f"{x:.6f}" for x in row])
            f.write(row_str + '\n')

# Example usage:
embedded_train = embed_reviews(trim_train)
embedded_test = embed_reviews(trim_test)

save_as_tsv(embedded_train, "./data/output/embedded_train_small.tsv")
save_as_tsv(embedded_test, "./data/output/embedded_test_small.tsv")


To complete the form. Run the following code:

In [11]:
import csv

row_num = 0
with open('./data/output/embedded_test_small.tsv') as f:
    read_file = csv.reader(f, delimiter='\t')
    for row in read_file:
        if row_num == 6:
            print(row[12])
            break
        row_num += 1

0.337376
