# Question 1

Through the textbook, you'd build the TF-IDF transformer manually. You can find a similar result when you use the scikit-learn package (see section 3.4.3), but the results look somewhat different. Supply your intuition in words about what makes them different.

# Question 2
In section 3.4.2, we measure the cosine similarity between the query (“How long does it take to get to the store?”) and the given documents. Please do the same exercise using the sklearn TfidfVectorizer function. 

    - Please use the same consine_sim function when you check the similarities. 

In [1]:
import math

def cosine_sim(vec1, vec2):
    """
    Since our vectors are dictionaries, lets convert them to lists for easier mathing.
    """
    vec1 = [val for val in vec1.values()]
    vec2 = [val for val in vec2.values()]
    
    dot_prod = 0
    for i, v in enumerate(vec1):
        dot_prod += v * vec2[i]
        
    mag_1 = math.sqrt(sum([x**2 for x in vec1]))
    mag_2 = math.sqrt(sum([x**2 for x in vec2]))
    
    return dot_prod / (mag_1 * mag_2)

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

doc_0 = "The faster Harry got to the store, the faster Harry, the faster, would get home."
doc_1 = "Harry is hairy and faster than Jill."
doc_2 = "Jill is not as hairy as Harry."

corpus = [doc_0, doc_1, doc_2]

vectorizer = TfidfVectorizer(min_df=1)
model = vectorizer.fit_transform(corpus)

print(model.todense()) 
# you may need to convert the this outcome vector into a dictionary

[[0.         0.         0.42662402 0.18698644 0.18698644 0.
  0.22087441 0.18698644 0.         0.         0.         0.18698644
  0.         0.74794576 0.18698644 0.18698644]
 [0.46312056 0.         0.35221512 0.         0.         0.35221512
  0.27352646 0.         0.35221512 0.35221512 0.         0.
  0.46312056 0.         0.         0.        ]
 [0.         0.75143242 0.         0.         0.         0.28574186
  0.22190405 0.         0.28574186 0.28574186 0.37571621 0.
  0.         0.         0.         0.        ]]


In [3]:
query = "How long does it take to get to the store?"

# Question 3

Zipf’s law states that if the word types in a corpus are sorted by frequency, the frequency of the word rank at r is proportional to $\frac{1}{r}$. In this question, let’s evaluate this law using Henry Wood Elliot’s book *Our Arctic province* (see `pg70373.txt`). 

Explain your finding.

# Question 4

Data: The Movie Review Data is a collection of movie reviews from imdb.com in the early 2000s, curated by Bo Pang and Lillian Lee for their research on natural language processing. The dataset includes 2,000 reviews, evenly split between positive and negative sentiments. The data has been updated and cleaned up in v2.0, which was released in 2004. This dataset is often referred to as the polarity dataset.

You need to develop a sentiment classifier using this dataset and evaluate its performance using cross-validation

## (a) 

The first approach is to follow the methodology used in Homework 1:

1. Split the Movie Review Data into training and test sets.
2. Train a Naive Bayes model on the training set using a bag-of-words representation.
3. Evaluate the model's performance

## (b)

The next approach is to use TF-IDF representation and cosine similarity, similar to Question 2:

1. Use the same split data in (a)
2. Transform the training data into TF-IDF representation using the `TfidfVectorizer` function from the `sklearn` library (using `fit_transform()`).
3. Transform the test data using `TfidfVectorizer.transform()`.
4. For each review in the test set:
    - Calculate the cosine similarity with the entire positive reviews in the training set.
    - Calculate the cosine similarity with the entire negative reviews in the training set.
    - Assign (predict) sentiment based on the higher cosine similarity.
5. Evaluate the performance of the model.

## (c)

The last approach is to use Latent Semantic Analysis (`LSA`) with Linear Discriminant Analysis (`LDA`):

1. Use the same split data in (a)
2. Transform the training data into a document-topic representation using the `PCA` function from the `sklearn` library.
3. Train a sentiment classifier using `LinearDiscriminantAnalysis` on the document-topic data.
4. Evaluate the performance of the model.
5. Repeat steps 2-4 with different numbers of topics.
6. Determine the optimal number of topics by comparing the performance of the model with different numbers of topics.
7. Justify the choice of the optimal number of topics based on the model's performance 