<div class="alert alert-block alert-danger">

Please submit your solutions in either Assignment_302.ipynb or Assignment_302_alternative.ipynb—submission of both is not required.

</div>

## Assignment: Document Embeddings and Classification with Doc2Vec (10 points)

**Background:**  
This assignment investigates how Doc2Vec document embeddings perform for text classification tasks, using datasets from the Hugging Face Hub. You will implement and compare different Doc2Vec architectures (PV-DM and PV-DBOW), analyze document similarity patterns, and evaluate classification performance. The workflow follows the structure and code style shown in the attached practice notebooks.

### Instructions and Point Breakdown

**1. Dataset Preparation (2 points)**

- Select any text classification dataset from Hugging Face Hub with at least 3 categories. Example datasets you can use (not limited to the following datasets):
  - IMDb: https://huggingface.co/datasets/SetFit/imdb
  - Amazon Polarity: https://huggingface.co/datasets/SetFit/amazon_polarity
  - Yahoo Answers Topics: https://huggingface.co/datasets/sentence-transformers/yahoo-answers
  - Banking77: https://huggingface.co/datasets/gtfintechlab/banking77 
  - SMS Spam Collection: https://huggingface.co/datasets/ucirvine/sms_spam
  - Hate Speech and Offensive Language: https://huggingface.co/datasets/Hate-speech-CNERG/hatexplain 


- Write code to:
  - Load the dataset using `datasets.load_dataset()` from Hugging Face.
  - Create TaggedDocument objects with unique identifiers for each document.
  - Split into training and test sets if not already provided.
- In 2–3 sentences, explain why TaggedDocument format is necessary for Doc2Vec training and how it differs from standard text preprocessing.

**2. Doc2Vec Model Training (2 points)**

- Train two Doc2Vec models using gensim:
  - **PV-DM model:** Set `dm=1`, `vector_size=100`, `window=5`, `min_count=2`.
  - **PV-DBOW model:** Set `dm=0`, `vector_size=100`, `window=5`, `min_count=2`.
- Train both models for 20 epochs on the training documents.
- Print the vocabulary size for each model and explain what the vector_size parameter represents.
- Briefly discuss one advantage of PV-DM versus PV-DBOW architecture.

**3. Document Similarity Analysis and Visualization (3 points)**

- Select 10 documents from different categories in your test set.
- Use both trained models to:
  - Infer document vectors for these test documents.
  - Compute pairwise cosine similarities between all document pairs.
  - Create a similarity heatmap for each model showing which documents are most similar.
- Compare similarity patterns between PV-DM and PV-DBOW models in a Markdown table.

**4. Classification Performance Evaluation (2 points)**

- Use the trained Doc2Vec models as feature extractors for classification:
  - Extract document vectors for all training and test documents.
  - Train a logistic regression classifier on the Doc2Vec features.
  - Evaluate classification accuracy on the test set for both models.
- Report and compare test accuracy for PV-DM and PV-DBOW approaches in a Markdown table.
- In 2–3 sentences, interpret your findings: Which Doc2Vec variant performed better for your chosen dataset, and what might explain the difference?

**5. Technical Reflection (1 point)**

- In a Markdown cell, answer:
  - How do Doc2Vec document embeddings compare to traditional bag-of-words approaches for text classification?
  - Suggest one modification to improve the Doc2Vec models (e.g., different hyperparameters, ensemble methods, preprocessing).
  - Name a real-world application where Doc2Vec would be particularly useful, and explain why document-level embeddings are advantageous for that task.

### Submission Requirements

- Jupyter Notebook containing:
  - Clear, well-commented code for all sections
  - Output cells showing similarity heatmaps and accuracy tables
  - Reflection in Markdown cells
- Use Python libraries: `datasets`, `gensim`, `scikit-learn`, `matplotlib`, `seaborn`, `numpy`

#### Grading Rubric

| Section                      | Points |
|------------------------------|:------:|
| Dataset preparation          |   2    |
| Doc2Vec model training       |   2    |
| Similarity analysis          |   3    |
| Classification evaluation    |   2    |
| Reflection quality           |   1    |
| **Total**                    | **10** |