**Homework Assignment: Advanced Word2Vec Exploration**

**Objective:**  
Develop a deep understanding of Word2Vec by not only training and using word embeddings but also by analyzing how different training choices affect their quality. You will experiment with various hyperparameters, evaluate embeddings using intrinsic tasks, visualize clusters, and reflect on the limitations of the model.

---

### Tasks

1. **Data Collection & Preprocessing:**
   - **Corpus Selection:**  
     Choose a substantial text corpus relevant to a domain of your interest (e.g., news articles, research papers, or even a mix of genres from Wikipedia). Explain your choice.
   - **Cleaning & Tokenization:**  
     - Remove punctuation, special characters, and normalize case.
     - Tokenize the text into sentences and words.
     - (Optional) Remove stopwords and apply stemming/lemmatization—justify your decision.
   - **Exploratory Analysis:**  
     - Compute and report statistics (vocabulary size, sentence length distribution, frequency of top words).

2. **Word2Vec Model Training:**
   - **Implement Two Variants:**  
     Train Word2Vec using both:
     - **Continuous Bag-of-Words (CBOW)**
     - **Skip-Gram**
   - **Hyperparameter Experiments:**  
     Experiment with the following parameters:
     - Vector dimensions (e.g., 50, 100, 300)
     - Context window size (e.g., 3, 5, 10)
     - Minimum word frequency thresholds
     - Negative sampling rate vs. hierarchical softmax
     - Sub-sampling of frequent words  
     Document your choices and rationale for different settings.

3. **Intrinsic Evaluation of Embeddings:**
   - **Similarity & Analogy Tasks:**  
     - Create a set of queries to find the top-N most similar words for a given target (e.g., “king” → should find “queen”, “prince”, etc.).
     - Test analogy relationships such as “man : woman :: king : ?”.  
     Compare the performance between CBOW and Skip-Gram.
     - Given a set of words, identify the one that does not belong in the group using vector distance metrics.

4. **Visualization & Clustering:**
   - **Dimensionality Reduction:**  
     Use t-SNE (or another dimensionality reduction method) to project the high-dimensional embeddings into 2D space.
   - **Cluster Analysis:**  
     - Plot a subset of words (e.g., the top 100 most frequent) and visually inspect their clustering.
     - Optionally, use clustering algorithms (e.g., K-Means) on the embeddings and discuss any patterns or topics you observe.
  
5. **Comparative Analysis & Reflection:**
   - **Parameter Impact Discussion:**  
     - Analyze how different hyperparameters (vector size, window size, sampling techniques) influenced the quality of the embeddings.
     - Compare the strengths and weaknesses of CBOW vs. Skip-Gram based on your intrinsic evaluations.
   - **Limitations & Improvements:**  
     - Reflect on the limitations of Word2Vec (e.g., inability to handle polysemy, out-of-vocabulary issues).
     - Propose potential methods or hybrid approaches (e.g., incorporating context with transformers or using subword information) to overcome these challenges.

---

### Deliverables

- **Jupyter Notebook:**
  - A well-documented notebook that includes:
    - Data preprocessing, exploratory data analysis, and cleaning steps.
    - Code for training both CBOW and Skip-Gram models along with your hyperparameter experiments.
    - Implementation and results of intrinsic evaluation tasks (similarity, analogy, outlier detection).
    - Visualization of embeddings and any clustering results.
    - Detailed markdown cells that explain your methodology, findings, and reflections.


---

### Grading Criteria

- **Implementation & Experimentation (40%):**  
  Accurate and efficient code with diverse hyperparameter experiments.
- **Evaluation & Analysis (30%):**  
  Depth of intrinsic evaluations and insightful comparisons between model variants.
- **Visualization & Reporting (30%):**  
  Quality of visualizations and clarity in presenting your results and conclusions.

