<a href="https://colab.research.google.com/github/Yixian-ch/nlp_selflearning/blob/main/word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Mid-Term Mini-Project: Training and Analyzing Domain-Specific Word Embeddings**

## **Project Overview**

In this mini-project, you will explore how **Word2Vec embeddings differ across different scientific domains**. You will:

- Train separate **Word2Vec models** on scientific papers from different domains.
- Use **word similarity and analogy tasks** to compare the representations across models.
- Analyze and interpret the domain-specific variations in embeddings.

## **Project Objectives**

✅ Learn to train Word2Vec embeddings on domain-specific corpora.  
✅ Understand how different domains influence word representations.  
✅ Perform **word similarity** and **word analogy** evaluations.  
✅ Compare embeddings trained on different domains to reveal biases and context variations.


## **Project Tasks**

### **1️⃣ Corpus Selection & Preprocessing**

Each team/student will:

1. **Select at least two scientific domains** (e.g., Computer Science, Biology, Medicine, Physics, etc.).
2. Use the **ArXiv Abstracts Dataset** (available on Hugging Face: `gfissore/arxiv-abstracts-2021`).
3. Extract **a subset of abstracts** from their selected domains (e.g., 10,000 abstracts per domain).
4. Preprocess the text by:
   - Lowercasing
   - Removing punctuation
   - Tokenizing sentences
   - Removing stopwords (optional)

📌 **Deliverable:** A cleaned text corpus for each selected domain.


In [None]:
# !pip install gdown  # Ensure gdown is installed
# !pip install datasets
!gdown --id 1yB52MYODAORHRvTfHHYweJg3uRlsdjSO -O arxiv_abstracts_sampled100000.csv

Downloading...
From (original): https://drive.google.com/uc?id=1yB52MYODAORHRvTfHHYweJg3uRlsdjSO
From (redirected): https://drive.google.com/uc?id=1yB52MYODAORHRvTfHHYweJg3uRlsdjSO&confirm=t&uuid=ccabe0a1-50f6-427d-b3b3-fd4deb651855
To: /content/arxiv_abstracts_sampled100000.csv
100% 120M/120M [00:01<00:00, 64.0MB/s]


## Libraries
Download sampled(100,000) csv in Google Drive

In [None]:
import pandas as pd
from datasets import load_dataset
dataset = load_dataset("gfissore/arxiv-abstracts-2021", split="train")
df = pd.DataFrame(dataset)
df_sampled = df.sample(n=100000, random_state=42)
df_sampled.to_csv("/content/drive/MyDrive/nlp_en/arxiv_abstracts_sampled100000.csv", index=False)
print("Dataset saved as arxiv_abstracts_sampled.csv")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.75k [00:00<?, ?B/s]

arxiv-abstracts.jsonl.gz:   0%|          | 0.00/940M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1999486 [00:00<?, ? examples/s]

Dataset saved as arxiv_abstracts_sampled.csv


## Read csv

In [12]:
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/nlp_en/arxiv_abstracts_sampled100000.csv")
# check data types in df
print(df.info())


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   id           100000 non-null  object
 1   submitter    99288 non-null   object
 2   authors      100000 non-null  object
 3   title        100000 non-null  object
 4   comments     77753 non-null   object
 5   journal-ref  37314 non-null   object
 6   doi          50947 non-null   object
 7   abstract     100000 non-null  object
 8   report-no    8517 non-null    object
 9   categories   100000 non-null  object
 10  versions     100000 non-null  object
dtypes: object(11)
memory usage: 8.4+ MB
None


In [6]:
from collections import Counter
df['categories'].head(50)
print(type(df['categories'][25]))
# Column categories contrains a str type with words separated by space, to extract wanted categories, it's better to convert it into a list of words separated by comma

<class 'str'>


In [27]:
# In the original df, the categories column is a string expression like '[...]', to select to domain, we need to turn it into a list
# with eval(expr), it can read a str expr and recognize other type expression like [] and return it. We split list's element by space because the domains are separated by space
df['categories'] = df['categories'].apply(lambda x: eval(x)[0].split(" "))
print(df['categories'][25],type(df['categories'][25]),type(df['categories']))

['math.AG', 'cs.CC'] <class 'list'> <class 'pandas.core.series.Series'>


In [29]:
# inspect how many different categories are in our df
all_categories = [cat for list_of_cate in df['categories'] for cat in list_of_cate] # returns a list of all categories
count_cate = Counter(all_categories)
print(count_cate)

Counter({'hep-ph': 8219, 'hep-th': 7516, 'quant-ph': 5985, 'astro-ph': 5162, 'cs.LG': 4988, 'gr-qc': 4597, 'cond-mat.mes-hall': 3837, 'cond-mat.mtrl-sci': 3703, 'cs.CV': 3446, 'math-ph': 3399, 'math.MP': 3399, 'cond-mat.stat-mech': 3238, 'cond-mat.str-el': 3107, 'astro-ph.CO': 2933, 'stat.ML': 2624, 'nucl-th': 2545, 'astro-ph.SR': 2527, 'math.CO': 2527, 'math.AP': 2491, 'astro-ph.GA': 2487, 'hep-ex': 2347, 'astro-ph.HE': 2340, 'math.PR': 2312, 'math.AG': 2249, 'cs.AI': 2091, 'cs.IT': 1884, 'math.IT': 1884, 'cond-mat.supr-con': 1853, 'math.DG': 1801, 'physics.optics': 1795, 'math.NT': 1687, 'math.OC': 1680, 'cond-mat.soft': 1668, 'cs.CL': 1574, 'math.DS': 1458, 'math.NA': 1399, 'hep-lat': 1295, 'math.FA': 1261, 'astro-ph.IM': 1156, 'math.RT': 1120, 'nucl-ex': 1080, 'physics.flu-dyn': 1079, 'cs.CR': 1079, 'astro-ph.EP': 1077, 'cond-mat.dis-nn': 1048, 'math.GT': 1014, 'math.ST': 963, 'stat.TH': 963, 'math.CA': 938, 'stat.ME': 932, 'cs.DS': 923, 'physics.soc-ph': 920, 'cs.SY': 892, 'physic

## Choice of domains
Some categories may contains multiple domains like `[cs,math,physics]` here, I choose to select categories contaning only a main domains like math or cs or physics

In [30]:
def filter_domain(domain:str,list_domain:list)->bool:
        return all(domain in name for name in list_domain)

# deal with some html crab issues
df['abstract'] = df['abstract'].apply(lambda x: x.replace("\n"," "))

# df[boolmask], the boolmask behaves like a dict n_row:bool, our df will only select thoes rows are True
# 10,000 samples for each domain

cate_math = df[df['categories'].apply(lambda x: filter_domain("math",x))].sample(n=10000, random_state=42).reset_index(drop=True)

cate_cs = df[df['categories'].apply(lambda x: filter_domain("cs",x))].sample(n=10000, random_state=42).reset_index(drop=True)

cate_astro = df[df['categories'].apply(lambda x: filter_domain("astro",x))].sample(n=10000, random_state=42).reset_index(drop=True)


print(cate_math.shape,cate_cs.shape,cate_astro.shape,sep="\n")

(10000, 11)
(10000, 11)
(10000, 11)


## Preprocess column abstract with nltk
- lowercase
- stopwords
- punk
- segmentation of sentences
- stemmatization

In [52]:
import nltk
from nltk.corpus import stopwords
import numpy as np
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
# nltk.download('punkt_tab')
# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stem = PorterStemmer()

def preprocess(corpus:str) -> str:
    '''
    Corpus needs to be segmented
    '''
    sents:list = sent_tokenize(corpus.lower())
    # remove punk and stopwords
    cleaned_sets = []
    for sent in sents:
        current_sent = [stem.stem(word) for word in sent.split() if word.isalnum() and word not in stop_words]
        cleaned_sets.extend(current_sent) # a list of single words
    return " ".join(cleaned_sets)

cate_math["processed_abstact"] = cate_math['abstract'].apply(preprocess)
cate_cs["processed_abstact"] = cate_cs['abstract'].apply(preprocess)
cate_astro["processed_abstact"] = cate_astro['abstract'].apply(preprocess)

print(cate_math["processed_abstact"].head(),cate_cs["processed_abstact"].head(),cate_astro["processed_abstact"].head())


synthet data gener becom preval solut privaci leakag data gener model design gener realist synthet precis express data distribut real gener adversari network gain great success comput vision doubtlessli use synthet data though prior work demonstr great learn correl data distribut rather true process dataset natur correl reliabl statist techniqu tell linear depend easili affect encod underli factor real data natur reliabl propos causal model name causal tabular gener neural network gener synthet tabular data use tabular causal extens experi simul dataset real dataset demonstr better perform method given true causal graph compar perform use estim causal


## Contextes are sensible to period of corpus

### **2️⃣ Train Word2Vec Models**

Train separate **Word2Vec models** on each domain-specific corpus.

📌 **Implementation Steps:**

```python
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Load and tokenize text corpus
def preprocess_text(text):
    # TODO

# Example corpus (Replace with real domain text)
domain_corpus = ["Machine learning is advancing AI.", "Deep learning improves NLP."]
tokenized_corpus = preprocess_text(domain_corpus)

# Train Word2Vec model
# TODO: read gensim documentation:
# https://radimrehurek.com/gensim/models/word2vec.html
# https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

model_domain1.save("word2vec_domain1.model")
model_domain2.save("word2vec_domain2.model")
# ...
```

🔹 Use the **same hyperparameters** across all domains for fair comparison.

📌 **Deliverable:** Trained Word2Vec models for different domains.


### **3️⃣ Word Similarity Comparison**
Compare how different domains represent words.

📌 **Task**: Select **5 key terms** and find their closest words using cosine similarity.

For example:
```python
word = "network"
print(model_domain1.wv.most_similar(word, topn=5))
print(model_domain2.wv.most_similar(word, topn=5))
```

📌 **Deliverable:** Table comparing top similar words per domain.



###**4️⃣ Word Analogy Task (Word Relationship Tests)**
Test how embeddings encode semantic relationships.

📌 **Task:** Compute word analogies manually using vector arithmetic.

```python
def compute_analogy(word_a, word_b, word_c, model):
    vec = # TODO
    return model.wv.similar_by_vector(vec, topn=5)

print(compute_analogy("algorithm", "software", "hardware", model))
```

📌 **Deliverable:** Compute at least 5 analogies for each domain and compare results.



### **5️⃣ Visualization: PCA/t-SNE of Embeddings**

Use PCA or t-SNE to visualize embeddings and analyze clustering.

📌 **Deliverable:** Visualization of word clusters.

###**Discussion:**

- What differences were observed between embeddings from different domains?

- What were the most surprising similarities/differences?

- How does the domain-specific corpus affect word representations?

- Find cases where the analogy fails, and discuss why the model makes mistakes:
  - Not enough training data?
  - Context-dependent meanings?
  - Training corpus bias?

- Any other interesting findings?


### **6️⃣ Optional Task (bonus):**
1. Compare your trained Word2Vec models with other online pre-trained Word2Vec models. Find existing models from: [Gensim-data repository](https://github.com/piskvorky/gensim-data)
2. Train a simple classification model using Word2Vec (can be your own trained model or online word embedding models) vectors as text representation (e.g., classify abstract domain).

📌 **Deliverable:** Optional results and analysis for either or both tasks.

# **Project Submission**

📅 **Deadline:** 27/04/2025

📌 **Submission Format:**

* Jupyter Notebook (.ipynb) with code and analysis.

* Conversation with AI Chatbot (if used, attach the conversation along with a brief report explaining what you learned from it)

🚀 **Good luck! Explore, compare, and understand embeddings!** 🚀
