Finding Distinctive Words with TF-IDF in Scientific Abstracts

We'll apply TF-IDF to scientific abstracts, to extract the critical information addressed in each excerpt. This is a handy tool for sorting through original literature sources.

TF-IDF weighs words based on two factors:
-- Term Frequency (TF): How often a word appears in a document, normalized by the document's length (gathers words relative importance within the document)
-- Inverse Document Frequency (IDF): How rare the word is across all documents. Words that appear in many documents (like "the" or "study") receive lower IDF scores. Specialized terms receive higher scores.

TF-IDF aims to highlight words that are frequent in a specific document and relatively unique across the corpus. TF-IDF helps with:
- Identifying field-specific terminology in academic literature
- Extracting keywords from documents
- Building search engines that return relevant results
- Comparing document similarity
- Text classification and clustering


Coding: 

pythonimport nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from math import log

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Sample abstracts from different scientific fields
abstracts = [
    {"field": "immunology", "text": "Study of antibodies and vaccines in immune response."},
    {"field": "immunology", "text": "Research on vaccine effects on antibody production."},
    {"field": "neuroscience", "text": "Brain activity during memory formation and recall."},
    {"field": "neuroscience", "text": "Neural pathways and dopamine in the brain."}
]

# We need to tokenize first to get rid of non-unique terms
processed_docs = []
for doc in abstracts:
    # Convert to lowercase and tokenize
    tokens = word_tokenize(doc["text"].lower())
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    
    processed_docs.append({"field": doc["field"], "tokens": tokens})

# Calculate Term Frequency (TF) 
# We count how many times each word appears in a single document and divide by the total number of words. This normalization prevents longer documents from having artificially higher term frequencies.
for doc in processed_docs:
    word_counts = Counter(doc["tokens"])
    total_words = len(doc["tokens"])
    doc["tf"] = {word: count/total_words for word, count in word_counts.items()}

# Calculate Inverse Document Frequency (IDF)
# We count how many documents contain each word. Words that appear in many documents will have high document frequency but low inverse document frequency. We compute log(N/DF) where N is the total number of documents. The logarithm dampens the effect for very rare words.
all_words = set()
for doc in processed_docs:
    all_words.update(doc["tokens"])

total_docs = len(processed_docs)
idf = {}
for word in all_words:
    doc_count = sum(1 for doc in processed_docs if word in doc["tokens"])
    idf[word] = log(total_docs / doc_count)

# Calculate TF-IDF
# Multiply each term's TF by its IDF to get a combined score. Words with high TF-IDF appear frequently in a specific document but rarely in the overall corpus.
for doc in processed_docs:
    doc["tfidf"] = {word: tf_value * idf[word] for word, tf_value in doc["tf"].items()}

# Find top terms by field
field_terms = {}
for doc in processed_docs:
    field = doc["field"]
    if field not in field_terms:
        field_terms[field] = {}
    
    for word, score in doc["tfidf"].items():
        field_terms[field][word] = field_terms[field].get(word, 0) + score

# Print top terms for each field
for field, terms in field_terms.items():
    top_terms = sorted(terms.items(), key=lambda x: x[1], reverse=True)[:3]
    print(f"\nTop terms for {field}:")
    for term, score in top_terms:
        print(f"  {term}: {score:.4f}")

We group documents by their field and sum the TF-IDF scores for each word across documents in the same field to find field-specific terminology.

Interpreting Results
The output shows the most distinctive terms for each field. For immunology, you might see terms like "antibodies" and "vaccines" with high scores. For neuroscience, terms like "brain" and "neural" would score highly.
These high-scoring terms represent the specialized vocabulary that characterizes each field, even though our sample is very small.