### Why This Project Stands Out
- Real-World Dataset: Scrapes real academic papers from arXiv.
- State-of-the-Art Tools: Combines BERT for ranking and LLM for summarization.
- Relevance to NLP: Covers embedding, retrieval, and question-answering pipelines.
- Practical Impact: Useful for researchers and students looking for summarized academic insights.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Scrape research paper abstracts from arXiv
def scrape_arxiv(query, max_results=10):
    base_url = "http://export.arxiv.org/api/query"
    params = {
        "search_query": f"all:{query}",
        "start": 0,
        "max_results": max_results
    }
    
    response = requests.get(base_url, params=params)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "lxml")
        papers = []
        for entry in soup.find_all("entry"):
            title = entry.find("title").text
            summary = entry.find("summary").text
            papers.append({"title": title.strip(), "abstract": summary.strip()})
        return pd.DataFrame(papers)
    else:
        raise Exception("Failed to fetch data from arXiv API")

# Step 2: Save dataset to CSV
query = "machine learning"
papers_df = scrape_arxiv(query, max_results=20)
papers_df.to_csv("arxiv_papers.csv", index=False)
print("Dataset saved to arxiv_papers.csv!")


Dataset saved to arxiv_papers.csv!


  soup = BeautifulSoup(response.content, "lxml")


In [21]:
# papers_df = pd.read_csv("arxiv_papers.csv")
# papers = papers_df["abstract"].tolist()
# papers

In [3]:
# ! pip install sentence-transformers

In [31]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import Ollama
from transformers import BertTokenizer, BertForSequenceClassification
import pandas as pd
import torch
import torch.nn.functional as F

# Load dataset
papers_df = pd.read_csv("arxiv_papers.csv")
papers = papers_df["abstract"].tolist()

# Step 1: Embed documents with HuggingFace
print("Generating embeddings...")
embeddings = HuggingFaceEmbeddings(model_name="allenai/scibert_scivocab_uncased")
vectorstore = FAISS.from_texts(papers, embeddings)

# Step 2: Set up LangChain pipeline with an LLM
llm = Ollama(model="llama3.2")
retriever = vectorstore.as_retriever()
qa_pipeline = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

# Step 3: Ranking using BERT (Optional)
bert_model_name = "allenai/scibert_scivocab_uncased"
bert_tokenizer = BertTokenizer.from_pretrained(bert_model_name)
bert_model = BertForSequenceClassification.from_pretrained(bert_model_name)

def rank_papers(query, papers):
    print("Ranking papers with BERT...")
    scores = []
    max_length = 512  # Set max length to avoid truncation issues
    for paper in papers:
        # Tokenize the inputs with truncation and padding
        inputs = bert_tokenizer.encode_plus(
            query, paper, 
            return_tensors="pt", 
            truncation=True, 
            padding=True, 
            max_length=max_length
        )
        
        # Get model outputs
        outputs = bert_model(**inputs)
        
        # Apply softmax to logits (optional if it's a multi-class classification)
        logits = outputs.logits
        probs = F.softmax(logits, dim=-1)  # Apply softmax along the last dimension
        score = probs[0][0].item()  # Assuming the first class is the most relevant
        
        scores.append(score)
    
    # Sort papers by scores (higher score is better)
    return sorted(zip(scores, papers), reverse=True)

# Query the system
query = "Computer"
retrieved_papers = retriever.get_relevant_documents(query)
ranked_papers = rank_papers(query, [doc.page_content for doc in retrieved_papers])

# Step 4: Summarize the top-ranked papers
print("\nTop Papers and Summaries:")
summary_question = "Please summarize this paper's abstract in a few sentences."  # Explicit question for summarization

for i, paper in enumerate(ranked_papers[:3], 1):
    print(f"\nPaper {i}:")
    print("Abstract:", paper[1])
    summary = qa_pipeline.run(f"{summary_question} {paper[1]}")
    print("Summary:", summary)

Generating embeddings...


No sentence-transformers model found with name allenai/scibert_scivocab_uncased. Creating a new one with mean pooling.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at allenai/scibert_scivocab_uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Ranking papers with BERT...

Top Papers and Summaries:

Paper 1:
Abstract: Transparent machine learning is introduced as an alternative form of machine
learning, where both the model and the learning system are represented in
source code form. The goal of this project is to enable direct human
understanding of machine learning models, giving us the ability to learn,
verify, and refine them as programs. If solved, this technology could represent
a best-case scenario for the safety and security of AI systems going forward.
Summary: Here is a summary of the paper's abstract in a few sentences:

This paper introduces transparent machine learning, where both models and learning systems are represented in source code form to enable human understanding and verification. The goal is to improve the safety and security of AI systems by making them more transparent and explainable.

Paper 2:
Abstract: Introduction to Machine learning covering Statistical Inference (Bayes, EM,
ML/MaxEnt duality), 