# SchoLAS

### Problem Statement
SchoLAS is an LLM-based tool that will scrape scholarly sites to gather the latest litereature in the field of Learning Analytics, so analysts can query this tool to stay updated with the field. 

## Idea of the Project

1. For starters, we know there is a huge body of research papers available via SoLAR's Journal of Learning Analytics. So, we will scrape the website's recent publication issue using `beautifulsoup`
2. Then we will create the RAG pipeline to convert these papers and augment a local LLM model to ansewr questions based on the papers we downloaded. 
3. Integrate `Streamlit` on front-end for ease of use of the tool. 


In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import os
from dotenv import load_dotenv
import fitz
import json
import re


In [4]:
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/115.0 Safari/537.36"
    )
}

In [32]:
# Step 1: Fetch the issue page HTML
issue_url = "https://learning-analytics.info/index.php/JLA/issue/view/511"
# issue_url = "https://learning-analytics.info/index.php/JLA/issue/view/509"
response = requests.get(issue_url, headers= headers)
response.raise_for_status()  # ensures we fail if the request breaks

soup = BeautifulSoup(response.text, "html.parser")

In [33]:
print(soup)

 <!DOCTYPE html>

<html lang="en-US" xml:lang="en-US">
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>
		Vol. 12 No. 2 (2025): Human Creativity and Learning Analytics
							| Journal of Learning Analytics
			</title>
<link href="https://learning-analytics.info/public/journals/42/favicon_en_US.png" rel="icon"/>
<meta content="Open Journal Systems 3.2.1.1" name="generator"/>
<link href="https://learning-analytics.info/index.php/JLA/gateway/plugin/WebFeedGatewayPlugin/atom" rel="alternate" type="application/atom+xml"/>
<link href="https://learning-analytics.info/index.php/JLA/gateway/plugin/WebFeedGatewayPlugin/rss" rel="alternate" type="application/rdf+xml"/>
<link href="https://learning-analytics.info/index.php/JLA/gateway/plugin/WebFeedGatewayPlugin/rss2" rel="alternate" type="application/rss+xml"/>
<link href="https://learning-analytics.info/index.php/JLA/$$$call$$$/page/page/css?name=stylesheet" rel="stylesheet" type="te

In [34]:
#find title tag to get issue info
journal_issue = soup.find("title").get_text(strip=True).split('\n')[0]


In [35]:
# Step 2: Find all 'PDF' button links
pdf_button_links = []
for a_tag in soup.find_all("a", class_="obj_galley_link pdf"):
    viewer_link = urljoin(issue_url, a_tag.get("href"))
    pdf_button_links.append(viewer_link)

print(f"Found {len(pdf_button_links)} PDF viewer links")
for link in pdf_button_links:
    print(link)

Found 19 PDF viewer links
https://learning-analytics.info/index.php/JLA/article/view/9273/7927
https://learning-analytics.info/index.php/JLA/article/view/8833/7921
https://learning-analytics.info/index.php/JLA/article/view/8807/7923
https://learning-analytics.info/index.php/JLA/article/view/8837/7919
https://learning-analytics.info/index.php/JLA/article/view/8809/7913
https://learning-analytics.info/index.php/JLA/article/view/8857/7917
https://learning-analytics.info/index.php/JLA/article/view/8835/7925
https://learning-analytics.info/index.php/JLA/article/view/8625/7899
https://learning-analytics.info/index.php/JLA/article/view/8543/7911
https://learning-analytics.info/index.php/JLA/article/view/8495/7905
https://learning-analytics.info/index.php/JLA/article/view/8481/7903
https://learning-analytics.info/index.php/JLA/article/view/8441/7915
https://learning-analytics.info/index.php/JLA/article/view/8753/7901
https://learning-analytics.info/index.php/JLA/article/view/8553/7893
https://

In [36]:
# Take the first viewer link
test_viewer_url = pdf_button_links[0]
print(f"Testing viewer page: {test_viewer_url}")

# Fetch viewer page HTML
viewer_resp = requests.get(test_viewer_url, headers=headers)
viewer_resp.raise_for_status()

viewer_soup = BeautifulSoup(viewer_resp.text, "html.parser")

# Find <a class="download"> for actual PDF link
download_tag = viewer_soup.find("a", class_="download")
if download_tag:
    pdf_url = urljoin(test_viewer_url, download_tag.get("href"))
    print("Actual PDF URL:", pdf_url)
else:
    print("No download link found in viewer page.")

Testing viewer page: https://learning-analytics.info/index.php/JLA/article/view/9273/7927
Actual PDF URL: https://learning-analytics.info/index.php/JLA/article/download/9273/7927/46087


In [37]:
folder_store_papers = "rag_corpus/"+journal_issue
os.makedirs(folder_store_papers, exist_ok=True)

In [38]:
for idx, viewer_link in enumerate(pdf_button_links, start=1):
    print(f"Now on {idx} of {len(pdf_button_links)}")
    # Fetch viewer page HTML
    viewer_resp = requests.get(viewer_link, headers=headers)
    viewer_resp.raise_for_status()

    viewer_soup = BeautifulSoup(viewer_resp.text, "html.parser")

    # Find <a class="download"> for actual PDF link
    download_tag = viewer_soup.find("a", class_="download")
    if not download_tag:
        print("⚠️ No download link found, skipping.")
        continue
    pdf_url = urljoin(viewer_link, download_tag.get("href"))
    print(f"Downloading: {pdf_url}")
    # Get paper title for filename
    title_tag = viewer_soup.find("h1", class_="page_title")
    if title_tag:
        filename = title_tag.get_text(strip=True)
    else:
        filename = f"paper_{idx}"
    
    # Clean filename
    safe_filename = "".join(c if c.isalnum() or c in " _-" else "_" for c in filename) + ".pdf"
    file_path = os.path.join(folder_store_papers, safe_filename)
    
    # Download PDF
    pdf_resp = requests.get(pdf_url, headers=headers)
    pdf_resp.raise_for_status()
    with open(file_path, "wb") as f:
        f.write(pdf_resp.content)
    
    print(f"✅ Saved: {file_path}")

print("\nAll downloads complete!")



Now on 1 of 19
Downloading: https://learning-analytics.info/index.php/JLA/article/download/9273/7927/46087
✅ Saved: rag_corpus/Vol. 12 No. 2 (2025): Human Creativity and Learning Analytics/paper_1.pdf
Now on 2 of 19
Downloading: https://learning-analytics.info/index.php/JLA/article/download/8833/7921/46081
✅ Saved: rag_corpus/Vol. 12 No. 2 (2025): Human Creativity and Learning Analytics/paper_2.pdf
Now on 3 of 19
Downloading: https://learning-analytics.info/index.php/JLA/article/download/8807/7923/46083
✅ Saved: rag_corpus/Vol. 12 No. 2 (2025): Human Creativity and Learning Analytics/paper_3.pdf
Now on 4 of 19
Downloading: https://learning-analytics.info/index.php/JLA/article/download/8837/7919/46079
✅ Saved: rag_corpus/Vol. 12 No. 2 (2025): Human Creativity and Learning Analytics/paper_4.pdf
Now on 5 of 19
Downloading: https://learning-analytics.info/index.php/JLA/article/download/8809/7913/45995
✅ Saved: rag_corpus/Vol. 12 No. 2 (2025): Human Creativity and Learning Analytics/paper_5

In [None]:
# import ollama 

# RAG stuff

# store model o/p in a variabe 

#streamlit design 