RAGify Science Grades 1–10
A smart, AI-powered learning companion for Classes 1 to 10, built on Retrieval-Augmented Generation (RAG) technology — making science easy to explore, ask, and understand across all grade levels.


In [7]:
pip install pandas PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import os 
import PyPDF2
from PyPDF2 import PdfReader


In [3]:
import os
import pandas as pd
from PyPDF2 import PdfReader

# Set base directory for downloads
downloads_dir = os.path.expanduser("~/Downloads")

# Define all file paths using os.path.join
class1 = [os.path.join(downloads_dir, 'class1.pdf'), os.path.join(downloads_dir, 'class1_1.pdf')]
class2 = [os.path.join(downloads_dir, 'class2.pdf'), os.path.join(downloads_dir, 'class2_pb.pdf')]
class3 = [os.path.join(downloads_dir, 'class3.pdf'), os.path.join(downloads_dir, 'class3_1.pdf')]
class4 = [os.path.join(downloads_dir, 'class4.pdf'), os.path.join(downloads_dir, 'class4_pb.pdf')]
class5 = [os.path.join(downloads_dir, 'class5.pdf')]
class6 = [os.path.join(downloads_dir, 'class6.pdf')]
class7 = [os.path.join(downloads_dir, 'class7_1.pdf')]  # Fixed typo
class8 = [os.path.join(downloads_dir, 'class8.pdf')]
class9 = [
    os.path.join(downloads_dir, 'class9 bio_1.pdf'), os.path.join(downloads_dir, 'class9 bio_2.pdf'),
    os.path.join(downloads_dir, 'class9 bio.pdf'), os.path.join(downloads_dir, 'class9 chemistry_1.pdf'),
    os.path.join(downloads_dir, 'class9 chemistry.pdf'), os.path.join(downloads_dir, 'class9 chemistry2.pdf'),
    os.path.join(downloads_dir, 'class9 physics fb.pdf'), os.path.join(downloads_dir, 'class9 physics.pdf')
]
class10 = [
    os.path.join(downloads_dir, 'class10 bio.pdf'), os.path.join(downloads_dir, 'class10 chemis.pdf'),
    os.path.join(downloads_dir, 'class10 physics.pdf')
]

pdf_files = class1 + class2 + class3 + class4 + class5 + class6 + class7 + class8 + class9 + class10

pdf_content_list = []

# Extract text from PDFs
for file in pdf_files:
    try:
        with open(file, 'rb') as pdf_file:
            reader = PdfReader(pdf_file)
            content = ""
            for page in reader.pages:
                content += page.extract_text() or ""
            pdf_content_list.append({'file_name': os.path.basename(file), 'content': content})
    except Exception as e:
        print(f"Error processing file {file}: {e}")

pdf_df = pd.DataFrame(pdf_content_list)
print(pdf_df)
pdf_df.to_csv('pdf_contents.csv', index=False)


                 file_name                                            content
0               class1.pdf  PrimarySouth   Sudan\nPupil’s Book\nAll the co...
1             class1_1.pdf  Uncorrected proof, all content subject to chan...
2               class2.pdf  A\nPublished by Macmillan/McGraw-Hill, of McGr...
3            class2_pb.pdf  PrimarySouth   Sudan\nPupil’s Book\nAll the co...
4               class3.pdf  3\nArchana Shukla\nThis is my book\nI am ........
5             class3_1.pdf  PrimarySouth   Sudan\nPupil’s Book\nAll the co...
6               class4.pdf  Archana Shukla\nThis is my book\nI am ...........
7            class4_pb.pdf  PrimarySouth   Sudan\nPupil’s Book\nAll the co...
8               class5.pdf                                                   
9               class6.pdf  SCIENCE\nTEXTBOOK  FOR CLASS VI\n2018-19\nFirs...
10            class7_1.pdf  SCIENCE\nTEXTBOOK  FOR CLASS VII\n2018-19\nFir...
11              class8.pdf  SCIENCE\nTEXTBOOK  FOR CLASS VIIISCI

In [7]:
import pandas as pd

# Load pre-saved PDF content directly
pdf_df = pd.read_csv("pdf_contents.csv")

# Optional: inspect loaded content
print(pdf_df.head())


       file_name                                            content
0     class1.pdf  PrimarySouth   Sudan\nPupil’s Book\nAll the co...
1   class1_1.pdf  Uncorrected proof, all content subject to chan...
2     class2.pdf  A\nPublished by Macmillan/McGraw-Hill, of McGr...
3  class2_pb.pdf  PrimarySouth   Sudan\nPupil’s Book\nAll the co...
4     class3.pdf  3\nArchana Shukla\nThis is my book\nI am ........


Step#2: Chunk Text

In [9]:
# 1. Define the chunking function
def chunk_text(text, chunk_size=500):
    words = text.split()
    chunks = [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
    return chunks

# 2. Apply chunking to all PDF content
chunks = []  # List to hold all text chunks

for idx, row in pdf_df.iterrows():
    if pd.notna(row['content']):
        # Chunk the current document
        current_chunks = chunk_text(row['content'])
        
        # Optional: Debug print for first file only
        if idx == 0:
            print(f"🔍 File: {row['file_name']}")
            print(f"🧩 Total chunks from this file: {len(current_chunks)}")
            print("📑 First 2 chunks preview:\n")
            for chunk in current_chunks[:2]:
                print(chunk[:300] + "\n---\n")  # Print first 300 characters of each chunk

        # Add chunks to the master list
        chunks.extend(current_chunks)

# 3. Final check
print(f"\n✅ Total combined chunks from all PDFs: {len(chunks)}")
print("🧠 Preview of first 2 chunks overall:\n")
for i, ch in enumerate(chunks[:2]):
    print(f"Chunk {i+1} (first 300 characters):")
    print(ch[:300] + "\n---\n")



🔍 File: class1.pdf
🧩 Total chunks from this file: 12
📑 First 2 chunks preview:

PrimarySouth Sudan Pupil’s Book All the courses in this primary series were developed by the Ministry of General Education and Instruction, Republic of South Sudan. The books have been designed to meet the primary school syllabus, and at the same time equiping the pupils with skills to fit in the mo
---

permission of the Copyright Holder. Pictures, illustrations and links to third party websites are provided in good faith, for information and education purposes only.iiiT able of contents Unit 1 Parts of the body and hygiene.................................1 1.1 Parts of the human body .............
---


✅ Total combined chunks from all PDFs: 1233
🧠 Preview of first 2 chunks overall:

Chunk 1 (first 300 characters):
PrimarySouth Sudan Pupil’s Book All the courses in this primary series were developed by the Ministry of General Education and Instruction, Republic of South Sudan. The books have been designed

Step#3:Vectorization

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Vectorize the text chunks using Bag of Words model
vectorizer = CountVectorizer().fit(chunks)
chunk_vectors = vectorizer.transform(chunks)

# Debug print: basic info
print(f"\n✅ Vectorization Complete!")
print(f"📐 Shape of vectorized matrix: {chunk_vectors.shape}")  # (number_of_chunks, vocabulary_size)

# Optional: print first 5 vocabulary words
print("\n🔡 First 5 Vocabulary Words:")
print(list(vectorizer.vocabulary_.keys())[:5])

# Optional: show non-zero elements in the first chunk's vector
print("\n📊 Non-zero elements in vector for chunk 1:")
first_vector = chunk_vectors[0]
first_vector_dense = first_vector.toarray()[0]

for idx, val in enumerate(first_vector_dense):
    if val != 0:
        print(f"{vectorizer.get_feature_names_out()[idx]}: {val}")



✅ Vectorization Complete!
📐 Shape of vectorized matrix: (1233, 27267)

🔡 First 5 Vocabulary Words:
['primarysouth', 'sudan', 'pupil', 'book', 'all']

📊 Non-zero elements in vector for chunk 1:
00500: 1
18033: 1
1primary: 1
2018: 1
activities: 1
all: 2
always: 1
an: 1
and: 16
any: 2
applied: 1
approach: 1
area: 1
as: 6
at: 2
basics: 1
be: 4
been: 2
before: 1
book: 27
books: 3
box: 1
by: 9
can: 1
care: 1
careful: 1
clean: 1
clear: 1
collaboration: 1
comprehensively: 1
comprises: 1
confiscated: 1
conjunction: 1
course: 1
courses: 1
cover: 2
coverage: 1
covers: 1
cut: 1
damaged: 1
designed: 1
developed: 3
do: 10
don: 1
down: 1
dry: 1
each: 1
education: 7
either: 1
electronic: 2
equiping: 1
exercises: 1
experts: 1
explanation: 1
face: 1
fit: 1
fold: 1
for: 5
force: 1
form: 1
found: 1
full: 2
fun: 1
funded: 3
funzi: 1
general: 7
global: 1
government: 1
graphic: 1
grounding: 1
group: 1
guide: 1
hands: 1
has: 1
have: 2
how: 2
if: 1
illustrations: 1
immediately: 1
imparting: 1
in: 6
industrial

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

try:
    while True:
        query = input("\n🧠 Ask your science question (or type 'exit', 'quit', or press Ctrl+C to stop): ").strip().lower()

        # Exit conditions
        if query in ['exit', 'quit']:
            print("👋 Exiting... Bye!")
            break

        # Vectorize query
        query_vec = vectorizer.transform([query])

        # Compute cosine similarity
        similarities = cosine_similarity(query_vec, chunk_vectors)

        # Get top 3 matching chunks
        top_n = 3
        top_indices = similarities[0].argsort()[-top_n:][::-1]

        print("\n📈 Top Matching Chunks:\n")
        for idx in top_indices:
            print(f"📄 Chunk #{idx} (Score: {similarities[0][idx]:.4f}):\n")
            print(chunks[idx][:300], "...")  # Limit to first 300 chars
            print("-" * 80)

except KeyboardInterrupt:
    print("\n🛑 Keyboard interrupt received. Exiting safely.")



🧠 Ask your science question (or type 'exit', 'quit', or press Ctrl+C to stop):  what is an earth?



📈 Top Matching Chunks:

📄 Chunk #1171 (Score: 0.3975):

satellites in order to stay in an orbit closer to Earth needs to travel faster as compare to those satellites in the farther orbits. Worked Example 4 Calculate the speed of a satellite which orbits the Earth at an altitude of 1000 kilometres above Earth's surface? Solution Step: 1 Write down known q ...
--------------------------------------------------------------------------------
📄 Chunk #1172 (Score: 0.3601):

6.38100.6 kg Nm 10 6.673v -- - ´ =´+ ´´ ´ ´=kg Hence, the orbital speed of satellite is .1 3ms10 7.36-´Self Assessment Questions: Q10: Write down any four uses of artificial satellites. Q11: What is Geostationary orbit? Q12: Why the two satellites of different masses have same speed in the same orbi ...
--------------------------------------------------------------------------------
📄 Chunk #39 (Score: 0.3559):

Chapter 7 • Lesson 1What is a tool? A tool can be a simple machine, or it can be made up of many simple machi


🧠 Ask your science question (or type 'exit', 'quit', or press Ctrl+C to stop):  exit


👋 Exiting... Bye!


In [29]:
import requests
import json
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# This will be part of your initial setup to vectorize your chunks (as you already did before)
# Assuming you have your 'chunks' and 'vectorizer' set up as before

# Your existing loop where the query is asked and processed
try:
    while True:
        query = input("\n🧠 Ask your science question (or type 'exit', 'quit', or press Ctrl+C to stop): ").strip().lower()

        # Exit conditions
        if query in ['exit', 'quit']:
            print("👋 Exiting... Bye!")
            break

        # Vectorize query
        query_vec = vectorizer.transform([query])

        # Compute cosine similarity
        similarities = cosine_similarity(query_vec, chunk_vectors)

        # Get top 3 matching chunks
        top_n = 3
        top_indices = similarities[0].argsort()[-top_n:][::-1]

        print("\n📈 Top Matching Chunks:\n")
        
        # Prepare relevant chunk and user input for prompt
        relevant_document = chunks[top_indices[0]]  # Take the most similar chunk
        prompt = """
        You are a smart, AI-powered learning companion and tutor for Classes 1 to 10, built on Retrieval-Augmented Generation (RAG) technology.
        Your goal is to make science easy to explore, ask, and understand across all grade levels. 
        This is the recommended tutoring and guide : {relevant_document}
        The user input is: {user_input}
        Compile a best answer if you findout tell the class from which it belongs and  tell the user based on the user input and 
        helping them by telling it from easy to medium to advance and making them understand all aspects with example so that then can prepare for there
        exams or making notes or anyhelp related to science and make sure consise words , based on user what class or grade level user asked, 
        making science fun and simple to understand.
        Note:Don't answer the question which does not belong to science:PLease Asked only Science related question,Ask Again

        """.format(user_input=query, relevant_document=relevant_document)

        # API request to Llama model for generating recommendation
        url = 'http://localhost:11434/api/generate'
        data = {
            "model": "llama3.2",
            "prompt": prompt
        }
        headers = {'Content-Type': 'application/json'}

        # Make the API request
        response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)

        full_response = []

        try:
            for line in response.iter_lines():
                if line:
                    decoded_line = json.loads(line.decode('utf-8'))
                    full_response.append(decoded_line['response'])
        finally:
            response.close()

        # Print the generated recommendation
        print("🤖 Recommendation:")
        print(''.join(full_response))

except KeyboardInterrupt:
    print("\n🛑 Keyboard interrupt received. Exiting safely.")



🧠 Ask your science question (or type 'exit', 'quit', or press Ctrl+C to stop):  How does the Earth move around the sun?



📈 Top Matching Chunks:

🤖 Recommendation:
You've asked about how the Earth moves around the Sun. That's a fantastic question! Let's explore it together.

**Class:** This is for Class 3-5 students (Grade 3-5).

**How does the Earth move around the Sun?**

Imagine you're holding a ball on a string and swinging it around your head. The ball moves in a circle, right? That's kind of like how the Earth moves around the Sun!

The Earth orbits the Sun because of its own gravity and motion. It takes about 365.25 days to complete one full rotation around the Sun, which is why we have a year with 12 months.

Here's an example to help you understand:

* Imagine the Sun as the center of a big circle.
* The Earth moves in a circular path around the Sun, kind of like a merry-go-round.
* As the Earth moves, it also rotates on its axis, which means it spins from side to side. That's why we have day and night!

So, to summarize: the Earth moves around the Sun because of its own gravity and motion, comp


🧠 Ask your science question (or type 'exit', 'quit', or press Ctrl+C to stop):  what is national anthem?



📈 Top Matching Chunks:

🤖 Recommendation:
I'd be happy to help you with your science-related questions.

However, I noticed that your first question "what is national anthem?" doesn't belong to the science category. It's a general knowledge question about music or culture.

Please ask another science-related question, and I'll do my best to assist you.

If you're ready to move on, I can recall our previous conversation about atm, mmHg, and other units of measurement.

Which one would you like to explore further?



🧠 Ask your science question (or type 'exit', 'quit', or press Ctrl+C to stop):  exit


👋 Exiting... Bye!
