<a href="https://colab.research.google.com/github/aishwarya-kumar/skillrec_for_gigworkers/blob/main/RAG6_llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [30]:
# !pip install pandas numpy sentence_transformers
# !pip install ollama transformers langchain
# !pip install chromadb pdfplumber

In [5]:
import pandas as pd
import numpy as np
import pdfplumber
import os
import re
import ollama
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import chromadb
# from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import pipeline

In [6]:
def load_documents(path):
    documents = []
    print(f"Starting to load PDFs from directory: {path}")

    # Loop through all files in the directory and process PDFs
    for file in os.listdir(path):
        if file.endswith(".pdf"):
            pdf_path = os.path.join(path, file)
            print(f"Processing file: {pdf_path}")

            with pdfplumber.open(pdf_path) as pdf:
                text = ""

                # Extract text from each page of the PDF
                for i, page in enumerate(pdf.pages):
                    page_text = page.extract_text()
                    if page_text:
                        text += page_text
                    else:
                        print(f"Warning: No text found on page {i+1} of {file}")

                # If there's any text extracted, add it to the documents list
                if text.strip():
                    documents.append({"page_content": text, "metadata": {"source": file}})
                    print(f"Extracted text from {file}")
                else:
                    print(f"Warning: No text extracted from {file}. Skipping.")

    # Check if any documents were loaded
    if not documents:
        raise ValueError("No content extracted from any PDF in the directory.")

    print(f"Loaded {len(documents)} documents successfully from {path}.")
    return documents

In [7]:
path =  "/content/"
documents = load_documents(path)

Starting to load PDFs from directory: /content/
Processing file: /content/The-Job-Skills-of-2024-Report.pdf
Extracted text from The-Job-Skills-of-2024-Report.pdf
Loaded 1 documents successfully from /content/.


In [8]:
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\n+', '\n', text)
    return text.strip()

In [9]:
def chunk_text(text, max_token_limit=4096, chunk_overlap=200):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=max_token_limit,
        chunk_overlap=chunk_overlap)

    # Split text into chunks
    chunks = text_splitter.split_text(text)
    return chunks

In [10]:
def preprocess_documents(docs, max_token_limit, chunk_overlap):
    all_chunks = []

    for doc in docs:
        # Clean the text content
        cleaned_content = clean_text(doc['page_content'])

        # Chunk the cleaned content
        chunks = chunk_text(cleaned_content, max_token_limit, chunk_overlap)

        # Add each chunk with metadata
        for chunk in chunks:
            all_chunks.append({"page_content": chunk, "metadata": doc['metadata']})

    print(f"Split {len(docs)} documents into {len(all_chunks)} chunks.")
    return all_chunks

In [31]:
chunks= preprocess_documents(documents,max_token_limit=4096, chunk_overlap=200)
# chunks[0]

Split 1 documents into 14 chunks.


In [12]:
embedding_model = SentenceTransformer('all-MiniLM-L12-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
def get_embeddings(chunks):
    print("Starting to generate embeddings for the chunks...")

    # Check if there are any chunks
    if not chunks:
        print("No chunks found! Exiting.")
        return []

    # Extracting page content from chunks and generating embeddings
    page_contents = [chunk['page_content'] for chunk in chunks]
    print(f"Extracted {len(page_contents)} page contents from the chunks.")

    # Generate embeddings using the model
    try:
        print("Generating embeddings:")
        embeddings = embedding_model.encode(page_contents)
        print(f"Generated embeddings for {len(page_contents)} chunks.")
    except Exception as e:
        print(f"Error generating embeddings: {e}")
        return []

    return embeddings

In [14]:
document_embeddings = get_embeddings(chunks)

Starting to generate embeddings for the chunks...
Extracted 14 page contents from the chunks.
Generating embeddings:
Generated embeddings for 14 chunks.


In [15]:
client = chromadb.Client()

def build_chromadb_index(documents, embeddings):
    collection_name = "tech_jobs"

    # Get a list of all collections
    collections = client.list_collections()  # Get a list of all collections
    collection_names = [collection.name for collection in collections]

    # Check if the collection exists and delete it if it does
    if collection_name in collection_names:
      client.delete_collection(name=collection_name)
      print(f"Deleted existing collection '{collection_name}'.")
    # else:
    #   print(f"No existing collection named '{collection_name}', proceeding to create a new one.")

    # Create a new collection
    collection = client.create_collection(name=collection_name)
    print(f"Created a new collection '{collection_name}'.")

    # Ensure there is content before adding to ChromaDB
    documents_text = [doc['page_content'] for doc in documents]
    if not documents_text:
        raise ValueError("No valid text content found in documents.")

    # Add the documents and their embeddings to the collection
    collection.add(
        documents=documents_text,
        embeddings=embeddings,
        # metadatas=[{"source": "pdf"}] * len(documents),
        ids=[str(i) for i in range(len(documents))]
    )
    print(f"Added {len(documents)} documents to ChromaDB collection.")

    return collection

In [16]:
len(chunks)

14

In [17]:
collection = build_chromadb_index(chunks, document_embeddings)

Created a new collection 'tech_jobs'.
Added 14 documents to ChromaDB collection.


In [18]:
from datetime import datetime
# Get the current year and month
current_year = datetime.now().year
current_month = datetime.now().strftime("%B")

query_prompt = f"""
Identify the top 3 job roles or careers for gig workers or freelancers in {current_month} {current_year}.
For each role, provide a list of the top 5 most in-demand skills required to succeed. Keep your answer succint and to the point.
"""

In [19]:
def retrieve_relevant_chunks(query, collection):
    query_embedding = embedding_model.encode([query]).tolist()
    query_result = collection.query(query_embeddings=query_embedding, n_results=3)
    return query_result['documents']

In [20]:
relevant_chunks = retrieve_relevant_chunks(query_prompt, collection)
# relevant_chunks[0]

In [21]:
# def generate_response(query, retrieved_chunks, max_tokens=4096 ):

#     tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
#     model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

#     # Prepare the prompt by combining the query with the retrieved chunks
#     prompt = f"Query: {query}\n\nRelevant Information:\n"
#     for idx, chunk in enumerate(retrieved_chunks):
#         prompt += f"Chunk {idx+1}: {chunk}\n"

#     # Encode the prompt text
#     inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=4096 )

#     # Generate a response based on the prompt
#     output = model.generate(**inputs, max_length=max_tokens)

#     # Decode and return the response
#     response = tokenizer.decode(output[0], skip_special_tokens=True)
#     return response

In [40]:
def generate_response(query, retrieved_chunks, max_new_tokens=512):
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
    model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

    # Prepare the prompt by combining the query with the retrieved chunks
    prompt = (
    f"Query: {query}\n\n"
    f"Relevant Information:\n"
    f"{retrieved_chunks}\n\n"
    f"Answer the query based on the relevant information provided above:\nAnswer:")

    # Encode the prompt text
    inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=4096)
    # print(tokenizer.decode(inputs["input_ids"][0]))

    # Generate a response based on the prompt
    output = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    pad_token_id=tokenizer.eos_token_id,
    temperature=0.7,
    top_k=50,
    top_p=0.9 )

    # Decode and return the response
    response = tokenizer.decode(output[0], skip_special_tokens=True)

    print(tokenizer.decode(output[0], skip_special_tokens=False))
    if "Answer:" in response:
      answer = response.split("Answer:")[-1].strip()
    else:
      answer = response.strip()

    return answer


In [23]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `tk1` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `tk1`


In [41]:
response= generate_response(query_prompt, relevant_chunks)

<|begin_of_text|>Query: 
Identify the top 3 job roles or careers for gig workers or freelancers in November 2024.
For each role, provide a list of the top 5 most in-demand skills required to succeed. Keep your answer succint and to the point.


Relevant Information:
[['The Job Skills of 2024 The Fastest-Growing Job Skills for Businesses, Governments, and Higher Education Institutions The Job Skills of 2024 1Introduction | Business Skills | Data Science Skills | Tech Skills | Conclusion | Appendix Table of Contents Introduction 3 The Fastest-Growing Job 9 Conclusion 26 Skills for 2024 Foreword: The State of Job Skills 4 in 2024 Business Skill Trends for 2024 10 Appendix 28 Fastest-Growing Leadership Skills 14 Using Data to Identify Critical Skills 6 Regional Data: The Fastest-Growing 29 Data Science Skill Trends for 2024 16 Job Skills for 2024 Executive Summary 7 Fastest-Growing AI Skills 19 Vertical Data: The Fastest-Growing 35 Tech Skill Trends for 2024 21 Job Skills for 2024 Fastest-

In [42]:
print(response)

Top 3 job roles or careers for gig workers or freelancers in November 2024, along with the top 5 most in-demand skills required to succeed:

1. **Data Visualization**: The top 5 in-demand skills for data visualization include:
	* Power BI
	* Tableau Software
	* Python
	* R
	* SQL
2. **Business Intelligence**: The top 5 in-demand skills for business intelligence include:
	* Power BI
	* Business Intelligence
	* SQL
	* Python
	* R
3. **Software Development**: The top 5 in-demand skills for software development include:
	* React (Web Framework)
	* Python
	* Java
	* JavaScript
	* C++
