# Web Scraping

In [1]:
!pipenv install langchain_community

[1mLoading .env environment variables...[0m
[1;32mInstalling langchain_community...[0m
✔ Installation Succeeded
[1mInstalling dependencies from Pipfile.lock [0m[1;39m(f6673e)...[0m
[32mAll dependencies are now up-to-date![0m
[1;32mUpgrading[0m langchain_community in [39m dependencies.[0m
[?25lBuilding requirements...
[2KResolving dependencies....
[2K✔ Success! Locking packages...
[2K[32m⠙[0m Locking packages...
[1A[2K[?25lBuilding requirements...
[2KResolving dependencies....
[2K✔ Success! Locking packages...
[2K[32m⠇[0m Locking packages...
[1A[2K[1mInstalling dependencies from Pipfile.lock [0m[1;39m(36fc99)...[0m
[32mAll dependencies are now up-to-date![0m
[1mInstalling dependencies from Pipfile.lock [0m[1;39m(36fc99)...[0m


In [8]:
!pip install selenium webdriver-manager

Collecting selenium
  Downloading selenium-4.30.0-py3-none-any.whl.metadata (7.5 kB)
Collecting webdriver-manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.29.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Collecting sortedcontainers (from trio~=0.17->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting pysocks!=1.5.7,<2.0,>=1.5.6 (from urllib3[socks]<3,>=1.26->selenium)
  Downloading PySocks-1.7.1-py3-none-any.whl.metadata (13 kB)
Downloading selenium-4.30.0-py3-none-any.whl (9.4 MB)
[2K   [38;2;114;156

## Extract the job role, job skill in JSON format

In [12]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time

def extract_job_info(url):
    
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")  
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
    driver.get(url)
    time.sleep(3)  

    try:
        job_title = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "h1"))
        ).text


        job_desc = None
        possible_selectors = [
            ".job-description",  
            "div[data-automation-id='jobDescription']",  
            "div[class*='description']" 
        ]

        for selector in possible_selectors:
            try:
                job_desc = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, selector))
                ).text
                if job_desc:
                    break  
            except:
                continue  
        
        if not job_desc:
            job_desc = "Job description not found"

        job_info = {"title": job_title, "description": job_desc}

    except Exception as e:
        job_info = {"error": f"Failed to extract job details: {str(e)}"}

    driver.quit()
    return job_info


url = "https://jpmc.fa.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1001/job/210606405?keyword=Python+Developer+-+Analyst&mode=location"
job_data = extract_job_info(url)
print(job_data)


{'title': 'Python Developer - Analyst', 'description': 'You are a strategic thinker passionate about driving solutions using “Data”. You have found the right team.\nAs a Data Engineer in our STO team, you will be a strategic thinker passionate about promoting solutions using data. You will mine, interpret, and clean our data, asking questions, connecting the dots, and uncovering hidden opportunities for realizing the data’s full potential. As part of a team of specialists, you will “slice and dice” data using various methods and create new visions for the future. Our STO team is focused on collaborating and partnering with business to deliver efficiency and enhance controls via technology adoption and infrastructure support for Global Finance & Business Management India.\nJob Responsibilities\nWrite efficient Python and SQL code to extract, transform, and load (ETL) data from various sources into Databricks.\nPerform data analysis and computation to derive actionable insights from the 

In [None]:
import os
from langchain_groq import ChatGroq

llm = ChatGroq(
    api_key="YOUR_GROQ_API_KEY",  # Add your actual Groq API key
    model="llama-3.1-8b-instant",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

In [16]:
from langchain_core.prompts import PromptTemplate

prompt_extract = PromptTemplate.from_template(
    """
    I will give you scraped text from the job posting. 
    Your job is to extract the job details & requirements in a JSON format containing the following keys: 'role', 'experience', 'skills', and 'description'. 
    Only return valid JSON. No preamble, please.
    Here is the scraped text: {job_data}
    """
)

chain_extract = prompt_extract | llm 
response = chain_extract.invoke(input={'job_data':job_data})
print(type(response.content))
print(response.content)

<class 'str'>
{
  "role": "Python Developer - Analyst",
  "experience": "Minimum 3 years of experience in data engineering",
  "skills": [
    "Python",
    "SQL",
    "Databricks",
    "Data analysis and computation",
    "Data visualization",
    "Tableau",
    "Machine learning",
    "Data science",
    "Cloud platforms (AWS, Azure, GCP)",
    "LLM (Large Language Model)",
    "Data quality, integrity, and security"
  ],
  "description": "As a Data Engineer in our STO team, you will mine, interpret, and clean our data, asking questions, connecting the dots, and uncovering hidden opportunities for realizing the data’s full potential. You will write efficient Python and SQL code to extract, transform, and load (ETL) data from various sources into Databricks, perform data analysis and computation to derive actionable insights from the data, and collaborate with data scientists, analysts, and other stakeholders to understand data requirements and deliver solutions."
}


In [24]:
from langchain_core.output_parsers import JsonOutputParser

json_parser = JsonOutputParser()
job_json= json_parser.parse(response.content)
print(type(job_json))
print(job_json)

<class 'dict'>
{'role': 'Python Developer - Analyst', 'experience': 'Minimum 3 years of experience in data engineering', 'skills': ['Python', 'SQL', 'Databricks', 'Data analysis and computation', 'Data visualization', 'Tableau', 'Machine learning', 'Data science', 'Cloud platforms (AWS, Azure, GCP)', 'LLM (Large Language Model)', 'Data quality, integrity, and security'], 'description': 'As a Data Engineer in our STO team, you will mine, interpret, and clean our data, asking questions, connecting the dots, and uncovering hidden opportunities for realizing the data’s full potential. You will write efficient Python and SQL code to extract, transform, and load (ETL) data from various sources into Databricks, perform data analysis and computation to derive actionable insights from the data, and collaborate with data scientists, analysts, and other stakeholders to understand data requirements and deliver solutions.'}


## JSON format to ChromaDB

In [19]:
import chromadb
import uuid 


client = chromadb.PersistentClient(path="./vectorstore")


collection = client.get_or_create_collection(name="job_portfolio")


job_details = {
    "skills": [
        "Proficiency in Python programming",
        "Experience with web frameworks such as Django or Flask",
        "Understanding of multi-process architecture",
        "Knowledge of front-end technologies like JavaScript, HTML5, and CSS3",
        "Familiarity with event-driven programming",
        "Strong problem-solving skills",
        "Excellent communication and teamwork abilities"
    ],
    "experience": [
        "Bachelor's degree in Computer Science, Engineering, or related field",
        "2+ years of experience in Python development",
        "Experience with version control systems like Git",
        "Background in financial services or related industry is a plus"
    ]
}


job_details_str = str(job_details)

# Store job details in ChromaDB
collection.add(
    documents=[job_details_str],  
    ids=[str(uuid.uuid4())],  
    metadatas=[{"source": "JPMorganChase"}]  
)

print("Job details stored successfully in ChromaDB!")


Job details stored successfully in ChromaDB!


In [20]:
json_response['skills']

['Python',
 'SQL',
 'Databricks',
 'Data analysis and computation',
 'Data visualization',
 'Tableau',
 'Machine learning',
 'Data science',
 'Cloud platforms (AWS, Azure, GCP)',
 'LLM (Large Language Model)',
 'Data quality, integrity, and security']

## Matches job skills to the closest portfolio in ChromaDB.

In [25]:
def match_job_to_portfolio(job_json):

    job_skills_str = str(job_json["skills"])  

    results = collection.query(
        query_texts=[job_skills_str],
        n_results=1  
    )

    if results["documents"]:
        return {"match": results["documents"][0], "metadata": results["metadatas"][0]}
    else:
        return {"error": "No matching portfolio found"}


new_job = {
    "role": job_json["role"],
    "skills": job_json["skills"],
    "experience": job_json["experience"],
    "description": job_json["description"]
}


match_result = match_job_to_portfolio(new_job)
print(match_result)


{'match': ['{\'skills\': [\'Proficiency in Python programming\', \'Experience with web frameworks such as Django or Flask\', \'Understanding of multi-process architecture\', \'Knowledge of front-end technologies like JavaScript, HTML5, and CSS3\', \'Familiarity with event-driven programming\', \'Strong problem-solving skills\', \'Excellent communication and teamwork abilities\'], \'experience\': ["Bachelor\'s degree in Computer Science, Engineering, or related field", \'2+ years of experience in Python development\', \'Experience with version control systems like Git\', \'Background in financial services or related industry is a plus\']}'], 'metadata': [{'source': 'JPMorganChase'}]}


## Generates a cold email using extracted job details and the matched portfolio.

In [30]:
from langchain_core.prompts import PromptTemplate

def generate_cold_email(job_json, match_result):

    job_role = job_json["role"]
    skills = ", ".join(job_json["skills"])
    experience = job_json["experience"]
    description = job_json["description"]
    
    if "match" in match_result and "metadata" in match_result:
        portfolio_source = match_result["metadata"][0]["source"]
    else:
        portfolio_source = "N/A"

    prompt_email = PromptTemplate.from_template(
        """
        **Objective:**  
        Craft a highly professional and persuasive cold email targeted at hiring managers or recruiters.  
        The email should be personalized, highlighting how our company's experience, skills, and past projects align perfectly with the job role.  
    
        **Tone & Style:**  
        - Professional, engaging, and confident  
        - Concise and to the point (max 200 words)  
        - Friendly but not overly casual  
    
        **Email Structure:**  
    

           - A compelling and personalized subject line that grabs attention.  
           - Example: "Expert Python Developer for {job_role} – Let’s Connect!"  
     
           - Address the hiring manager or recruiter.  
           - Mention the specific job role and company name.  
           - Create a strong opening that sparks interest.  
    
           - Briefly mention relevant skills and experience based on the job posting.  
           - Show how our past projects align with their needs.  
           - Mention key technical skills that match the role.  
      
           - Highlight a relevant project from our portfolio.  
           - Explain how our experience can benefit their team.  
    
           - Politely ask for a meeting or call.  
           - Provide a direct way to contact us.  
           - Example: "Would love to discuss how we can add value to your team. Let’s schedule a quick call this week!"  
    
        **Job Details:**  
        - Role: {job_role}  
        - Skills: {skills}  
        - Experience: {experience}  
        - Description: {description}  
        - Matching Portfolio Source: {portfolio_source} 

        In saluatation don't mention hiring manager name, just give as Dear hiring manager.
        You are Ujjwal Kumar Singh [name], your email is ujjwalks2709@gmail.com, your phone number is +91 7543932088.
    
        ✨ Ensure the email feels natural, avoids generic phrasing, and maximizes impact!  
        """
    )


    chain_email = prompt_email | llm 
    response_email = chain_email.invoke(input={
        "job_role": job_role,
        "skills": skills,
        "experience": experience,
        "description": description,
        "portfolio_source": portfolio_source
    })

    return response_email.content


cold_email = generate_cold_email(new_job, match_result)
print("\n📩 Generated Cold Email:\n")
print(cold_email)



📩 Generated Cold Email:

Here's a highly professional and persuasive cold email targeted at hiring managers or recruiters:

Subject: Expert Python Developer for Python Developer - Analyst – Let’s Connect!

Dear Hiring Manager,

I came across the Python Developer - Analyst role at [Company Name] and was impressed by the opportunity to join a team that values data-driven insights. As a seasoned data engineer with a passion for unlocking the full potential of data, I believe my skills and experience make me an ideal fit for this position.

With over 3 years of experience in data engineering, I possess a strong foundation in Python, SQL, and Databricks. My expertise in data analysis and computation, data visualization using Tableau, and machine learning has enabled me to deliver actionable insights to stakeholders in my previous roles. I'm also well-versed in cloud platforms (AWS, Azure, GCP) and have experience working with Large Language Models (LLM).

One of my notable projects, which 