# CV Parser Testing

This notebook tests the enhanced CV parser with LLM integration.

In [14]:
import sys
import os
import json
from pprint import pprint

# Add project root to path
sys.path.append('..')

# Import required modules
from resume.resume_parser import extract_text
from utils.cv_parser import CVParser, CVProfile

## Load Test CV

In [15]:
# Path to test CV
cv_path = '../resume/Anjani Sharma.pdf'

# Extract text from the CV
cv_text = extract_text(cv_path)

# Print the first 500 characters of the CV text
print(cv_text[:500] + '...')

ANJANI  SHARMA  
Senior Data Science & AI Leader | Generative AI, Machine  Learning,  Python,  People leader 
Ph# 07436353049 , email: anjani.sharma1@gmail.com , LinkedIn:  LinkedIn    Project Repository: Github  
 
Professional Summary  
Accomplished data science leader with over 9 years of experience delivering production -grade machine learning solutions and leading high -
performing teams. Proven track record in developing and deploying cutting -edge AI applications, including GenAI, recomm ...


## Test Rule-Based Extraction

First, let's test just the regex-based extraction to see what information we can get without LLM.

In [16]:
# Create a parser instance
parser = CVParser(cv_text)

# Extract basic info using regex patterns
parser._extract_personal_info()

# Print the extracted personal information
for key in ['name', 'position', 'email', 'phone', 'linkedin', 'website', 'location']:
    print(f"{key}: {parser.result.get(key, 'Not found')}")

Error extracting personal info with LLM: 'github'
name: ANJANI SHARMA
position: Senior Data Science & AI Leader
email: anjani.sharma1@gmail.com
phone: 07436353049
linkedin: LinkedIn
website: 
location: Brighton, UK


In [17]:
# Identify sections in the CV
sections = parser._identify_sections()

# Print the identified sections
print("Identified Sections:")
for section_name in sections.keys():
    print(f"- {section_name}")

Identified Sections:
- summary
- skills
- experience
- projects
- education
- certifications


## Test LLM-Enhanced Parsing

Now, let's test the full parser with LLM integration.

In [6]:
# Create a profile instance (uses the full parser)
profile = CVProfile(cv_text)

# Get the parsed data
parsed_data = profile.parsed_data

# Print basic personal information
print("\n--- BASIC INFO ---")
for key in ['name', 'position', 'email', 'phone', 'linkedin', 'website', 'location']:
    print(f"{key}: {parsed_data.get(key, 'Not found')}")

Error extracting personal info with LLM: 'github'

--- BASIC INFO ---
name: ANJANI SHARMA
position: Senior Data Science & AI Leader
email: anjani.sharma1@gmail.com
phone: 07436353049
linkedin: LinkedIn profile URL if found
website: 
location: Brighton, UK


In [7]:
# Print skills (first 10)
print("\n--- SKILLS ---")
skills = parsed_data.get('skills', [])
for skill in skills[:10]:
    print(f"- {skill.get('name', 'Unknown')} ({skill.get('category', 'Unknown')})") 
if len(skills) > 10:
    print(f"... and {len(skills) - 10} more skills")


--- SKILLS ---
- Python (Programming Languages)
- SQL (Programming Languages)
- Java (Programming Languages)
- R (Programming Languages)
- Generative AI (Data Science & Analytics)
- Machine Learning & AI (Data Science & Analytics)
- Big Data Tools (Data Science & Analytics)
- Data Visualization Tools (Data Science & Analytics)
- Cloud Platforms (DevOps & Cloud)
- Version Control (Tools & Software)
... and 1 more skills


In [8]:
# Print work experience
print("\n--- WORK EXPERIENCE ---")
experiences = parsed_data.get('work_experience', [])
for exp in experiences:
    print(f"- {exp.get('title', 'Unknown')} at {exp.get('company', 'Unknown')}")
    print(f"  {exp.get('start_date', '')} to {exp.get('end_date', '')}")
    
    # Print a few achievements if available
    achievements = exp.get('achievements', [])
    if achievements:
        print("  Achievements:")
        for achievement in achievements[:2]:
            print(f"  - {achievement}")
        if len(achievements) > 2:
            print(f"  ... and {len(achievements) - 2} more achievements")
    print()


--- WORK EXPERIENCE ---
- Founder & Head of Data Science and AI at AI Transformers Ltd
  Apr '2023 to Present
  Achievements:
  - Improved customer retention by 15% through ML-driven segmentation
  - Increased marketing campaign conversion by 20% using advanced ML models
  ... and 1 more achievements

- Sr. Data Scientist at Admiral Group Plc
  Aug '2023 to Mar '2024

- Lead Data Scientist at Mindmap Consulting Digital
  June '2020 to Jul '2023

- Analytics Lead at Mindmap Consulting Digital
  June '2016 to May '2020

- Freelance Data Scientist at Mindmap Consulting Digital
  Nov '2014 to June '2016



In [9]:
# Print education
print("\n--- EDUCATION ---")
education = parsed_data.get('education', [])
for edu in education:
    print(f"- {edu.get('degree', 'Unknown')} in {edu.get('field', 'Unknown')}")
    print(f"  {edu.get('institution', 'Unknown')}")
    print(f"  {edu.get('start_date', '')} to {edu.get('end_date', '')}")
    print()


--- EDUCATION ---
- BA - Honours in Economics and Mathematics
  St. Joseph’s College, North Bengal University
  January 2000 to January 2003

- Data Science Specialization in 
  
  January 2023 to January 2025



In [10]:
# Print projects (if available)
print("\n--- PROJECTS ---")
projects = parsed_data.get('projects', [])
for project in projects[:3]:
    print(f"- {project.get('name', 'Unknown')}")
    print(f"  {project.get('description', '')}")
    
    # Print technologies if available
    technologies = project.get('technologies', [])
    if technologies:
        print(f"  Technologies: {', '.join(technologies[:5])}")
        if len(technologies) > 5:
            print(f"  ... and {len(technologies) - 5} more technologies")
    print()


--- PROJECTS ---
- Customer Support Chatbot
  Developed and deployed a chatbot for customer support using LangChain and OpenAI API.
  Technologies: LangChain, OpenAI, Python, AWS

- Text Classification Project
  Conducted multi-class text classification using NLTK and BERT for text analysis.
  Technologies: NLTK, BERT

- Metadata Generation Project
  Implemented a metadata generation project using LLaMA2, FAISS, and RAGs.
  Technologies: LLaMA2, FAISS, RAGs, Python



## Test Skill Enhancement

Let's examine the enhanced skills feature that adds related skills based on the parsed skills.

In [11]:
# Get original skills (names only)
original_skills = [s['name'].strip().lower() for s in parsed_data.get('skills', []) if s]
print("Original Skills:")
print(", ".join(original_skills[:10]))
if len(original_skills) > 10:
    print(f"... and {len(original_skills) - 10} more skills")

# Get enhanced skills
enhanced_skills = profile.skills
print("\nEnhanced Skills:")
print(", ".join(enhanced_skills[:10]))
if len(enhanced_skills) > 10:
    print(f"... and {len(enhanced_skills) - 10} more skills")

# Find new skills added through enhancement
new_skills = [s for s in enhanced_skills if s.lower() not in original_skills]
print("\nNewly Added Skills:")
print(", ".join(new_skills))

Original Skills:
python, sql, java, r, generative ai, machine learning & ai, big data tools, data visualization tools, cloud platforms, version control
... and 1 more skills

Enhanced Skills:
neural networks, java, r, generative ai, cloud platforms, python, deep learning, sql, statistics, version control
... and 11 more skills

Newly Added Skills:
neural networks, deep learning, statistics, predictive modeling, data analysis, natural language processing, ETL, data science, data engineering, data warehousing


## Complete Parsed Data

Finally, let's see the complete parsed data in a structured format.

In [13]:
# Convert the parsed data to a formatted JSON string
parsed_json = json.dumps(parsed_data, indent=2)

# Print the JSON
print(parsed_json)

TypeError: Object of type datetime is not JSON serializable