# Smart Resume Parser using Hugging Face Transformers
## Unit-1 NLP Mini-Project

**Chethan S** <br>
**PES2UG23CS150** <br>
**6th Semester C section**

**Objective:** Implement a resume parsing system using Hugging Face pipelines to:
- Extract named entities (NER)
- Generate text summarization
- Perform zero-shot classification for skill domains

**Models Used:**
- BERT-based NER model
- `sshleifer/distilbart-cnn-12-6` for summarization
- `facebook/bart-large-mnli` for zero-shot classification

## 1. Install and Import Required Libraries
We'll install the transformers library and import the necessary modules for our resume parser.

In [2]:
# Import required libraries
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')


## 2. Define Resume Text
We are just defining a simple resume for our project

In [3]:
# Sample resume text for processing
resume_text = """
Sarah Johnson
Machine Learning Engineer
Email: sarah.johnson@email.com | Phone: (555) 123-4567
Location: San Francisco, California

PROFESSIONAL SUMMARY
Experienced Machine Learning Engineer with 5+ years of expertise in developing AI-powered solutions. 
Skilled in Python programming, deep learning frameworks like TensorFlow and PyTorch, and cloud platforms 
including AWS and Azure. Proven track record in implementing natural language processing models and 
computer vision applications for enterprise clients.

TECHNICAL SKILLS
• Programming Languages: Python, JavaScript, SQL, R
• Machine Learning: TensorFlow, PyTorch, Scikit-learn, Keras
• Web Development: React, Node.js, Flask, Django
• Cloud Platforms: AWS, Azure, Google Cloud Platform
• Cybersecurity: Penetration testing, Security auditing, Vulnerability assessment
• Artificial Intelligence: Natural Language Processing, Computer Vision, Neural Networks
• Databases: PostgreSQL, MongoDB, MySQL

WORK EXPERIENCE
Senior ML Engineer at TechCorp Inc., San Francisco (2021-Present)
- Developed and deployed machine learning models for fraud detection achieving 94% accuracy
- Built recommendation systems using collaborative filtering and deep learning techniques
- Led a team of 4 data scientists in implementing MLOps pipelines on AWS

Data Scientist at DataSolutions LLC, California (2019-2021)
- Created predictive analytics models for customer behavior analysis
- Implemented web scraping solutions and API integrations using Python
- Designed secure data processing pipelines with encryption and access controls

EDUCATION
Master of Science in Computer Science - Stanford University (2019)
Bachelor of Engineering in Software Engineering - University of California, Berkeley (2017)

CERTIFICATIONS
- AWS Certified Machine Learning Specialty
- Google Cloud Professional ML Engineer
- Certified Ethical Hacker (CEH)
"""

print("Resume text loaded successfully!")
print(f"Text length: {len(resume_text)} characters")
print("Ready for NLP processing...")

Resume text loaded successfully!
Text length: 1881 characters
Ready for NLP processing...


## 3. Setup NER Pipeline
We are using BERT-based NER model to identify and extract entities like names, organizations, and locations from the resume.

In [4]:
# Initialize NER pipeline with BERT-based model
print("Loading NER pipeline...")
ner_pipeline = pipeline("ner", aggregation_strategy="simple")
print(" NER pipeline loaded successfully!")
print("Model ready for entity extraction")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496.
Using a pipeline without specifying a model name and revision in production is not recommended.


Loading NER pipeline...


Loading weights: 100%|██████████| 391/391 [00:00<00:00, 921.88it/s, Materializing param=classifier.weight]                                      
BertForTokenClassification LOAD REPORT from: dbmdz/bert-large-cased-finetuned-conll03-english
Key                      | Status     |  | 
-------------------------+------------+--+-
bert.pooler.dense.weight | UNEXPECTED |  | 
bert.pooler.dense.bias   | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


 NER pipeline loaded successfully!
Model ready for entity extraction


## 4. Extract Entities from Resume
Here, we extract the important Named Entities

In [5]:
# Extract entities from resume text
print("Extracting named entities from resume...")
entities = ner_pipeline(resume_text)

print("\n" + "="*50)
print("EXTRACTED ENTITIES")
print("="*50)

# Group entities by type for better presentation
entity_types = {}
for entity in entities:
    entity_type = entity['entity_group']
    if entity_type not in entity_types:
        entity_types[entity_type] = []
    entity_types[entity_type].append({
        'text': entity['word'], 
        'confidence': round(entity['score'], 3)
    })

# Display entities by type
for entity_type, items in entity_types.items():
    print(f"\n{entity_type}:")
    for item in items:
        print(f"   • {item['text']} (confidence: {item['confidence']})")

print(f"\nTotal entities found: {len(entities)}")
print("Entity extraction completed!")

Extracting named entities from resume...

EXTRACTED ENTITIES

PER:
   • Sarah Johnson (confidence: 0.9940000176429749)

ORG:
   • Learning (confidence: 0.5519999861717224)
   • ##Flow (confidence: 0.5989999771118164)
   • PyTorch (confidence: 0.6650000214576721)
   • AWS (confidence: 0.6990000009536743)
   • Azure (confidence: 0.6790000200271606)
   • ##QL (confidence: 0.37700000405311584)
   • R (confidence: 0.4339999854564667)
   • TensorFlow (confidence: 0.6650000214576721)
   • PyTorch (confidence: 0.718999981880188)
   • Scikit (confidence: 0.6660000085830688)
   • Keras (confidence: 0.7319999933242798)
   • React (confidence: 0.9509999752044678)
   • Node (confidence: 0.8960000276565552)
   • j (confidence: 0.492000013589859)
   • F (confidence: 0.8669999837875366)
   • ##k (confidence: 0.7390000224113464)
   • Django (confidence: 0.4749999940395355)
   • AWS (confidence: 0.8659999966621399)
   • Azure (confidence: 0.824999988079071)
   • Google Cloud (confidence: 0.7400000095367

## 5. Setup Text Summarization Pipeline
We use DistilBART model for summarization, to create a concise resume summary.

In [8]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

print("Loading summarization model...")
tokenizer = AutoTokenizer.from_pretrained("sshleifer/distilbart-cnn-12-6")
summarization_model = AutoModelForSeq2SeqLM.from_pretrained("sshleifer/distilbart-cnn-12-6")
print("Summarization model loaded successfully!")
print("Model ready for text summarization")


Loading summarization model...


Please make sure the generation config includes `forced_bos_token_id=0`. 
Loading weights: 100%|██████████| 358/358 [00:00<00:00, 781.51it/s, Materializing param=model.shared.weight]                                  


Summarization model loaded successfully!
Model ready for text summarization


## 6. Generate Resume Summary
Creating summary of the resume using the summarization pipeline.

In [9]:

# Generate summary with appropriate length constraints
print(" Generating resume summary...")

# Tokenize and generate using the seq2seq model
inputs = tokenizer(resume_text, return_tensors="pt", truncation=True, max_length=1024)
summary_ids = summarization_model.generate(
    inputs["input_ids"],
    attention_mask=inputs.get("attention_mask"),
    max_length=150,
    min_length=50,
    do_sample=False,
    early_stopping=True
)
summary_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("\n" + "="*50)
print(" RESUME SUMMARY")
print("="*50)
print(summary_text)
print("="*50)

print(f"\n Original text length: {len(resume_text)} characters")
print(f" Summary length: {len(summary_text)} characters")
print(" Summary generation completed!")


 Generating resume summary...

 RESUME SUMMARY
 Sarah Johnson is an experienced Machine Learning Engineer with 5+ years of expertise in developing AI-powered solutions . She is fluent in Python programming, deep learning frameworks like TensorFlow and PyTorch, and cloud platforms like AWS and Azure . She has a track record in implementing natural language processing models and computer vision applications .

 Original text length: 1881 characters
 Summary length: 363 characters
 Summary generation completed!


## 7. Setup Zero-Shot Classification Pipeline
We are using BART-Large-MNLI to classify the resume into different skill domains

In [10]:
# Initialize zero-shot classification pipeline
print(" Loading zero-shot classification pipeline...")
classification_pipeline = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define skill domain labels for classification
skill_domains = [
    "Programming",
    "Artificial Intelligence", 
    "Machine Learning",
    "Web Development",
    "Cybersecurity"
]

print(" Zero-shot classification pipeline loaded successfully!")
print(f" Ready to classify into {len(skill_domains)} skill domains")
print(" Domains:", ", ".join(skill_domains))

 Loading zero-shot classification pipeline...


Loading weights: 100%|██████████| 515/515 [00:00<00:00, 936.96it/s, Materializing param=model.shared.weight]                                   


 Zero-shot classification pipeline loaded successfully!
 Ready to classify into 5 skill domains
 Domains: Programming, Artificial Intelligence, Machine Learning, Web Development, Cybersecurity


## 8. Classify Resume into Skill Domains
Here, we classify the resume content to determine which skill domains it best matches.

In [14]:
# Perform zero-shot classification
print(" Classifying resume into skill domains...")

classification_result = classification_pipeline(resume_text, skill_domains)

print("\n" + "="*50)
print(" SKILL DOMAIN CLASSIFICATION")
print("="*50)

# Display results with confidence scores
for i, (domain, score) in enumerate(zip(classification_result['labels'], classification_result['scores'])):
    confidence_percent = round(score * 100, 1)
    print(f"{i+1}. {domain}: {confidence_percent}%")
    
    # Add visual confidence bar
    bar_length = int(confidence_percent / 5)  # Scale to 20 characters max
    bar = "=" * bar_length + ":" * (20 - bar_length)
    print(f"   {bar} ({confidence_percent}%)")
    print()

print("="*50)
print(f" Top domain: {classification_result['labels'][0]}")
print(f" Confidence: {round(classification_result['scores'][0] * 100, 1)}%")
print(" Classification completed!")

 Classifying resume into skill domains...

 SKILL DOMAIN CLASSIFICATION
1. Machine Learning: 46.7%

2. Artificial Intelligence: 28.6%
   =====::::::::::::::: (28.6%)

3. Programming: 11.7%
   ==:::::::::::::::::: (11.7%)

4. Cybersecurity: 6.8%
   =::::::::::::::::::: (6.8%)

5. Web Development: 6.2%
   =::::::::::::::::::: (6.2%)

 Top domain: Machine Learning
 Confidence: 46.7%
 Classification completed!


## 9. Comprehensive Results Summary
Let's compile all our analysis results into a comprehensive resume parsing report.

In [17]:
# Comprehensive Resume Analysis Report
print("="*60)
print(" SMART RESUME PARSER - COMPREHENSIVE REPORT")
print("="*60)

print("\n CANDIDATE INFORMATION")
print("-" * 40)
# Extract key entities for candidate info
names = [entity['word'] for entity in entities if entity['entity_group'] == 'PER']
locations = [entity['word'] for entity in entities if entity['entity_group'] == 'LOC'] 
organizations = [entity['word'] for entity in entities if entity['entity_group'] == 'ORG']

print(f" Name: {names[0] if names else 'Not detected'}")
print(f" Location: {', '.join(locations[:2]) if locations else 'Not detected'}")
print(f" Organizations: {', '.join(organizations[:3]) if organizations else 'Not detected'}")

print("\n RESUME SUMMARY")
print("-" * 40)
print(summary_text)

print("\n SKILL DOMAIN CLASSIFICATION")
print("-" * 40)
for i, (domain, score) in enumerate(zip(classification_result['labels'][:3], classification_result['scores'][:3])):
    confidence_percent = round(score * 100, 1)
    print(f"{i+1}. {domain}: {confidence_percent}%")

print("\n ANALYSIS STATISTICS")
print("-" * 40)
print(f" Resume length: {len(resume_text)} characters")
print(f" Entities extracted: {len(entities)}")
print(f" Summary compression: {round((len(summary_text)/len(resume_text))*100, 1)}%")
print(f" Top skill domain: {classification_result['labels'][0]}")

print("\n MODELS USED")
print("-" * 40)
print("• NER: BERT-based model")
print("• Summarization: sshleifer/distilbart-cnn-12-6") 
print("• Classification: facebook/bart-large-mnli")

print("\n" + "="*60)
print(" RESUME PARSING COMPLETED SUCCESSFULLY!")
print("="*60)

 SMART RESUME PARSER - COMPREHENSIVE REPORT

 CANDIDATE INFORMATION
----------------------------------------
 Name: Sarah Johnson
 Location: San Francisco, California
 Organizations: Learning, ##Flow, PyTorch

 RESUME SUMMARY
----------------------------------------
 Sarah Johnson is an experienced Machine Learning Engineer with 5+ years of expertise in developing AI-powered solutions . She is fluent in Python programming, deep learning frameworks like TensorFlow and PyTorch, and cloud platforms like AWS and Azure . She has a track record in implementing natural language processing models and computer vision applications .

 SKILL DOMAIN CLASSIFICATION
----------------------------------------
1. Machine Learning: 46.7%
2. Artificial Intelligence: 28.6%
3. Programming: 11.7%

 ANALYSIS STATISTICS
----------------------------------------
 Resume length: 1881 characters
 Entities extracted: 50
 Summary compression: 19.3%
 Top skill domain: Machine Learning

 MODELS USED
------------------