# Resume Category Classification using BERT  
## Text Quality Analysis

This notebook identifies and analyzes text quality issues in the resume dataset.
The purpose is to understand potential sources of noise before applying
BERT-based text classification.


### Import Required Libraries

The following libraries are used for:
- Data loading and inspection
- Regular expression-based text pattern analysis
- Frequency analysis for repeated words

No preprocessing or modification is performed at this stage.


In [1]:
import pandas as pd
import re
from collections import Counter

### Load Dataset

The resume dataset is loaded in read-only mode.
At this stage, no columns are modified, removed, or encoded.
The `Category` column is preserved exactly as provided.


In [5]:
!git clone https://github.com/chanmyae99/resume-bert-classification.git

Cloning into 'resume-bert-classification'...
remote: Enumerating objects: 35, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 35 (delta 4), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (35/35), 908.24 KiB | 14.65 MiB/s, done.
Resolving deltas: 100% (4/4), done.


In [6]:
%cd resume-bert-classification/notebooks


/content/resume-bert-classification/notebooks


In [7]:
import os
os.getcwd()

'/content/resume-bert-classification/notebooks'

In [10]:
!ls ..


data  notebooks  README.md


In [9]:
df = pd.read_csv("../data/raw/CHAN MYAE AUNG.csv")
df.head()

Unnamed: 0,Category,Resume
0,Python Developer,Technical Skills: Languages Python Python Fram...
1,Health and fitness,Education Details \r\nJanuary 2018 M.S. Nutrit...
2,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
3,Network Security Engineer,"Operating Systems: Windows, Linux, Ubuntu Netw..."
4,Java Developer,Education Details \r\n BE IT pjlce\r\nJava D...


### Dataset Overview

This step provides a structural overview of the dataset, including:
- Number of records
- Column data types
- Presence of missing values

This information helps identify potential data completeness issues.


In [11]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  962 non-null    object
 1   Resume    962 non-null    object
dtypes: object(2)
memory usage: 15.2+ KB


### Missing Value Analysis

Missing values in the resume text may indicate incomplete or poorly formatted resumes,
which can negatively impact text understanding and classification.


In [12]:
df.isnull().sum()


Unnamed: 0,0
Category,0
Resume,0


### Raw Resume Text Inspection

This step examines raw resume text samples to visually identify:
- Formatting artifacts
- Inconsistent capitalization
- Presence of symbols and noise
- Redundant or unstructured content

These samples serve as qualitative evidence of text quality issues.


In [13]:
for i in range(3):
    print(f"\n--- Resume Sample {i+1} ---\n")
    print(df["Resume"].iloc[i][:600])



--- Resume Sample 1 ---

Technical Skills: Languages Python Python Framework Django, DRF Databases MySQL, Oracle, Sqlite, MongoDB Web Technologies CSS, HTML, RESTful Web Services REST Methodologies Agile, Scrum Version Control Github Project Managent Tool Jira Operating Systems Window, Unix Education Details 
 BE   Dr.BAMU,Aurangabad
Python Developer 

Python Developer - Arsys Inovics pvt ltd
Skill Details 
CSS- Exprience - 31 months
DJANGO- Exprience - 31 months
HTML- Exprience - 31 months
MYSQL- Exprience - 31 months
PYTHON- Exprience - 31 months
web services- Exprience - Less than 1 year months
Logger- 

--- Resume Sample 2 ---

Education Details 
January 2018 M.S. Nutrition and Exercise Physiology New York, NY Teachers College, Columbia University
January 2016 B.S. Nutrition and Dietetics Miami, FL Florida International University
January 2011 B.Sc. General Microbiology Pune, Maharashtra Abasaheb Garware College
Group Fitness Instructor, India 

Group Fitness Ins

### Resume Length Analysis

Resume length varies significantly across candidates.
Such variability affects padding and truncation when using BERT,
as the model requires fixed-length input sequences.


In [14]:
df["resume_length"] = df["Resume"].apply(len)
df["resume_length"].describe()


Unnamed: 0,resume_length
count,962.0
mean,3161.433472
std,2886.343894
min,142.0
25%,1217.25
50%,2355.0
75%,4073.75
max,14816.0


### Capitalization Inconsistency Analysis

Resumes often contain mixed usage of uppercase and lowercase letters.
Inconsistent casing can lead to inconsistent token embeddings,
even when using an uncased BERT model.


In [15]:
def casing_ratio(text):
    upper = sum(c.isupper() for c in text)
    lower = sum(c.islower() for c in text)
    return upper / (upper + lower + 1)

df["casing_ratio"] = df["Resume"].apply(casing_ratio)
df["casing_ratio"].describe()


Unnamed: 0,casing_ratio
count,962.0
mean,0.127502
std,0.049295
min,0.039988
25%,0.091481
50%,0.123028
75%,0.156109
max,0.304348


### Special Characters and Symbol Analysis

Resumes frequently include symbols such as bullets, punctuation, and decorative characters.
These symbols may introduce meaningless tokens during tokenization
and add noise to the text representation.


In [16]:
def extract_special_chars(text):
    return list(set(re.findall(r"[^\w\s]", text)))

df["special_chars"] = df["Resume"].apply(extract_special_chars)
df["special_chars"].head()


Unnamed: 0,special_chars
0,"[), ,, -, :, , ¢, ', ., &, (]"
1,"[-, ., ,]"
2,"[-, ,]"
3,"[/, ), ,, -, :, , ¢, ', ., &, (]"
4,"[), ,, -, , :, ., +, (, ]"


### Line Breaks and Formatting Artifacts

Resumes are often copied from documents or PDFs, resulting in excessive
line breaks and formatting artifacts. These disrupt sentence continuity
and semantic flow.


In [17]:
df["line_breaks"] = df["Resume"].apply(
    lambda x: x.count("\n") + x.count("\r")
)

df["line_breaks"].describe()


Unnamed: 0,line_breaks
count,962.0
mean,85.534304
std,68.573387
min,16.0
25%,36.0
50%,68.0
75%,113.5
max,404.0


### Redundant Word and Skill Repetition

Repeated occurrences of the same skills or keywords may bias model attention
towards certain terms without adding meaningful new information.


In [18]:
def repeated_words(text):
    words = text.lower().split()
    return [w for w, c in Counter(words).items() if c > 3]

df["repeated_tokens"] = df["Resume"].apply(repeated_words)
df["repeated_tokens"].head()


Unnamed: 0,repeated_tokens
0,"[python, web, services, rest, project, -, expr..."
1,"[and, columbia, university, fitness, -]"
2,"[exprience, -, less, than, 1, year, months]"
3,"[network, :, cisco, and, lan, networking, devi..."
4,"[java, exprience, -, less, than, 1, year, mont..."


### Summary of Text Quality Issues

The table below summarizes key quantitative indicators of text quality issues
identified across the dataset. This summary supports the justification for
subsequent preprocessing steps.


In [19]:
summary = pd.DataFrame({
    "avg_resume_length": [df["resume_length"].mean()],
    "max_resume_length": [df["resume_length"].max()],
    "avg_line_breaks": [df["line_breaks"].mean()],
    "avg_casing_ratio": [df["casing_ratio"].mean()]
})

summary


Unnamed: 0,avg_resume_length,max_resume_length,avg_line_breaks,avg_casing_ratio
0,3161.433472,14816,85.534304,0.127502
