# Purpose and Scope of Text Pre-processing

Text pre-processing is a critical step in Natural Language Processing pipelines,
particularly when using transformer-based models such as BERT.

Although BERT is robust to unstructured and noisy text, certain text quality issues
including excessive formatting artifacts, inconsistent capitalization, and structural
noise can negatively affect tokenization, attention distribution, and contextual
representation.

In this notebook, only **minimal and model-appropriate preprocessing techniques** are
applied to address the significant text quality issues identified earlier. Aggressive
text cleaning methods such as stemming, lemmatization, stopword removal, or text
augmentation are intentionally avoided, as they may remove important contextual
information or violate project constraints.

All preprocessing steps preserve the original meaning of the resume text and do not
introduce any external or synthetic content.


In [41]:
import pandas as pd
import re
from collections import Counter
from google.colab import files


### Load Dataset


The resume dataset is loaded in read-only mode.
At this stage, no columns are modified, removed, or encoded.
The `Category` column is preserved exactly as provided.

In [6]:
!git clone https://github.com/chanmyae99/resume-bert-classification.git

Cloning into 'resume-bert-classification'...
remote: Enumerating objects: 64, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (55/55), done.[K
remote: Total 64 (delta 18), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (64/64), 930.92 KiB | 2.68 MiB/s, done.
Resolving deltas: 100% (18/18), done.


In [7]:
%cd resume-bert-classification/notebooks

/content/resume-bert-classification/notebooks/resume-bert-classification/notebooks


In [27]:
df = pd.read_csv("../data/raw/CHAN MYAE AUNG.csv")
df.head()

Unnamed: 0,Category,Resume
0,Python Developer,Technical Skills: Languages Python Python Fram...
1,Health and fitness,Education Details \r\nJanuary 2018 M.S. Nutrit...
2,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
3,Network Security Engineer,"Operating Systems: Windows, Linux, Ubuntu Netw..."
4,Java Developer,Education Details \r\n BE IT pjlce\r\nJava D...


### Processing Copy

To preserve the original resume text and enable clear before-and-after comparison,
all preprocessing steps are applied to a copied version of the resume column.


In [28]:
df["Resume_processed"] = df["Resume"].copy()


### Case Normalisation

To address inconsistent capitalization across resumes, all text is converted to lowercase.
This reduces stylistic variation introduced by formatting conventions such as uppercase
section headers while preserving semantic meaning.


In [29]:
df["Resume_processed"] = df["Resume_processed"].str.lower()



In [30]:
sample_index = 0

print("=== BEFORE CASE NORMALISATION ===\n")
print(df["Resume"].iloc[sample_index][:400])

print("\n=== AFTER CASE NORMALISATION ===\n")
print(df["Resume_processed"].iloc[sample_index][:400])



=== BEFORE CASE NORMALISATION ===

Technical Skills: Languages Python Python Framework Django, DRF Databases MySQL, Oracle, Sqlite, MongoDB Web Technologies CSS, HTML, RESTful Web Services REST Methodologies Agile, Scrum Version Control Github Project Managent Tool Jira Operating Systems Window, Unix Education Details 
 BE   Dr.BAMU,Aurangabad
Python Developer 

Python Developer - Arsys Inovics pvt ltd
Skill Details 
CSS- Exp

=== AFTER CASE NORMALISATION ===

technical skills: languages python python framework django, drf databases mysql, oracle, sqlite, mongodb web technologies css, html, restful web services rest methodologies agile, scrum version control github project managent tool jira operating systems window, unix education details 
 be   dr.bamu,aurangabad
python developer 

python developer - arsys inovics pvt ltd
skill details 
css- exp


### Removal of Special Characters and Symbols

Special characters and symbols introduced by resume formatting do not contribute
meaningful semantic information and may generate noisy tokens during BERT
tokenization. These characters are removed to reduce noise.


In [31]:
def remove_special_characters(text):
    return re.sub(r"[^a-z0-9\s]", " ", text)

df["Resume_processed"] = df["Resume_processed"].apply(remove_special_characters)


In [35]:
print("=== BEFORE SPECIAL CHARACTER REMOVAL ===\n")
print(df["Resume"].iloc[sample_index][1520: 1800])
print("\n")
print("=== AFTER SPECIAL CHARACTER REMOVAL ===\n")
print(df["Resume_processed"].iloc[sample_index][1520: 1800])


=== BEFORE SPECIAL CHARACTER REMOVAL ===

ities:
â¢ Participated in entire lifecycle of the projects including Design, Development, and Deployment, Testing and Implementation and support.
â¢ Developed views and templates with Python and Django's view controller and templating language to created user-friendly website


=== AFTER SPECIAL CHARACTER REMOVAL ===

ities 
    participated in entire lifecycle of the projects including design  development  and deployment  testing and implementation and support 
    developed views and templates with python and django s view controller and templating language to created user friendly website


### Normalisation of Line Breaks and Whitespace

Excessive line breaks and irregular spacing fragment sentence continuity.
Whitespace is normalised to improve contextual flow for BERT’s attention mechanism.


In [36]:
def normalize_whitespace(text):
    text = re.sub(r"\s+", " ", text)
    return text.strip()

df["Resume_processed"] = df["Resume_processed"].apply(normalize_whitespace)


In [38]:
print("=== BEFORE WHITESPACE NORMALISATION ===\n")
print(df["Resume"].iloc[sample_index][:400])
print("\n")
print("=== AFTER WHITESPACE NORMALISATION ===\n")
print(df["Resume_processed"].iloc[sample_index][:400])


=== BEFORE WHITESPACE NORMALISATION ===

Technical Skills: Languages Python Python Framework Django, DRF Databases MySQL, Oracle, Sqlite, MongoDB Web Technologies CSS, HTML, RESTful Web Services REST Methodologies Agile, Scrum Version Control Github Project Managent Tool Jira Operating Systems Window, Unix Education Details 
 BE   Dr.BAMU,Aurangabad
Python Developer 

Python Developer - Arsys Inovics pvt ltd
Skill Details 
CSS- Exp


=== AFTER WHITESPACE NORMALISATION ===

technical skills languages python python framework django drf databases mysql oracle sqlite mongodb web technologies css html restful web services rest methodologies agile scrum version control github project managent tool jira operating systems window unix education details be dr bamu aurangabad python developer python developer arsys inovics pvt ltd skill details css exprience 31 months django e


### Reduction of Redundant Word Repetition

Some resumes contain immediate repetition of skills due to formatting.
Consecutive duplicate words are reduced to prevent attention bias while
preserving semantic content.


In [39]:
def reduce_redundant_repetition(text):
    words = text.split()
    cleaned = [words[0]] if words else []
    for w in words[1:]:
        if w != cleaned[-1]:
            cleaned.append(w)
    return " ".join(cleaned)

df["Resume_processed"] = df["Resume_processed"].apply(reduce_redundant_repetition)


In [40]:
print("=== BEFORE ===\n")
print(df["Resume"].iloc[sample_index][:400])

print("\n=== AFTER ===\n")
print(df["Resume_processed"].iloc[sample_index][:400])


=== BEFORE ===

Technical Skills: Languages Python Python Framework Django, DRF Databases MySQL, Oracle, Sqlite, MongoDB Web Technologies CSS, HTML, RESTful Web Services REST Methodologies Agile, Scrum Version Control Github Project Managent Tool Jira Operating Systems Window, Unix Education Details 
 BE   Dr.BAMU,Aurangabad
Python Developer 

Python Developer - Arsys Inovics pvt ltd
Skill Details 
CSS- Exp

=== AFTER ===

technical skills languages python framework django drf databases mysql oracle sqlite mongodb web technologies css html restful web services rest methodologies agile scrum version control github project managent tool jira operating systems window unix education details be dr bamu aurangabad python developer python developer arsys inovics pvt ltd skill details css exprience 31 months django exprienc


## Summary of Text Pre-processing

All preprocessing steps were applied selectively to address specific text quality
issues identified earlier. The original resume text was preserved throughout the
process, and before-and-after examples demonstrate the effect of each technique.

The final preprocessed resume text is now suitable for BERT tokenization and
model training.


In [43]:
# Make sure the folder exists
!mkdir -p ../data/preprocessed

# Save the preprocessed CSV
df.to_csv("../data/preprocessed/resumes_preprocessed.csv", index=False)
