Cleaning messy text using **re**

In [2]:
raw_text = """
@@ Welcome!!!  to the   STUDENT   Performance*** system...

This  project -- analyzes  students'    scores,    feedback,     and learning-patterns.

Some students are STUDYING (harder) while others??... are struggling!! with concepts.

We will @track performance across    subjects:  math, science,, and history....!
"""

import re

def clean_text(text):
    text = re.sub(r'[^\w\s.,]', '', text)       # Remove symbols except letters, numbers, space, period, comma
    text = re.sub(r'\s{2,}', ' ', text)         # Replace multiple spaces with single space
    text = re.sub(r'\n+', '\n', text)           # Remove excess newlines
    return text.strip()

# Try it
cleaned_text = clean_text(raw_text)
print("Cleaned Text:\n", cleaned_text)

Cleaned Text:
 Welcome to the STUDENT Performance system... This project analyzes students scores, feedback, and learningpatterns. Some students are STUDYING harder while others... are struggling with concepts. We will track performance across subjects math, science,, and history....


Tokenization and Lemmatization using **spaCy**

In [3]:
!pip install spacy
import spacy

# Load English NLP model
nlp = spacy.load("en_core_web_sm")

def tokenize_and_lemmatize(text):
    doc = nlp(text)
    print("Tokens & Lemmas:\n")
    for token in doc:
        print(f"{token.text:15} --> {token.lemma_}")

# Try it
tokenize_and_lemmatize(cleaned_text)

Tokens & Lemmas:

Welcome         --> welcome
to              --> to
the             --> the
STUDENT         --> student
Performance     --> performance
system          --> system
...             --> ...
This            --> this
project         --> project
analyzes        --> analyze
students        --> student
scores          --> score
,               --> ,
feedback        --> feedback
,               --> ,
and             --> and
learningpatterns --> learningpatterns
.               --> .
Some            --> some
students        --> student
are             --> be
STUDYING        --> study
harder          --> hard
while           --> while
others          --> other
...             --> ...
are             --> be
struggling      --> struggle
with            --> with
concepts        --> concept
.               --> .
We              --> we
will            --> will
track           --> track
performance     --> performance
across          --> across
subjects        --> subject
math         

Cleaning messy table using **pandas**

In [4]:
raw_table = [
    ["Name", "Score", "Remarks"],
    ["Alice", "87", "Good improvement"],
    ["Bob", "91"],                       # Missing Remarks
    ["Carol", "", "Needs support"],      # Missing Score
    ["", "75", "Average"],               # Missing Name
    ["David", "89", ""]                  # Missing Remarks
]

import pandas as pd

def clean_table_data(raw_table):
    max_cols = max(len(row) for row in raw_table)
    padded_rows = [row + ['']*(max_cols - len(row)) for row in raw_table]
    df = pd.DataFrame(padded_rows[1:], columns=padded_rows[0])
    return df

# Try it
df = clean_table_data(raw_table)
print("\nCleaned DataFrame:\n")
print(df)


Cleaned DataFrame:

    Name Score           Remarks
0  Alice    87  Good improvement
1    Bob    91                  
2  Carol           Needs support
3           75           Average
4  David    89                  
