Week 11 – Assignment: Natural Language Processing (NLP)

Subject: Data Science & AI
Project: Credit Card Fraud Detection
Student Name: Ayesha Tariq
Date: October 23, 2025

 Objectives

Apply NLP preprocessing techniques

Perform tokenization

Remove stopwords

Apply TF-IDF vectorization

Prepare a basic NLP pipeline

Markdown Cell #1 – Introduction

 Introduction

Natural Language Processing (NLP) focuses on analyzing and processing text data.
Since the main project dataset contains numerical transaction data,
a small text-based dataset is used to demonstrate NLP preprocessing
techniques required for this assignment.


Code Cell #1 – Import Librarie

In [17]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import TfidfVectorizer


Code Cell #2 – Download NLTK Resources

In [18]:
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Code Cell #3 – Create Sample Text Dataset

In [19]:
# Sample transaction-related text data
text_data = [
    "Transaction approved successfully",
    "Payment declined due to insufficient balance",
    "Suspicious transaction detected",
    "Card payment completed",
    "Possible fraud detected on account",
    "Transaction failed please contact bank"
]

df_text = pd.DataFrame({"Text": text_data})
df_text


Unnamed: 0,Text
0,Transaction approved successfully
1,Payment declined due to insufficient balance
2,Suspicious transaction detected
3,Card payment completed
4,Possible fraud detected on account
5,Transaction failed please contact bank


Code Cell #4 – Tokenization & Stopword Removal

In [20]:
import nltk
import os

# Set custom NLTK data path to avoid lookup errors
nltk_data_path = os.path.join(os.getcwd(), "nltk_data")
if not os.path.exists(nltk_data_path):
    os.makedirs(nltk_data_path)

# Download required resources to custom path
nltk.download('punkt', download_dir=nltk_data_path)
nltk.download('stopwords', download_dir=nltk_data_path)

# Add the custom path to NLTK data search paths
nltk.data.path.append(nltk_data_path)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Load stopwords
stop_words = set(stopwords.words('english'))

# Preprocessing function (tokenization + stopword removal)
def preprocess_text(text):
    try:
        tokens = word_tokenize(text.lower())
    except LookupError:
        # Fallback: split on spaces if tokenizer fails
        tokens = text.lower().split()
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return tokens

# Apply preprocessing
df_text["Tokens"] = df_text["Text"].apply(preprocess_text)

df_text


[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /content/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,Text,Tokens
0,Transaction approved successfully,"[transaction, approved, successfully]"
1,Payment declined due to insufficient balance,"[payment, declined, due, insufficient, balance]"
2,Suspicious transaction detected,"[suspicious, transaction, detected]"
3,Card payment completed,"[card, payment, completed]"
4,Possible fraud detected on account,"[possible, fraud, detected, account]"
5,Transaction failed please contact bank,"[transaction, failed, please, contact, bank]"


Code Cell #5 – TF-IDF Vectorization

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Join tokens back into strings for TF-IDF
df_text["Processed_Text"] = df_text["Tokens"].apply(lambda x: " ".join(x))

# Initialize TF-IDF vectorizer
tfidf = TfidfVectorizer()

# Fit and transform processed text
X_tfidf = tfidf.fit_transform(df_text["Processed_Text"])

# Convert to DataFrame for visualization
tfidf_df = pd.DataFrame(
    X_tfidf.toarray(),
    columns=tfidf.get_feature_names_out()
)

tfidf_df


Unnamed: 0,account,approved,balance,bank,card,completed,contact,declined,detected,due,failed,fraud,insufficient,payment,please,possible,successfully,suspicious,transaction
0,0.0,0.635091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.635091,0.0,0.439681
1,0.0,0.0,0.462625,0.0,0.0,0.0,0.0,0.462625,0.0,0.462625,0.0,0.0,0.462625,0.379359,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.559022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.681722,0.471964
3,0.0,0.0,0.0,0.0,0.611713,0.611713,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.501613,0.0,0.0,0.0,0.0,0.0
4,0.521823,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.427903,0.0,0.0,0.521823,0.0,0.0,0.0,0.521823,0.0,0.0,0.0
5,0.0,0.0,0.0,0.472493,0.0,0.0,0.472493,0.0,0.0,0.0,0.472493,0.0,0.0,0.0,0.472493,0.0,0.0,0.0,0.327113


Markdown Cell #2 – Results & Analysis

Results & Analysis

- Tokenization breaks text into individual meaningful words.
- Stopword removal eliminates common words that add little meaning.
- TF-IDF assigns importance to words based on frequency and uniqueness.
- The resulting matrix can be used as input for machine learning models.

Markdown Cell #3 – Project Milestone Justification

Project Milestone – NLP Pipeline Ready

Although the primary project dataset is numerical, this assignment
demonstrates a complete NLP preprocessing pipeline including:

- Tokenization
- Stopword removal
- TF-IDF vectorization

This satisfies the NLP pipeline requirement and prepares the system
for future integration of text-based transaction descriptions or logs.


Markdown Cell #4 – Conclusion

 Conclusion

- NLP preprocessing techniques were successfully applied.
- A complete text processing pipeline was implemented.
- This completes the **NLP milestone**.
