This jupyter notebook consists of the data cleaning and pre-processing part of the project.

- We load the data, the pros and cons dataset extracted from https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets. The two files are merged into one and the resulting dataset has text in one column and its corresponding label - 0 for Con/negative and 1 for Pro/positive text.

- We remove duplicates, lowercase, and special characters/punctuations.

- We output two versions of cleaned text - one for tf-idf and one for BERT.

- 'cleaned_df.csv' is output as a final dataframe which is used in _tfidf_and_ml.ipynb_ and _bert_modeling.ipynb_.

We are sprucing up and cleaning the data for the two pipelines - using tf-idf for feature extraction versus a BERT pipeline. For tf-idf, a bit more cleaning is required in comparison to BERT, which would handle the cleaning on its own.

#### I. Data Loading

In [2]:
# imports

import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import unicodedata

In [3]:
# Load Pros (positive)
pros_texts = []
with open('data/IntegratedPros.txt', 'r', encoding='utf-8', errors='ignore') as f:
    for line in f:
        text = line.replace('<Pros>', '').replace('</Pros>', '').strip()
        if text:
            pros_texts.append(text)

# Load Cons (negative)
cons_texts = []
with open('data/IntegratedCons.txt', 'r', encoding='utf-8', errors='ignore') as f:
    for line in f:
        text = line.replace('<Cons>', '').replace('</Cons>', '').strip()
        if text:
            cons_texts.append(text)

# Create DataFrame
df = pd.DataFrame({
    'text': pros_texts + cons_texts,
    'label': [1] * len(pros_texts) + [0] * len(cons_texts)
})

print(f"Total entries: {len(df)} (Pros: {len(pros_texts)}, Cons: {len(cons_texts)})")

Total entries: 45867 (Pros: 22936, Cons: 22931)


In [4]:
df.head()

Unnamed: 0,text,label
0,"Easy to use, economical!",1
1,Digital is where it's at...down with developin...,1
2,"Good image quality, 3x optical zoom, macro mod...",1
3,Awesome features/easy to use/fun/versatile/low...,1
4,"intuitive, user friendly",1


In [5]:
print(f"\nMissing values:")
print(df.isnull().sum())

# Check for duplicates
n_duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate entries: {n_duplicates}")

# Text length statistics
df['text_length'] = df['text'].apply(len)
df['word_count'] = df['text'].apply(lambda x: len(x.split()))

print(f"\nText length statistics:")
print(df[['text_length', 'word_count']].describe())


Missing values:
text     0
label    0
dtype: int64

Number of duplicate entries: 4157

Text length statistics:
        text_length    word_count
count  45867.000000  45867.000000
mean      44.341727      7.050341
std       25.692263      4.002214
min        1.000000      1.000000
25%       25.000000      4.000000
50%       41.000000      6.000000
75%       61.000000     10.000000
max      381.000000     57.000000


In [6]:
df = df.drop_duplicates(subset=['text']).reset_index(drop=True)

In [7]:
def clean_for_tfidf(text):
    text = text.lower()
    text = unicodedata.normalize("NFKD", text)
    text = re.sub(r"http\S+|www\S+", " ", text)         # remove URLs
    text = re.sub(r"[^a-z0-9\s]", " ", text)            # remove punctuation/special chars
    text = re.sub(r"\s+", " ", text).strip()            # collapse whitespace
    return text


def clean_for_bert(text):
    text = unicodedata.normalize("NFKD", text)
    text = re.sub(r"http\S+|www\S+", " ", text)         # remove URLs
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [8]:
df["text_tfidf"] = df["text"].apply(clean_for_tfidf)
df["text_bert"] = df["text"].apply(clean_for_bert)

In [9]:
print(df.label.value_counts())

label
1    21387
0    20176
Name: count, dtype: int64


In [11]:
print("\nMissing values check:")
print(f"text_tfidf missing: {df['text_tfidf'].isnull().sum()}")
print(f"text_bert missing: {df['text_bert'].isnull().sum()}")


Missing values check:
text_tfidf missing: 0
text_bert missing: 0


In [12]:
print(df.isnull().sum())

text           0
label          0
text_length    0
word_count     0
text_tfidf     0
text_bert      0
dtype: int64


In [10]:
df.to_csv('data/cleaned_df.csv', index=False)