## Importing Libraries and Initial Setup

The following libraries and modules are imported and initialized for text preprocessing and feature extraction:

`pandas`: Library for data manipulation and analysis.

`string`: Contains common string operations and constants.

`nltk`: The Natural Language Toolkit, a comprehensive library for natural language processing. Here, it is used for:

`nltk.tokenize.word_tokenize`: Tokenizes sentences into individual words (tokens).

`nltk.corpus.stopwords`: Provides a set of common stopwords in multiple languages. These stopwords are used for filtering irrelevant words (e.g., "the", "is") from the text.

`spacy`: Library for advanced natural language processing.

`sklearn.feature_extraction.text.TfidfVectorizer`: For converting text to TF-IDF features.

NLTK data (`punkt` and `stopwords`) is downloaded for tokenization and filtering out common stop words.

The `en_core_web_trf` model from spaCy is loaded for processing text with transformers specifically for lemmatization.

In [None]:
import pandas as pd
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import spacy
import json
from sklearn.feature_extraction.text import TfidfVectorizer
import stopwordsiso as stopwords
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

nlp = spacy.load("en_core_web_trf")

## Filipino Stopwords

The following list contains commonly used Filipino stopwords. These are words that are typically filtered out in text processing as they do not carry significant meaning in the context of text analysis.

These stopwords include common words such as articles, prepositions, and conjunctions that are usually removed during text preprocessing to focus on more meaningful words in text analysis.

In [2]:
filipino_stopwords = stopwords.stopwords('tl')  

## Function: `load_dataset`

The `load_dataset` function is used to load a dataset from a CSV file. It handles various potential errors during the file loading process.

### Code

In [3]:
def load_dataset(file_path):
    try:
        data = pd.read_csv(file_path)
        print("Dataset loaded successfully!")
        return data
    except FileNotFoundError:
        print("File not found. Please check the file path.")
    except pd.errors.EmptyDataError:
        print("File is empty. Please check the file content.")
    except pd.errors.ParserError:
        print("Error parsing file. Please check the file format.")
    except Exception as e:
        print(f"An error occurred: {e}")

## Loading Dataset and Extracting Sentences

The following code snippet demonstrates how to load a dataset from a CSV file and extract sentences from it.

### Code

In [None]:
file_path = '../data/training_data_sample.csv'
data = load_dataset(file_path)

sentences = data['sentence'].tolist()

## Function: `format_list_as_string`

The `format_list_as_string` function converts a list of tokens into a formatted string representation.

In text preprocessing, tokens are often stored as lists of words after steps like tokenization, stopword removal, and lemmatization. These lists can sometimes be difficult to read or display, especially when reviewing or debugging processed data.

this function ensures that these lists are formatted neatly when printed.

### Code

In [5]:
def format_list_as_string(token_list):
    return str(token_list).replace("'", '"')

## Function: `print_table`

The `print_table` function displays a formatted table using the `rich` library, which provides enhanced terminal output for data visualization.

### Code

In [None]:
def print_table(data, title="Table", num_samples=20):
    from rich.console import Console
    from rich.table import Table
    
    table = Table(title=title)
    
    for col in data.columns:
        table.add_column(col)

    for _, row in data.head(num_samples).iterrows():
        formatted_row = [format_list_as_string(row[col]) if isinstance(row[col], list) else row[col] for col in data.columns]
        table.add_row(*map(str, formatted_row))
    
    console = Console()
    console.print(table)

print_table(data, title="Original Data")

## Function: `convert_to_lowercase`

The `convert_to_lowercase` function converts all text in the 'sentence' column of a DataFrame to lowercase.

It ensures uniformity in text data by transforming all characters to lowercase, which is a key preprocessing step in natural language processing (NLP) tasks.

### Code

In [None]:
def convert_to_lowercase(data):
    if 'sentence' in data.columns:
        data['sentence'] = data['sentence'].str.lower()
        print("Sentence has been converted to lowercase.")
    else:
        print("Column 'sentence' not found in the DataFrame.")

convert_to_lowercase(data)
print_table(data, title="Data After Lowercase Conversion")

## Function: `remove_punctuation`

The `remove_punctuation` function removes punctuation from all text in the 'sentence' column of a DataFrame.

It ensures that only the meaningful words remain, free of any punctuation that could interfere with subsequent text processing steps.

### Code

In [None]:
def remove_punctuation(data):
    if 'sentence' in data.columns:
        data['sentence'] = data['sentence'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
        print("Punctuation has been removed.")
    else:
        print("Column 'sentence' not found in the DataFrame.")

remove_punctuation(data)
print_table(data, title="Data After Punctuation Removal")



## Function: `remove_numbers`

The `remove_numbers` function removes numerical digits from all text in the 'sentence' column of a DataFrame.

It ensures that the text data is free from irrelevant numerical characters that could distort the analysis.

### Code

In [None]:
def remove_numbers(data):
    if 'sentence' in data.columns:
        data['sentence'] = data['sentence'].str.replace(r'\d+', '', regex=True)
        print("Numbers have been removed.")
    else:
        print("Column 'sentence' not found in the DataFrame.")

remove_numbers(data)
print_table(data, title="Data After Numbers Removal")


## Function: `tokenize_sentences`

The `tokenize_sentences` function tokenizes each sentence in the 'sentence' column of a DataFrame into individual words.

It breaks down sentences into individual words or tokens, facilitating analysis, standardizing input, and enabling various natural language processing tasks.

### Code

In [None]:
def tokenize_sentences(data):
    if 'sentence' in data.columns:
        data['sentence'] = data['sentence'].apply(lambda x: word_tokenize(x))
        print("Sentences have been tokenized.")
    else:
        print("Column 'sentence' not found in the DataFrame.")

tokenize_sentences(data)
print_table(data, title="Data After Tokenization")

## Function: `remove_stopwords`

The `remove_stopwords` function removes stopwords from each tokenized sentence in the 'sentence' column of a DataFrame.

Stopwords are common words that usually add little meaning to a sentence and are often filtered out in text processing to enhance the focus on more meaningful words.

### Code

In [None]:
def remove_stopwords(data):
    if 'sentence' in data.columns:
        english_stopwords = set(stopwords.stopwords('english'))
        all_stopwords = english_stopwords.union(set(filipino_stopwords))
        
        data['sentence'] = data['sentence'].apply(lambda tokens: [word for word in tokens if word.lower() not in all_stopwords])
        print("Stopwords have been removed.")
    else:
        print("Column 'sentence' not found in the DataFrame.")

remove_stopwords(data)
print_table(data, title="Data After Stopwords Removal")

## Function: `lemmatize_tokens`

The `lemmatize_tokens` function lemmatizes each token in the 'sentence' column of a DataFrame using spaCy's lemmatization.

it reduces words to their base or dictionary form (lemmas), which helps in standardizing the text by consolidating different inflections of a word into a single representation, thereby improving the performance of natural language processing tasks such as classification and clustering.

### Code

In [None]:
def lemmatize_filo(data):
    with open('../data/filipino_lemmatizer.json', 'r', encoding='utf-8') as json_file:
        lemma_dict = json.load(json_file)

    token_to_lemma = {}
    for lemma, tokenval in lemma_dict['lemma_dict'].items():
        for token in tokenval:
            token_to_lemma[token] = lemma

    for index, row in data.iterrows():
        tokens = row['sentence']
        
        if all(len(token) == 1 for token in tokens):
            tokens = ''.join(tokens).split()
        
        updated_tokens = [token_to_lemma.get(token, token) for token in tokens]
        data.at[index, 'sentence'] = updated_tokens

lemmatize_filo(data)
print_table(data, title="Data After Lemmatization in Filipino")

In [None]:
def lemmatize_eng(data):
    if 'sentence' in data.columns:
        data['sentence'] = data['sentence'].apply(lambda tokens: [nlp(token)[0].lemma_ for token in tokens])
        print("Tokens have been lemmatized.")
    else:
        print("Column 'sentence' not found in the DataFrame.")

lemmatize_eng(data)
print_table(data, title="Data After Lemmatization in English")

## Function: `join_tokens`

The `join_tokens` function joins tokens in each list within the 'sentence' column of a DataFrame back into single sentences.

it reconstructs the original text format after various transformations, enabling further analysis or modeling tasks that require complete sentences, such as text classification, sentiment analysis, or readability assessments.

### Code

In [None]:
def join_tokens(data):
    if 'sentence' in data.columns:
        data['sentence'] = data['sentence'].apply(lambda tokens: ' '.join(tokens))
        print("Tokens have been joined back into sentences.")
    else:
        print("Column 'sentence' not found in the DataFrame.")

join_tokens(data)
print_table(data, title="Data After Joining Tokens")

## Function: `vectorize_with_tfidf`

The `vectorize_with_tfidf` function performs TF-IDF vectorization on the 'sentence' column of a DataFrame.

**TF-IDF** is a statistical measure used to evaluate the importance of a word in a document relative to a collection (or corpus) of documents. It combines two key concepts:

1. **Term Frequency (TF)**: Measures how frequently a word appears in a single document. Words that appear frequently within a document are considered more important for that document.
   - Formula: `TF = (Number of times a term appears in a document) / (Total number of terms in the document)`

2. **Inverse Document Frequency (IDF)**: Measures how important a word is across the entire dataset. Words that appear in many documents (common words) are less important, while rare words are more important.
   - Formula: `IDF = log(Total number of documents / Number of documents containing the term)`

The **TF-IDF score** is calculated by multiplying TF and IDF for each word, providing a numerical value representing the word's significance within a document relative to the entire corpus.

### Code

In [None]:
def vectorize_with_tfidf(data):
    if 'sentence' in data.columns:
        vectorizer = TfidfVectorizer()

        tfidf_matrix = vectorizer.fit_transform(data['sentence'])

        print("TF-IDF Vectorization complete.")
        return tfidf_matrix, vectorizer
    else:
        print("Column 'sentence' not found in the DataFrame.")

tfidf_matrix, vectorizer = vectorize_with_tfidf(data)
feature_names = vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
tfidf_df['emotion'] = data['emotion'].values

# Save the DataFrame to a CSV file
tfidf_df.to_csv('tfidf_vectorized_data.csv', index=False)

print(tfidf_df.head())

## Save a Model

In [16]:
import pickle
import os

def save_model(model, filename):
    with open(filename, 'wb') as f:
        pickle.dump(model, f)

    print(f"Model saved to {filename}")

current_dir = os.getcwd()

models_folder = os.path.join(current_dir, "..", "models")
if not os.path.exists(models_folder):
    os.makedirs(models_folder)

## Save the TF-IDF Vectorizer Model

In [None]:
save_model(vectorizer, os.path.join(models_folder, "tfidf_vectorizer_model.pkl"))

## Function: `RandomOverSampler` and `RandomUnderSampler`

In [None]:
X = tfidf_df.drop(columns=['emotion'])
y = tfidf_df['emotion']

print("Class distribution before resampling:")
print(y.value_counts())
print("\nData before resampling:")
print(tfidf_df)

def resample_data(X, y):
    oversampler = RandomOverSampler(random_state=42)
    X_over, y_over = oversampler.fit_resample(X, y)
    
    print(f"\nOversampled class distribution:\n{y_over.value_counts()}")

    undersampler = RandomUnderSampler(random_state=42)
    X_resampled, y_resampled = undersampler.fit_resample(X_over, y_over)
    
    print(f"\nFinal resampled class distribution:\n{y_resampled.value_counts()}")
    
    return X_resampled, y_resampled

X_resampled, y_resampled = resample_data(X, y)

resampled_df = pd.concat([pd.DataFrame(X_resampled, columns=feature_names), pd.DataFrame(y_resampled, columns=['emotion'])], axis=1)

resampled_df.to_csv('../data/resampled_df.csv', index=False)

print("\nData after resampling:")
print(resampled_df)