# **Project 4: LLM Project Activity - Topic Modeling**
### **Week 23** 2-Representation
Apply tokenization and text representation methods in the project. (9.3)

- Input text and preprocessing completed in notebook 1_preprocessing.
- Also, tokenization was evaluated and applied in notebook 1_preprocessing (tokenization cleaned text, normalized text, and split on spaces). For this notebook, added in how many documents are in the dataset and how many total words make up the vocabulary in the dataset after tokenization from notebook 1.
- Next, in text representation, will be to complete vectorization using chosen method of TF-IDF resulting in output of vectorized text. In this step, removing notebook 1 tokenization will be required to pass into the TF-IDF as a string and then TF-IDF will re-tokenize those strings as part of the building feature matrix. A re-check on the number of documents and total words in the retokenized output will be completed again at this stage to ensure completion/for reference.
- From this, will selected to build a Logistic Regression ML model and its performance will be evaluated against test data.

In [None]:
#Number of documents and total words vocabulary after tokenization (notebook 1_preprocessing) - documents, words, vocabulary size
#Train dataset
num_documents_train = len(ds_train)
all_tokens_train = [token for tokens in ds_train['text'] for token in tokens]
total_words_train = len(all_tokens_train)
vocab_size_train = len(set(all_tokens_train))

#Test dataset
num_documents_test = len(ds_test)
all_tokens_test = [token for tokens in ds_test['text'] for token in tokens]
total_words_test = len(all_tokens_test)
vocab_size_test = len(set(all_tokens_test))

print(f"Train documents: {num_documents_train}")
print(f"Train total words (tokens): {total_words_train}")
print(f"Train vocabulary size (unique words): {vocab_size_train}\n")

print(f"Test documents: {num_documents_test}")
print(f"Test total words (tokens): {total_words_test}")
print(f"Test vocabulary size (unique words): {vocab_size_test}")

Train documents: 11314
Train total words (tokens): 1166464
Train vocabulary size (unique words): 101919

Test documents: 7532
Test total words (tokens): 730769
Test vocabulary size (unique words): 73050


In [None]:
#Vectorization using TF-IDF to convert text into numerical features
from sklearn.feature_extraction.text import TfidfVectorizer

#Detokenize
train_texts = [' '.join(tokens) if isinstance(tokens, list) else tokens for tokens in new_ds['train']['text']]
test_texts = [' '.join(tokens) if isinstance(tokens, list) else tokens for tokens in new_ds['test']['text']]

#Create TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

#Fit on training data and transform
X_train_tfidf = tfidf_vectorizer.fit_transform(train_texts)
X_test_tfidf = tfidf_vectorizer.transform(test_texts)

In [None]:
#Output of vectorized text - documents, words, vocabulary size
def dataset_stats(ds):
    num_documents = len(ds)
    all_tokens = [token for tokens in ds['text'] for token in tokens]
    total_words = len(all_tokens)
    vocab_size = len(set(all_tokens))
    return num_documents, total_words, vocab_size

# For train set
train_num_docs, train_total_words, train_vocab_size = dataset_stats(ds_train)

# For test set
test_num_docs, test_total_words, test_vocab_size = dataset_stats(ds_test)

print(f"Train set - Documents: {train_num_docs}, Total words: {train_total_words}, Vocabulary size: {train_vocab_size}")
print(f"Test set  - Documents: {test_num_docs}, Total words: {test_total_words}, Vocabulary size: {test_vocab_size}")

Train set - Documents: 11314, Total words: 1166464, Vocabulary size: 101919
Test set  - Documents: 7532, Total words: 730769, Vocabulary size: 73050


In [None]:
#Extract labels from dataset in preparation for build, train, test of Logistic Regression Model
y_train = ds_train['label'].tolist()
y_test = ds_test['label'].tolist()

In [None]:
#Build, train, test Logistic Regression Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

#Create model
clf = LogisticRegression(max_iter=1000, random_state=42)
#Train model
clf.fit(X_train_tfidf, y_train)
#Predict on test data
y_pred = clf.predict(X_test_tfidf)
#Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

Accuracy: 0.6795
              precision    recall  f1-score   support

           0       0.49      0.45      0.47       319
           1       0.60      0.70      0.65       389
           2       0.66      0.62      0.64       394
           3       0.66      0.62      0.64       392
           4       0.77      0.68      0.72       385
           5       0.80      0.70      0.75       395
           6       0.72      0.79      0.76       390
           7       0.77      0.70      0.73       396
           8       0.48      0.81      0.60       398
           9       0.81      0.79      0.80       397
          10       0.88      0.87      0.87       399
          11       0.86      0.67      0.75       396
          12       0.54      0.59      0.56       393
          13       0.76      0.76      0.76       396
          14       0.71      0.74      0.72       394
          15       0.64      0.79      0.70       398
          16       0.58      0.69      0.63       364
          

**Model Performance Summary**

Overall Accuracy: 68%

F1-scores: performance varies significantly across classes (e.g., classes 10,6 demonstrate strong performance ~.87-.76, other classes 18,0 demonstrate weaker performance ~.49-.47)

Results suggest possible class imbalance with some classes having few samples and lower F1 scores (e.g., class 19 had 251 samples/scored lowest F1=.26 vs. class 2 had 394 samples/scored better F1=.64)

Also, potential for false positives as some classes (e.g., 8) showed high recall (e.g., class 8 .81) but low precision (.48).


