### Goal is to analyze and classify messages, then develop a chatbot that can respond to natural language queries with meaningful insights.

Records in the dataset contains:
- id_user = unique 'int' value
- timestamp = 'str' -- convert it to timestamp
- source = 'str'
- message 'str'

## Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Instalations

In [3]:
!pip install PyGithub

Collecting PyGithub
  Downloading PyGithub-2.6.1-py3-none-any.whl.metadata (3.9 kB)
Collecting pynacl>=1.4.0 (from PyGithub)
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl.metadata (8.6 kB)
Downloading PyGithub-2.6.1-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (856 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m856.7/856.7 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pynacl, PyGithub
Successfully installed PyGithub-2.6.1 pynacl-1.5.0


## Git Commit

In [19]:
# Step 1: Reset the environment by forcing a directory change to a safe location
import os
os.chdir('/content')


In [None]:
# Step 2: Clean up any existing problematic directories
!rm -rf /content/LLM_DataScientist_ChatBot
!rm -rf /content/repo


In [None]:
# Step 3: Mount Google Drive properly
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


In [None]:
# Step 4: Set up the GitHub token correctly
# This needs to be done manually through Colab's secrets manager
# or directly in this cell (but be careful with security)
import getpass
import os

# Get token securely (won't be visible in the output)
github_token = getpass.getpass('Enter your GitHub token: ')
os.environ['GITHUB_TOKEN'] = github_token

# Verify the token is set
print(f"Token is set: {os.environ.get('GITHUB_TOKEN') is not None}")


In [None]:
# Step 5: Now clone the repository
!git clone https://github.com/alikova/LLM_DataScientist_ChatBot.git /content/clean_repo


In [None]:
# Step 6: Navigate to the repository
%cd /content/clean_repo


In [None]:
# Step 7: Configure Git
!git config --global user.name "alikova"
!git config --global user.email "z.alenka7@gmail.com"


In [None]:
# Step 8: List files in Google Drive to find the notebook
!ls -la "/content/drive/MyDrive/Colab Notebooks/"


In [None]:
# Step 9: Once you find the correct path, copy the notebook
# Update this path based on the ls command output
# !cp "/content/drive/MyDrive/Colab Notebooks/YOUR_ACTUAL_PATH/LLM_DataScientist_analyse_classify_chatbot.ipynb" .

# Step 10: Add, commit, and push using the token
# !git add LLM_DataScientist_analyse_classify_chatbot.ipynb
# !git commit -m "Add analysis and classification notebook"
# !git push https://$GITHUB_TOKEN@github.com/alikova/LLM_DataScientist_ChatBot.git main

## Data Preprocessing and Cleaning

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer

df = pd.read_csv("LLM-DataScientist-Task_Data.csv")

def clean_text(text):
  text = text.lower()
  text = re.sub(r'\s+', ' ', text)
  text = re.sub(r'[^\w\s]', '', text)
  return text

# Apply cleaning function
df['cleaned_message'] = df['message'].apply(clean_text)

# Tokenization
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
df['tokens'] = df['cleaned_message'].apply(lambda x: tokenizer.tokenize(x))

df.to_csv("cleaned_LLM-DataScientist-Task_Data.csv", index=False)

## Message Vectorization with HuggingFace Transformers
- pre-trained transformer model from HuggingFace ---> for converting the text messages into numerical vectors,
- BERT or HuggingFace ---> for encoding text messages into embeddings for later classification,
- fine-tuning the transformer model
- Use the embeddings as features for downstream classification /

Metric for Evaluation:
- accuracy, precision and recall, F1-score, confusion matrix

In [None]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments
import torch

# Load pre-trained BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=5)

# Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the messages
def tokenize_function(examples):
    return tokenizer(examples['cleaned_message'], padding="max_length", truncation=True)

# Tokenize dataset
encoded_dataset = df['cleaned_message'].apply(lambda x: tokenize_function({'cleaned_message': x}))

# Split into train/test sets
train_data, test_data = train_test_split(encoded_dataset, test_size=0.2)

# Set up the Trainer
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data
)

# Train the model
trainer.train()


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Get predictions
predictions = trainer.predict(test_data)
predicted_labels = predictions.predictions.argmax(axis=-1)

# Evaluation metrics
print(classification_report(test_data, predicted_labels))
print(confusion_matrix(test_data, predicted_labels))


## Message Categorization and Filtering (Classification and Categorization of Messages)

- Categories: Once the model is trained, use the transformer model's output (predicted category) to classify each message
- Filtering: Based on time range and source, filter the messages before feeding them into the model

In [None]:
# Assuming categories are ['login issues', 'game issues', 'payment issues', etc.]
df['predicted_category'] = model.predict(df['cleaned_message'])


In [None]:
df_filtered = df[(df['timestamp'] >= '2023-01-01') & (df['source'] == 'telegram')]


## Building the Chatbot PetE - Pete Bot

In [None]:
def chatbot_response(query):
    # Process the query (e.g., extract the category, time range, and source)
    category = extract_category_from_query(query)
    time_range = extract_time_range_from_query(query)
    source = extract_source_from_query(query)

    # Filter data based on the query
    filtered_data = filter_data(df, category, time_range, source)

    # Return the response
    return generate_response(filtered_data)

# Example function call
response = chatbot_response("Show me all login issues from the last month.")
print(response)


## Deploy and Tests

### Test

- testing the model with different messages and evaluate its accuracy

### Deploy

- deoply the finished model and chatbot to a server/cloud

### Document

- update the GitHub repository with detailed documentation, including the setup instructions, usage, and how to interact with the chatbot