<a href="https://colab.research.google.com/github/armandossrecife/mysentimentanalysis/blob/main/my_automatic_inspection_issues_in_hf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-requirements

## Dependencies

### Install dependencies

- datasets from Hugging Face
- transformers Hugging Face
- torch
- accelerate
- ntlk

In [1]:
!pip -q install datasets

In [2]:
!pip -q install transformers[torch]

In [3]:
!pip -q install accelerate -U

In [4]:
!pip -q install nltk

### Import dependencies


- torch
- pandas
- numpy
- transformers
- sklearn
- datasets
- json
- string
- nltk

In [5]:
import torch
import pandas as pd
import numpy as np

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers import AutoTokenizer

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

from datasets import load_dataset
import json

import string
from urllib.parse import urlparse
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from urllib.parse import urlparse

In [6]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [7]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
def truncate_string(text, max_length=100, add_ellipsis=True):
  if len(text) <= max_length:
    return text

  truncated_text = text[:max_length]

  if add_ellipsis:
    truncated_text += "..."

  return truncated_text

def to_lowercase(text):
  return text.lower()

def remove_hyperlinks(text):
  tokens = nltk.word_tokenize(text)
  filtered_tokens = [token for token in tokens if not urlparse(token).scheme]
  return ' '.join(filtered_tokens)

def remove_punctuation(text):
  translator = str.maketrans('', '', string.punctuation)
  return text.translate(translator)

def remove_stopwords(text):
  stop_words = set(stopwords.words('english'))
  words = text.split()
  filtered_words = [word for word in words if word not in stop_words]
  return ' '.join(filtered_words)

def preprocess_text(text):
  text = to_lowercase(text)
  text = remove_hyperlinks(text)
  #text = remove_punctuation(text)
  text = remove_stopwords(text)
  return text

# A) Create the Model (based on example from Hugging face DestilBert)

In [9]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
task = "issue-analysis"
MY_HUGGING_FACE_DATASET = "armandoufpi/cassandraissuesgroundtruth"

In [10]:
# Load pre-trained DistilBERT model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Dataset da minha conta Hugging Fase

https://huggingface.co/datasets/armandoufpi/cassandraissuesgroundtruth

In [11]:
#dataset da minha conta Hugging Fase
splits = {'train': 'train.jsonl', 'test': 'test.jsonl'}
df_treino = pd.read_json("hf://datasets/armandoufpi/cassandraissuesgroundtruth/" + splits["train"])
df_teste = pd.read_json("hf://datasets/armandoufpi/cassandraissuesgroundtruth/" + splits["test"])

In [12]:
df_treino['SummaryDescriptionComments']= df_treino.apply(lambda row: row['summary'] + ' ' + row['description'] + ' ' + row['comments_text'],axis=1).values
df_treino['processed_text'] = df_treino['SummaryDescriptionComments'].apply(preprocess_text)

df_teste['SummaryDescriptionComments']= df_teste.apply(lambda row: row['summary'] + ' ' + row['description'] + ' ' + row['comments_text'],axis=1).values
df_teste['processed_text'] = df_teste['SummaryDescriptionComments'].apply(preprocess_text)

In [13]:
df_treino

Unnamed: 0,issue_key,summary,issue_type,issue_status,issue_priority,description,comments,architectural_impact,comments_text,label,label_text,SummaryDescriptionComments,processed_text
0,CASSANDRA-3489,EncryptionOptions should be instantiated,Bug,Resolved,Low,"As the title says, otherwise you get an NPE wh...","['There\'s a bunch of ""if encryption options i...",NO,"There\'s a bunch of ""if encryption options is ...",0,negative,EncryptionOptions should be instantiated As th...,"encryptionoptions instantiated title says , ot..."
1,CASSANDRA-16780,Log when writing many tombstones to a partition,Improvement,Resolved,Normal,Log when writing many tombstones to a partitio...,['https://github.com/krummas/cassandra/commits...,NO,https://github.com/krummas/cassandra/commits/m...,0,negative,Log when writing many tombstones to a partitio...,log writing many tombstones partition log writ...
2,CASSANDRA-5426,Redesign repair messages,Improvement,Resolved,Low,Many people have been reporting 'repair hang' ...,['Work in progress is pushed to: https://githu...,YES,https://github.com/yukim/cassandra/commits/542...,1,positive,Redesign repair messages Many people have been...,redesign repair messages many people reporting...
3,CASSANDRA-5121,system.peers.tokens is empty after node restart,Bug,Resolved,Low,Using a 2 nodes fresh cluster (127.0.0.1 & 127...,"['In StorageService.handleStateNormal, when we...",NO,removeEndpoint should be used instead\n [ju...,0,negative,system.peers.tokens is empty after node restar...,system.peers.tokens empty node restart using 2...
4,CASSANDRA-11944,sstablesInBounds might not actually give all s...,Bug,Resolved,Normal,Same problem as with CASSANDRA-11886 - if we t...,['https://github.com/krummas/cassandra/commits...,YES,https://github.com/krummas/cassandra/commits/m...,1,positive,sstablesInBounds might not actually give all s...,sstablesinbounds might actually give sstables ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,CASSANDRA-18617,Disable the deprecated keyspace/table threshol...,Improvement,Resolved,Normal,The non-guardrail thresholds 'keyspace_count_w...,"[""Part of this change is to add converters tha...",YES,\xa0[https://github.com/apache/cassandra/pull/...,1,positive,Disable the deprecated keyspace/table threshol...,disable deprecated keyspace/table thresholds c...
196,CASSANDRA-5244,Compactions don't work while node is bootstrap...,Bug,Resolved,Urgent,It seems that there is a race condition in Sto...,"[""Thanks for the detective work, Jouni. I'll ...",NO,BLOCKED (on object monitor)\n at org.apache...,0,negative,Compactions don't work while node is bootstrap...,compactions n't work node bootstrapping seems ...
197,CASSANDRA-173,add getPendingTasks to CFSMBean,Improvement,Resolved,Low,need to add an atomicint and inc/decr it whene...,['rebased patch as 0001-CASSANDRA-173-added-CF...,NO,rebased patch as 0001-CASSANDRA-173-added-CFS-...,0,negative,add getPendingTasks to CFSMBean need to add an...,add getpendingtasks cfsmbean need add atomicin...
198,CASSANDRA-359,CFS readStats_ and diskReadStats_ are missing,Bug,Resolved,Normal,There is no description,"[""shouldn't we also get rid of getReadDiskHits...",NO,"[""shouldn't we also get rid of getReadDiskHits...",0,negative,CFS readStats_ and diskReadStats_ are missing ...,cfs readstats_ diskreadstats_ missing descript...


In [14]:
df_teste

Unnamed: 0,issue_key,summary,issue_type,issue_status,issue_priority,description,comments,architectural_impact,comments_text,label,label_text,SummaryDescriptionComments,processed_text
0,CASSANDRA-11944,sstablesInBounds might not actually give all s...,Bug,Resolved,Normal,Same problem as with CASSANDRA-11886 - if we t...,['https://github.com/krummas/cassandra/commits...,YES,https://github.com/krummas/cassandra/commits/m...,1,positive,sstablesInBounds might not actually give all s...,sstablesinbounds might actually give sstables ...
1,CASSANDRA-12988,make the consistency level for user-level auth...,Improvement,Resolved,Low,Most reads for the auth-related tables execute...,['Linked patch allows an operator to set the r...,YES,[Link|https://app.circleci.com/pipelines/githu...,1,positive,make the consistency level for user-level auth...,make consistency level user-level auth reads w...
2,CASSANDRA-15004,Anti-compaction briefly corrupts sstable state...,Bug,Resolved,Urgent,Since we use multiple sstable rewriters in ant...,['|[3.0|https://github.com/bdeggleston/cassand...,YES,not sure what is going on with the dtests thou...,1,positive,Anti-compaction briefly corrupts sstable state...,anti-compaction briefly corrupts sstable state...
3,CASSANDRA-15265,Index summary redistribution can start even wh...,Bug,Resolved,Normal,When we pause autocompaction for upgradesstabl...,['Patch adds a flag in `CompactionManager` whi...,YES,[3.0|https://circleci.com/workflow-run/8882a8a...,1,positive,Index summary redistribution can start even wh...,index summary redistribution start even compac...
4,CASSANDRA-18029,fix starting Paxos auto repair,Bug,Resolved,Normal,This test was not run in CI because of its nam...,['I fixed here what I could: [https://github.c...,YES,repaired}} rely on running regular/incremental...,1,positive,fix starting Paxos auto repair This test was n...,fix starting paxos auto repair test run ci nam...
5,CASSANDRA-18058,In-memory index and query path,New Feature,Resolved,Normal,An in-memory index using the in-memory trie st...,['The github PR for this ticket is here:\xa0\r...,YES,[https://app.circleci.com/pipelines/github/ade...,1,positive,In-memory index and query path An in-memory in...,in-memory index query path in-memory index usi...
6,CASSANDRA-18617,Disable the deprecated keyspace/table threshol...,Improvement,Resolved,Normal,The non-guardrail thresholds 'keyspace_count_w...,"[""Part of this change is to add converters tha...",YES,\xa0[https://github.com/apache/cassandra/pull/...,1,positive,Disable the deprecated keyspace/table threshol...,disable deprecated keyspace/table thresholds c...
7,CASSANDRA-1919,Add shutdownhook to flush commitlog,Improvement,Resolved,Low,this replaces the periodic_with_flush approach...,"[""The approach I took was to add a shutdownBlo...",YES,Could not create ServerSocket on address /127....,1,positive,Add shutdownhook to flush commitlog this repla...,add shutdownhook flush commitlog replaces peri...
8,CASSANDRA-414,remove sstableLock,Improvement,Resolved,Normal,There is no description,['rebased.\n\n02\n remove sstableLock. re-...,YES,the cleanup does happen. If it were the SSTR ...,1,positive,remove sstableLock There is no description the...,remove sstablelock description cleanup happen ...
9,CASSANDRA-5426,Redesign repair messages,Improvement,Resolved,Low,Many people have been reporting 'repair hang' ...,['Work in progress is pushed to: https://githu...,YES,https://github.com/yukim/cassandra/commits/542...,1,positive,Redesign repair messages Many people have been...,redesign repair messages many people reporting...


## Carrega os dados de treino e teste

In [15]:
# Load the dataset from armandoufpi hugging face
train_data = load_dataset(MY_HUGGING_FACE_DATASET, split="train")
test_data = load_dataset(MY_HUGGING_FACE_DATASET, split="test")

In [16]:
train_data

Dataset({
    features: ['summary', 'architectural_impact', 'comments_text', 'label_text', 'comments', 'issue_status', 'description', 'issue_priority', 'issue_type', 'issue_key', 'label'],
    num_rows: 200
})

In [17]:
test_data

Dataset({
    features: ['summary', 'architectural_impact', 'comments_text', 'label_text', 'comments', 'issue_status', 'description', 'issue_priority', 'issue_type', 'issue_key', 'label'],
    num_rows: 26
})

In [18]:
print(f"len(train_data['summary']): {len(train_data['summary'])}")
print(f"train_data['summary'][0]: {train_data['summary'][0]}")
print(f"train_data['label'][0]: {train_data['label'][0]}")
print(f"train_data['label_text'][0]: {train_data['label_text'][0]}")
print(f"train_data['description'][0]: {train_data['description'][0]}")

len(train_data['summary']): 200
train_data['summary'][0]: EncryptionOptions should be instantiated
train_data['label'][0]: 0
train_data['label_text'][0]: negative
train_data['description'][0]: As the title says, otherwise you get an NPE when the options are missing from the yaml.  It's included in my second patch on CASSANDRA-3045 and is a one line fix.


## Processa os dados de treino e teste

In [19]:
# Function to preprocess text data
def preprocess_function_description(examples):
  return tokenizer(examples["description"], padding="max_length", truncation=True)

def preprocess_function_description(examples):
  return tokenizer(examples["description"], padding="max_length", truncation=True)

# Function to preprocess text data
def preprocess_function(examples):
  return tokenizer(examples["summary"], padding="max_length", truncation=True)

In [20]:
# Preprocess train and test data
train_data = train_data.map(preprocess_function, batched=True)
test_data = test_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/26 [00:00<?, ? examples/s]

In [21]:
train_data

Dataset({
    features: ['summary', 'architectural_impact', 'comments_text', 'label_text', 'comments', 'issue_status', 'description', 'issue_priority', 'issue_type', 'issue_key', 'label', 'input_ids', 'attention_mask'],
    num_rows: 200
})

In [22]:
test_data

Dataset({
    features: ['summary', 'architectural_impact', 'comments_text', 'label_text', 'comments', 'issue_status', 'description', 'issue_priority', 'issue_type', 'issue_key', 'label', 'input_ids', 'attention_mask'],
    num_rows: 26
})

In [23]:
# Access the 'input_ids' from the preprocessed data
#train_inputs = train_data["input_ids"]
#test_inputs = test_data["input_ids"]

## Treina o modelo

In [24]:
!rm -rf results
!mkdir results
!ls -l

total 8
drwxr-xr-x 2 root root 4096 Aug 23 19:16 results
drwxr-xr-x 1 root root 4096 Aug 22 13:24 sample_data


In [25]:
training_args = TrainingArguments(
    output_dir="results",  # Fixed typo (removed extra space)
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,  # Assuming you meant "size" here
    learning_rate=2e-5,
)

In [26]:
# Create trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    compute_metrics="accuracy",
)

In [27]:
# Train the model
trainer.train()

Step,Training Loss


TrainOutput(global_step=39, training_loss=0.819644047663762, metrics={'train_runtime': 34.9939, 'train_samples_per_second': 17.146, 'train_steps_per_second': 1.114, 'total_flos': 79480439193600.0, 'train_loss': 0.819644047663762, 'epoch': 3.0})

## Faz as previsões baseadas no modelo treinado

In [28]:
# TODO: fazer a analise do issue baseado em varios fields ao mesmo tempo
def analyse_issue(issue_field):
  inputs = tokenizer(issue_field, padding="max_length", truncation=True, return_tensors="pt")

  # Move the model and input to GPU if available
  if torch.cuda.is_available():
    model.to('cuda')
    inputs.to('cuda')

  with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

  # Print the predicted sentiment
  if predictions == 1:
    return "YES"
  else:
    return "NO"

### Dados de testes

In [29]:
df_teste[['issue_key', 'summary', 'architectural_impact']]

Unnamed: 0,issue_key,summary,architectural_impact
0,CASSANDRA-11944,sstablesInBounds might not actually give all s...,YES
1,CASSANDRA-12988,make the consistency level for user-level auth...,YES
2,CASSANDRA-15004,Anti-compaction briefly corrupts sstable state...,YES
3,CASSANDRA-15265,Index summary redistribution can start even wh...,YES
4,CASSANDRA-18029,fix starting Paxos auto repair,YES
5,CASSANDRA-18058,In-memory index and query path,YES
6,CASSANDRA-18617,Disable the deprecated keyspace/table threshol...,YES
7,CASSANDRA-1919,Add shutdownhook to flush commitlog,YES
8,CASSANDRA-414,remove sstableLock,YES
9,CASSANDRA-5426,Redesign repair messages,YES


In [30]:
yes_count = df_teste.architectural_impact.to_list().count("YES")
no_count = df_teste.architectural_impact.to_list().count("NO")

# Print the counts
print("YES count:", yes_count)
print("NO count:", no_count)

YES count: 10
NO count: 16


### Roda o modelo com os dados de testes (baseado apenas no campo Summary)

In [31]:
lista_analisa_summary_yes_no = []

for index, row in df_teste.iterrows():
  field = row['summary']
  issue_key = row['issue_key']
  summary = truncate_string(text=row['summary'], max_length=50)
  previsao = analyse_issue(issue_field=field)
  print(f"{issue_key}, {summary}, Architectural Impact:{previsao}")
  lista_analisa_summary_yes_no.append(previsao)

CASSANDRA-11944, sstablesInBounds might not actually give all sstab..., Architectural Impact:NO
CASSANDRA-12988, make the consistency level for user-level auth rea..., Architectural Impact:YES
CASSANDRA-15004, Anti-compaction briefly corrupts sstable state for..., Architectural Impact:NO
CASSANDRA-15265, Index summary redistribution can start even when c..., Architectural Impact:YES
CASSANDRA-18029, fix starting Paxos auto repair, Architectural Impact:NO
CASSANDRA-18058, In-memory index and query path, Architectural Impact:YES
CASSANDRA-18617, Disable the deprecated keyspace/table thresholds a..., Architectural Impact:YES
CASSANDRA-1919, Add shutdownhook to flush commitlog, Architectural Impact:NO
CASSANDRA-414, remove sstableLock, Architectural Impact:YES
CASSANDRA-5426, Redesign repair messages, Architectural Impact:YES
CASSANDRA-11540, The JVM should exit if jmx fails to bind, Architectural Impact:NO
CASSANDRA-6013, CAS may return false but still commit the insert, Architectural Imp

In [32]:
yes_count_summary = lista_analisa_summary_yes_no.count("YES")
no_count_summary = lista_analisa_summary_yes_no.count("NO")

# Print the counts
print("YES count:", yes_count_summary)
print("NO count:", no_count_summary)

YES count: 7
NO count: 19
