<a href="https://colab.research.google.com/github/armandossrecife/mysentimentanalysis/blob/main/my_automatic_inspection_issues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A) Testes de Inspeção de Issues do Cassandra

## Install dependencies

- datasets from Hugging Face
- transformers Hugging Face
- torch
- accelerate
- ntlk

In [334]:
!pip -q install datasets

In [335]:
!pip -q install transformers[torch]

In [336]:
!pip -q install accelerate -U

In [337]:
!pip -q install nltk

## Import dependencies


- torch
- pandas
- numpy
- transformers
- sklearn
- datasets
- json
- string
- nltk

In [338]:
import torch
import pandas as pd
import numpy as np

from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
from transformers import AutoTokenizer

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

from datasets import load_dataset
import json

import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from urllib.parse import urlparse

In [339]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [340]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [341]:
def truncate_string(text, max_length=100, add_ellipsis=True):
  if len(text) <= max_length:
    return text

  truncated_text = text[:max_length]

  if add_ellipsis:
    truncated_text += "..."

  return truncated_text

def to_lowercase(text):
  return text.lower()

def remove_hyperlinks(text):
  tokens = nltk.word_tokenize(text)
  filtered_tokens = [token for token in tokens if not urlparse(token).scheme]
  return ' '.join(filtered_tokens)

def remove_punctuation(text):
  translator = str.maketrans('', '', string.punctuation)
  return text.translate(translator)

def remove_stopwords(text):
  stop_words = set(stopwords.words('english'))
  words = text.split()
  filtered_words = [word for word in words if word not in stop_words]
  return ' '.join(filtered_words)

def preprocess_text(text):
  text = to_lowercase(text)
  text = remove_hyperlinks(text)
  #text = remove_punctuation(text)
  text = remove_stopwords(text)
  return text

## Dataset da minha conta Hugging Fase

https://huggingface.co/datasets/armandoufpi/cassandraissuesgroundtruth


In [342]:
splits = {'train': 'train.jsonl', 'test': 'test.jsonl'}
df_treino = pd.read_json("hf://datasets/armandoufpi/cassandraissuesgroundtruth/" + splits["train"])
df_teste = pd.read_json("hf://datasets/armandoufpi/cassandraissuesgroundtruth/" + splits["test"])

In [383]:
df_treino

Unnamed: 0,issue_key,summary,issue_type,issue_status,issue_priority,description,comments,architectural_impact,comments_text,label,label_text
0,CASSANDRA-3489,EncryptionOptions should be instantiated,Bug,Resolved,Low,"As the title says, otherwise you get an NPE wh...","['There\'s a bunch of ""if encryption options i...",NO,"There\'s a bunch of ""if encryption options is ...",0,negative
1,CASSANDRA-16780,Log when writing many tombstones to a partition,Improvement,Resolved,Normal,Log when writing many tombstones to a partitio...,['https://github.com/krummas/cassandra/commits...,NO,https://github.com/krummas/cassandra/commits/m...,0,negative
2,CASSANDRA-5426,Redesign repair messages,Improvement,Resolved,Low,Many people have been reporting 'repair hang' ...,['Work in progress is pushed to: https://githu...,YES,https://github.com/yukim/cassandra/commits/542...,1,positive
3,CASSANDRA-5121,system.peers.tokens is empty after node restart,Bug,Resolved,Low,Using a 2 nodes fresh cluster (127.0.0.1 & 127...,"['In StorageService.handleStateNormal, when we...",NO,removeEndpoint should be used instead\n [ju...,0,negative
4,CASSANDRA-11944,sstablesInBounds might not actually give all s...,Bug,Resolved,Normal,Same problem as with CASSANDRA-11886 - if we t...,['https://github.com/krummas/cassandra/commits...,YES,https://github.com/krummas/cassandra/commits/m...,1,positive
...,...,...,...,...,...,...,...,...,...,...,...
195,CASSANDRA-18617,Disable the deprecated keyspace/table threshol...,Improvement,Resolved,Normal,The non-guardrail thresholds 'keyspace_count_w...,"[""Part of this change is to add converters tha...",YES,\xa0[https://github.com/apache/cassandra/pull/...,1,positive
196,CASSANDRA-5244,Compactions don't work while node is bootstrap...,Bug,Resolved,Urgent,It seems that there is a race condition in Sto...,"[""Thanks for the detective work, Jouni. I'll ...",NO,BLOCKED (on object monitor)\n at org.apache...,0,negative
197,CASSANDRA-173,add getPendingTasks to CFSMBean,Improvement,Resolved,Low,need to add an atomicint and inc/decr it whene...,['rebased patch as 0001-CASSANDRA-173-added-CF...,NO,rebased patch as 0001-CASSANDRA-173-added-CFS-...,0,negative
198,CASSANDRA-359,CFS readStats_ and diskReadStats_ are missing,Bug,Resolved,Normal,There is no description,"[""shouldn't we also get rid of getReadDiskHits...",NO,"[""shouldn't we also get rid of getReadDiskHits...",0,negative


In [384]:
df_teste

Unnamed: 0,issue_key,summary,issue_type,issue_status,issue_priority,description,comments,architectural_impact,comments_text,label,label_text
0,CASSANDRA-11944,sstablesInBounds might not actually give all s...,Bug,Resolved,Normal,Same problem as with CASSANDRA-11886 - if we t...,['https://github.com/krummas/cassandra/commits...,YES,https://github.com/krummas/cassandra/commits/m...,1,positive
1,CASSANDRA-12988,make the consistency level for user-level auth...,Improvement,Resolved,Low,Most reads for the auth-related tables execute...,['Linked patch allows an operator to set the r...,YES,[Link|https://app.circleci.com/pipelines/githu...,1,positive
2,CASSANDRA-15004,Anti-compaction briefly corrupts sstable state...,Bug,Resolved,Urgent,Since we use multiple sstable rewriters in ant...,['|[3.0|https://github.com/bdeggleston/cassand...,YES,not sure what is going on with the dtests thou...,1,positive
3,CASSANDRA-15265,Index summary redistribution can start even wh...,Bug,Resolved,Normal,When we pause autocompaction for upgradesstabl...,['Patch adds a flag in `CompactionManager` whi...,YES,[3.0|https://circleci.com/workflow-run/8882a8a...,1,positive
4,CASSANDRA-18029,fix starting Paxos auto repair,Bug,Resolved,Normal,This test was not run in CI because of its nam...,['I fixed here what I could: [https://github.c...,YES,repaired}} rely on running regular/incremental...,1,positive
5,CASSANDRA-18058,In-memory index and query path,New Feature,Resolved,Normal,An in-memory index using the in-memory trie st...,['The github PR for this ticket is here:\xa0\r...,YES,[https://app.circleci.com/pipelines/github/ade...,1,positive
6,CASSANDRA-18617,Disable the deprecated keyspace/table threshol...,Improvement,Resolved,Normal,The non-guardrail thresholds 'keyspace_count_w...,"[""Part of this change is to add converters tha...",YES,\xa0[https://github.com/apache/cassandra/pull/...,1,positive
7,CASSANDRA-1919,Add shutdownhook to flush commitlog,Improvement,Resolved,Low,this replaces the periodic_with_flush approach...,"[""The approach I took was to add a shutdownBlo...",YES,Could not create ServerSocket on address /127....,1,positive
8,CASSANDRA-414,remove sstableLock,Improvement,Resolved,Normal,There is no description,['rebased.\n\n02\n remove sstableLock. re-...,YES,the cleanup does happen. If it were the SSTR ...,1,positive
9,CASSANDRA-5426,Redesign repair messages,Improvement,Resolved,Low,Many people have been reporting 'repair hang' ...,['Work in progress is pushed to: https://githu...,YES,https://github.com/yukim/cassandra/commits/542...,1,positive


### Carrega o dataset e faz os devidos processamentos (transformações)

In [343]:
dataset = pd.concat([df_treino, df_teste], axis=0)
dataset['Textual_Type'] = 'AI_Yes'
dataset.loc[dataset['label']==0, 'Textual_Type'] = 'AI_No'
dataset['SummaryDescriptionComments']= dataset.apply(lambda row: row['summary'] + ' ' + row['description'] + ' ' + row['comments_text'],axis=1).values
dataset['processed_text'] = dataset['SummaryDescriptionComments'].apply(preprocess_text)

In [344]:
dataset.head()

Unnamed: 0,issue_key,summary,issue_type,issue_status,issue_priority,description,comments,architectural_impact,comments_text,label,label_text,Textual_Type,SummaryDescriptionComments,processed_text
0,CASSANDRA-3489,EncryptionOptions should be instantiated,Bug,Resolved,Low,"As the title says, otherwise you get an NPE wh...","['There\'s a bunch of ""if encryption options i...",NO,"There\'s a bunch of ""if encryption options is ...",0,negative,AI_No,EncryptionOptions should be instantiated As th...,"encryptionoptions instantiated title says , ot..."
1,CASSANDRA-16780,Log when writing many tombstones to a partition,Improvement,Resolved,Normal,Log when writing many tombstones to a partitio...,['https://github.com/krummas/cassandra/commits...,NO,https://github.com/krummas/cassandra/commits/m...,0,negative,AI_No,Log when writing many tombstones to a partitio...,log writing many tombstones partition log writ...
2,CASSANDRA-5426,Redesign repair messages,Improvement,Resolved,Low,Many people have been reporting 'repair hang' ...,['Work in progress is pushed to: https://githu...,YES,https://github.com/yukim/cassandra/commits/542...,1,positive,AI_Yes,Redesign repair messages Many people have been...,redesign repair messages many people reporting...
3,CASSANDRA-5121,system.peers.tokens is empty after node restart,Bug,Resolved,Low,Using a 2 nodes fresh cluster (127.0.0.1 & 127...,"['In StorageService.handleStateNormal, when we...",NO,removeEndpoint should be used instead\n [ju...,0,negative,AI_No,system.peers.tokens is empty after node restar...,system.peers.tokens empty node restart using 2...
4,CASSANDRA-11944,sstablesInBounds might not actually give all s...,Bug,Resolved,Normal,Same problem as with CASSANDRA-11886 - if we t...,['https://github.com/krummas/cassandra/commits...,YES,https://github.com/krummas/cassandra/commits/m...,1,positive,AI_Yes,sstablesInBounds might not actually give all s...,sstablesinbounds might actually give sstables ...


### Atributos chaves

In [345]:
dataset[['issue_key', 'summary', 'description', 'comments_text', 'label', 'Textual_Type']]

Unnamed: 0,issue_key,summary,description,comments_text,label,Textual_Type
0,CASSANDRA-3489,EncryptionOptions should be instantiated,"As the title says, otherwise you get an NPE wh...","There\'s a bunch of ""if encryption options is ...",0,AI_No
1,CASSANDRA-16780,Log when writing many tombstones to a partition,Log when writing many tombstones to a partitio...,https://github.com/krummas/cassandra/commits/m...,0,AI_No
2,CASSANDRA-5426,Redesign repair messages,Many people have been reporting 'repair hang' ...,https://github.com/yukim/cassandra/commits/542...,1,AI_Yes
3,CASSANDRA-5121,system.peers.tokens is empty after node restart,Using a 2 nodes fresh cluster (127.0.0.1 & 127...,removeEndpoint should be used instead\n [ju...,0,AI_No
4,CASSANDRA-11944,sstablesInBounds might not actually give all s...,Same problem as with CASSANDRA-11886 - if we t...,https://github.com/krummas/cassandra/commits/m...,1,AI_Yes
...,...,...,...,...,...,...
21,CASSANDRA-6706,Duplicate rows returned when in clause has rep...,If a value is repeated within an IN clause the...,"[""That is kind of the intended behavior. Is it...",0,AI_No
22,CASSANDRA-6962,examine shortening path length post-5202,From CASSANDRA-5202 discussion:\n\n{quote}\nDi...,feels pretty error prone. What about keeping t...,0,AI_No
23,CASSANDRA-6972,Throw an ERROR when auto_bootstrap: true and b...,Obviously when this condition exists the node ...,false in their seed configs.' 'Yes the right f...,0,AI_No
24,CASSANDRA-758,support wrapped range queries,we want to support scanning from KeyX to KeyA ...,add wrapped range support + test' '+1 Looks go...,0,AI_No


In [346]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 226 entries, 0 to 25
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   issue_key                   226 non-null    object
 1   summary                     226 non-null    object
 2   issue_type                  226 non-null    object
 3   issue_status                226 non-null    object
 4   issue_priority              226 non-null    object
 5   description                 226 non-null    object
 6   comments                    226 non-null    object
 7   architectural_impact        226 non-null    object
 8   comments_text               226 non-null    object
 9   label                       226 non-null    int64 
 10  label_text                  226 non-null    object
 11  Textual_Type                226 non-null    object
 12  SummaryDescriptionComments  226 non-null    object
 13  processed_text              226 non-null    object
dtype

In [347]:
minhas_colunas = ['issue_key', 'SummaryDescriptionComments', 'processed_text', 'Textual_Type']
dataset2 = dataset[minhas_colunas]
dataset2.head()

Unnamed: 0,issue_key,SummaryDescriptionComments,processed_text,Textual_Type
0,CASSANDRA-3489,EncryptionOptions should be instantiated As th...,"encryptionoptions instantiated title says , ot...",AI_No
1,CASSANDRA-16780,Log when writing many tombstones to a partitio...,log writing many tombstones partition log writ...,AI_No
2,CASSANDRA-5426,Redesign repair messages Many people have been...,redesign repair messages many people reporting...,AI_Yes
3,CASSANDRA-5121,system.peers.tokens is empty after node restar...,system.peers.tokens empty node restart using 2...,AI_No
4,CASSANDRA-11944,sstablesInBounds might not actually give all s...,sstablesinbounds might actually give sstables ...,AI_Yes


In [386]:
dataset2['processed_text'][0]

Unnamed: 0,processed_text
0,"encryptionoptions instantiated title says , otherwise get npe options missing yaml . 's included second patch cassandra-3045 one line fix . there\ 's bunch `` encryption options null ignore '' special cases already you\ 're going instantiate default instead let\ 's get rid those.\n\nmay also need applied 0.8 unless aforesaid special cases cover everything . ' ' could find special case added first time fixed back 0.8 cassandra-3007 . attached patch removes instantiates default instead . ' `` hmm . thought place otc 's going npe current code base . +1 patch . '' `` ( checked 0.8 otc null check . 're good . ) '' 'committed ."
0,"sstablesinbounds might actually give sstables within bounds due start positions moved sstables problem cassandra-11886 - try fetch sstablesinbounds canonical_sstables , miss actually overlapping sstables . 3.0+ state sstableset want calling method . looks like issue could cause include many sstables compactions think contain droppable tombstones https : //github.com/krummas/cassandra/commits/marcuse/intervaltreesstableset\nhttp : //cassci.datastax.com/view/dev/view/krummas/job/krummas-marcuse-intervaltreesstableset-testall/\nhttp : //cassci.datastax.com/view/dev/view/krummas/job/krummas-marcuse-intervaltreesstableset-dtest/\n\npatch remove option pick sstableset want returned live sstables supported . want canonical sstables within bounds provide intervaltree built sstables.\n\nalso includes cassandra-11886 part might change depending review ticket ' ' [ ~benedict ] - bandwidth review well along w/cassandra-11886 ? ' 'life busy right sure ... ' 'no doubt - hope context would similar enough 11886 delta would pretty small add top . looking make habit . : ) ' 'habits imply future date life hopefully hectic . currently mid-demolition rebuild home eating free non-free time alike . ' `` 'm little confused patch jira comment - n't see ( branch ) removal option provide sstableset ... '' 'maybe looking wrong commit ? rebased squashed [ here|https : //github.com/krummas/cassandra/commits/marcuse/intervaltreesstableset ] \n { code } \n- public collection < sstablereader > getoverlappingsstables ( sstableset sstableset iterable < sstablereader > sstables ) \n+ public collection < sstablereader > getoverlappingsstables ( iterable < sstablereader > sstables ) \n { code } ' `` probably - earlier commits seemed ticket . thanks.\n\ni 'll proper read later would suggest renaming methods include implicit sstableset.live 's still minimally front-and-centre functionality used perhaps renaming select selectlive sstablesinbounds perhaps inboundslive ( sstables ) ? '' 'pushed new commit method renames branch triggered new cassci builds ' 'ping [ ~benedict ] ' '+1 ' 'committed thanks !"


## Configura o ambiente para o modelo de IA

In [349]:
model_name = 'distilbert-base-uncased'
device_name = 'cuda'
#device_name = 'cpu'
max_length = 512
cached_model_directory_name = 'distilbert-ehbugs'

In [350]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

In [351]:
tokenizer

DistilBertTokenizerFast(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

## Classe de apoio para manipular o dataset

In [352]:
class MyDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

def cap_number(x):
    if x > 1:
      return 1
    elif x < 0:
      return 0
    else:
      return x

def compute_metrics(pred):
    labels = pred.label_ids
    # preds = pred.predictions.argmax(-1)
    outputs = pred.predictions.flatten().tolist()
    probas = [cap_number(x) for x in outputs]
    preds = np.array(np.array(probas) > 0.5, dtype=int)
    acc = accuracy_score(labels, preds)
    return {
      'accuracy': acc,
    }

In [353]:
!echo 'Cria a pasta results'
!rm -rf results
!mkdir results
!echo 'Cria a pasta logs'
!rm -rf logs
!mkdir logs

Cria a pasta results
Cria a pasta logs


## Configura os argumentos para o treinamento

In [354]:
training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    learning_rate=5e-5,              # initial learning rate for Adam optimizer
    warmup_steps=100,                # number of warmup steps for learning rate scheduler (set lower because of small dataset size)
    weight_decay=0.01,               # strength of weight decay
    output_dir='/content/results',   # output directory
    logging_dir='/content/logs',     # directory for storing logs
    logging_steps=150,               # number of steps to output logging (set lower because of small dataset size)
    evaluation_strategy='steps',     # evaluate during fine-tuning so that we can see progress
)



In [355]:
unique_labels = {'AI_Yes', 'AI_No'}
label2id = {'AI_No': 0, 'AI_Yes': 1}
id2label = {0: 'AI_No', 1: 'AI_Yes'}

## Cria um StratifiedKFold

Stratified K-Fold cross-validator.

Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

In [356]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=51)
X, y = dataset['processed_text'], dataset['Textual_Type']
skf.get_n_splits(X, y)
folds = {}

In [357]:
X

Unnamed: 0,processed_text
0,"encryptionoptions instantiated title says , otherwise get npe options missing yaml . 's included second patch cassandra-3045 one line fix . there\ 's bunch `` encryption options null ignore '' special cases already you\ 're going instantiate default instead let\ 's get rid those.\n\nmay also need applied 0.8 unless aforesaid special cases cover everything . ' ' could find special case added first time fixed back 0.8 cassandra-3007 . attached patch removes instantiates default instead . ' `` hmm . thought place otc 's going npe current code base . +1 patch . '' `` ( checked 0.8 otc null check . 're good . ) '' 'committed ."
1,log writing many tombstones partition log writing many tombstones partition like writing large partition https : //github.com/krummas/cassandra/commits/marcuse/16780\r\n\r\nhttps : //app.circleci.com/pipelines/github/krummas/cassandra ? branch=marcuse % 2f16780 ' ' think add yaml comments explaining . ' 'yep good point pushed fix ' '+1 ' `` n't think technically put 4.0.x unless 's considered bug unfortunately . '' 'committed thanks
2,"redesign repair messages many people reporting 'repair hang ' something goes wrong . two major causes hang 1 ) validation failure 2 ) streaming failure . currently , failures happen , failed node would respond back repair initiator . goal ticket redesign message flows around repair repair never hang . https : //github.com/yukim/cassandra/commits/5426-3\n\nremoved classes kept backward compatibility.\n\nbq . one thing i\ 'm sure seems get error log doesn\'t error repair session . maybe otherwise fear people won\'t notice something went wrong.\nbq . also fail maybe could send error message ( typically exception message ) easier debugging/reporting.\n\nthe latest version notifies user throwing exception filled ( repairsession # exception ) error occurred . sending exception back coordinator useful i\ 'd rather take different approach use tracing cf ( cassandra-5483 ) .\n\nbq . also wonder maybe fail-fast policy errors . instance one node fail it\ 's validation phase maybe might worth failing right away let user re-trigger repair fixed whatever source error rather still differencing/syncing nodes ( admit solutions possible ) .\n\ni changed let repair session fail error occurred think better repair option ( something like -k -- keep-going ) keep repair running report failed session/job end . +1 separate ticket.\n\nbq . going bit think add 2 messages interrupt validation sync phase . could useful users need stop repair reason also get error validation one node could use interrupt nodes thus fail fast minimizing amount work done uselessly . anyway guess part done follow ticket.\n\n+1 separate ticket . also need add way abort streaming interrupt syncing.\n\nbq . repairmessagetype gossip proof could wise add `` future '' type say 4 5 `` case '' .\nbq . really need repairmessageheader ? making repairmessage repairjobdesc repairmessagetype body rather creating yet another class ? \n\nfor messages mimicked way o.a.c.transport.messages does.\n\nbq . hashcode methods ( differencer nodepair repairjobdesc ... ) i\ 'd prefer using guava\ 's objects.hashcode ( ) ( objects.equal ( ) equals ( ) null ) .\n\ndone didn\'t miss anything.\n\nbq . would move gossiper/failure registration ars.addtoactivesessions.\n\ndone.\n\nbq . i\ 'd remove validator.rangetovalidate inline desc.range.\n\ndone.\n\nbq . curiosity mean todo comment validator.add ( ) .\n\nthat comment ancient version . removed since longer applicable.\n\nbq . merkletree.fullrange maybe it\ 's time add mt serializer rather restoring manually ugly error prone . aslo partitioner let\ 's maybe mt uses databasedescriptor.getpartitioner ( ) directly rather restoring manually differencer.run ( ) .\n\nyup good time finally cleanup merkletree serialization . done.\n ' `` streamingrepairtask.initiatestreaming ( ) 's block\n\n { code } try\n { \n ... \n streamout.transfersstables ( outsession sstables request.ranges operationtype.aes ) ; \n // request ranges remote node\n streamin.requestranges ( request.dst desc.keyspace collections.singleton ( cfstore ) request.ranges operationtype.aes ) ; \n } \ncatch ( exception e ) ... { code } \n\nis value putting streamin.requestranges ( ) separate try block ( immediately ) fail streamout problem ? could potentially make forward progress ( stream streamin ) even streamout fails ? 'll note 1.2 try/catch yuki 's new work changed regard.\n\n\n\n '' `` [ ~jasobrown ] actually think try catch block redundant . streaming run thread streamingrepairtask exception handled istreamcallback 's onerror method ( empty current 1.2 ) .\ni 'm trying overhaul streaming api 2.0 ( cassandra-5286 ) fine grained control streaming . '' 'yuki confirms https : //github.com/yukim/cassandra/commits/5426-3 ready review . ' `` alright v3 lgtm +1.\n\ni 've committed though 'll note currently repair tends get stuck due cassandra-5699 ( 've checked ok patch cassandra-5699 ) . '' ]"
3,"system.peers.tokens empty node restart using 2 nodes fresh cluster ( 127.0.0.1 & 127.0.0.2 ) running latest 1.2 , ’ querying system.peers get nodes cluster respective token . seems problem either node restart . node starts , querying system.peers seems ok : { code } 127.0.0.1 > select * system.peers ; + -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -+ -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -+ | data_center | host_id | peer | rack | release_version | rpc_address | schema_version | tokens | +=================+==========================================+===============+===========+=====================+=================+==========================================+===========================================+ | datacenter1 | 4819cbb0-9741-4fe0-8d7d-95941b0247bf | 127.0.0.2 | rack1 | 1.2.0 | 127.0.0.2 | 59adb24e-f3cd-3e02-97f0-5b395827453f | 56713727820156410577229101238628035242 | + -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -+ -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -+ { code } soon one node restarted ( let ’ say 127.0.0.2 ) , tokens column empty : { code } 127.0.0.1 > select * system.peers ; + -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -+ -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -+ | data_center | host_id | peer | rack | release_version | rpc_address | schema_version | tokens | +=================+==========================================+===============+===========+=====================+=================+==========================================+=============+ | datacenter1 | 4819cbb0-9741-4fe0-8d7d-95941b0247bf | 127.0.0.2 | rack1 | 1.2.0 | 127.0.0.2 | 59adb24e-f3cd-3e02-97f0-5b395827453f | | + -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -+ -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -+ { code } { code } log server side : debug responding : rows [ peer ( system , peers ) , org.apache.cassandra.db.marshal.inetaddresstype ] [ data_center ( system , peers ) , org.apache.cassandra.db.marshal.utf8type ] [ host_id ( system , peers ) , org.apache.cassandra.db.marshal.uuidtype ] [ rack ( system , peers ) , org.apache.cassandra.db.marshal.utf8type ] [ release_version ( system , peers ) , org.apache.cassandra.db.marshal.utf8type ] [ rpc_address ( system , peers ) , org.apache.cassandra.db.marshal.inetaddresstype ] [ schema_version ( system , peers ) , org.apache.cassandra.db.marshal.uuidtype ] [ tokens ( system , peers ) , org.apache.cassandra.db.marshal.settype ( org.apache.cassandra.db.marshal.utf8type ) ] | 127.0.0.2 | datacenter1 | 4819cbb0-9741-4fe0-8d7d-95941b0247bf | rack1 | 1.2.0 | 127.0.0.2 | 59adb24e-f3cd-3e02-97f0-5b395827453f | null { code } restarting node ( 127.0.0.1 ) restore back tokens column . removeendpoint used instead\n [ junit ] \tat org.apache.cassandra.db.systemtable.updatetokens ( ) \n [ junit ] \tat org.apache.cassandra.db.systemtable.updatelocaltokens ( ) \n [ junit ] \tat org.apache.cassandra.service.storageservice.handlestatenormal ( ) \n [ junit ] \tat org.apache.cassandra.service.storageservice.onchange ( ) \n [ junit ] \tat org.apache.cassandra.service.relocatetest.testrelocationsuccess ( ) \n [ junit ] \n [ junit ] \n [ junit ] test org.apache.cassandra.service.relocatetest failed\n ... \n { noformat } \n ' 'you update fixed 17adf8e4f72114d336140fac5157a35e63d1f53a ' 'updated ; test passes thanks !"
4,"sstablesinbounds might actually give sstables within bounds due start positions moved sstables problem cassandra-11886 - try fetch sstablesinbounds canonical_sstables , miss actually overlapping sstables . 3.0+ state sstableset want calling method . looks like issue could cause include many sstables compactions think contain droppable tombstones https : //github.com/krummas/cassandra/commits/marcuse/intervaltreesstableset\nhttp : //cassci.datastax.com/view/dev/view/krummas/job/krummas-marcuse-intervaltreesstableset-testall/\nhttp : //cassci.datastax.com/view/dev/view/krummas/job/krummas-marcuse-intervaltreesstableset-dtest/\n\npatch remove option pick sstableset want returned live sstables supported . want canonical sstables within bounds provide intervaltree built sstables.\n\nalso includes cassandra-11886 part might change depending review ticket ' ' [ ~benedict ] - bandwidth review well along w/cassandra-11886 ? ' 'life busy right sure ... ' 'no doubt - hope context would similar enough 11886 delta would pretty small add top . looking make habit . : ) ' 'habits imply future date life hopefully hectic . currently mid-demolition rebuild home eating free non-free time alike . ' `` 'm little confused patch jira comment - n't see ( branch ) removal option provide sstableset ... '' 'maybe looking wrong commit ? rebased squashed [ here|https : //github.com/krummas/cassandra/commits/marcuse/intervaltreesstableset ] \n { code } \n- public collection < sstablereader > getoverlappingsstables ( sstableset sstableset iterable < sstablereader > sstables ) \n+ public collection < sstablereader > getoverlappingsstables ( iterable < sstablereader > sstables ) \n { code } ' `` probably - earlier commits seemed ticket . thanks.\n\ni 'll proper read later would suggest renaming methods include implicit sstableset.live 's still minimally front-and-centre functionality used perhaps renaming select selectlive sstablesinbounds perhaps inboundslive ( sstables ) ? '' 'pushed new commit method renames branch triggered new cassci builds ' 'ping [ ~benedict ] ' '+1 ' 'committed thanks !"
5,partially heap memtables move contents bytebuffers off-heap records written memtable . ( see comments details ) removed couple used methods added couple refaction.impossible ( ) instead null changed memtable code avoids double cast minor typo.\n\nthanks . 'll incorporate cassandra-6694 instead 's okay ? '' `` bq . starts new thread gc work done room static pool n't threads n't timed . cleaner thread started remains forever . mean 're worried flushing many memtables gc work possible would spam
6,"introduce transactional api behaviours corrupt system state penultimate ( probably final 2.1 , agree introduce ) round changes internals managing sstable writing , 've introduced new api called `` transactional '' hope make much easier write correct behaviour . things stand conflate lot behaviours methods like `` close '' - recent changes unpicked , n't go far enough . proposal introduces interface designed support four actions ( top normal function ) : * preparetocommit * commit * abort * cleanup normal operation , finished constructing state change call preparetocommit ; state changes prepared , call commit . point everything fails , abort called . _either_ case , cleanup called last . transactional objects autocloseable , behaviour rollback changes unless commit completed successfully . changes actually less invasive might sound , since recently introduce abort places , well commit like methods . simply formalises behaviour , makes consistent objects interact way . much code change boilerplate , moving object try-declaration , although change still non-trivial . _does_ eliminate _lot_ special casing since 2.1 released . data tracker api changes compaction leftover cleanups finish job making much easier reason , change think worthwhile considering 2.1 , since 've overhauled entire area ( released changes ) , change essentially finishing touches , risk minimal potential gains reasonably significant . 1 ) really horrendous generics ; 2 ) moving preparetocommit transactional making no-args requiring commit preparation arguments provided separate method ; 3 ) leaving as-is . \n\ni think 'm leaning towards ( 2 ) though may change mind taken conclusion . n't perfect allow us clearly codify correct behaviours cost needing little use only-temporary builder-like state inside transactional objects preparetocommit parameters also return values ( like sstablerewriter sstablewriter return list readers reader respectively ) .\n\nbq . may want convert touched /io tests take advantage exercise various writers transactional\n\nyeah . reader tests probably perhaps introduce special sequentialwriter test work kinds implementation test behaviours consistent transactional . appear kind sstablewriter test either . think separate ticket since scope much broader perhaps introduce starter touching functionality file follow-up . '' `` ok 've pushed smallish refactor puts preparetocommit ( )"
7,"make sstable2json output readable sstable2json writes entire file single json line . also , local timestamp delete given hex bytes instead int . fix attached ' 'also renames `` columns '' field output `` cells '' ' ' feel like missing matching changes sstableimport . make sstableexporttest pass first look ? ' 'v2 ' 'now sstableimporttest . v3 ? ' 'v3 also removes `` old '' format support import ( pre-1.0 stuff ) supercolumn support . ' 'lgtm +1 . ' 'committed"
8,"operational improvements & hardening replica filtering protection cassandra-8272 uses additional space heap ensure correctness 2i filtering queries consistency levels one/local_one . things follow , however , make life bit easier operators generally de-risk usage : ( note : line numbers based { { trunk } } { { 3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb } } . ) * minor optimizations * * { { } } - given size up-front , may able use simple arrays instead lists { { rowstofetch } } { { originalpartitions } } . alternatively ( also ) , may able null references two collections aggressively . ( ex . using { { arraylist # set ( ) } } instead { { get ( ) } } { { queryprotectedpartitions ( ) } } , assuming pass { { tofetch } } argument { { querysourceonkey ( ) } } . ) * { { } } - may able use { { encodingstats.merge ( ) } } remove custom { { stats ( ) } } method . * { { & 228 } } - cache instance { { unaryoperator # identity ( ) } } instead creating one fly . * { { } } - may able scatter/gather rather serially querying every row needs completed . n't clear win perhaps , given targets latency single queries adds complexity . ( certainly decent candidate kick even issue . ) * documentation intelligibility * * places ( changes.txt , tracing output { { replicafilteringprotection } } , etc . ) mention `` replica-side filtering protection '' ( makes seem like coordinator n't filter ) rather `` replica filtering protection '' ( sounds like actually , protect incorrect replica filtering results ) . 's minor fix , would avoid confusion . * method call chain { { dataresolver } } might bit simpler put { { repaireddatatracker } } { { resolvecontext } } . * testing * * want bite bullet get basic tests rfp ( including guardrails might add ) onto in-jvm dtest framework . * guardrails * * stands , n't way enforce upper bound memory usage { { replicafilteringprotection } } caches row responses first round requests . ( remember , later used merged second round results complete data filtering . ) operators likely need way protect , i.e . simply fail queries hit particular threshold rather gc nodes oblivion . ( control limits page sizes n't quite get us , stale results _expand_ number incomplete results must cache . ) fun question , primary axes scope ( per-query , global , etc . ) granularity ( per-partition , per-row , per-cell , actual heap usage , etc. ) . starting disposition right trade-off performance/complexity accuracy something along lines cached rows per query . prior art suggests probably makes sense alongside things like { { tombstone_failure_threshold } } { { cassandra.yaml } } . created cassandra-15948 '' ' [ ~maedhroz ] couldn\'t resist giving try per-row lazy pre-fetch left [ here|https : //github.com/adelapena/cassandra/commit/accf2a47c341875942b0d8b06c016cc0d66d62cb ] .\r\n\r\ninstead consuming contents merged partition consumes row-per-row replica contents . way conflicts caches one row per replica instead entire partition per replica . also iif finds rows fetch replica advances first phase merged row iterator bit reaching certain cache size trying find balance cache size number rfp queries . right desired cache size hardcoded 100 could use config property query limit example . also could also let unbounded minimize number rfp queries main advantage approach absence conflicts nothing needs cached . benefit common case conflicts.\r\n\r\nto illustrate per-row approach behaves let\ 's see example : \r\n { code : python } \r\nself._prepare_cluster ( \r\n create_table= '' create table ( k int c int v text primary key ( k c ) ) '' \r\n create_index= '' create index ( v ) '' \r\n both_nodes= [ `` insert ( k c v ) values ( 0 0 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 1 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 2 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 3 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 4 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 5 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 6 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 7 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 8 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 9 \'old\ ' ) '' ] \r\n only_node1= [ `` insert ( k c v ) values ( 0 4 \'new\ ' ) '' \r\n `` insert ( k c v ) values ( 0 6 \'new\ ' ) '' ] ) \r\nself._assert_all ( `` select c v = \'old\ ' '' rows= [ [ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 5 ] [ 7 ] [ 8 ] [ 9 ] ] ) \r\n { code } \r\nwithout per-row approach cached 20 rows ( 10 per replica ) issued one single rfp query . contrast per-row approach behaviour : \r\n * target cache size high unbounded cache max 12 rows need 1 rfp queries . less cached row don\'t cache rows first conflict found fourth row.\r\n * target cache size 2 cache max 8 rows need 1 rfp queries . two conflicts fit window cached rows fetched don\'t need cache rows.\r\n * use target cache size 1 cache max 6 rows differently need 2 separate rfp queries.\r\n * conflicts one cached row per replica ; current one.\r\n\r\nnote consuming rows first phase iterator populate cache still produce unlimited growth cache still need guardrail . configurable target cache size mention used try find balance cache size grouping primary keys fetch . ' 'bq . main advantage approach absence conflicts nothing needs cached\r\n\r\neven though partition-restricted queries without digest mismatches skip { { dataresolver } } entirely still helps case mismatch start large number conflict-free rows correct ? think benefit consider given likely dealing partition-restricted queries.\r\n\r\nbq . also could also let unbounded minimize number rfp queries\r\n\r\nthe one thing makes little uneasy extra logic need enforce `` target cache size '' . propose avoid simply leave guardrails we\ 've already got place avoid catastrophe ( excessive rfp queries ) see means simplify remains ( like clear { { contents } } array list batches ) . i\ 'll try see looks pull main 3.0 branch works.\r\n\r\naside think remaining question would verifying safety [ aggressively clearing|https : //github.com/apache/cassandra/pull/659/commits/30b8f4bebd95b3520b637d6d25d6bc16cb4d81a2 ] { { responses } } . ' `` [ ~adelapena ] tried strip lazy rows approach bit [ here|https : //github.com/maedhroz/cassandra/commit/c5abb49626da0141277de92e173fa8ed8062bcf3 ] . understand bit better 'm bit skeptical whether want proceed . already know partition-restricted case without digest mismatch avoids altogether . large number non-conflicting rows start first-phase iterator though seems like price avoiding row caching creating large number { { cachedrowiterator } } objects . maybe right trade-off 'm sure . '' `` [ ~maedhroz ] \xa0i like changes lazy rows approach . however 'm afraid need snapshot cached rows done [ local copy-and-clear|https : //github.com/adelapena/cassandra/blob/accf2a47c341875942b0d8b06c016cc0d66d62cb/src/java/org/apache/cassandra/service/replicafilteringprotection.java # l522-l524 ] otherwise advances replica introduce new data mess producing multiple test failures . much better previous patch track number contents snapshot save us queue copy 's done [ here|https : //github.com/adelapena/cassandra/blob/35d8e712bbbe03076ba867c11759664e8ff839e4/src/java/org/apache/cassandra/service/replicafilteringprotection.java # l528-l568 ] .\r\n\r\nalso think making { { currentmergedrows } } / { { unprotectedpartition } } partition iterator correct . 's pointer current first iteration merged partition shared builders rfp . make local reduce speed pointer advanced producing end rfp queries.\r\n { quote } large number non-conflicting rows start first-phase iterator though seems like price avoiding row caching creating large number { { cachedrowiterator } } objects . maybe right trade-off 'm sure.\r\n { quote } \r\nwe find balance max cache size number { { cachedrowiterator } } instances try grow cache bit conflicts : \r\n { code : java } \r\nwhile ( unprotectedpartition ! = null & & unprotectedpartition.hasnext ( ) \r\n & & ( tofetch ! = null || cachedrows.size ( ) < min_cache_size ) ) \r\n { code } \r\nmin cache/buffer size constant config property function warning threshold something related query limit . would still limit size cache absence conflicts quickly reducing number { { cachedrowiterator } } instances.\r\n\r\nalso given concerned cache size might want consider tracking max size cache reaches query add new table metric tracks average max cache size . '' ' [ ~adelapena ] quick slack discussion think we\ 've landed following : \r\n\r\n1 . ) we\ 'll stop partition-based lazy first-phase iterator consumption approach ( what\ 's main patch branch right ) . it\ 's clear { { min_cache_size } } ( avoid creating tons `` singleton '' { { cachedrowiterator } } instances ) would produce something meaningfully different.\r\n\r\n2 . ) interest visibility we\ 'll explore adding histogram quantifies much row caching queries . would put data behind assumptions might help tuning guardrails well . ( i\ 'll try something shortly ... ) ' `` case need explore per-row approach future 'm leaving [ here|https : //github.com/adelapena/cassandra/commit/90900ec717958270bc38b501b4248dfb7d55958c ] \xa0the extensive prototype uses two properties control min cache size n't conflicts yet found conflicts . '' ' [ ~maedhroz ] metric cache size really nice . however think would useful track max per-query cache size rather per-partition size . single advance first phase merge iterator still insert indefinite amount partitions cache . example let\ 's consider scenario rows outdated replica superseded updated replica : \r\n { code : java } \r\nself._prepare_cluster ( \r\n create_table= '' create table ( k int c int v text primary key ( k c ) ) '' \r\n create_index= '' create index ( v ) '' \r\n both_nodes= [ `` insert ( k c v ) values ( 0 0 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 0 1 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 1 0 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 1 1 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 2 0 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 2 1 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 3 0 \'old\ ' ) '' \r\n `` insert ( k c v ) values ( 3 1 \'old\ ' ) '' ] \r\n only_node1= [ `` delete k = 0 '' \r\n `` delete k = 1 '' \r\n `` delete k = 2 '' \r\n `` delete k = 3 '' ] ) \r\nself._assert_none ( `` select c v = \'old\ ' limit 1 '' ) \r\n { code } \r\nin test 4 partitions cached single advance merged iterator cache contains 16 rows maximum . however metric records 8 times per-partition cache size 2. think would useful either single metric max cache size 12 two records ( one per replica ) value 6. would make easier detect cases exposed example.\r\n\r\nalso per-query metric would easier advice operators start worrying consistency among replicas rfp metric starts get higher fetch size independently whether queries single-partition not.\r\n\r\nas tracking cache size per replica per query think would nice used criteria used guardrail measure thing . would mean either tracking metric per query leaving guardrail changing guardrail per-replica instead per-query . \r\n\r\nwdyt ? ' `` bq . tracking cache size per replica per query think would nice used criteria used guardrail measure thing . would mean either tracking metric per query leaving guardrail changing guardrail per-replica instead per-query.\r\n\r\ntracking metric per query rather changing existing guardrail sounds like right move although n't really legitimate multi-partition use-case yet time per-partition metric equivalent per-query one . hand whole point metric help provide guidance operators looking set appropriate warn/fail thresholds . 'll push something today along slightly modified inline"
9,"anti-compaction briefly corrupts sstable state reads since use multiple sstable rewriters anticompaction , first call preparetocommit remove original sstables tracker view rewriters add sstables . creates brief window reads miss data . sure going dtests though probably need restart ' 'nice catch looks like good fix me.\r\n\r\n ( +1 ) ' `` blake realised issue patch posted put together alternative patch input [ ~krummas ] .\r\n\r\n [ 3.0|https : //github.com/belliottsmith/cassandra/tree/15004-3.0 ] [ 3.11|https : //github.com/belliottsmith/cassandra/tree/15004-3.11 ] [ 4.0|https : //github.com/belliottsmith/cassandra/tree/15004-4.0 ] \r\n\r\nthese patches extract interface { { lifecycletransaction } } no-op relevant calls ( { { preparetocommit } } { { obsoleteoriginals } } ) { { sstablerewriter.preparetocommit } } update tracker - invoked directly rewriter finished preparatory work.\r\n\r\nit 's bit ugly still finicky probably better/safer invasive surgery point time . '' 'updated unit tests [ 3.0|https : //github.com/krummas/cassandra/tree/15004-3.0 ] [ 3.11|https : //github.com/krummas/cassandra/tree/15004-3.11 ] [ trunk|https : //github.com/krummas/cassandra/tree/15004-trunk ] also adds checks files disk expect ' 'lgtm need comments explaining going comment mentioning { { permitredundanttransitions } } needs removed/updated ' `` thanks . 've pushed branches updated comments . '' '+1 ' 'thanks committed [ 3.0|https : //github.com/apache/cassandra/commit/44785dd2eec5697eec7e496ed3a73d2573f4fe6a ] [ 3.11|https : //github.com/apache/cassandra/commit/9199e591c6148d14f3d12784af8ce5342f118161 ] [ 4.0|https : //github.com/apache/cassandra/commit/df62169d1b6a5bfff2bc678ffbeb0883a3a576b5 ]"


In [358]:
y

Unnamed: 0,Textual_Type
0,AI_No
1,AI_No
2,AI_Yes
3,AI_No
4,AI_Yes
5,AI_Yes
6,AI_No
7,AI_No
8,AI_Yes
9,AI_Yes


In [359]:
skf.split(X, y)

<generator object _BaseKFold.split at 0x791d4c98b760>

In [360]:
print(f"Running in {device_name}")

Running in cuda


## Treina e Avalia o Modelo

In [361]:
list_my_metrics = list()

In [362]:
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"Fold {i+1}: Train Size {len(train_index)} | Test Size {len(test_index)}")
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    train_labels_encoded = [float(label2id[yi]) for yi in y_train]
    test_labels_encoded  = [float(label2id[yi]) for yi in y_test]

    X_train = [str(i) for i in X_train]
    X_test = [str(i) for i in X_test]

    unique_labels = set(label for label in y_train)
    label2id = {label: id for id, label in enumerate(unique_labels)}
    id2label = {id: label for label, id in label2id.items()}

    train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=max_length)
    test_encodings  = tokenizer(X_test, truncation=True, padding=True, max_length=max_length)

    train_labels_encoded = [float(label2id[yi]) for yi in y_train]
    test_labels_encoded  = [float(label2id[yi]) for yi in y_test]

    train_dataset = MyDataset(train_encodings, train_labels_encoded)
    test_dataset = MyDataset(test_encodings, test_labels_encoded)

    model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=1).to(device_name)
    trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      eval_dataset=test_dataset,
      compute_metrics=compute_metrics
    )
    trainer.train()
    trainer.evaluate()
    predicted_results = trainer.predict(test_dataset)
    outputs = predicted_results.predictions.flatten().tolist()
    probas = [cap_number(x) for x in outputs]
    preds = np.array(np.array(probas) > 0.5, dtype=int)

    # roc_auc_score(test_labels_encoded, probas)
    folds[i] = {}
    folds[i]['pre'] = precision_score(test_labels_encoded, preds)
    folds[i]['rec'] = recall_score(test_labels_encoded, preds)
    folds[i]['acc'] = accuracy_score(test_labels_encoded, preds)
    folds[i]['auc'] = roc_auc_score(test_labels_encoded, probas)
    folds[i]['f1'] = f1_score(test_labels_encoded, preds)

    print(f"Fold {i+1}=> PRE: {folds[i]['pre']}; REC: {folds[i]['rec']}; ACC: {folds[i]['acc']}; F1S: {folds[i]['f1']}; AUC: {folds[i]['auc']}")

    item_metric = {'Fold':i+1, 'PRE':folds[i]['pre'], 'REC':folds[i]['rec'], 'ACC':folds[i]['acc'], 'F1S':folds[i]['f1'], 'AUC':folds[i]['auc']}
    list_my_metrics.append(item_metric)

Fold 1: Train Size 180 | Test Size 46


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss


Fold 1=> PRE: 0.6739130434782609; REC: 1.0; ACC: 0.6739130434782609; F1S: 0.8051948051948052; AUC: 0.7655913978494624
Fold 2: Train Size 181 | Test Size 45


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss


Fold 2=> PRE: 0.6888888888888889; REC: 1.0; ACC: 0.6888888888888889; F1S: 0.8157894736842105; AUC: 0.7373271889400921
Fold 3: Train Size 181 | Test Size 45


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss


Fold 3=> PRE: 0.6666666666666666; REC: 1.0; ACC: 0.6666666666666666; F1S: 0.8; AUC: 0.6599999999999999
Fold 4: Train Size 181 | Test Size 45


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss


Fold 4=> PRE: 0.6818181818181818; REC: 1.0; ACC: 0.6888888888888889; F1S: 0.8108108108108109; AUC: 0.7333333333333334
Fold 5: Train Size 181 | Test Size 45


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss


Fold 5=> PRE: 0.6590909090909091; REC: 0.9666666666666667; ACC: 0.6444444444444445; F1S: 0.7837837837837838; AUC: 0.6511111111111112


In [363]:
df_my_metrics = pd.DataFrame(list_my_metrics)
df_my_metrics

Unnamed: 0,Fold,PRE,REC,ACC,F1S,AUC
0,1,0.673913,1.0,0.673913,0.805195,0.765591
1,2,0.688889,1.0,0.688889,0.815789,0.737327
2,3,0.666667,1.0,0.666667,0.8,0.66
3,4,0.681818,1.0,0.688889,0.810811,0.733333
4,5,0.659091,0.966667,0.644444,0.783784,0.651111


# B) Create the Model (based on example from Hugging face DestilBert)

In [364]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
task = "issue-analysis"
MY_HUGGING_FACE_DATASET = "armandoufpi/cassandraissuesgroundtruth"

In [365]:
# Load pre-trained DistilBERT model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Dataset da minha conta Hugging Fase

https://huggingface.co/datasets/armandoufpi/cassandraissuesgroundtruth

In [366]:
#dataset da minha conta Hugging Fase
splits = {'train': 'train.jsonl', 'test': 'test.jsonl'}
df_treino = pd.read_json("hf://datasets/armandoufpi/cassandraissuesgroundtruth/" + splits["train"])
df_teste = pd.read_json("hf://datasets/armandoufpi/cassandraissuesgroundtruth/" + splits["test"])

In [367]:
df_treino

Unnamed: 0,issue_key,summary,issue_type,issue_status,issue_priority,description,comments,architectural_impact,comments_text,label,label_text
0,CASSANDRA-3489,EncryptionOptions should be instantiated,Bug,Resolved,Low,"As the title says, otherwise you get an NPE wh...","['There\'s a bunch of ""if encryption options i...",NO,"There\'s a bunch of ""if encryption options is ...",0,negative
1,CASSANDRA-16780,Log when writing many tombstones to a partition,Improvement,Resolved,Normal,Log when writing many tombstones to a partitio...,['https://github.com/krummas/cassandra/commits...,NO,https://github.com/krummas/cassandra/commits/m...,0,negative
2,CASSANDRA-5426,Redesign repair messages,Improvement,Resolved,Low,Many people have been reporting 'repair hang' ...,['Work in progress is pushed to: https://githu...,YES,https://github.com/yukim/cassandra/commits/542...,1,positive
3,CASSANDRA-5121,system.peers.tokens is empty after node restart,Bug,Resolved,Low,Using a 2 nodes fresh cluster (127.0.0.1 & 127...,"['In StorageService.handleStateNormal, when we...",NO,removeEndpoint should be used instead\n [ju...,0,negative
4,CASSANDRA-11944,sstablesInBounds might not actually give all s...,Bug,Resolved,Normal,Same problem as with CASSANDRA-11886 - if we t...,['https://github.com/krummas/cassandra/commits...,YES,https://github.com/krummas/cassandra/commits/m...,1,positive
...,...,...,...,...,...,...,...,...,...,...,...
195,CASSANDRA-18617,Disable the deprecated keyspace/table threshol...,Improvement,Resolved,Normal,The non-guardrail thresholds 'keyspace_count_w...,"[""Part of this change is to add converters tha...",YES,\xa0[https://github.com/apache/cassandra/pull/...,1,positive
196,CASSANDRA-5244,Compactions don't work while node is bootstrap...,Bug,Resolved,Urgent,It seems that there is a race condition in Sto...,"[""Thanks for the detective work, Jouni. I'll ...",NO,BLOCKED (on object monitor)\n at org.apache...,0,negative
197,CASSANDRA-173,add getPendingTasks to CFSMBean,Improvement,Resolved,Low,need to add an atomicint and inc/decr it whene...,['rebased patch as 0001-CASSANDRA-173-added-CF...,NO,rebased patch as 0001-CASSANDRA-173-added-CFS-...,0,negative
198,CASSANDRA-359,CFS readStats_ and diskReadStats_ are missing,Bug,Resolved,Normal,There is no description,"[""shouldn't we also get rid of getReadDiskHits...",NO,"[""shouldn't we also get rid of getReadDiskHits...",0,negative


In [368]:
df_teste

Unnamed: 0,issue_key,summary,issue_type,issue_status,issue_priority,description,comments,architectural_impact,comments_text,label,label_text
0,CASSANDRA-11944,sstablesInBounds might not actually give all s...,Bug,Resolved,Normal,Same problem as with CASSANDRA-11886 - if we t...,['https://github.com/krummas/cassandra/commits...,YES,https://github.com/krummas/cassandra/commits/m...,1,positive
1,CASSANDRA-12988,make the consistency level for user-level auth...,Improvement,Resolved,Low,Most reads for the auth-related tables execute...,['Linked patch allows an operator to set the r...,YES,[Link|https://app.circleci.com/pipelines/githu...,1,positive
2,CASSANDRA-15004,Anti-compaction briefly corrupts sstable state...,Bug,Resolved,Urgent,Since we use multiple sstable rewriters in ant...,['|[3.0|https://github.com/bdeggleston/cassand...,YES,not sure what is going on with the dtests thou...,1,positive
3,CASSANDRA-15265,Index summary redistribution can start even wh...,Bug,Resolved,Normal,When we pause autocompaction for upgradesstabl...,['Patch adds a flag in `CompactionManager` whi...,YES,[3.0|https://circleci.com/workflow-run/8882a8a...,1,positive
4,CASSANDRA-18029,fix starting Paxos auto repair,Bug,Resolved,Normal,This test was not run in CI because of its nam...,['I fixed here what I could: [https://github.c...,YES,repaired}} rely on running regular/incremental...,1,positive
5,CASSANDRA-18058,In-memory index and query path,New Feature,Resolved,Normal,An in-memory index using the in-memory trie st...,['The github PR for this ticket is here:\xa0\r...,YES,[https://app.circleci.com/pipelines/github/ade...,1,positive
6,CASSANDRA-18617,Disable the deprecated keyspace/table threshol...,Improvement,Resolved,Normal,The non-guardrail thresholds 'keyspace_count_w...,"[""Part of this change is to add converters tha...",YES,\xa0[https://github.com/apache/cassandra/pull/...,1,positive
7,CASSANDRA-1919,Add shutdownhook to flush commitlog,Improvement,Resolved,Low,this replaces the periodic_with_flush approach...,"[""The approach I took was to add a shutdownBlo...",YES,Could not create ServerSocket on address /127....,1,positive
8,CASSANDRA-414,remove sstableLock,Improvement,Resolved,Normal,There is no description,['rebased.\n\n02\n remove sstableLock. re-...,YES,the cleanup does happen. If it were the SSTR ...,1,positive
9,CASSANDRA-5426,Redesign repair messages,Improvement,Resolved,Low,Many people have been reporting 'repair hang' ...,['Work in progress is pushed to: https://githu...,YES,https://github.com/yukim/cassandra/commits/542...,1,positive


## Carrega os dados de treino e teste

In [369]:
# Load the dataset from armandoufpi hugging face
train_data = load_dataset(MY_HUGGING_FACE_DATASET, split="train")
test_data = load_dataset(MY_HUGGING_FACE_DATASET, split="test")

In [370]:
train_data

Dataset({
    features: ['summary', 'architectural_impact', 'comments_text', 'label_text', 'comments', 'issue_status', 'description', 'issue_priority', 'issue_type', 'issue_key', 'label'],
    num_rows: 200
})

In [371]:
test_data

Dataset({
    features: ['summary', 'architectural_impact', 'comments_text', 'label_text', 'comments', 'issue_status', 'description', 'issue_priority', 'issue_type', 'issue_key', 'label'],
    num_rows: 26
})

In [372]:
print(f"len(train_data['summary']): {len(train_data['summary'])}")
print(f"train_data['summary'][0]: {train_data['summary'][0]}")
print(f"train_data['label'][0]: {train_data['label'][0]}")
print(f"train_data['label_text'][0]: {train_data['label_text'][0]}")
print(f"train_data['description'][0]: {train_data['description'][0]}")

len(train_data['summary']): 200
train_data['summary'][0]: EncryptionOptions should be instantiated
train_data['label'][0]: 0
train_data['label_text'][0]: negative
train_data['description'][0]: As the title says, otherwise you get an NPE when the options are missing from the yaml.  It's included in my second patch on CASSANDRA-3045 and is a one line fix.


## Processa os dados de treino e teste

In [373]:
# Function to preprocess text data
def preprocess_function_description(examples):
  return tokenizer(examples["description"], padding="max_length", truncation=True)

# Function to preprocess text data
def preprocess_function(examples):
  return tokenizer(examples["summary"], padding="max_length", truncation=True)

In [374]:
# Preprocess train and test data
train_data = train_data.map(preprocess_function, batched=True)
test_data = test_data.map(preprocess_function, batched=True)

In [375]:
# Access the 'input_ids' from the preprocessed data
#train_inputs = train_data["input_ids"]
#test_inputs = test_data["input_ids"]

## Treina o modelo

In [376]:
!rm -rf results
!mkdir results
!ls -l

total 12
drwxr-xr-x 2 root root 4096 Jul 30 20:58 logs
drwxr-xr-x 2 root root 4096 Jul 30 20:58 results
drwxr-xr-x 1 root root 4096 Jul 29 13:22 sample_data


In [377]:
training_args = TrainingArguments(
    output_dir="results",  # Fixed typo (removed extra space)
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,  # Assuming you meant "size" here
    learning_rate=2e-5,
)

In [378]:
# Create trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data,
    compute_metrics="accuracy",
)

In [379]:
# Train the model
trainer.train()

Step,Training Loss


TrainOutput(global_step=39, training_loss=0.819644047663762, metrics={'train_runtime': 32.3881, 'train_samples_per_second': 18.525, 'train_steps_per_second': 1.204, 'total_flos': 79480439193600.0, 'train_loss': 0.819644047663762, 'epoch': 3.0})

## Faz as previsões baseadas no modelo treinado

In [380]:
# TODO: fazer a analise do issue baseado em varios fields ao mesmo tempo
def analyse_issue(issue_field):
  inputs = tokenizer(issue_field, padding="max_length", truncation=True, return_tensors="pt")

  # Move the model and input to GPU if available
  if torch.cuda.is_available():
    model.to('cuda')
    inputs.to('cuda')

  with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

  # Print the predicted sentiment
  if predictions == 1:
    return "Architectural Impact: Yes"
  else:
    return "Architectural Impact: No"

### Dados de testes

In [388]:
df_teste[['issue_key', 'summary', 'architectural_impact']]

Unnamed: 0,issue_key,summary,architectural_impact
0,CASSANDRA-11944,sstablesInBounds might not actually give all s...,YES
1,CASSANDRA-12988,make the consistency level for user-level auth...,YES
2,CASSANDRA-15004,Anti-compaction briefly corrupts sstable state...,YES
3,CASSANDRA-15265,Index summary redistribution can start even wh...,YES
4,CASSANDRA-18029,fix starting Paxos auto repair,YES
5,CASSANDRA-18058,In-memory index and query path,YES
6,CASSANDRA-18617,Disable the deprecated keyspace/table threshol...,YES
7,CASSANDRA-1919,Add shutdownhook to flush commitlog,YES
8,CASSANDRA-414,remove sstableLock,YES
9,CASSANDRA-5426,Redesign repair messages,YES


### Roda o modelo com os dados de testes

In [381]:
for index, row in df_teste.iterrows():
  field = row['summary']
  issue_key = row['issue_key']
  summary = truncate_string(text=row['summary'], max_length=50)
  print(f"{issue_key}, {summary}, {analyse_issue(issue_field=field)}")

CASSANDRA-11944, sstablesInBounds might not actually give all sstab..., Architectural Impact: No
CASSANDRA-12988, make the consistency level for user-level auth rea..., Architectural Impact: Yes
CASSANDRA-15004, Anti-compaction briefly corrupts sstable state for..., Architectural Impact: No
CASSANDRA-15265, Index summary redistribution can start even when c..., Architectural Impact: Yes
CASSANDRA-18029, fix starting Paxos auto repair, Architectural Impact: No
CASSANDRA-18058, In-memory index and query path, Architectural Impact: Yes
CASSANDRA-18617, Disable the deprecated keyspace/table thresholds a..., Architectural Impact: Yes
CASSANDRA-1919, Add shutdownhook to flush commitlog, Architectural Impact: No
CASSANDRA-414, remove sstableLock, Architectural Impact: Yes
CASSANDRA-5426, Redesign repair messages, Architectural Impact: Yes
CASSANDRA-11540, The JVM should exit if jmx fails to bind, Architectural Impact: No
CASSANDRA-6013, CAS may return false but still commit the insert, Archit