In [1]:
import requests
import pandas as pd
from collections import Counter
import pandas as pd
import plotly.express as px

# Reading The Dataset

In [2]:

url = "https://api.github.com/repos/rails/rails/issues"

issues = []

params = {
    "state": "all",  
    "per_page": 100,  
}

for i in range(1, 6): 
    params["page"] = i
    response = requests.get(url, params=params)
    issues.extend(response.json())

df = pd.DataFrame(issues)

In [3]:
df.head()

Unnamed: 0,url,repository_url,labels_url,comments_url,events_url,html_url,id,node_id,number,title,...,closed_at,author_association,active_lock_reason,draft,pull_request,body,reactions,timeline_url,performed_via_github_app,state_reason
0,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://github.com/rails/rails/pull/50958,2116092234,PR_kwDNIULOZeUV5A,50958,Add missing alias to errors array,...,,NONE,,False,{'url': 'https://api.github.com/repos/rails/ra...,it used to be that the << operator was able to...,{'url': 'https://api.github.com/repos/rails/ra...,https://api.github.com/repos/rails/rails/issue...,,
1,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://github.com/rails/rails/issues/50954,2115098687,I_kwDNIULOfhHYPw,50954,esbuild precompilation error in rails 7.1.3,...,,NONE,,,,### Steps to reproduce\r\ncreate a new rails 7...,{'url': 'https://api.github.com/repos/rails/ra...,https://api.github.com/repos/rails/rails/issue...,,
2,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://github.com/rails/rails/pull/50953,2114861781,PR_kwDNIULOZdP1vw,50953,Add webp as a default to active_storage.web_im...,...,,CONTRIBUTOR,,False,{'url': 'https://api.github.com/repos/rails/ra...,### Motivation / Background\r\n\r\nCustomers a...,{'url': 'https://api.github.com/repos/rails/ra...,https://api.github.com/repos/rails/rails/issue...,,
3,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://github.com/rails/rails/pull/50952,2114686099,PR_kwDNIULOZdGLMw,50952,Tiny update to callbacks docs [ci skip],...,2024-02-02T12:33:18Z,CONTRIBUTOR,,False,{'url': 'https://api.github.com/repos/rails/ra...,### Motivation / Background\r\n\r\nThe followi...,{'url': 'https://api.github.com/repos/rails/ra...,https://api.github.com/repos/rails/rails/issue...,,
4,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://api.github.com/repos/rails/rails/issue...,https://github.com/rails/rails/pull/50951,2114685245,PR_kwDNIULOZdGIPQ,50951,set default_enforce_utf8 to false,...,,NONE,,False,{'url': 'https://api.github.com/repos/rails/ra...,`enforce_utf8` is false by deault in `form_for...,{'url': 'https://api.github.com/repos/rails/ra...,https://api.github.com/repos/rails/rails/issue...,,


### 1. How do the number of issues evolve over time?

In [4]:

df['created_at'] = pd.to_datetime(df['created_at'])

issue_counts_by_date = df.resample('D', on='created_at').size()

# Plot
fig = px.line(issue_counts_by_date, title='Number of Issues Over Time')
fig.update_xaxes(title_text='Date')
fig.update_yaxes(title_text='Number of Issues')
fig.show()


### 2. Are there any periods in which we get more issues?

Yes, there are distinct periods where more issues are reported. These periods are represented by the peaks in the line chart. Specifically, from the provided image, it appears that:

- There is a significant peak in the number of issues reported around the first week of January 2024. This could indicate a surge in activity, possibly due to new year code sprints, releases, or other community activities.
Another noticeable peak occurs in the third week of January 2024, suggesting another period with increased issue reporting.
- The final week of January 2024 also shows an increased number of issues, though not as pronounced as the first peak.
- These peaks could be due to various reasons such as new feature deployments, version updates, or discovery of bugs that coincide with these dates. It would be beneficial to cross-reference these dates with the Rails project’s update logs, community forums, or other documentation to understand the context behind the increased number of issues.






### 3. Is there anyone who reports more issues than others?

In [5]:

df['reporter'] = df['user'].apply(lambda x: x['login'] if isinstance(x, dict) else None)

top_reporters = df['reporter'].value_counts().head(10)

print(top_reporters)

skipkayhil        33
seanpdoyle        26
p8                25
dhh               24
akhilgkrishnan    21
casperisfine      14
sato11            14
byroot            11
Earlopain         10
dorianmarie        9
Name: reporter, dtype: int64


In [6]:
top_reporters_data = top_reporters.reset_index()
top_reporters_data.columns = ['Reporter', 'Frequency']

fig = px.bar(top_reporters_data, x='Reporter', y='Frequency',
             title='Top Issue Reporters',
             labels={'Reporter': 'Username', 'Frequency': 'Number of Issues Reported'},
             color='Frequency')

fig.show()

### 4. What is the most popular category (label)?

In [7]:
def extract_label_names(labels):
    return [label['name'] for label in labels if 'name' in label]


df['label_names'] = df['labels'].apply(extract_label_names)
all_label_names = sum(df['label_names'].tolist(), [])
label_counts = Counter(all_label_names)
labels_frequency_df = pd.DataFrame(label_counts.items(), columns=['Label Name', 'Frequency']).sort_values(by='Frequency', ascending=False).reset_index(drop=True)


In [8]:
fig = px.bar(labels_frequency_df.head(10),  
             x='Label Name',
             y='Frequency',
             title='Top 10 Most Popular Labels',
             labels={'Label Name': 'Label Name', 'Frequency': 'Frequency'},
             color='Frequency',  
             )


fig.update_layout(xaxis_title="Label Name",
                  yaxis_title="Frequency",
                  xaxis={'categoryorder':'total descending'} 
                  )

fig.show()

### Most Discussed Issues

In [9]:
most_discussed_issues = df.sort_values(by='comments', ascending=False)
print(most_discussed_issues[['title', 'comments']].head(10))

                                                 title  comments
496        Set a new default for the Puma thread count        83
490            Add (a very basic!!) Rubocop by default        30
180  Request: rename Rails console as irr or someth...        23
384             Generate devcontainer files by default        20
492  Extract Action Notifier framework for push not...        13
439                 Default to creating git pre-commit        13
460  Add rate limiting to Action Controller via the...        13
63          Introduce `ActiveSupport::TestCase.around`        11
471  Add Thruster to Docker setup to get HTTP/2, X-...        11


In [10]:
issue_counts_by_month = df.resample('M', on='created_at').size()

# Plot
fig = px.line(issue_counts_by_month, title='Number of Issues Each Month')
fig.update_xaxes(title_text='Month')
fig.update_yaxes(title_text='Number of Issues')
fig.show()

# Classification Task

In [13]:
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW, get_linear_schedule_with_warmup, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

def get_first_label(label_list):
    if label_list:
        return label_list[0]['name']  # This gets the name of the first label
    else:
        return 'No Label'  

df['single_label'] = df['labels'].apply(get_first_label)

# Assuming df is your DataFrame
df['body'] = df['body'].astype(str)  # Ensure text column is string
df['single_label'] = pd.factorize(df['single_label'])[0]  

class GitHubIssueDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Prepare datasets
X_train, X_val, y_train, y_val = train_test_split(df['body'], df['single_label'], test_size=0.2)
train_encodings = tokenizer(X_train.tolist(), truncation=True, padding=True)
val_encodings = tokenizer(X_val.tolist(), truncation=True, padding=True)
train_dataset = GitHubIssueDataset(train_encodings, y_train.tolist())
val_dataset = GitHubIssueDataset(val_encodings, y_val.tolist())

# Load model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(df['single_label'].unique()))

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    evaluation_strategy="epoch", 
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Custom compute_metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)  # Convert model logits to class predictions
    # Use 'macro', 'micro', or 'weighted' averaging based on your specific needs
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/150 [00:00<?, ?it/s]

{'loss': 2.9678, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.2}
{'loss': 2.8811, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.4}
{'loss': 2.8296, 'learning_rate': 3e-06, 'epoch': 0.6}
{'loss': 2.7329, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.8}
{'loss': 2.6656, 'learning_rate': 5e-06, 'epoch': 1.0}


  0%|          | 0/13 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.



{'eval_loss': 2.6227939128875732, 'eval_accuracy': 0.21, 'eval_f1': 0.05402041324371421, 'eval_precision': 0.10384615384615385, 'eval_recall': 0.07806267806267807, 'eval_runtime': 176.4532, 'eval_samples_per_second': 0.567, 'eval_steps_per_second': 0.074, 'epoch': 1.0}
{'loss': 2.6265, 'learning_rate': 6e-06, 'epoch': 1.2}
{'loss': 2.5047, 'learning_rate': 7.000000000000001e-06, 'epoch': 1.4}
{'loss': 2.5653, 'learning_rate': 8.000000000000001e-06, 'epoch': 1.6}
{'loss': 2.4543, 'learning_rate': 9e-06, 'epoch': 1.8}
{'loss': 2.4802, 'learning_rate': 1e-05, 'epoch': 2.0}


  0%|          | 0/13 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



{'eval_loss': 2.3065574169158936, 'eval_accuracy': 0.33, 'eval_f1': 0.09999195080330985, 'eval_precision': 0.22240642499263188, 'eval_recall': 0.12989766081871343, 'eval_runtime': 243.3792, 'eval_samples_per_second': 0.411, 'eval_steps_per_second': 0.053, 'epoch': 2.0}
{'loss': 2.3586, 'learning_rate': 1.1000000000000001e-05, 'epoch': 2.2}
{'loss': 2.4396, 'learning_rate': 1.2e-05, 'epoch': 2.4}
{'loss': 2.3297, 'learning_rate': 1.3000000000000001e-05, 'epoch': 2.6}
{'loss': 2.2934, 'learning_rate': 1.4000000000000001e-05, 'epoch': 2.8}
{'loss': 2.1659, 'learning_rate': 1.5e-05, 'epoch': 3.0}


  0%|          | 0/13 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



{'eval_loss': 2.1251490116119385, 'eval_accuracy': 0.33, 'eval_f1': 0.08492127136874532, 'eval_precision': 0.14012023757786468, 'eval_recall': 0.1237248213125406, 'eval_runtime': 244.55, 'eval_samples_per_second': 0.409, 'eval_steps_per_second': 0.053, 'epoch': 3.0}
{'train_runtime': 7443.1479, 'train_samples_per_second': 0.161, 'train_steps_per_second': 0.02, 'train_loss': 2.553015111287435, 'epoch': 3.0}


  0%|          | 0/13 [00:00<?, ?it/s]


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



{'eval_loss': 2.1251490116119385,
 'eval_accuracy': 0.33,
 'eval_f1': 0.08492127136874532,
 'eval_precision': 0.14012023757786468,
 'eval_recall': 0.1237248213125406,
 'eval_runtime': 238.3497,
 'eval_samples_per_second': 0.42,
 'eval_steps_per_second': 0.055,
 'epoch': 3.0}

### Conclusion and Next Steps

An assessment of the classification model's current performance based on its training on the GitHub problems dataset for Ruby on Rails has been made possible. With an overall accuracy of 33%, the model shows that it is fundamentally capable of classifying problems according to their descriptions into the appropriate categories. But the model's predictive ability appears to be below ideal for real-world use based on the precision, recall, and F1 scores, which are roughly 14%, 12%, and 8%, respectively.

Additionally, the relatively significant evaluation loss suggests that the model has difficulty extrapolating its predictions to unobserved data. A number of things, like class imbalance, inadequate feature representation, or the requirement for additional hyperparameter adjustment, might be blamed for this.

It is advised that the following actions be taken going future to improve the model's performance:

- Data augmentation can enhance the model's learning ability by expanding the dataset or using methods to make the data more interesting.
-Class Imbalance Mitigation: To resolve any imbalance in the dataset, use resampling techniques or modify class weights.
- Hyperparameter Optimization: Use methodical techniques like grid search or random search to experiment with various sets of hyperparameters.
- Model Complexity Review: To capture subtle trends in the data, investigate several model architectures or make the current model more complicated.
- Feature engineering: Use more complex embeddings or experiment with other text preparation methods to improve the input features.
- Cross-validation: To make sure the model is reliable and stable across various dataset subsets, use k-fold cross-validation.
- Alternative Evaluation Metrics: Consider additional metrics suited for multi-class classification tasks, such as the macro or weighted averages for precision, recall, and F1 score, to gain a more nuanced understanding of model performance across classes.

Our goal in addressing these areas is to create a classification system for Rails bugs that is more precise and trustworthy. The enhanced method would greatly simplify the process of collaborative development by helping maintainers to classify and prioritize concerns.