# An√°lise de M√©tricas de Qualidade - 75QUA

Este notebook analisa as m√©tricas CK e bugs detectados pelo SpotBugs em m√∫ltiplas releases de um projeto Java.

## ‚ö†Ô∏è IMPORTANTE: Execute a An√°lise Primeiro!

```bash
make analyze REPO=jhy/jsoup
# OU
make analyze-limit REPO=jhy/jsoup LIMIT=5
```

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pathlib import Path
import json
import xml.etree.ElementTree as ET

# Configura√ß√£o de visualiza√ß√£o
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

## 1. Configura√ß√£o

In [None]:
# CONFIGURA√á√ÉO - Altere para o nome do seu projeto
PROJECT_NAME = "jsoup"
RESULTS_DIR = Path(f"/workspace/results/{PROJECT_NAME}")

print(f"Analisando projeto: {PROJECT_NAME}")
print(f"Diret√≥rio de resultados: {RESULTS_DIR}")
print(f"Diret√≥rio existe: {RESULTS_DIR.exists()}")

if RESULTS_DIR.exists():
    release_dirs = sorted([d for d in RESULTS_DIR.glob('*') if d.is_dir() and not d.name.startswith('.')])
    print(f"‚úì Encontradas {len(release_dirs)} releases")

## 2. Carregar M√©tricas CK

In [None]:
all_metrics = []

for release_dir in release_dirs:
    class_csv = release_dir / 'ck' / 'class.csv'
    
    if class_csv.exists():
        df = pd.read_csv(class_csv)
        df['release'] = release_dir.name
        
        metadata_file = release_dir / 'metadata.json'
        if metadata_file.exists():
            with open(metadata_file) as f:
                metadata = json.load(f)
                df['release_date'] = metadata.get('published_date', '')
        
        all_metrics.append(df)
        print(f"‚úì {release_dir.name}: {len(df)} classes")

if all_metrics:
    df_all = pd.concat(all_metrics, ignore_index=True)
    print(f"\n‚úì Total de classes: {len(df_all)}")
    print(f"‚úì Releases: {df_all['release'].nunique()}")
else:
    df_all = pd.DataFrame()

In [None]:
# Visualizar estrutura dos dados
df_all.head()

## 3. Estat√≠sticas Descritivas por Release

In [None]:
if not df_all.empty:
    metrics_by_release = df_all.groupby('release').agg({
        'wmc': ['mean', 'median', 'std', 'max'],
        'dit': ['mean', 'median', 'std', 'max'],
        'noc': ['mean', 'median', 'std', 'max'],
        'cbo': ['mean', 'median', 'std', 'max'],
        'lcom': ['mean', 'median', 'std', 'max'],
        'rfc': ['mean', 'median', 'std', 'max'],
        'loc': ['sum', 'mean', 'median', 'std']
    }).round(2)
    
    display(metrics_by_release)

## 4. Visualiza√ß√£o - Evolu√ß√£o das M√©tricas

In [None]:
if not df_all.empty:
    fig, axes = plt.subplots(2, 2, figsize=(16, 10))
    fig.suptitle('Evolu√ß√£o das M√©tricas CK', fontsize=16, fontweight='bold')
    
    # WMC
    metrics_by_release[('wmc', 'mean')].plot(ax=axes[0, 0], marker='o', color='blue')
    axes[0, 0].set_title('WMC (Weighted Methods per Class) - M√©dia')
    axes[0, 0].set_ylabel('WMC M√©dio')
    axes[0, 0].tick_params(axis='x', rotation=45)
    axes[0, 0].grid(True, alpha=0.3)
    
    # CBO
    metrics_by_release[('cbo', 'mean')].plot(ax=axes[0, 1], marker='s', color='green')
    axes[0, 1].set_title('CBO (Coupling Between Objects) - M√©dia')
    axes[0, 1].set_ylabel('CBO M√©dio')
    axes[0, 1].tick_params(axis='x', rotation=45)
    axes[0, 1].grid(True, alpha=0.3)
    
    # LCOM
    metrics_by_release[('lcom', 'mean')].plot(ax=axes[1, 0], marker='^', color='red')
    axes[1, 0].set_title('LCOM (Lack of Cohesion) - M√©dia')
    axes[1, 0].set_ylabel('LCOM M√©dio')
    axes[1, 0].tick_params(axis='x', rotation=45)
    axes[1, 0].grid(True, alpha=0.3)
    
    # LOC
    metrics_by_release[('loc', 'sum')].plot(ax=axes[1, 1], marker='D', color='purple')
    axes[1, 1].set_title('LOC (Lines of Code) - Total')
    axes[1, 1].set_ylabel('LOC Total')
    axes[1, 1].tick_params(axis='x', rotation=45)
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'metrics_evolution.png', dpi=300, bbox_inches='tight')
    plt.show()

## 5. Distribui√ß√£o das M√©tricas (Boxplots)

**√ötil para identificar classes outliers em cada release**

In [None]:
if not df_all.empty:
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    fig.suptitle('Distribui√ß√£o das M√©tricas CK', fontsize=16, fontweight='bold')
    
    metrics = ['wmc', 'dit', 'noc', 'cbo', 'lcom', 'rfc']
    positions = [(0,0), (0,1), (0,2), (1,0), (1,1), (1,2)]
    
    for metric, pos in zip(metrics, positions):
        df_all.boxplot(column=metric, by='release', ax=axes[pos], rot=45)
        axes[pos].set_title(f'{metric.upper()}')
        axes[pos].set_xlabel('')
    
    plt.suptitle('Distribui√ß√£o das M√©tricas CK por Release', fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'metrics_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()

## 6. Correla√ß√£o entre M√©tricas (Heatmap)

**Mostra rela√ß√µes entre as m√©tricas CK**

In [None]:
if not df_all.empty:
    correlation_metrics = ['wmc', 'dit', 'noc', 'cbo', 'lcom', 'rfc', 'loc']
    corr_matrix = df_all[correlation_metrics].corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Matriz de Correla√ß√£o entre M√©tricas CK', fontsize=14, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'correlation_matrix.png', dpi=300, bbox_inches='tight')
    plt.show()

## 7. Top Classes com Problemas

In [None]:
if not df_all.empty:
    latest_release = df_all[df_all['release'] == df_all['release'].unique()[-1]]
    
    print("Top 10 Classes com Maior Complexidade (WMC):")
    print(latest_release.nlargest(10, 'wmc')[['class', 'wmc', 'cbo', 'lcom', 'loc']])
    
    print("\nTop 10 Classes com Maior Acoplamento (CBO):")
    print(latest_release.nlargest(10, 'cbo')[['class', 'wmc', 'cbo', 'lcom', 'loc']])
    
    print("\nTop 10 Classes com Menor Coes√£o (LCOM):")
    print(latest_release.nlargest(10, 'lcom')[['class', 'wmc', 'cbo', 'lcom', 'loc']])

## 8. An√°lise de Tend√™ncias

In [None]:
if not df_all.empty:
    first_release = metrics_by_release.iloc[0]
    last_release = metrics_by_release.iloc[-1]
    
    growth_rates = pd.DataFrame({
        'M√©trica': ['WMC', 'DIT', 'NOC', 'CBO', 'LCOM', 'RFC', 'LOC (total)'],
        'Primeira Release': [
            first_release[('wmc', 'mean')],
            first_release[('dit', 'mean')],
            first_release[('noc', 'mean')],
            first_release[('cbo', 'mean')],
            first_release[('lcom', 'mean')],
            first_release[('rfc', 'mean')],
            first_release[('loc', 'sum')]
        ],
        '√öltima Release': [
            last_release[('wmc', 'mean')],
            last_release[('dit', 'mean')],
            last_release[('noc', 'mean')],
            last_release[('cbo', 'mean')],
            last_release[('lcom', 'mean')],
            last_release[('rfc', 'mean')],
            last_release[('loc', 'sum')]
        ]
    })
    
    growth_rates['Varia√ß√£o (%)'] = ((growth_rates['√öltima Release'] - growth_rates['Primeira Release']) / growth_rates['Primeira Release'] * 100).round(2)
    
    print("An√°lise de Crescimento das M√©tricas:")
    display(growth_rates)

## 9. Exportar M√©tricas CK

In [None]:
if not df_all.empty:
    metrics_by_release.to_csv(RESULTS_DIR / 'metrics_summary.csv')
    growth_rates.to_csv(RESULTS_DIR / 'growth_rates.csv', index=False)
    print("‚úì M√©tricas CK exportadas:")
    print("  - metrics_summary.csv")
    print("  - growth_rates.csv")

## 10. An√°lise de Bugs - SpotBugs + find-sec-bugs

### 10.1 Carregar Bugs

In [None]:
def parse_spotbugs_xml(xml_file):
    """Parse SpotBugs XML report."""
    try:
        tree = ET.parse(xml_file)
        root = tree.getroot()
        bugs = []
        
        for bug in root.findall('.//BugInstance'):
            bug_info = {
                'type': bug.get('type'),
                'priority': int(bug.get('priority', 0)),
                'rank': int(bug.get('rank', 0)),
                'category': bug.get('category'),
                'abbrev': bug.get('abbrev', ''),
            }
            
            class_elem = bug.find('.//Class')
            bug_info['class'] = class_elem.get('classname', '') if class_elem is not None else ''
            
            method_elem = bug.find('.//Method')
            bug_info['method'] = method_elem.get('name', '') if method_elem is not None else ''
            
            long_msg = bug.find('.//LongMessage')
            bug_info['description'] = long_msg.text if long_msg is not None else ''
            
            bugs.append(bug_info)
        
        return bugs
    except Exception as e:
        print(f"Erro: {e}")
        return []

# Coletar bugs
all_bugs = []

for release_dir in release_dirs:
    spotbugs_xml = release_dir / 'spotbugs-report.xml'
    
    if spotbugs_xml.exists():
        bugs = parse_spotbugs_xml(spotbugs_xml)
        for bug in bugs:
            bug['release'] = release_dir.name
            
            metadata_file = release_dir / 'metadata.json'
            if metadata_file.exists():
                with open(metadata_file) as f:
                    metadata = json.load(f)
                    bug['release_date'] = metadata.get('published_date', '')
        
        all_bugs.extend(bugs)
        print(f"‚úì {release_dir.name}: {len(bugs)} bugs")
    else:
        print(f"‚úó {release_dir.name}: sem SpotBugs")

if all_bugs:
    df_bugs = pd.DataFrame(all_bugs)
    df_bugs['priority_label'] = df_bugs['priority'].map({1: 'HIGH', 2: 'MEDIUM', 3: 'LOW'})
    print(f"\n‚úì Total de bugs: {len(df_bugs)}")
    print(f"‚úì Releases com bugs: {df_bugs['release'].nunique()}")
else:
    df_bugs = pd.DataFrame()

### 10.2 Estat√≠sticas de Bugs

In [None]:
if not df_bugs.empty:
    bugs_by_release = df_bugs.groupby('release').agg({
        'type': 'count',
        'priority': ['mean', 'min', 'max']
    }).round(2)
    bugs_by_release.columns = ['Total_Bugs', 'Priority_Mean', 'Priority_Min', 'Priority_Max']
    
    print("Bugs por Release:")
    display(bugs_by_release)
    
    print("\nBugs por Categoria:")
    print(df_bugs['category'].value_counts())
    
    print("\nBugs por Prioridade:")
    print(df_bugs['priority_label'].value_counts())

### 10.3 Visualiza√ß√µes de Bugs

In [None]:
if not df_bugs.empty:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('An√°lise de Bugs - SpotBugs + find-sec-bugs', fontsize=16, fontweight='bold')
    
    # 1. Evolu√ß√£o dos bugs
    bugs_by_release['Total_Bugs'].plot(ax=axes[0, 0], marker='o', color='red', linewidth=2)
    axes[0, 0].set_title('Evolu√ß√£o do Total de Bugs')
    axes[0, 0].set_ylabel('Total de Bugs')
    axes[0, 0].tick_params(axis='x', rotation=45)
    axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Bugs por categoria
    df_bugs['category'].value_counts().head(10).plot(kind='barh', ax=axes[0, 1], color='orange')
    axes[0, 1].set_title('Top 10 Categorias')
    axes[0, 1].set_xlabel('Quantidade')
    
    # 3. Distribui√ß√£o por prioridade
    df_bugs['priority_label'].value_counts().plot(
        kind='pie', ax=axes[1, 0], autopct='%1.1f%%',
        colors=['#ff4444', '#ffaa44', '#44ff44']
    )
    axes[1, 0].set_title('Distribui√ß√£o por Prioridade')
    axes[1, 0].set_ylabel('')
    
    # 4. Top tipos de bugs
    df_bugs['type'].value_counts().head(10).plot(kind='barh', ax=axes[1, 1], color='steelblue')
    axes[1, 1].set_title('Top 10 Tipos de Bugs')
    axes[1, 1].set_xlabel('Quantidade')
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'bugs_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()

### 10.4 An√°lise de Bugs de Seguran√ßa (find-sec-bugs)

In [None]:
if not df_bugs.empty:
    security_bugs = df_bugs[df_bugs['category'] == 'SECURITY'].copy()
    
    print(f"Total de bugs de seguran√ßa: {len(security_bugs)}")
    
    if not security_bugs.empty:
        print("\nBugs de Seguran√ßa por Release:")
        print(security_bugs['release'].value_counts().sort_index())
        
        print("\nTop 10 Tipos de Vulnerabilidades:")
        print(security_bugs['type'].value_counts().head(10))
        
        print("\nTop 10 Classes com Mais Bugs de Seguran√ßa:")
        print(security_bugs['class'].value_counts().head(10))
        
        # Visualiza√ß√£o
        fig, axes = plt.subplots(1, 2, figsize=(16, 6))
        fig.suptitle('An√°lise de Bugs de Seguran√ßa', fontsize=16, fontweight='bold')
        
        security_by_release = security_bugs.groupby('release').size()
        security_by_release.plot(ax=axes[0], marker='o', color='darkred', linewidth=2)
        axes[0].set_title('Evolu√ß√£o de Bugs de Seguran√ßa')
        axes[0].set_ylabel('Quantidade')
        axes[0].tick_params(axis='x', rotation=45)
        axes[0].grid(True, alpha=0.3)
        
        security_bugs['type'].value_counts().head(10).plot(kind='barh', ax=axes[1], color='crimson')
        axes[1].set_title('Top 10 Vulnerabilidades')
        axes[1].set_xlabel('Quantidade')
        
        plt.tight_layout()
        plt.savefig(RESULTS_DIR / 'security_bugs.png', dpi=300, bbox_inches='tight')
        plt.show()
    else:
        print("\n‚úì Nenhum bug de seguran√ßa encontrado")

### 10.5 Bugs Cr√≠ticos (Prioridade HIGH)

In [None]:
if not df_bugs.empty:
    critical_bugs = df_bugs[df_bugs['priority'] == 1].copy()
    
    print(f"Total de bugs CR√çTICOS: {len(critical_bugs)}")
    
    if not critical_bugs.empty:
        latest_release_name = df_bugs['release'].unique()[-1]
        latest_critical = critical_bugs[critical_bugs['release'] == latest_release_name]
        
        print(f"\nBugs cr√≠ticos na √∫ltima release ({latest_release_name}): {len(latest_critical)}")
        print("\n" + "="*80)
        print("DETALHES DOS BUGS CR√çTICOS (√∫ltimos 15):")
        print("="*80)
        
        for idx, bug in latest_critical.head(15).iterrows():
            print(f"\nüî¥ {bug['type']} - {bug['category']}")
            print(f"   Classe: {bug['class']}")
            if bug['method']:
                print(f"   M√©todo: {bug['method']}")
            if bug['description']:
                print(f"   {bug['description'][:150]}...")
            print("   " + "-"*76)
    else:
        print("\n‚úì Nenhum bug cr√≠tico encontrado!")

### 10.6 Exportar Dados de Bugs

In [None]:
if not df_bugs.empty:
    # Exportar todos os bugs
    df_bugs.to_csv(RESULTS_DIR / 'bugs_all_releases.csv', index=False)
    print("‚úì Bugs exportados:")
    print("  - bugs_all_releases.csv")
    
    if not security_bugs.empty:
        security_bugs.to_csv(RESULTS_DIR / 'security_bugs.csv', index=False)
        print("  - security_bugs.csv")
    
    if not critical_bugs.empty:
        critical_bugs.to_csv(RESULTS_DIR / 'critical_bugs.csv', index=False)
        print("  - critical_bugs.csv")
    
    # Resumo JSON
    summary_stats = {
        'total_bugs': len(df_bugs),
        'security_bugs': len(security_bugs) if not security_bugs.empty else 0,
        'critical_bugs': len(critical_bugs) if not critical_bugs.empty else 0,
        'releases_analyzed': df_bugs['release'].nunique(),
        'most_common_bug_type': df_bugs['type'].value_counts().index[0] if len(df_bugs) > 0 else 'N/A',
        'most_common_category': df_bugs['category'].value_counts().index[0] if len(df_bugs) > 0 else 'N/A'
    }
    
    with open(RESULTS_DIR / 'bugs_summary.json', 'w') as f:
        json.dump(summary_stats, f, indent=2)
    print("  - bugs_summary.json")
    
    print("\n" + "="*80)
    print("üìä RESUMO GERAL:")
    print("="*80)
    for key, value in summary_stats.items():
        print(f"  {key.replace('_', ' ').title()}: {value}")
    print("="*80)

## 11. Resumo e Pr√≥ximos Passos

### ‚úÖ O que foi analisado:

1. **M√©tricas CK** - Complexidade (WMC), Acoplamento (CBO), Coes√£o (LCOM), etc.
2. **Distribui√ß√£o** - Boxplots para identificar classes outliers
3. **Correla√ß√µes** - Heatmap mostrando rela√ß√µes entre m√©tricas
4. **Bugs Gerais** - SpotBugs (todos os bugs detectados)
5. **Bugs de Seguran√ßa** - find-sec-bugs (vulnerabilidades)
6. **Bugs Cr√≠ticos** - Prioridade HIGH

### üìÅ Arquivos gerados:

**M√©tricas CK:**
- `metrics_summary.csv` - Estat√≠sticas por release
- `growth_rates.csv` - Taxa de crescimento
- `metrics_evolution.png` - Gr√°ficos de evolu√ß√£o
- `metrics_distribution.png` - Boxplots
- `correlation_matrix.png` - Heatmap de correla√ß√µes

**Bugs:**
- `bugs_all_releases.csv` - Todos os bugs
- `security_bugs.csv` - Bugs de seguran√ßa
- `critical_bugs.csv` - Bugs cr√≠ticos
- `bugs_summary.json` - Resumo estat√≠stico
- `bugs_analysis.png` - Visualiza√ß√µes gerais
- `security_bugs.png` - Visualiza√ß√µes de seguran√ßa

### üéØ Como usar no trabalho:

1. **Identificar problemas**: Use Top Classes e boxplots para encontrar outliers
2. **Priorizar**: Foque em bugs de seguran√ßa e cr√≠ticos primeiro
3. **Refatorar**: Classes com WMC/CBO/LCOM altos
4. **Submeter PRs**: Corrija bugs e refatore c√≥digo
5. **Documentar**: Use gr√°ficos no artigo cient√≠fico

### üí° Dicas para Pull Requests:

- Bugs de seguran√ßa s√£o sempre bem-vindos
- Comece com bugs simples (LOW priority)
- Classes com LCOM alto ‚Üí Split Responsibility
- M√©todos com WMC alto ‚Üí Extract Method

**Boa sorte! üöÄ**