# An√°lise de M√©tricas de Qualidade - 75QUA

Este notebook analisa as m√©tricas CK, PMD, bugs detectados pelo SpotBugs e refatora√ß√µes detectadas pelo RefactoringMiner em m√∫ltiplas releases de um projeto Java.

## ‚ö†Ô∏è IMPORTANTE: Execute a An√°lise Primeiro!

```bash
make analyze REPO=jhy/jsoup
# OU
make analyze-limit REPO=jhy/jsoup LIMIT=5
```

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pathlib import Path
import json
import xml.etree.ElementTree as ET

# Configura√ß√£o de visualiza√ß√£o
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

## 1. Configura√ß√£o

In [None]:
# CONFIGURA√á√ÉO - Altere para o nome do seu projeto
PROJECT_NAME = "jsoup"
RESULTS_DIR = Path(f"/workspace/results/{PROJECT_NAME}")

print(f"Analisando projeto: {PROJECT_NAME}")
print(f"Diret√≥rio de resultados: {RESULTS_DIR}")
print(f"Diret√≥rio existe: {RESULTS_DIR.exists()}")

if RESULTS_DIR.exists():
    release_dirs = sorted([d for d in RESULTS_DIR.glob('*') if d.is_dir() and not d.name.startswith('.')])
    print(f"‚úì Encontradas {len(release_dirs)} releases")

## 2. Carregar M√©tricas CK

In [None]:
all_metrics = []

for release_dir in release_dirs:
    class_csv = release_dir / 'ck' / 'class.csv'
    
    if class_csv.exists():
        df = pd.read_csv(class_csv)
        df['release'] = release_dir.name
        
        metadata_file = release_dir / 'metadata.json'
        if metadata_file.exists():
            with open(metadata_file) as f:
                metadata = json.load(f)
                df['release_date'] = metadata.get('published_date', '')
        
        all_metrics.append(df)
        print(f"‚úì {release_dir.name}: {len(df)} classes")

if all_metrics:
    df_all = pd.concat(all_metrics, ignore_index=True)
    print(f"\n‚úì Total de classes: {len(df_all)}")
    print(f"‚úì Releases: {df_all['release'].nunique()}")
else:
    df_all = pd.DataFrame()

In [None]:
# Visualizar estrutura dos dados
df_all.head()

## 3. Estat√≠sticas Descritivas por Release

In [None]:
if not df_all.empty:
    metrics_by_release = df_all.groupby('release').agg({
        'wmc': ['mean', 'median', 'std', 'max'],
        'dit': ['mean', 'median', 'std', 'max'],
        'noc': ['mean', 'median', 'std', 'max'],
        'cbo': ['mean', 'median', 'std', 'max'],
        'lcom': ['mean', 'median', 'std', 'max'],
        'rfc': ['mean', 'median', 'std', 'max'],
        'loc': ['sum', 'mean', 'median', 'std']
    }).round(2)
    
    display(metrics_by_release)

## 4. Visualiza√ß√£o - Evolu√ß√£o das M√©tricas

In [None]:
if not df_all.empty:
    fig = plt.figure(figsize=(18, 12))
    fig.suptitle('Evolu√ß√£o das 7 M√©tricas CK', fontsize=16, fontweight='bold')
    
    # WMC
    ax1 = plt.subplot(3, 3, 1)
    metrics_by_release[('wmc', 'mean')].plot(ax=ax1, marker='o', color='blue', linewidth=2)
    ax1.set_title('WMC - Complexidade')
    ax1.set_ylabel('WMC M√©dio')
    ax1.tick_params(axis='x', rotation=45)
    ax1.grid(True, alpha=0.3)
    
    # DIT
    ax2 = plt.subplot(3, 3, 2)
    metrics_by_release[('dit', 'mean')].plot(ax=ax2, marker='s', color='orange', linewidth=2)
    ax2.set_title('DIT - Heran√ßa')
    ax2.set_ylabel('DIT M√©dio')
    ax2.tick_params(axis='x', rotation=45)
    ax2.grid(True, alpha=0.3)
    
    # NOC
    ax3 = plt.subplot(3, 3, 3)
    metrics_by_release[('noc', 'mean')].plot(ax=ax3, marker='^', color='brown', linewidth=2)
    ax3.set_title('NOC - Filhos')
    ax3.set_ylabel('NOC M√©dio')
    ax3.tick_params(axis='x', rotation=45)
    ax3.grid(True, alpha=0.3)
    
    # CBO
    ax4 = plt.subplot(3, 3, 4)
    metrics_by_release[('cbo', 'mean')].plot(ax=ax4, marker='D', color='green', linewidth=2)
    ax4.set_title('CBO - Acoplamento')
    ax4.set_ylabel('CBO M√©dio')
    ax4.tick_params(axis='x', rotation=45)
    ax4.grid(True, alpha=0.3)
    
    # LCOM
    ax5 = plt.subplot(3, 3, 5)
    metrics_by_release[('lcom', 'mean')].plot(ax=ax5, marker='v', color='red', linewidth=2)
    ax5.set_title('LCOM - Coes√£o')
    ax5.set_ylabel('LCOM M√©dio')
    ax5.tick_params(axis='x', rotation=45)
    ax5.grid(True, alpha=0.3)
    
    # RFC
    ax6 = plt.subplot(3, 3, 6)
    metrics_by_release[('rfc', 'mean')].plot(ax=ax6, marker='*', color='cyan', linewidth=2)
    ax6.set_title('RFC - Response')
    ax6.set_ylabel('RFC M√©dio')
    ax6.tick_params(axis='x', rotation=45)
    ax6.grid(True, alpha=0.3)
    
    # LOC
    ax7 = plt.subplot(3, 3, 7)
    metrics_by_release[('loc', 'sum')].plot(ax=ax7, marker='p', color='purple', linewidth=2)
    ax7.set_title('LOC - Linhas (Total)')
    ax7.set_ylabel('LOC Total')
    ax7.tick_params(axis='x', rotation=45)
    ax7.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'metrics_evolution.png', dpi=300, bbox_inches='tight')
    plt.show()

## 5. Distribui√ß√£o das M√©tricas (Boxplots)

**√ötil para identificar classes outliers em cada release**

In [None]:
if not df_all.empty:
    fig = plt.figure(figsize=(20, 12))
    fig.suptitle('Distribui√ß√£o das 7 M√©tricas CK - Boxplots (para identificar outliers)', 
                 fontsize=16, fontweight='bold')
    
    metrics = ['wmc', 'dit', 'noc', 'cbo', 'lcom', 'rfc', 'loc']
    titles = ['WMC (Complexidade)', 'DIT (Heran√ßa)', 'NOC (Filhos)', 
              'CBO (Acoplamento)', 'LCOM (Coes√£o)', 'RFC (Response)', 'LOC (Linhas)']
    
    for i, (metric, title) in enumerate(zip(metrics, titles), 1):
        ax = plt.subplot(3, 3, i)
        df_all.boxplot(column=metric, by='release', ax=ax, rot=45)
        ax.set_title(title)
        ax.set_xlabel('')
        ax.get_figure().suptitle('')  # Remove t√≠tulo autom√°tico
    
    plt.suptitle('Distribui√ß√£o das 7 M√©tricas CK por Release - Boxplots', 
                 fontsize=16, fontweight='bold', y=0.995)
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'metrics_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()

## 6. Correla√ß√£o entre M√©tricas (Heatmap)

**Mostra rela√ß√µes entre as m√©tricas CK**

In [None]:
if not df_all.empty:
    correlation_metrics = ['wmc', 'dit', 'noc', 'cbo', 'lcom', 'rfc', 'loc']
    corr_matrix = df_all[correlation_metrics].corr()
    
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
                square=True, linewidths=1, cbar_kws={"shrink": 0.8})
    plt.title('Matriz de Correla√ß√£o entre M√©tricas CK', fontsize=14, fontweight='bold', pad=20)
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'correlation_matrix.png', dpi=300, bbox_inches='tight')
    plt.show()

## 7. Top Classes com Problemas

In [None]:
if not df_all.empty:
    latest_release = df_all[df_all['release'] == df_all['release'].unique()[-1]]
    
    print("="*80)
    print(f"AN√ÅLISE DAS 7 M√âTRICAS CK - √öLTIMA RELEASE ({latest_release['release'].iloc[0]})")
    print("="*80)
    
    print("\nTop 10 Classes com Maior Complexidade (WMC):")
    print(latest_release.nlargest(10, 'wmc')[['class', 'wmc', 'cbo', 'lcom', 'loc']])
    
    print("\n" + "-"*80)
    print("Top 10 Classes com Maior Profundidade de Heran√ßa (DIT):")
    print(latest_release.nlargest(10, 'dit')[['class', 'dit', 'wmc', 'cbo']])
    
    print("\n" + "-"*80)
    print("Top 10 Classes com Mais Filhos (NOC):")
    print(latest_release.nlargest(10, 'noc')[['class', 'noc', 'dit', 'wmc']])
    
    print("\n" + "-"*80)
    print("Top 10 Classes com Maior Acoplamento (CBO):")
    print(latest_release.nlargest(10, 'cbo')[['class', 'cbo', 'wmc', 'lcom']])
    
    print("\n" + "-"*80)
    print("Top 10 Classes com Menor Coes√£o (LCOM - valores altos):")
    print(latest_release.nlargest(10, 'lcom')[['class', 'lcom', 'wmc', 'cbo']])
    
    print("\n" + "-"*80)
    print("Top 10 Classes com Maior RFC (Response For Class):")
    print(latest_release.nlargest(10, 'rfc')[['class', 'rfc', 'wmc', 'cbo']])
    
    print("\n" + "-"*80)
    print("Top 10 Classes com Mais Linhas de C√≥digo (LOC):")
    print(latest_release.nlargest(10, 'loc')[['class', 'loc', 'wmc', 'cbo']])

## 8. An√°lise de Tend√™ncias

In [None]:
if not df_all.empty:
    first_release = metrics_by_release.iloc[0]
    last_release = metrics_by_release.iloc[-1]
    
    growth_rates = pd.DataFrame({
        'M√©trica': ['WMC', 'DIT', 'NOC', 'CBO', 'LCOM', 'RFC', 'LOC (total)'],
        'Primeira Release': [
            first_release[('wmc', 'mean')],
            first_release[('dit', 'mean')],
            first_release[('noc', 'mean')],
            first_release[('cbo', 'mean')],
            first_release[('lcom', 'mean')],
            first_release[('rfc', 'mean')],
            first_release[('loc', 'sum')]
        ],
        '√öltima Release': [
            last_release[('wmc', 'mean')],
            last_release[('dit', 'mean')],
            last_release[('noc', 'mean')],
            last_release[('cbo', 'mean')],
            last_release[('lcom', 'mean')],
            last_release[('rfc', 'mean')],
            last_release[('loc', 'sum')]
        ]
    })
    
    growth_rates['Varia√ß√£o (%)'] = ((growth_rates['√öltima Release'] - growth_rates['Primeira Release']) / growth_rates['Primeira Release'] * 100).round(2)
    
    print("An√°lise de Crescimento das M√©tricas:")
    display(growth_rates)

## 9. Exportar M√©tricas CK

In [None]:
if not df_all.empty:
    metrics_by_release.to_csv(RESULTS_DIR / 'metrics_summary.csv')
    growth_rates.to_csv(RESULTS_DIR / 'growth_rates.csv', index=False)
    print("‚úì M√©tricas CK exportadas:")
    print("  - metrics_summary.csv")
    print("  - growth_rates.csv")

## 9.5 An√°lise PMD - An√°lise Est√°tica de C√≥digo

An√°lise dos problemas detectados pelo PMD (Programming Mistake Detector) em cada release.

In [None]:
# Carregar dados do PMD
all_pmd = []

for release_dir in release_dirs:
    pmd_csv = release_dir / 'pmd-report.csv'
    
    if pmd_csv.exists():
        try:
            df = pd.read_csv(pmd_csv)
            df['release'] = release_dir.name
            
            metadata_file = release_dir / 'metadata.json'
            if metadata_file.exists():
                with open(metadata_file) as f:
                    metadata = json.load(f)
                    df['release_date'] = metadata.get('published_date', '')
            
            all_pmd.append(df)
            print(f"‚úì {release_dir.name}: {len(df)} problemas PMD")
        except Exception as e:
            print(f"‚úó {release_dir.name}: erro ao ler PMD - {e}")
    else:
        print(f"‚úó {release_dir.name}: sem PMD")

if all_pmd:
    df_pmd = pd.concat(all_pmd, ignore_index=True)
    print(f"\n‚úì Total de problemas PMD: {len(df_pmd)}")
    print(f"‚úì Releases com PMD: {df_pmd['release'].nunique()}")
else:
    df_pmd = pd.DataFrame()
    print("\n‚úó Nenhum dado PMD dispon√≠vel")

df_pmd.head()

### 9.5.1 Estat√≠sticas PMD por Release

In [None]:
if not df_pmd.empty:
    pmd_by_release = df_pmd.groupby('release').agg({
        'Problem': 'count',
        'Priority': ['mean', 'min', 'max']
    }).round(2)
    pmd_by_release.columns = ['Total_Problems', 'Priority_Mean', 'Priority_Min', 'Priority_Max']
    
    print("="*80)
    print("PROBLEMAS PMD POR RELEASE")
    print("="*80)
    display(pmd_by_release)
    
    print("\n" + "="*80)
    print("ESTAT√çSTICAS GERAIS:")
    print("="*80)
    print(f"M√©dia de problemas por release: {pmd_by_release['Total_Problems'].mean():.1f}")
    print(f"Mediana: {pmd_by_release['Total_Problems'].median():.1f}")
    print(f"M√≠nimo: {pmd_by_release['Total_Problems'].min()}")
    print(f"M√°ximo: {pmd_by_release['Total_Problems'].max()}")
    
    # Primeira vs √öltima
    first = pmd_by_release.iloc[0]
    last = pmd_by_release.iloc[-1]
    variation = ((last['Total_Problems'] - first['Total_Problems']) / first['Total_Problems'] * 100)
    
    print(f"\nPrimeira release ({pmd_by_release.index[0]}): {first['Total_Problems']:.0f} problemas")
    print(f"√öltima release ({pmd_by_release.index[-1]}): {last['Total_Problems']:.0f} problemas")
    print(f"Varia√ß√£o: {variation:+.1f}%")
else:
    print("Nenhum dado PMD dispon√≠vel")

### 9.5.2 Distribui√ß√£o por Prioridade

PMD classifica problemas em 4 n√≠veis de prioridade:
- **Priority 1:** Cr√≠tico (problemas graves de design/seguran√ßa)
- **Priority 2:** Alto
- **Priority 3:** M√©dio
- **Priority 4:** Baixo (principalmente estilo de c√≥digo)

In [None]:
if not df_pmd.empty:
    priority_dist = df_pmd['Priority'].value_counts().sort_index()
    
    print("="*80)
    print("DISTRIBUI√á√ÉO POR PRIORIDADE (GERAL)")
    print("="*80)
    for priority, count in priority_dist.items():
        pct = count / len(df_pmd) * 100
        print(f"Priority {priority}: {count:5d} ({pct:5.1f}%)")
    
    # Visualiza√ß√£o
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('An√°lise PMD - Evolu√ß√£o e Distribui√ß√£o', fontsize=16, fontweight='bold')
    
    # 1. Evolu√ß√£o total de problemas
    pmd_by_release['Total_Problems'].plot(ax=axes[0, 0], marker='o', color='purple', linewidth=2)
    axes[0, 0].set_title('Total de Problemas por Release')
    axes[0, 0].set_ylabel('Quantidade')
    axes[0, 0].tick_params(axis='x', rotation=45)
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].axhline(y=pmd_by_release['Total_Problems'].mean(), color='orange',
                       linestyle='--', label=f'M√©dia: {pmd_by_release["Total_Problems"].mean():.1f}')
    axes[0, 0].legend()
    
    # 2. Distribui√ß√£o por prioridade (pizza)
    priority_labels = [f'Priority {p}' for p in priority_dist.index]
    colors = ['#ff4444', '#ff8844', '#ffcc44', '#88cc44']
    priority_dist.plot(kind='pie', ax=axes[0, 1], autopct='%1.1f%%', 
                       colors=colors[:len(priority_dist)], labels=priority_labels, startangle=90)
    axes[0, 1].set_title('Distribui√ß√£o por Prioridade (Geral)')
    axes[0, 1].set_ylabel('')
    
    # 3. Evolu√ß√£o por prioridade
    priority_evolution = df_pmd.groupby(['release', 'Priority']).size().unstack(fill_value=0)
    priority_evolution.plot(ax=axes[1, 0], marker='o', linewidth=2)
    axes[1, 0].set_title('Evolu√ß√£o por Prioridade')
    axes[1, 0].set_ylabel('Quantidade')
    axes[1, 0].tick_params(axis='x', rotation=45)
    axes[1, 0].legend(title='Priority', labels=[f'Priority {p}' for p in priority_evolution.columns])
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. Top 10 regras mais violadas (√∫ltima release)
    latest_release_name = df_pmd['release'].unique()[-1]
    latest_pmd = df_pmd[df_pmd['release'] == latest_release_name]
    top_rules = latest_pmd['Rule'].value_counts().head(10)
    top_rules.plot(kind='barh', ax=axes[1, 1], color='darkviolet')
    axes[1, 1].set_title(f'Top 10 Regras Violadas ({latest_release_name})')
    axes[1, 1].set_xlabel('Quantidade')
    axes[1, 1].invert_yaxis()
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'pmd_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("Nenhum dado PMD dispon√≠vel")

### 9.5.3 Top 10 Regras, Categorias e Arquivos

In [None]:
if not df_pmd.empty:
    latest_release_name = df_pmd['release'].unique()[-1]
    latest_pmd = df_pmd[df_pmd['release'] == latest_release_name].copy()  # .copy() evita SettingWithCopyWarning
    
    print("="*80)
    print(f"AN√ÅLISE DETALHADA - √öLTIMA RELEASE ({latest_release_name})")
    print("="*80)
    
    print(f"\nTotal de problemas: {len(latest_pmd)}")
    
    print("\n" + "-"*80)
    print("TOP 10 REGRAS MAIS VIOLADAS:")
    print("-"*80)
    top_rules = latest_pmd['Rule'].value_counts().head(10)
    for i, (rule, count) in enumerate(top_rules.items(), 1):
        pct = count / len(latest_pmd) * 100
        print(f"{i:2d}. {rule:50s} : {count:4d} ({pct:5.1f}%)")
    
    print("\n" + "-"*80)
    print("TOP 10 CATEGORIAS (Rule set):")
    print("-"*80)
    top_categories = latest_pmd['Rule set'].value_counts().head(10)
    for i, (cat, count) in enumerate(top_categories.items(), 1):
        pct = count / len(latest_pmd) * 100
        print(f"{i:2d}. {cat:30s} : {count:4d} ({pct:5.1f}%)")
    
    print("\n" + "-"*80)
    print("TOP 10 ARQUIVOS COM MAIS PROBLEMAS:")
    print("-"*80)
    # Extrair nome do arquivo do caminho completo
    latest_pmd['FileName'] = latest_pmd['File'].apply(lambda x: x.split('/')[-1] if isinstance(x, str) else '')
    top_files = latest_pmd['FileName'].value_counts().head(10)
    for i, (file, count) in enumerate(top_files.items(), 1):
        pct = count / len(latest_pmd) * 100
        print(f"{i:2d}. {file:50s} : {count:4d} ({pct:5.1f}%)")
    
    print("\n" + "-"*80)
    print("PROBLEMAS CR√çTICOS (Priority 1):")
    print("-"*80)
    critical = latest_pmd[latest_pmd['Priority'] == 1]
    print(f"Total: {len(critical)} problemas cr√≠ticos ({len(critical)/len(latest_pmd)*100:.1f}%)")
    
    if len(critical) > 0:
        print("\nTop 5 regras cr√≠ticas:")
        critical_rules = critical['Rule'].value_counts().head(5)
        for rule, count in critical_rules.items():
            print(f"  ‚Ä¢ {rule}: {count}")
else:
    print("Nenhum dado PMD dispon√≠vel")

### 9.5.4 Exportar Dados PMD

In [None]:
if not df_pmd.empty:
    latest_release_name = df_pmd['release'].unique()[-1]
    latest_pmd = df_pmd[df_pmd['release'] == latest_release_name]
    
    # Exportar todos
    df_pmd.to_csv(RESULTS_DIR / 'pmd_all_releases.csv', index=False)
    print("‚úì PMD exportado:")
    print("  - pmd_all_releases.csv (todos os problemas)")
    
    # √öltima release
    latest_pmd.to_csv(RESULTS_DIR / f'pmd_{latest_release_name}.csv', index=False)
    print(f"  - pmd_{latest_release_name}.csv (√∫ltima release)")
    
    # Problemas cr√≠ticos
    critical = df_pmd[df_pmd['Priority'] == 1]
    if not critical.empty:
        critical.to_csv(RESULTS_DIR / 'pmd_critical_all.csv', index=False)
        print("  - pmd_critical_all.csv (prioridade 1)")
    
    # Resumo JSON
    summary = {
        'latest_release': latest_release_name,
        'latest_release_problems': len(latest_pmd),
        'latest_release_critical': len(latest_pmd[latest_pmd['Priority'] == 1]),
        'average_problems_per_release': float(pmd_by_release['Total_Problems'].mean()),
        'median_problems_per_release': float(pmd_by_release['Total_Problems'].median()),
        'total_releases_analyzed': df_pmd['release'].nunique(),
        'most_common_rule': latest_pmd['Rule'].value_counts().index[0] if len(latest_pmd) > 0 else 'N/A',
        'most_common_category': latest_pmd['Rule set'].value_counts().index[0] if len(latest_pmd) > 0 else 'N/A'
    }
    
    with open(RESULTS_DIR / 'pmd_summary.json', 'w') as f:
        json.dump(summary, f, indent=2)
    print("  - pmd_summary.json")
    
    print("\n" + "="*80)
    print("üìä RESUMO PMD:")
    print("="*80)
    for key, value in summary.items():
        label = key.replace('_', ' ').title()
        if isinstance(value, float):
            print(f"  {label}: {value:.1f}")
        else:
            print(f"  {label}: {value}")
    print("="*80)
else:
    print("Nenhum dado PMD dispon√≠vel")

## 10. An√°lise de Bugs - SpotBugs + find-sec-bugs

### 10.1 Carregar Bugs

In [None]:
def parse_spotbugs_xml(xml_file):
    """Parse SpotBugs XML report."""
    try:
        tree = ET.parse(xml_file)
        root = tree.getroot()
        bugs = []
        
        for bug in root.findall('.//BugInstance'):
            bug_info = {
                'type': bug.get('type'),
                'priority': int(bug.get('priority', 0)),
                'rank': int(bug.get('rank', 0)),
                'category': bug.get('category'),
                'abbrev': bug.get('abbrev', ''),
            }
            
            class_elem = bug.find('.//Class')
            bug_info['class'] = class_elem.get('classname', '') if class_elem is not None else ''
            
            method_elem = bug.find('.//Method')
            bug_info['method'] = method_elem.get('name', '') if method_elem is not None else ''
            
            long_msg = bug.find('.//LongMessage')
            bug_info['description'] = long_msg.text if long_msg is not None else ''
            
            bugs.append(bug_info)
        
        return bugs
    except Exception as e:
        print(f"Erro: {e}")
        return []

# Coletar bugs
all_bugs = []

for release_dir in release_dirs:
    spotbugs_xml = release_dir / 'spotbugs-report.xml'
    
    if spotbugs_xml.exists():
        bugs = parse_spotbugs_xml(spotbugs_xml)
        for bug in bugs:
            bug['release'] = release_dir.name
            
            metadata_file = release_dir / 'metadata.json'
            if metadata_file.exists():
                with open(metadata_file) as f:
                    metadata = json.load(f)
                    bug['release_date'] = metadata.get('published_date', '')
        
        all_bugs.extend(bugs)
        print(f"‚úì {release_dir.name}: {len(bugs)} bugs")
    else:
        print(f"‚úó {release_dir.name}: sem SpotBugs")

if all_bugs:
    df_bugs = pd.DataFrame(all_bugs)
    df_bugs['priority_label'] = df_bugs['priority'].map({1: 'HIGH', 2: 'MEDIUM', 3: 'LOW'})
    print(f"\n‚úì Total de bugs: {len(df_bugs)}")
    print(f"‚úì Releases com bugs: {df_bugs['release'].nunique()}")
else:
    df_bugs = pd.DataFrame()

### 10.2 Estat√≠sticas de Bugs

In [None]:
if not df_bugs.empty:
    bugs_by_release = df_bugs.groupby('release').agg({
        'type': 'count',
        'priority': ['mean', 'min', 'max']
    }).round(2)
    bugs_by_release.columns = ['Total_Bugs', 'Priority_Mean', 'Priority_Min', 'Priority_Max']
    
    print("="*80)
    print("BUGS POR RELEASE (cada release √© independente)")
    print("="*80)
    display(bugs_by_release)
    
    # Estat√≠sticas gerais
    print("\n" + "="*80)
    print("ESTAT√çSTICAS GERAIS:")
    print("="*80)
    print(f"M√©dia de bugs por release: {bugs_by_release['Total_Bugs'].mean():.1f}")
    print(f"Mediana de bugs por release: {bugs_by_release['Total_Bugs'].median():.1f}")
    print(f"M√≠nimo de bugs em uma release: {bugs_by_release['Total_Bugs'].min()}")
    print(f"M√°ximo de bugs em uma release: {bugs_by_release['Total_Bugs'].max()}")
    
    # Primeira vs √öltima release
    first_release = bugs_by_release.iloc[0]
    last_release = bugs_by_release.iloc[-1]
    variation = ((last_release['Total_Bugs'] - first_release['Total_Bugs']) / first_release['Total_Bugs'] * 100)
    
    print(f"\nPrimeira release ({bugs_by_release.index[0]}): {first_release['Total_Bugs']:.0f} bugs")
    print(f"√öltima release ({bugs_by_release.index[-1]}): {last_release['Total_Bugs']:.0f} bugs")
    print(f"Varia√ß√£o: {variation:+.1f}%")

### 10.3 An√°lise da √öltima Release (Estado Atual)

In [None]:
if not df_bugs.empty:
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Evolu√ß√£o dos Bugs ao Longo das Releases', fontsize=16, fontweight='bold')
    
    # 1. Evolu√ß√£o do total de bugs
    bugs_by_release['Total_Bugs'].plot(ax=axes[0, 0], marker='o', color='red', linewidth=2)
    axes[0, 0].set_title('Total de Bugs por Release')
    axes[0, 0].set_ylabel('Quantidade de Bugs')
    axes[0, 0].tick_params(axis='x', rotation=45)
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].axhline(y=bugs_by_release['Total_Bugs'].mean(), color='orange', 
                       linestyle='--', label=f'M√©dia: {bugs_by_release["Total_Bugs"].mean():.1f}')
    axes[0, 0].legend()
    
    # 2. Evolu√ß√£o por categoria (top 5)
    category_evolution = df_bugs.groupby(['release', 'category']).size().unstack(fill_value=0)
    top_categories = df_bugs['category'].value_counts().head(5).index
    category_evolution[top_categories].plot(ax=axes[0, 1], marker='o', linewidth=2)
    axes[0, 1].set_title('Evolu√ß√£o das Top 5 Categorias')
    axes[0, 1].set_ylabel('Quantidade de Bugs')
    axes[0, 1].tick_params(axis='x', rotation=45)
    axes[0, 1].legend(title='Categoria', bbox_to_anchor=(1.05, 1), loc='upper left')
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. Evolu√ß√£o por prioridade
    priority_evolution = df_bugs.groupby(['release', 'priority_label']).size().unstack(fill_value=0)
    priority_evolution.plot(ax=axes[1, 0], marker='o', linewidth=2, 
                           color=['#ff4444', '#ffaa44'])
    axes[1, 0].set_title('Evolu√ß√£o por Prioridade')
    axes[1, 0].set_ylabel('Quantidade de Bugs')
    axes[1, 0].tick_params(axis='x', rotation=45)
    axes[1, 0].legend(title='Prioridade')
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. Top 10 tipos de bugs na √∫ltima release
    latest_release_name = df_bugs['release'].unique()[-1]
    latest_bugs = df_bugs[df_bugs['release'] == latest_release_name]
    latest_bugs['type'].value_counts().head(10).plot(kind='barh', ax=axes[1, 1], color='steelblue')
    axes[1, 1].set_title(f'Top 10 Tipos de Bugs ({latest_release_name})')
    axes[1, 1].set_xlabel('Quantidade')
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'bugs_evolution.png', dpi=300, bbox_inches='tight')
    plt.show()

### 10.5 Bugs de Seguran√ßa (find-sec-bugs)

In [None]:
if not df_bugs.empty:
    security_bugs = df_bugs[df_bugs['category'] == 'SECURITY'].copy()
    
    print("="*80)
    print("BUGS DE SEGURAN√áA (find-sec-bugs)")
    print("="*80)
    
    if not security_bugs.empty:
        # An√°lise por release
        security_by_release = security_bugs.groupby('release').size()
        
        print(f"\nM√©dia de bugs de seguran√ßa por release: {security_by_release.mean():.1f}")
        print(f"Mediana: {security_by_release.median():.1f}")
        
        print("\nBugs de Seguran√ßa por Release:")
        print(security_by_release)
        
        # √öltima release
        latest_release_name = df_bugs['release'].unique()[-1]
        latest_security = security_bugs[security_bugs['release'] == latest_release_name]
        
        print(f"\n{'='*80}")
        print(f"√öLTIMA RELEASE ({latest_release_name}): {len(latest_security)} bugs de seguran√ßa")
        print("="*80)
        
        print("\nTop 10 Tipos de Vulnerabilidades (√∫ltima release):")
        print(latest_security['type'].value_counts().head(10))
        
        print("\nTop 10 Classes com Bugs de Seguran√ßa (√∫ltima release):")
        print(latest_security['class'].value_counts().head(10))
        
        # Visualiza√ß√£o
        fig, axes = plt.subplots(1, 2, figsize=(16, 6))
        fig.suptitle('An√°lise de Bugs de Seguran√ßa', fontsize=16, fontweight='bold')
        
        security_by_release.plot(ax=axes[0], marker='o', color='darkred', linewidth=2)
        axes[0].set_title('Evolu√ß√£o de Bugs de Seguran√ßa')
        axes[0].set_ylabel('Quantidade')
        axes[0].tick_params(axis='x', rotation=45)
        axes[0].grid(True, alpha=0.3)
        axes[0].axhline(y=security_by_release.mean(), color='orange', 
                       linestyle='--', label=f'M√©dia: {security_by_release.mean():.1f}')
        axes[0].legend()
        
        latest_security['type'].value_counts().head(10).plot(kind='barh', ax=axes[1], color='crimson')
        axes[1].set_title(f'Top 10 Vulnerabilidades ({latest_release_name})')
        axes[1].set_xlabel('Quantidade')
        
        plt.tight_layout()
        plt.savefig(RESULTS_DIR / 'security_bugs.png', dpi=300, bbox_inches='tight')
        plt.show()
    else:
        print("\n‚úì Nenhum bug de seguran√ßa encontrado")

### 10.6 Bugs Cr√≠ticos (Prioridade HIGH)

In [None]:
if not df_bugs.empty:
    critical_bugs = df_bugs[df_bugs['priority'] == 1].copy()
    
    print("="*80)
    print("BUGS CR√çTICOS (Prioridade HIGH)")
    print("="*80)
    
    if not critical_bugs.empty:
        # An√°lise por release
        critical_by_release = critical_bugs.groupby('release').size()
        
        print(f"\nM√©dia de bugs cr√≠ticos por release: {critical_by_release.mean():.1f}")
        print(f"Mediana: {critical_by_release.median():.1f}")
        
        print("\nBugs Cr√≠ticos por Release:")
        print(critical_by_release)
        
        # √öltima release
        latest_release_name = df_bugs['release'].unique()[-1]
        latest_critical = critical_bugs[critical_bugs['release'] == latest_release_name]
        
        print(f"\n{'='*80}")
        print(f"√öLTIMA RELEASE ({latest_release_name}): {len(latest_critical)} bugs cr√≠ticos")
        print("="*80)
        
        if not latest_critical.empty:
            print("\nDETALHES DOS BUGS CR√çTICOS:")
            for idx, bug in latest_critical.iterrows():
                print(f"\nüî¥ {bug['type']} - {bug['category']}")
                print(f"   Classe: {bug['class']}")
                if bug['method']:
                    print(f"   M√©todo: {bug['method']}")
                if bug['description']:
                    print(f"   {bug['description'][:150]}...")
                print("   " + "-"*76)
    else:
        print("\n‚úì Nenhum bug cr√≠tico encontrado!")

### 10.7 Exportar Dados de Bugs

In [None]:
if not df_bugs.empty:
    # Identificar √∫ltima release
    latest_release_name = df_bugs['release'].unique()[-1]
    latest_bugs = df_bugs[df_bugs['release'] == latest_release_name]
    latest_security = security_bugs[security_bugs['release'] == latest_release_name] if not security_bugs.empty else pd.DataFrame()
    latest_critical = critical_bugs[critical_bugs['release'] == latest_release_name] if not critical_bugs.empty else pd.DataFrame()
    
    # Exportar todos os bugs
    df_bugs.to_csv(RESULTS_DIR / 'bugs_all_releases.csv', index=False)
    print("‚úì Bugs exportados:")
    print("  - bugs_all_releases.csv (todos os bugs de todas as releases)")
    
    # Bugs da √∫ltima release
    latest_bugs.to_csv(RESULTS_DIR / f'bugs_{latest_release_name}.csv', index=False)
    print(f"  - bugs_{latest_release_name}.csv (√∫ltima release)")
    
    if not security_bugs.empty:
        security_bugs.to_csv(RESULTS_DIR / 'security_bugs_all.csv', index=False)
        print("  - security_bugs_all.csv (todas as releases)")
    
    if not critical_bugs.empty:
        critical_bugs.to_csv(RESULTS_DIR / 'critical_bugs_all.csv', index=False)
        print("  - critical_bugs_all.csv (todas as releases)")
    
    # Resumo JSON
    summary_stats = {
        'latest_release': latest_release_name,
        'latest_release_bugs': len(latest_bugs),
        'latest_release_security_bugs': len(latest_security),
        'latest_release_critical_bugs': len(latest_critical),
        'average_bugs_per_release': float(bugs_by_release['Total_Bugs'].mean()),
        'median_bugs_per_release': float(bugs_by_release['Total_Bugs'].median()),
        'total_releases_analyzed': df_bugs['release'].nunique(),
        'most_common_bug_type_latest': latest_bugs['type'].value_counts().index[0] if len(latest_bugs) > 0 else 'N/A',
        'most_common_category_latest': latest_bugs['category'].value_counts().index[0] if len(latest_bugs) > 0 else 'N/A'
    }
    
    with open(RESULTS_DIR / 'bugs_summary.json', 'w') as f:
        json.dump(summary_stats, f, indent=2)
    print("  - bugs_summary.json")
    
    print("\n" + "="*80)
    print("üìä RESUMO GERAL:")
    print("="*80)
    for key, value in summary_stats.items():
        label = key.replace('_', ' ').title()
        if isinstance(value, float):
            print(f"  {label}: {value:.1f}")
        else:
            print(f"  {label}: {value}")
    print("="*80)

## 11. Refatora√ß√µes (RefactoringMiner)

An√°lise das refatora√ß√µes detectadas pelo RefactoringMiner (arquivo `refactorings-all.json`).


In [None]:
refactoring_file = RESULTS_DIR / 'refactorings-all.json'
rows = []
commits_com_ref = 0
total_commits_refminer = 0

if refactoring_file.exists():
    with open(refactoring_file, encoding='utf-8') as f:
        ref_data = json.load(f)

    total_commits_refminer = len(ref_data.get('commits', []))

    for commit in ref_data.get('commits', []):
        ref_list = commit.get('refactorings', [])
        if ref_list:
            commits_com_ref += 1
            for ref in ref_list:
                rows.append({
                    'commit': commit.get('sha1'),
                    'type': ref.get('type'),
                    'description': ref.get('description')
                })

    df_refs = pd.DataFrame(rows)
    print(f'Total de commits avaliados (RefactoringMiner): {total_commits_refminer}')
    print(f'Commits com refatoracoes: {commits_com_ref}')
    print(f'Total de refatoracoes detectadas: {len(df_refs)}')
else:
    df_refs = pd.DataFrame()
    print('Arquivo refactorings-all.json nao encontrado. Execute a analise primeiro.')

df_refs.head()


### 11.2 An√°lise Avan√ßada - Arquivos e Classes Refatorados

In [None]:
# Extrair arquivos e classes de cada refatora√ß√£o
if refactoring_file.exists():
    file_refactorings = []
    class_refactorings = []
    
    with open(refactoring_file, encoding='utf-8') as f:
        ref_data = json.load(f)
    
    for commit in ref_data.get('commits', []):
        for ref in commit.get('refactorings', []):
            ref_type = ref.get('type', '')
            
            # Extrair arquivos de leftSideLocations e rightSideLocations
            files_in_ref = set()
            for side in ['leftSideLocations', 'rightSideLocations']:
                for loc in ref.get(side, []):
                    file_path = loc.get('filePath', '')
                    if file_path:
                        files_in_ref.add(file_path)
                        file_refactorings.append({
                            'file': file_path,
                            'type': ref_type,
                            'commit': commit.get('sha1')
                        })
            
            # Extrair classes de codeElement
            classes_in_ref = set()
            for side in ['leftSideLocations', 'rightSideLocations']:
                for loc in ref.get(side, []):
                    code_elem = loc.get('codeElement', '')
                    if code_elem and loc.get('codeElementType') == 'TYPE_DECLARATION':
                        # Extrair nome da classe (ex: "org.jsoup.parser.Tag" -> "Tag")
                        if code_elem:
                            # Tentar extrair o nome qualificado da classe
                            classes_in_ref.add(code_elem)
                            class_refactorings.append({
                                'class': code_elem,
                                'type': ref_type,
                                'commit': commit.get('sha1')
                            })
    
    df_file_refs = pd.DataFrame(file_refactorings) if file_refactorings else pd.DataFrame()
    df_class_refs = pd.DataFrame(class_refactorings) if class_refactorings else pd.DataFrame()
    
    print(f"‚úì Extra√≠dos {len(df_file_refs)} refatora√ß√µes em arquivos")
    print(f"‚úì Extra√≠dos {len(df_class_refs)} refatora√ß√µes em classes")
else:
    df_file_refs = pd.DataFrame()
    df_class_refs = pd.DataFrame()

#### 11.2.1 Top 10 Arquivos Mais Refatorados

Arquivos que receberam mais refatora√ß√µes ao longo do hist√≥rico.

In [None]:
if not df_file_refs.empty:
    # Top 10 arquivos mais refatorados
    top_files = df_file_refs['file'].value_counts().head(10)
    
    print("="*80)
    print("TOP 10 ARQUIVOS MAIS REFATORADOS")
    print("="*80)
    print(top_files)
    
    # Visualiza√ß√£o
    fig = plt.figure(figsize=(18, 14))
    fig.suptitle('An√°lise de Arquivos Refatorados', fontsize=16, fontweight='bold', y=0.995)
    
    # Layout: 2 colunas
    # Coluna esquerda: Top 10 arquivos
    # Coluna direita: Top 5 arquivos com detalhamento
    
    # Top 10 arquivos (truncar nomes longos)
    ax1 = plt.subplot(1, 2, 1)
    top_files_display = top_files.copy()
    top_files_display.index = [f.split('/')[-1] if len(f) > 40 else f for f in top_files_display.index]
    
    top_files_display.plot(kind='barh', ax=ax1, color='steelblue')
    ax1.set_title('Top 10 Arquivos Mais Refatorados', fontsize=12, fontweight='bold')
    ax1.set_xlabel('Quantidade de Refatora√ß√µes')
    ax1.invert_yaxis()
    ax1.grid(True, alpha=0.3, axis='x')
    
    # Top 5 arquivos com tipos de refatora√ß√£o (5 subplots verticais)
    top5_files = top_files.head(5).index
    
    for i, file_path in enumerate(top5_files, 1):
        ax = plt.subplot(5, 2, i*2)
        
        # Filtrar refatora√ß√µes deste arquivo
        file_refs = df_file_refs[df_file_refs['file'] == file_path]
        type_counts = file_refs['type'].value_counts().head(10)
        
        # Truncar nome do arquivo para t√≠tulo
        file_name = file_path.split('/')[-1]
        
        # Gr√°fico de barra horizontal
        type_counts.plot(kind='barh', ax=ax, color='coral')
        ax.set_title(f'{i}. {file_name} ({len(file_refs)} refs)', fontsize=10, fontweight='bold')
        ax.set_xlabel('Quantidade', fontsize=8)
        ax.tick_params(axis='y', labelsize=7)
        ax.tick_params(axis='x', labelsize=8)
        ax.invert_yaxis()
        ax.grid(True, alpha=0.3, axis='x')
        
        # Truncar labels longos
        labels = [label.get_text()[:35] + '...' if len(label.get_text()) > 35 else label.get_text() 
                  for label in ax.get_yticklabels()]
        ax.set_yticklabels(labels)
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'refactorings_by_file.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Exportar
    top_files.to_csv(RESULTS_DIR / 'top_refactored_files.csv', header=['refactoring_count'])
    print(f"\n‚úì Exportado: top_refactored_files.csv")
else:
    print("Nenhum dado de arquivos refatorados dispon√≠vel")

#### 11.2.2 Top 10 Classes Mais Refatoradas

Classes que receberam mais refatora√ß√µes e os tipos aplicados em cada uma.

In [None]:
if not df_class_refs.empty:
    # Top 10 classes mais refatoradas
    top_classes = df_class_refs['class'].value_counts().head(10)
    
    print("="*80)
    print("TOP 10 CLASSES MAIS REFATORADAS")
    print("="*80)
    print(top_classes)
    
    # Para cada top 10 classe, mostrar quais tipos de refatora√ß√£o foram aplicados
    print("\n" + "="*80)
    print("TIPOS DE REFATORA√á√ÉO POR CLASSE (TOP 10)")
    print("="*80)
    
    for class_name in top_classes.head(10).index:
        class_refs = df_class_refs[df_class_refs['class'] == class_name]
        type_counts = class_refs['type'].value_counts()
        
        print(f"\nüì¶ {class_name} ({len(class_refs)} refatora√ß√µes)")
        print("-" * 80)
        for ref_type, count in type_counts.items():
            print(f"  ‚Ä¢ {ref_type}: {count}")
    
    # Visualiza√ß√£o
    fig, axes = plt.subplots(2, 1, figsize=(18, 14))
    fig.suptitle('An√°lise de Classes Refatoradas', fontsize=16, fontweight='bold')
    
    # Top 10 classes (truncar nomes longos - pegar apenas o nome da classe)
    top_classes_display = top_classes.copy()
    top_classes_display.index = [c.split('.')[-1] if '.' in c else c for c in top_classes_display.index]
    
    top_classes_display.plot(kind='barh', ax=axes[0], color='darkgreen')
    axes[0].set_title('Top 10 Classes Mais Refatoradas')
    axes[0].set_xlabel('Quantidade de Refatora√ß√µes')
    axes[0].invert_yaxis()
    
    # Heatmap: Top 10 classes x tipos de refatora√ß√£o
    top10_classes = top_classes.head(10).index
    df_top10 = df_class_refs[df_class_refs['class'].isin(top10_classes)]
    heatmap_data = df_top10.groupby(['class', 'type']).size().unstack(fill_value=0)
    
    # Pegar apenas os tipos mais comuns para n√£o poluir o heatmap
    top_types = df_top10['type'].value_counts().head(15).index
    heatmap_data = heatmap_data[top_types]
    
    # Truncar nomes de classes no index
    heatmap_data.index = [c.split('.')[-1] if '.' in c else c for c in heatmap_data.index]
    
    sns.heatmap(heatmap_data, annot=True, fmt='d', cmap='YlOrRd', ax=axes[1], 
                cbar_kws={'label': 'Quantidade'})
    axes[1].set_title('Heatmap: Top 10 Classes x Top 15 Tipos de Refatora√ß√£o')
    axes[1].set_xlabel('Tipo de Refatora√ß√£o')
    axes[1].set_ylabel('Classe')
    axes[1].tick_params(axis='x', rotation=45, labelsize=8)
    axes[1].tick_params(axis='y', labelsize=9)
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'refactorings_by_class.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Exportar
    top_classes.to_csv(RESULTS_DIR / 'top_refactored_classes.csv', header=['refactoring_count'])
    heatmap_data.to_csv(RESULTS_DIR / 'class_refactoring_heatmap.csv')
    print(f"\n‚úì Exportado: top_refactored_classes.csv, class_refactoring_heatmap.csv")
else:
    print("Nenhum dado de classes refatoradas dispon√≠vel")

#### 11.2.3 Categoriza√ß√£o das Refatora√ß√µes

Agrupamento de refatora√ß√µes por categoria sem√¢ntica.

In [None]:
if not df_refs.empty:
    # Categorizar refatora√ß√µes
    def categorize_refactoring(ref_type):
        ref_type_lower = ref_type.lower()
        
        # Categorias baseadas em Fowler's Refactoring Catalog
        if any(x in ref_type_lower for x in ['extract method', 'inline method', 'move method', 
                                               'rename method', 'change method', 'add parameter',
                                               'remove parameter', 'parameterize']):
            return 'M√©todos'
        elif any(x in ref_type_lower for x in ['extract class', 'inline class', 'move class',
                                                 'rename class', 'change class', 'split class']):
            return 'Classes'
        elif any(x in ref_type_lower for x in ['extract variable', 'inline variable', 'rename variable',
                                                 'rename attribute', 'rename parameter',
                                                 'encapsulate', 'field', 'attribute']):
            return 'Vari√°veis/Atributos'
        elif any(x in ref_type_lower for x in ['pull up', 'push down', 'extract interface',
                                                 'extract superclass', 'collapse hierarchy']):
            return 'Hierarquia'
        elif any(x in ref_type_lower for x in ['move', 'rename package']):
            return 'Pacotes'
        else:
            return 'Outros'
    
    df_refs['category'] = df_refs['type'].apply(categorize_refactoring)
    
    # Estat√≠sticas por categoria
    category_stats = df_refs['category'].value_counts()
    
    print("="*80)
    print("REFATORA√á√ïES POR CATEGORIA")
    print("="*80)
    print(category_stats)
    print(f"\nTotal: {category_stats.sum()} refatora√ß√µes")
    
    # Visualiza√ß√£o
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    fig.suptitle('Categoriza√ß√£o das Refatora√ß√µes', fontsize=16, fontweight='bold')
    
    # Gr√°fico de pizza
    category_stats.plot(kind='pie', ax=axes[0], autopct='%1.1f%%', startangle=90)
    axes[0].set_title('Distribui√ß√£o por Categoria')
    axes[0].set_ylabel('')
    
    # Top 5 tipos por categoria (barra empilhada)
    category_type_data = []
    for cat in category_stats.index:
        cat_refs = df_refs[df_refs['category'] == cat]
        top5_types = cat_refs['type'].value_counts().head(5)
        for ref_type, count in top5_types.items():
            category_type_data.append({
                'category': cat,
                'type': ref_type[:40] + '...' if len(ref_type) > 40 else ref_type,  # Truncar
                'count': count
            })
    
    df_cat_type = pd.DataFrame(category_type_data)
    pivot = df_cat_type.pivot(index='category', columns='type', values='count').fillna(0)
    pivot.plot(kind='barh', stacked=True, ax=axes[1])
    axes[1].set_title('Top 5 Tipos por Categoria')
    axes[1].set_xlabel('Quantidade')
    axes[1].legend(title='Tipo', bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=7)
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'refactorings_by_category.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Exportar
    category_stats.to_csv(RESULTS_DIR / 'refactorings_categories.csv', header=['count'])
    print(f"\n‚úì Exportado: refactorings_categories.csv")
else:
    print("Nenhum dado de refatora√ß√µes dispon√≠vel")

#### 11.2.4 Cruzamento: Refatora√ß√µes vs M√©tricas CK

Verificar se classes com problemas de qualidade (alto WMC, CBO, LCOM) foram refatoradas.

In [None]:
if not df_all.empty and not df_class_refs.empty:
    # Usar √∫ltima release para m√©tricas CK
    latest_release = df_all[df_all['release'] == df_all['release'].unique()[-1]].copy()
    
    # Identificar classes problem√°ticas (top 20% em WMC, CBO, LCOM)
    wmc_threshold = latest_release['wmc'].quantile(0.80)
    cbo_threshold = latest_release['cbo'].quantile(0.80)
    lcom_threshold = latest_release['lcom'].quantile(0.80)
    
    problematic_classes = latest_release[
        (latest_release['wmc'] >= wmc_threshold) |
        (latest_release['cbo'] >= cbo_threshold) |
        (latest_release['lcom'] >= lcom_threshold)
    ].copy()
    
    print("="*80)
    print("CRUZAMENTO: CLASSES PROBLEM√ÅTICAS vs REFATORA√á√ïES")
    print("="*80)
    print(f"Classes problem√°ticas identificadas: {len(problematic_classes)}")
    print(f"  ‚Ä¢ WMC >= {wmc_threshold:.1f}: {len(latest_release[latest_release['wmc'] >= wmc_threshold])}")
    print(f"  ‚Ä¢ CBO >= {cbo_threshold:.1f}: {len(latest_release[latest_release['cbo'] >= cbo_threshold])}")
    print(f"  ‚Ä¢ LCOM >= {lcom_threshold:.1f}: {len(latest_release[latest_release['lcom'] >= lcom_threshold])}")
    
    # Contar refatora√ß√µes por classe
    refactored_classes_count = df_class_refs['class'].value_counts().to_dict()
    
    # Adicionar contagem de refatora√ß√µes √†s classes problem√°ticas
    problematic_classes['refactoring_count'] = problematic_classes['class'].apply(
        lambda x: refactored_classes_count.get(x, 0)
    )
    
    # Classes problem√°ticas que foram refatoradas
    refactored_problematic = problematic_classes[problematic_classes['refactoring_count'] > 0]
    not_refactored_problematic = problematic_classes[problematic_classes['refactoring_count'] == 0]
    
    print(f"\nClasses problem√°ticas que FORAM refatoradas: {len(refactored_problematic)} ({len(refactored_problematic)/len(problematic_classes)*100:.1f}%)")
    print(f"Classes problem√°ticas que N√ÉO foram refatoradas: {len(not_refactored_problematic)} ({len(not_refactored_problematic)/len(problematic_classes)*100:.1f}%)")
    
    # Top 10 classes problem√°ticas mais refatoradas
    if not refactored_problematic.empty:
        print("\n" + "-"*80)
        print("TOP 10 CLASSES PROBLEM√ÅTICAS MAIS REFATORADAS:")
        print("-"*80)
        top_refactored = refactored_problematic.nlargest(10, 'refactoring_count')[
            ['class', 'wmc', 'cbo', 'lcom', 'refactoring_count']
        ]
        print(top_refactored.to_string())
    
    # Classes problem√°ticas que mais precisam de aten√ß√£o (alto WMC/CBO/LCOM e poucas refatora√ß√µes)
    print("\n" + "-"*80)
    print("TOP 10 CLASSES PROBLEM√ÅTICAS COM POUCAS REFATORA√á√ïES (precisam de aten√ß√£o):")
    print("-"*80)
    problematic_classes['problem_score'] = (
        problematic_classes['wmc'] / wmc_threshold +
        problematic_classes['cbo'] / cbo_threshold +
        problematic_classes['lcom'] / lcom_threshold
    )
    needs_attention = problematic_classes.nsmallest(10, 'refactoring_count')[
        ['class', 'wmc', 'cbo', 'lcom', 'refactoring_count', 'problem_score']
    ]
    print(needs_attention.to_string())
    
    # Visualiza√ß√£o
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Cruzamento: Refatora√ß√µes vs M√©tricas CK', fontsize=16, fontweight='bold')
    
    # 1. Classes problem√°ticas refatoradas vs n√£o refatoradas (pizza)
    refactored_status = pd.Series({
        'Refatoradas': len(refactored_problematic),
        'N√£o Refatoradas': len(not_refactored_problematic)
    })
    refactored_status.plot(kind='pie', ax=axes[0, 0], autopct='%1.1f%%', 
                          colors=['lightgreen', 'lightcoral'], startangle=90)
    axes[0, 0].set_title('Classes Problem√°ticas: Status de Refatora√ß√£o')
    axes[0, 0].set_ylabel('')
    
    # 2. WMC vs Refactorings (scatter)
    axes[0, 1].scatter(problematic_classes['wmc'], problematic_classes['refactoring_count'], 
                      alpha=0.6, color='blue')
    axes[0, 1].set_title('WMC vs Quantidade de Refatora√ß√µes')
    axes[0, 1].set_xlabel('WMC (Complexidade)')
    axes[0, 1].set_ylabel('Refatora√ß√µes')
    axes[0, 1].grid(True, alpha=0.3)
    
    # 3. CBO vs Refactorings (scatter)
    axes[1, 0].scatter(problematic_classes['cbo'], problematic_classes['refactoring_count'], 
                      alpha=0.6, color='green')
    axes[1, 0].set_title('CBO vs Quantidade de Refatora√ß√µes')
    axes[1, 0].set_xlabel('CBO (Acoplamento)')
    axes[1, 0].set_ylabel('Refatora√ß√µes')
    axes[1, 0].grid(True, alpha=0.3)
    
    # 4. LCOM vs Refactorings (scatter)
    axes[1, 1].scatter(problematic_classes['lcom'], problematic_classes['refactoring_count'], 
                      alpha=0.6, color='red')
    axes[1, 1].set_title('LCOM vs Quantidade de Refatora√ß√µes')
    axes[1, 1].set_xlabel('LCOM (Coes√£o)')
    axes[1, 1].set_ylabel('Refatora√ß√µes')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'refactorings_vs_metrics.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Exportar
    problematic_classes[['class', 'wmc', 'cbo', 'lcom', 'refactoring_count', 'problem_score']].to_csv(
        RESULTS_DIR / 'problematic_classes_refactorings.csv', index=False
    )
    print(f"\n‚úì Exportado: problematic_classes_refactorings.csv")
else:
    print("Dados de m√©tricas CK ou refatora√ß√µes n√£o dispon√≠veis para cruzamento")

#### 11.2.5 Estat√≠sticas Descritivas - Refatora√ß√µes

In [None]:
if not df_refs.empty:
    # Estat√≠sticas descritivas
    print("="*80)
    print("ESTAT√çSTICAS DESCRITIVAS - REFATORA√á√ïES")
    print("="*80)
    
    # Refatora√ß√µes por commit
    refs_per_commit = df_refs.groupby('commit').size()
    
    print(f"\nTotal de commits analisados: {total_commits_refminer}")
    print(f"Commits com refatora√ß√µes: {commits_com_ref} ({commits_com_ref/total_commits_refminer*100:.1f}%)")
    print(f"Commits sem refatora√ß√µes: {total_commits_refminer - commits_com_ref} ({(total_commits_refminer - commits_com_ref)/total_commits_refminer*100:.1f}%)")
    
    print(f"\nTotal de refatora√ß√µes detectadas: {len(df_refs)}")
    print(f"Tipos √∫nicos de refatora√ß√£o: {df_refs['type'].nunique()}")
    
    print(f"\nRefatora√ß√µes por commit (estat√≠sticas):")
    print(f"  ‚Ä¢ M√©dia: {refs_per_commit.mean():.2f}")
    print(f"  ‚Ä¢ Mediana: {refs_per_commit.median():.1f}")
    print(f"  ‚Ä¢ Desvio padr√£o: {refs_per_commit.std():.2f}")
    print(f"  ‚Ä¢ M√≠nimo: {refs_per_commit.min()}")
    print(f"  ‚Ä¢ M√°ximo: {refs_per_commit.max()}")
    print(f"  ‚Ä¢ Percentil 75%: {refs_per_commit.quantile(0.75):.1f}")
    print(f"  ‚Ä¢ Percentil 90%: {refs_per_commit.quantile(0.90):.1f}")
    
    # Top 5 commits com mais refatora√ß√µes
    print(f"\nTop 5 commits com mais refatora√ß√µes:")
    top_commits = refs_per_commit.nlargest(5)
    for commit, count in top_commits.items():
        print(f"  ‚Ä¢ {commit[:8]}... : {count} refatora√ß√µes")
    
    # Visualiza√ß√£o
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    fig.suptitle('Estat√≠sticas de Refatora√ß√µes por Commit', fontsize=16, fontweight='bold')
    
    # Histograma de refatora√ß√µes por commit
    refs_per_commit.hist(bins=30, ax=axes[0], color='steelblue', edgecolor='black')
    axes[0].set_title('Distribui√ß√£o de Refatora√ß√µes por Commit')
    axes[0].set_xlabel('Quantidade de Refatora√ß√µes')
    axes[0].set_ylabel('Frequ√™ncia (commits)')
    axes[0].axvline(refs_per_commit.mean(), color='red', linestyle='--', 
                   label=f'M√©dia: {refs_per_commit.mean():.2f}')
    axes[0].axvline(refs_per_commit.median(), color='orange', linestyle='--',
                   label=f'Mediana: {refs_per_commit.median():.1f}')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Boxplot
    refs_per_commit.plot(kind='box', ax=axes[1], vert=True)
    axes[1].set_title('Boxplot: Refatora√ß√µes por Commit')
    axes[1].set_ylabel('Quantidade de Refatora√ß√µes')
    axes[1].grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig(RESULTS_DIR / 'refactorings_statistics.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    # Resumo final JSON
    summary = {
        'total_commits': total_commits_refminer,
        'commits_with_refactorings': commits_com_ref,
        'commits_without_refactorings': total_commits_refminer - commits_com_ref,
        'percentage_commits_with_refs': round(commits_com_ref/total_commits_refminer*100, 2),
        'total_refactorings': len(df_refs),
        'unique_refactoring_types': df_refs['type'].nunique(),
        'refactorings_per_commit': {
            'mean': round(refs_per_commit.mean(), 2),
            'median': round(refs_per_commit.median(), 1),
            'std': round(refs_per_commit.std(), 2),
            'min': int(refs_per_commit.min()),
            'max': int(refs_per_commit.max()),
            'percentile_75': round(refs_per_commit.quantile(0.75), 1),
            'percentile_90': round(refs_per_commit.quantile(0.90), 1)
        }
    }
    
    with open(RESULTS_DIR / 'refactorings_summary.json', 'w') as f:
        json.dump(summary, f, indent=2)
    
    print(f"\n‚úì Exportado: refactorings_summary.json")
else:
    print("Nenhum dado de refatora√ß√µes dispon√≠vel")

## 12. Resumo e Pr√≥ximos Passos

### ‚úÖ O que foi analisado:
1. **M√©tricas CK** - Complexidade (WMC), Acoplamento (CBO), Coes√£o (LCOM), etc.
2. **Distribui√ß√£o** - Boxplots para identificar classes outliers.
3. **Correla√ß√µes** - Heatmap mostrando rela√ß√µes entre m√©tricas.
4. **PMD - An√°lise Est√°tica** Evolu√ß√£o de problemas de c√≥digo, prioridades, regras violadas.
5. **Bugs Gerais** - SpotBugs (todos os bugs detectados).
6. **Bugs de Seguran√ßa** - find-sec-bugs (vulnerabilidades).
7. **Bugs Cr√≠ticos** - Prioridade HIGH.
8. **Refatora√ß√µes** - Hist√≥rico completo analisado pelo RefactoringMiner.
9. **An√°lise Avan√ßada de Refatora√ß√µes**:
   - Top 10 arquivos mais refatorados
   - Top 10 classes mais refatoradas + tipos aplicados
   - Categoriza√ß√£o por tipo de refatora√ß√£o (M√©todos, Classes, Hierarquia, etc.)
   - Cruzamento com m√©tricas CK (classes problem√°ticas foram refatoradas?)
   - Estat√≠sticas descritivas (refatora√ß√µes por commit)

### üìÅ Arquivos gerados:

**M√©tricas CK**
- `metrics_summary.csv` - Estat√≠sticas por release.
- `growth_rates.csv` - Taxa de crescimento.
- `metrics_evolution.png` - Gr√°ficos de evolu√ß√£o.
- `metrics_distribution.png` - Boxplots.
- `correlation_matrix.png` - Heatmap de correla√ß√µes.

**PMD - An√°lise Est√°tica**
- `pmd_all_releases.csv` - Todos os problemas PMD.
- `pmd_{release}.csv` - Problemas da √∫ltima release.
- `pmd_critical_all.csv` - Problemas cr√≠ticos (Priority 1).
- `pmd_summary.json` - Resumo estat√≠stico.
- `pmd_analysis.png` - Evolu√ß√£o e distribui√ß√£o por prioridade.

**Bugs**
- `bugs_all_releases.csv` - Todos os bugs.
- `security_bugs_all.csv` - Bugs de seguran√ßa.
- `critical_bugs_all.csv` - Bugs cr√≠ticos.
- `bugs_summary.json` - Resumo estat√≠stico.
- `bugs_evolution.png` - Visualiza√ß√µes gerais.
- `security_bugs.png` - Visualiza√ß√µes de seguran√ßa.

**Refatora√ß√µes (RefactoringMiner)**
- `refactorings-all.json` - Refatora√ß√µes detectadas em commits.
- `refactoring-miner.log` - Log completo do processamento (stdout/stderr).
- `refactorings_by_type.csv` - Quantidade de refatora√ß√µes por tipo.
- `top_refactored_files.csv` - Top arquivos mais refatorados.
- `top_refactored_classes.csv` - Top classes mais refatoradas.
- `class_refactoring_heatmap.csv` - Heatmap classes x tipos.
- `refactorings_categories.csv` - Distribui√ß√£o por categoria.
- `problematic_classes_refactorings.csv` - Cruzamento com m√©tricas CK.
- `refactorings_summary.json` - Estat√≠sticas descritivas.
- `refactorings_by_file.png` - Visualiza√ß√µes de arquivos.
- `refactorings_by_class.png` - Visualiza√ß√µes de classes + heatmap.
- `refactorings_by_category.png` - Categoriza√ß√£o.
- `refactorings_vs_metrics.png` - Cruzamento com m√©tricas CK.
- `refactorings_statistics.png` - Estat√≠sticas por commit.

### üéØ Como usar:
1. **Identificar problemas** - Use Top Classes (m√©tricas CK), Top Regras (PMD), boxplots e tipos de refatora√ß√£o.
2. **Priorizar** - Foque em:
   - **PMD Priority 1** (problemas cr√≠ticos de design/seguran√ßa)
   - Bugs de seguran√ßa/cr√≠ticos (SpotBugs)
   - Classes problem√°ticas que N√ÉO foram refatoradas (veja `problematic_classes_refactorings.csv`)
   - Arquivos/classes mais refatorados (hotspots de mudan√ßa)
3. **Refatorar** - Classes com WMC/CBO/LCOM altos + muitos problemas PMD s√£o candidatos naturais.
4. **Submeter PRs** - Corrija bugs, problemas PMD cr√≠ticos e planeje refatora√ß√µes compat√≠veis.
5. **Documentar** - Utilize gr√°ficos e tabelas no artigo cient√≠fico, cruzando m√©tricas, PMD, bugs e refatora√ß√µes.

### üí° Dicas para Pull Requests:
- **Problemas PMD Priority 1** s√£o candidatos excelentes para PRs r√°pidos (design/boas pr√°ticas).
- Bugs de seguran√ßa s√£o sempre bem-vindos; priorize-os.
- Comece com corre√ß√µes simples (PMD Priority 3-4, bugs LOW) e evolua para cr√≠ticos.
- Classes com LCOM alto + muitos problemas PMD -> aplicar **Split Responsibility**.
- M√©todos com WMC alto + regras PMD -> aplicar **Extract Method**.
- Use o ranking de refatora√ß√µes para justificar decis√µes.
- **Classes problem√°ticas com poucas refatora√ß√µes** s√£o candidatas priorit√°rias.

### üìä Insights para Artigo Cient√≠fico:
- **Correla√ß√£o m√©tricas x refatora√ß√µes:** Classes com alta complexidade foram refatoradas?
- **PMD vs M√©tricas CK:** Classes com alto WMC/CBO t√™m mais problemas PMD?
- **Evolu√ß√£o da qualidade:** Problemas PMD diminu√≠ram ao longo das releases?
- **Padr√µes de refatora√ß√£o:** Quais categorias s√£o mais comuns?
- **Hotspots de mudan√ßa:** Arquivos/classes mais refatorados indicam √°reas cr√≠ticas.
- **D√≠vida t√©cnica:** Classes problem√°ticas sem refatora√ß√µes + muitos problemas PMD = d√≠vida t√©cnica acumulada.