# Machine Learning for Software Analysis (MLSA)

#### Fabio Pinelli
<a href="mailto:fabio.pinelli@imtlucca.it">fabio.pinelli@imtlucca.it</a><br/>
IMT School for Advanced Studies Lucca<br/>
2025/2026<br/>
October, 9 2025

## Exercise 1

**Dataset**: KC1 (NASA PROMISE) — static code metrics (Halstead, McCabe/Cyclomatic, LOC...).

**Metrics**:
- *Halstead* (operators/operands, volume, difficulty, effort, estimated bugs) —
[https://en.wikipedia.org/wiki/Halstead_complexity_measures](https://en.wikipedia.org/wiki/Halstead_complexity_measures)

- *Cyclomatic complexity (McCabe)* — [https://en.wikipedia.org/wiki/Cyclomatic_complexity
](https://en.wikipedia.org/wiki/Cyclomatic_complexity
)


**Goal**: binary classification of defective vs non-defective modules.
  
+ **Size / comments**

    - **loc**: "Lines of Code (LOC): total number of lines in the module.",
    - **loccodeandcomment**: "Lines of code including comment lines.",
    - **locode**: "Effective lines of code (no comments/blank).",
    - **locomment**: "Number of comment lines.",
    - **loblank**: "Number of blank lines.",
    - **branchcount**: "Number of branches in the control flow (approx. number of decisions).",

+ **McCabe / Cyclomatic**
    - **v(g)**": "Cyclomatic complexity v(G): number of linearly independent paths in the CFG (McCabe).",
    - **ev(g)**": "Essential complexity ev(G): measures degree of structuredness.",
    - **iv(g)**": "Design complexity iv(G): complexity related to design/call structure.",
    - **cyclomatic_complexity**": "Cyclomatic complexity: number of independent paths.",

+ **Halstead**
    - **uniq_op**: "Halstead: distinct operators (η₁).",
    - **uniq_opnd**: "Halstead: distinct operands (η₂).",
    - **total_op**: "Halstead: total operators (N₁).",
    - **total_opnd**: "Halstead: total operands (N₂).",
    - **n**: "Halstead length N = N₁ + N₂.",
    - **v**: "Halstead volume V = N × log₂(η₁ + η₂).",
    - **l**: "Halstead level L (inverse of difficulty).",
    - **d**: "Halstead difficulty D = (η₁/2) × (N₂/η₂).",
    - **i**: "Halstead intelligence content I = L × V (interpretations vary).",
    - **e**: "Halstead effort E = D × V.",
    - **b**: "Halstead estimated bugs B ≈ V/3000 (or (E^(2/3))/3000).",
    - **t**: "Halstead time to program T = E/18 (seconds)."





In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/fpinell/mlsa/refs/heads/main/AA20252026/data/kc1_modified.csv')

In [2]:
df.shape

(2109, 22)

In [3]:
df.defects.value_counts()

Unnamed: 0_level_0,count
defects,Unnamed: 1_level_1
False,1783
True,326
