# Analyze Linux Bugs

In [1]:
import pandas as pd

## Data source
- 1000 lasts commits from Linux Kernel GitHub repository
- Last commit date -> `Wed Sep 20 12:09:23 2023 -0700)`
- Stored as `linux_commits.txt`

## Methodology

We will look for the first 100 commits that are a bug fix starting from the last one following these steps:

1) Selecting a commit:
- I discard those that start with "Merge ..."
- I check that in the title, in the commit description or in the bug report (mail thread or bugzilla) it is mentioned that it fixes a bug in the software
- I check that the change is made in the source code and that it is a change in functionality, excluding variable renaming or comments.

2) Extract commit info:
- Name/message of the commit (for identification only)
- Link to commit
- Link to mailing list (if exists)
- Link to bugzilla (if exists)
- Link to bug-introduction commit (BIC) (if exists)
- Location -> Depending on the folder in which the file is changed (e.g. `fs/` => File System)
- Regression -> Whether or not it is a regression

In [2]:
df = pd.read_csv('Bugs_LinuxKernel.csv')  

## Results

After reviewing 203 commits, 100 commits have been found that fixed a bug (Bug-Fixing Commit, BFC)

To take an overview of how much information we have been able to collect, we will analyze each of the proposed fields:

### How many BFCs have a link to the mailing list?

In [3]:
len(df[df["MAIL LIST LINK"]!="NONE"])

49

**Findings**
- Half of the BFCs have an associated mailing list.
- Mailing lists are clearly identified, as the format `Link: <URL>` or `Closes: <URL>` is used.
- There are cases where there may be more than one mailing list associated with a BFC.

### How many BFCs have a link to Bugzilla?

In [4]:
len(df[df["BUGZILLA LINK"]!="NONE"])

5

**Findings**
- The number is very low compared to mailing lists
- **Proposal:** Bugzilla bugs could be checked to see if the reported BFCs contain a link to the same Bugzilla issue.

### How many BFCs have a link to the BIC?

In [5]:
len(df[df["BIC LINK"]!="NONE"])

80

**Findings**
- The numbers are surprisingly high
- This is because they follow a format in which they report the BIC in the following format `Fixes: <hash> (<comment>)`
- There are cases where more than one BIC is reported.
- In many cases they use GitBisect to find the BIC
- **Proposal**: It can automatically calculate how much time elapsed between the BIC and the BFC

# Where are the BFCs located?

In [34]:
df.groupby("LOCATION").count().sort_values("ID",ascending=False)['ID']

LOCATION
DRIVER        48
FILESYSTEM    24
ARCH           7
NETWORK        6
KERNEL         5
INCLUDE        4
SCRIPTS        2
BLOCK          1
IO             1
MEMORY         1
SECURITY       1
Name: ID, dtype: int64

**Findings**
- The "drivers" module condenses half of the bugs. This is in line with several studies, which show that it is the module where most bugs are fixed.

### How many BFCs are detected regressions?

In [37]:
len(df[df["REGRESSION"]=="YES"])

9

**Findings**
- To check if it is a regression, it has been verified that either the commit author reports it as such or if it reverses a change of the BIC.
- The numbers are not very high, but we may be missing a lot.

## Other proposals

- Check/collect authors of commits (or who signs them)

## Examples

Below are some examples of commits, either because of their interest in categorization or because of the difficulty they involve:

- Commits that looks like a fix, but actually just avoids a warning.
    -  https://github.com/torvalds/linux/commit/75b2f7e4c9e0fd750a5a27ca9736d1daa7a3762a
    - https://github.com/torvalds/linux/commit/72178d5d1a38dd185d1db15f177f2d122ef10d9b
- Commit that rename variables
    - https://github.com/torvalds/linux/commit/40d84e198b0ae64df71ac0e70675b16900b90bde 
- Commit with more than one mailing list
    - https://github.com/torvalds/linux/commit/df1c357f25d808e30b216188330e708e09e1a412
- Commit with a different mailing list than usual
    - https://github.com/torvalds/linux/commit/80cc944eca4f0baa9c381d0706f3160e491437f2
- Commit with more than one BIC, one of them being reverted to the BFC and the other having been reverted before.
    - https://github.com/torvalds/linux/commit/7a3bc2b3989e05bbaa904a63279049a401491c84
- Commits discarded as a fix (but I'm not sure)
    - https://github.com/torvalds/linux/commit/e4e14095cc68a2efefba6f77d95efe1137e751d4
    - https://github.com/torvalds/linux/commit/06ed09351b67eb1114ae106a87a0ee3ea9adb3db
    - https://github.com/torvalds/linux/commit/64be47ba286117ee4e3dd9d064c88ea2913e3269
    - https://github.com/torvalds/linux/commit/ec5fa9fcdeca69edf7dab5ca3b2e0ceb1c08fe9a
    - https://github.com/torvalds/linux/commit/ab43213e7afd08ac68d4282060bacf309e70fd14
    - https://github.com/torvalds/linux/commit/1832403cd41ca6b19b24e9d64f79cb08d920ca44
    - https://github.com/torvalds/linux/commit/f5b2c10b57615828b531bb0ae56bd6325a41167e
    - https://github.com/torvalds/linux/commit/fc6efed2c728c9c10b058512fc9c1613f870a8e8
    - https://github.com/torvalds/linux/commit/97e3c6a853f2af9145daf0c6ca25bcdf55c759d4
    - https://github.com/torvalds/linux/commit/9855d60cfc720ff32355484c119acafd3c4dc806
    - https://github.com/torvalds/linux/commit/8a81cf96f5510aaf9a65d103f7405079a7b0fcc5
- Commits that change/fix only configuration files (Kconfig)
    - https://github.com/torvalds/linux/commit/c42d116ccb72b6a33728e2b4b76ab175197ffb07
    - https://github.com/torvalds/linux/commit/41425941dfcf47cc6df8e500af6ff16a7be6539f
    - https://github.com/torvalds/linux/commit/86e16b87afac20779da1228d690a95c54d7e2ad0
    - https://github.com/torvalds/linux/commit/98a15816636044f25be4644db2a3e09fad68aaf7
    - https://github.com/torvalds/linux/commit/f6c8a312ef0175ea67a1ace29e1d1e5d470ea45a
    - https://github.com/torvalds/linux/commit/b2cacc2e818717545e6d0cc453b72f98249398bf
    - https://github.com/torvalds/linux/commit/24dc13f94367edb314b13923818d98dd565edc44
- Commit that fix a test
    - https://github.com/torvalds/linux/commit/a06023a8f78d3e9e73ca4363ccf3871a06e16ecc