# Code Quality in Different Programming Languages
*Probabilistic Programming 2024 Exam by Raúl Pardo and Andrzej Wąsowski*<br/>
*version 1.0.0 2024-05-01 10:00*

*Does the programming language in which a program is written affect software quality?* This is the driving research question you will study in this exam. In this context, we quantify software quality as number of bugs; the more bugs the lower the quality. In the software engineering world, there are many beliefs (or stereotypes) regarding the propensity to introduce bugs depending on the programming language in which the program is written. For instance, a common belief is that functional languages help programmers to not  introduce bugs. Furthermore, there exist other factors that may influence the amount of bugs in a piece of software. In the context of open source projects, it seems plausible that older projects will have more reported bugs---e.g., one would expect a 10 year old project to have more bugs than a 1 month old project.


## Data

The dataset contains $N=1127$ records of fragments of GitHub projects that are written in different programming languages. The dataset is in the file [dataset.csv](dataset.csv). Click the link to get an idea about the content and the structure of the file. The variables in the dataset are divided into predictors and predicted variable as follows:

Predictors:
* Language (L): the used programming language
* Commits (C): the total number of commits in the project
* Insertions (I): the total number of inserted lines in all commits
* Age (A): the time passed since the oldest recorded commit in the project
* Devs (D): the total number of users committing code to the project
* Project type (T): the type of project, e.g., application, library, framework, ...

Predicted variable:
* Bugs (B): the number of commits classified as "bugs"

## Research questions

As mentioned above, we are interested in understanding whether the programming language in which a program is written (as well as other factors) impacts the number of bugs in the project. To this end, you must investigate the validity of the following hypotheses:
    
* **H1** - Haskell code is less prone to contain bugs (B). In other words, the distribution on the number of bugs (B) for Haskell gives high probability to the lowest number of bugs among all programming languages (L).
    
* **H2** - Age (A) has a positive impact on number of bugs (B) for all programming languages (L). That is, projects of old age (A) have larger number of bugs (B). 
    
* **H3** - Number of commits (C) does not impact the effect of age (A) on the number of bugs (B) for any programming language (L). That is, the effect of age (A), conditioned on number of commits (C), on number of bugs (B) is the same as the direct effect of age (A) on number of bugs (B).

Your task is to use Bayesian Inference and Regression to decide whether these hypotheses hold, or possibly reject them. This includes:

* Loading, restructuring and transforming the data as needed.

* Designing Bayesian regression models and using the inference algorithms to test the above hypotheses in PyMC.

* Explaining your model idea in English, preferably using a figure, and showing the Python code.

* Checking and reflecting (in writing) on the quality of the sampling process, considering warnings from the tool, sampling summary statistics, trace plots, and autocorrelation plots. Comment whether the quality is good, and whether you had to make any adjustments during the modeling.

* Visualizing the posterior information appropriately to address the two hypotheses.
  

You should hand in a zip file with a Jupyter notebook and the data file (so that we can run it), and a **PDF file rendering of the jupyter notebook**, so that your work can be assessed just by reading this file. The PDF file should include all the plots and results. Make sure the notebook is actually a **report** readable to the examiners, especially to the censor, who has not been following the course. The report should contain a brief introduction, an efficient explanation of how data is loaded and cleaned, an analysis of the model design, a discussion of sampling quality, the posterior plots, and decision outcome for each hypothesis. It should end with an overall conclusion.  

It appears that the best PDF rendering is obtained by File / Export to HTML, and then saving/printing to PDF from your browser.

*IMPORTANT:* For each of the tasks below, your code must accompany an explanation of its meaning and intended purpose. **Source code alone is not self-explanatory**. You should also reflect on the results you get, e.g., highlighting issues with the data, or issues, pitfalls and assumptions of a model. **Exams containing only source code or very scarce explanations will result in low grades, including failing grades.**

## Minimum requirements 

1. Design a regression model to predict number of bugs (B) using language (L) as a predictor.
  
2. Analyze hypothesis H1 using the regression model in (1.).

## Ideas for extension

**Groups aiming at grade 7 and more should complete the following tasks:**
    
3. Analyze hypothesis H2, if necessary design a new model.
        
4. Perform prior predictive checks in all your models. Explain why the priors you selected are appropriate.
    
5. Perform posterior predictive checks in all your models. Discuss the results in the posterior predictive checks.
    
6. Discuss trace convergence in all your models.
    
**Groups aiming at grade 10 and higher should try 3-5 ideas from below or add some of your own:**

7. Analyze hypothesis H3, if necessary design a new model.
    
8. Perform a counterfactual analysis in your model for H3: For each project, plot posterior predictions on the number of bugs for increasing age and assuming for that projects have 2000 commits. You may extend this task to varying number of commits.
    
9. Consider mixture models for analyzing the hypotheses above. Explain why the mixture models you evaluate are appropriate in the context of this analysis.
    
10. Design models that treat/transform the outcome variable (number of bugs) as a real value. Analyze the hypotheses above with the new model. Explain whether the result of this analysis differs from the one for the models you used in (1.), (2.) and (3.). Alternatively, use a binomial model to predict a probability that a commit is a bug.
    
11. Use information criteria to compare the models to analyze H1, H2 and H3.
    
12. Design a meaningful multilevel model in the context of these data.
    
13. Pose and analyze a new hypothesis involving more predictors than those in H1, H2 and H3.
   
14. Use causal reasoning to analyze causal relations between the variables in the dataset.

This is an open exam, and **the above directions of thinking are mostly to inspire you**.  Treat the task as a project.  The list above is indicative and will not be strictly followed when grading. You can land lower if the extensions are not realized well. You can land higher if you have other interesting ideas yourself (than those listed here). You are encouraged to add your ideas instead (based on the course material).

**The solutions to the exam must be made solely by the members of the group**. You are not allowed to discuss exam solutions with other classmates, posting questions in internet fora, or the like. You
are allowed to ask for clarification of possible mistakes, misprints, and so on, by private email to `raup@itu.dk` and `wasowski@itu.dk` with a CC to `mflh@itu.dk` (Exam coordinator).

**Your solution must contain the following declaration:**

    We hereby declare that we have answered these exam questions ourselves without any outside help.

---

In [None]:
import numpy as np
import pandas as pd
import arviz as az
import pymc as pm
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('dataset.csv')

In [None]:
df