Skip to content

Small challenges designed to train analytical thinking and the use of data-driven software analysis techniques.

License

Notifications You must be signed in to change notification settings

feststelltaste/software-analytics-katas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 

Repository files navigation

Software Analytics Katas

Small exercises designed to train analytical thinking and the use of data-driven software analysis.

If you want to get started with Python & pandas right away, you can do so by clicking this button right now! Binder

Status: beta, Feedback and Pull Requests very welcome!

About "katas"

Katas are intended as a small-group (3-5 people) exercise, usually as part of a larger group (4-10 groups are ideal), each of whom is doing a different kata. A moderator keeps track of time, assigns katas (or allows this website to choose one randomly), and acts as the facilitator for the exercise.

-- Adapted from Ted Neward

Software Analytics Katas Overview

Motivation Problem

Data source: Stack Overflow
Difficulty: easy

Problem Context

Developers in a software company use a version control system (VCS for short) called CVS (Concurrent Versions System). Now developers have the idea to migrate to SVN (Subversion). However, you believe that "Git" has become the standard in the software development community. So you suggest Git as an alternative for the team.

Your Task

Find data-driven facts that show that the software development community mainly uses the version control system Git!

Additional Information

  • You know that Stack Overflow is a platform that provides answers to questions regarding certain technologies.

Starters

  • File (~ 4 MB) with statistics about the questions asked on Stack Overflow about version control systems over several years.
    • CreationDate: The timestamp of the creation date of a Stack Overflow post (= question).
    • TagName: The tag name for a technology (in our case for only 4 VCSes: "cvs", "svn", "git" and "mercurial").
    • ViewCount: The number of views of a post
  • Dataset URL: datasets/stackoverflow_vcs_data_subset.csv

A Long Ping Pong Along

Data source: Log files
Difficulty: medium

Problem Context

The control software for the "DependencyHell" ghost train attraction was implemented as a microservices architecture. Since there were unexplained failures in spooky cases again and again, aping API was introduced for all services. This allows services to call other services to see if they are currently reachable. A convention was introduced for the API that any working service should acknowledge a call to /healthy with the HTTP status code 200. However, there are still sporadic failures.

Your Task

The development team would like to narrow down when and why the sporadic bugs occur. An analysis of the log files about the failure situations at the microservices should provide clarity.

Additional Information

  • There is an aggregated log file that logs the ping calls over several days.
  • The log file format is the same for all services
  • The operating software runs in the exhibitor's own data center.
  • The showman usually operates from 1 p.m. and sometimes deep into the night.

Starters

  • Log file (~7 MB) with the recorded communication between the services of one week with the following information:
    • timestamp: timestamp of the log entry
    • status: Returned HTTP status code of a request
    • method: HTTP request method used
    • url: Called service URL
    • ms: Response time of the called service call
  • Dataset URL: datasets/scarylog.csv
    • Possible header string: "timestamp", "status", "method", "url", "ms".

Tests are Code, too

Data source: version control system
Difficulty: medium

Problem Context

The developers of the integrated development environment "IntelliJ IDEA" have noticed that many developers check out the current version, make changes, but do not commit the changes back to the repo until days later in one big commit. This leads so many merge conflicts thus conflicts among developers as well. In most cases, developers also forget to check in newly written tests (or they don't even at all, which is only discovered later during the code review).

To improve the development process, the developers have agreed on the following measures:

  • All commits must now contain less than 500 lines of code.
  • The ratio of test code to source code must be at least 0.7:1 at the end of the day.

Your Task

Show that developers are now working as they have agreed upon! Track the commit activities and find out, if developers are now writing tests as they should.

Additional Information

  • Only source code written in Java or Kotlin is affected by the measure.
  • IntelliJ IDEA uses the postfix "Test" as an identifier for test code.
  • Source code is managed using the Git version control system.
  • The software project uses a Continuous Integration Server.

Starters

  • A (preprocessed) Git Log Numstat file over a period of six months. Each line corresponds to a change to a source code file and contains the following content:
    • ts_in_s: the commit timestamp in seconds
    • path: The file path of the source code file
    • add: The number of lines added ("additions")
    • del: The number of deleted lines ("deletions")
  • dataset URL: datasets/intellij_testing.csv

Under the hood

Data source: version control system
Difficulty: medium

Problem Context

A customer management system for veterinary practices called PetClinic, written in Java, uses the JDBC (Java Database Connectivity) interface to access a database. However, the enterprise architects have now determined that in the future all database access should be done via an object-relational mapping using the interface of JPA (Java Persistence API). The development team must migrate the application from JDBC to JPA in addition to the normal feature development. The migration takes very long because many code passages have to be changed. So the work has been dragging on for a long time now. However, confidence on the part of product management seems to be slowly waning.

Your Task

You would like to make the progress of the technology replacement transparent. Visualize the respective amounts of code for the old and new libraries over time to show the progress of the migration.

Additional Information

  • The team has stored the code for the two interfaces (JDBC and JPA, respectively) in different Java packages (= directories) with the respective interface name.
  • The source code is managed using the Git version control system.
  • The software project uses a Continuous Integration Server.

Starters

  • A (preprocessed) Git Log Numstat CSV file (~3 MB) with a Git Log Numstat output, which records the changes per line per file incl. changed number of lines of code.
  • Dataset URL: datasets/db_api_refactoring.csv

Access Denied

Data sources: various kind of static and dynamic data
Difficulty: challenging

Problem Context

The internal insurance system "InsurHappy" is a web application that makes heavy use of mainframe operations. However, sporadic authorization errors occur in the application when it is used. It is suspected that individual access rights of individual users for the execution of COBOL routines are not available. At the same time, mainframe administrators are very concerned that users are not given too many access rights.

Your Task

A list of missing user permissions is needed to show which reals users with which user ID needs which routines to work smoothly with InsurHappy. Create this list for the administrators!

Starters

What's Next?

Did you find these challenges interesting? I would be happy if you leave some feedback here as an issue.

I hope to see you at a conference or meetup to talk about it soon! You can see where I am here. I also organize Software Analytics workshops from time to time where we take a deeper look at the data-driven analysis of software systems. On my blog, I write about my analyses as well.

Markus Harrer
@feststelltaste

About

Small challenges designed to train analytical thinking and the use of data-driven software analysis techniques.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published