In [13]:
import pandas as pd


# Data Science on <br/> Software Data
<b>Markus Harrer</b>, Software Development Analyst
  
`@feststelltaste`


<small>Visual Software Analytics Summer School, 18 September 2019</small>

<img src="../../demos/resources/innoq_logo.jpg" width=20% height="20%" align="right"/>

# About Me

## In the past
* Bachelor student
* Researcher
* Software developer
* Master student*
* Master's degree candidate*
* Application developer

**and househusband*

## Now

<img src="../../demos/resources/about_me.png" style="width:85%;" >

## My Motivation for Data Analysis in Software Development

### The current problem in the industry
<img src="../../demos/resources/kombar0_en.png" style="width:95%;" align="center"/>

### The current problem in the industry
<img src="../../demos/resources/kombar4_en.png" style="width:95%;" align="center"/>

## "Software Analytics" to the rescue?

### Definition Software Analytics
"Software Analytics is analytics on <b>software data</b> for **managers** and <b class="green">software engineers</b> with the aim of empowering software development individuals and teams to <i>gain and share insight from their data</i> to <b>make better decisions</b>."
<br/>
<div align="right"><small>Tim Menzies and Thomas Zimmermann</small></div>

### Which kind of Software Data do we have?

* static
* runtime
* chronological
* Community

<b>=> a great variety!</b>

### My problem with classic Software Analytics

<img src="../../demos/resources/freq1_en.png" style="width:80%;" align="center"/>

### My problem with classic Software Analytics

<img src="../../demos/resources/freq2_en.png" style="width:80%;" align="center"/>

### My problem with classic Software Analytics

<img src="../../demos/resources/freq3_en.png" style="width:80%;" align="center"/>

### My problem with classic Software Analytics

<img src="../../demos/resources/freq4_en.png" style="width:80%;" align="center"/>

### My problem with classic Software Analytics

<img src="../../demos/resources/freq5_en.png" style="width:80%;" align="center"/>

### Some analysis tasks from practice

* Communicating negative performance implications of complex data models
* Spotting concurrency problems in custom-built frameworks
* Identifying performance bottlenecks across different software systems
* Making lost knowledge visible due to turnover
* Analyzing the health of a open source community

### "It depends" aka "context matters!"

<div align="center">
<img src="../../demos/resources/context.png" style="width:70%;" /></div>

<b>Individual systems == individual problems => individual analyses => individual insights!</b>

### Others see that problem, too

*Thomas Zimmermann in "One size does not fit all":*
<br/><br/>
<div style="font-size:70%;" align="center">
"The main lesson: There is no one size fits all model. Even if you find models that work for most, they will not work for everyone. There is much <strong>academic research</strong> into <strong>general models</strong>. In contrast, <b><span class="green">industrial practitioners</span></b> are often fine with <b><span class="green">models that just work for their data</span></b> if the model provides some insight or allows them to work more efficiently."<br/><br/></div>

But: "... the methods typically are applicable on different datasets." <b>=> we see what's possible!</b>

<br/><br/><div align="center"><h1><b><strong>Data Science</strong> on <b><span class="green">Software Data</span></b>:<br/><br/> A Lightweight Implementation of <b><span class="blue">Software Analytics</span></b></h1></div>

## Data Science

### What is Data Science?

"**Statistics** on a <b><span class="green">Mac</span></b>."
<br/>
<br/>
<div align="right"><small>https://twitter.com/cdixon/status/428914681911070720</small></div>

<b>Data Science Venn Diagram (Drew Conway)</b>

<img src="../../demos/resources/venn_diagram.png" style="width:50%;" >

### My Definition

#### What does "**data**" mean for me?
"Without **data** you‘re just another person with an opinion."
<br/>
<div align="right"><small>W. Edwards Deming</small></div>

<b>=> Delivering credible insights based on <span class="green">facts</span>.</b>

#### What does "**science**" mean for me?
  
  
"The aim of **science** is to seek the simplest explanations of complex facts."
<br/>
<div align="right"><small>Albert Einstein</small></div>

<b>=> Working out insights in a <span class="green">comprehensible</span> way.</b>

## Why Data Science at all?

### High demand in data analytics

<img src="../../demos/resources/data_scientist_sexy.png" style="width: 80%;"/>

### Young job positions are paid well...
*Data from Stack Overflow Developer Survey 2019*
<img src="../../demos/resources/stackoverflow_salary_devtype-1.svg" style="width: 65%;"/>

### ... but also demanding?
*Data from Stack Overflow Developer Survey 2019*

<b>"Who's Actively Looking for a Job?" (Top 5)</b>
<img src="../../demos/resources/stackoverflow_on_job_search.png" style="width: 100%;"/>

### Big and supportive community
* Free online courses, videos and tutorials (e. g. DataCamp with > 4.6M members)
* Online communities that help each other (e. g. Stack Overflow)
* Online competitions to improve own skills (e. g. Kaggle)

### Free and easy to use tools!
_"R is for statisticians who want to program, Python is for developers who want to do statistics."_

<img src="../../demos/resources/r_vs_python_pandas.png" style="width: 57%;"/>

### Data Science popularity is still growing!

"100" == max. popularity!

# How far away are <span class="green">Software Engineers</span></b> from <strong>Data Science</strong>?

### What is a Data Scientist?
"A data scientist is someone who<br/>
&nbsp;&nbsp;is better at **statistics**<br/>
&nbsp;&nbsp;than any <b><span class="green">software engineer</span></b><br/>
&nbsp;&nbsp;and better at <b><span class="green">software engineering</span></b><br/>
&nbsp;&nbsp;than any **statistician**."
<br/>
<br/>
<div align="right"><small>From https://twitter.com/cdixon/status/428914681911070720</small></div>

<b>Not so far away as you may have thought!</b>

# How to Get Started?

## Reuse a Proven Approach (~ scientific method)
<small>Roger Pengs "Stages of Data Analysis"</small><br/>
I. Stating Question  
II. Exploratory Data Analysis  
III. Formal Modeling  
IV. Interpretation  
V. Communication  
  



<b>=> from a <strong>question</strong> over <span class="green">data</span> to <span class="blue" style="background-color: #FFFF00">insights</span>!</b>

## Be Aware of the "Seven principles
...of inductive software engineering" (Tim Menzies)
1. Human before algorithms
1. Plan for Scale
1. Get Early Feedback
1. Be Open Minded
1. Be Smart with Your Learning
1. Live with the Data You Have
1. Develop a Broad Skill Set That Uses a Big Toolkit

## Use Literate Statistical Programming

`(Intent + Code + Data + Results)`<br />
`* Logical Step`<br />
`+ Automation`<br />
`= Literate Statistical Programming`

Approach: **Computational notebooks**

### Computational Notebook Example
<br/>
  

<div align="center"><img src="../../demos/resources/notebook_approach.jpg"></div>

## Use Standard Data Science Tools

### One of the more popular tech stacks

* Jupyter Notebook
* Python 3
* pandas
* matplotlib

### Jupyter Notebook

**Interactive Notebook**
* Document-based analyses
* Executable Code
* Displaying results immediately
* Everything in one place
* Every step to the solution visible

<b><span class="green">=> Working out results in a comprehensible way!</span></b>



### Python 3

**Best programming language for Data Science!**
* Easy
* Effective
* Fast
* Fun
* Automation

<b><span class="green">=> Data Analysis becomes repeatable</span></b>

### pandas

**Pragmatic data analysis framework**
* Tabular data structures ("programmable Excel sheet")
* Really fast
* Flexible 
* Expressive

<b><span class="green">=> Good integration point for your data sources!</span></b>

### matplotlib

**Programmable visualization library**

* Programmatic creation of graphics
* Plots line charts, bar charts, pie charts and much more
* Integrated into pandas

<b><span class="green">=> Direct visualization of results in Jupyter Notebooks</span></b>

### The Python ecosystem
<br/>
<div class="row">
  <div class="column">
    <b>Data Analysis</b>
    <ul>
      <li>NumPy</li>
      <li>scikit-learn</li>
      <li>TensorFlow</li>
      <li>SciPy</li>
      <li>PySpark</li>
      <li>py2neo</li>
    </ul>
  </div>
  <div class="column">
    <b>Visualization and more</b>
    <ul>
      <li>pygal</li>
      <li>Bokeh</li>
      <li>python-pptx</li>
      <li>RISE</li>
      <li>Requests, xmldataset, Selenium, Flask...</li>
    </ul>
  </div>
</div> 

<b><span class="green">=> Provides the flexibility that is needed in specific situations</span></b> 


### Other Technologies
**Jupyter Notebook** works also with other technological platforms e. g.
* jQAssistant software scanner / Neo4j graph database
* JVM-based languages via beakerx / Tablesaw
* bash

<b><span class="green">=> If you want to use special technology, you can!</span></b>


### Anaconda 3

**Data Science Python Distribution**

* Free all-inclusive package
* Brings everything you need to get started
* Optimized for running fast on your operating system

<b><span class="green">=> Download, install, ready, go!</span></b>

### My Recommendations for an easy start

#### My TOP 5's*

https://www.feststelltaste.de/category/top5/
    
Courses, videos, blogs, books and more...

<small>**some pages are still under development*</small>

### My Book Recommendations
* Adam Tornhill: Software Design X-Ray 
* Wes McKinney: Python For Data Analysis
* Jeff Leek: The Elements of Data Analytic Style
* Tim Menzies, Laurie Williams, Thomas Zimmermann: Perspectives on Data Science for Software Engineering

# Hands-On

## Programming Demo

### Case Study

#### IntelliJ IDEA

* IDE for Java developers
* Almost entirely written in Java
* Big and long-living project

### I. Stating Question (1/3)

* Write down your question explicitly
* Explain analysis idea comprehensibly


### I. Stating Question (2/3)

<b>Question</b>
* Which code is complex and did change often lately?


### I. Stating Question (3/3)
#### Implementation Idea
* Tools: Jupyter, Python, pandas, matplotlib
* Heuristics:
 * "complex": many lines of code 
 * "change often": number of Git commits
 * "lately": last 30 days


**Meta goal:** Get to know the basic mechanics of the stack.

### II. Exploratory Data analysis
* Load and explore possible data sets
* Clean up and filter the raw data

*We load Git log dataset extracted from a Git repository.*

*We explore some basic key elements of the dataset*

<b>1</b> **DataFrame** (~ programmable Excel worksheet), <b>6</b> **Series** (= columns), <b>1128819</b> **rows** (= entries)

*We convert the text with a time to a real timestamp object.*

*We filter out older changes.*

*We keep just code written in Java.*

### III. Formal Modeling

* Create new perspective on the data
* Join data with other datasets


*We aggregate the rows by counting the number of changes per file.*

*We add additional information about the number of lines of all currently existing files...*

*...and join this data with the existing dataset.*

### VI. Interpretation
* Work out the essence of the analysis
* Make the central message / new insight clear

*We show only the TOP 10 hotspots in the code.*

### V. Communication
* Transform insights into a comprehensible visualization
* Communicate the next steps after the analysis

*We plot the TOP 10 list as XY diagram.*

### End of Demo

## Further Analysis

* Analysis of performance bottlenecks with data from `vmstat`
* Identifying Modularization Options based on Code Changes
* Dependency Analysis with data from `jdeps` and visualization with `D3`

## Summary
**1.** <b>Software Analytics</b> with Data Science is possible!<br/>
**2.** If you need to go into deeper analysis: <b>you can</b>!  
**3.** There are many <b>data sources</b> in software development. _What are you waiting for?_


<b>=> from a <strong>question</strong> over <span class="green">data</span> to <span class="blue" style="background-color: #FFFF00">insights</span>!</b>

# Thanks! Questions?

<b>Markus Harrer</b><br/>
innoQ Deutschland GmbH
  
markus.harrer@innoq.com

`@feststelltaste`