# Automatically detecting security-relevant system weaknesses
                             <div style="text-align: right">-- handed in by Felix Wolff | 765508</div>
                             *Code Repository Mining* seminar at Hasso-Plattner-Institute
                             winter term 2017/2018

This document explains the reasoning behind the technical solution implemented for the topic *Effects of high-profile incidents on code*. It first covers the application features, the data analysis and the logic leading to several implementation decisions, as well as the data structure inside the database. Furthermore it explains why the topical shift was made.

The code for this project can be found [here](https://github.com/flxw/code-repository-mining).
The intermediate presentation slides can be found [here](https://github.com/flxw/code-repository-mining/blob/master/docs/CRM%20Intermediate%20Presentation%20Felix%20Wolff.pdf).
The final presentation slides can be found here:

In [9]:
import requests
import time
import re
import psycopg2
import os

import plotly.plotly as py
import plotly.graph_objs as go

import cufflinks as cf
import pandas as pd

from scipy import stats

PLOTLY_UN    = os.environ.get("PLOTLY_UN")
PLOTLY_TOKEN = os.environ.get("PLOTLY_TOKEN")

POSTGRES_DB_NAME = os.environ.get("POSTGRES_DB_NAME")
POSTGRES_DB_UN   = os.environ.get("POSTGRES_DB_UN")
POSTGRES_DB_PW   = os.environ.get("POSTGRES_DB_PW")
POSTGRES_DB_HOST = os.environ.get("POSTGRES_DB_HOST")
connect_to_db = 'postgresql+psycopg2://' + \
                POSTGRES_DB_UN + ':' + POSTGRES_DB_PW + '@' + \
                POSTGRES_DB_HOST + '/' + POSTGRES_DB_NAME;

%load_ext sql
%config echo=False
%sql $connect_to_db
connection = psycopg2.connect(dbname=POSTGRES_DB_NAME, user=POSTGRES_DB_UN, password=POSTGRES_DB_PW);
cursor     = connection.cursor()


The sql extension is already loaded. To reload it, use:
  %reload_ext sql


# The technical solution

Upon running the script `client/checksystem.py`, all packages installed via the distribution-default package manager are checked for known weaknesses. For every weakness found, a block of information is printed as follows:
```
CVE-2013-0166 released on Friday 08. February 2013
OpenSSL before 0.9.8y, 1.0.0 before 1.0.0k, and 1.0.1 before 1.0.1d does not properly perform signature verification for OCSP responses, which allows remote OCSP servers to cause a denial of service (NULL pointer dereference and application crash) via an invalid key.
Official NIST entry: https://nvd.nist.gov/vuln/detail/CVE-2013-0166
Recommended information source (16.4% of total references for this CWE): http://www.kb.cert.org/vuls/id/737740
A knowledgeable Twitter and Github user might be: https://github.com/delphij - as 91.5% of his posts are on this kind of CWE
```

Line by line this reveals the following information:
1. The CVE-ID (for a refresher read [Wikipeda](https://en.wikipedia.org/wiki/Common_Vulnerabilities_and_Exposures)) and its publishing date
2. A brief description of the vulnerability, also taken from the National Institute for Standards and Technology (NIST)
3. The official NIST database link
4. An information source which is probably of good help to the user
5. A person who is both active on Github and Twitter in the domain of cybersecurity and who might be of assistance. The tweets are being segmented by [CWEs](https://cwe.mitre.org/about/).

This application offers a huge improvement over the [complicated search form at NIST](https://nvd.nist.gov/vuln/search). Furthermore it contributes to the trend of automatic vulnerability detection systems, as made evident by [JFrogs XRay](https://jfrog.com/xray/) and GitHubs [recent addition to its data services](https://github.com/blog/2470-introducing-security-alerts-on-github).

After an explanation of the dataset, the logic behind the information items 4 and 5 shall be described in detail.

# Data origins

Three datasets from different sources were cobined to create the foundations for the application and analysis presented in this document - all inside a PostgreSQL database:

1. A complete [ghtorrent](http://ghtorrent.org/) dump
2. Tweets referring to CVE-IDs that were also referred to by commits from the above source. To accomplish this, [TweetScraper](github.com/flxw/tweetscraper) was forked and extended with PostgreSQL saving capabilities.
3. An extraction of relevant data via ETL from the [cve-search](github.com/cve-search/cve-search) project.

The different tables and their origin are denoted below (views in *italic*):

| ghtorrent     | Twitter | cve-search |
| ------------- | ------------- |-------|
| commits  | cve_referring_tweets  | cwe |
| *view_commits_search_for_cve*  | *view_cve_referring_tweets_extracted_domains*  | cve_per_product_version | 
||*view_cve_referring_tweets_extracted_cves*|cve_cwe_classification|
|||cvereference|
|||*view_cvereference_extracted_domains*|
|||cve|

# User recommendations

In order to recommend a person from the plethora of Twitter and GitHub users who might be an expert on a software error, several criteria were introduced:
1. The user uses the same name in both GitHub and Twitter.
2. The user has tweeted about the same CWE as the current CVE in question. (I.e. he knows this type of vulnerability)

An important assumption here that identical usernames belong to the same person. In a sampling test, this held true for the following users:
* zisk0 - [Twitter](https://twitter.com/zisk0) - [GitHub](https://github.com/zisk0)
* nahi - [Twitter](https://twitter.com/nahi) - [GitHub](https://github.com/nahi)
* fdiskyou - [Twitter](https://twitter.com/fdiskyou) - [GitHub](https://github.com/fdiskyou)
* citypw - [Twitter](https://twitter.com/citypw) - [GitHub](https://github.com/citypw)
* breenmachine - [Twitter](https://twitter.com/breenmachine) - [GitHub](https://github.com/breenmachine)

As the following graph shows, more GitHub users are becoming more active on Twitter and their tweets are fairly evenly distributed across the users every year. This trend needs to be seen in connection with the growing number of GitHub users.

In [31]:
query = """
SELECT
    DISTINCT t.username,
    extract(year from t.timestamp) AS t_year,
    COUNT(t.id) OVER (PARTITION BY t.username, extract(year from t.timestamp)) AS t_user_count,
    COUNT(t.id) OVER (PARTITION BY extract(year from t.timestamp)) AS t_year_count
FROM cve_referring_tweets t
JOIN view_commit_data_search_for_cve vc ON vc.name = t.username
ORDER BY t.username, t_year"""

df = pd.read_sql_query(query, connection)

lyt = go.Layout(
    title='Same Github & Twitter handles over time and share-of-year-volume',
    font=dict(family='Open Sans, monospace', size=12, color='#888888'),
    autosize=False,
    height=800,
    margin=go.Margin(
      l=175
    ),
    xaxis=dict(title='CWE IDs'),
    yaxis=dict(title='Usernames')
)

data = [
    {
        'x': df.t_year,
        'y': df.username,
        'mode': 'markers',
        'marker': {
            'color': df.t_user_count / df.t_year_count,
            'size': 10,
            'showscale': True,
            "colorscale": [ [0,"rgb(40,171,226)"], [1,"rgb(247,146,58)"] ]
        }

    }
]

fig = go.Figure(data = data, layout = lyt)
py.iplot(fig, filename='same-userhandles-time-volume-bubble-chart')

Not only is the number of Tweets increasing every year, but also do some users appear to be knowledgeable in certain areas. This becomes apprarent when plotting their share of the total number of tweets for a given CWE against the individual CWE IDs and usernames. The graph below shows a selection of users who have contributed more than 10% to the total number of tweets:

In [33]:
query = """
SELECT
    DISTINCT t.username,
    ccc.cweid,
    COUNT(t.id) OVER (PARTITION BY t.username, ccc.cweid) AS t_cwe_count,
    COUNT(t.id) OVER (PARTITION BY ccc.cweid) AS t_count
FROM cve_referring_tweets t
JOIN view_commit_data_search_for_cve vc ON vc.name = t.username
JOIN view_cve_referring_tweets_extracted_cves ec ON t.id = ec.tweet_id
JOIN cve_cwe_classification ccc ON ec.cve = ccc.cveid"""

df = pd.read_sql_query(query, connection)

lyt = go.Layout(
    title='Same Github & Twitter handles over time and greater-than-10%-share-of-cwe-volume',
    font=dict(family='Open Sans, monospace', size=12, color='#888888'),
    autosize=False,
    height=600,
    margin=go.Margin(
      l=175
    ),
    xaxis=dict(title='CWE IDs'),
    yaxis=dict(title='Usernames')
)

ratio = df.t_cwe_count / df.t_count

data = [
    {
        'x': df.cweid[ratio > 0.1],
        'y': df.username[ratio > 0.1],
        'mode': 'markers',
        'marker': {
            'color': ratio,
            'size': 10,
            'showscale': True,
            "colorscale": [ [0,"rgb(40,171,226)"], [1,"rgb(247,146,58)"] ]
        }
    }
]

fig = go.Figure(data = data, layout = lyt)
py.iplot(fig, filename='same-userhandles-cwe-volume-bubble-chart')

# Source recommendation

## Twitter references

## NIST references

# Data sources

# Data structure and size

# Future work

# Justification of topic adaption