# Experimenting with Python Clang Bindings

The goal of this notebook is to load the python clang bindings and the juliet dataset. Then try to get an AST out of some of Juliet's code snippets, can we make sense of any of it?

## Setup
You need clang and llvm installed. 

Then you need to make sure the clang python bindings are in your python path. What this really means is that you need to run the next cell, and if it fails you need to:
  1. Download the clang source code from this location: http://releases.llvm.org/download.html
     - Note: You probably need to be careful to download the correct version, check this by running `clang --version` in your shell. Then download the clang source code for the version it outputs (from the page linked above). I had to download 7.0.1.
  2. Extract this source to a know location, I chose "/home/dan/masters-cyber-security/project/clang-src/". 
  3. Open `~/.bashrc` in a text editor, and at the end add the following line:
 `PYTHONPATH=/home/dan/masters-cyber-security/project/clang-src/:$PYTHONPATH`
      
  4. Run `source ~/.bashrc` in your shell.
  5. Restart jupyter notebook and all python sessions.

Hopefully it'll work then.

In [None]:
import clang.cindex

In [2]:
import os
import pandas as pd

In [None]:
# This cell might not be needed for you.
clang.cindex.Config.set_library_file('/lib/x86_64-linux-gnu/libclang-8.so.1')

Load in the juliet data set, and pick the first data point as an example

In [6]:
juliet = pd.read_csv("../data/juliet.csv.zip")

In [10]:
example = juliet.iloc[0]
example

Unnamed: 0                                                     0
testcase_ID                                                61940
filename       000/061/940/CWE114_Process_Control__w32_char_c...
code           /* TEMPLATE GENERATED TESTCASE FILE\nFilename:...
flaw                                                     CWE-114
flaw_loc                                                     121
CWE-015                                                    False
CWE-023                                                    False
CWE-036                                                    False
CWE-078                                                    False
CWE-090                                                    False
CWE-114                                                     True
CWE-121                                                    False
CWE-122                                                    False
CWE-123                                                    False
CWE-124                  

In [27]:
print(example.code)

/* TEMPLATE GENERATED TESTCASE FILE
Filename: CWE114_Process_Control__w32_char_connect_socket_01.c
Label Definition File: CWE114_Process_Control__w32.label.xml
Template File: sources-sink-01.tmpl.c
*/
/*
 * @description
 * CWE: 114 Process Control
 * BadSource: connect_socket Read data using a connect socket (client side)
 * GoodSource: Hard code the full pathname to the library
 * Sink:
 *    BadSink : Load a dynamic link library
 * Flow Variant: 01 Baseline
 *
 * */

#include "std_testcase.h"

#include <wchar.h>

#ifdef _WIN32
#include <winsock2.h>
#include <windows.h>
#include <direct.h>
#pragma comment(lib, "ws2_32") /* include ws2_32.lib when linking */
#define CLOSE_SOCKET closesocket
#else /* NOT _WIN32 */
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#define INVALID_SOCKET -1
#define SOCKET_ERROR -1
#define CLOSE_SOCKET close
#define SOCKET int
#endif

#define TCP_PORT 27015
#define IP_ADDRESS "127.0.0.1"


#if

Instantiate the clang parser and give it our example. We use `unsaved_files` to tell it to parse a file that doesn't actually exist on disk.

In [48]:
index = clang.cindex.Index.create()
translation_unit = index.parse(path=example.filename, unsaved_files=[(example.filename, example.code)])

In [49]:
translation_unit

<clang.cindex.TranslationUnit at 0x7f8977d09278>

`root` is the root note of the AST. Try to explore and figure out what this all means! It's pretty dense ha

In [50]:
root = translation_unit.cursor

In [33]:
children = list(root.get_children())

In [41]:
children[0].kind

CursorKind.TYPEDEF_DECL

In [57]:
# took this from a lovely tutorial on chess.com: 
#     https://www.chess.com/blog/lockijazz/using-python-to-traverse-and-modify-clang-s-ast-tree
# needed a minor update: change node.type to node.kind

function_calls = []             # List of AST node objects that are function calls
function_declarations = []      # List of AST node objects that are fucntion declarations

def traverse(node):
    # Recurse for children of this node
    for child in node.get_children():
        traverse(child)

    # Add the node to function_calls
    if node.kind == clang.cindex.CursorKind.CALL_EXPR:
        function_calls.append(node)

    # Add the node to function_declarations
    if node.kind == clang.cindex.CursorKind.FUNCTION_DECL:
        function_declarations.append(node)

    # Print out information about the node
    print('Found %s [line=%s, col=%s]' % (node.displayname, node.location.line, node.location.column))

In [58]:
traverse(root)

Found _Float32 [line=214, col=15]
Found _Float64 [line=251, col=16]
Found _Float32x [line=268, col=16]
Found _Float64x [line=285, col=21]
Found tm [line=83, col=8]
Found  [line=89, col=14]
Found  [line=89, col=6]
Found wcscpy(int *restrict, const int *restrict) [line=87, col=17]
Found  [line=94, col=14]
Found  [line=94, col=6]
Found wcsncpy(int *restrict, const int *restrict, int) [line=92, col=17]
Found  [line=99, col=14]
Found  [line=99, col=6]
Found wcscat(int *restrict, const int *restrict) [line=97, col=17]
Found  [line=103, col=14]
Found  [line=103, col=6]
Found wcsncat(int *restrict, const int *restrict, int) [line=101, col=17]
Found  [line=107, col=33]
Found  [line=107, col=14]
Found  [line=107, col=6]
Found __s1 [line=106, col=35]
Found __s2 [line=106, col=56]
Found wcscmp(const int *, const int *) [line=106, col=12]
Found  [line=110, col=33]
Found  [line=110, col=14]
Found  [line=110, col=6]
Found __s1 [line=109, col=36]
Found __s2 [line=109, col=57]
Found __n [line=109, col=

In [70]:
[decl.displayname for decl in function_declarations]

['wcscpy(int *restrict, const int *restrict)',
 'wcsncpy(int *restrict, const int *restrict, int)',
 'wcscat(int *restrict, const int *restrict)',
 'wcsncat(int *restrict, const int *restrict, int)',
 'wcscmp(const int *, const int *)',
 'wcsncmp(const int *, const int *, int)',
 'wcscasecmp(const int *, const int *)',
 'wcsncasecmp(const int *, const int *, int)',
 'wcscasecmp_l(const int *, const int *, int)',
 'wcsncasecmp_l(const int *, const int *, int, int)',
 'wcscoll(const int *, const int *)',
 'wcsxfrm(int *restrict, const int *restrict, int)',
 'wcscoll_l(const int *, const int *, int)',
 'wcsxfrm_l(int *, const int *, int, int)',
 'wcsdup(const int *)',
 'wcschr(const int *, int)',
 'wcsrchr(const int *, int)',
 'wcscspn(const int *, const int *)',
 'wcsspn(const int *, const int *)',
 'wcspbrk(const int *, const int *)',
 'wcsstr(const int *, const int *)',
 'wcstok(int *restrict, const int *restrict, int **restrict)',
 'wcslen(const int *)',
 'wcsnlen(const int *, int)',


In [71]:
[call.displayname for call in function_calls]

['strchr',
 'strchr',
 'close',
 'printLine',
 'printLine',
 'strcpy',
 'printLine',
 'printLine',
 'goodG2B']

I found this nice tutorial that helps to explain how the Python clang bindings canbe used to explore AST's: https://github.com/FraMuCoder/PyClASVi/blob/master/doc/python_clang_usage.md

In [56]:
def print_ast(cursor, deep=0):
    print(' '.join((deep*'    ', str(cursor.kind), str(cursor.spelling))))
    for child in cursor.get_children():
        print_ast(child, deep+1)

print_ast(root)

 CursorKind.TRANSLATION_UNIT 000/061/940/CWE114_Process_Control__w32_char_connect_socket_01.c
     CursorKind.TYPEDEF_DECL _Float32
     CursorKind.TYPEDEF_DECL _Float64
     CursorKind.TYPEDEF_DECL _Float32x
     CursorKind.TYPEDEF_DECL _Float64x
     CursorKind.STRUCT_DECL tm
     CursorKind.FUNCTION_DECL wcscpy
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.UNEXPOSED_ATTR 
     CursorKind.FUNCTION_DECL wcsncpy
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.UNEXPOSED_ATTR 
     CursorKind.FUNCTION_DECL wcscat
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.UNEXPOSED_ATTR 
     CursorKind.FUNCTION_DECL wcsncat
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.UNEXPOSED_ATTR 
     CursorKind.FUNCTION_DECL wcscmp
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PURE_ATTR 
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL __s1
         CursorKind.PARM_DECL __s2
     CursorKind.FUNCTION_DECL wcsncmp
         CursorKind.UNEXPOSED_ATTR 