# Experimenting with Python Clang Bindings

The goal of this notebook is to load the python clang bindings and the juliet dataset. Then try to get an AST out of some of Juliet's code snippets, can we make sense of any of it?

## Setup
You need clang and llvm installed. 

Then you need to make sure the clang python bindings are in your python path. What this really means is that you need to run the next cell, and if it fails you need to:
  1. Download the clang source code from this location: http://releases.llvm.org/download.html
     - Note: You probably need to be careful to download the correct version, check this by running `clang --version` in your shell. Then download the clang source code for the version it outputs (from the page linked above). I had to download 7.0.1.
  2. Extract this source to a know location, I chose "/home/dan/masters-cyber-security/project/clang-src/". 
  3. Open `~/.bashrc` in a text editor, and at the end add the following line:
 `PYTHONPATH=/home/dan/masters-cyber-security/project/clang-src/:$PYTHONPATH`
      
  4. Run `source ~/.bashrc` in your shell.
  5. Restart jupyter notebook and all python sessions.

Hopefully it'll work then.

In [11]:
import clang.cindex

In [12]:
import os
import pandas as pd

In [13]:
# This cell might not be needed for you.
clang.cindex.Config.set_library_file('/lib/x86_64-linux-gnu/libclang-8.so.1')

Load in the juliet data set, and pick the first data point as an example

In [14]:
juliet = pd.read_csv("../data/juliet.csv.zip")

In [15]:
example = juliet.iloc[0]
example

Unnamed: 0                                                     0
testcase_ID                                                61940
filename       000/061/940/CWE114_Process_Control__w32_char_c...
code           /* TEMPLATE GENERATED TESTCASE FILE\nFilename:...
flaw                                                     CWE-114
flaw_loc                                                     121
CWE-015                                                    False
CWE-023                                                    False
CWE-036                                                    False
CWE-078                                                    False
CWE-090                                                    False
CWE-114                                                     True
CWE-121                                                    False
CWE-122                                                    False
CWE-123                                                    False
CWE-124                  

In [16]:
print(example.code)

/* TEMPLATE GENERATED TESTCASE FILE
Filename: CWE114_Process_Control__w32_char_connect_socket_01.c
Label Definition File: CWE114_Process_Control__w32.label.xml
Template File: sources-sink-01.tmpl.c
*/
/*
 * @description
 * CWE: 114 Process Control
 * BadSource: connect_socket Read data using a connect socket (client side)
 * GoodSource: Hard code the full pathname to the library
 * Sink:
 *    BadSink : Load a dynamic link library
 * Flow Variant: 01 Baseline
 *
 * */

#include "std_testcase.h"

#include <wchar.h>

#ifdef _WIN32
#include <winsock2.h>
#include <windows.h>
#include <direct.h>
#pragma comment(lib, "ws2_32") /* include ws2_32.lib when linking */
#define CLOSE_SOCKET closesocket
#else /* NOT _WIN32 */
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#define INVALID_SOCKET -1
#define SOCKET_ERROR -1
#define CLOSE_SOCKET close
#define SOCKET int
#endif

#define TCP_PORT 27015
#define IP_ADDRESS "127.0.0.1"


#if

Instantiate the clang parser and give it our example. We use `unsaved_files` to tell it to parse a file that doesn't actually exist on disk.

In [17]:
index = clang.cindex.Index.create()
translation_unit = index.parse(path=example.filename, unsaved_files=[(example.filename, example.code)])

In [18]:
translation_unit

<clang.cindex.TranslationUnit at 0x7f40a97f0e80>

`root` is the root note of the AST. Try to explore and figure out what this all means! It's pretty dense ha

In [19]:
root = translation_unit.cursor

In [20]:
children = list(root.get_children())

In [21]:
children[0].kind

CursorKind.TYPEDEF_DECL

In [22]:
# took this from a lovely tutorial on chess.com: 
#     https://www.chess.com/blog/lockijazz/using-python-to-traverse-and-modify-clang-s-ast-tree
# needed a minor update: change node.type to node.kind

function_calls = []             # List of AST node objects that are function calls
function_declarations = []      # List of AST node objects that are fucntion declarations

def traverse(node):
    # Recurse for children of this node
    for child in node.get_children():
        traverse(child)

    # Add the node to function_calls
    if node.kind == clang.cindex.CursorKind.CALL_EXPR:
        function_calls.append(node)

    # Add the node to function_declarations
    if node.kind == clang.cindex.CursorKind.FUNCTION_DECL:
        function_declarations.append(node)

    # Print out information about the node
    print('Found %s [line=%s, col=%s]' % (node.displayname, node.location.line, node.location.column))

In [23]:
traverse(root)

Found _Float32 [line=214, col=15]
Found _Float64 [line=251, col=16]
Found _Float32x [line=268, col=16]
Found _Float64x [line=285, col=21]
Found wint_t [line=20, col=23]
Found __count [line=15, col=7]
Found __wch [line=18, col=19]
Found  [line=19, col=17]
Found __wchb [line=19, col=10]
Found  [line=16, col=3]
Found __wch [line=18, col=19]
Found  [line=19, col=17]
Found __wchb [line=19, col=10]
Found  [line=16, col=3]
Found __value [line=20, col=5]
Found  [line=13, col=9]
Found __count [line=15, col=7]
Found __wch [line=18, col=19]
Found  [line=19, col=17]
Found __wchb [line=19, col=10]
Found  [line=16, col=3]
Found __wch [line=18, col=19]
Found  [line=19, col=17]
Found __wchb [line=19, col=10]
Found  [line=16, col=3]
Found __value [line=20, col=5]
Found  [line=13, col=9]
Found __mbstate_t [line=21, col=3]
Found __mbstate_t [line=6, col=9]
Found mbstate_t [line=6, col=21]
Found _IO_FILE [line=4, col=8]
Found struct _IO_FILE [line=5, col=16]
Found __FILE [line=5, col=25]
Found _IO_FILE [l

Found  [line=223, col=20]
Found MSG_WAITALL [line=223, col=5]
Found  [line=225, col=16]
Found MSG_FIN [line=225, col=5]
Found  [line=227, col=16]
Found MSG_SYN [line=227, col=5]
Found  [line=229, col=20]
Found MSG_CONFIRM [line=229, col=5]
Found  [line=231, col=16]
Found MSG_RST [line=231, col=5]
Found  [line=233, col=20]
Found MSG_ERRQUEUE [line=233, col=5]
Found  [line=235, col=20]
Found MSG_NOSIGNAL [line=235, col=5]
Found  [line=237, col=17]
Found MSG_MORE [line=237, col=5]
Found  [line=239, col=22]
Found MSG_WAITFORONE [line=239, col=5]
Found  [line=241, col=18]
Found MSG_BATCH [line=241, col=5]
Found  [line=243, col=20]
Found MSG_ZEROCOPY [line=243, col=5]
Found  [line=245, col=20]
Found MSG_FASTOPEN [line=245, col=5]
Found  [line=248, col=24]
Found MSG_CMSG_CLOEXEC [line=248, col=5]
Found  [line=200, col=1]
Found msg_name [line=259, col=11]
Found socklen_t [line=260, col=5]
Found msg_namelen [line=260, col=15]
Found struct iovec [line=262, col=12]
Found msg_iov [line=262, col=19

Found  [line=756, col=28]
Found __pid_t [line=756, col=8]
Found fork() [line=756, col=16]
Found  [line=764, col=29]
Found __pid_t [line=764, col=8]
Found vfork() [line=764, col=16]
Found  [line=770, col=33]
Found __fd [line=770, col=27]
Found ttyname(int) [line=770, col=14]
Found  [line=775, col=6]
Found  [line=775, col=14]
Found __fd [line=774, col=27]
Found __buf [line=774, col=39]
Found __buflen [line=774, col=53]
Found ttyname_r(int, char *, int) [line=774, col=12]
Found  [line=779, col=30]
Found __fd [line=779, col=24]
Found isatty(int) [line=779, col=12]
Found  [line=784, col=27]
Found ttyslot() [line=784, col=12]
Found  [line=790, col=6]
Found  [line=790, col=14]
Found __from [line=789, col=30]
Found __to [line=789, col=50]
Found link(const char *, const char *) [line=789, col=12]
Found  [line=797, col=6]
Found  [line=797, col=14]
Found __fromfd [line=795, col=24]
Found __from [line=795, col=46]
Found __tofd [line=795, col=58]
Found __to [line=796, col=18]
Found __flags [line=79

In [24]:
[decl.displayname for decl in function_declarations]

['wcscpy(int *restrict, const int *restrict)',
 'wcsncpy(int *restrict, const int *restrict, int)',
 'wcscat(int *restrict, const int *restrict)',
 'wcsncat(int *restrict, const int *restrict, int)',
 'wcscmp(const int *, const int *)',
 'wcsncmp(const int *, const int *, int)',
 'wcscasecmp(const int *, const int *)',
 'wcsncasecmp(const int *, const int *, int)',
 'wcscasecmp_l(const int *, const int *, locale_t)',
 'wcsncasecmp_l(const int *, const int *, int, locale_t)',
 'wcscoll(const int *, const int *)',
 'wcsxfrm(int *restrict, const int *restrict, int)',
 'wcscoll_l(const int *, const int *, locale_t)',
 'wcsxfrm_l(int *, const int *, int, locale_t)',
 'wcsdup(const int *)',
 'wcschr(const int *, int)',
 'wcsrchr(const int *, int)',
 'wcscspn(const int *, const int *)',
 'wcsspn(const int *, const int *)',
 'wcspbrk(const int *, const int *)',
 'wcsstr(const int *, const int *)',
 'wcstok(int *restrict, const int *restrict, int **restrict)',
 'wcslen(const int *)',
 'wcsnlen(

In [25]:
[call.displayname for call in function_calls]

['socket',
 'memset',
 'inet_addr',
 'htons',
 'connect',
 'strchr',
 'strchr',
 'close',
 'printLine',
 'printLine',
 'strcpy',
 'printLine',
 'printLine',
 'goodG2B']

I found this nice tutorial that helps to explain how the Python clang bindings canbe used to explore AST's: https://github.com/FraMuCoder/PyClASVi/blob/master/doc/python_clang_usage.md

In [26]:
def print_ast(cursor, deep=0):
    print(' '.join((deep*'    ', str(cursor.kind), str(cursor.spelling))))
    for child in cursor.get_children():
        print_ast(child, deep+1)

print_ast(root)

 CursorKind.TRANSLATION_UNIT 000/061/940/CWE114_Process_Control__w32_char_connect_socket_01.c
     CursorKind.TYPEDEF_DECL _Float32
     CursorKind.TYPEDEF_DECL _Float64
     CursorKind.TYPEDEF_DECL _Float32x
     CursorKind.TYPEDEF_DECL _Float64x
     CursorKind.TYPEDEF_DECL wint_t
     CursorKind.STRUCT_DECL 
         CursorKind.FIELD_DECL __count
         CursorKind.UNION_DECL 
             CursorKind.FIELD_DECL __wch
             CursorKind.FIELD_DECL __wchb
                 CursorKind.INTEGER_LITERAL 
         CursorKind.FIELD_DECL __value
             CursorKind.UNION_DECL 
                 CursorKind.FIELD_DECL __wch
                 CursorKind.FIELD_DECL __wchb
                     CursorKind.INTEGER_LITERAL 
     CursorKind.TYPEDEF_DECL __mbstate_t
         CursorKind.STRUCT_DECL 
             CursorKind.FIELD_DECL __count
             CursorKind.UNION_DECL 
                 CursorKind.FIELD_DECL __wch
                 CursorKind.FIELD_DECL __wchb
                     CursorKi

         CursorKind.FIELD_DECL msg_controllen
         CursorKind.FIELD_DECL msg_flags
     CursorKind.STRUCT_DECL cmsghdr
         CursorKind.FIELD_DECL cmsg_len
         CursorKind.FIELD_DECL cmsg_level
         CursorKind.FIELD_DECL cmsg_type
         CursorKind.FIELD_DECL __cmsg_data
     CursorKind.FUNCTION_DECL __cmsg_nxthdr
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.TYPE_REF struct cmsghdr
         CursorKind.PARM_DECL __mhdr
             CursorKind.TYPE_REF struct msghdr
         CursorKind.PARM_DECL __cmsg
             CursorKind.TYPE_REF struct cmsghdr
     CursorKind.ENUM_DECL 
         CursorKind.ENUM_CONSTANT_DECL SCM_RIGHTS
             CursorKind.INTEGER_LITERAL 
     CursorKind.STRUCT_DECL linger
         CursorKind.FIELD_DECL l_onoff
         CursorKind.FIELD_DECL l_linger
     CursorKind.STRUCT_DECL osockaddr
         CursorKind.FIELD_DECL sa_family
         CursorKind.FIELD_DECL sa_data
             CursorKind.INTEGER_LITERAL 
     CursorKind.ENUM_DECL 


         CursorKind.PARM_DECL __path
     CursorKind.FUNCTION_DECL tcgetpgrp
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.TYPE_REF __pid_t
         CursorKind.PARM_DECL __fd
     CursorKind.FUNCTION_DECL tcsetpgrp
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL __fd
         CursorKind.PARM_DECL __pgrp_id
             CursorKind.TYPE_REF __pid_t
     CursorKind.FUNCTION_DECL getlogin
     CursorKind.FUNCTION_DECL getlogin_r
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL __name
         CursorKind.PARM_DECL __name_len
     CursorKind.FUNCTION_DECL setlogin
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL __name
     CursorKind.VAR_DECL optarg
     CursorKind.VAR_DECL optind
     CursorKind.VAR_DECL opterr
     CursorKind.VAR_DECL optopt
     CursorKind.FUNCTION_DECL getopt
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL ___argc
      

In [44]:

edgelist = []

def tree2edgelist(node):
    node_id = (node.location.line, node.location.column)
    
    for child in node.get_children():
        child_id = (child.location.line, child.location.column)
        edgelist.append((node_id, child_id))
        
        tree2edgelist(child)
        


In [45]:
tree2edgelist(root)

In [52]:
for (parent, child) in edgelist:
    print(parent, child)

(0, 0) (214, 15)
(0, 0) (251, 16)
(0, 0) (268, 16)
(0, 0) (285, 21)
(0, 0) (20, 23)
(0, 0) (13, 9)
(13, 9) (15, 7)
(13, 9) (16, 3)
(16, 3) (18, 19)
(16, 3) (19, 10)
(19, 10) (19, 17)
(13, 9) (20, 5)
(20, 5) (16, 3)
(16, 3) (18, 19)
(16, 3) (19, 10)
(19, 10) (19, 17)
(0, 0) (21, 3)
(21, 3) (13, 9)
(13, 9) (15, 7)
(13, 9) (16, 3)
(16, 3) (18, 19)
(16, 3) (19, 10)
(19, 10) (19, 17)
(13, 9) (20, 5)
(20, 5) (16, 3)
(16, 3) (18, 19)
(16, 3) (19, 10)
(19, 10) (19, 17)
(0, 0) (6, 21)
(6, 21) (6, 9)
(0, 0) (4, 8)
(0, 0) (5, 25)
(5, 25) (5, 16)
(0, 0) (4, 8)
(0, 0) (7, 25)
(7, 25) (7, 16)
(0, 0) (28, 8)
(28, 8) (31, 10)
(28, 8) (31, 25)
(31, 25) (31, 10)
(31, 25) (31, 35)
(28, 8) (34, 29)
(28, 8) (35, 14)
(28, 8) (36, 14)
(28, 8) (39, 15)
(39, 15) (39, 23)
(0, 0) (42, 33)
(42, 33) (42, 16)
(0, 0) (24, 20)
(24, 20) (24, 9)
(0, 0) (83, 8)
(0, 0) (87, 17)
(87, 17) (89, 6)
(87, 17) (89, 14)
(0, 0) (92, 17)
(92, 17) (94, 6)
(92, 17) (94, 14)
(0, 0) (97, 17)
(97, 17) (99, 6)
(97, 17) (99, 14)
(0, 0) (

(0, 0) (80, 3)
(80, 3) (75, 9)
(75, 9) (77, 27)
(77, 27) (77, 10)
(75, 9) (78, 8)
(78, 8) (78, 15)
(75, 9) (79, 31)
(0, 0) (86, 9)
(86, 9) (88, 34)
(88, 34) (88, 10)
(86, 9) (89, 8)
(89, 8) (89, 15)
(86, 9) (90, 12)
(0, 0) (91, 3)
(91, 3) (86, 9)
(86, 9) (88, 34)
(88, 34) (88, 10)
(86, 9) (89, 8)
(89, 8) (89, 15)
(86, 9) (90, 12)
(0, 0) (93, 9)
(93, 9) (95, 8)
(95, 8) (95, 15)
(93, 9) (96, 12)
(0, 0) (97, 3)
(97, 3) (93, 9)
(93, 9) (95, 8)
(95, 8) (95, 15)
(93, 9) (96, 12)
(0, 0) (103, 22)
(0, 0) (108, 9)
(108, 9) (110, 8)
(110, 8) (110, 15)
(108, 9) (111, 12)
(0, 0) (112, 3)
(112, 3) (108, 9)
(108, 9) (110, 8)
(110, 8) (110, 15)
(108, 9) (111, 12)
(0, 0) (114, 9)
(114, 9) (116, 8)
(116, 8) (116, 15)
(114, 9) (117, 7)
(0, 0) (118, 3)
(118, 3) (114, 9)
(114, 9) (116, 8)
(116, 8) (116, 15)
(114, 9) (117, 7)
(0, 0) (26, 8)
(26, 8) (28, 11)
(26, 8) (29, 12)
(0, 0) (33, 21)
(33, 21) (33, 9)
(0, 0) (24, 6)
(24, 6) (26, 3)
(26, 3) (26, 17)
(24, 6) (29, 3)
(29, 3) (29, 16)
(24, 6) (32, 3)
(32,

(71, 1) (402, 5)
(71, 1) (404, 5)
(71, 1) (406, 5)
(71, 1) (408, 5)
(71, 1) (410, 5)
(71, 1) (412, 5)
(71, 1) (414, 5)
(71, 1) (416, 5)
(71, 1) (418, 5)
(71, 1) (420, 5)
(71, 1) (422, 5)
(71, 1) (424, 5)
(71, 1) (426, 5)
(71, 1) (428, 5)
(71, 1) (430, 5)
(71, 1) (432, 5)
(71, 1) (434, 5)
(71, 1) (436, 5)
(71, 1) (438, 5)
(71, 1) (440, 5)
(71, 1) (442, 5)
(71, 1) (445, 5)
(71, 1) (447, 5)
(71, 1) (449, 5)
(71, 1) (451, 5)
(71, 1) (454, 5)
(71, 1) (456, 5)
(71, 1) (458, 5)
(71, 1) (460, 5)
(71, 1) (462, 5)
(71, 1) (465, 5)
(71, 1) (467, 5)
(71, 1) (469, 5)
(71, 1) (471, 5)
(71, 1) (473, 5)
(71, 1) (475, 5)
(71, 1) (477, 5)
(71, 1) (479, 5)
(71, 1) (481, 5)
(71, 1) (483, 5)
(71, 1) (485, 5)
(71, 1) (487, 5)
(71, 1) (489, 5)
(71, 1) (491, 5)
(71, 1) (493, 5)
(71, 1) (497, 5)
(497, 5) (497, 16)
(497, 16) (497, 16)
(497, 16) (497, 41)
(71, 1) (499, 5)
(71, 1) (502, 5)
(71, 1) (504, 5)
(71, 1) (506, 5)
(71, 1) (508, 5)
(71, 1) (511, 5)
(71, 1) (514, 5)
(71, 1) (516, 5)
(71, 1) (518, 5)
(71, 1

In [53]:
import sys

orig_stdout = sys.stdout
f = open('../data/ast_output.txt', 'w')
sys.stdout = f
#gooverlist(data[0])
for (parent, child) in edgelist:
    print(parent, child)
sys.stdout = orig_stdout
f.close()