# Experimenting with Python Clang Bindings

The goal of this notebook is to load the python clang bindings and the juliet dataset. Then try to get an AST out of some of Juliet's code snippets, can we make sense of any of it?

## Setup
You need clang and llvm installed. 

Then you need to make sure the clang python bindings are in your python path. What this really means is that you need to run the next cell, and if it fails you need to:
  1. Download the clang source code from this location: http://releases.llvm.org/download.html
     - Note: You probably need to be careful to download the correct version, check this by running `clang --version` in your shell. Then download the clang source code for the version it outputs (from the page linked above). I had to download 7.0.1.
  2. Extract this source to a know location, I chose "/home/dan/masters-cyber-security/project/clang-src/". 
  3. Open `~/.bashrc` in a text editor, and at the end add the following line:
 `PYTHONPATH=/home/dan/masters-cyber-security/project/clang-src/:$PYTHONPATH`
      
  4. Run `source ~/.bashrc` in your shell.
  5. Restart jupyter notebook and all python sessions.

Hopefully it'll work then.

In [11]:
import clang.cindex

In [12]:
import os
import pandas as pd

In [13]:
# This cell might not be needed for you.
clang.cindex.Config.set_library_file('/lib/x86_64-linux-gnu/libclang-8.so.1')

Load in the juliet data set, and pick the first data point as an example

In [14]:
juliet = pd.read_csv("../data/juliet.csv.zip")

In [15]:
example = juliet.iloc[0]
example

Unnamed: 0                                                     0
testcase_ID                                                61940
filename       000/061/940/CWE114_Process_Control__w32_char_c...
code           /* TEMPLATE GENERATED TESTCASE FILE\nFilename:...
flaw                                                     CWE-114
flaw_loc                                                     121
CWE-015                                                    False
CWE-023                                                    False
CWE-036                                                    False
CWE-078                                                    False
CWE-090                                                    False
CWE-114                                                     True
CWE-121                                                    False
CWE-122                                                    False
CWE-123                                                    False
CWE-124                  

In [16]:
print(example.code)

/* TEMPLATE GENERATED TESTCASE FILE
Filename: CWE114_Process_Control__w32_char_connect_socket_01.c
Label Definition File: CWE114_Process_Control__w32.label.xml
Template File: sources-sink-01.tmpl.c
*/
/*
 * @description
 * CWE: 114 Process Control
 * BadSource: connect_socket Read data using a connect socket (client side)
 * GoodSource: Hard code the full pathname to the library
 * Sink:
 *    BadSink : Load a dynamic link library
 * Flow Variant: 01 Baseline
 *
 * */

#include "std_testcase.h"

#include <wchar.h>

#ifdef _WIN32
#include <winsock2.h>
#include <windows.h>
#include <direct.h>
#pragma comment(lib, "ws2_32") /* include ws2_32.lib when linking */
#define CLOSE_SOCKET closesocket
#else /* NOT _WIN32 */
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <unistd.h>
#define INVALID_SOCKET -1
#define SOCKET_ERROR -1
#define CLOSE_SOCKET close
#define SOCKET int
#endif

#define TCP_PORT 27015
#define IP_ADDRESS "127.0.0.1"


#if

Instantiate the clang parser and give it our example. We use `unsaved_files` to tell it to parse a file that doesn't actually exist on disk.

In [17]:
index = clang.cindex.Index.create()
translation_unit = index.parse(path=example.filename, unsaved_files=[(example.filename, example.code)])

In [18]:
translation_unit

<clang.cindex.TranslationUnit at 0x7f40a97f0e80>

`root` is the root note of the AST. Try to explore and figure out what this all means! It's pretty dense ha

In [19]:
root = translation_unit.cursor

In [20]:
children = list(root.get_children())

In [21]:
children[0].kind

CursorKind.TYPEDEF_DECL

In [22]:
# took this from a lovely tutorial on chess.com: 
#     https://www.chess.com/blog/lockijazz/using-python-to-traverse-and-modify-clang-s-ast-tree
# needed a minor update: change node.type to node.kind

function_calls = []             # List of AST node objects that are function calls
function_declarations = []      # List of AST node objects that are fucntion declarations

def traverse(node):
    # Recurse for children of this node
    for child in node.get_children():
        traverse(child)

    # Add the node to function_calls
    if node.kind == clang.cindex.CursorKind.CALL_EXPR:
        function_calls.append(node)

    # Add the node to function_declarations
    if node.kind == clang.cindex.CursorKind.FUNCTION_DECL:
        function_declarations.append(node)

    # Print out information about the node
    print('Found %s [line=%s, col=%s]' % (node.displayname, node.location.line, node.location.column))

In [23]:
traverse(root)

Found _Float32 [line=214, col=15]
Found _Float64 [line=251, col=16]
Found _Float32x [line=268, col=16]
Found _Float64x [line=285, col=21]
Found wint_t [line=20, col=23]
Found __count [line=15, col=7]
Found __wch [line=18, col=19]
Found  [line=19, col=17]
Found __wchb [line=19, col=10]
Found  [line=16, col=3]
Found __wch [line=18, col=19]
Found  [line=19, col=17]
Found __wchb [line=19, col=10]
Found  [line=16, col=3]
Found __value [line=20, col=5]
Found  [line=13, col=9]
Found __count [line=15, col=7]
Found __wch [line=18, col=19]
Found  [line=19, col=17]
Found __wchb [line=19, col=10]
Found  [line=16, col=3]
Found __wch [line=18, col=19]
Found  [line=19, col=17]
Found __wchb [line=19, col=10]
Found  [line=16, col=3]
Found __value [line=20, col=5]
Found  [line=13, col=9]
Found __mbstate_t [line=21, col=3]
Found __mbstate_t [line=6, col=9]
Found mbstate_t [line=6, col=21]
Found _IO_FILE [line=4, col=8]
Found struct _IO_FILE [line=5, col=16]
Found __FILE [line=5, col=25]
Found _IO_FILE [l

Found  [line=223, col=20]
Found MSG_WAITALL [line=223, col=5]
Found  [line=225, col=16]
Found MSG_FIN [line=225, col=5]
Found  [line=227, col=16]
Found MSG_SYN [line=227, col=5]
Found  [line=229, col=20]
Found MSG_CONFIRM [line=229, col=5]
Found  [line=231, col=16]
Found MSG_RST [line=231, col=5]
Found  [line=233, col=20]
Found MSG_ERRQUEUE [line=233, col=5]
Found  [line=235, col=20]
Found MSG_NOSIGNAL [line=235, col=5]
Found  [line=237, col=17]
Found MSG_MORE [line=237, col=5]
Found  [line=239, col=22]
Found MSG_WAITFORONE [line=239, col=5]
Found  [line=241, col=18]
Found MSG_BATCH [line=241, col=5]
Found  [line=243, col=20]
Found MSG_ZEROCOPY [line=243, col=5]
Found  [line=245, col=20]
Found MSG_FASTOPEN [line=245, col=5]
Found  [line=248, col=24]
Found MSG_CMSG_CLOEXEC [line=248, col=5]
Found  [line=200, col=1]
Found msg_name [line=259, col=11]
Found socklen_t [line=260, col=5]
Found msg_namelen [line=260, col=15]
Found struct iovec [line=262, col=12]
Found msg_iov [line=262, col=19

Found  [line=756, col=28]
Found __pid_t [line=756, col=8]
Found fork() [line=756, col=16]
Found  [line=764, col=29]
Found __pid_t [line=764, col=8]
Found vfork() [line=764, col=16]
Found  [line=770, col=33]
Found __fd [line=770, col=27]
Found ttyname(int) [line=770, col=14]
Found  [line=775, col=6]
Found  [line=775, col=14]
Found __fd [line=774, col=27]
Found __buf [line=774, col=39]
Found __buflen [line=774, col=53]
Found ttyname_r(int, char *, int) [line=774, col=12]
Found  [line=779, col=30]
Found __fd [line=779, col=24]
Found isatty(int) [line=779, col=12]
Found  [line=784, col=27]
Found ttyslot() [line=784, col=12]
Found  [line=790, col=6]
Found  [line=790, col=14]
Found __from [line=789, col=30]
Found __to [line=789, col=50]
Found link(const char *, const char *) [line=789, col=12]
Found  [line=797, col=6]
Found  [line=797, col=14]
Found __fromfd [line=795, col=24]
Found __from [line=795, col=46]
Found __tofd [line=795, col=58]
Found __to [line=796, col=18]
Found __flags [line=79

In [24]:
[decl.displayname for decl in function_declarations]

['wcscpy(int *restrict, const int *restrict)',
 'wcsncpy(int *restrict, const int *restrict, int)',
 'wcscat(int *restrict, const int *restrict)',
 'wcsncat(int *restrict, const int *restrict, int)',
 'wcscmp(const int *, const int *)',
 'wcsncmp(const int *, const int *, int)',
 'wcscasecmp(const int *, const int *)',
 'wcsncasecmp(const int *, const int *, int)',
 'wcscasecmp_l(const int *, const int *, locale_t)',
 'wcsncasecmp_l(const int *, const int *, int, locale_t)',
 'wcscoll(const int *, const int *)',
 'wcsxfrm(int *restrict, const int *restrict, int)',
 'wcscoll_l(const int *, const int *, locale_t)',
 'wcsxfrm_l(int *, const int *, int, locale_t)',
 'wcsdup(const int *)',
 'wcschr(const int *, int)',
 'wcsrchr(const int *, int)',
 'wcscspn(const int *, const int *)',
 'wcsspn(const int *, const int *)',
 'wcspbrk(const int *, const int *)',
 'wcsstr(const int *, const int *)',
 'wcstok(int *restrict, const int *restrict, int **restrict)',
 'wcslen(const int *)',
 'wcsnlen(

In [25]:
[call.displayname for call in function_calls]

['socket',
 'memset',
 'inet_addr',
 'htons',
 'connect',
 'strchr',
 'strchr',
 'close',
 'printLine',
 'printLine',
 'strcpy',
 'printLine',
 'printLine',
 'goodG2B']

I found this nice tutorial that helps to explain how the Python clang bindings canbe used to explore AST's: https://github.com/FraMuCoder/PyClASVi/blob/master/doc/python_clang_usage.md

In [26]:
def print_ast(cursor, deep=0):
    print(' '.join((deep*'    ', str(cursor.kind), str(cursor.spelling))))
    for child in cursor.get_children():
        print_ast(child, deep+1)

print_ast(root)

 CursorKind.TRANSLATION_UNIT 000/061/940/CWE114_Process_Control__w32_char_connect_socket_01.c
     CursorKind.TYPEDEF_DECL _Float32
     CursorKind.TYPEDEF_DECL _Float64
     CursorKind.TYPEDEF_DECL _Float32x
     CursorKind.TYPEDEF_DECL _Float64x
     CursorKind.TYPEDEF_DECL wint_t
     CursorKind.STRUCT_DECL 
         CursorKind.FIELD_DECL __count
         CursorKind.UNION_DECL 
             CursorKind.FIELD_DECL __wch
             CursorKind.FIELD_DECL __wchb
                 CursorKind.INTEGER_LITERAL 
         CursorKind.FIELD_DECL __value
             CursorKind.UNION_DECL 
                 CursorKind.FIELD_DECL __wch
                 CursorKind.FIELD_DECL __wchb
                     CursorKind.INTEGER_LITERAL 
     CursorKind.TYPEDEF_DECL __mbstate_t
         CursorKind.STRUCT_DECL 
             CursorKind.FIELD_DECL __count
             CursorKind.UNION_DECL 
                 CursorKind.FIELD_DECL __wch
                 CursorKind.FIELD_DECL __wchb
                     CursorKi

         CursorKind.FIELD_DECL msg_controllen
         CursorKind.FIELD_DECL msg_flags
     CursorKind.STRUCT_DECL cmsghdr
         CursorKind.FIELD_DECL cmsg_len
         CursorKind.FIELD_DECL cmsg_level
         CursorKind.FIELD_DECL cmsg_type
         CursorKind.FIELD_DECL __cmsg_data
     CursorKind.FUNCTION_DECL __cmsg_nxthdr
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.TYPE_REF struct cmsghdr
         CursorKind.PARM_DECL __mhdr
             CursorKind.TYPE_REF struct msghdr
         CursorKind.PARM_DECL __cmsg
             CursorKind.TYPE_REF struct cmsghdr
     CursorKind.ENUM_DECL 
         CursorKind.ENUM_CONSTANT_DECL SCM_RIGHTS
             CursorKind.INTEGER_LITERAL 
     CursorKind.STRUCT_DECL linger
         CursorKind.FIELD_DECL l_onoff
         CursorKind.FIELD_DECL l_linger
     CursorKind.STRUCT_DECL osockaddr
         CursorKind.FIELD_DECL sa_family
         CursorKind.FIELD_DECL sa_data
             CursorKind.INTEGER_LITERAL 
     CursorKind.ENUM_DECL 


         CursorKind.PARM_DECL __path
     CursorKind.FUNCTION_DECL tcgetpgrp
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.TYPE_REF __pid_t
         CursorKind.PARM_DECL __fd
     CursorKind.FUNCTION_DECL tcsetpgrp
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL __fd
         CursorKind.PARM_DECL __pgrp_id
             CursorKind.TYPE_REF __pid_t
     CursorKind.FUNCTION_DECL getlogin
     CursorKind.FUNCTION_DECL getlogin_r
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL __name
         CursorKind.PARM_DECL __name_len
     CursorKind.FUNCTION_DECL setlogin
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL __name
     CursorKind.VAR_DECL optarg
     CursorKind.VAR_DECL optind
     CursorKind.VAR_DECL opterr
     CursorKind.VAR_DECL optopt
     CursorKind.FUNCTION_DECL getopt
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.UNEXPOSED_ATTR 
         CursorKind.PARM_DECL ___argc
      

In [66]:
edgelist = []

def tree2edgelist(node, indentifier=1):
    node_id = indentifier
    
    for child in node.get_children():
        child_id = tree2edgelist(child, indentifier+1)
        edgelist.append([indentifier,child_id])
        indentifier = child_id
    return indentifier


In [67]:
tree2edgelist(root)

2538

In [68]:
edgelist

[[1, 2],
 [2, 3],
 [3, 4],
 [4, 5],
 [5, 6],
 [7, 8],
 [9, 10],
 [11, 12],
 [10, 12],
 [8, 12],
 [14, 15],
 [16, 17],
 [15, 17],
 [13, 17],
 [12, 17],
 [6, 17],
 [19, 20],
 [21, 22],
 [23, 24],
 [22, 24],
 [20, 24],
 [26, 27],
 [28, 29],
 [27, 29],
 [25, 29],
 [24, 29],
 [18, 29],
 [17, 29],
 [30, 31],
 [29, 31],
 [31, 32],
 [33, 34],
 [32, 34],
 [34, 35],
 [36, 37],
 [35, 37],
 [38, 39],
 [40, 41],
 [41, 42],
 [39, 42],
 [42, 43],
 [43, 44],
 [44, 45],
 [46, 47],
 [45, 47],
 [37, 47],
 [48, 49],
 [47, 49],
 [50, 51],
 [49, 51],
 [51, 52],
 [53, 54],
 [54, 55],
 [52, 55],
 [56, 57],
 [57, 58],
 [55, 58],
 [59, 60],
 [60, 61],
 [58, 61],
 [62, 63],
 [63, 64],
 [61, 64],
 [65, 66],
 [66, 67],
 [67, 68],
 [68, 69],
 [69, 70],
 [64, 70],
 [71, 72],
 [72, 73],
 [73, 74],
 [74, 75],
 [75, 76],
 [76, 77],
 [70, 77],
 [78, 79],
 [79, 80],
 [80, 81],
 [77, 81],
 [82, 83],
 [83, 84],
 [84, 85],
 [85, 86],
 [81, 86],
 [87, 88],
 [88, 89],
 [89, 90],
 [91, 92],
 [90, 92],
 [86, 92],
 [93, 94],
 [9

In [69]:
for (parent, child) in edgelist:
    print(parent, child)

1 2
2 3
3 4
4 5
5 6
7 8
9 10
11 12
10 12
8 12
14 15
16 17
15 17
13 17
12 17
6 17
19 20
21 22
23 24
22 24
20 24
26 27
28 29
27 29
25 29
24 29
18 29
17 29
30 31
29 31
31 32
33 34
32 34
34 35
36 37
35 37
38 39
40 41
41 42
39 42
42 43
43 44
44 45
46 47
45 47
37 47
48 49
47 49
50 51
49 51
51 52
53 54
54 55
52 55
56 57
57 58
55 58
59 60
60 61
58 61
62 63
63 64
61 64
65 66
66 67
67 68
68 69
69 70
64 70
71 72
72 73
73 74
74 75
75 76
76 77
70 77
78 79
79 80
80 81
77 81
82 83
83 84
84 85
85 86
81 86
87 88
88 89
89 90
91 92
90 92
86 92
93 94
94 95
95 96
96 97
98 99
97 99
92 99
100 101
101 102
102 103
99 103
104 105
103 105
106 107
107 108
108 109
110 111
109 111
105 111
112 113
111 113
114 115
115 116
113 116
117 118
118 119
116 119
120 121
121 122
119 122
123 124
124 125
122 125
126 127
127 128
125 128
129 130
130 131
128 131
132 133
133 134
131 134
135 136
134 136
137 138
138 139
136 139
140 141
141 142
139 142
143 144
144 145
142 145
146 147
147 148
148 149
149 150
150 151
145 151
152 153
151 

1070 1071
1072 1073
1071 1073
1066 1073
1074 1075
1075 1076
1077 1078
1076 1078
1079 1080
1078 1080
1073 1080
1081 1082
1082 1083
1084 1085
1083 1085
1086 1087
1085 1087
1080 1087
1088 1089
1090 1091
1089 1091
1092 1093
1091 1093
1087 1093
1094 1095
1095 1096
1097 1098
1096 1098
1099 1100
1098 1100
1093 1100
1101 1102
1102 1103
1103 1104
1104 1105
1105 1106
1100 1106
1107 1108
1108 1109
1109 1110
1110 1111
1111 1112
1106 1112
1113 1114
1114 1115
1115 1116
1116 1117
1117 1118
1119 1120
1118 1120
1121 1122
1120 1122
1112 1122
1123 1124
1124 1125
1125 1126
1126 1127
1127 1128
1129 1130
1128 1130
1131 1132
1130 1132
1122 1132
1133 1134
1134 1135
1136 1137
1135 1137
1137 1138
1132 1138
1139 1140
1140 1141
1142 1143
1141 1143
1143 1144
1138 1144
1145 1146
1146 1147
1147 1148
1148 1149
1149 1150
1151 1152
1150 1152
1144 1152
1153 1154
1154 1155
1155 1156
1156 1157
1157 1158
1159 1160
1158 1160
1152 1160
1161 1162
1162 1163
1163 1164
1160 1164
1165 1166
1167 1168
1166 1168
1169 1170
1168 1170


2203 2204
2205 2206
2206 2207
2207 2208
2204 2208
2209 2210
2210 2211
2211 2212
2208 2212
2212 2213
2213 2214
2214 2215
2215 2216
2217 2218
2218 2219
2219 2220
2220 2221
2221 2222
2216 2222
2223 2224
2224 2225
2225 2226
2226 2227
2222 2227
2228 2229
2229 2230
2230 2231
2231 2232
2227 2232
2233 2234
2234 2235
2232 2235
2236 2237
2237 2238
2238 2239
2239 2240
2235 2240
2241 2242
2242 2243
2243 2244
2244 2245
2240 2245
2246 2247
2245 2247
2248 2249
2249 2250
2250 2251
2247 2251
2252 2253
2253 2254
2254 2255
2255 2256
2256 2257
2257 2258
2251 2258
2259 2260
2260 2261
2258 2261
2262 2263
2261 2263
2264 2265
2263 2265
2266 2267
2265 2267
2268 2269
2269 2270
2270 2271
2267 2271
2272 2273
2273 2274
2274 2275
2271 2275
2276 2277
2277 2278
2275 2278
2279 2280
2278 2280
2280 2281
2282 2283
2281 2283
2284 2285
2285 2286
2283 2286
2287 2288
2286 2288
2289 2290
2290 2291
2291 2292
2293 2294
2292 2294
2288 2294
2295 2296
2296 2297
2298 2299
2297 2299
2294 2299
2300 2301
2301 2302
2299 2302
2303 2304


In [70]:
import sys

orig_stdout = sys.stdout
f = open('../data/ast_output.txt', 'w')
sys.stdout = f
#gooverlist(data[0])
for (parent, child) in edgelist:
    print(parent, child)
sys.stdout = orig_stdout
f.close()