# Code miner

This script simplifies the process of data mining of variable categories in a Java code.
It makes best efford to extract code snippets and select ones that could contain useful variables using *keywords*. Then, user should choose manually the variables and map them to the categories. 

## Working pipeline
The working pipeline is the following:
1. Clone Java code repositories
2. Choose keywords and their type
3. Run all the code until *Main Parcing Loop*
4. Parce repositories
5. Save all the changes in the part *After parcing*

## Setup

In [1]:
import pandas as pd
import os
import IPython
import time
import re

# String separator for GUI
sep = '-'*40

# Time in seconds, which scripts waits till erasing current cell output during the work 
sleep_time = 2

### Code directories

`reps_dir` list stores all repositories, which parser will traverse.

In [2]:
reps_dir = [r'./reps/Arduino']
# reps_dir = [r'./reps/Arduino', r'./reps/sndcpy', r'./reps/AmazeFileManager']

### Keywords and categories

Keywords are words (could be regex, but you should use them with understanding and care) which parser tries to find in the code snippet. Each keyword is actually a tuple in the form `(key, isWord)`:
 * `key` is the word itself
 * `isWord` is a boolean, which equals `True` if script should search for given key only as separate word, and `False` otherwise
 

In [3]:
search_keywords = [('for', True), 
                   ('while', True), 
                   ('iterator', False)]

 For example, consider the following part of code:
 ```java
 void main() {
     Boolean isFormatted = False;
     ListIterator it = new ArrayList<Integer>().begin();
     while(true) {
         if (foo(isFormatted, it)) {
             break;
         }
     }
 }
```
And the provided `search_keywors` list. Here parse would not trigger on key `for` in `isFormatted`, since it is a part of another word. However, it would trigger on key `while`, since it is separate word and on key `iterator` since parameter `isWord` for it is `False`.
 
Last notice: search with keywords is *case insensitive*

Also, here are the list of available categories, which variables could be mapped to

> **Important**: make sure that categories does not start with the same letter. If you want to disable this rule, check `show_to_user` function implementation

In [4]:
#Category names could not start with the same letter!
categories_available = ['loop_control', 'iterator', 'maybe_loop_control', 'break_loop_control']

### Dataset

The data will be loaded from and saved to the file with a name `dataset_name`

In [5]:
columns_list = ['Name', 'Code', 'Category']

dataset_name = 'parser_dataset.csv'

This is simple routine to load `pd.DataFrame` file, or create one if it doesn't exist. 

In [6]:
try:
    dataset = pd.read_csv(dataset_name)
except Exception: 
    dataset = pd.DataFrame(columns = columns_list)    

Check columns in the dataset. Just for safety.

In [7]:
assert (dataset.columns == columns_list).all()

Printing some info about dataset

In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2903 entries, 0 to 2902
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      2903 non-null   object
 1   Code      2903 non-null   object
 2   Category  2903 non-null   object
dtypes: object(3)
memory usage: 68.2+ KB


New entries are stored in the dictionary `new_data` first, so that you can roll changes back.

Each dict entry associates with one of the columns in database, and values is list of new entries

In [9]:
new_data = {col: [] for col in columns_list}

### Progress file

Progress file is a file, which stores all completed file adresses. They are saved and loaded by the same princible as the dataset.

Setting up `dir_sep` symbol, which separates entries of the file.
Also specifying progress file name

In [10]:
dir_sep = '\n'
files_progress_name = 'file_progress.txt'

Opening routine. During script execution, all completed files stores in a set. That allows avoiding duplicating entries and faster search for specific element

In [11]:
files_completed = set()

try:
    with open(files_progress_name, 'r') as f:
        files_completed = set(f.read().split(dir_sep))
except Exception:
    pass

current_progress = len(files_completed)

### Traversing the reps

This code traverses recursively all directories from `repos_dir` and add all `.java` files in the `code_samples` list.

Entry of `code_samples` has a form `(path, code)`, where:
 * `path` - relative path to the variable
 * `code` - string with all code in the file

In [17]:
code_samples = []
files_amount = 0

# Traversing all repositories from the list, and 
# saving code from *.java files and their pathes
for rep in reps_dir:
    walk = os.walk(rep)

    for root, dirs, files in walk:
        for file in files:
            if file.endswith(".java"):
                files_amount += 1
                path = root + '/' + file
                code = text_from_file(path)
                code_samples.append((path, code))

Comparing total amount of files and amount of currently completed:

In [18]:
print(f"Completed {current_progress} out of {files_amount}")

Completed 3135 out of 548


## Functions

### GUI

* `print_gui` prints main user interface elements on the screen.

In [None]:
def print_gui(code_fragment, keyword_found):
    print(sep)
    print(f'Parser found keyword {keyword_found}!\n')
    print('List all needed variables in the form:')
    print('"varName typeName"')
    print(f'There are only the following variable categories: {categories_available}.')
    print('You can write either full category name, or only first letter.\n')
    print('To end -- press Enter with empty line. To restart -- type "$r" + Enter')
    print('Invalid input would result in force restart.')
    print(sep)
    print(f'Code fragment:')
    print()
    print(code_fragment)
    print(sep)
    

* `show_to_user` asks user for variable names, parses them and adds to `new_data`

In [13]:
def show_to_user(code_fragment, keyword_found, new_data):
    while True:
        # Printing GUI
        time.sleep(sleep_time)
        IPython.display.clear_output()
        print_gui(code_fragment, keyword_found)
        
        # The reason why we keep input in these lists first is 
        # possibility of reloading
        names = list()
        categories = list()  
        
        line = input()    
        while line != '' and line.find('$r') == -1:
            # Words should be a list of two elements: variable name and its type
            words = line.split(' ')
            
            # Check list size
            if len(words) != 2:
                print('Invalid words amount! Restarting...')
                line = '$r'
                break        
            # Check if code fragment actually contains written variable
            if code_fragment.find(words[0]) == -1:
                print('Invalid variable name! Restarting...')
                line = '$r'
                break
            
            # Validating the type
            category = None
            for cat in categories_available:
                # !!! This is a place, where you can change the way categories detected
                # By default, you can mention:
                #     - First letter of category
                #     - First two letters of category
                #     - Full category name
                if words[1] == cat or words[1] == cat[0] or words[1] == cat[0:2]:
                    category = cat
                    break        
                    
            if category is None:
                print('Invalid category! Restarting...')
                line = '$r'
                break    
            
            names.append(words[0])
            categories.append(category) 
            
            line = input()
            
        if line.find('$r') != -1:
            print("Restarting...")
            continue
        break        
    
    # Adding data to temporary data structure
    new_data['Name'] += names
    new_data['Code'] += [code_fragment] * len(names)
    new_data['Category'] += categories
    
    print(sep)
    print('New samples successfully added to database!')

### Validating code fragments 

* `check_variable` returns `True` if the code fragment (given by `code` and `idx`) corresponds to the key from `search_keywords`  

In [14]:
def check_variable(code, idx, key, isWord):
    prevIdx = idx - 1
    postIdx = idx + len(key)
    
    if re.match(key, code[idx:]) is None:
        return False
    
    #      either not word or word rule should match
    return not isWord or ((prevIdx == -1 or not code[prevIdx].isalnum()) and 
                          (postIdx >= len(code) or not code[postIdx].isalnum()))

 * `try_find_keywords` tries to find any keyword in a given code fragment. Calls `show_to_user` funciton and returns `True` if founds any keyword, returns `False` otherwise

In [15]:
def try_find_keywords(code, new_data):
    lcode = code.lower()
    cur_idx = 0        
    
    while cur_idx != len(code) - 1 :
        # If comment opening symbol found, jump to comment close symbol
        if cur_idx < len(lcode) - 1 and lcode[cur_idx:cur_idx+2] == r'/*':
            cur_idx = lcode.find(r'*/', cur_idx+2)
            
        if cur_idx < len(lcode) - 1 and lcode[cur_idx:cur_idx+2] == r'//':
            cur_idx = lcode.find('\n', cur_idx+2)

        # If string literal beginning found, jump to string literal end
        if lcode[cur_idx] == r'"':
            cur_idx = lcode.find(r'"', cur_idx + 1);

        if cur_idx == -1:
            break

        for (key, isWord) in search_keywords:
            if check_variable(lcode, cur_idx, key, isWord):
                show_to_user(code, key, new_data)

                return True

        cur_idx += 1
            
    return False

 * `try_find_keywords_alternative` works much faster then `try_find_keywords`. However, it does not exclude search inside string literals and comments

In [16]:
def try_find_keywords_alternative(code, new_data):
    lcode = code.lower()
    for (key, isWord) in search_keywords:
        idx = lcode.find(key)
        prevIdx = idx - 1
        postIdx = idx + len(key)
        if idx != -1:
            if not isWord or ((prevIdx == -1 or not lcode[prevIdx].isalnum()) and (postIdx >= len(code) or not lcode[postIdx].isalnum())):
                show_to_user(code, key, new_data)
                return True
            
    return False

### Others

 * `text_from_file` opens given file and returns all its content in string

In [12]:
def text_from_file(path):
    with open(path, 'rb') as f:
        text = f.read()
    return text.decode('UTF-8')

 * `class_check` returns `True` if given code fragment (given by code string and index) contains `class` key reegarding all required rules, and `False` otherwise.

In [19]:
def class_check(code, idx):
    keys = ["class", "enum"]
    
    prevIdx = idx - 1
    
    for key in keys:
        postIdx = idx + len(key)
        if code[idx : postIdx] != key:
            continue
    
        # class keyword has to be rounded by whitespaces
        if ((prevIdx == -1 or code[prevIdx].isspace()) and 
                (postIdx >= len(code) or code[postIdx].isspace())):
            return True
    
    return False

## Main parsing loop

The following part tries to spot methods and send its body to the functions, which try to find keywords. If such a word is found, it starts communication with the user.

In [20]:
for (path, code) in code_samples:
    # Check if file was already processed, and adding it to the set otherwise
    if path in files_completed:
        continue
        
    # Current index in code
    idx = 0
    # Unclosed curve brackets
    bracket_idx = 0
    # Unclosed class bodies
    class_bracket_idx = 0
    # Index where last method began. If no method currently proceeding, then 0
    method_begin_idx = 0
    
    
    if_variable_added = False
    if_method_proceeded = False
    if_brackets_open = False
    if_comment_open = False
    if_multiline_comment_open = False
         
    while idx < len(code):                
        # Skipping string litterals and comments
        if code[idx] == '"':
            idx += 1
            while code[idx] != '"':
                if code[idx] == '\\':
                    idx += 1
                
                idx += 1
                
        if code[idx] == "'":
            idx += 1
            while code[idx] != "'":
                if code[idx] == '\\':
                    idx += 1
                
                idx += 1
                
        if idx+1 < len(code) and code[idx : idx+2] == '//':
            idx = code.find('\n', idx)
        if idx+1 < len(code) and code[idx : idx+2] == '/*':
            idx = code.find('*/', idx)
        
        if idx == -1:
            break
        
        # Counting brackets
        if code[idx] == '{':
            bracket_idx += 1
            
        if code[idx] == '}':
            bracket_idx -= 1
        
        # Try to find new class or close opened one
        if not if_method_proceeded: 
            if class_check(code, idx):
                idx = code.find('{', idx)
                bracket_idx += 1
                class_bracket_idx += 1
                
            elif class_bracket_idx > bracket_idx:
                class_bracket_idx -= 1    
        # Spot new method beginning   
        if code[idx] == '{' and bracket_idx == class_bracket_idx+1 and not if_method_proceeded:
            method_begin_idx = idx
            if_method_proceeded = True
        
        # Spot method closing
        if code[idx] == '}' and bracket_idx == class_bracket_idx and if_method_proceeded:
            if method_begin_idx == 0:
                raise Error(f"Slicing error in file {file}!")    
            
            # Main loop part. Trying to find keywords in found method body
            if try_find_keywords(code[method_begin_idx : idx+1], new_data):
                if_variable_added = True
            method_begin_idx = 0
                
            if_method_proceeded = not if_method_proceeded
        
        # Each loop iteration, index increases
        idx += 1

    if bracket_idx != 0:
        raise NameError(f'Bracket error in file {path}!')

    if if_variable_added:
        print(f'You have finished {path}.')
    files_completed.add(path)
    
    
IPython.display.clear_output()
print('All repos were parced!')        

----------------------------------------
Parser found keyword for!

List all needed variables in the form:
"varName typeName"
There are only the following variable categories: ['loop_control', 'iterator', 'maybe_loop_control', 'break_loop_control'].
You can write either full category name, or only first letter.

To end -- press Enter with empty line. To restart -- type "$r" + Enter
Invalid input would result in force restart.
----------------------------------------
Code fragment:

{
        if (totalOutputStreams == 0) {
            return 0;
        }
        for (int i = ((int)totalOutputStreams) - 1; i >= 0; i--) {
            if (findBindPairForOutStream(i) < 0) {
                return unpackSizes[i];
            }
        }
        return 0;
    }
----------------------------------------


KeyboardInterrupt: Interrupted by user

## Saving progress

Showing progress and saving all progress inside dataframe. Also cleaning up the new_data, so that it's possible to proceed with main parsing loop

In [28]:
print(f'Amount of examples was {len(dataset)}')
dataset = pd.concat([dataset, pd.DataFrame(new_data)], axis = 0, ignore_index = True)
print(f'Amount of examples now {len(dataset)}')

new_data = {col: [] for col in columns_list}

Amount of examples was 2903
Amount of examples now 2903


Saving all the changes into the files

In [25]:
with open(files_progress_name, 'w') as f:
    f.write(dir_sep.join(files_completed))
dataset.to_csv(dataset_name, index = False)

Checking if file updated

In [26]:
# Use for checking purposes
try:
    dataset = pd.read_csv(dataset_name)
except Exception: 
    dataset = pd.DataFrame(columns = columns_list)

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2903 entries, 0 to 2902
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Name      2903 non-null   object
 1   Code      2903 non-null   object
 2   Category  2903 non-null   object
dtypes: object(3)
memory usage: 68.2+ KB


In [27]:
print(dataset['Category'].value_counts())

loop_control          1474
iterator              1055
maybe_loop_control     225
break_loop_control     149
Name: Category, dtype: int64


## Error handling


In [24]:
assert False, 'Do not run this part with "Run all"!'

AssertionError: Do not run this part with "Run all"!

These section is for dealing with errors. 
In the following cell, write the path to the file. 

In [21]:
err_file_name = './reps/Arduino/arduino-core/src/processing/app/helpers/PreferencesHelper.java' 

Finding code inside this file

In [22]:
err_code = ""
for (name, code) in code_samples:
    if name == err_file_name:
        err_code = code
print(err_code)

package processing.app.helpers;

import java.awt.Color;
import java.awt.Font;

public abstract class PreferencesHelper {

//  /**
//   * Create a Color with the value of the specified key. The format of the color
//   * should be an hexadecimal number of 6 digit, eventually prefixed with a '#'.
//   * 
//   * @param name
//   * @return A Color object or <b>null</b> if the key is not found or the format
//   *         is wrong
//   */
//  static public Color getColor(PreferencesMap prefs, String name) {
//    Color parsed = parseColor(prefs.get(name));
//    if (parsed != null)
//      return parsed;
//    return Color.GRAY; // set a default
//  }
//
//
//  static public void setColor(PreferencesMap prefs, String attr, Color what) {
//    putColor(prefs, attr, what);
//  }
//
//
//  static public Font getFontWithDefault(PreferencesMap prefs, PreferencesMap defaults, String attr) {
//    Font font = getFont(prefs, attr);
//    if (font == null) {
//      String value = defaults.get(attr)

After fixing the code, save this path in separate file

In [23]:
name = "error_pathes"
with open("file_progress.txt", 'a') as f:
    f.write(dir_sep + name)