                 ____       __       _                 __   ______            _ __      __     __
                / __ \___  / /______(_)__ _   ______ _/ /  / ____/___  ____  (_) /___  / /_   / /
               / /_/ / _ \/ __/ ___/ / _ \ | / / __ `/ /  / /   / __ \/ __ \/ / / __ \/ __/  / / 
              / _, _/  __/ /_/ /  / /  __/ |/ / /_/ / /  / /___/ /_/ / /_/ / / / /_/ / /_   /_/  
             /_/ |_|\___/\__/_/  /_/\___/|___/\__,_/_/   \____/\____/ .___/_/_/\____/\__/  (_)   
                                                                   /_/                           
                                       (learning every nanosecond)

## Getting code
The goal of this algorithm is to retrieve a relevant code snippet given an English description and a series of already defined code/description pairs.
![ret](../images/retrievalHighLevel.png)

In [156]:
import linecache
import pyndri
import os
import sys
import numpy as np
np.random.seed(0)
import re, string, timeit

## Indri 
Indri is the retrieval engine that I am currently using since it has a nice interface with python and has some of the algorithms I need.

Indri takes files in an XML format. Sentence pairs are usually stored line by line in a file. So we will need to convert from single line to formatted XML.

### Datasets 
We will currently make a full explanation for only one dataset: Django. This is because it is relatively small (18k sentences) and clean. Further descriptions and analysis are found in other notebooks in this directory.

In [160]:
def show_sample(fp, src_ext=".src", tgt_ext=".tgt", lines=[3,21,80,99]):
    linecache.clearcache()
    for l in lines:
        print(linecache.getline(train_fp+src_ext, l))
        print("LINE: {} \nSOURCE:    {} \nTARGET:     {}\n".format(l, 
                                                                   linecache.getline(fp+src_ext, l), 
                                                                   linecache.getline(fp+tgt_ext, l)))

In [101]:
django_fp = "../datasets/django/all"
show_sample(django_fp, src_ext=".desc", tgt_ext=".code", lines=[13,14])

LINE: 13 
SOURCE:      define the function get_cache with backend and dictionary pair of elements kwargs as arguments.
 
TARGET:         def get_cache ( backend , ** kwargs ) :


LINE: 14 
 




In [102]:
!head -5 ../datasets/django/all.desc 

  from threading import local into default name space.
  from django.conf import settings into default name space.
  from django.core import signals into default name space.


In [103]:
!head -5 ../datasets/django/all.code

 from threading import local
  from django . conf import settings
 from django . core import signals


In [104]:
dirName = "temp"
 
try:
    # Create target Directory
    os.mkdir(dirName)
    print("Directory " , dirName ,  " Created ") 
except FileExistsError:
    print("Directory " , dirName ,  " already exists")

Directory  temp  Created 


### Dataset and train / test split
Copy the full dataset to the temp folder. We then split the data into a training and testing set at around 90% / 10%

In [107]:
train_ratio = 0.9 # this means 90% of the data will be used for training, thus 10% for testing
num_samples = sum(1 for line in open(django_fp + ".desc"))
train_cutoff = int(num_samples * train_ratio)

lines = np.arange(num_samples)
np.random.shuffle(lines)

train_lines = lines[:train_cutoff]
test_lines = lines[train_cutoff:]

In [139]:
train_fp = "temp/retrieval_train"
test_fp = "temp/retrieval_test"

##### Train split for .desc and .code

In [109]:
with open(train_fp + ".desc", "w") as out:
    for l in train_lines:
        src = linecache.getline(django_fp + ".desc", l)
        out.write(src)

In [110]:
with open(train_fp + ".code", "w") as out:
    for l in train_lines:
        src = linecache.getline(django_fp + ".code", l)
        out.write(src)

##### Test split for .desc and .code

In [111]:
with open(test_fp + ".desc", "w") as out:
    for l in test_lines:
        src = linecache.getline(django_fp + ".desc", l)
        out.write(src)

In [112]:
with open(test_fp + ".code", "w") as out:
    for l in test_lines:
        src = linecache.getline(django_fp + ".code", l)
        out.write(src)

### Convert to TrecText format

In [113]:
with open(train_fp + ".desc", "r") as f, open("temp/train_desc.trectext", "w") as out:
    count = 0
    while True:
        line = f.readline()
        
        if not line :
            break
            
        out.write("<DOC>\n  <DOCNO>{}</DOCNO>\n  <TEXT>\n{}  </TEXT>\n</DOC>\n".format(count, line))
        count += 1

### Create the index with indri
To create an index we need to supply Indri with a parameter file specifying how to handle each document. Indri will then generate an index folder with is fast to query.

In [114]:
with open("temp/IndriBuildIndex.conf", "w") as out:
    conf = """
<parameters>
<index>temp/django_index/</index>
<memory>1024M</memory>
<storeDocs>true</storeDocs>
<corpus><path>temp/train_desc.trectext</path><class>trectext</class></corpus>
<stemmer><name>krovetz</name></stemmer>
</parameters>"""
    
    out.write(conf)
    

In [115]:
!IndriBuildIndex temp/IndriBuildIndex.conf

kstem_add_table_entry: Duplicate word emeritus will be ignored.
0:00: Created repository temp/django_index/
0:00: Opened temp/train_desc.trectext
0:06: Documents parsed: 16923 Documents indexed: 16923
0:06: Closed temp/train_desc.trectext
0:06: Closing index
0:07: Finished


In [116]:
index = pyndri.Index("temp/django_index/")
env = pyndri.TFIDFQueryEnvironment(index)

## Query example

In [165]:
linecache.getline(train_fp+".desc", 1)

"  replace every occurrence of '\\t' in s with '\\\\t'.\n"

In [166]:
train_fp+".desc"

'temp/retrieval_train.desc'

In [163]:
results = env.query('error handler', results_requested=5)

In [164]:
show_sample(train_fp, src_ext=".desc", tgt_ext=".code", lines=[doc[0] for doc in results])

  for every handler in handlers,

LINE: 16012 
SOURCE:      for every handler in handlers,
 
TARGET:          for handler in handlers :


  for every handler in handlers,

LINE: 10339 
SOURCE:      for every handler in handlers,
 
TARGET:                               for handler in handlers :


  substitute self._upload_handlers for handlers.

LINE: 8784 
SOURCE:      substitute self._upload_handlers for handlers.
 
TARGET:      handlers = self . _upload_handlers


  substitute upload_handlers for self._upload_handlers.

LINE: 7760 
SOURCE:      substitute upload_handlers for self._upload_handlers.
 
TARGET:       self . _upload_handlers = upload_handlers


  substitute _upload_handlers for self.__upload_handlers.

LINE: 4361 
SOURCE:      substitute _upload_handlers for self.__upload_handlers.
 
TARGET:      self . _upload_handlers = upload_handlers




In [119]:
top_code = [linecache.getline(django_fp + ".code", doc[0]) for doc in results]

In [121]:
with open(test_fp + ".desc", "r") as f, open("temp/retrieval_predictions" + ".code", "w") as out:
    lines = f.readlines()
    l = 1
    for line in lines:
        result = env.query(line.translate(str.maketrans('', '', string.punctuation)), results_requested=1)
        out.write(linecache.getline(train_fp + ".code",result[0][0]))
        if l % 500 == 1:
            print("LINE: {} \nQUERY:     {}\nPRED DESCRIPTION:    {} \nPRED CODE:     {}\nTRUTH    {}\n".format(
                l, 
                line, 
                linecache.getline(train_fp + ".desc", result[0][0]), 
                linecache.getline(train_fp + ".code", result[0][0]), 
                linecache.getline(test_fp + ".code", l)))
        l += 1

LINE: 1 
QUERY:       if p_pattern starts with a string '^',

PRED DESCRIPTION:      open the file named filepath in read mode, with file descriptor renamed to fp perform,
 
PRED CODE:                       with open ( filepath , 'rb' ) as fp :

TRUTH              return commands


LINE: 501 
QUERY:       return value.

PRED DESCRIPTION:      substitute sys.maxsize for MAXSIZE.
 
PRED CODE:       MAXSIZE = sys . maxsize

TRUTH     return


LINE: 1001 
QUERY:       define the method _isdst with 2 arguments self and dt.

PRED DESCRIPTION:      define the method eval with 2 arguments self and context.
 
PRED CODE:        def eval ( self , context ) :

TRUTH     return changeset




IndexError: tuple index out of range

In [125]:
"if p_pattern starts with a string '^',".translate(str.maketrans('', '', string.punctuation))

'if ppattern starts with a string '