                 ____       __       _                 __   ______            _ __      __     __
                / __ \___  / /______(_)__ _   ______ _/ /  / ____/___  ____  (_) /___  / /_   / /
               / /_/ / _ \/ __/ ___/ / _ \ | / / __ `/ /  / /   / __ \/ __ \/ / / __ \/ __/  / / 
              / _, _/  __/ /_/ /  / /  __/ |/ / /_/ / /  / /___/ /_/ / /_/ / / / /_/ / /_   /_/  
             /_/ |_|\___/\__/_/  /_/\___/|___/\__,_/_/   \____/\____/ .___/_/_/\____/\__/  (_)   
                                                                   /_/                           
                                       (learning every nanosecond)

## Getting code
The goal of this algorithm is to retrieve a relevant code snippet given an English description and a series of already defined code/description pairs.
![ret](../images/retrievalHighLevel.png)

In [1]:
import linecache
import pyndri
import os
import sys

## Indri 
Indri is the retrieval engine that I am currently using since it has a nice interface with python and has some of the algorithms I need.

Indri takes files in an XML format. Sentence pairs are usually stored line by line in a file. So we will need to convert from single line to formatted XML.

### Datasets 
We will currently make a full explanation for only one dataset: Django. This is because it is relatively small (18k sentences) and clean. Further descriptions and analysis are found in other notebooks in this directory.

In [2]:
def show_sample(fp, src_ext=".src", tgt_ext=".tgt", lines=[3,21,80,99]):
    for l in lines:
        print("LINE: {} \nSOURCE:    {} \nTARGET:     {}\n".format(l, 
                                                                   linecache.getline(fp+src_ext, l), 
                                                                   linecache.getline(fp+tgt_ext, l)))

In [3]:
django_fp = "../datasets/django/all"
show_sample(django_fp, src_ext=".desc", tgt_ext=".code", lines=[13,14])

LINE: 13 
SOURCE:      define the function get_cache with backend and dictionary pair of elements kwargs as arguments.
 
TARGET:         def get_cache ( backend , ** kwargs ) :


LINE: 14 
 




In [5]:
!head -5 ../datasets/django/all.desc 

  from threading import local into default name space.
  from django.conf import settings into default name space.
  from django.core import signals into default name space.


In [4]:
!head -5 ../datasets/django/all.code

 from threading import local
  from django . conf import settings
 from django . core import signals


In [6]:
dirName = "temp"
 
try:
    # Create target Directory
    os.mkdir(dirName)
    print("Directory " , dirName ,  " Created ") 
except FileExistsError:
    print("Directory " , dirName ,  " already exists")

Directory  temp  Created 


### Convert to TrecText format

In [11]:
with open(django_fp + ".desc", "r") as f, open("temp/django_desc.trectext", "w") as out:
    count = 0
    while True:
        line = f.readline()
        
        if not line :
            break
            
        out.write("<DOC>\n  <DOCNO>{}</DOCNO>\n  <TEXT>\n{}  </TEXT>\n</DOC>\n".format(count, line))
        count += 1

### Create the index with indri
To create an index we need to supply Indri with a parameter file specifying how to handle each document. Indri will then generate an index folder with is fast to query.

In [18]:
with open("temp/IndriBuildIndex.conf", "w") as out:
    conf = """
<parameters>
<index>temp/django_index/</index>
<memory>1024M</memory>
<storeDocs>true</storeDocs>
<corpus><path>temp/django_desc.trectext</path><class>trectext</class></corpus>
<stemmer><name>krovetz</name></stemmer>
</parameters>"""
    
    out.write(conf)
    

In [19]:
!IndriBuildIndex temp/IndriBuildIndex.conf

kstem_add_table_entry: Duplicate word emeritus will be ignored.
0:00: Created repository temp/django_index/
0:00: Opened temp/django_desc.trectext
0:10: Documents parsed: 18805 Documents indexed: 18805
0:10: Closed temp/django_desc.trectext
0:10: Closing index
0:11: Finished


In [21]:
index = pyndri.Index("temp/django_index/")
env = pyndri.TFIDFQueryEnvironment(index)

In [44]:
results = env.query('for every handler in handlers', results_requested=5)

In [45]:
show_sample(django_fp, src_ext=".desc", tgt_ext=".code", lines=[doc[0] for doc in results])

LINE: 9733 
SOURCE:      for every handler in handlers,
 
TARGET:         for handler in handlers :


LINE: 9698 
SOURCE:      for every handler in handlers,
 
TARGET:                               for handler in handlers :


LINE: 9650 
SOURCE:      for every handler in handlers,
 
TARGET:          for handler in handlers :


LINE: 9747 
SOURCE:      for every handler in self._upload_handlers,
 
TARGET:                  for handler in self . _upload_handlers :


LINE: 10074 
SOURCE:      for every handler in settings.FILE_UPLOAD_HANDLERS,
 
TARGET:               self . _upload_handlers = [ uploadhandler . load_handler ( handler , self )  for handler in settings . FILE_UPLOAD_HANDLERS ]




In [32]:
top_code = [linecache.getline(django_fp + ".code", doc[0]) for doc in results]