Link to datasets used in the project: <br> 
https://github.com/github/CodeSearchNet <br>
https://www.kaggle.com/linkanjarad/coding-problems-and-solution-python-code?select=ProblemSolutionPythonV3.csv <br>
https://www.kaggle.com/veeralakrishna/python-code-data <br>

In [1]:
import numpy as np
import pandas as pd
import os
from multiprocessing import Pool
import re
import pysbd # For sentence segmentation
from langdetect import detect_langs
import string

## Data Exploration and Pre-processing

#### About the Data
We used three different datasets for this project, two sourced from Kaggle and one provided by Github.

The first two are a collection of different coding challenges from various sources such as w3resource, geeksforgeeks, etc. In these examples, there is usually a prompt written in plain English as well as a solution written in python code. The covered topics for the coding challenges include:
* Working with strings, lists, arrays, tuples, dictionaries, CSV, JSON
* Utilizing modules including NumPy, BeautifulSoup, tkinter, Pandas, random, os, re, datetime
* File I/O
* Loops and Conditionals
* Functions (including Lambda) and Classes
* OOP and DSA
* Searching & Sorting
* Pattern Printing 

Between these two datasets there are around 8000 examples of code and the matching prompt.

Our third dataset is CodeSearchNet by Github. The full dataset contains 2 million comment and code pairs across multiple programming languages, but we will only be using a selection of those written in python for our purposes here. CodeSearchNet was originally intended to help improve code searching using natural language but should work well for the problem of code summarization as well. The codenet data contains more advanced functions, class definitions, and is inherently more complex than the coding challenge data. 


#### Preprocessing Coding Challenge Data 
*Loading in the data from directory*

In [2]:
# https://www.kaggle.com/linkanjarad/coding-problems-and-solution-python-code?select=ProblemSolutionPythonV3.csv
df = pd.read_csv("Data/ProblemSolutionPythonV3.csv", index_col=0) #small dataset 1
df.head()

Unnamed: 0,Problem,Python Code
0,Write a NumPy program to repeat elements of an...,"import numpy as np\rx = np.repeat(3, 4)\rprint..."
1,Write a Python function to create and print a ...,def printValues():\n\tl = list()\n\tfor i in r...
2,Write a Python program to remove duplicates fr...,"import itertools\rnum = [[10, 20], [40], [30, ..."
3,Write a NumPy program to compute the x and y c...,import numpy as np\rimport matplotlib.pyplot a...
4,Write a Python program to alter a given SQLite...,import sqlite3\rfrom sqlite3 import Error\rdef...


*Loading code data*

In [3]:
# https://www.kaggle.com/veeralakrishna/python-code-data
with open('Data/Python_code_data.txt', encoding="utf-8") as f: #small dataset 2
    lines = f.readlines()
f.close()

#### Exploring a few samples from both datasets

In [4]:
rand_example = np.random.choice(df.shape[0])
print("#"*120 + "\n" + "#"*22, "-"*30, " Problem  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(df['Problem'].iloc[rand_example])
print('\n' + "#"*120 + "\n" + "#"*22, "-"*30, "  Function  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(df['Python Code'].iloc[rand_example])

########################################################################################################################
###################### ------------------------------  Problem   ------------------------------ ######################
######################################################################################################################## 

Write a Python Program for ShellSort

########################################################################################################################
###################### ------------------------------   Function   ------------------------------ ######################
######################################################################################################################## 

# Python program for implementation of Shell Sort
  
def shellSort(arr):
  
    # Start with a big gap, then reduce the gap
    n = len(arr)
    gap = n/2
  
    # Do a gapped insertion sort for this gap size.
    # The first gap elem

In [5]:
rand_example = np.random.choice(df.shape[0])
print("#"*120 + "\n" + "#"*22, "-"*30, " Problem  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(df['Problem'].iloc[rand_example])
print('\n' + "#"*120 + "\n" + "#"*22, "-"*30, "  Function  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(df['Python Code'].iloc[rand_example])

########################################################################################################################
###################### ------------------------------  Problem   ------------------------------ ######################
######################################################################################################################## 

Write a Python program to parse a given CSV string and get the list of lists of string values. Use csv.reader

########################################################################################################################
###################### ------------------------------   Function   ------------------------------ ######################
######################################################################################################################## 

import csvcsv_string = """1,2,34,5,67,8,9"""print("Original string:")print(csv_string)lines = csv_string.splitlines()print("List of CSV formatted strings

In [6]:
rand_example = np.random.choice(df.shape[0])
print("#"*120 + "\n" + "#"*22, "-"*30, " Problem  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(df['Problem'].iloc[rand_example])
print('\n' + "#"*120 + "\n" + "#"*22, "-"*30, "  Function  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(df['Python Code'].iloc[rand_example])

########################################################################################################################
###################### ------------------------------  Problem   ------------------------------ ######################
######################################################################################################################## 

Write a Pandas program to create a time series object with a time zone. 

########################################################################################################################
###################### ------------------------------   Function   ------------------------------ ######################
######################################################################################################################## 

import pandas as pdprint("Timezone: Europe/Berlin:")print("Using pytz:")date_pytz = pd.Timestamp('2019-01-01', tz = 'Europe/Berlin')print(date_pytz.tz)  print("Using dateutil:")date_util = pd.

In [7]:
print(''.join(lines[0:30]))

# write a python program to add two numbers 
num1 = 1.5
num2 = 6.3
sum = num1 + num2
print(f'Sum: {sum}')


# write a python function to add two user provided numbers and return the sum
def add_two_numbers(num1, num2):
    sum = num1 + num2
    return sum


# write a program to find and print the largest among three numbers

num1 = 10
num2 = 12
num3 = 14
if (num1 >= num2) and (num1 >= num3):
   largest = num1
elif (num2 >= num1) and (num2 >= num3):
   largest = num2
else:
   largest = num3
print(f'largest:{largest}')


# write a program to find and print the smallest among three numbers
num1 = 10
num2 = 12



#### Cleaning Coding Challenge Data

Removing unnecessary language from the prompts, we do not need phrases like 'write a python function', 'write a numpy program', etc.

In [8]:
#remove the first few words from the prompt 'write a python program' 'write a python function' ... etc
df['Problem'] = df['Problem'].apply(lambda x: ' '.join(x.split()[5:]))

*Looking at examples*

In [9]:
rand_example = np.random.choice(df.shape[0])
print(df['Problem'].iloc[rand_example])

in an array


In [10]:
rand_example = np.random.choice(df.shape[0])
print(df['Problem'].iloc[rand_example])

get the current date, oldest date and number of days between Current date and oldest date of Ufo dataset.


In [11]:
rand_example = np.random.choice(df.shape[0])
print(df['Problem'].iloc[rand_example])

create an array of all the even integers from 30 to 70.


**Cleaning the Python Code Data**

For the second dataset, it was provided as a txt file so we needed to identify what lines are code and what lines are the problem prompts/comments.

In [12]:
# keep track of what lines in .txt file are used for comments, two lists that will eventually become the dataframe for 
# holding the prompt and holding the code
comments = []
comment_lines = []
code = []

# for each line, if it starts with a '#' character, save that line as a comment line
for i in np.arange(len(lines)):
    if(lines[i][0]=='#'):
        comments.append(lines[i])
        comment_lines.append(i)
        
# testing to see which lines are comment lines
print(comment_lines[0:4])
print(lines[0])
print(lines[7])

[0, 7, 13, 27]
# write a python program to add two numbers 

# write a python function to add two user provided numbers and return the sum



In [13]:
# populate list of code
for index, elem in enumerate(comment_lines):
    if index+1 < len(comment_lines) and index - 1 >= 0:
        prev = comment_lines[index-1]
        code.append(lines[prev+1:elem])
        
# create dataframe
df1 = pd.DataFrame(list(zip(comments,code)),columns=['comments','code'])

#convert list of lines to one string
df1['code'] = df1['code'].apply(lambda x: ' '.join(x))

*Looking at examples*

In [14]:
rand_example = np.random.choice(df1.shape[0])
print("#"*120 + "\n" + "#"*22, "-"*30, " Problem  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(df1['comments'].iloc[rand_example])
print('\n' + "#"*120 + "\n" + "#"*22, "-"*30, "  Function  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(df1['code'].iloc[rand_example])

########################################################################################################################
###################### ------------------------------  Problem   ------------------------------ ######################
######################################################################################################################## 

# Write a program to print the sum of squares of first n natural numbers


########################################################################################################################
###################### ------------------------------   Function   ------------------------------ ######################
######################################################################################################################## 

n = 21
 sum_n = 0
 for i in range(1, n+1):
     sum_n += i**2
 print(sum_n)
 
 



The second dataset was less organized so we have a bunch of different manual rules to clean the problem prompts and remove examples that could not be cleaned.

*Cleaning df1*

In [15]:
df1['comments'] = df1['comments'].str.lower()
count = 0
remaining = 0
for row in df1['comments']:
    if len(df1['comments'][count].split())<=3: #removes lines that only have two words
        df1['comments'][count] = ''
    df1['comments'][count] = df1['comments'][count].replace('# write a python program to ','')
    df1['comments'][count] = df1['comments'][count].replace('# write a program to ','')
    df1['comments'][count] = df1['comments'][count].replace('# write program to ','')
    df1['comments'][count] = df1['comments'][count].replace('# write program which ','')
    df1['comments'][count] = df1['comments'][count].replace('# write a program which ','')
    df1['comments'][count] = df1['comments'][count].replace('# write a ','')
    df1['comments'][count] = df1['comments'][count].replace('# write python code to demonstrate ','')
    df1['comments'][count] = df1['comments'][count].replace('# write python3 code to demonstrate','')
    df1['comments'][count] = df1['comments'][count].replace('# please write a program which ','')
    df1['comments'][count] = df1['comments'][count].replace('# please write a program to ','')
    df1['comments'][count] = df1['comments'][count].replace('# define a function that can ','')
    df1['comments'][count] = df1['comments'][count].replace('# define a function which can ','')
    if df1['comments'][count].find('write a program to')!=-1:
        df1['comments'][count] = df1['comments'][count][df1['comments'][count].find('write a program to')+19:]
    if df1['comments'][count].find('write a python program that')!=-1:
        df1['comments'][count] = df1['comments'][count][df1['comments'][count].find('write a python program that')+28:]
    if df1['comments'][count].find('function to')!=-1:
        df1['comments'][count] = df1['comments'][count][df1['comments'][count].find('function to')+12:]
    if df1['comments'][count].find('program to')!=-1:
        df1['comments'][count] = df1['comments'][count][df1['comments'][count].find('program to')+11:]
    if df1['comments'][count].find('define')!=-1 and df1['comments'][count].find('class')!=-1:
        df1['comments'][count] = ""
    if df1['comments'][count].find('# in[')!=-1:
        df1['comments'][count] = ""
    if df1['comments'][count].find('printing result')!=-1:
        df1['comments'][count] = ""
    if df1['comments'][count].find('code to')!=-1:
        df1['comments'][count] = df1['comments'][count][df1['comments'][count].find('code to')+8:]
    if df1['comments'][count].find('function that')!=-1:
        df1['comments'][count] = df1['comments'][count][df1['comments'][count].find('function that')+14:]
    if df1['comments'][count].find('function which')!=-1:
        df1['comments'][count] = df1['comments'][count][df1['comments'][count].find('function which')+15:]
    if df1['comments'][count].find('#')==0:
        remaining+=1
    count+=1
print(df1.shape[0]-remaining,'clean examples')
print(remaining, 'remaining')
print(round(((df1.shape[0]-remaining)/df1.shape[0])*100,2) ,'% clean')

4111 clean examples
845 remaining
82.95 % clean


In [16]:
# replace empty prompts with null and then drop
df1['comments'].replace('', np.nan, inplace=True)
df1 = df1.dropna()

# drop the remaining examples
df1 = df1[df1['comments'].str.find('#') == -1]

*New amount of examples*

In [17]:
df1.shape

(3735, 2)

In [18]:
df1.head()

Unnamed: 0,comments,code
0,add two numbers \n,num1 = 1.5\n num2 = 6.3\n sum = num1 + num2\n ...
1,add two user provided numbers and return the s...,"def add_two_numbers(num1, num2):\n sum = n..."
2,find and print the largest among three numbers\n,\n num1 = 10\n num2 = 12\n num3 = 14\n if (num...
3,find and print the smallest among three numbers\n,num1 = 10\n num2 = 12\n num3 = 14\n if (num1 <...
4,merge two given lists into one\n,"def merge_lists(l1, l2):\n return l1 + l2\..."


**Combining both datasets**

Now that we have both datasets processed we can combine them:

In [19]:
print(df1.columns)
print(df.columns)
df1.rename(columns={'comments':'Problem','code':'Python Code'},inplace=True)
print(df1.columns)

Index(['comments', 'code'], dtype='object')
Index(['Problem', 'Python Code'], dtype='object')
Index(['Problem', 'Python Code'], dtype='object')


In [20]:
df = df.append(df1)
df.shape

(7042, 2)

In [21]:
# lowercase everything
df['Problem'] = df['Problem'].apply(lambda x: str(x).lower())
df['Python Code'] = df['Python Code'].apply(lambda x: str(x).lower())

*Removing imports and examples that are not functions*

In [22]:
for i in range(df.shape[0]):
    index = df['Python Code'].iloc[i].find('def')
    if index != -1:
        df['Python Code'].iloc[i] = df['Python Code'].iloc[i][index:]
        
# Removing any examples that are not functions 
df = df[df['Python Code'].str.find('def') != -1]
df = df.reset_index(drop=True)
df

Unnamed: 0,Problem,Python Code
0,create and print a list where the values are s...,def printvalues():\n\tl = list()\n\tfor i in r...
1,alter a given sqlite table.,def sql_connection():\r try:\r conn = sq...
2,extract specified size of strings from a give ...,"def extract_string(str_list1, l):\r result ..."
3,sort unsorted numbers using strand sort.,"def strand_sort(arr: list, reverse: bool = fal..."
4,insert a specified element in a given list aft...,"def inset_element_list(lst, x, n):\r i = n\..."
...,...,...
3152,program using generator to print the numbers w...,def numgenerator(n):\n for i in range(n+1)...
3153,searches an item in a sorted list. the functio...,"def bin_search(li, element):\n bottom = 0\..."
3154,searches an item in a sorted list. the functio...,"def bin_search(li, element):\n bottom = 0\..."
3155,print this list after removing all duplicate v...,def removeduplicate( li ):\n newli=[]\n ...


*Spacing out the code for easier tokenization*

In [23]:
def clean_function(text):
    text_split = str(text).replace('\n',' ').replace('\t',' ').replace(',', ' , ').split()
    text = ' '.join(text_split)
    text = (text.replace('(', ' ( ').replace(')', ' ) ')
            .replace('=',' = ').replace('"', ' " ')
            .replace("'", " ' ").replace("#"," # ")
            .replace('[',' [ ').replace(']', ' ] ')
            .replace('{',' { ').replace('}', ' } ')
            .replace('+', ' + ').replace('-', ' - ')
            .replace(':', ' : ').replace('  ',' '))
    return text

df['Python Code Cleaned'] = df['Python Code'].apply(clean_function)
df

Unnamed: 0,Problem,Python Code,Python Code Cleaned
0,create and print a list where the values are s...,def printvalues():\n\tl = list()\n\tfor i in r...,def printvalues ( ) : l = list ( ) for i in ra...
1,alter a given sqlite table.,def sql_connection():\r try:\r conn = sq...,def sql_connection ( ) : try : conn = sqlite3....
2,extract specified size of strings from a give ...,"def extract_string(str_list1, l):\r result ...","def extract_string ( str_list1 , l ) : result ..."
3,sort unsorted numbers using strand sort.,"def strand_sort(arr: list, reverse: bool = fal...","def strand_sort ( arr : list , reverse : bool ..."
4,insert a specified element in a given list aft...,"def inset_element_list(lst, x, n):\r i = n\...","def inset_element_list ( lst , x , n ) : i = n..."
...,...,...,...
3152,program using generator to print the numbers w...,def numgenerator(n):\n for i in range(n+1)...,def numgenerator ( n ) : for i in range ( n + ...
3153,searches an item in a sorted list. the functio...,"def bin_search(li, element):\n bottom = 0\...","def bin_search ( li , element ) : bottom = 0 t..."
3154,searches an item in a sorted list. the functio...,"def bin_search(li, element):\n bottom = 0\...","def bin_search ( li , element ) : bottom = 0 t..."
3155,print this list after removing all duplicate v...,def removeduplicate( li ):\n newli=[]\n ...,def removeduplicate ( li ) : newli = [ ] seen ...


**Saving the dataset**

In [24]:
# df.to_csv('./Data/Python_code_cleaned.csv', index=False)

### Preprocessing the CodeNet dataset

*Loading in data from directory*

In [25]:
data_dir = './Data'

In [26]:
codenet = pd.read_csv(os.path.join(data_dir, 'CodeNetData.csv'))
codenet.head()

Unnamed: 0,docstring,function
0,Trains a k-nearest neighbors classifier for fa...,"def train(train_dir, model_save_path=None, n_n..."
1,Recognizes faces in given image using a traine...,"def predict(X_img_path, knn_clf=None, model_pa..."
2,Shows the face recognition results visually.\n...,"def show_prediction_labels_on_image(img_path, ..."
3,Convert a dlib 'rect' object to a plain tuple ...,"def _rect_to_css(rect):\n """"""\n Convert ..."
4,"Make sure a tuple in (top, right, bottom, left...","def _trim_css_to_bounds(css, image_shape):\n ..."


Size of the dataset

In [27]:
codenet.shape

(503502, 2)

**Exploring some random samples**

*Click the cell below multiple times for different examples*

In [28]:
rand_example = np.random.choice(codenet.shape[0])
print("#"*120 + "\n" + "#"*22, "-"*30, " Docstring  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(codenet['docstring'].iloc[rand_example])
print('\n' + "#"*120 + "\n" + "#"*22, "-"*30, "  Function  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(codenet['function'].iloc[rand_example])

########################################################################################################################
###################### ------------------------------  Docstring   ------------------------------ ######################
######################################################################################################################## 

setup root logger with ColoredFormatter.

########################################################################################################################
###################### ------------------------------   Function   ------------------------------ ######################
######################################################################################################################## 

def setup_logger(log_level, log_file=None):
    """setup root logger with ColoredFormatter."""
    level = getattr(logging, log_level.upper(), None)
    if not level:
        color_print("Invalid log level: %s" % log_level, "RED

In [29]:
rand_example = np.random.choice(codenet.shape[0])
print("#"*120 + "\n" + "#"*22, "-"*30, " Docstring  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(codenet['docstring'].iloc[rand_example])
print('\n' + "#"*120 + "\n" + "#"*22, "-"*30, "  Function  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(codenet['function'].iloc[rand_example])

########################################################################################################################
###################### ------------------------------  Docstring   ------------------------------ ######################
######################################################################################################################## 

Deploys a file to the Artifactory BEL namespace cache

    :param str filename: The physical path
    :param str module_name: The name of the module to deploy to
    :param tuple[str] auth: A pair of (str username, str password) to give to the auth keyword of the constructor of
                            :class:`artifactory.ArtifactoryPath`. Defaults to the result of :func:`get_arty_auth`.
    :return: The resource path, if it was deployed successfully, else none.
    :rtype: Optional[str]

########################################################################################################################
#################

In [30]:
rand_example = np.random.choice(codenet.shape[0])
print("#"*120 + "\n" + "#"*22, "-"*30, " Docstring  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(codenet['docstring'].iloc[rand_example])
print('\n' + "#"*120 + "\n" + "#"*22, "-"*30, "  Function  ", "-"*30, "#"*22 + "\n" + "#"*120, '\n')
print(codenet['function'].iloc[rand_example])

########################################################################################################################
###################### ------------------------------  Docstring   ------------------------------ ######################
######################################################################################################################## 

Remove sphinx-tabs CSS and JS asset files if not used in a page

########################################################################################################################
###################### ------------------------------   Function   ------------------------------ ######################
######################################################################################################################## 

def update_context(app, pagename, templatename, context, doctree):
    """ Remove sphinx-tabs CSS and JS asset files if not used in a page """
    if doctree is None:
        return
    visitor = _FindTabs

#### Steps for cleaning the docstrings and functions

**Docstring** <br>
1. Remove the line breaks
2. Remove the tabs
3. Remove the examples that starts with `return`, `Args`, `:param`, or `Parameters`
4. Remove the examples that are in another language
5. Remove the examples that contain more characters than just a `,` or `.` (No extra characters)
6. Remove `,` from examples 
7. Lowercase everything
8. Remove everything after the first sentence
9. Remove the period after the end of the sentence
10. Remove examples that are less than 3 words and more than 20
11. Remove any missing or blank examples

**Function** <br>
1. Remove everything between triple quotes
2. Remove functions that have more than 100 words or so
3. Remove any tabs, line breaks, extra white spaces and add spacing between commas, parentheses, and equal signs
5. Remove any missing or blank examples

*Note: running the code below can take a long time to process the data (45+ minutes sometimes). You can skip all the pre-processing steps and load in the cleaned data below. The step that takes this long is using the language detection (step 4) to remove any examples that are not in English on 500k+ examples. Even multi-threading this operations takes around 10 minutes on 14 cores.*

In [31]:
#codenet = pd.read_csv(os.path.join(data_dir, 'CodeNetData_cleaned.csv'))

**Cleaning docstring**

In [32]:
### Step 1
codenet['docstring_cleaned'] = codenet['docstring'].apply(lambda x: str(x).split('\n\n')[0])

### Step 2
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: ' '.join(str(x).replace('\n',' ').split()))

### Step 3
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: x[:(None if str(x).find(':return:') == -1 else str(x).find(':return:'))])
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: x[:(None if str(x).find('Args') == -1 else str(x).find('Args'))])
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: x[:(None if str(x).find(':param') == -1 else str(x).find(':param'))])
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: x[:(None if str(x).find('Parameters') == -1 else str(x).find('Parameters'))])

### Step 4
from f import add_features
# def detectEnglish(input):
#     try:
#         if(detect_langs(input)[0].lang=="en"):
#             return True
#         else: 
#             return False
#     except:
#         return False

# def add_features(df):       
#     df['docstring_isEnglish'] = df['docstring'].apply(lambda x: detectEnglish(str(x)))
#     return df

def parallelize_dataframe(df, func, n_cores=8):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

codenet = parallelize_dataframe(codenet, add_features, n_cores=14) # Multi-threading 
codenet = codenet[codenet['docstring_isEnglish'] == True]
codenet = codenet.reset_index(drop=True) # reseting the index

### Step 5
codenet = codenet[codenet['docstring_cleaned'].apply(lambda x: x.replace(' ','').replace(',','').replace('.','').isalpha())]
codenet = codenet.reset_index(drop=True)

### Step 6
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: str(x).replace(',',''))

### Step 7
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: str(x).lower())

### Step 8
seg = pysbd.Segmenter(language="en", clean=False)
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: seg.segment(x)[0])

### Step 9
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: str(x).rstrip('.'))

### Step 10
codenet = codenet[codenet['docstring_cleaned'].apply(lambda x: len(x.split()) > 2)]
codenet = codenet[codenet['docstring_cleaned'].apply(lambda x: len(x.split()) < 21)]

### Step 11
codenet = codenet[codenet['docstring_cleaned'].astype(str) != ''] # Removing any docstrings that are empty
codenet = codenet[codenet['docstring_cleaned'].isna() == False]
codenet = codenet[codenet["docstring_cleaned"].astype(str) != 'None']
codenet = codenet.reset_index(drop=True) # reseting the index

*Looking at the cleaned examples now*

In [33]:
rand_example = np.random.choice(codenet.shape[0])
print(f'Cleaned docstring: {codenet["docstring_cleaned"].iloc[rand_example]}\n')
print(f'Old docstring: {codenet["docstring"].iloc[rand_example]}')

Cleaned docstring: wait for the job to complete or a timeout to happen

Old docstring: Wait for the job to complete, or a timeout to happen.

      This is more efficient than the version in the base Job class, in that we can
      use a call that blocks for the poll duration rather than a sleep. That means we
      shouldn't block unnecessarily long and can also poll less.

    Args:
      timeout: how long to wait (in seconds) before giving up; default None which means no timeout.

    Returns:
      The QueryJob


In [34]:
rand_example = np.random.choice(codenet.shape[0])
print(f'Cleaned docstring: {codenet["docstring_cleaned"].iloc[rand_example]}\n')
print(f'Old docstring: {codenet["docstring"].iloc[rand_example]}')

Cleaned docstring: get the values for the convolutionfilter record

Old docstring: Get the values for the CONVOLUTIONFILTER record.


In [35]:
rand_example = np.random.choice(codenet.shape[0])
print(f'Cleaned docstring: {codenet["docstring_cleaned"].iloc[rand_example]}\n')
print(f'Old docstring: {codenet["docstring"].iloc[rand_example]}')

Cleaned docstring: add a resource attribute attribute to a resource

Old docstring: Add a resource attribute attribute to a resource.

        attr_is_var indicates whether the attribute is a variable or not --
        this is used in simulation to indicate that this value is expected
        to be filled in by the simulator.


**Cleaning Functions**

In [36]:
### Step 1
def remove_pydocs_doubleQ(text):
    text = str(text)
    try:
        indexes_to_remove = []
        for match in re.finditer('"""', text):
            indexes_to_remove.append(match.start())  
        return text[:indexes_to_remove[0]] + text[indexes_to_remove[1]+3:]
    except:
        pass
    
codenet['function_cleaned'] = codenet['function'].apply(remove_pydocs_doubleQ)

### Step 2
codenet = codenet[codenet['function_cleaned'].apply(lambda x: len(str(x).split())) < 101]

### Step 3
def clean_function(text):
    text_split = str(text).replace('\n',' ').replace('\t',' ').replace(',', ' , ').split()
    text = ' '.join(text_split)
    text = (text.replace('(', ' ( ').replace(')', ' ) ')
            .replace('=',' = ').replace('"', ' " ')
            .replace("'", " ' ").replace("#"," # ")
            .replace('[',' [ ').replace(']', ' ] ')
            .replace('{',' { ').replace('}', ' } ')
            .replace('+', ' + ').replace('-', ' - ')
            .replace(':', ' : ').replace('  ',' '))
    return text

codenet['function_cleaned'] = codenet['function_cleaned'].apply(clean_function)

### Step 4
codenet = codenet[codenet['function_cleaned'].astype(str) != ''] # Removing any functions that are empty
codenet = codenet[codenet['function_cleaned'].isna() == False] 
codenet = codenet[codenet["function_cleaned"].astype(str) != 'None']
codenet = codenet.reset_index(drop=True) # reseting the index

*Looking at the cleaned functions*

In [37]:
rand_example = np.random.choice(codenet.shape[0])
print(f'Cleaned function: \n{codenet["function_cleaned"].iloc[rand_example]}\n')
print(f'Old function: \n{codenet["function"].iloc[rand_example]}')

Cleaned function: 
def set_x_grid_info ( self , x_low , x_high , num_x , xscale , xval_name ) : self._set_grid_info ( ' x ' , x_low , x_high , num_x , xscale , xval_name ) return

Old function: 
def set_x_grid_info(self, x_low, x_high, num_x, xscale, xval_name):
        """Set the grid values for x.

        Create information for the grid of x values.

        Args:
            num_x (int): Number of points on axis.
            x_low/x_high (float): Lowest/highest value for the axis.
            xscale (str): Scale of the axis. Choices are 'log' or 'lin'.
            xval_name (str): Name representing the axis. See GenerateContainer documentation
                for options for the name.

        """
        self._set_grid_info('x', x_low, x_high, num_x, xscale, xval_name)
        return


In [38]:
rand_example = np.random.choice(codenet.shape[0])
print(f'Cleaned function: \n{codenet["function_cleaned"].iloc[rand_example]}\n')
print(f'Old function: \n{codenet["function"].iloc[rand_example]}')

Cleaned function: 
def getPageFontList ( self , pno ) : if self.isClosed or self.isEncrypted : raise ValueError ( " operation illegal for closed / encrypted doc " ) if self.isPDF : return self._getPageInfo ( pno , 1 ) return [ ] 

Old function: 
def getPageFontList(self, pno):
        """Retrieve a list of fonts used on a page.
        """
        if self.isClosed or self.isEncrypted:
            raise ValueError("operation illegal for closed / encrypted doc")
        if self.isPDF:
            return self._getPageInfo(pno, 1)
        return []


New shape of the pre-processed dataset

In [39]:
codenet.shape

(201644, 5)

In [40]:
codenet.head()

Unnamed: 0,docstring,function,docstring_cleaned,docstring_isEnglish,function_cleaned
0,Returns an array of bounding boxes of human fa...,"def _raw_face_locations(img, number_of_times_t...",returns an array of bounding boxes of human fa...,True,"def _raw_face_locations ( img , number_of_time..."
1,Returns an array of bounding boxes of human fa...,"def face_locations(img, number_of_times_to_ups...",returns an array of bounding boxes of human fa...,True,"def face_locations ( img , number_of_times_to_..."
2,Return the Catalyst datatype from the size of ...,"def _int_size_to_type(size):\n """"""\n Ret...",return the catalyst datatype from the size of ...,True,def _int_size_to_type ( size ) : if size < = 8...
3,Convert a schema from Spark to Arrow,"def to_arrow_schema(schema):\n """""" Convert ...",convert a schema from spark to arrow,True,def to_arrow_schema ( schema ) : import pyarro...
4,Convert schema from Arrow to Spark.,"def from_arrow_schema(arrow_schema):\n """""" ...",convert schema from arrow to spark,True,def from_arrow_schema ( arrow_schema ) : retur...


*Removing any hiddle characters*

In [41]:
codenet['function_cleaned'] = codenet['function_cleaned'].apply(lambda x: ''.join(line for line in x if line in string.printable))
codenet['docstring_cleaned'] = codenet['docstring_cleaned'].apply(lambda x: ''.join(line for line in x if line in string.printable))

In [44]:
### Saving the dataset to the data dir
# codenet[['docstring','docstring_cleaned','function','function_cleaned']].to_csv(os.path.join(data_dir, 'CodeNetData_cleaned.csv'))