Programming Language Classifier

The purpose of this project is to build a machine learning model that is capable of predicting the coding language given a code snippet of that language.

With the advancement and accessibility of technology, many programming languages have been developed. Similar to the differences between language of speech, each programming language has it own syntax, notations, best practices, and rules.

Natural Language Processing (NLP) has widely been used to tackle speech applications for different languages of speech. While processing of programming languages has been explored, it has not advanced to the same level. Processing programming languages presents its own challenges since the meaning of spaces, punctuations and words vary across all programming languages. For example, the word 'from' means different things in different languages, as do the punctuation marks "." and "&".

The following 30 programming languages are explored in this project are: "Ada", "AWK", "C", "C++", "C#", "COBOL", "Fortran", "Go", "Groovy", "Haskell", "Java", "JavaScript", "Julia", "Kotlin", "Mathematica", "MATLAB", "Perl", "PHP", "PowerShell", "PureBasic", "Python", "R", "Racket", "Ruby", "Rust", "Scala", "SQL PL", "Swift", "UNIX Shell", "VBA"

About the Dataset

The code snippets were webscraped the scraper.py from the Rosetta Code wepage. Rosetta Code is a wiki project created by Mike Mol in 2007 that solves common tasks in a myriad of coding languages. As such, it is a useful resource to obtain code snippets with their labels. The scraper crawls across each task page and writes the name of the language and its corresponding code snippets for that given task to a CSV file (RosettaCodeData.csv).

Note the scraper writes commas in the code snippet as the string sequence '!@#$%^&&^%$#@!'. This is done because CSV's interpret the commas as newlines, disrupting the 2-column structure within the file. The replaced sequence is complex to avoid mixing it up with any other string sequences that occur naturally in the code snippet. When preprocessing the data for classification, this string sequence is restored to commas using a string replacement.

Citations

Rosetta Code Web Page: http://www.rosettacode.org/wiki/Rosetta_Code

Workflow

The data was loaded from the RosettaCodeData.csv file using Pandas library
The text column was preprocessing by:
- Replacing the string sequence '!@#$%^&&^%$#@!' to commas
- Removing all characters that are: not english, not numeric, not a punctutation mark, and spaces
The language column was preprocessed by:
- label-encoding the 30 languages to a numeric representation from 0-29
Both the text and language columns were converted to arrays
The text column was tokenized on a mixed-level (word and character levels) where each word and each punctuation mark is a token. Thus each observation is changed from one string to a sequence of numeric indices corresponding to words and punctuations in the string. Mixed-level tokenization/representation was the safest way to preserve the uniquness of each language since the words and punctuations are often the distinguishing factors between languages. This is also useful since this preprocessing step can apply to any language. As such, this model can be expanded easily to include more languages.
The tokenized text was embedded to a vector representation based on the Term-Frequency-Inverse-Document-Frequency (TF-IDF) of each token in the text.
The model was built using Scikit-Learn. It is a Support-Vector-Machine (SVM) Classifier with a linear kernel. The SVM was chosen since this problem deals with a high feature space (>70,000 unique tokens in the dataset) and linear SVM's are known to excel in this application which is often the case with NLP applications.

Model Prediction

To make a prediction using the model:

Clone this repository from GitHub
Install the packages from requirements.txt
```
 pip install -r requirements.txt
```
Place the code snippets in the Snippets folder as .txt files. There are .txt files there as examples.
From the command line run predict.py as follows:
```
 python predict.py --top 3
```
the top argument is the top N predictions to display when making a prediction
Command line should print the filename and the top_n predictions for each file with their correspoding percentage for each prediction. The last line summarizes the time taken to make the predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Snippets		Snippets
Test_Set_Metrics		Test_Set_Metrics
model		model
.gitignore		.gitignore
Coding_Language_Classifier.ipynb		Coding_Language_Classifier.ipynb
README.md		README.md
RosettaCodeData.csv		RosettaCodeData.csv
languages.json		languages.json
predict.py		predict.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Programming Language Classifier

About the Dataset

Citations

Workflow

Model Prediction

About

Uh oh!

Releases

Packages

Languages

albazahm/Programming-Language-Classifier

Folders and files

Latest commit

History

Repository files navigation

Programming Language Classifier

About the Dataset

Citations

Workflow

Model Prediction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages