Classifier that can take snippets of code and guesses the programming language of the code.
After completing this assignment, you should understand:
- Feature extraction
- Classification
- The varied syntax of programming languages
After completing this assignment, you should be able to:
- Build a robust classifier
- A Git repo called programming-language-classifier containing at least:
README.md
file explaining how to run your project- a
requirements.txt
file - a suite of tests for your project
- Passing unit tests
- No PEP8 or Pyflakes warnings or errors
Option 1: Get code from the Computer Language Benchmarks Game. You can download their code directly. In the downloaded archive under benchmarksgame/bench
, you'll find many directories with short programs in them. Using the file extensions of these files, you should be able to find out what programming language they are.
Option 2: Scrape code from Rosetta Code. You will need to figure out how to scrape HTML and parse it. BeautifulSoup is your best bet for doing that.
Option 3: Get code from GitHub somehow. The specifics of this are left up to you.
You are allowed to use other code samples as well.
For your sanity, you only have to worry about the following languages:
- C (.gcc, .c)
- C#
- Common Lisp (.sbcl)
- Clojure
- Haskell
- Java
- JavaScript
- OCaml
- Perl
- PHP (.hack, .php)
- Python
- Ruby (.jruby, .yarv)
- Scala
- Scheme (.racket)
Feel more than free to add others!
Using your corpus, you should extract features for your classifier. Use whatever classifier engine that works best for you and that you can explain how it works.
Your initial classifier should be able to take a string containing code and return a guessed language for it. It is recommended you also have a method that returns the snippet's percentage chance for each language in a dict.
The test/
directory contains code snippets. The file test.csv
contains a list of the file names in the test
directory and the language of each snippet. Use this set of snippets to test your classifier. Do not use the test snippets for training your classifier.
-
You will need to have Python 3 installed on your machine or have access to a Python 3 interpreter. See python's site for details.
-
Clone this repo onto your machine.
-
You will need to make sure that you have a Python 3 virtual environment running in the folder that you intend to work from. See this site for details if you're not familiar. Complete this step before attempting the below.
-
In your command-line program (such as Terminal on Mac OS X), navigate into the newly created repo. By default, this will be called
programming-language-classifier
. Install the requirements file by runnningpip install -r requirements.txt
. Note thatlxml
compiles to the version of Python in your environment after downloading (Python 3)—this process can take awhile, so please be patient. -
For textblob to work properly, you will need to download its associated data files. To do this you will need to run the following command at the command line:
python -m textblob.download_corpora
This process can take awhile, so please be patient. -
For nltk to work properly, you will need to download its associated data files. To do this you will need to run the following command at the command line:
$ python3
>>> import nltk
>>> nltk.download()
This will open up a new window on which you will need to select to download all. This process can take awhile, so please be patient. When it is done and you have closed the download window the command line will show True
. Enter exit()
on the Python command line to return to the command prompt.
- To run this program, save
guess_lang.py
to your computer. Using a command-line program (such as Terminal on Mac OS X), navigate to the folder containing the downloaded file and run the following line to play:python3 guess_lang.py