# Programming Language Classifier
Brought to you by Python and scikit-learn.



Import the main classifier functions and a few scikit-learn functions to explain the process:


In [1]:
from classifier import prepare_dataset, prepare_pipeline, predict_language
from sklearn.datasets import load_files
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn import linear_model
from sklearn.pipeline import Pipeline

# The classifier in action
sample.py is a tiny python script.

In [2]:
pipeline, target_names = prepare_pipeline()
predict_language('sample.py', pipeline, target_names)

Prediction: python


'python'

This classifier uses three parts from scikit-learn: CountVectorizer, TfidfTransformer, and SGDClassifier.

CountVectorizer takes the text of the programs and vectorizes (puts them in a form usable by the estimator) them by word count.
TfidTransformer takes those vectorized word counts and discounts those that appear in many programs. In other words, it provides a high value for a given term in a given value if that term occurs often in that particular program and very rarely anywhere else.
SGDClassifier is classifier.py's estimater. It is trained by the output provided by the vectorizer and transformer and can then make predictions about new input. In particular, SDGClassifier implements linear models that utilize stochiastic grade descent learning. One sample at a time, the estimator estimates a gradient of loss and updates itself with that information. With enough data, it predicts very well.


Here are the training and test F scores for our SDGClassifier pipeline:

In [3]:
target_names, X_train, X_test, y_train, y_test = prepare_dataset()
keystone = Pipeline([('vectorizer', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('classifier', linear_model.SGDClassifier())])
keystone.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...   penalty='l2', power_t=0.5, random_state=None, shuffle=True,
       verbose=0, warm_start=False))])

In [4]:
print("SDGClassifier train score: {}".format(keystone.score(X_train, y_train)))
print("SDGClassifier test score: {}".format(keystone.score(X_test, y_test)))

SDGClassifier train score: 0.9942939594674363
SDGClassifier test score: 0.9876705141657922


Now, compare those F scores with those from a naive Bayes classifier:

In [5]:
trans_alaska = Pipeline([('vectorizer', CountVectorizer()), ('classifier', MultinomialNB())])
trans_alaska.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [6]:
print("Naive Bayes train score: {}".format(trans_alaska.score(X_train, y_train)))
print("Naive Bayes test score: {}".format(trans_alaska.score(X_test, y_test)))

Naive Bayes train score: 0.9683872237161408
Naive Bayes test score: 0.953305351521511


## Stoichiometry wins every time. Or so said my chemisty teacher.

You can classify programs yourself! Just run `python3 classifier.py your_filename` from the command line in the programming-language-classifier directory. Have fun!