## Final Attempt on Language Detection

This code shows all steps that lead to a model that is run against europar.test file for classification.  It builds on multiple previous attempts.  The code that was used to compare multiple hyperparameters and what this notebook is based on can be found in this file [final_hyperparameter_tuning_using_percentage_of_words.py](final_hyperparameter_tuning_using_percentage_of_words.py)

This code uses files in the /pickles directory. If cloning repository from scratch, first the [corpora] (http://www.statmt.org/europarl/) must be downloaded into the /txt folder and the [Create Pickles from Corpora](Create%20Pickles%20from%20Corpora.ipynb) must be run.  Due to space neither the /txt or /pickles files are incorporated into this directory.

This notebook shows that a model can be built using three parameters:

number_of_documents      - how many documents from corpora to use for training
upto_percentage          - percentage of most common words to use per language
number_of_common_letters - how many letters of that language to use for feature set creation

### Feature set

The feature set consist of three parts:

common_letters - the most frequent letters to use in a language for classification. Up to n (number_of_common_letters).

    features['common_letters_2'] = 'tl'
    features['common_letters_3'] = 'tlp'
    features['common_letters_4'] = 'tlpo'

common_ending - Top n (number_of_common_letters) most common word endings. 

    features['common_ending_1'] = 'ing'
    features['common_ending_2'] = 'ter'
    ...
    features['common_ending_10'] = 'ong'

common_words: top % of words use in each language.*

    features['estacion'] = True
    features['corriendo'] = False
    ...
    features['code'] = True
    
The model with the best performance only uses common_letters and common_ending. 



In [1]:
# Utility functions in the util package
# Created to parse and classify the corpora
from util.general import *
from util.features import *
from util.classification import *

In [2]:
#Hyper parameters
number_of_documents      = 5000
upto_percentage          = 0
number_of_common_letters = 7        

In [3]:
# -------------Step 1-------------
# ----READ FROM PICKLE FILES (Pre-processed)----
# Get data to create features from corpora
pickles_directory = "pickles"
print("Number of documents to extract: " + str(number_of_documents))
print("Percentage of common words to use:" + str(upto_percentage))

# Part 1 - Extract documents from corpora
start = time.time()
all_documents = extract_documents_from_corpora_pickles(pickles_directory, number_of_documents)
print("Elapsed time reading all documents:" + print_elapsed_time(start))
print("Total Documents:" + str(len(all_documents)))

# get common words
start = time.time()
most_common_words = extract_words_from_corpora_pickles_upto_per(pickles_directory, upto_percentage)
elapsed_reading_wl = print_elapsed_time(start)
print("Elapsed time reading all words, letters:" + elapsed_reading_wl)
print("all_documents:" + str(len(all_documents)))
print("most_common_words:" + str(len(most_common_words)))
for k, v in most_common_words.items():
    print(k+" words:"+str(len(v)))


Number of documents to extract: 5000
Percentage of common words to use:0
Elapsed time reading all documents:0:02:55
Total Documents:105000
Elapsed time reading all words, letters:0:01:09
all_documents:105000
most_common_words:21
bg words:0
cs words:0
da words:0
de words:0
el words:0
en words:0
es words:0
et words:0
fi words:0
fr words:0
hu words:0
it words:0
lt words:0
lv words:0
nl words:0
pl words:0
pt words:0
ro words:0
sk words:0
sl words:0
sv words:0


In [4]:
# -------------Step 2-------------
# Create featureset to be used for training
# this is a list of documents with features and label
start = time.time()

# create word_features
word_features = most_common_wordsonly(most_common_words)
print("words_features:" + str(len(word_features)))

# create featureset
featuresets = [(document_features_fromwords(d, word_features, number_of_common_letters), c) for (d, c) in all_documents]
elapsed_feature_creation = print_elapsed_time(start)
print("Elapsed time featureset creation:" + elapsed_feature_creation)
print("featuresets:" + str(len(featuresets)))


words_features:0
Elapsed time featureset creation:0:08:02
featuresets:105000


In [5]:
#Sample of one featureset used for training.
featuresets[123]

({'common_ending_top_1': 'ния',
  'common_ending_top_10': 'по',
  'common_ending_top_2': 'ата',
  'common_ending_top_3': 'и',
  'common_ending_top_4': 'чаи',
  'common_ending_top_5': 'на',
  'common_ending_top_6': 'ята',
  'common_ending_top_7': 'ипа',
  'common_ending_top_8': 'ава',
  'common_ending_top_9': 'ане',
  'common_letters_2': 'аи',
  'common_letters_3': 'аир',
  'common_letters_4': 'аирв',
  'common_letters_5': 'аирвн',
  'common_letters_6': 'аирвнп',
  'common_letters_7': 'аирвнпк'},
 'bg')

In [6]:
# -------------Step 3-------------
# Split train, test 
# create model with train
# score model with test
numpy.random.shuffle(featuresets)
# calculate how many items to slice by (95% train, 5% test)
slice_by = int((80 * len(featuresets)) / 100)
train_set, test_set = featuresets[:slice_by], featuresets[slice_by:]
print("Train set:" + str(len(train_set)))
print("Test set:" + str(len(test_set)))

# -------------Step 4-------------
# Build the Model
start = time.time()
classifier = nltk.NaiveBayesClassifier.train(train_set)
elapsed_training = print_elapsed_time(start)
print("Elapsed time for training:" + elapsed_training)
start = time.time()
accuracy = nltk.classify.accuracy(classifier, test_set)
print("Classifier Accuracy:" + str(accuracy * 100))
elapsed_accuracy = print_elapsed_time(start)
print("Elapsed time for accuracy testing:" + elapsed_accuracy)



Train set:84000
Test set:21000
Elapsed time for training:0:00:02
Classifier Accuracy:96.5904761904762
Elapsed time for accuracy testing:0:00:14


### Deploy model

Once the model has been created, deploy against the europarl.test file.
The accuracy of the model is that of those labeled correctly divided by total.

The accuracy of this model against the europarl.test files is **92.58**

In [7]:
# -------------Step 5-------------
# Test against europarl_test file

europarl_testfile = "europarl.test"
results_outfile = "europarl_test_classified_attempt_" + str(number_of_documents) + "_" + str(upto_percentage) + "_" + str(number_of_common_letters)+".csv"
everyother = 20
start = time.time()
total_ctr, positive_ctr, negative_ctr, language_counter = test_europarltest_file_final(europarl_testfile, classifier, word_features, number_of_common_letters)
# results
   
print("       Total attempted: " + str(total_ctr))
print("  Classified correctly: " + str(positive_ctr))
print("Classified incorrectly: " + str(negative_ctr))
euro_accuracy = (positive_ctr / total_ctr) * 100
print("  Europartest Accuracy: "+str(euro_accuracy))
elapsed_accuracy = print_elapsed_time(start)
print("Elapsed time for accuracy testing:" + elapsed_accuracy)  # Save classifier for deployment

print("Results per language:")
for k,v in language_counter.items():
    print("       "+k+":"+str(v))


       Total attempted: 21000
  Classified correctly: 19442
Classified incorrectly: 1558
  Europartest Accuracy: 92.58095238095238
Elapsed time for accuracy testing:0:00:19
Results per language:
       bg_correct:999
       bg_incorrect:1
       cs_incorrect:180
       cs_correct:820
       da_incorrect:81
       da_correct:919
       de_correct:912
       de_incorrect:88
       el_correct:1000
       en_correct:934
       en_incorrect:66
       es_correct:911
       es_incorrect:89
       et_incorrect:85
       et_correct:915
       fi_correct:951
       fi_incorrect:49
       fr_correct:971
       fr_incorrect:29
       hu_correct:947
       hu_incorrect:53
       it_incorrect:60
       it_correct:940
       lt_correct:932
       lt_incorrect:68
       lv_correct:917
       lv_incorrect:83
       nl_correct:941
       nl_incorrect:59
       pl_correct:944
       pl_incorrect:56
       pt_correct:934
       pt_incorrect:66
       ro_incorrect:56
       ro_correct:944
       sk_incorre