# Language Identification in a document
To accomplish this task, two approaches will be used - first by comparing the n-most frequent words in the document with the n-most frequent words in each language and choosing the language with the highest score. Secondly, n-gram words approach will be used. The essence of both these approaches has been taken from [Comparison of Language Identification Techniques](https://dbs.cs.uni-duesseldorf.de/lehre/bmarbeit/barbeiten/ba_panich.pdf) by L Panich. Changes have been made in terms of designing matching scores. Additionally, comparatively small-sized datasets have been chosen to check the efficiency of these methods with limited data.



## Dataset
The dataset used here is the DLI-32 dataset as given in [https://github.com/xprogramer/DLI32-corpus](https://github.com/xprogramer/DLI32-corpus). It is a relatively small corpus that has 10 .txt files for each of the 32 languages, where text length ranges from 93 to 146 words. Here, for simplicity, I have considered 6 languages. I have performed certain preprocessing steps to consolidate all files of a particular language to a single file.


In [1]:
# extracting dataset from zip
import zipfile as zf
files = zf.ZipFile("dataset.zip", 'r')
files.extractall('dataset')
files.close()

In [2]:
# Old file names as given as 1.txt, 2.txt and so on; we consolidate them to 'French', 'English' and so on...
old_names_list = [i for i in range(1, 91)]
old_to_new = {
              'French' : [i for i in range(1,11)], # Files with serial no 1 to 10 are in french, 10 to 20 in English and so on
              'English' : [ i for i in range(11,21)],
              'Russian' : [i for i in range(31,41)],
              'German' : [i for i in range(41,51)],
              'Italian' : [i for i in range(51, 61)],
              'Spanish' : [i for i in range(71, 81)],
}

for lang in old_to_new.keys():
    outfile = 'dataset/' + lang + '.txt'
    with open( outfile, 'w') as fname: # destination file i.e., language
        for old_file in old_names_list:
            if old_file in old_to_new[lang]:
                with open('dataset/dataset/' + str(old_file) + '.txt') as inp:#opening each match of old_to_new[lang]
                    fname.write(inp.read())
                  

### 1. Most frequent word approach
In this method, we will compare the top n-most frequent words in the given document with the top n-frequent words obtained from our dataset for each language. A weighted and normalised score will be maintained based on the matches occured in this manner. The language with the highest score will be taken as the language for that document.

In [3]:
import re
import  string
import numpy as np
import math

#### Step 1) Create a list of n-frequent words based on input language

In [4]:
n_words = 20  # change the value 20 to any constant to include top-n words
def frequent_words(lang, n_words): #returns n-most frequent words for a doc
    words = re.sub('['+string.punctuation+']', '', lang).split() # convert text to a list of words with punctuation removed
    word_dict = {}
    for word in words:
        if word in word_dict:
            word_dict[word] += 1
        else:
            word_dict[word] = 1
    freq = 0 # stores the highest frequency
    
    sorted_words = sorted(word_dict.items(), key= lambda x : x[1], reverse = True)
    print (sorted_words)    
    final_words = []
    for i in range(len(sorted_words)):
        final_words.append(sorted_words[i][0])
    return final_words
    
#with open ('dataset/French.txt', 'r') as fname:
   # temp = fname.read()  
   # frequent_words(temp, 20)

#### Step 2) Generating n-most frequent words for each language
In this case , I am using a list of 20 most used words for each language

In [5]:
lang_list = ['French', 'English', 'Russian', 'German', 'Italian', 'Spanish']
freq_words = {}
normalizer = n_words*(n_words+1)/2 # corresponds to a perfect score if all words match
#print (normalizer)
for lang in lang_list:
    with open ('dataset/'+lang+'.txt', 'r') as fname:
        freq_words[lang] = frequent_words(fname.read(), n_words)
print (freq_words)        

[('de', 53), ('et', 32), ('le', 24), ('à', 18), ('pas', 17), ('les', 16), ('que', 14), ('la', 14), ('est', 13), ('en', 12), ('des', 11), ('pour', 11), ('un', 10), ('une', 10), ('votre', 10), ('ce', 10), ('vous', 9), ('ne', 9), ('sur', 8), ('qui', 8), ('dans', 8), ('par', 8), ('médecin', 7), ('nutritionniste', 7), ('taux', 7), ('avec', 7), ('faire', 7), ('plus', 6), ('peut', 6), ('il', 5), ('important', 5), ('ou', 5), ('oncologue', 5), ('nutriments', 5), ('a', 5), ('études', 4), ('dun', 4), ('traitement', 4), ('Il', 4), ('du', 4), ('tout', 4), ('toujours', 4), ('avant', 4), ('car', 4), ('quil', 4), ('je', 4), ('analyses', 4), ('au', 4), ('être', 4), ('physique', 4), ('nest', 4), ('équipe', 4), ('dire', 4), ('ont', 3), ('montré', 3), ('alimentation', 3), ('très', 3), ('survie', 3), ('collaborer', 3), ('fait', 3), ('loncologue', 3), ('saine', 3), ('après', 3), ('long', 3), ('déficit', 3), ('si', 3), ('quoi', 3), ('soit', 3), ('bon', 3), ('ça', 3), ('genre', 3), ('lactivité', 3), ('trop', 

#### Step 3) Generating matching scores for test file with each language
Here we use a weighted score: We find the n-most frequent words in the test file and then compare them with the frequent words in each language. If a word matches with the most frequent word(say, first word) in the given language, it will be given a higher association score with that language, compared to it matching with a less frequent word (say, 15th word). Thus, it is like a weighted score, which will be normalised in such a way so that if all words match, it is given a score of 1.

In [6]:
def matching_score(test_freq_words, freq_words):
    scores = {lang: 0 for lang in lang_list} # scores corresponding to each language
    for word in test_freq_words:
        for lang in freq_words.keys():
            if word in freq_words[lang] : # test word is present as a frequent word in a particulal language
                scores[lang] += (20 - freq_words[lang].index(word))
                # so that if it matches with a more frequent word(at the beginning), it gets a higher score
    for k in scores.keys():
        scores[k] = scores[k] / normalizer
    return scores    

#### Step 4) Testing with a user-input file

In [7]:
def language_identify(file_address, st): # argument 'st' denotes whether you are uploading a file or directly copying text
    if st == 'file': # If you are copying a file address
        test_file = file_address 
        test_freq_words = []
        with open(test_file, 'r', errors = 'ignore') as fname: # some characters throw an error with 'utf-8'
            test_freq_words = frequent_words(fname.read(), n_words)
    #print(test_freq_words) 
    #print(freq_words)
    elif st == 'not_file':
        test_freq_words = frequent_words(file_address, n_words)
    final_scores = sorted(matching_score(test_freq_words, freq_words).items(), key= lambda x:x[1], reverse = True)    
    #print (final_scores)
    print ('Language of document is:', final_scores[0][0])
    return final_scores[0][0]

In [8]:
file_1 = 'dataset/lang_test.txt'# CHANGE THIS URL TO INCLUDE YOUR OWN TEST FILE
language_identify(file_1, 'file')

[('und', 8), ('mit', 6), ('zu', 5), ('Zuschauer', 5), ('dem', 5), ('fr', 4), ('den', 4), ('des', 4), ('epische', 4), ('das', 4), ('der', 3), ('nicht', 3), ('eine', 3), ('seiner', 3), ('Theater', 3), ('Brecht', 3), ('zum', 3), ('es', 3), ('in', 3), ('er', 3), ('Begriff', 2), ('Verfremdung', 2), ('nur', 2), ('seine', 2), ('sondern', 2), ('die', 2), ('Brechtsche', 2), ('aber', 2), ('erst', 2), ('Schriften', 2), ('Theaters', 2), ('Das', 2), ('einen', 2), ('Theaterstck', 2), ('statt', 2), ('ihn', 2), ('macht', 2), ('Haltung', 2), ('Geschehen', 2), ('also', 2), ('Brechts', 1), ('spielt', 1), ('rein', 1), ('literarischen', 1), ('Werke', 1), ('wichtige', 1), ('Rolle', 1), ('Theaterstcke', 1), ('Opern', 1), ('auch', 1), ('Um', 1), ('Vorstellung', 1), ('verstehen', 1), ('muss', 1), ('man', 1), ('anhand', 1), ('einiger', 1), ('damit', 1), ('verbundenen', 1), ('epischen', 1), ('untersuchen', 1), ('im', 1), ('Gegensatz', 1), ('dramatischen', 1), ('Form', 1), ('aufstellt', 1), ('modern', 1), ('erklr

'Italian'

#### Step 5) Testing with random bits of text copied from wiki articles in different languages

In [10]:
while (True):
    test_input = input("\n Enter few lines of text: ")
    language_identify(test_input, 'not_file')
    user = input("Do you want to test more? (y/Y or N/n)")
    if user == 'N' or user == 'n':
        break


 Enter few lines of text: 
[]
Language of document is: French
Do you want to test more? (y/Y or N/n)n


## 2.  N-gram approach

If we use a list of most frequent collection of words instead of most frequent words, it will help resolve the ambiguities that arise sometimes while using a single word. For example, if the word 'a' comes frequently in a given document, we cannot correctly determine which language is talked about. But instead if we use a pair of words 'a huir', we can be more certain that the language is Spanish. The problem with word grams, however, is that it becomes increasingly unlikely to find a larger n-gram in a given language, i.e., 'This will do', 'This will be', 'This will work' will all be considered as different n-grams, even though first two letters are same in all of them. 

For this reason, we will use char level n-grams of a language. For this exercise, after a number of experiments with different combinations, I have decided to use the average normalised n-gram distances for bigrams, trigrams and quadgrams.

More weightage are given to those n-grams which do not have a underscore in them while assigning frequencies.


#### Step 1) Create n-grams out of given corpus of text and return top k n-grams along with their log probabilities

In [11]:
n = [2,3,4]  # Here we are choosing bigrams,trigrams and quadgrams; change this value to get n-grams with a particular n
k = 50 # Decides how many top n-grams will be used for calculating the distance metric

In [30]:
def create_n_grams(lang, num, k): #returns top k n-grams according to frequency
    words = re.sub('['+string.punctuation+']', '', lang) #  punctuation removed
    words = words.lower()
    words = re.sub('\s+', ' ', words).strip() # replaces multiple spaces, newline tabs with a single space
    words = words.replace(' ','_')# so that we can visualise spaces easily
    grams = {}
    #print (words)
    for i in range(len(words)-num):
        temp = words[i:i+num]
        if temp in grams:
            grams[temp] += 1
        else:
            grams[temp] = 1
    sum_freq = len(words) - num + 1
    for key in grams.keys():
        red = 1 # reduction factor equal 1 if no '_' is present
        if '_' in key: red = 2
        grams[key] = round(math.log(grams[key] / (red * sum_freq)), 3) #normalizing by dividing by total no of n-grams for that corpus and taking log                                             
    grams = sorted(grams.items(), key= lambda x : x[1], reverse = True) 
    #print (grams)
    final_grams = [] # contains a list of top k n-grams in a given language 
    log_probs = [] # contains logprobs corresponding to each n-gram
    for i in range(len(grams)):
        final_grams.append(grams[i][0])
        log_probs.append(grams[i][1])
    return final_grams, log_probs
       
with open ('dataset/French.txt', 'r') as fname:
    temp = fname.read()  
    create_n_grams(temp, 3, 20)    

####  Step 2) Feeding each language's text into create_n_grams() and storing the output as a dictionary indexed by language names

In [31]:
lang_list = ['French', 'English', 'Russian', 'German', 'Italian', 'Spanish']
bi_grams = {}
tri_grams = {}
quad_grams = {}
for lang in lang_list:
    with open ('dataset/'+lang+'.txt', 'r') as fname:
        file = fname.read()
        bi_grams[lang] = create_n_grams(file, n[0], k)
        tri_grams[lang] = create_n_grams(file, n[1], k)
        quad_grams[lang] = create_n_grams(file, n[2], k)       
n_grams = {2 : bi_grams, 3 : tri_grams, 4 : quad_grams}
#print (n_grams[2]) 

####  Step 3) Creating matching function to decide the distance between n-grams of two pieces of text 
If we find that a n-gram in a document matches with that of a given language, we take the absolute value of the difference of both its logprobs. We sum up all such absolute values to get the distance of the document from a given language. In case a n-gram of the document is not found in a given language, we penalize by adding a penalty term. Here, I have chosen the absolute value of that n-gram's logprob as the penalty term.

In [32]:
def matching_score_2(test_grams, grams_list, n): # n helps us know whether it is bigram, trigram or quadgram
    dist = {lang: 0 for lang in lang_list} # distance corresponding to each language
    for gram in test_grams[0]:
        for lang in grams_list.keys():
            idx_2 = test_grams[0].index(gram)
            if gram in n_grams[n][lang][0] : 
                idx = n_grams[n][lang][0].index(gram)
                dist[lang] += abs(n_grams[n][lang][1][idx] - test_grams[1][idx_2])
            else: # gram is not present in that language's corpus
                dist[lang] += abs(test_grams[1][idx_2])
                # penalty term
    return dist   

#### Step 4) Testing the n-gram approach with input .txt files as well as pieces of text copied from wikipedia

In [33]:
def language_identify_2(file_address, st): # argument 'st' denotes whether you are uploading a file or directly copying text
    test_bigrams = []
    test_trigrams = []
    test_quadgrams = []
    test_file = []
    if st == 'file': # If you are copying a file address
        temp = file_address
        with open(temp, 'r', errors = 'ignore') as fname: # some characters throw an error with 'utf-8'
            file_address = fname.read()
    #print (file_address) 
    test_bigrams = create_n_grams(file_address, 2, k)
    test_trigrams = create_n_grams(file_address, 3, k)
    test_quadgrams = create_n_grams(file_address, 4, k)
    bi_dist = matching_score_2(test_bigrams, bi_grams, 2) 
    tri_dist = matching_score_2(test_trigrams, tri_grams, 3)
    quad_dist = matching_score_2(test_quadgrams, quad_grams, 4) 
    #print (bi_dist, tri_dist)
    final_dist = {}
    for lang in bi_dist.keys():
        final_dist[lang] =bi_dist[lang] + tri_dist[lang] + quad_dist[lang]
    sum_dist = 0
    for dist in final_dist.values():
        sum_dist += dist
    for lang in final_dist.keys():
        final_dist[lang] /= sum_dist
    dist_list = sorted(final_dist.items(), key= lambda x:x[1])     
    #print (dist_list)    
    print ('Predicted language :' + dist_list[0][0] + '\n')
    return dist_list[0][0]

In [34]:
file_1 = 'dataset/lang_test.txt'# CHANGE THIS URL TO INCLUDE YOUR OWN TEST FILE
language_identify_2(file_1, 'file')

Predicted language :German



'German'

In [44]:
while (True):
    test_input = input("\n Enter few lines of text: ")
    language_identify_2(test_input, 'not_file')
    user = input("Do you want to test more? (y/Y or N/n)")
    if user == 'N' or user == 'n':
        break


 Enter few lines of text: Nadie ser condenado por actos u omisiones que en el momento de cometerse no fueron delictivos segn el Derecho nacional o internacional. Tampoco se impondr pena ms grave que la aplicable en el momento de la comisin del delito. 
Predicted language :Spanish

Do you want to test more? (y/Y or N/n)n


## Testing with Test Dataset


#### Testing with data from Universal Declaration of Human Rights Translations
Here we use random text from various translations of the Universal Declaration of Human Rights

In [22]:
import random
from random import randrange


In [41]:
count = 0
accuracy = 0
misclassified = [] # contains list of misclassified examples
for i in range(1000):
    random_index = randrange(0,6)
    random_lang = lang_list[random_index]
    file = 'test_set/' + random_lang#
    random_lines = random.choice(open(file, errors = "ignore").readlines())
    if len (random_lines) <= 20: # less than 20 characters is considered too short for language identification
        continue
    count += 1 # A valid test case needs to have 20 or more characters    
    print (random_lines)
    print ("Actual language : " + random_lang )
    predicted = language_identify_2(random_lines, 'not_file')
    #print ("Predicted language : ", predicted)
    if (predicted == random_lang): 
        accuracy += 1
    else:
        misclassified.append((random_lines, random_lang, predicted))
print ('Accuracy :', accuracy/count)   


Los hombres y las mujeres, a partir de la edad nbil, tienen derecho, sin restriccin alguna por motivos de raza, nacionalidad o religin, a casarse y fundar una familia; y disfrutarn de iguales derechos en cuanto al matrimonio, durante el matrimonio y en caso de disolucin del matrimonio. 

Actual language : Spanish
Predicted language :Spanish

принимая во внимание, что пренебрежение и презрение к правам человека привели к варварским актам, которые возмущают совесть человечества, и что создание такого мира, в котором люди будут иметь свободу слова и убеждений и будут свободны от страха и нужды, провозглашено как высокое стремление людей; и

Actual language : Russian
Predicted language :Russian

Ogni individuo ha diritto ad un'effettiva possibilit di ricorso a competenti tribunali contro atti che violino i diritti fondamentali a lui riconosciuti dalla costituzione o dalla legge.

Actual language : Italian
Predicted language :Italian

Niemand darf der Folter oder grausamer, unmenschlicher o

Predicted language :Spanish

This right may not be invoked in the case of prosecutions genuinely arising from non-political crimes or from acts contrary to the purposes and principles of the United Nations. 

Actual language : English
Predicted language :English

Toda persona tiene derecho a la educacin. La educacin debe ser gratuita, al menos en lo concerniente a la instruccin elemental y fundamental. La instruccin elemental ser obligatoria. La instruccin tcnica y profesional habr de ser generalizada; el acceso a los estudios superiores ser igual para tod
Actual language : Spanish
Predicted language :Spanish

Nadie ser condenado por actos u omisiones que en el momento de cometerse no fueron delictivos segn el Derecho nacional o internacional. Tampoco se impondr pena ms grave que la aplicable en el momento de la comisin del delito. 

Actual language : Spanish
Predicted language :Spanish

Toda persona tiene derecho al descanso, al disfrute del tiempo libre, a una limitacin razonable de 

Predicted language :Spanish

Ogni individuo ha diritto di lasciare qualsiasi paese, incluso il proprio, e di ritornare nel proprio paese. 

Actual language : Italian
Predicted language :Italian

Everyone has the right to the protection of the moral and material interests resulting from any scientific, literary or artistic production of which he is the author. 

Actual language : English
Predicted language :English

Toute personne a le droit de circuler librement et de choisir sa rsidence  l'intrieur d'un Etat. 

Actual language : French
Predicted language :French

Todo individuo tiene derecho a la vida, a la libertad y a la seguridad de su persona. 

Actual language : Spanish
Predicted language :Spanish

Каждый человек имеет право на эффективное восстановление в правах компетентными национальными судами в случаях нарушения его основных прав, предоставленных ему конституцией или законом. 

Actual language : Russian
Predicted language :Russian

Considerando que una concepcin comn de esto

Predicted language :Spanish

Ogni individuo ha il diritto di cercare e di godere in altri paesi asilo dalle persecuzioni. 

Actual language : Italian
Predicted language :Italian

Mtter und Kinder haben Anspruch auf besondere Frsorge und Untersttzung. Alle Kinder, eheliche wie auereheliche, genieen den gleichen sozialen Schutz. 

Actual language : German
Predicted language :German

Ogni individuo accusato di un reato  presunto innocente sino a che la sua colpevolezza non sia stata provata legalmente in un pubblico processo nel quale egli abbia avuto tutte le garanzie necessarie per la sua difesa. 

Actual language : Italian
Predicted language :Italian

Jeder hat das Recht auf Meinungsfreiheit und freie Meinungsuerung; dieses Recht schliet die Freiheit ein, Meinungen ungehindert anzuhngen sowie ber Medien jeder Art und ohne Rcksicht auf Grenzen Informationen und Gedankengut zu suchen, zu empfangen und zu verbreiten. 

Actual language : German
Predicted language :German

da es notwendig i

Predicted language :French

Toda persona tiene los derechos y libertades proclamados en esta Declaracin, sin distincin alguna de raza, color, sexo, idioma, religin, opinin poltica o de cualquier otra ndole, origen nacional o social, posicin econmica, nacimiento o cualquier otra condicin. 

Actual language : Spanish
Predicted language :Spanish

Toute personne accuse d'un acte dlictueux est prsume innocente jusqu' ce que sa culpabilit ait t lgalement tablie au cours d'un procs public o toutes les garanties ncessaires  sa dfense lui auront t assures. 

Actual language : French
Predicted language :French

Toda persona tiene derecho a salir de cualquier pas, incluso el propio, y a regresar a su pas. 

Actual language : Spanish
Predicted language :Spanish

Nadie ser sometido a torturas ni a penas o tratos crueles, inhumanos o degradantes. 

Actual language : Spanish
Predicted language :Spanish

Tutti sono eguali dinanzi alla legge e hanno diritto, senza alcuna discriminazione, ad una eguale 

Predicted language :Spanish

Toda persona tiene derecho a una nacionalidad. 

Actual language : Spanish
Predicted language :Spanish

Ogni individuo, in quanto membro della societ, ha diritto alla sicurezza sociale, nonch alla realizzazione attraverso lo sforzo nazionale e la cooperazione internazionale ed in rapporto con l'organizzazione e le risorse di ogni Stato, dei diritti economici, sociali e culturali indispensabili alla sua dignit ed al libero sviluppo della sua personalit.

Actual language : Italian
Predicted language :Italian

Tutti sono eguali dinanzi alla legge e hanno diritto, senza alcuna discriminazione, ad una eguale tutela da parte della legge. Tutti hanno diritto ad una eguale tutela contro ogni discriminazione che violi la presente Dichiarazione come contro qualsiasi incitamento a tale discriminazione.

Actual language : Italian
Predicted language :Italian

Nessun individuo potr essere sottoposto ad interferenze arbitrarie nella sua vita privata, nella sua famiglia, n

Predicted language :French

la presente dichiarazione universale dei diritti umani come ideale comune da raggiungersi da tutti i popoli e da tutte le Nazioni, al fine che ogni individuo ed ogni organo della societ, avendo costantemente presente questa Dichiarazione, si sforzi di promuovere, con l'insegnamento e l'educazione, il rispetto di questi diritti e di queste libert e di garantirne, mediante misure progressive di carattere nazionale e internazionale, l'universale ed effettivo riconoscimento e rispetto tanto fra i popoli degli stessi Stati membri, quanto fra quelli dei territori sottoposti alla loro giurisdizione.

Actual language : Italian
Predicted language :Italian

Toute personne a le droit de circuler librement et de choisir sa rsidence  l'intrieur d'un Etat. 

Actual language : French
Predicted language :French

This right may not be invoked in the case of prosecutions genuinely arising from non-political crimes or from acts contrary to the purposes and principles of the Un

Predicted language :French

A nadie se privar arbitrariamente de su nacionalidad ni del derecho a cambiar de nacionalidad. 

Actual language : Spanish
Predicted language :Spanish

Toute personne a le droit de fonder avec d'autres des syndicats et de s'affilier  des syndicats pour la dfense de ses intrts. 

Actual language : French
Predicted language :French

провозглашает настоящую Всеобщую декларацию прав человека в качестве задачи, к выполнению которой должны стремиться все народы и государства с тем, чтобы каждый человек и каждый орган общества, постоянно имея в виду настоящую Декларацию, стремились путем просвещения и образования содействовать уважению этих прав и свобод и обеспечению, путем национальных и международных прогрессивных мероприятий, всеобщего и эффективного признания и осуществления их как среди народов государств-членов Организации, так и среди народов территорий, находящихся под их юрисдикцией. 

Actual language : Russian
Predicted language :Russian

Toute personne,

Predicted language :English

Nul ne sera l'objet d'immixtions arbitraires dans sa vie prive, sa famille, son domicile ou sa correspondance, ni d'atteintes  son honneur et  sa rputation. Toute personne a droit  la protection de la loi contre de telles immixtions ou de telles atteintes. 

Actual language : French
Predicted language :French

Toute personne a droit  la libert de pense, de conscience et de religion; ce droit implique la libert de changer de religion ou de conviction ainsi que la libert de manifester sa religion ou sa conviction, seule ou en commun, tant en public qu'en priv, par l'enseignement, les pratiques, le culte et l'accomplissement des rites. 

Actual language : French
Predicted language :French

Niemand darf der Folter oder grausamer, unmenschlicher oder erniedrigender Behandlung oder Strafe unterworfen werden. 

Actual language : German
Predicted language :German

Каждый человек, обвиняемый в совершении преступления, имеет право считаться невиновным до тех пор, пок

Predicted language :Spanish

Toda persona tiene el derecho de acceso, en condiciones de igualdad, a las funciones pblicas de su pas. 

Actual language : Spanish
Predicted language :Spanish

Declaracin Universal de Derechos Humanos

Actual language : Spanish
Predicted language :Spanish

Ogni individuo ha diritto alla libert di riunione e di associazione pacifica. 

Actual language : Italian
Predicted language :Italian

da die Nichtanerkennung und Verachtung der Menschenrechte zu Akten der Barbarei gefhrt haben, die das Gewissen der Menschheit mit Emprung erfllen, und da verkndet worden ist, da einer Welt, in der die Menschen Rede- und Glaubensfreiheit und Freiheit von Furcht und Not genieen, das hchste Streben des Menschen gilt, 

Actual language : German
Predicted language :German

Jeder hat Anspruch auf die in dieser Erklrung verkndeten Rechte und Freiheiten ohne irgendeinen Unterschied, etwa nach Rasse, Hautfarbe, Geschlecht, Sprache, Religion, politischer oder sonstiger berzeugung, 

Predicted language :German

Le mariage ne peut tre conclu qu'avec le libre et plein consentement des futurs poux. 

Actual language : French
Predicted language :French

Toute personne a droit  un recours effectif devant les juridictions nationales comptentes contre les actes violant les droits fondamentaux qui lui sont reconnus par la constitution ou par la loi. 

Actual language : French
Predicted language :French

Ogni individuo ha diritto al riposo ed allo svago, comprendendo in ci una ragionevole limitazione delle ore di lavoro e ferie periodiche retribuite.

Actual language : Italian
Predicted language :Italian

Генеральная Ассамблея, 

Actual language : Russian
Predicted language :French

Nessun individuo potr essere arbitrariamente privato della sua cittadinanza, n del diritto di mutare cittadinanza. 

Actual language : Italian
Predicted language :Italian

Ogni individuo che lavora ha diritto ad una rimunerazione equa e soddisfacente che assicuri a lui stesso e alla sua famiglia

Predicted language :Italian

Die Familie ist die natrliche Grundeinheit der Gesellschaft und hat Anspruch auf Schutz durch Gesellschaft und Staat. 

Actual language : German
Predicted language :German

Ogni individuo che lavora ha diritto ad una rimunerazione equa e soddisfacente che assicuri a lui stesso e alla sua famiglia una esistenza conforme alla dignit umana ed integrata, se necessario, da altri mezzi di protezione sociale. 

Actual language : Italian
Predicted language :Italian

Все люди равны перед законом и имеют право, без всякого различия, на равную защиту закона. Все люди имеют право на равную защиту от какой бы то ни было дискриминации, нарушающей настоящую Декларацию, и от какого бы то ни было подстрекательства к такой дискриминации. 

Actual language : Russian
Predicted language :Russian

Considerando tambin esencial promover el desarrollo de relaciones amistosas entre las naciones, 

Actual language : Spanish
Predicted language :Spanish

принимая во внимание, что прене

Predicted language :English

﻿Всеобщая декларация прав человека

Actual language : Russian
Predicted language :Russian

En caso de persecucin, toda persona tiene derecho a buscar asilo, y a disfrutar de l, en cualquier pas. 

Actual language : Spanish
Predicted language :Spanish

Todos son iguales ante la ley y tienen, sin distincin, derecho a igual proteccin de la ley. Todos tienen derecho a igual proteccin contra toda discriminacin que infrinja esta Declaracin y contra toda provocacin a tal discriminacin. 

Actual language : Spanish
Predicted language :Spanish

Jeder hat das Recht auf Meinungsfreiheit und freie Meinungsuerung; dieses Recht schliet die Freiheit ein, Meinungen ungehindert anzuhngen sowie ber Medien jeder Art und ohne Rcksicht auf Grenzen Informationen und Gedankengut zu suchen, zu empfangen und zu verbreiten. 

Actual language : German
Predicted language :German

Everyone has the right to rest and leisure, including reasonable limitation of working hours and periodic h

Predicted language :French

Everyone has the right to equal access to public service in his country. 

Actual language : English
Predicted language :English

Considrant que la mconnaissance et le mpris des droits de l'homme ont conduit  des actes de barbarie qui rvoltent la conscience de l'humanit et que l'avnement d'un monde o les tres humains seront libres de parler et de croire, librs de la terreur et de la misre, a t proclam comme la plus haute aspiration de l'homme, 

Actual language : French
Predicted language :French

A partir de l'ge nubile, l'homme et la femme, sans aucune restriction quant  la race, la nationalit ou la religion, ont le droit de se marier et de fonder une famille. Ils ont des droits gaux au regard du mariage, durant le mariage et lors de sa dissolution. 

Actual language : French
Predicted language :French

Considerando que una concepcin comn de estos derechos y libertades es de la mayor importancia para el pleno cumplimiento de dicho compromiso, 

Actual lang

Predicted language :French

No one shall be subjected to arbitrary arrest, detention or exile. 

Actual language : English
Predicted language :English

Everyone, as a member of society, has the right to social security and is entitled to realization, through national effort and international co-operation and in accordance with the organization and resources of each State, of the economic, social and cultural rights indispensable for his dignity and the free development of his personality. 

Actual language : English
Predicted language :English

Nul ne peut tre oblig de faire partie d'une association. 

Actual language : French
Predicted language :French

Jeder hat das Recht auf eine Staatsangehrigkeit. 

Actual language : German
Predicted language :German

принимая во внимание, что народы Объединенных Наций подтвердили в Уставе свою веру в основные права человека, в достоинство и ценность человеческой личности и в равноправие мужчин и женщин и решили содействовать социальному прогрессу

Predicted language :English

Tutti sono eguali dinanzi alla legge e hanno diritto, senza alcuna discriminazione, ad una eguale tutela da parte della legge. Tutti hanno diritto ad una eguale tutela contro ogni discriminazione che violi la presente Dichiarazione come contro qualsiasi incitamento a tale discriminazione.

Actual language : Italian
Predicted language :Italian

Toda persona acusada de delito tiene derecho a que se presuma su inocencia mientras no se pruebe su culpabilidad, conforme a la ley y en juicio pblico en el que se le hayan asegurado todas las garantas necesarias para su defensa. 

Actual language : Spanish
Predicted language :Spanish

Ogni individuo ha diritto all'istruzione. L'istruzione deve essere gratuita almeno per quanto riguarda le classi elementari e fondamentali. L'istruzione elementare deve essere obbligatoria. L'istruzione tecnica e professionale deve essere messa alla portata di tutti e l'istruzione superio
Actual language : Italian
Predicted language :It

#### Seeing misclassified examples

In [46]:
print (accuracy, count)
print (set(misclassified))

522 530
{('Генеральная Ассамблея, \n', 'Russian', 'French'), ('Resolution 217 A (III) vom 10.12.1948 \n', 'German', 'English'), ('The General Assembly, \n', 'English', 'German')}


## Conclusion

We see that the ensemble of 2, 3 and 4-grams trained on the distance metric defined above achieves an accuracy of around 98.5%. Out of 530 examples, it misclassifies only 3 unique examples which are having very few words (2 or 3), which is usually not the case for documents. Furthermore, this method produces these results after being trained on a comparatively small corpus over each language. More improvement can be obtained by training over a larger corpus. 