# Language Detection
There are three languages: A, B and C. Each language uses the same set of symbols: "A, o, e, t, p, g, and k." However, each language uses the symbols differently. In each of these languages we can model everything as P(next symbol | current symbol).

There is training data available for each language. This consists of several files each generated by sampling from a Markov model. Using Python, build a Markov model for each of the languages.
Now use the Markov model and Bayes' rule to classify the test cases. Write down how you used Bayes' rule to get your classifier. Give the full posterior distribution for each test case.
The audio dataset can be found here, while the symbol dataset is in this file.

For pre-class work you are asked to do two things:

1. build a Markov model for each language:
   1. find the initial distribution (With what probability does each letter occur first?)
   2. find the transition matrix (Look here if you’re stuck.)
   3. Now you have a Markov model :)
2. classify test cases for each language using Bayes’ rule → find the probability of a language, given the string: P(language = A|string = “Aoetpp”)
   1. Bayes rule: P(A|B) = $\frac{P(B|A)*P(A)}{P(B)}$
   2. P(language = A|string = “Aoetpp”) is the posterior we are interested in.
   3. P(string = “Aoetpp”|language = A) is the likelihood, the probability of a string given a certain language
      1. calculate the probability of the markov model for language A generating the string “Aoetpp”
      2. What is the probability of “A”? (initial distribution)
      3. What is the probability of “o” given “A”? (transition matrix)
   4. P(language = A) is the prior probability we have about the probability of the language. (You can assume a uniform prior.)
   5. P(string = “Aoetpp”) is the evidence, or the probability of the string “Aoetpp”.
      1. P(string = “Aoetpp”|language = A) * P(language = A)
      2. P(string = “Aoetpp”|language = B) * P(language = B)
      3. P(string = “Aoetpp”|language = C) * P(language = C)

Identify which language model has the highest probability … tadaaaam!

In [8]:
import os
os.listdir('symbol/')
files = os.listdir()

In [12]:
from pathlib import Path

entries = Path('symbol/')
files = []
for entry in entries.iterdir():
    files.append(entry.name)

In [89]:
langA = ['symbol/%s' % f for f in files if "langA" in f]
langB = ['symbol/%s' % f for f in files if "langB" in f]
langC = ['symbol/%s' % f for f in files if "langC" in f]
test = ['symbol/%s' % f for f in files if "test" in f]
print(len(langA))
print(langA[0])


30
symbol/language-training-langA-7


In [90]:
import collections

def parse_text(single_lang):
    symbols = {'A':0, 'o':1, 'e':2, 't':3, 'p':4, 'g':5, 'k':6}
    file = open(single_lang, 'r').read() 
    nums = [symbols[i] for i in file]
    return nums

def transition_matrix(transitions):
    '''
    the following code takes a list such as
    [1,1,2,6,8,5,5,7,8,8,1,1,4,5,5,0,0,0,1,1,4,4,5,1,3,3,4,5,4,1,1]
    with states labeled as successive integers starting with 0
    and returns a transition matrix, M,
    where M[i][j] is the probability of transitioning from i to j
    Resource: https://stackoverflow.com/questions/46657221/generating-markov-transition-matrix-in-python
    '''

    n = 1+ max(transitions) #number of states

    M = [[0]*n for _ in range(n)]

    for (i,j) in zip(transitions,transitions[1:]):
        M[i][j] += 1

    #now convert to probabilities:
    for row in M:
        s = sum(row)
        if s > 0:
            row[:] = [f/s for f in row]
    return M

def print_matrix(m): 
    for row in m: 
        print(' '.join('{0:.2f}'.format(x) for x in row))
    return None
    

def single_initial_prob(single_lang, char_index):
    '''Initial prob of an english character for a single file'''
    counter=collections.Counter(single_lang)
    return counter[char_index]/len(single_lang)

def initial_dist(lang, char):
    '''Initial prob of an english character for a language '''
    output = []
    symbols = {'A':0, 'o':1, 'e':2, 't':3, 'p':4, 'g':5, 'k':6} 
    for i in lang:
        char_index  = symbols[char]
        num = parse_text(i)
        output.append(single_initial_prob(num, char_index))
    return output 

## Find the probability of having 'A'? 

In [95]:
import numpy as np

prob_A = initial_dist(langA, 'A')
prob_B = initial_dist(langB, 'A')
prob_C = initial_dist(langC, 'A')
print(f'Average probability of getting A in language A file given language A is {np.mean(prob_A)}') 
print(f'Average probability of getting B in language A file given language B is {np.mean(prob_B)}') 
print(f'Average probability of getting C in language A file given language C is {np.mean(prob_C)}') 


Average probability of getting A in language A file is 0.13433333333333333
Average probability of getting B in language A file is 0.22399999999999995
Average probability of getting C in language A file is 0.14266666666666666


## What is the probability of “o” given “A” (transition matrix)? 

In [107]:
def baysien_prob(lang, given_char, pred_char): 
    '''Calculate the baysien prob or a predicted character given on character in a language'''
    symbols = {'A':0, 'o':1, 'e':2, 't':3, 'p':4, 'g':5, 'k':6} 
    give_ind = symbols[given_char]
    pred_ind = symbols[pred_char]
    output = []
    for i in lang:
        num = parse_text(i)
        m = transition_matrix(num)
        output.append(m[give_ind][pred_ind])
    return output 
langA_baysien = baysien_prob(langA, 'A', 'o')
langB_baysien = baysien_prob(langB, 'A', 'o')
langC_baysien = baysien_prob(langC, 'A', 'o')

print(f'Average probability of getting o given A in language A file given language A is {np.mean(langA_baysien)}')
print(f'Average probability of getting o given A in language B file given language B is {np.mean(langB_baysien)}')
print(f'Average probability of getting o given A in language C file given language C is {np.mean(langC_baysien)}')

Average probability of getting o given A in language A file is 0.016605098605098605
Average probability of getting o given A in language B file is 0.28682639852664976
Average probability of getting o given A in language C file is 0.07930071347061382


P(string = “Aoetpp”|language = A)

Probability of generating 'Aoetpp' is $P(A)* P(o|A) * P(e|Ao) * P(t|Aoe) * P(p|Aoet) * P(p|Aoetp)$


## Probability of generating 'Aoetpp'

In [131]:
def Aoetpp(lang): 
    all_lists  = [initial_dist(lang, 'A'), baysien_prob(lang, 'A', 'o'), baysien_prob(lang, 'o', 'e'), baysien_prob(lang, 'e', 't'), baysien_prob(lang, 't', 'p'), baysien_prob(lang, 'p', 'p')] 
    mul = 1
    for i in range(len(all_lists)-1): 
        mul = np.multiply(all_lists[i], all_lists[i+1])
    return  mul

print(f'Average probability of getting Aoetpp in language A file given language A is {np.mean(Aoetpp(langA))}')
print(f'Average probability of getting Aoetpp in language B file given language B is {np.mean(Aoetpp(langB))}')
print(f'Average probability of getting Aoetpp in language C file given language C is {np.mean(Aoetpp(langC))}')

Average probability of getting Aoetpp in language A file is 0.006219030300637773
Average probability of getting Aoetpp in language B file is 0.0
Average probability of getting Aoetpp in language C file is 0.030558180380914297


In [132]:
prob_lang = 1/3 
print(f'Average probability of getting Aoetpp in language A file given language A is {np.mean(Aoetpp(langA))*prob_lang}')
print(f'Average probability of getting Aoetpp in language B file given language B is {np.mean(Aoetpp(langB))* prob_lang}')
print(f'Average probability of getting Aoetpp in language C file given language C is {np.mean(Aoetpp(langC))*prob_lang}')

Average probability of getting Aoetpp in language A file given language A is 0.002073010100212591
Average probability of getting Aoetpp in language B file given language B is 0.0
Average probability of getting Aoetpp in language C file given language C is 0.010186060126971432
