# Assignment Task 1
### Generative vs Discriminative

**Generative** : Naive Bayes classifier, Gaussian Mixture model, GANs, LDA(Latent Dirichlet Allocation) <br>
**Discriminative** : Neural networks, Logistic regression, SVM<br> <br>
**Explainations** : <br>
**Generative Models** : These models are considered generative because they estimate the conditional prob p(x|y) by actually calculating joint Probabilty p(x,y) and diving it by prior distribution p(x) i.e., p(x|y) = p(x,y)/p(x). Therefore, it is considered Genrative.<br>
**Discriminative Models** : Each of these models are used in classification task where we have to find out the probability of of output y given input x and these networks directly learn the conditional probability distribution p(y|x) without the need of any prior distribution.


# Assignment Task 2
### Hidden Markov Model

In [None]:
import numpy as np
from nltk.corpus import treebank,brown

In [None]:
corpus = brown.tagged_sents(tagset='universal')[:-100] 
test_data= brown.tagged_sents(tagset='universal')[-10:]

In [None]:
tag_dict={}
word_dict={}

for sent in corpus:
    for elem in sent:
        if elem[0] not in word_dict:
            word_dict[elem[0]] = 0
        if elem[1] not in tag_dict:
            tag_dict[elem[1]] = 0
        word_dict[elem[0]] += 1
        tag_dict[elem[1]] += 1
print("Number of words in dict : ",len(word_dict))

In [None]:
tags_list = {}
tags_rev_list = []
i = 0
for tag, ct in tag_dict.items():
    tags_list[tag] = i
    tags_rev_list.append(tag)
    i += 1
NUM_TAGS = len(tags_list)
word_list = {}
i = 0
for word, ct in word_dict.items():
    word_list[word] = i
    i += 1
NUM_WORDS = len(word_list)
print("Tags List : ", tags_list)
print("Reversed Tag List : ",tags_rev_list)

#### Start Matrix

In [None]:
S = np.zeros(NUM_TAGS)
start_tags_ct = {}
total = 0
for sent in corpus:
    start_tag = sent[0][1]
    if start_tag not in start_tags_ct:
        start_tags_ct[start_tag] = 0
    start_tags_ct[start_tag] += 1
    total += 1
for start_tag, ct in start_tags_ct.items():
    S[tags_list[start_tag]] = ct/total
print("Start Matrix with dimensions",S.shape,":")
print(S)

#### Transition Matrix

In [None]:
P = np.zeros((NUM_TAGS,NUM_TAGS))
transition = {}
for sent in corpus:
    ln = len(sent)
    for i in range(1,ln):
        prev = sent[i-1][1]
        curr = sent[i][1]
        if prev not in transition:
            transition[prev] = {}
        if curr not in transition[prev]:
            transition[prev][curr] = 0
        transition[prev][curr] += 1
for prev,val in transition.items():
    total = 0
    for curr, ct in val.items():
        total += ct
    for curr, ct in val.items():
        P[tags_list[prev]][tags_list[curr]] = ct/total
print("Transition Matrix with dimensions",P.shape,":")
print(P)

#### Emission Matrix

In [None]:
O = np.zeros((NUM_TAGS,NUM_WORDS))
emission = {}
for sent in corpus:
    for elem in sent:
        word = elem[0]
        tag = elem[1]
        if tag not in emission:
            emission[tag] = {}
        if word not in emission[tag]:
            emission[tag][word] = 0
        emission[tag][word] += 1
        
for tag, val in emission.items():
    total = 0
    for word, ct in val.items():
        total += ct
    for word, ct in val.items():
        O[tags_list[tag]][word_list[word]] = ct/total
print("Emission Matrix with dimensions",O.shape,":")
print(O)

#### Formulation of Viterbi Algorithm

In [None]:
def viterbi(sequence):
    M = len(sequence)
    state_len = NUM_TAGS
    
    T1 = np.zeros((state_len,M))
    T2 = np.zeros((state_len,M))
    
    for state in range(state_len):
        prob = 1e-100
        if sequence[0] in word_list:
            prob = O[state,word_list[sequence[0]]]
        T1[state,0] = S[state]*prob
        T2[state,0] = -1
    for i in range(1,M):
        for state in range(state_len):
            prob = 1e-10
            if sequence[i] in word_list:
                prob = O[state,word_list[sequence[i]]]
            temp = T1[:,i-1]*P[:,state]*prob
            T1[state,i] = np.max(temp)
            T2[state,i] = np.argmax(temp)
 
    seq = np.zeros(M)
    seq[M-1] = np.argmax(T1[:,M-1])
    best_score = np.max(T1[:,M-1])
    for state in range(M-2,-1,-1):
        seq[state] = T2[int(seq[state+1]),state+1]    
    return seq, best_score

In [None]:
test_sents = []
test_tags = []
for sent in test_data:
    test_sents.append([x[0] for x in sent])
    test_tags.append([x[1] for x in sent])

In [None]:
seq,best_score = viterbi(test_sents[0])
total_correct = 0
total = 0
test_labels = []
for i,sent in enumerate(test_sents):
    seq, best_score = viterbi(sent)
    print("Sentence : ", ' '.join(sent))
    print("Best Score : ",best_score)
    res_tags = [tags_rev_list[int(x)] for x in seq]
    test_labels.append(res_tags)
    print("Best Sequence : ",res_tags)
    print("Actual Sequence : ", test_tags[i])
    correct = np.mean(np.array(res_tags)==np.array(test_tags[i]))
    total_correct += np.sum(np.array(res_tags)==np.array(test_tags[i]))
    total += len(res_tags)
    print("Accuracy : ",correct*100,"%")
    print("------------------------------------------------------------------")
print("Overall Accuracy : ",total_correct,"/",total," = ",(total_correct*100/total),"%")

In [None]:
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

In [None]:
print("F1 score for HMM Model :")
print(metrics.flat_f1_score(test_tags, test_labels, 
                      average='weighted', labels=tags_rev_list))

In [None]:
print("Tag wise report for HMM Model")
print(metrics.flat_classification_report(
    test_tags, test_labels, labels=tags_rev_list, digits=3
))

# Assignment Task 3
### Conditional Random Fields

In [None]:
train_sents = corpus
def word2features(sent,i):
    word = sent[i][0]
    
    features ={
    'bias': 1.0,
    'length_of_word' : len(word),
    'startsWithUpper' : word[0].isupper(),
    'lower' : word.lower(),
    'word_index' : i,
    'length_of_sentence' : len(sent),
    }
                
    return features

def sent2features(sent):
    return [word2features(sent,i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for i,label in sent]

In [None]:
X_train=[sent2features(s) for s in train_sents]
y_train=[sent2labels(s) for s in train_sents]

X_test=[sent2features(s) for s in test_data]
y_test=[sent2labels(s) for s in test_data]

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

In [None]:
y_pred = crf.predict(X_test)
labels=list(crf.classes_)
print("F1 score for CRF model : ")
print(metrics.flat_f1_score(y_test, y_pred, 
                      average='weighted', labels=labels))

In [None]:
sorted_labels = sorted(
    labels, 
    key=lambda name: (name[1:], name[0])
)
print("Tag Wise report for CRF model :")
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

#### CRF features justification
**bias** - default<br>
**length_of_word** - distribution of length of word for different POS tags can vary<br>
**startsWithUpper** - some nouns have their first letter as uppercase. Also, it gave a good improvement on the f1 score<br>
**lower** - we make all the words lowercase for most of the nlp tasks, so the intuition of providing the lowercase word as a feature is good so that same words with different cases have atleast this feature same. Showed a tremendous imporvement in f1 scores (from ~0.40 to ~0.94)<br>
**word_index and length_of_sentence** - The relative position of the word in the sentence is a good feature for assigning POS tags. Eg -  The subjects or nouns usually come at the start and verb comes later. Improved the f1 score to 0.96