###Lab3

In [13]:
import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix


def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir, f) for f in os.listdir(train_dir)]
    all_words = []
    for mail in emails:
        with open(mail) as m:
            for i, line in enumerate(m):
                if i == 2:
                    words = line.split()
                    all_words += words

    dictionary = Counter(all_words)

    cpy = dictionary.copy()
    list_to_remove = cpy.keys()
    for item in list_to_remove:
        if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    dictionary = dictionary.most_common(3000)
    return dictionary


def extract_features(mail_dir):
    files = [os.path.join(mail_dir, fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files), 3000))
    docID = 0
    for fil in files:
        with open(fil) as fi:
            for i, line in enumerate(fi):
                if i == 2:
                    words = line.split()
                    for word in words:
                        wordID = 0
                        for i, d in enumerate(dictionary):
                            if d[0] == word:
                                wordID = i
                                features_matrix[docID, wordID] = words.count(word)
            docID = docID + 1
    return features_matrix


# Create a dictionary of words with its frequency

train_dir = 'train-mails'
dictionary = make_Dictionary(train_dir)

# Prepare feature vectors per training mail and its labels

train_labels = np.zeros(702)
train_labels[351:701] = 1
train_matrix = extract_features(train_dir)

# Training SVM and Naive bayes classifier and its variants

model1 = LinearSVC()
model2 = MultinomialNB()

model1.fit(train_matrix, train_labels)
model2.fit(train_matrix, train_labels)

# Test the unseen mails for Spam

test_dir = 'test-mails'
test_matrix = extract_features(test_dir)
test_labels = np.zeros(260)
test_labels[130:260] = 1

result1 = model1.predict(test_matrix)
result2 = model2.predict(test_matrix)

print(confusion_matrix(test_labels, result1))
print (confusion_matrix(test_labels, result2))

[[126   4]
 [  6 124]]
[[129   1]
 [  9 121]]


_Test-set contains 130 spam emails and 130 non-spam emails._
we can get the **result**:

| Naive Bayes | ham | spam |
| :---: | :---: | :---: |
|ham|129|1|
|spam|9|121|

| SVM | ham | spam |
| :---: | :---: | :---: |
|ham|126|4|
|spam|6|124|

**Conclusion**: in small email set, both the models had similar performance on the test-set except that the SVM has slightly balanced false identifications.

#### Question1
>Naive Bayes algorithm is widely used in real life, such as text classification, spam classification, credit evaluation, phishing website detection and so on.

#### Question2
Advantage:
>the algorithm logical is simple and stable

when the relationship between data set attributes is relatively independent, naive Bayes method perform well

#### Question3
Disadvantage:
>its conditions for attribute independence

In many cases, it is difficult to satisfy the independence of the attributes of the data set, because there are often correlations between the attributes of the data set. If this problem occurs in the classification process, the effect of classification will be greatly reduced.

#### Question4
when The data set attributes are independent and the correlation is small