# Assignmnet 2 (100 points)

**Name: Daniel Shaquille** <br>
**Email: das9688@thi.de** <br>
**Group:** B <br>
**Hours spend *(optional)* :** <br>

### SMS Spam Detection *(100 points)*

<p>You are hired as an AI expert in the development department of a telecommunications company. The first thing on your orientation plan is a small project that your boss has assigned you for the following given situation. Your supervisor has given away his private cell phone number on too many websites and is now complaining about daily spam SMS. Therefore, it is your job to write a spam detector in Python. </p>

<p>In doing so, you need to use a Naive Bayes classifier that can handle both bag-of-words (BoW) and tf-idf features as input. For the evaluation of your spam detector, an SMS collection is available as a dataset - this has yet to be suitably split into train and test data. To keep the costs as low as possible and to avoid problems with copyrights, your boss insists on a new development with Python.</p>

<p>Include a short description of the data preprocessing steps, method, experiment design, hyper-parameters, and evaluation metric. Also, document your findings, drawbacks, and potential improvements.</p>

<p>Note: You need to implement the bag-of-words (BoW) and tf-idf feature extractor from scratch. You can use existing python libraries for other tasks.</p>

**Dataset and Resources**

* SMS Spam Collection Dataset: https://archive.ics.uci.edu/dataset/228/sms+spam+collection

In [None]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score


<h1>Bag of Words</h1>

In [None]:
data = pd.read_csv("SMSSpamCollection", sep='\t', header=None, encoding='ISO-8859-1')

x = data[1]
y = data[0]

data = x.tolist()


# Now 'data' is a list containing only the messages
original_data = data


In [None]:
'''Bag Of Words'''


# Preprocessing
def custom_fit(data):

    unique_words = set()

    for each_sentence in data:
        for each_word in each_sentence.split(' '):

            if len(each_word) > 2:
                unique_words.add(each_word)

    # Getting the index of the word
    vocab = {}

    for i, w in enumerate(sorted(list(unique_words))):
        vocab[w] = i
    print(vocab)

    return vocab


# Get vocabulary index
def custom_transform(data):
    vocab = custom_fit(data)  

    
    num_docs = len(data)
    num_words = len(vocab)
    bow_matrix = np.zeros((num_docs, num_words), dtype=int)

    
    for i, doc in enumerate(data):
        word_counts = {}
        words = doc.split()
        for word in words:
            # Filter out short words
            if len(word) > 2:  
                if word in vocab:
                    col_index = vocab[word] # get the col index of word from vocab
                    if col_index not in word_counts:
                        word_counts[col_index] = 0
                    word_counts[col_index] += 1

        # Update the corresponding entries in the bag-of-words matrix
        for col_index, count in word_counts.items():
            bow_matrix[i, col_index] = count

    return bow_matrix

bag_of_words = custom_transform(original_data)


In [None]:
X_train_BoW, X_test_BoW, y_train_BoW, y_test_BoW = train_test_split(bag_of_words, y, test_size=0.2, random_state=50)

# Train and evaluate the classifier
classifier = MultinomialNB()
classifier.fit(X_train_BoW, y_train_BoW)
y_pred = classifier.predict(X_test_BoW)

# Calculate accuracy
accuracy = accuracy_score(y_test_BoW, y_pred)
print("Accuracy:", accuracy)


In [None]:
'''Bag of Words'''

''' 
For preprocessing, the custom fit function will take words and fit them by giving index
We go through the whole data through each sentences and will give a count for the words
We then sort and index them

Then we will get the count values for each word
We then are filling the counted value of each row in the matrix and other values in that row are zero
Initialize an empty matrix to hold the bag-of-words representation
Iterate over each document and update the bag-of-words matrix

After we have obtained our bag of words, we split the data into train and test where our test data size is 20 percent pf the whole dataset
We then pass our training dataset into the MultinomialNB(), then get the prediction using the test dataset.
Finally we get our accuracy score

'''



<h1>TF-IDF</h1>

In [None]:
data = pd.read_csv("SMSSpamCollection", sep='\t', header=None, encoding='ISO-8859-1')
# print(data)

x = data[1]
y = data[0]

data = x.tolist()

# Now 'data' is a list containing only the messages
original_data2 = data
print(original_data2)



In [None]:

# Function for preprocessing
def preprocess_text(text):
    words_list = []
    for sent in text:
        words = [word.lower() for word in sent.split() if word.isalpha()]
        words_list.append(words)
    return words_list

sentences = preprocess_text(original_data2)


word_set = set(word for words in sentences for word in words)
# print(word_set)

total_docs = len(original_data2)


# We index each word from the vocabulary to map the word to the vector
word_index = {}
for i, w in enumerate(word_set):
    word_index[w] = i
    

# A function to keep count of the numbers of documents containing the word
def count_dict(sentences):
    count_dict = {}
    # print(word_set)
    for word in word_set:
        count_dict[word] = 0
    
    # print(sentences)
    for sent in sentences:
        for word in sent:
            count_dict[word] += 1
    
    return count_dict

word_count = count_dict(sentences)
# print(word_count)


# Term Frequency calculation
def term_frequency(document, word):
    N = len(document)
    occurance = len([token for token in document if token == word])
    # print(occurance)
    # print(occurance/N)
    return occurance/N

# Inverse Document Frequency Calculation
def inverse_document_frequency(word):
    # print(word)
    try:
        word_occurance = word_count[word] +1
    
    except:
        word_occurance = 1
    
    return np.log(total_docs/ word_occurance)

# TF-IDF function
def tf_idf(sentence):
    vec = np.zeros((len(word_set),))
    for word in sentence:
        tf = term_frequency(sentence, word)
        idf = inverse_document_frequency(word)
        vec[word_index[word]] = tf * idf
    
    return vec


tf_idf_converted = []
for sent in sentences:
    tf_idf_converted.append(tf_idf(sent))



In [None]:
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(tf_idf_converted, y, test_size=0.2, random_state=50)

# Train and evaluate the classifier
classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train_tfidf)
y_pred = classifier.predict(X_test_tfidf)

# Goaussian and Binomial

# Calculate accuracy
accuracy = accuracy_score(y_test_tfidf, y_pred)
print("Accuracy:", accuracy)


In [None]:
'''TF-IDF'''

'''  
First we preprocess and tokenize the text
We then have a set of unique words in the corpus
Index each word in the vocabulary
Create a function named count_dict() to keep count of the numbers of documents containing the word
We create a function to calculate the term frequency of a word in a document
We also create a function to calculate the inverse document frequency of a word
Finally we have also a function to calculate the TF-IDF vector for a sentence and we will get arrays of our tf-idf value dataset

After we have obtained data of array of tf-idf, we split the data into train and test where our test data size is 20 percent pf the whole dataset
We then pass our training dataset into the MultinomialNB(), then get the prediction using the test dataset.
Finally we get our accuracy score

Findings are in this case that the bag of words gives a higher accuracy score compared with the tf-idf
Some drawbacks are in the preprocessing part to clean the data
For potential improvements, we could use the nltk libraries like word_tokenize or stopwords and also use Countvectorizer

'''

### Additional Experiments *(5 additional points - <span style="color: red;">Optional</span>)*

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
data = pd.read_csv("SMSSpamCollection", sep='\t', header=None, encoding='ISO-8859-1')
# print(data)

x = data[1]
y = data[0]

data = x.tolist()

# Now 'data' is a list containing only the messages
original_data3 = data  

def preprocess_text(text):
    tokens = word_tokenize(text.lower()) 
    tokens = [token for token in tokens if token.isalpha()]  
    tokens = [token for token in tokens if token not in stopwords.words('english')] 
    return " ".join(tokens)

processed_texts = [preprocess_text(text) for text in original_data3]

<h1>TF_IDF</h1>

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed_texts)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)


In [None]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))


<h1>Bag of Words</h1>

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(processed_texts)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)


In [None]:
y_pred = classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
