<a href="https://colab.research.google.com/github/brindhasenthilkumar/fmml2021/blob/main/Mod3_Lab3_fmml20210502.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Sahil Manoj Bhatt

NOTE: YOU ONLY NEED TO MAKE CHANGES/WRITE CODE IN CELLS THAT SPECIFICALLY MENTION TASK-1, TASK-2, etc.

WRITE ANY OBSERVATION(S), IF REQUIRED BY THE TASK, IN A SEPARATE CELL AT THE BOTTOM OF THE NOTEBOOK.  

---

In [None]:
# NLTK (or Natural Language Tool Kit) is a commonly used library for processing text.
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "finalizing"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print("Stemmer on: ", sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print("Lemmatizer on: ", sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print("Both on: ", sample_text_result)

finalizing
Stemmer on:  final
Lemmatizer on:  finalize
Both on:  finalize


### BAG OF WORDS 

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. 
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


### TF-IDF 
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*. 




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

### UNDERSTANDING THE DATA : A REVIEWS DATASET

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv', sep='\t')

df.head(10)

Unnamed: 0,"sentence\tsentiment,,,,,,,"
"Not sure who was more lost - the flat characters or the audience,"" nearly half of whom walked out.","0"",,,,,,"
"Attempting artiness with black & white and clever camera angles,"" the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.","0"",,,,,,"
"Very little music or anything to speak of.\t0,,,,,,,",
"The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.\t1,,,,,,,",
"The rest of the movie lacks art, charm, meaning... If it's about emptiness,"" it works I guess because it's empty.","0"",,,,"
"Wasted two hours.\t0,,,,,,,",
"Saw the movie today and thought it was a good effort,"" good messages for kids.","1"",,,,,,"
"A bit predictable.\t0,,,,,,,",
"Loved the casting of Jimmy Buffet as the science teacher.\t1,,,,,,,",
"And those baby owls were adorable.\t1,,,,,,,",


In [None]:
df.shape

(999, 1)

In [None]:
list1 = []
list2 = []
file1 = open('reviews.csv', 'r')
Lines = file1.readlines()
for line in Lines:
  list1.append(line.split('\t')[0].replace(',', '').strip('\"').replace('"', ''))
  list2.append(line.split('\t')[1].replace(',', '').replace('"', '').replace('\n', ''))

In [None]:
for i in range(10):
  print(list2[i])

sentiment
0
0
0
1
0
0
1
0
1


In [None]:
df1 = pd.DataFrame({'sentence': list1, 'sentiment': list2}) # making a dataframe using the two lists

In [None]:
df1.isnull().sum()

sentence     0
sentiment    0
dtype: int64

In [None]:
df1.head(2)

Unnamed: 0,sentence,sentiment
0,sentence,sentiment
1,Not sure who was more lost - the flat characte...,0


In [None]:
df1 = df1.iloc[1:] # removes the first row and retains the other rows
df1

Unnamed: 0,sentence,sentiment
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1
5,The rest of the movie lacks art charm meaning....,0
...,...,...
995,I just got bored watching Jessice Lange take h...,0
996,Unfortunately any virtue in this film's produc...,0
997,In a word it is embarrassing.,0
998,Exceptionally bad!,0


In [None]:
df1.groupby("sentiment").count()

Unnamed: 0_level_0,sentence
sentiment,Unnamed: 1_level_1
0,499
1,500


### KNN MODEL

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn(k, weight, distance):
    #"""Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data =  df1 # pd.read_csv('reviews.csv',sep='\t')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    #knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)
    knn = neighbors.KNeighborsClassifier(n_neighbors=k, weights=weight, algorithm='auto', leaf_size=30, p=2, metric=distance, metric_params=None, n_jobs=1)

    print("k =", k, " Weights = ", weight, " Distance Metrics = ", distance)
    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn(k, weight, distance):
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = df1 # pd.read_csv('reviews.csv',sep='\t')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=k, weights= weight, algorithm='brute', leaf_size=30, p=2,
                                         metric=distance, metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW and TFID
predicted, y_test = bow_knn(7, 'uniform', 'cosine')
predicted, y_test = tfidf_knn(7, 'uniform', 'cosine')

k = 7  Weights =  uniform  Distance Metrics =  cosine
KNN with BOW accuracy = 70.5%
Cross Validation Accuracy: 0.70
[0.70411985 0.69548872 0.70676692]


KNN with TFIDF accuracy = 76.5%
Cross Validation Accuracy: 0.74
[0.76029963 0.72556391 0.7406015 ]


In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print(classification_report(y_test,predicted))

print(confusion_matrix(y_test,predicted))

              precision    recall  f1-score   support

           0       0.74      0.74      0.74        90
           1       0.79      0.78      0.79       110

    accuracy                           0.77       200
   macro avg       0.76      0.76      0.76       200
weighted avg       0.77      0.77      0.77       200

[[67 23]
 [24 86]]


In [None]:
predicted, y_test = bow_knn(9, 'distance', 'cosine')
predicted, y_test = tfidf_knn(9, 'distance', 'cosine')

k = 9  Weights =  distance  Distance Metrics =  cosine
KNN with BOW accuracy = 71.0%
Cross Validation Accuracy: 0.69
[0.69662921 0.68045113 0.69548872]


KNN with TFIDF accuracy = 75.5%
Cross Validation Accuracy: 0.74
[0.7752809  0.71428571 0.72932331]


In [None]:
predicted, y_test = bow_knn(9, 'uniform', 'minkowski')
predicted, y_test = tfidf_knn(9, 'uniform', 'minkowski')

k = 9  Weights =  uniform  Distance Metrics =  minkowski
KNN with BOW accuracy = 68.0%
Cross Validation Accuracy: 0.63
[0.61797753 0.63533835 0.63909774]


KNN with TFIDF accuracy = 77.0%
Cross Validation Accuracy: 0.74
[0.77153558 0.71052632 0.72556391]


In [None]:
predicted, y_test = bow_knn(9, 'distance', 'minkowski')
predicted, y_test = tfidf_knn(9, 'distance', 'minkowski')

k = 9  Weights =  distance  Distance Metrics =  minkowski
KNN with BOW accuracy = 69.0%
Cross Validation Accuracy: 0.64
[0.62546816 0.64285714 0.65037594]


KNN with TFIDF accuracy = 77.5%
Cross Validation Accuracy: 0.74
[0.77153558 0.71428571 0.72556391]


In [None]:
predicted, y_test = bow_knn(9, 'uniform', 'euclidean')
predicted, y_test = tfidf_knn(9, 'uniform', 'euclidean')

k = 9  Weights =  uniform  Distance Metrics =  euclidean
KNN with BOW accuracy = 68.0%
Cross Validation Accuracy: 0.63
[0.61797753 0.63533835 0.63909774]


KNN with TFIDF accuracy = 77.0%
Cross Validation Accuracy: 0.74
[0.77153558 0.71052632 0.72556391]


In [None]:
predicted, y_test = bow_knn(9, 'distance', 'euclidean')
predicted, y_test = tfidf_knn(9, 'distance', 'euclidean')

k = 9  Weights =  distance  Distance Metrics =  euclidean
KNN with BOW accuracy = 69.0%
Cross Validation Accuracy: 0.64
[0.62546816 0.64285714 0.65037594]


KNN with TFIDF accuracy = 77.5%
Cross Validation Accuracy: 0.74
[0.77153558 0.71428571 0.72556391]


In [None]:
predicted, y_test = bow_knn(9, 'uniform', 'cityblock')
predicted, y_test = tfidf_knn(9, 'uniform', 'cityblock')

k = 9  Weights =  uniform  Distance Metrics =  cityblock
KNN with BOW accuracy = 70.5%
Cross Validation Accuracy: 0.61
[0.64419476 0.56766917 0.60526316]


KNN with TFIDF accuracy = 66.0%
Cross Validation Accuracy: 0.60
[0.59925094 0.62406015 0.58270677]


In [None]:
predicted, y_test = bow_knn(9, 'distance', 'cityblock')
predicted, y_test = tfidf_knn(9, 'distance', 'cityblock')

k = 9  Weights =  distance  Distance Metrics =  cityblock
KNN with BOW accuracy = 71.0%
Cross Validation Accuracy: 0.62
[0.65543071 0.57518797 0.62406015]


KNN with TFIDF accuracy = 68.5%
Cross Validation Accuracy: 0.61
[0.59925094 0.63533835 0.58270677]


In [None]:
predicted, y_test = bow_knn(9, 'distance', 'chebyshev')
predicted, y_test = tfidf_knn(9, 'distance', 'chebyshev')

k = 9  Weights =  distance  Distance Metrics =  chebyshev
KNN with BOW accuracy = 50.0%
Cross Validation Accuracy: 0.51
[0.46441948 0.5037594  0.54887218]


KNN with TFIDF accuracy = 61.5%
Cross Validation Accuracy: 0.53
[0.5505618  0.52631579 0.51879699]


In [None]:
predicted, y_test = bow_knn(9, 'uniform', 'chebyshev')
predicted, y_test = tfidf_knn(9, 'uniform', 'chebyshev')

k = 9  Weights =  uniform  Distance Metrics =  chebyshev
KNN with BOW accuracy = 49.0%
Cross Validation Accuracy: 0.50
[0.46067416 0.4924812  0.54511278]


KNN with TFIDF accuracy = 61.5%
Cross Validation Accuracy: 0.53
[0.54681648 0.52631579 0.51879699]


In [None]:
## KNN accuracy after using TFIDF
# this function has been altered above

#predicted, y_test = tfidf_knn()

KNN with TFIDF accuracy = 75.5%
Cross Validation Accuracy: 0.73
[0.73033708 0.7406015  0.71052632]


### SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam_text_data.csv to spam_text_data.csv


In [None]:
import pandas as pd
df = pd.read_csv('spam_text_data.csv', error_bad_lines=False)
df.head(20)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.groupby("Category").count()

Unnamed: 0_level_0,Message
Category,Unnamed: 1_level_1
0,4825
1,747


In [None]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
len(df)

5572

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn(**dict): #k, weight, distance):
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam_text_data.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=dict['k'], weights=dict['weight'], algorithm='auto', leaf_size=30, p=2, metric=dict['distance'], metric_params=None, n_jobs=1)

    print(" For Spam Dataset:  k =", k, " Weights = ", weight, " Distance Metrics = ", distance)
    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn(k, weight, distance):
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam_text_data.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=dict['k'], weights=dict['weight'], algorithm='brute', leaf_size=30, p=2, metric=dict['distance'], metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
## KNN accuracy after using BoW and TFID
dict = {"k": 7, "weight": "uniform", "distance": "cosine"}
predicted, y_test = bow_knn(**dict)
predicted, y_test = tfidf_knn(**dict)


 For Spam Dataset:  k = 7  Weights =  uniform  Distance Metrics =  cosine
KNN with BOW accuracy = 97.57847533632287%
Cross Validation Accuracy: 0.96
[0.95625841 0.96231494 0.96228956]


KNN with TFIDF accuracy = 97.57847533632287%
Cross Validation Accuracy: 0.96
[0.95693136 0.96366083 0.96094276]


In [None]:
dict = {"k": 7, "weight": "distance", "distance": "cosine"}
predicted, y_test = bow_knn(**dict)
predicted, y_test = tfidf_knn(**dict)

 For Spam Dataset:  k = 7  Weights =  distance  Distance Metrics =  cosine
KNN with BOW accuracy = 97.9372197309417%
Cross Validation Accuracy: 0.97
[0.96635262 0.9730821  0.96969697]


KNN with TFIDF accuracy = 98.29596412556054%
Cross Validation Accuracy: 0.97
[0.96298789 0.9717362  0.96498316]


In [None]:
dict = {"k": 7, "weight": "distance", "distance": "cityblock"}
predicted, y_test = bow_knn(**dict)
predicted, y_test = tfidf_knn(**dict)

 For Spam Dataset:  k = 7  Weights =  distance  Distance Metrics =  cityblock
KNN with BOW accuracy = 93.45291479820628%
Cross Validation Accuracy: 0.93
[0.93001346 0.92059219 0.92929293]


KNN with TFIDF accuracy = 93.09417040358744%
Cross Validation Accuracy: 0.92
[0.92328398 0.9179004  0.92255892]


In [None]:
dict = {"k": 5, "weight": "uniform", "distance": "cityblock"}
predicted, y_test = bow_knn(**dict)
predicted, y_test = tfidf_knn(**dict)

 For Spam Dataset:  k = 5  Weights =  uniform  Distance Metrics =  cityblock
KNN with BOW accuracy = 91.74887892376682%
Cross Validation Accuracy: 0.90
[0.90309556 0.89771198 0.9030303 ]


KNN with TFIDF accuracy = 91.74887892376682%
Cross Validation Accuracy: 0.90
[0.89771198 0.89636608 0.8976431 ]


In [None]:
dict = {"k": 5, "weight": "distance", "distance": "euclidean"}
predicted, y_test = bow_knn(**dict)
predicted, y_test = tfidf_knn(**dict)

 For Spam Dataset:  k = 5  Weights =  distance  Distance Metrics =  euclidean
KNN with BOW accuracy = 93.99103139013452%
Cross Validation Accuracy: 0.93
[0.93270525 0.92126514 0.93131313]


KNN with TFIDF accuracy = 93.99103139013452%
Cross Validation Accuracy: 0.92
[0.92664872 0.91857335 0.92525253]


In [None]:
dict = {"k": 5, "weight": "uniform", "distance": "euclidean"}
predicted, y_test = bow_knn(**dict)
predicted, y_test = tfidf_knn(**dict)

 For Spam Dataset:  k = 5  Weights =  uniform  Distance Metrics =  euclidean
KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]


KNN with TFIDF accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.90
[0.90040377 0.89771198 0.9010101 ]


In [None]:
dict = {"k": 7, "weight": "uniform", "distance": "euclidean"}
predicted, y_test = bow_knn(**dict)
predicted, y_test = tfidf_knn(**dict)

 For Spam Dataset:  k = 7  Weights =  uniform  Distance Metrics =  euclidean
KNN with BOW accuracy = 90.94170403587444%
Cross Validation Accuracy: 0.90
[0.89838493 0.8923284  0.9037037 ]


KNN with TFIDF accuracy = 90.85201793721973%
Cross Validation Accuracy: 0.89
[0.89165545 0.88694482 0.89090909]


In [None]:
from sklearn.metrics import confusion_matrix
pd.crosstab(y_test, predicted, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,0,1,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,67,23,90
1,24,86,110
All,91,109,200


In [None]:
print(confusion_matrix(y_test,predicted))

[[67 23]
 [24 86]]


In [None]:
# This cell may take some time to run
#predicted, y_test = bow_knn()

KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [None]:
# This cell may take some time to run
# the function is modified 
#predicted, y_test = tfidf_knn()

KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
Ans: TF-IDF identifies the unique words of a particular document, whereas in bag of words commonly appearing words in one or more documents are given high importance.

TF-IDF can identify the real unique words in a documents, the learning becomes efficient thus the accuracy increases significantly.

2. Can you think of techniques that are better than both BoW and TF-IDF ?
Based on the search materials, it was found that many other techniques were also very effective than BOW and TF-IDF: word2vec - this visualize the document based on the syntactic and sematic difference between the statements. 

3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

Stremming extracts the stem word from a given word.
Lemmatization extracts a meaningful word from a given word.

Eg:
Finalization
Finally
Finalizing

out using stemming will be final 
in the above we have got a meaningful word. 

for the above example the lemmatization gives
finalization
finally
finalize

there different meanful words

Pros and Cons of Stemming and Lemmatization:
1. Stemming is faster than Lemmatization, but lemmatization gives meaningful words.
2. Stemming will be approriate in sentimental classification, detecting spam mails, as the stem words will be sufficient to classify the observations.
3. Lemmatization will be more significant in analysing the questions and answers of a googleform or from any digital source.


### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
