<a href="https://colab.research.google.com/github/dhiksha08/Shrishti/blob/main/Tutorial_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SRISHTI'23 Tutorial - 10**
### Using KNN for Text Classification
#### Module Coordinator: Tanvi Kamble


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.1: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.
In case of text, there are lots of things that need to be taken into account.


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



In [3]:
""" Running, Runs,runner . Extracting run from these is done using Stemming and Lemmatizing text

Stemming basically chops off suffix or prefix without any information about word "RUN"
whereas is Lemmatizing word "RUN" is present to do it (A dictionary is provided)"""

' Running, Runs,runner . Extracting run from these is done using Stemming and Lemmatizing text\n\nStemming basically chops off suffix or prefix without any information about word "RUN"\nwhereas is Lemmatizing word "RUN" is present to do it (A dictionary is provided)'

### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [6]:
""" The tag[:2] indexing is used in the code snippet to handle variations in the length of the POS tag. In most cases, the POS tags in the Penn Treebank tagset are two characters long, such as 'NN', 'VB', 'JJ', etc. However, there are some cases where the tags have more than two characters, such as 'NNS' (plural noun), 'VBN' (past participle verb), 'JJR' (comparative adjective), etc.

By using tag[:2], the code ensures that only the first two characters of the POS tag are considered. This approach is used to handle both two-character tags and longer tags. For example, tag[:2] would extract 'NN' from 'NNS' and 'VB' from 'VBN'.

The reason for extracting only the first two characters is that the first two characters usually represent the essential information needed to determine the general POS category (noun, verb, adjective, adverb). The specific variations within a category can be handled separately."""

" The tag[:2] indexing is used in the code snippet to handle variations in the length of the POS tag. In most cases, the POS tags in the Penn Treebank tagset are two characters long, such as 'NN', 'VB', 'JJ', etc. However, there are some cases where the tags have more than two characters, such as 'NNS' (plural noun), 'VBN' (past participle verb), 'JJR' (comparative adjective), etc.\n\nBy using tag[:2], the code ensures that only the first two characters of the POS tag are considered. This approach is used to handle both two-character tags and longer tags. For example, tag[:2] would extract 'NN' from 'NNS' and 'VB' from 'VBN'.\n\nThe reason for extracting only the first two characters is that the first two characters usually represent the essential information needed to determine the general POS category (noun, verb, adjective, adverb). The specific variations within a category can be handled separately."

In [7]:
""" In the lemmatization example, the line lemma = wordnet_lemmatizer.lemmatize(token, get_tag(tag[:2])) uses tag[:2] to extract the first two characters of the POS tag.

The POS tags provided by the pos_tag() function in NLTK are in the form of Penn Treebank tags, such as 'NN' for a noun, 'VB' for a verb, 'JJ' for an adjective, and so on.

The get_tag() function is defined to map these Penn Treebank tags to WordNet POS tags, which are required by the lemmatize() method. However, WordNet POS tags use different format codes compared to Penn Treebank tags. For example, 'NN' in Penn Treebank is equivalent to 'n' in WordNet, 'VB' is equivalent to 'v', 'JJ' is equivalent to 'a', and 'RB' is equivalent to 'r'.

By using tag[:2], we extract the first two characters from the Penn Treebank tag, which correspond to the relevant POS category. This two-character code is then passed to the get_tag() function to obtain the corresponding WordNet POS tag for lemmatization."""

" In the lemmatization example, the line lemma = wordnet_lemmatizer.lemmatize(token, get_tag(tag[:2])) uses tag[:2] to extract the first two characters of the POS tag.\n\nThe POS tags provided by the pos_tag() function in NLTK are in the form of Penn Treebank tags, such as 'NN' for a noun, 'VB' for a verb, 'JJ' for an adjective, and so on.\n\nThe get_tag() function is defined to map these Penn Treebank tags to WordNet POS tags, which are required by the lemmatize() method. However, WordNet POS tags use different format codes compared to Penn Treebank tags. For example, 'NN' in Penn Treebank is equivalent to 'n' in WordNet, 'VB' is equivalent to 'v', 'JJ' is equivalent to 'a', and 'RB' is equivalent to 'r'.\n\nBy using tag[:2], we extract the first two characters from the Penn Treebank tag, which correspond to the relevant POS category. This two-character code is then passed to the get_tag() function to obtain the corresponding WordNet POS tag for lemmatization."

In [39]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text) #non alphabetic characters are replaced with space
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)

        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))

            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [13]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
['troubling'] [('troubling', 'VBG')]
['trouble']
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [14]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


In [15]:
""" Examples of stop words in English include "the," "is," "and," "in," "a," etc. """

' Examples of stop words in English include "the," "is," "and," "in," "a," etc. '

In [18]:
clean_train = ["I love to read books",
               "Reading is my favorite hobby",
               "Books transport me to different worlds"]
distinct=[]
for i in clean_train:
  for j in i.split(" "):
    if j not in distinct:
      distinct.append(j)

document_matrix=[]
for i in clean_train:
  temp=[i.count(j) for j in distinct]
  document_matrix.append(temp)
document_matrix

[[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]

## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [19]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

In [26]:
import numpy as np
clean_train = ["I love to read books",
               "Reading is my favorite hobby",
               "Books transport me to different worlds"]
distinct=[]
for i in clean_train:
  for j in i.split(" "):
    if j not in distinct:
      distinct.append(j)

document_matrix=[]
for i in clean_train:
  temp=np.array([i.count(j) for j in distinct])
  document_matrix.append(temp)

document_matrix=np.array(document_matrix)
print("Term Frequency")
document_matrix

Term Frequency


array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]])

In [27]:
import math
print("IDF")
idf=[]
for j in range(len(distinct)):
  idf.append(math.log(len(clean_train)/np.sum(document_matrix[:,j])))
idf


IDF


[1.0986122886681098,
 1.0986122886681098,
 0.4054651081081644,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098,
 1.0986122886681098]

In [29]:
tf_idf=[i*idf for i in document_matrix]
tf_idf

[array([1.09861229, 1.09861229, 0.40546511, 1.09861229, 1.09861229,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]),
 array([0.        , 0.        , 0.        , 0.        , 0.        ,
        1.09861229, 1.09861229, 1.09861229, 1.09861229, 1.09861229,
        0.        , 0.        , 0.        , 0.        , 0.        ]),
 array([0.        , 0.        , 0.40546511, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        1.09861229, 1.09861229, 1.09861229, 1.09861229, 1.09861229])]

In [31]:
""" IDF is Inverse Document Frequency. For each word in disticnt (vocabulary) we calculate IDF.
It is log(#Documents/#Documents containing that word)
TF-IDF is calculated by TF * IDF

TF for each document is in document matrix
"""


' IDF is Inverse Document Frequency. For each word in disticnt (vocabulary) we calculate IDF. \nIt is log(#Documents/#Documents containing that word) '

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [32]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv
Saving spam.csv to spam.csv


In [33]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [34]:
df = df.dropna()

In [35]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [43]:
""" The provided code initializes a K-Nearest Neighbors (KNN) classifier using scikit-learn's `neighbors.KNeighborsClassifier` class. Here's an explanation of the parameters used in the initialization:

- `n_neighbors`: The number of neighbors to consider for classification. In this case, it is set to 5.
- `weights`: The weight function used in prediction. `'uniform'` means all neighbors have equal weight. Other options include `'distance'`, where closer neighbors have more influence, or you can define a custom function.
- `algorithm`: The algorithm used to compute the nearest neighbors. `'auto'` automatically selects the most appropriate algorithm based on the input data.
- `leaf_size`: The size of the leaf node in the KD tree or Ball tree. Smaller values lead to a more accurate but slower search.
- `p`: The power parameter for the Minkowski distance metric. When `p=2`, it is equivalent to using the Euclidean distance.
- `metric`: The distance metric used to calculate distances between points. `'euclidean'` is the Euclidean distance, which is the default. Other options include `'manhattan'`, `'chebyshev'`, or you can define a custom distance metric.
- `metric_params`: Additional keyword arguments to be passed to the distance metric function.
- `n_jobs`: The number of parallel jobs to run for neighbors search. In this case, it is set to 1, indicating no parallelization.

By initializing the `KNeighborsClassifier` with these parameters, you have created a KNN classifier object that is ready to be trained and used for classification tasks."""

" The provided code initializes a K-Nearest Neighbors (KNN) classifier using scikit-learn's `neighbors.KNeighborsClassifier` class. Here's an explanation of the parameters used in the initialization:\n\n- `n_neighbors`: The number of neighbors to consider for classification. In this case, it is set to 5.\n- `weights`: The weight function used in prediction. `'uniform'` means all neighbors have equal weight. Other options include `'distance'`, where closer neighbors have more influence, or you can define a custom function.\n- `algorithm`: The algorithm used to compute the nearest neighbors. `'auto'` automatically selects the most appropriate algorithm based on the input data.\n- `leaf_size`: The size of the leaf node in the KD tree or Ball tree. Smaller values lead to a more accurate but slower search.\n- `p`: The power parameter for the Minkowski distance metric. When `p=2`, it is equivalent to using the Euclidean distance.\n- `metric`: The distance metric used to calculate distances b

In [44]:
""" Certainly! Here's a clearer explanation of the `algorithm` and `leaf_size` parameters in the context of the K-Nearest Neighbors (KNN) classifier:

- `algorithm`: This parameter determines the algorithm used to compute the nearest neighbors. The available options are:
  - `'auto'`: This is the default value and it automatically selects the most appropriate algorithm based on the input data. It chooses between 'ball_tree', 'kd_tree', and 'brute' algorithms based on the training data's characteristics.
  - `'ball_tree'`: This algorithm builds a Ball tree data structure to store the training samples and efficiently search for nearest neighbors. It works well for high-dimensional data.
  - `'kd_tree'`: This algorithm builds a KD tree data structure to store the training samples and efficiently search for nearest neighbors. It also performs well for high-dimensional data.
  - `'brute'`: This algorithm performs a brute-force search by computing the distances between all pairs of training samples. It is suitable for small datasets or when using a distance metric that is not supported by the tree-based algorithms.

- `leaf_size`: This parameter determines the size of the leaf node in the KD tree or Ball tree. A smaller `leaf_size` value leads to a more accurate but slower nearest neighbor search. The leaf node is a subset of training samples within the tree structure, and a smaller `leaf_size` means that each leaf node contains fewer samples. This can improve the accuracy of the search but increases the computational cost.

In summary, by setting `algorithm='auto'`, the KNN classifier will automatically select the most appropriate algorithm based on the characteristics of the input data. The `leaf_size` parameter controls the size of the leaf nodes in the tree-based algorithms, where a smaller value improves accuracy but increases computation time."""

" Certainly! Here's a clearer explanation of the `algorithm` and `leaf_size` parameters in the context of the K-Nearest Neighbors (KNN) classifier:\n\n- `algorithm`: This parameter determines the algorithm used to compute the nearest neighbors. The available options are:\n  - `'auto'`: This is the default value and it automatically selects the most appropriate algorithm based on the input data. It chooses between 'ball_tree', 'kd_tree', and 'brute' algorithms based on the training data's characteristics.\n  - `'ball_tree'`: This algorithm builds a Ball tree data structure to store the training samples and efficiently search for nearest neighbors. It works well for high-dimensional data.\n  - `'kd_tree'`: This algorithm builds a KD tree data structure to store the training samples and efficiently search for nearest neighbors. It also performs well for high-dimensional data.\n  - `'brute'`: This algorithm performs a brute-force search by computing the distances between all pairs of train

In [40]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [41]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%




Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]




In [42]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%
Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]




In [67]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn(n,weight,algo,leafsize,p1,metric1,njobs):
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=n, weights=weight, algorithm=algo, leaf_size=leafsize, p=p1, metric=metric1, metric_params=None, n_jobs=njobs)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn(n,weight,algo,leafsize,p1,metric1,njobs):
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=n, weights=weight, algorithm=algo, leaf_size=leafsize, p=p1, metric=metric1, metric_params=None, n_jobs=njobs)


    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [52]:
""" Here are the possible parameters for the `KNeighborsClassifier` class in scikit-learn:

- `n_neighbors`: The number of neighbors to consider for classification or regression.
- `weights`: The weight function used in prediction. Possible values are:
  - `'uniform'`: All points in each neighborhood are weighted equally (default).
  - `'distance'`: Weight points by the inverse of their distance. Closer neighbors have a greater influence on the prediction.
  - Custom callable function: You can define a custom function that accepts an array of distances and returns an array of weights.
- `algorithm`: The algorithm used to compute the nearest neighbors. Possible values are:
  - `'auto'`: Automatically selects the most appropriate algorithm based on the input data (default).
  - `'ball_tree'`: Builds a Ball tree data structure for nearest neighbor search.
  - `'kd_tree'`: Builds a KD tree data structure for nearest neighbor search.
  - `'brute'`: Performs a brute-force search by computing distances between all pairs of points.
- `leaf_size`: The size of the leaf node in the KD tree or Ball tree. Smaller values lead to a more accurate but slower search.
- `p`: The power parameter for the Minkowski distance metric. For example:
  - `p=1` corresponds to the Manhattan distance.
  - `p=2` corresponds to the Euclidean distance (default).
  - `p>=3` corresponds to the Minkowski distance.
- `metric`: The distance metric used for computing distances between points. Possible values are:
  - `'euclidean'`: Euclidean distance (default).
  - `'manhattan'`: Manhattan distance.
  - `'chebyshev'`: Chebyshev distance.
  - Custom distance metric: You can define a custom function that computes the distance between two points.
- `n_jobs`: The number of parallel jobs to run for neighbors search. A value of -1 uses all available processors.

These parameters allow you to customize the behavior of the KNN classifier based on your specific requirements."""

" Here are the possible parameters for the `KNeighborsClassifier` class in scikit-learn:\n\n- `n_neighbors`: The number of neighbors to consider for classification or regression.\n- `weights`: The weight function used in prediction. Possible values are:\n  - `'uniform'`: All points in each neighborhood are weighted equally (default).\n  - `'distance'`: Weight points by the inverse of their distance. Closer neighbors have a greater influence on the prediction.\n  - Custom callable function: You can define a custom function that accepts an array of distances and returns an array of weights.\n- `algorithm`: The algorithm used to compute the nearest neighbors. Possible values are:\n  - `'auto'`: Automatically selects the most appropriate algorithm based on the input data (default).\n  - `'ball_tree'`: Builds a Ball tree data structure for nearest neighbor search.\n  - `'kd_tree'`: Builds a KD tree data structure for nearest neighbor search.\n  - `'brute'`: Performs a brute-force search by 

In [53]:
predicted, y_test = bow_knn(7,'uniform','auto',40,3,'minkowski',-1)

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 64.92146596858639%




Cross Validation Accuracy: 0.63
[0.64705882 0.59215686 0.65748031]




In [54]:
predicted, y_test = tfidf_knn(7,'uniform','auto',40,3,'minkowski',-1)

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 68.58638743455498%




Cross Validation Accuracy: 0.60
[0.61176471 0.6        0.59448819]


In [68]:
predicted, y_test = bow_knn(5,'distance','brute',30,2,'cosine',-1)

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 70.15706806282722%




Cross Validation Accuracy: 0.71
[0.70980392 0.70196078 0.73228346]




In [69]:
predicted, y_test = tfidf_knn(5,'distance','brute',30,2,'cosine',-1)

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%
Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]




# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [55]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam (1).csv


In [56]:
import pandas as pd
df = pd.read_csv('spam.csv', error_bad_lines=False)
df



  df = pd.read_csv('spam.csv', error_bad_lines=False)


Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [57]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [58]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [59]:
len(df)

5572

In [66]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [61]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [62]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


In [72]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn(n,weight,algo,leafsize,p1,metric1,njobs):
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=n, weights=weight, algorithm=algo, leaf_size=leafsize, p=p1, metric=metric1, metric_params=None, n_jobs=njobs)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn(n,weight,algo,leafsize,p1,metric1,njobs):
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=n, weights=weight, algorithm=algo, leaf_size=leafsize, p=p1, metric=metric1, metric_params=None, n_jobs=njobs)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [64]:
# This cell may take some time to run
predicted, y_test = bow_knn(5,'distance','brute',30,2,'cosine',-1)

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 98.29596412556054%
Cross Validation Accuracy: 0.97
[0.96837147 0.97106326 0.96969697]




In [75]:
predicted, y_test =tfidf_knn(5,'distance','brute',30,2,'cosine',-1)

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

In [76]:
"""

TD-IDF considers the frequency of a word in all the documents whereas bow only considers frequency in that document.
TF-IDF gives more importance to rare terms that occur in only a few documents.
These rare terms can carry significant information and help in distinguishing between different documents.
In contrast, Bag-of-Words treats all terms equally and does not consider their rarity or uniqueness.
same higher weights are assigned to terms which are occuring frequently in all of the documents.

TF-IDF automatically downweights common stop words that are frequent in many documents.
As the frequency increases the idf value of the word decreases.

When a word occurs in many documents log value decreases so automatically idf value decreases.
But when a word occurs many times in a document it is given more importance"""

'\n\nTD-IDF considers the frequency of a word in all the documents whereas bow only considers frequency in that document.\nTF-IDF gives more importance to rare terms that occur in only a few documents.\nThese rare terms can carry significant information and help in distinguishing between different documents. \nIn contrast, Bag-of-Words treats all terms equally and does not consider their rarity or uniqueness.\nsame higher weights are assigned to terms which are occuring frequently in all of the documents.\n\nTF-IDF automatically downweights common stop words that are frequent in many documents. \nAs the frequency increases the idf value of the word decreases.\n\nWhen a word occurs in many documents log value decreases so automatically idf value decreases.\nBut when a word occurs many times in a document it is given more importance'

In [77]:
"""

1. TF-IDF considers the frequency of a word in all the documents, while BoW only considers the frequency in the specific document.
2. TF-IDF gives more importance to rare terms that occur in only a few documents. These rare terms can carry significant information and help in distinguishing between different documents. In contrast, BoW treats all terms equally and does not consider their rarity or uniqueness.
3. TF-IDF automatically downweights common stop words that are frequent in many documents. The IDF value decreases as the frequency of a word across documents increases.
4. When a word occurs many times in a document, it is given more importance in both TF-IDF and BoW. However, TF-IDF also considers the inverse document frequency, which balances the weight of the term based on its occurrence in other documents.

Overall, your statement provides a concise and accurate summary of the differences between TF-IDF and BoW in terms of term frequency, importance of rare terms, and handling of common words."""

'Yes, your statement captures the key differences between TF-IDF and Bag-of-Words (BoW) accurately:\n\n1. TF-IDF considers the frequency of a word in all the documents, while BoW only considers the frequency in the specific document.\n2. TF-IDF gives more importance to rare terms that occur in only a few documents. These rare terms can carry significant information and help in distinguishing between different documents. In contrast, BoW treats all terms equally and does not consider their rarity or uniqueness.\n3. TF-IDF automatically downweights common stop words that are frequent in many documents. The IDF value decreases as the frequency of a word across documents increases.\n4. When a word occurs many times in a document, it is given more importance in both TF-IDF and BoW. However, TF-IDF also considers the inverse document frequency, which balances the weight of the term based on its occurrence in other documents.\n\nOverall, your statement provides a concise and accurate summary 

In [79]:
"""
1. Word Embeddings: Word embeddings, such as Word2Vec, GloVe, or FastText, represent words as dense vectors in a continuous space. These embeddings capture semantic and contextual relationships between words and can be more effective in capturing word meanings and document semantics compared to BoW or TF-IDF.

2. Neural Network Models: Deep learning models, such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs), can learn representations directly from the raw text data. These models can capture complex relationships and dependencies between words and can achieve better performance for tasks like text classification or sentiment analysis.

3. Transformer Models: Transformer models, such as the BERT (Bidirectional Encoder Representations from Transformers) model, have achieved state-of-the-art results in various natural language processing (NLP) tasks. These models utilize attention mechanisms to capture contextual information and dependencies between words, leading to more accurate representations of text.

4. Subword-level Representations: Rather than considering words as atomic units, subword-level representations like Byte-Pair Encoding (BPE) or subword embeddings can be used. These representations break words into smaller units or subwords, which can handle out-of-vocabulary words and capture morphological variations.

5. Topic Modeling: Techniques like Latent Dirichlet Allocation (LDA) can be used to discover latent topics in a collection of documents. These models identify underlying themes and distributions of topics in the corpus, providing a more interpretable representation of the text data.

6. Graph-based Models: Graph-based models represent text as a graph structure, where words are nodes and relationships between words are edges. These models can capture semantic relationships and dependencies between words, enabling more nuanced representations.

These techniques go beyond the simple word frequency-based approaches of BoW and TF-IDF and leverage more sophisticated methods to capture semantic meaning, context, and relationships in text data. Each technique has its own strengths and applicability depending on the specific task and dataset at hand."""

'\n1. Word Embeddings: Word embeddings, such as Word2Vec, GloVe, or FastText, represent words as dense vectors in a continuous space. These embeddings capture semantic and contextual relationships between words and can be more effective in capturing word meanings and document semantics compared to BoW or TF-IDF.\n\n2. Neural Network Models: Deep learning models, such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs), can learn representations directly from the raw text data. These models can capture complex relationships and dependencies between words and can achieve better performance for tasks like text classification or sentiment analysis.\n\n3. Transformer Models: Transformer models, such as the BERT (Bidirectional Encoder Representations from Transformers) model, have achieved state-of-the-art results in various natural language processing (NLP) tasks. These models utilize attention mechanisms to capture contextual information and dependencies between wor

In [81]:
"""

Pros of Stemming:

Computationally efficient and faster
Reduces the word by chopping prefix and suffix which decreases storage size


Cons:

Stemming may produce words that are not actual dictionary words or may produce incorrect stems in some cases.
Stemming can lead to ambiguity or loss of meaning since it collapses different words into the same stem, which may result in imprecise retrieval.
Stemming is language-dependent and requires specific stemming algorithms for different languages.

Lemmatization:
Pros:

Lemmatization produces actual dictionary words or lemmas, which helps in maintaining the integrity and meaning of the words.
It considers the part of speech of the word, which allows for more accurate base form identification based on the context.
Lemmatization can be beneficial in tasks that require more precise linguistic analysis, such as language understanding, text generation, and machine translation.
Cons:

Lemmatization is computationally more expensive compared to stemming due to the need for morphological analysis and access to a vocabulary database.
It may not always lead to significant improvements in retrieval performance in information retrieval systems.
Lemmatization requires more linguistic knowledge and resources specific to each language, making it more challenging to implement for multiple languages."""


'\n\nPros of Stemming:\n\nComputationally efficient and faster \nReduces the word by chopping prefix and suffix which decreases storage size\n\n\nCons:\n\nStemming may produce words that are not actual dictionary words or may produce incorrect stems in some cases.\nStemming can lead to ambiguity or loss of meaning since it collapses different words into the same stem, which may result in imprecise retrieval.\nStemming is language-dependent and requires specific stemming algorithms for different languages.\n\nLemmatization:\nPros:\n\nLemmatization produces actual dictionary words or lemmas, which helps in maintaining the integrity and meaning of the words.\nIt considers the part of speech of the word, which allows for more accurate base form identification based on the context.\nLemmatization can be beneficial in tasks that require more precise linguistic analysis, such as language understanding, text generation, and machine translation.\nCons:\n\nLemmatization is computationally more e

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
