# <center>ML Final Project</center>

## <center>Rocchio Classification</center>

#### For text and document classification, the Rocchio Classification algorithm is a straightforward but powerful technique. It functions by classifying newly created documents according to their vector representations in a high-dimensional space, therefore organizing them into specified groups. Using training data, the algorithm determines the "center" of each category's documents, or centroids, for each category. Rocchio measures a new document's distance from these centroids and places it in the category with the closest centroid. The technique aims to maximize the distance from the centroids of other classes while minimizing the distance between a document and the correct class centroid.

#### Because of its simplicity and effectiveness, Rocchio is a well-liked option for some jobs, such as sentiment analysis and spam filtering, where it excels. Because of its efficacious handling of high-dimensional data, it remains popular in a wide range of natural language processing applications.



#### Research paper URL - "https://arxiv.org/pdf/1904.08067v5.pdf"

#### Dataset URL - "https://github.com/kk7nc/Text_Classification/tree/master/Data"

#### Github URL - "https://github.com/kk7nc/Text_Classification/tree/master/code"

### Dataset loaded...

In [1]:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
RMDL: Random Multimodel Deep Learning for Classification
 * Copyright (C) 2018  Kamran Kowsari <kk7nc@virginia.edu>
 * Last Update: 04/25/2018
 * This file is part of  RMDL project, University of Virginia.
 * Free to use, change, share and distribute source code of RMDL
 * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification
 * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL)
 * Comments and Error: email: kk7nc@virginia.edu
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''


from __future__ import print_function

import os, sys, tarfile
import numpy as np
import zipfile

if sys.version_info >= (3, 0, 0):
    import urllib.request as urllib  # ugly but works
else:
    import urllib

print(sys.version_info)

# image shape


# path to the directory with the data
DATA_DIR = '.\Glove'

# url of the binary data



# path to the binary train file with image data


def download_and_extract(data='Wikipedia'):
    """
    Download and extract the GloVe
    :return: None
    """

    if data=='Wikipedia':
        DATA_URL = 'http://nlp.stanford.edu/data/glove.6B.zip'
    elif data=='Common_Crawl_840B':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip'
    elif data=='Common_Crawl_42B':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.42B.300d.zip'
    elif data=='Twitter':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip'
    else:
        print("prameter should be Twitter, Common_Crawl_42B, Common_Crawl_840B, or Wikipedia")
        exit(0)


    dest_directory = DATA_DIR
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = DATA_URL.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)
    print(filepath)

    path = os.path.abspath(dest_directory)
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\rDownloading %s %.2f%%' % (filename,
                                                          float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()

        filepath, _ = urllib.urlretrieve(DATA_URL, filepath)#, reporthook=_progress)


        zip_ref = zipfile.ZipFile(filepath, 'r')
        zip_ref.extractall(DATA_DIR)
        zip_ref.close()
    return path

sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)


In [2]:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
RMDL: Random Multimodel Deep Learning for Classification
 * Copyright (C) 2018  Kamran Kowsari <kk7nc@virginia.edu>
 * Last Update: 04/25/2018
 * This file is part of  RMDL project, University of Virginia.
 * Free to use, change, share and distribute source code of RMDL
 * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification
 * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL)
 * Comments and Error: email: kk7nc@virginia.edu
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''


from __future__ import print_function

import os, sys, tarfile
import numpy as np

if sys.version_info >= (3, 0, 0):
    import urllib.request as urllib  # ugly but works
else:
    import urllib

print(sys.version_info)

# image shape


# path to the directory with the data
DATA_DIR = '.\data_WOS'

# url of the binary data
DATA_URL = 'http://kowsari.net/WebOfScience.tar.gz'


# path to the binary train file with image data


def download_and_extract():
    """
    Download and extract the WOS datasets
    :return: None
    """
    dest_directory = DATA_DIR
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = DATA_URL.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)


    path = os.path.abspath(dest_directory)
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\rDownloading %s %.2f%%' % (filename,
                                                          float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()

        filepath, _ = urllib.urlretrieve(DATA_URL, filepath, reporthook=_progress)

        print('Downloaded', filename)

        tarfile.open(filepath, 'r').extractall(dest_directory)
    return path

sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)


In [4]:
from sklearn.neighbors import NearestCentroid
# from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', NearestCentroid()),
                     ])

text_clf.fit(X_train, y_train)


predicted = text_clf.predict(X_test)

In [5]:
print(metrics.classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.75      0.49      0.60       319
           1       0.44      0.76      0.56       389
           2       0.75      0.68      0.71       394
           3       0.71      0.59      0.65       392
           4       0.81      0.71      0.76       385
           5       0.83      0.66      0.74       395
           6       0.49      0.88      0.63       390
           7       0.86      0.76      0.80       396
           8       0.91      0.86      0.89       398
           9       0.85      0.79      0.82       397
          10       0.95      0.80      0.87       399
          11       0.94      0.66      0.78       396
          12       0.40      0.70      0.51       393
          13       0.84      0.49      0.62       396
          14       0.89      0.72      0.80       394
          15       0.55      0.73      0.63       398
          16       0.68      0.76      0.71       364
          17       0.97    

## Conclusion:
#### Using the 20 Newsgroups dataset, the Rocchio Classification algorithm performs moderately well, with an overall accuracy of 69%. The algorithm struggles in some categories, with scores as low as 0.47 and 0.51; nevertheless, it excels in others, obtaining f1-scores as high as 0.89 and 0.87. The findings imply that although Rocchio works well for some classes—especially those with sharp differences—it might not be the greatest option for datasets that have overlapping or less distinct categories. Rocchio has promise overall, but for better results, it might need to be adjusted further or used in a different way.