# <center>ML Final Project</center>

## <center>Bagging & Boosting</center>

### <center>Bagging - </center>

#### "Bootstrap Aggregation," or "bagging," is a method used to lower a predictive model's variance. It is a parallel ensemble method that fits each model independently of the others while training many models at once. By producing new data from the existing dataset, bagging improves the training process. By drawing samples with replacement, a technique known as bootstrap sampling enables certain observations to be reproduced across many training sets. In the freshly created datasets, every element in the original dataset has an equal chance of being chosen.


### <center>Boosting - </center>

#### Boosting is a sequential ensemble technique where the weights of the observations are iteratively adjusted according to the most recent classification performance. An observation's weight is increased upon wrong classification, meaning that in succeeding cycles, more difficult examples will be the focus. Boosting essentially attempts to increase the prediction capacity and decrease bias error of the model by making weak learners stronger. Misclassified data points have their weights increased during each iteration, increasing their influence in the subsequent training cycle. During training, the boosting approach gives each model a weight, increasing the weight of the models that do well. As boosting advances, it records mistakes made by earlier models to inform the instruction of current learners.




#### Research paper URL - "https://arxiv.org/pdf/1904.08067v5.pdf"

#### Dataset URL - "https://github.com/kk7nc/Text_Classification/tree/master/Data"

#### Github URL - "https://github.com/kk7nc/Text_Classification/tree/master/code"

### Dataset loaded...

In [2]:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
RMDL: Random Multimodel Deep Learning for Classification
 * Copyright (C) 2018  Kamran Kowsari <kk7nc@virginia.edu>
 * Last Update: 04/25/2018
 * This file is part of  RMDL project, University of Virginia.
 * Free to use, change, share and distribute source code of RMDL
 * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification
 * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL)
 * Comments and Error: email: kk7nc@virginia.edu
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''


from __future__ import print_function

import os, sys, tarfile
import numpy as np
import zipfile

if sys.version_info >= (3, 0, 0):
    import urllib.request as urllib  # ugly but works
else:
    import urllib

print(sys.version_info)

# image shape


# path to the directory with the data
DATA_DIR = '.\Glove'

# url of the binary data



# path to the binary train file with image data


def download_and_extract(data='Wikipedia'):
    """
    Download and extract the GloVe
    :return: None
    """

    if data=='Wikipedia':
        DATA_URL = 'http://nlp.stanford.edu/data/glove.6B.zip'
    elif data=='Common_Crawl_840B':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.840B.300d.zip'
    elif data=='Common_Crawl_42B':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.42B.300d.zip'
    elif data=='Twitter':
        DATA_URL = 'http://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip'
    else:
        print("prameter should be Twitter, Common_Crawl_42B, Common_Crawl_840B, or Wikipedia")
        exit(0)


    dest_directory = DATA_DIR
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = DATA_URL.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)
    print(filepath)

    path = os.path.abspath(dest_directory)
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\rDownloading %s %.2f%%' % (filename,
                                                          float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()

        filepath, _ = urllib.urlretrieve(DATA_URL, filepath)#, reporthook=_progress)


        zip_ref = zipfile.ZipFile(filepath, 'r')
        zip_ref.extractall(DATA_DIR)
        zip_ref.close()
    return path

sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)


In [3]:
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
RMDL: Random Multimodel Deep Learning for Classification
 * Copyright (C) 2018  Kamran Kowsari <kk7nc@virginia.edu>
 * Last Update: 04/25/2018
 * This file is part of  RMDL project, University of Virginia.
 * Free to use, change, share and distribute source code of RMDL
 * Refrenced paper : RMDL: Random Multimodel Deep Learning for Classification
 * Refrenced paper : An Improvement of Data Classification using Random Multimodel Deep Learning (RMDL)
 * Comments and Error: email: kk7nc@virginia.edu
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''


from __future__ import print_function

import os, sys, tarfile
import numpy as np

if sys.version_info >= (3, 0, 0):
    import urllib.request as urllib  # ugly but works
else:
    import urllib

print(sys.version_info)

# image shape


# path to the directory with the data
DATA_DIR = '.\data_WOS'

# url of the binary data
DATA_URL = 'http://kowsari.net/WebOfScience.tar.gz'


# path to the binary train file with image data


def download_and_extract():
    """
    Download and extract the WOS datasets
    :return: None
    """
    dest_directory = DATA_DIR
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = DATA_URL.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)


    path = os.path.abspath(dest_directory)
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\rDownloading %s %.2f%%' % (filename,
                                                          float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()

        filepath, _ = urllib.urlretrieve(DATA_URL, filepath, reporthook=_progress)

        print('Downloaded', filename)

        tarfile.open(filepath, 'r').extractall(dest_directory)
    return path

sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)


# Bagging : 

In [4]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', BaggingClassifier(KNeighborsClassifier())),
                     ])

text_clf.fit(X_train, y_train)


predicted = text_clf.predict(X_test)

In [5]:
print(metrics.classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.58      0.73      0.65       319
           1       0.59      0.59      0.59       389
           2       0.63      0.56      0.59       394
           3       0.57      0.57      0.57       392
           4       0.60      0.56      0.58       385
           5       0.69      0.61      0.65       395
           6       0.61      0.45      0.52       390
           7       0.74      0.72      0.73       396
           8       0.81      0.83      0.82       398
           9       0.74      0.74      0.74       397
          10       0.81      0.85      0.83       399
          11       0.75      0.83      0.79       396
          12       0.70      0.49      0.58       393
          13       0.81      0.54      0.65       396
          14       0.75      0.78      0.76       394
          15       0.71      0.78      0.74       398
          16       0.72      0.73      0.72       364
          17       0.62    

# Boosting :

In [6]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.datasets import fetch_20newsgroups

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', GradientBoostingClassifier(n_estimators=50,verbose=2)),
                     ])

text_clf.fit(X_train, y_train)


predicted = text_clf.predict(X_test)

      Iter       Train Loss   Remaining Time 
         1           2.2026           65.98m
         2           1.9756           70.43m
         3           1.8175           68.44m
         4           1.6981           64.51m
         5           1.6015           61.62m
         6           1.5210           60.36m
         7           1.4468           59.33m
         8           1.3821           58.40m
         9           1.3284           56.43m
        10           1.2772           54.28m
        11           1.2323           52.81m
        12           1.1904           51.35m
        13           1.1532           49.94m
        14           1.1182           48.93m
        15           1.0865           47.98m
        16           1.0557           46.47m
        17           1.0273           45.20m
        18           0.9991           43.77m
        19           0.9739           42.70m
        20           0.9495           41.63m
        21           0.9245           40.35m
        2

In [7]:
print(metrics.classification_report(y_test, predicted))

              precision    recall  f1-score   support

           0       0.80      0.64      0.71       319
           1       0.68      0.67      0.67       389
           2       0.70      0.68      0.69       394
           3       0.65      0.72      0.68       392
           4       0.78      0.78      0.78       385
           5       0.84      0.61      0.70       395
           6       0.82      0.84      0.83       390
           7       0.86      0.72      0.78       396
           8       0.89      0.85      0.87       398
           9       0.92      0.85      0.88       397
          10       0.93      0.86      0.90       399
          11       0.89      0.81      0.85       396
          12       0.29      0.69      0.41       393
          13       0.87      0.68      0.76       396
          14       0.85      0.81      0.83       394
          15       0.84      0.86      0.85       398
          16       0.65      0.79      0.71       364
          17       0.96    

### Conclusion:
#### The boosting algorithm fared better than the bagging technique according to the performance measures. The boosting algorithm's f1-score was 0.75, whereas the bagging algorithm's was 0.67. This suggests that although both methods yielded respectable results on the data, boosting was more successful. For this specific job, boosting was a better option because of the higher f1-score, which indicates that boosting was more successful in striking a balance between precision and recall. In summary, boosting outperformed bagging in this comparison, demonstrating its benefit in improving model accuracy.