**Multi class classification**

Import the libraries 

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
np.random.seed(0)


In [5]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, accuracy_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.neighbors import NearestNeighbors
from scipy import stats
from sklearn.metrics import balanced_accuracy_score
from time import time as tt

Data
We use the 20 newsgroup dataset: this is available from sklearn library

It's a dataset of news articles that can be categorized into one of 20 possible categories.

Notes:

We will load 5 of those 20 categories here and then turn the problem into a binary classification task.
Here training and test data are split by time (e.g. they're from different years), they are not shuffled. This means that we want to build a classifier that doesn't overfit on how a topic is talked about at a specific point in time, but instead manages to generalise to future news articles. This is an example of when we don't want to shuffle our dataset.
sklearn helps us clean the data with the "remove" argument (see below). More often, cleaning the data is not that easy.

In [8]:
# we only load 5 of the 20 possible news categories
categories= ['rec.motorcycles',
    'rec.sport.baseball',
   'rec.sport.hockey',
   'talk.politics.misc', 
   'talk.religion.misc']
 


In [9]:

# training
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'), 
                                      categories=categories)
# note that "remove=('headers','footers','quotes') this removes features from the texts that 
# is not really representative of a category but can cause the algorithm to overfit
# test
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)
# uncomment the following line if you want to check the dataset's description
#print(newsgroups_train.DESCR)


In [12]:
#Let's see what data we got and in what forma

In [13]:
print(newsgroups_train)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



There is a lot going on, but the printed result starts with "{" which means newsgroup_train is a collection of items that can be accessed by calling their respective names, the same way we access data inside a python dictionary. We can see that some of these names seem to be 'data', 'target_names' and 'target'. Let's try calling 'target'

In [16]:
print(newsgroups_train['target'])


[0 0 3 ... 0 0 1]


These are our target label. They seem to have been encoded already since we know from when we downloaded the dataset that they correspond to specific news categories: graphics, autos, baseball, electronics, med. Let's see what the exact names are and we can probably assume that the order of the target categories correspond to the order of the labels.

In [18]:
print('Category names are:')
print(newsgroups_train.target_names)

print()

print('Category names and corresponding labels are:')
print([(v, k) for k, v in enumerate(newsgroups_train.target_names)])
# above we just iterated through all the category names. The enumerate function
# returns both the values we are iterating through together with their position in
# the iteration list

Category names are:
['rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'talk.politics.misc', 'talk.religion.misc']

Category names and corresponding labels are:
[('rec.motorcycles', 0), ('rec.sport.baseball', 1), ('rec.sport.hockey', 2), ('talk.politics.misc', 3), ('talk.religion.misc', 4)]


Let's put the target into a dataframe for ease of processing.

In [20]:
# let's create the dataframe. The `zip` function allows us to pass two different "list" 
# of values from where to create the dataframe. Each list is a column
df_train = pd.DataFrame(data=zip(newsgroups_train.data, newsgroups_train.target), 
                        columns=['news_text', 'category'])
df_test = pd.DataFrame(data=zip(newsgroups_test.data, newsgroups_test.target), 
                       columns=['news_text','category'])



Is the data balanced?

In [21]:
df_train.category.value_counts()

2    600
0    598
1    597
3    465
4    377
Name: category, dtype: int64

It's a bit imbalanced, but it's not too bad. We'll see what we can do about it...

Since we're here let's also check about duplicates and missing values

In [22]:
# missing values
print(df_train.isna().sum())
print()
print(df_test.isna().sum())


news_text    0
category     0
dtype: int64

news_text    0
category     0
dtype: int64


In [23]:
# duplicates
print(df_train.duplicated().sum())

print()

print(df_test.duplicated().sum())

65

34


We do have some duplicate values. Let's remove them

In [24]:
# remove duplicates
df_train = df_train.drop_duplicates()
df_test = df_test.drop_duplicates()
