In [1]:
#loading the news groups dataset from scikit-learn's datasets library
from sklearn.datasets import fetch_20newsgroups

In [2]:
#initializing the dataset
news = fetch_20newsgroups()
#this may take a few minutes as the dataset is large

In [3]:
#printing the first 2 rows of the data
print(news.data[0:2])

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n", "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 1

In [4]:
#printing the target names
print(news.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


We can see that there are 20 categories in this dataset.

In [5]:
#storing the target names in catgories variable
categories = news.target_names

In [6]:
#selecting the dataset to load ‘train’ for the training set, ‘test’ for the test set
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

The split between the train and test set is based upon a messages posted before and after a specific date.

In [7]:
#printing the training data
print(train.data[2])

From: twillis@ec.ecn.purdue.edu (Thomas E Willis)
Subject: PB questions...
Organization: Purdue University Engineering Computer Network
Distribution: usa
Lines: 36

well folks, my mac plus finally gave up the ghost this weekend after
starting life as a 512k way back in 1985.  sooo, i'm in the market for a
new machine a bit sooner than i intended to be...

i'm looking into picking up a powerbook 160 or maybe 180 and have a bunch
of questions that (hopefully) somebody can answer:

* does anybody know any dirt on when the next round of powerbook
introductions are expected?  i'd heard the 185c was supposed to make an
appearence "this summer" but haven't heard anymore on it - and since i
don't have access to macleak, i was wondering if anybody out there had
more info...

* has anybody heard rumors about price drops to the powerbook line like the
ones the duo's just went through recently?

* what's the impression of the display on the 180?  i could probably swing
a 180 if i got the 80Mb disk

In [8]:
#printing the testing data
print(test.data[2])

From: mathew <mathew@mantis.co.uk>
Subject: Re: STRONG & weak Atheism
Organization: Mantis Consultants, Cambridge. UK.
X-Newsreader: rusnews v1.02
Lines: 9

acooper@mac.cc.macalstr.edu (Turin Turambar, ME Department of Utter Misery) writes:
> Did that FAQ ever got modified to re-define strong atheists as not those who
> assert the nonexistence of God, but as those who assert that they BELIEVE in 
> the nonexistence of God?

In a word, yes.


mathew



In order to use this dataset for Machine Learning, we need to convert the content of each string into a vector of numbers. We can do this using TF-IDF vectorizer and create a pipeline that attaches it to a multinomial naive bayes classifier. 

TF-IDF stands for Term Frequency-Inverse Document Frequency. It weighs the word counts by a measure of how often they appear in the documents instead of CountVectorizer which does the raw word count and puts too much weight on words that appear frequently.

In [9]:
#importing TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#importing Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB

#importing pipeline
from sklearn.pipeline import make_pipeline

In text classification, the features are related to word counts or frequencies within the documents to be classified. We have used Multinomial Naive Bayes as it is appropriate for features that represent counts or count rates.

In [10]:
#we will use pipleine to attach MultinomialNB and TfidfVectorizer
model = make_pipeline(TfidfVectorizer(), MultinomialNB(alpha=0.1))

# here alpha is the Laplace smoothning parameter (which helps in achieving results even if the word has not 
# appeared in the training data, that is, adding a value to the obsereved count) and by altering the values 
# of alpha, we got to know that 0.1 gives us the best result.

In [11]:
#applying the model to the training data
model.fit(train.data, train.target)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('multinomialnb',
                 MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True))],
         verbose=False)

In [12]:
#predicting labels for the test data
pred_labels = model.predict(test.data)
print(pred_labels)

[ 7 11  0 ...  9  3 15]


In [13]:
#f1 score of the model
from sklearn import metrics

#since default average is binary and our problem is multinomial so we have chosen macro in our case as it 
#will compute the metric independently for each class and then take the average (hence treating all classes equally).

print(metrics.f1_score(test.target, pred_labels, average='macro'))  

0.81388938732554


In [14]:
# confusion matrix
# from sklearn.metrics import confusion_matrix
# con_mat = confusion_matrix(test.target, pred_labels)

# heatmap of confusion matrix
# import seaborn as sns
# import matplotlib.pyplot as plt

# %matplotlib inline

# plt.figure(figsize=(20,20))
# ax = sns.heatmap(con_mat.T, annot=True, fmt='.1g', xticklabels=train.target_names, yticklabels=train.target_names)
# plt.xlabel('True labels')
# plt.ylabel('Predicted labels')

# bottom, top = ax.get_ylim()
# ax.set_ylim(bottom + 0.5, top - 0.5)
# plt.show()

#we dropped the idea of creating confusion matrix due to large number of labels

In [15]:
#let us define a function to predict the label/category of any sentence
def predict_label(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

In [16]:
predict_label('Elon Musk is working on new electric car model')
#correct prediction by the model

'rec.autos'

In [17]:
predict_label('I love my Mac')
#correct prediction by the model

'comp.sys.mac.hardware'

In [18]:
predict_label('Narendra Modi is helping the country fight against coronavirus')
#correct prediction by the model

'talk.politics.mideast'

We know that the classification of text documents using sparse features (like TfidfVectorizer) shuffles the training and test data, instead of segmenting by time. And in that case, multinomial Naive Bayes gives such a higher F-score of 0.88. 

Let us try to find out if the model is really performing that well or is overfitting. We can know this by removing the header, footer and quotes, which are common in all news groups and fitting the model again.

In [29]:
train = fetch_20newsgroups(subset='train', remove = ('header','footer','quotes'), categories=categories)
test = fetch_20newsgroups(subset='test', remove = ('header','footer','quotes'), categories=categories)

In [30]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB(alpha=0.1))

In [31]:
model.fit(train.data, train.target)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('multinomialnb',
                 MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True))],
         verbose=False)

In [32]:
pred_labels = model.predict(test.data)

In [33]:
print(metrics.f1_score(pred_labels, test.target, average='macro'))

0.765567386988352


We can see that our F1 score reduced a lot, from 81% to 76%, after removing headers, footers and quotes. But, this is more realistic than our earlier score as over model is not overfitting now. 