## Assignment 5

In this assignment the 20 newsgroup dataset is used.  
There are around 18000 posts on 20 different topicks. The data are also split in train and test subsets. The split is done based on a specific date. All messages before that date belong to the train set and the rest on the test set.  

Goal of the assignment is to use the Naive Bayes classifier to classift the documents to their specific topic.  
We are also going to be making the assumption that the data follow a multinomial distribution.  

_Multinomial Naive Bayes is suitable for classification of discrete features, such as word counts in the case of text classification._

In [69]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

In scikit-learn official documentation of the dataset, it is [recommender](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) to remove features like header, footers and quotes (newsgroup related metadata) when working with a Naive Bayes classifier, as the classifier overfits on these features and does not learn actual topic-related features.


In [16]:
X_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
X_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

In [17]:
news_topics = {i:x for i,x in enumerate(X_train.target_names)}

In [48]:
print(f'The target topics of the data are: \n{", ".join(X_train.target_names)}. \n')
print(f'Train data consist of {X_train.filenames.shape[0]} entries and test data of {X_test.filenames.shape[0]} entries.')
#Counter(X_train.target)
print('Also there is approximate the same number of examples across all diferent topics.')

The target topics of the data are: 
alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware, comp.windows.x, misc.forsale, rec.autos, rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space, soc.religion.christian, talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc. 

Train data consist of 11314 entries and test data of 7532 entries.
Also there is approximate the same number of examples across all diferent topics.


In [65]:
text_clf = Pipeline([('count_vectorizer', CountVectorizer()), 
                     ('tfidf_vectorizer', TfidfTransformer()),
                     ('clf', MultinomialNB())
                    ])

In [66]:
text_clf.fit(X_train.data, X_train.target)

Pipeline(steps=[('count_vectorizer', CountVectorizer()),
                ('tfidf_vectorizer', TfidfTransformer()),
                ('clf', MultinomialNB())])

In [68]:
predictions = text_clf.predict(X_test.data)

In [70]:
classification_report(X_test.target, predictions, target_names=X_test.target_names)

'                          precision    recall  f1-score   support\n\n             alt.atheism       0.81      0.07      0.13       319\n           comp.graphics       0.72      0.62      0.67       389\n comp.os.ms-windows.misc       0.70      0.50      0.59       394\ncomp.sys.ibm.pc.hardware       0.55      0.75      0.64       392\n   comp.sys.mac.hardware       0.81      0.61      0.69       385\n          comp.windows.x       0.83      0.74      0.78       395\n            misc.forsale       0.86      0.69      0.77       390\n               rec.autos       0.82      0.68      0.74       396\n         rec.motorcycles       0.89      0.63      0.73       398\n      rec.sport.baseball       0.95      0.69      0.80       397\n        rec.sport.hockey       0.59      0.90      0.71       399\n               sci.crypt       0.47      0.80      0.59       396\n         sci.electronics       0.77      0.43      0.55       393\n                 sci.med       0.86      0.63      0.73    