## <center>20 Newsgroups data</center>

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.



## About the dataset
The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are very closely related to each other (e.g. comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), while others are highly unrelated (e.g misc.forsale / soc.religion.christian). Here is a list of the 20 newsgroups, partitioned (more or less) according to subject matter:

```python
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x	
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
misc.forsale	
talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
soc.religion.christian
```

The dataset used in this example is the 20 newsgroups dataset which will be automatically downloaded and then cached and reused for the document classification example.

You can adjust the number of categories by giving their names to the dataset loader or setting them to None to get the 20 of them.

### Importing necessary Libraries

In [1]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score , f1_score
from sklearn.feature_extraction.text import TfidfVectorizer

### Loading the dataset

In [3]:
df = fetch_20newsgroups(subset='train')
pprint(list(df.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


### Task 1: Create a list of 4 newsgroup and fetch it using function `fetch_20newsgroups` for both train and test data.
***
### Instructions

- Create a list of 4 newsgroup i.e `'alt.atheism'`, `'talk.religion.misc'`, `'comp.graphics'`, `'sci.space'` and save it as `categories`
- Fetch 4 newsgroup using function `fetch_20newsgroups` with parameters as `subset='train'`,`categories=categories` and store it in `newsgroups_train` variable. Similarly do it for test and store in `newsgroups_test`.


### Task 2: Use TfidfVectorizer on train data and find out the Number of Non-Zero components per sample.
***
### Instructions
 - Initialise a `TfidfVectorizer()` object and save it as `vectorizer`
 - Apply the "fit_transform()" method of `vectorizer` on `newsgroups_train.data` and store the result in `vectors`
 - Print the distribution count of the `vectors`
 - Find out the Number of Non-Zero components per sample

#### Observation:- The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero components by sample in a more than 30000-dimensional space (i.e. less than .5% non-zero features)

### Task 3: Use TfidfVectorizer on test data and apply Naive Bayes model and calculate f1_score.
***
### Instructions
- Apply the `transform()` method  on `newsgroups_test.data` and store the result in `vectors_test`
- Initialise a naive bayes model with `MultinomialNB()` having parameter as `alpha=.01` and save it to a variable called `clf`
- Apply the `fit()` method of `clf` on `vectors` and `newsgroups_train.target`
- Predict on test data i.e `vectors_test` and save it as `pred`
- Find out the f1 score between `newsgroups_test.target` and `pred` using the `f1_score` method also pass parameter as `average = 'macro'`

### Task 4: Print the top 20 news category and top 20 words for every news category.
***
As we see that Multinomial Naive Bayes gets a higher F-score of 0.88. You might be thinking what’s going on inside this classifier?

Let’s take a look at what the most informative features are:
- Create a function `show_top20` with 3 parameters as `classifier`, `vectorizer`, `categories`
- Get the feature names using `get_feature_names()` attribute of `vectorizer` and convert it into array and save it as `feature_names`.
- Start a `for` loop to iterate over both the index and value of `categories` using `enumerate()` function
- Inside the for loop, sort the top 20 coefficient using `argsort()` function of numpy by passing parameter as `classifier.coef_[i][-20]` and save it as in variable `top20`
- Print out the corresponding value (`categories` i.e. news category) and top 20 words for every news category
- And then call the function as `show_top20(clf, vectorizer, newsgroups_train.target_names)`