# Practical 2: Text classification
#### Ayoub Bagheri
<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />

In this practical, are going to create a text classification pipeline. We will work with the famous 20 Newsgroups data set from the sklearn library.

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. It was originally collected by Ken Lang, and it has become a popular data set for experiments in text applications of machine learning techniques.

Today we will use the following libraries. Take care to have them installed!

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
import pandas as pd
import numpy as np

### Let's get started!

1\. **Use the code below to load the tarin and test subsets of the 20 Newsgroups data set from sklearn datasets. Remove the headers, footers and qoutes from the news article when loading data sets. Use number 321 for random_state. In order to get faster execution times for this practical we will work on a partial data set with only 5 categories out of the 20 available in the data set: ('rec.sport.hockey', 'talk.politics.mideast', 'soc.religion.christian', 'comp.graphics', 'sci.med').**

In [2]:
categories = ['rec.sport.hockey', 'talk.politics.mideast', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [3]:
twenty_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'), 
                                  categories=categories, shuffle=True, random_state=321)
# type(twenty_train)

In [4]:
twenty_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'), 
                                 categories=categories, shuffle=True, random_state=321)

2\. **Find out about the number of news articles in train and test sets.**

3\. **Covert the train and test to dataframes.**

4\. **In order to feed classification models with text data, first you need to turn the text into vectors of numerical values suitable for statistical analysis. Use the binary representation with TfidfVectorizer and create document-term matrices for test and train (name them X_train and X_test). We also built similar dtm in the previous practical.**

5\. **Create y_train and y_test objects from the df_train.label.values and df_test.label.values, respectively.**

6\. **Select at least two of the following classifiers and train two models on the data set.**
    - [K-Nearest Neighbor classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
    - [Multionimal Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)
    - [Support Vector Machine](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
    - [Decision Tree Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)
    - [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

7\. **Using a [Voting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier), we can combine multiple classifiers. Can we get better results if we combine the classifiers? (this is also called ensemble learning)**

8\. **In order to prepare a text classifier easier, we can use the Pipeline class from sklearn. Create a pipeline with TfidfVectorizer and your best classifer from step 6.**

9\. **Fit the pipeline on your training set.**

10\. **Compute the accuracy on the test set.**

11\. **Can you also compute precision, recall and f1?**