# Lesson 2 Text Classification 

Theory: We will learn how it’s possible to represent text and how a classifier can use this representation. We will use TF-Idf and experiment with a couple of supervised learning models.

Exercise: Build an NLP pipeline to perform classification.
We will need to clean the text, transform it into something readable by an algorithm, and finally classify it.

Outcome: You will be able to solve a text classification problem end to end.



In [None]:
!python -m spacy download en

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 27.2 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

ng_train = fetch_20newsgroups(subset='train', 
                                  categories=categories,
                                  shuffle=True,
                                  random_state=11)

ng_test = fetch_20newsgroups(subset='test',
                                 categories=categories,
                                 shuffle=True,
                                 random_state=11)


ng_test.target


array([2, 1, 1, ..., 1, 2, 0])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
ENGLISH_STOP_WORDS = ['the']
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import string
import re
import spacy
spacy.load('en')
from spacy.lang.en import English

import nltk
nltk.download('stopwords')


parser = English()


STOPLIST = set(set(stopwords.words('english')).union( set(ENGLISH_STOP_WORDS)))
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”"]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
help(Pipeline)

Help on class Pipeline in module sklearn.pipeline:

class Pipeline(sklearn.utils.metaestimators._BaseComposition)
 |  Pipeline(steps, *, memory=None, verbose=False)
 |  
 |  Pipeline of transforms with a final estimator.
 |  
 |  Sequentially apply a list of transforms and a final estimator.
 |  Intermediate steps of the pipeline must be 'transforms', that is, they
 |  must implement `fit` and `transform` methods.
 |  The final estimator only needs to implement `fit`.
 |  The transformers in the pipeline can be cached using ``memory`` argument.
 |  
 |  The purpose of the pipeline is to assemble several steps that can be
 |  cross-validated together while setting different parameters. For this, it
 |  enables setting parameters of the various steps using their names and the
 |  parameter name separated by a `'__'`, as in the example below. A step's
 |  estimator may be replaced entirely by setting the parameter with its name
 |  to another estimator, or a transformer removed by setting

In [None]:
, TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(2, 2))
clf = # complete here

pipe = Pipeline([# complete here])

# data
X_train = ng_train.data
y_train = ng_train.target
X_test = ng_test.data
y_test = ng_test.target

# train
pipe.fit(X_train, y_train)

# test
y_pred = pipe.predict(X_test)



from sklearn import metrics
print(metrics.classification_report(y_test, y_pred,
    target_names=ng_train.target_names))


                        precision    recall  f1-score   support

           alt.atheism       0.90      0.75      0.82       319
         comp.graphics       0.88      0.94      0.91       389
               sci.med       0.92      0.87      0.90       396
soc.religion.christian       0.85      0.95      0.90       398

              accuracy                           0.89      1502
             macro avg       0.89      0.88      0.88      1502
          weighted avg       0.89      0.89      0.88      1502



In [None]:
y_test, y_pred

(['From: "nigel allen" <nigel.allen@canrem.com>\nSubject: Occupational Injuries and Disease: Workers Memorial Day\nReply-To: "nigel allen" <nigel.allen@canrem.com>\nOrganization: Canada Remote Systems\nDistribution: sci\nLines: 97\n\n\nHere is a press release from the American Federation of State, \nCounty and Municipal Employees.\n\n Unions Point To Deadly Workplaces; AFSCME, Other Unions\nCommemorate Workers Memorial Day\n To: National Desk, Labor Writer\n Contact: Janet Rivera of the American Federation of State, County\nand Municipal Employees, AFL-CIO, 202-429-1130\n\n   WASHINGTON, April 23 -- The American Federation of State, \nCounty and Municipal Employees (AFSCME) and other unions\nof the AFL-CIO on Wednesday, April 28, will commemorate the fifth\nannual Workers Memorial Day -- a day to pay homage to the 6\nmillion workers who are killed, injured, or diseased on the job.\n   This year, AFSCME will focus its Workers Memorial Day efforts an\nthe dangerous environment in which c

# Excercise

Improve the above classifier, creating a new pipeline, add some transformations and check the performanfes of other classifiers of your choice.