# Lesson 2 Text Classification 

Theory: We will learn how it’s possible to represent text and how a classifier can use this representation. We will use TF-Idf and experiment with a couple of supervised learning models.

Exercise: Build an NLP pipeline to perform classification.
We will need to clean the text, transform it into something readable by an algorithm, and finally classify it.

Outcome: You will be able to solve a text classification problem end to end.



In [None]:
!python -m spacy download en

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.4 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

ng_train = fetch_20newsgroups(subset='train', 
                                  categories=categories,
                                  shuffle=True,
                                  random_state=11)

ng_test = fetch_20newsgroups(subset='test',
                                 categories=categories,
                                 shuffle=True,
                                 random_state=11)


ng_test.target


array([2, 1, 1, ..., 1, 2, 0])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
# from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
import string
import re
import spacy
spacy.load('en')
from spacy.lang.en import English

import nltk
nltk.download('stopwords')

ENGLISH_STOP_WORDS = []
parser = English()


STOPLIST = set(set(stopwords.words('english')).union( set(ENGLISH_STOP_WORDS)))
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”"]

class CleanTextTransformer(TransformerMixin):
    
    def _clean_text(self, text):
        text = text.strip().replace("\n", " ").replace("\r", " ")
        text = text.lower()
        return text

    def transform(self, X, **transform_params):
        return [self._clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self
    
    def get_params(self, deep=True):
            return {}
    


def tokenize_text(sample):
    '''
    Here we are using SpaCy
    '''
    tokens = parser(sample)
    lemmas = []
    for token in tokens:
        lemmas.append(token.lemma_.lower().strip() if token.lemma_ != "-PRON-" else token.lower_)
    tokens = lemmas
    tokens = [tok for tok in tokens if tok not in STOPLIST]
    tokens = [tok for tok in tokens if tok not in SYMBOLS]
    return tokens

    

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:

vectorizer = CountVectorizer(tokenizer=tokenize_text, 
                             ngram_range=(1,1))
clf = LinearSVC()

pipe = Pipeline([('cleanText', CleanTextTransformer()),
                 ('vectorizer', vectorizer), 
                 ('clf', clf)])
# data
X_train = ng_train.data
y_train = ng_train.target
X_test = ng_test.data
y_test = ng_test.target
# train
pipe.fit(X_train, y_train)
# test
y_pred = pipe.predict(X_test)



from sklearn import metrics
print(metrics.classification_report(y_test, y_pred,
    target_names=ng_train.target_names))


                        precision    recall  f1-score   support

           alt.atheism       0.90      0.77      0.83       319
         comp.graphics       0.87      0.94      0.90       389
               sci.med       0.92      0.84      0.88       396
soc.religion.christian       0.85      0.94      0.89       398

              accuracy                           0.88      1502
             macro avg       0.88      0.88      0.88      1502
          weighted avg       0.88      0.88      0.88      1502



In [None]:
y_test, y_pred

(array([2, 1, 1, ..., 1, 2, 0]), array([2, 1, 1, ..., 1, 2, 0]))

# Excercise

Improve the above classifier, creating a new pipeline, add some transformations and check the performanfes of other classifiers of your choice.