# First approximation to a text classification pipeline with Python based modules

The goal of this notebook is to present a common ensemble and execution of NLP text classification modules based in Python.

This evaluation is inspired in a recognized Kaggle challenge of tumor classification.

**Note:** From the current directory, create a 'datasets' folder at the same level of the notebook and decompresse training and test datasets available at 'compressed_datasets'.


In [1]:
import os
import subprocess

uncompressed_datasets = os.getcwd() + "/datasets_test"

print(uncompressed_datasets)

#Create folder for uncompressed datasets
if not os.path.exists(uncompressed_datasets):
    os.makedirs(uncompressed_datasets)


/home/crosas/0.BSC2019/Projects/NLP/Encomienda/PTL/notebooks/datasets_test


In [2]:
loc_7z = "/usr/bin/7zr"
compressed_datasets = os.getcwd() + "/compressed_datasets"


- Extract Training Data into local folder

In [3]:

training_data_compressed = "training_data.json.7z"
training_data = uncompressed_datasets + "/" + training_data_compressed[:-3]

path_compressed = compressed_datasets + "/" + training_data_compressed


extract_command = r'"{}" e "{}" -o"{}"'.format(loc_7z, path_compressed, uncompressed_datasets)

subprocess.call(extract_command, shell=True)


2

- Extract Test Data into local folder (it is also labelled, therefore will be included into the training set)

In [4]:

test_data_compressed = "test_data.json.7z"
test_data = uncompressed_datasets + "/" + test_data_compressed[:-3]

path_compressed = compressed_datasets + "/" + test_data_compressed

extract_command = r'"{}" e "{}" -o"{}"'.format(loc_7z, path_compressed, uncompressed_datasets)
subprocess.call(extract_command, shell=True)


2

- Extract Data to be predicted into local folder

In [5]:
prediction_data_compressed = "prediction_data.json.7z"
prediction_data = uncompressed_datasets + "/" + prediction_data_compressed[:-3]

path_compressed = compressed_datasets + "/" + prediction_data_compressed

extract_command = r'"{}" e "{}" -o"{}"'.format(loc_7z, path_compressed, uncompressed_datasets)
subprocess.call(extract_command, shell=True)

2

### 1.Read Input

To read data we use Textacy (https://spacy.io/universe/project/textacy  or https://chartbeat-labs.github.io/textacy/ ). 

It is designed to be useful for pre and postprocessing.

In [6]:
import textacy

print(training_data)
training_records = textacy.io.read_json(training_data, lines=True)

print(test_data)
test_records = textacy.io.read_json(test_data, lines= True)

/home/crosas/0.BSC2019/Projects/NLP/Encomienda/PTL/notebooks/datasets_test/training_data.json
/home/crosas/0.BSC2019/Projects/NLP/Encomienda/PTL/notebooks/datasets_test/test_data.json


### 2. List of Text and Labels

Textacy provide a data structure to provide easy access to the data read.
In this case, the training database is in json format. For each entry in the database there is a full-text [in english] and the corresponding label to the class identified.

In [7]:
X = []
ylabels = []

for training_record in training_records:
    
    X.append(training_record["text"]) # the features we want to analyze
    ylabels.append(training_record["class"]) # the labels, or answers, we want to test against

for test_record in test_records:
    X.append(test_record["text"])
    ylabels.append(test_record["class"])

print(len(X))
print(len(ylabels))


3682
3682


#### 2.1 Create training and test sets from labelled data 

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

#### 2.1 Tokenizing data with spaCy

In [9]:
import string
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

punctuations = string.punctuation

nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

parser = English()

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    # print(mytokens)
    
    return mytokens

### 3. transforms(): Extract Noun Phrases from Text

Custom transformer function

In [10]:
from sklearn.base import TransformerMixin

class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        #for text in X:
            # print("TEXT________________________________________")
            # print(text)
        return [transform_text(text) for text in X]
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def get_params(self, deep=True):
        return {}

def transform_text(text):
    if (len(text)>1):
        doc = textacy.make_spacy_doc((str(text), {"class": ""}))
        nps = textacy.extract.noun_chunks(doc,
                                          drop_determiners = True,
                                          min_freq = 1)
        doc_nps = [str(np).replace(' ','_') for np in nps]
        nps = ' '.join(doc_nps) # because countvectorizer needs a string
    else:
        nps = ''
    return (nps)

### 4.Vectorizer (Bag of Words / TfIdf)

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)

### 5. Classifier for Training (Logistic Regression)

In [12]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver='newton-cg',multi_class='multinomial')

### 6. Training Pipeline

In [14]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([("cleaner", predictors()),
                ('vectorizer', bow_vector),
                ('classifier', classifier)])

pipe.fit(X_train,y_train)



Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x7f9d28e3d6d8>), ('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ng...ty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False))])

In [15]:
from joblib import dump

dump(pipe, 'pipeUSECASE.joblib')

['pipeUSECASE.joblib']

## 7. Classification

Read Test data

In [16]:
from joblib import load
pipes = load('pipeUSECASE.joblib')

preds = pipe.predict(X_test)

#print("results:")
#for (sample, pred) in zip(X_test, preds):
#    print (sample, ":", pred)



### 8. Test fitting 

In [24]:
from sklearn import metrics
print("Logistic Regression Accuracy (weighted): ", metrics.accuracy_score(y_test, preds))
print("Logistic Regression Precision (weighted): ", metrics.precision_score(y_test, preds,
                                                                average='weighted'))
print("Logistic Regression Recall (weighted): ", metrics.recall_score(y_test, preds,
                                                          average='weighted'))
print("Logistic Regression F1 score (weighted): ", metrics.f1_score(y_test, preds, 
                                                                    average='weighted'))


Logistic Regression Accuracy (weighted):  0.604524886877828
Logistic Regression Precision (weighted):  0.6090253559799256
Logistic Regression Recall (weighted):  0.604524886877828
Logistic Regression F1 score (weighted):  0.6045672925785044


***Questions***:

1. How this can be executed in HPC, e.g. training a new model for the corpus?
2. Is there a maximum lenght for the spacy object nlp?
3. Can we use the Noun Phrase to clean the text?