# Text Processing for Accelerator project

A simplified pipeline processing text with FastText.

* Load CPA data
* Basic text cleaning
* Vectorize (with FastText)
* Split into test and training sets
* Reduce dimension using UMAP supervised 
* Predict on test set
* Use metrics for an unbalanced dataset
* View the clustering in plotly scatterplots

In [1]:
# this bit shouldn't be necessary if we pip install -e .   in the parent directory
%load_ext autoreload
%autoreload 2

In [2]:
import functools
from pprint import pprint
from time import time
from IPython.display import display, HTML
import logging
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
import umap

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express
from imblearn.metrics import classification_report_imbalanced

import text_processing

pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load FastText Pretrained

Note: This requires a fair bit of memory (peaks at about 17.5 GiB)

Recommend shutting down other kernels first, once this has loaded memory usage drops again.

This takes a few minutes to load in.

In [3]:
wv = text_processing.fetch_fasstext_pretrained(filepath="../../data/wiki.en.bin")

2020-12-04 14:51:07,267 - text_processing - INFO - Loading FastText pretrained from ../../data/wiki.en.bin
2020-12-04 14:55:07,910 - text_processing - INFO - Model loaded


#### Load in the CPA data

In [4]:
CPA = text_processing.fetch_files()

2020-12-04 14:55:08,025 - text_processing - INFO - cleanded CPA File imported


In [5]:
CPA1 = CPA[CPA.Level.isin({5,6})][['Code','Descr_old','Descr','Category_0','Category_1','Category_2','Category_3']].copy()
df = text_processing.clean_col(CPA1, "Descr")
df.drop('Descr',axis=1,inplace=True)

df.sample(5)

2020-12-04 14:55:08,180 - text_processing - INFO - Cleaning column: Descr 


Unnamed: 0,Code,Descr_old,Category_0,Category_1,Category_2,Category_3,Descr_cleaned
4689,69.10.1,Legal services,8,M,69,69.1,legal services
3030,30.30.9,Sub-contracted operations as part of manufacturing of air and spacecraft and related machinery,2,C,30,30.3,sub-contracted operations part manufacturing air spacecraft related machinery
1606,21.20.12,"Medicaments, containing hormones, but not antibiotics",2,C,21,21.2,medicaments containing hormones
82,01.13.72,Sugar beet seeds,1,A,1,1.1,sugar beet seeds
3213,32.99.41,Cigarette lighters and other lighters; smoking pipes and cigar or cigarette holders and parts thereof,2,C,32,32.9,cigarette lighters lighters smoking pipes cigar cigarette holders parts thereof


### Vectorize CPA data using FastText

In [6]:
text_to_vec = functools.partial(text_processing.vectorize_text, wv)
df["Descr_cleaned_vectorized"] = df.Descr_cleaned.apply(text_to_vec)

In [7]:
# reduce the dimension for the whole lot
df['Reduced_dim'] = text_processing.reduce_dimensionality(df.Descr_cleaned_vectorized,10)
# reduce the dimension for the whole lot using supervised learning
df['Reduced_dim_supervised'] = text_processing.reduce_dimensionality_supervised(df.Descr_cleaned_vectorized,df.Category_2)

2020-12-04 14:55:08,573 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Fri Dec  4 14:55:08 2020 Finding Nearest Neighbors
Fri Dec  4 14:55:08 2020 Building RP forest with 8 trees
Fri Dec  4 14:55:09 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
	 9  /  12
	 10  /  12
Fri Dec  4 14:55:15 2020 Finished Nearest Neighbor Search
Fri Dec  4 14:55:17 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs


2020-12-04 14:55:32,491 - text_processing - INFO - Now applying umap to reduce dimension


Fri Dec  4 14:55:32 2020 Finished embedding
UMAP(min_dist=0.0, n_components=10, random_state=3052528580, verbose=10)
Construct fuzzy simplicial set
Fri Dec  4 14:55:32 2020 Finding Nearest Neighbors
Fri Dec  4 14:55:32 2020 Building RP forest with 8 trees
Fri Dec  4 14:55:33 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
Fri Dec  4 14:55:34 2020 Finished Nearest Neighbor Search


  return f(**kwargs)


Fri Dec  4 14:55:34 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Fri Dec  4 14:55:50 2020 Finished embedding


### Split the CPA data into a training set and a test set

In [8]:
# Split dataset into training set and test set
train_set, test_set = train_test_split(df.copy(), test_size=0.2, random_state=42)

# Use the Random Forest Classifier

https://www.datacamp.com/community/tutorials/random-forests-classifier-python

We have already  used UMAP supervised classification to reduce the dimension on the training set.   
We then split our data into training and test datasets.   
We now use the random forest classifier on the training set, and see how it works on the test set.


In [9]:
#Create Gaussian Classifier
for this_cat in ['Category_3','Category_2','Category_1','Category_0']:
    #Train the model using the training sets 
    X_train = train_set.Reduced_dim_supervised
    X_test = test_set.Reduced_dim_supervised

    y_train = train_set[this_cat]
    y_test = test_set[this_cat]

    vecs = np.array(list(X_train.values))
    target = np.array(list(y_train.values))

    clf = RandomForestClassifier(n_estimators=100
                              ).fit(vecs, target)

    y_pred=clf.predict(np.array(list(X_test.values)))
# Model Accuracy, how often is the classifier correct?
    print(f"Accuracy for {this_cat} classification:",metrics.accuracy_score(y_test, y_pred))

Accuracy for Category_3 classification: 0.8098360655737705
Accuracy for Category_2 classification: 0.9814207650273225
Accuracy for Category_1 classification: 0.985792349726776
Accuracy for Category_0 classification: 0.994535519125683


In [10]:
CN = text_processing.fetch_CN_files()
Cat1_Cat2_map = CPA[CPA.Level==2][['Code','Parent']].rename(columns={'Code':'Category_2','Parent':'Category_1'})
CN=CN.merge(Cat1_Cat2_map, on='Category_2', how='left')


# we now set up a higher level for A10 indstry levels (10 categories)
update_dict0 = {'A':'1','F':'3','J':'5', 'K':'6', 'L':'7','M':'8','N':'8'}
update_dict = {**update_dict0,**dict.fromkeys(['B','C','D','E'],'2'),**dict.fromkeys(['G','H','I'],'4'),
               **dict.fromkeys(['O','P','Q'],'9'), **dict.fromkeys(['R','S','T','U'],'10')}


CN['Category_0'] = CN.Category_1.replace(update_dict)
CN['Category_0']= CN['Category_0'].astype(str)
CN.sample(3)

2020-12-04 14:56:04,839 - text_processing - INFO - CN new Files imported and cleaned


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0
5233,29269070,20.14.43,2926 90 70,10,"(excl. acrylonitrile, 1-cyanoguanidine ""dicyandiamide"", fenproporex ""INN"" and its salts, methadone ""INN""-intermediate ""4-cyano-2-dimethylamino-4,4-diphenylbutane"", alpha-Phenylacetoacetonitrile and isophthalonitrile)",Nitrile-function compounds,20,C,2
4032,27101931,19.20.26,2710 19 31,10,,Gas oils of petroleum or bituminous minerals for undergoing a specific process as defined in Additional Note 5 to chapter 27,19,C,2
7707,48051910,17.12.34,4805 19 10,10,,"Wellenstoff, uncoated, in rolls of a width > 36 cm or in square or rectangular sheets with one side > 36 cm and the other side > 15 cm in the unfolded state",17,C,2


In [11]:
# Vectorize the CN description using FastText
CN["Descr_cleaned_vectorized"] = CN.CN_Description_cleaned.apply(
    text_to_vec
)

In [12]:
CN['Reduced_dim'] = text_processing.reduce_dimensionality(CN.Descr_cleaned_vectorized, 10)
#CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
#CN_test_df['Reduced_dim'] = text_processing.reduce_dimensionality(CN_test_df.Descr_cleaned_vectorized)

#CN_test_df['Reduced_dim_supervised'] = text_processing.train_test_umap(df.Descr_cleaned_vectorized,df.Category_2, CN_test_df.Descr_cleaned_vectorized)

2020-12-04 14:56:08,549 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Fri Dec  4 14:56:08 2020 Finding Nearest Neighbors
Fri Dec  4 14:56:08 2020 Building RP forest with 11 trees
Fri Dec  4 14:56:09 2020 NN descent for 14 iterations
	 0  /  14
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	 5  /  14
	 6  /  14
	 7  /  14
	 8  /  14
	 9  /  14
	 10  /  14
Fri Dec  4 14:56:11 2020 Finished Nearest Neighbor Search
Fri Dec  4 14:56:11 2020 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Fri Dec  4 14:56:34 2020 Finished embedding


In [13]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

this_cat = 'Category_1'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
display(Results.sample(5))
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

Accuracy: 0.8077004442563994


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
3775,25181000,08.11.30,2518 10 00,10,"(excl. broken or crushed dolomite for concrete aggregates, road metalling or railway or other ballast), not calcined or not sintered, incl.","Crude dolomite dolomite roughly trimmed or merely cut, by sawing or otherwise, into blocks or slabs of a rectangular ""incl. square"" shape",8,B,2,C
7640,48025590,17.12.14,4802 55 90,10,", not containing fibres obtained by a mechanical or chemi-mechanical process or of which <= 10% by weight of the total fibre content consists of such fibres, and weighing >= 80 g but <= 150 g/m², n.","Uncoated paper and paperboard, of a kind used for writing, printing or other graphic purposes, and non-perforated punchcards and punch-tape paper, in rolls of any sizee.s.",17,C,2,C
6579,39152000,38.11.55,3915 20 00,10,,"Waste, parings and scrap, of polymers of styrene",38,E,2,M
11520,73102119,25.92.11,7310 21 19,10,,"Cans of iron or steel, of a capacity of < 50 l, which are to be closed by soldering or crimping, of a kind used for preserving drink",25,C,2,C
11012,72112900,24.32.10,7211 29 00,10,"not clad, plated or coated, containing by weight >= 0,25% of carbon","Flat-rolled products of iron or non-alloy steel, of a width of < 600 mm, simply cold-rolled ""cold-reduced"",",24,C,2,C


                   pre       rec       spe        f1       geo       iba       sup

          A       0.00      0.00      1.00      0.00      0.00      0.00       594
          B       0.00      0.00      1.00      0.00      0.00      0.00       106
          C       0.91      0.89      0.08      0.90      0.26      0.07      8605
          D       0.00      0.00      1.00      0.00      0.00      0.00         2
          E       0.00      0.00      0.99      0.00      0.00      0.00       102
          F       0.00      0.00      1.00      0.00      0.00      0.00         0
          J       0.00      0.00      0.96      0.00      0.00      0.00        32
          M       0.00      0.00      0.94      0.00      0.00      0.00         5
          P       0.00      0.00      1.00      0.00      0.00      0.00         0
          R       0.00      0.00      1.00      0.00      0.00      0.00         7
          S       0.00      0.00      1.00      0.00      0.00      0.00         1

av

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [14]:
classification_report_imbalanced

<function imblearn.metrics._classification.classification_report_imbalanced(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, alpha=0.1)>

In [15]:
this_cat = 'Category_2'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 3 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
display(Results.sample(5))
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

Test 3 Accuracy: 0.0239052253014597


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
13897,84831050,28.15.22,8483 10 50,10,,Articulated shafts,28,C,2,72
15063,87149610,30.92.30,8714 96 10,10,,Pedals for bicycles,30,C,2,27
4203,28111200,20.13.24,2811 12 00,10,,"Hydrogen cyanide ""hydrocyanic acid""",20,C,2,38
13253,84431970,28.99.14,8443 19 70,10,"(excl. machinery for printing textile materials, those for use in the production of semiconductors, ink jet printing machines, hectograph or stencil duplicating machines, addressing machines and other office printing machines of heading 8469 to 8472 and offset, flexographic, letterpress and gravure printing machinery)","Printing machinery used for printing by means of plates, cylinders and other printing components of heading 8442",28,C,2,73
592,3031200,10.20.13,0303 12 00,10,"(excl. sockeye salmon ""red salmon"")",Frozen Pacific salmon,10,C,2,14


                   pre       rec       spe        f1       geo       iba       sup

         01       0.00      0.00      1.00      0.00      0.00      0.00       417
         02       0.00      0.00      1.00      0.00      0.00      0.00        38
         03       0.00      0.00      1.00      0.00      0.00      0.00       139
         05       0.00      0.00      1.00      0.00      0.00      0.00         5
         06       0.00      0.00      1.00      0.00      0.00      0.00         5
         07       0.00      0.00      1.00      0.00      0.00      0.00        25
         08       0.00      0.00      1.00      0.00      0.00      0.00        71
         10       0.75      0.00      1.00      0.01      0.06      0.00      1748
         11       0.00      0.00      1.00      0.00      0.00      0.00       216
         12       0.00      0.00      1.00      0.00      0.00      0.00        16
         13       0.00      0.00      1.00      0.00      0.00      0.00       731
   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Try again with Binary Classifier

In [67]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [None]:
#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 3 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

In [58]:
CN_test_df[['CN_Code', 'CPA_Code', 'Category_0']].dtypes

CN_Code        int64
CPA_Code      object
Category_0    object
dtype: object

In [59]:
np.array(CN_test_df[this_cat])

array(['1', '1', '1', ..., '10', '10', '10'], dtype=object)

In [52]:
y_CN_test

array(['1', '1', '1', ..., '10', '10', '10'], dtype='<U2')

In [35]:
def binary_cl(X_train, forest_clf, cat):
    y_train=(train_set.Category_0==cat)
  #  y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3,  method="predict_proba")
    y_scores = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
    score = roc_auc_score(y_train, y_scores[:,1])
    return score


this_cat = 'Category_0'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train_BE = (df[this_cat]=='2')


bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train_BE)

CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred2=bin_forest_clf.predict(CN_test)


In [46]:
#m = y_CN_pred2
m = pd.Series(data=y_CN_pred2.tolist(), index=CN_test_df.index)

In [68]:
Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()


m = y_CN_pred2
Results['Pred_val'] = '2'
Results['Pred_val'] = Results.Pred_val.where(m, '0')

Results['Predicted'] = pd.Series(data=y_CN_pred2.tolist(), index=CN_test_df.index)
#display(Results.sample(5))
#print(classification_report_imbalanced(y_CN_test, y_CN_pred2))
#print(classification_report_imbalanced(y_CN_test, np.array(Results.Pred_val)))

precision_recall_fscore_support(y_CN_tes,  np.array(Results.Pred_val), *, beta=1.0, labels=None, pos_label="2", average='binary', warn_for=('precision', 'recall', 'f-score'), sample_weight=None, zero_division='warn')

SyntaxError: invalid syntax (<ipython-input-68-ee3510ad45c5>, line 13)

In [70]:
precision_recall_fscore_support(y_CN_test,  np.array(Results.Pred_val), pos_label="2", average='binary')

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

In [None]:
def binary_cl(X_train, forest_clf, cat):
    y_train=(train_set.Category_0==cat)
  #  y_probas_forest = cross_val_predict(forest_clf, X_train, y_train, cv=3,  method="predict_proba")
    y_scores = cross_val_predict(forest_clf, X_train, y_train, cv=3, method='predict_proba')
    score = roc_auc_score(y_train, y_scores[:,1])
    return score
 
X_train = train_set[['dim1','dim2']]
forest_clf = RandomForestClassifier(random_state=42)

