# Text Processing for Accelerator project

A simplified pipeline processing text with FastText.

* Load CPA data
* Basic text cleaning
* Vectorize (with FastText)
* Split into test and training sets
* Reduce dimension using UMAP supervised 
* Predict on test set
* Use metrics for an unbalanced dataset
* View the clustering in plotly scatterplots

In [1]:
# this bit shouldn't be necessary if we pip install -e .   in the parent directory
%load_ext autoreload
%autoreload 2

In [2]:
import functools
from pprint import pprint
from time import time
from IPython.display import display, HTML
import logging
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
import umap

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express
from imblearn.metrics import classification_report_imbalanced

import text_processing

pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load FastText Pretrained

Note: This requires a fair bit of memory (peaks at about 17.5 GiB)

Recommend shutting down other kernels first, once this has loaded memory usage drops again.

This takes a few minutes to load in.

In [3]:
wv = text_processing.fetch_fasstext_pretrained(filepath="../../data/wiki.en.bin")

2020-12-03 07:44:59,528 - text_processing - INFO - Loading FastText pretrained from ../../data/wiki.en.bin
2020-12-03 07:49:14,184 - text_processing - INFO - Model loaded


#### Load in the CPA data

In [4]:
CPA = text_processing.fetch_files()

2020-12-03 07:49:14,283 - text_processing - INFO - cleanded CPA File imported


In [5]:
CPA1 = CPA[CPA.Level.isin({5,6})][['Code','Descr_old','Descr','Category_0','Category_1','Category_2','Category_3']].copy()
df = text_processing.clean_col(CPA1, "Descr")
df.drop('Descr',axis=1,inplace=True)

df.sample(5)

2020-12-03 07:49:14,420 - text_processing - INFO - Cleaning column: Descr 


Unnamed: 0,Code,Descr_old,Category_0,Category_1,Category_2,Category_3,Descr_cleaned
4689,69.10.1,Legal services,8,M,69,69.1,legal services
3030,30.30.9,Sub-contracted operations as part of manufacturing of air and spacecraft and related machinery,2,C,30,30.3,sub-contracted operations part manufacturing air spacecraft related machinery
1606,21.20.12,"Medicaments, containing hormones, but not antibiotics",2,C,21,21.2,medicaments containing hormones
82,01.13.72,Sugar beet seeds,1,A,1,1.1,sugar beet seeds
3213,32.99.41,Cigarette lighters and other lighters; smoking pipes and cigar or cigarette holders and parts thereof,2,C,32,32.9,cigarette lighters lighters smoking pipes cigar cigarette holders parts thereof


### Vectorize CPA data using FastText

In [10]:
text_to_vec = functools.partial(text_processing.vectorize_text, wv)
df["Descr_cleaned_vectorized"] = df.Descr_cleaned.apply(text_to_vec)

In [11]:
# reduce the dimension for the whole lot
df['Reduced_dim'] = text_processing.reduce_dimensionality(df.Descr_cleaned_vectorized,10)
# reduce the dimension for the whole lot using supervised learning
df['Reduced_dim_supervised'] = text_processing.reduce_dimensionality_supervised(df.Descr_cleaned_vectorized,df.Category_2)

2020-12-03 07:55:34,088 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Thu Dec  3 07:55:34 2020 Finding Nearest Neighbors
Thu Dec  3 07:55:34 2020 Building RP forest with 8 trees
Thu Dec  3 07:55:35 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
	 9  /  12
	 10  /  12
Thu Dec  3 07:55:41 2020 Finished Nearest Neighbor Search
Thu Dec  3 07:55:44 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs


2020-12-03 07:56:00,647 - text_processing - INFO - Now applying umap to reduce dimension


Thu Dec  3 07:56:00 2020 Finished embedding
UMAP(min_dist=0.0, n_components=10, random_state=3052528580, verbose=10)
Construct fuzzy simplicial set
Thu Dec  3 07:56:00 2020 Finding Nearest Neighbors
Thu Dec  3 07:56:00 2020 Building RP forest with 8 trees
Thu Dec  3 07:56:00 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
Thu Dec  3 07:56:01 2020 Finished Nearest Neighbor Search


  return f(**kwargs)


Thu Dec  3 07:56:02 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Thu Dec  3 07:56:18 2020 Finished embedding


### Split the CPA data into a training set and a test set

In [12]:
# Split dataset into training set and test set
train_set, test_set = train_test_split(df.copy(), test_size=0.2, random_state=42)

# Use the Random Forest Classifier

https://www.datacamp.com/community/tutorials/random-forests-classifier-python

We have already  used UMAP supervised classification to reduce the dimension on the training set.   
We then split our data into training and test datasets.   
We now use the random forest classifier on the training set, and see how it works on the test set.


In [13]:
#Create Gaussian Classifier
for this_cat in ['Category_3','Category_2','Category_1','Category_0']:
    #Train the model using the training sets 
    X_train = train_set.Reduced_dim_supervised
    X_test = test_set.Reduced_dim_supervised

    y_train = train_set[this_cat]
    y_test = test_set[this_cat]

    vecs = np.array(list(X_train.values))
    target = np.array(list(y_train.values))

    clf = RandomForestClassifier(n_estimators=100
                              ).fit(vecs, target)

    y_pred=clf.predict(np.array(list(X_test.values)))
# Model Accuracy, how often is the classifier correct?
    print(f"Accuracy for {this_cat} classification:",metrics.accuracy_score(y_test, y_pred))

Accuracy for Category_3 classification: 0.8098360655737705
Accuracy for Category_2 classification: 0.9814207650273225
Accuracy for Category_1 classification: 0.985792349726776
Accuracy for Category_0 classification: 0.994535519125683


In [14]:
CN = text_processing.fetch_CN_files()
Cat1_Cat2_map = CPA[CPA.Level==2][['Code','Parent']].rename(columns={'Code':'Category_2','Parent':'Category_1'})
CN=CN.merge(Cat1_Cat2_map, on='Category_2', how='left')


# we now set up a higher level for A10 indstry levels (10 categories)
update_dict0 = {'A':'1','F':'3','J':'5', 'K':'6', 'L':'7','M':'8','N':'8'}
update_dict = {**update_dict0,**dict.fromkeys(['B','C','D','E'],'2'),**dict.fromkeys(['G','H','I'],'4'),
               **dict.fromkeys(['O','P','Q'],'9'), **dict.fromkeys(['R','S','T','U'],'10')}


CN['Category_0'] = CN.Category_1.replace(update_dict)
CN['Category_0']= CN['Category_0'].astype(str)
CN.sample(3)

2020-12-03 08:00:15,221 - text_processing - INFO - CN new Files imported and cleaned


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0
5233,29269070,20.14.43,2926 90 70,10,"(excl. acrylonitrile, 1-cyanoguanidine ""dicyandiamide"", fenproporex ""INN"" and its salts, methadone ""INN""-intermediate ""4-cyano-2-dimethylamino-4,4-diphenylbutane"", alpha-Phenylacetoacetonitrile and isophthalonitrile)",Nitrile-function compounds,20,C,2
4032,27101931,19.20.26,2710 19 31,10,,Gas oils of petroleum or bituminous minerals for undergoing a specific process as defined in Additional Note 5 to chapter 27,19,C,2
7707,48051910,17.12.34,4805 19 10,10,,"Wellenstoff, uncoated, in rolls of a width > 36 cm or in square or rectangular sheets with one side > 36 cm and the other side > 15 cm in the unfolded state",17,C,2


In [15]:
# Vectorize the CN description using FastText
CN["Descr_cleaned_vectorized"] = CN.CN_Description_cleaned.apply(
    text_to_vec
)

In [17]:
CN['Reduced_dim'] = text_processing.reduce_dimensionality(CN.Descr_cleaned_vectorized, 10)
#CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
#CN_test_df['Reduced_dim'] = text_processing.reduce_dimensionality(CN_test_df.Descr_cleaned_vectorized)

#CN_test_df['Reduced_dim_supervised'] = text_processing.train_test_umap(df.Descr_cleaned_vectorized,df.Category_2, CN_test_df.Descr_cleaned_vectorized)

2020-12-03 08:01:08,992 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Thu Dec  3 08:01:09 2020 Finding Nearest Neighbors
Thu Dec  3 08:01:09 2020 Building RP forest with 11 trees
Thu Dec  3 08:01:09 2020 NN descent for 14 iterations
	 0  /  14
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	 5  /  14
	 6  /  14
	 7  /  14
	 8  /  14
	 9  /  14
	 10  /  14
Thu Dec  3 08:01:12 2020 Finished Nearest Neighbor Search
Thu Dec  3 08:01:12 2020 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Thu Dec  3 08:01:35 2020 Finished embedding


In [22]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

this_cat = 'Category_1'
#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# try different tests
######## test 1
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 1 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))


########## test 2
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==7)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 2 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

########## test 3
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 3 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

######## test 4
CN_test_df= CN[(CN.Category_2.notnull())]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 4 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
display(Results.sample(5))
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

Test 1 Accuracy: 0.6519337016574586
Test 2 Accuracy: 0.7627212389380531
Test 3 Accuracy: 0.7826316902898244
Test 4 Accuracy: 0.775413176364048


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
6446,39041000,20.16.30,3904 10,7,not mixed with any other substances,"Poly""vinyl chloride"", in primary forms,",20,C,2,C
69,1051500,01.47.14,0105 15 00,10,,"Live domestic guinea fowls, weighing <= 185 g",1,A,1,M
2946,20071091,10.86.10,2007 10 91,10,"(excl. with a sugar content of > 13% by weight), whether or not containing added sugar or other sweetening matter,","Jams, jellies, marmalades, purée and pastes, of guavas, mangoes, mangosteens, papaws ""papayas"", tamarinds, cashew apples, lychees, jackfruit, sapodillo plums, passion fruit, carambola, pitahaya, obtained by cooking put up for retail sale as infant food or for dietetic purposes, in containers of a net weight of <= 250 g",10,C,2,C
9944,63022100,13.92.12,6302 21 00,10,(excl. knitted or crocheted),Printed bedlinen of cotton,13,C,2,C
10695,71039900,32.12.11,7103 99 00,10,"(excl. precious and semi-precious stones, simply sawn or roughly shaped, diamonds, rubies, sapphires and emeralds, imitation precious stones and semi-precious stones), whether or not graded, not strung, mounted or set, precious and semi-precious stones, worked, ungraded, temporarily strung for convenience of transport","Precious and semi-precious stones, worked but",32,C,2,C


                   pre       rec       spe        f1       geo       iba       sup

          A       0.00      0.00      0.99      0.00      0.00      0.00       845
          B       0.00      0.00      1.00      0.00      0.00      0.00       209
          C       0.90      0.86      0.14      0.88      0.35      0.13     11944
          D       0.00      0.00      1.00      0.00      0.00      0.00         6
          E       0.01      0.01      0.99      0.01      0.08      0.01       167
          F       0.00      0.00      1.00      0.00      0.00      0.00         0
          G       0.00      0.00      1.00      0.00      0.00      0.00         0
          J       0.00      0.00      0.94      0.00      0.00      0.00        51
          M       0.00      0.00      0.94      0.00      0.00      0.00         7
          P       0.00      0.00      1.00      0.00      0.00      0.00         0
          R       0.00      0.00      1.00      0.00      0.00      0.00        19
   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
classification_report_imbalanced

In [23]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

this_cat = 'Category_0'
df['Category_0'] = df['Category_0'].astype(str)
#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# try different tests
######## test 1
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 1 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))


########## test 2
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==7)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 2 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

########## test 3
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 3 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

######## test 4
CN_test_df= CN[(CN.Category_2.notnull())]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 4 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
display(Results.sample(5))
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

Test 1 Accuracy: 0.7900552486187845
Test 2 Accuracy: 0.8349004424778761
Test 3 Accuracy: 0.8409138988787814
Test 4 Accuracy: 0.8385782205116595


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
12246,82033000,25.73.30,8203 30 00,10,,"Metal-cutting shears and similar hand tools, of base metal",25,C,2,2
350,2076031,10.12.10,0207 60 31,10,,"Fresh, chilled or frozen whole wings of domestic guinea fowls",10,C,2,2
913,3054980,10.20.24,0305 49 80,10,"(excl. offal, Pacific salmon, Atlantic salmon, Danube salmon, herring, lesser or Greenland halibut, Atlantic halibut, mackerel, trout, tilapia, catfish, carp, eels, Nile perch and snakeheads)","Smoked fish, incl. fillets",10,C,2,2
11670,73211900,27.52.11,7321 19,7,"(excl. liquid or gaseous fuel, and large cooking appliances)","Appliances for baking, frying, grilling and cooking and plate warmers, for domestic use, of iron or steel, for solid fuel or other non-electric source of energy",27,C,2,2
5976,35029020,20.59.60,3502 90 20,10,"(excl. egg albumin and milk albumin [incl. concentrates of two or more whey proteins containing by weight > 80% whey proteins, calculated on the dry matter])","Albumins, unfit, or to be rendered unfit, for human consumption",20,C,2,2


                   pre       rec       spe        f1       geo       iba       sup

          1       0.00      0.00      1.00      0.00      0.00      0.00       845
         10       0.00      0.00      1.00      0.00      0.00      0.00        22
          2       0.93      0.90      0.04      0.91      0.18      0.04     12326
          3       0.00      0.00      1.00      0.00      0.00      0.00         0
          5       0.00      0.00      0.96      0.00      0.00      0.00        51
          8       0.00      0.00      0.95      0.00      0.00      0.00         7

avg / total       0.86      0.84      0.10      0.85      0.17      0.03     13251



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
8301,52095200,13.20.20,5209 52 00,10,,"Woven fabrics of cotton, containing >= 85% cotton by weight and weighing > 200 g/m², in three-thread or four-thread twill, incl. cross twill, printed",13,C,2,5
11636,73182200,25.94.12,7318 22 00,10,(excl. spring washers and other lock washers),Washers of iron or steel,25,C,2,8
12317,82078019,25.73.40,8207 80 19,10,,"Tools for turning, interchangeable, for working metal, with working part of materials other than sintered metal carbide or cermets",25,C,2,2
13899,84832000,28.15.23,8483 20,7,,"Bearing housings, incorporating ball or roller bearings, for machinery",28,C,2,2
1175,4031053,10.51.52,0403 10 53,10,", whether or not concentrated,","Yogurt flavoured or with added fruit, nuts or cocoa, sweetened, in solid forms, of a milkfat content by weight of > 1,5% but <= 27%",10,C,2,2


                   pre       rec       spe        f1       geo       iba       sup

          1       0.00      0.00      1.00      0.00      0.00      0.00       845
         10       0.00      0.00      1.00      0.00      0.00      0.00        22
          2       0.92      0.90      0.02      0.91      0.14      0.02     12326
          3       0.00      0.00      1.00      0.00      0.00      0.00         0
          4       0.00      0.00      1.00      0.00      0.00      0.00         0
          5       0.00      0.00      0.97      0.00      0.00      0.00        51
          8       0.00      0.00      0.94      0.00      0.00      0.00         7
          9       0.00      0.00      1.00      0.00      0.00      0.00         0

avg / total       0.86      0.84      0.09      0.85      0.13      0.02     13251



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
