# Text Processing for Accelerator project

A simplified pipeline processing text with FastText.

* Load CPA data
* Basic text cleaning
* Vectorize (with FastText)
* Split into test and training sets
* Reduce dimension using UMAP supervised 
* Predict on test set
* Use metrics for an unbalanced dataset
* View the clustering in plotly scatterplots

In [1]:
# this bit shouldn't be necessary if we pip install -e .   in the parent directory
%load_ext autoreload
%autoreload 2

In [2]:
import functools
from pprint import pprint
from time import time
from IPython.display import display, HTML
import logging
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
import umap

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express
from imblearn.metrics import classification_report_imbalanced

import text_processing

pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load FastText Pretrained

Note: This requires a fair bit of memory (peaks at about 17.5 GiB)

Recommend shutting down other kernels first, once this has loaded memory usage drops again.

This takes a few minutes to load in.

In [3]:
wv = text_processing.fetch_fasstext_pretrained(filepath="../../data/wiki.en.bin")

2020-11-28 18:03:34,036 - text_processing - INFO - Loading FastText pretrained from ../../data/wiki.en.bin
2020-11-28 18:07:37,959 - text_processing - INFO - Model loaded


#### Load in the CPA data

In [4]:
CPA = text_processing.fetch_files()

2020-11-28 18:07:38,074 - text_processing - INFO - cleanded CPA File imported


In [5]:
CPA1 = CPA[CPA.Level.isin({5,6})][['Code','Descr_old','Descr','Category_0','Category_1','Category_2','Category_3']].copy()
df = text_processing.clean_col(CPA1, "Descr")
df.drop('Descr',axis=1,inplace=True)

df.sample(5)

2020-11-28 18:07:38,184 - text_processing - INFO - Cleaning column: Descr 


Unnamed: 0,Code,Descr_old,Category_0,Category_1,Category_2,Category_3,Descr_cleaned
4689,69.10.1,Legal services,8,M,69,69.1,legal services
3030,30.30.9,Sub-contracted operations as part of manufacturing of air and spacecraft and related machinery,2,C,30,30.3,sub-contracted operations part manufacturing air spacecraft related machinery
1606,21.20.12,"Medicaments, containing hormones, but not antibiotics",2,C,21,21.2,medicaments containing hormones
82,01.13.72,Sugar beet seeds,1,A,1,1.1,sugar beet seeds
3213,32.99.41,Cigarette lighters and other lighters; smoking pipes and cigar or cigarette holders and parts thereof,2,C,32,32.9,cigarette lighters lighters smoking pipes cigar cigarette holders parts thereof


### Vectorize CPA data using FastText

In [6]:
text_to_vec = functools.partial(text_processing.vectorize_text, wv)
df["Descr_cleaned_vectorized"] = df.Descr_cleaned.apply(
    text_to_vec
)

In [7]:
# reduce the dimension for the whole lot
df['Reduced_dim'] = text_processing.reduce_dimensionality(df.Descr_cleaned_vectorized)
# reduce the dimension for the whole lot using supervised learning
df['Reduced_dim_supervised'] = text_processing.reduce_dimensionality_supervised(df.Descr_cleaned_vectorized,df.Category_2)

2020-11-28 18:07:38,582 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Sat Nov 28 18:07:38 2020 Finding Nearest Neighbors
Sat Nov 28 18:07:38 2020 Building RP forest with 8 trees
Sat Nov 28 18:07:39 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
	 9  /  12
	 10  /  12
Sat Nov 28 18:07:45 2020 Finished Nearest Neighbor Search
Sat Nov 28 18:07:47 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs


2020-11-28 18:08:02,866 - text_processing - INFO - Now applying umap to reduce dimension


Sat Nov 28 18:08:02 2020 Finished embedding
UMAP(min_dist=0.0, n_components=10, random_state=3052528580, verbose=10)
Construct fuzzy simplicial set
Sat Nov 28 18:08:02 2020 Finding Nearest Neighbors
Sat Nov 28 18:08:02 2020 Building RP forest with 8 trees
Sat Nov 28 18:08:03 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
Sat Nov 28 18:08:04 2020 Finished Nearest Neighbor Search


  return f(**kwargs)


Sat Nov 28 18:08:05 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Sat Nov 28 18:08:21 2020 Finished embedding


### Split the CPA data into a training set and a test set

In [8]:
# Split dataset into training set and test set
train_set, test_set = train_test_split(df.copy(), test_size=0.2, random_state=42)

# Use the Random Forest Classifier

https://www.datacamp.com/community/tutorials/random-forests-classifier-python

We have already  used UMAP supervised classification to reduce the dimension on the training set.   
We then split our data into training and test datasets.   
We now use the random forest classifier on the training set, and see how it works on the test set.


In [9]:
#Create Gaussian Classifier
for this_cat in ['Category_3','Category_2','Category_1','Category_0']:
    #Train the model using the training sets 
    X_train = train_set.Reduced_dim_supervised
    X_test = test_set.Reduced_dim_supervised

    y_train = train_set[this_cat]
    y_test = test_set[this_cat]

    vecs = np.array(list(X_train.values))
    target = np.array(list(y_train.values))

    clf = RandomForestClassifier(n_estimators=100
                              ).fit(vecs, target)

    y_pred=clf.predict(np.array(list(X_test.values)))
# Model Accuracy, how often is the classifier correct?
    print(f"Accuracy for {this_cat} classification:",metrics.accuracy_score(y_test, y_pred))

Accuracy for Category_3 classification: 0.8098360655737705
Accuracy for Category_2 classification: 0.9814207650273225
Accuracy for Category_1 classification: 0.985792349726776
Accuracy for Category_0 classification: 0.994535519125683


In [54]:
CN = text_processing.fetch_CN_files()
Cat1_Cat2_map = CPA[CPA.Level==2][['Code','Parent']].rename(columns={'Code':'Category_2','Parent':'Category_1'})
CN=CN.merge(Cat1_Cat2_map, on='Category_2', how='left')


# we now set up a higher level for A10 indstry levels (10 categories)
update_dict0 = {'A':'1','F':'3','J':'5', 'K':'6', 'L':'7','M':'8','N':'8'}
update_dict = {**update_dict0,**dict.fromkeys(['B','C','D','E'],'2'),**dict.fromkeys(['G','H','I'],'4'),
               **dict.fromkeys(['O','P','Q'],'9'), **dict.fromkeys(['R','S','T','U'],'10')}


CN['Category_0'] = CN.Category_1.replace(update_dict)
CN['Category_0']= CN['Category_0'].astype(str)
CN.sample(3)

2020-11-28 18:49:55,918 - text_processing - INFO - CN new Files imported and cleaned
2020-11-28 18:49:55,918 - text_processing - INFO - CN new Files imported and cleaned


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0
11068,72149979,24.10.62,7214 99 79,10,"(excl. bars and rods with indentations, ribs, grooves or other deformations produced during the rolling process, twisted after rolling, and of free-cutting steel)","Bars and rods of iron or non-alloy steel, only hot-rolled, only hot-drawn or only hot-extruded, containing by weight >= 0,25% carbon, of circular cross-section measuring < 80 mm in diameter",24,C,2
9982,63031200,13.92.15,6303 12,7,(excl. awnings and sunblinds),"Curtains, incl. drapes, and interior blinds, curtain or bed valances of synthetic fibres, knitted or crocheted",13,C,2
563,3028200,03.00.21,0302 82,7,,"Fresh or chilled, rays and skates ""Rajidae""",3,A,1


In [55]:
# Vectorize the CN description using FastText
CN["Descr_cleaned_vectorized"] = CN.CN_Description_cleaned.apply(
    text_to_vec
)

In [56]:
CN['Reduced_dim'] = text_processing.reduce_dimensionality(CN.Descr_cleaned_vectorized)
#CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
#CN_test_df['Reduced_dim'] = text_processing.reduce_dimensionality(CN_test_df.Descr_cleaned_vectorized)

#CN_test_df['Reduced_dim_supervised'] = text_processing.train_test_umap(df.Descr_cleaned_vectorized,df.Category_2, CN_test_df.Descr_cleaned_vectorized)

2020-11-28 18:50:07,381 - text_processing - INFO - Now applying umap to reduce dimension
2020-11-28 18:50:07,381 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Sat Nov 28 18:50:07 2020 Finding Nearest Neighbors
Sat Nov 28 18:50:07 2020 Building RP forest with 11 trees
Sat Nov 28 18:50:08 2020 NN descent for 14 iterations
	 0  /  14
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	 5  /  14
	 6  /  14
	 7  /  14
	 8  /  14
	 9  /  14
	 10  /  14
Sat Nov 28 18:50:10 2020 Finished Nearest Neighbor Search
Sat Nov 28 18:50:10 2020 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Sat Nov 28 18:50:33 2020 Finished embedding


In [57]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

this_cat = 'Category_1'
#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# try different tests
######## test 1
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 1 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))


########## test 2
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==7)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 2 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

########## test 3
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 3 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

######## test 4
CN_test_df= CN[(CN.Category_2.notnull())]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 4 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

Test 1 Accuracy: 0.6740331491712708
Test 2 Accuracy: 0.7842920353982301
Test 3 Accuracy: 0.8087581975883225
Test 4 Accuracy: 0.8002414912082108


In [60]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

this_cat = 'Category_0'
df['Category_0'] = df['Category_0'].astype(str)
#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# try different tests
######## test 1
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 1 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))


########## test 2
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==7)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 2 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

########## test 3
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 3 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

######## test 4
CN_test_df= CN[(CN.Category_2.notnull())]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 4 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

Test 1 Accuracy: 0.7734806629834254
Test 2 Accuracy: 0.8313053097345132
Test 3 Accuracy: 0.8354135815527819
Test 4 Accuracy: 0.8334465323371821


In [61]:
Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
Results.sample(5)


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
5837,33019021,20.53.10,3301 90 21,10,,Extracted oleoresins of liquorice and hops,20,C,2,2
14703,85459010,27.90.13,8545 90 10,10,,"Heating resistors for electrical purposes, of graphite or other carbon",27,C,2,2
7839,48191000,17.21.13,4819 10 00,10,,"Cartons, boxes and cases, of corrugated paper or paperboard",17,C,2,2
2156,12024200,10.39.25,1202 42 00,10,"(excl. seed for sowing, roasted or otherwise cooked) not broken","Groundnuts, shelled, whether or",10,C,2,2
3519,22089011,11.01.10,2208 90 11,10,,"Arrack, in containers holding <= 2 l",11,C,2,2


In [63]:
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

                   pre       rec       spe        f1       geo       iba       sup

          1       0.00      0.00      1.00      0.00      0.00      0.00       845
         10       0.00      0.00      1.00      0.00      0.00      0.00        22
          2       0.92      0.90      0.02      0.91      0.14      0.02     12326
          3       0.00      0.00      1.00      0.00      0.00      0.00         0
          4       0.00      0.00      1.00      0.00      0.00      0.00         0
          5       0.00      0.00      0.96      0.00      0.00      0.00        51
          8       0.00      0.00      0.95      0.00      0.00      0.00         7
          9       0.00      0.00      1.00      0.00      0.00      0.00         0

avg / total       0.86      0.83      0.09      0.85      0.13      0.02     13251



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
