# Text Processing for Accelerator project

A simplified pipeline processing text with FastText.

* Load CPA data
* Basic text cleaning
* Vectorize (with FastText)
* Split into test and training sets
* Reduce dimension using UMAP supervised 
* Predict on test set
* Use metrics for an unbalanced dataset
* View the clustering in plotly scatterplots

In [1]:
# this bit shouldn't be necessary if we pip install -e .   in the parent directory
%load_ext autoreload
%autoreload 2

In [2]:
import functools
from pprint import pprint
from time import time
from IPython.display import display, HTML
import logging
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
import umap

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express
from imblearn.metrics import classification_report_imbalanced

import text_processing

pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load FastText Pretrained

Note: This requires a fair bit of memory (peaks at about 17.5 GiB)

Recommend shutting down other kernels first, once this has loaded memory usage drops again.

This takes a few minutes to load in.

In [3]:
wv = text_processing.fetch_fasstext_pretrained(filepath="../../data/wiki.en.bin")

2021-04-12 15:49:13,122 - text_processing - INFO - Loading FastText pretrained from ../../data/wiki.en.bin
2021-04-12 15:53:14,035 - text_processing - INFO - Model loaded


#### Load in the CPA data

In [4]:
CPA = text_processing.fetch_files()

2021-04-12 15:53:14,198 - text_processing - INFO - cleanded CPA File imported


In [5]:
CPA1 = CPA[CPA.Level.isin({5,6})][['Code','Descr_old','Descr','Category_0','Category_1','Category_2','Category_3']].copy()
df = text_processing.clean_col(CPA1, "Descr")
df.drop('Descr',axis=1,inplace=True)

df.sample(5)

2021-04-12 15:58:35,644 - text_processing - INFO - Cleaning column: Descr 


Unnamed: 0,Code,Descr_old,Category_0,Category_1,Category_2,Category_3,Descr_cleaned
4689,69.10.1,Legal services,8,M,69,69.1,legal services
3030,30.30.9,Sub-contracted operations as part of manufacturing of air and spacecraft and related machinery,2,C,30,30.3,sub-contracted operations part manufacturing air spacecraft related machinery
1606,21.20.12,"Medicaments, containing hormones, but not antibiotics",2,C,21,21.2,medicaments containing hormones
82,01.13.72,Sugar beet seeds,1,A,1,1.1,sugar beet seeds
3213,32.99.41,Cigarette lighters and other lighters; smoking pipes and cigar or cigarette holders and parts thereof,2,C,32,32.9,cigarette lighters lighters smoking pipes cigar cigarette holders parts thereof


### Vectorize CPA data using FastText

In [6]:
text_to_vec = functools.partial(text_processing.vectorize_text, wv)
df["Descr_cleaned_vectorized"] = df.Descr_cleaned.apply(text_to_vec)

In [7]:
# reduce the dimension for the whole lot
df['Reduced_dim'] = text_processing.reduce_dimensionality(df.Descr_cleaned_vectorized,10)
# reduce the dimension for the whole lot using supervised learning
df['Reduced_dim_supervised'] = text_processing.reduce_dimensionality_supervised(df.Descr_cleaned_vectorized,df.Category_2)

2021-04-12 15:58:39,172 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Mon Apr 12 15:58:39 2021 Finding Nearest Neighbors
Mon Apr 12 15:58:39 2021 Building RP forest with 8 trees
Mon Apr 12 15:58:39 2021 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
	 9  /  12
Mon Apr 12 15:58:46 2021 Finished Nearest Neighbor Search
Mon Apr 12 15:58:48 2021 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs


2021-04-12 15:59:03,230 - text_processing - INFO - Now applying umap to reduce dimension


Mon Apr 12 15:59:03 2021 Finished embedding
UMAP(min_dist=0.0, n_components=10, random_state=3052528580, verbose=10)
Construct fuzzy simplicial set
Mon Apr 12 15:59:03 2021 Finding Nearest Neighbors
Mon Apr 12 15:59:03 2021 Building RP forest with 8 trees
Mon Apr 12 15:59:03 2021 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
	 9  /  12
Mon Apr 12 15:59:04 2021 Finished Nearest Neighbor Search


  return f(**kwargs)


Mon Apr 12 15:59:05 2021 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Mon Apr 12 15:59:21 2021 Finished embedding


### Split the CPA data into a training set and a test set

In [8]:
# Split dataset into training set and test set
train_set, test_set = train_test_split(df.copy(), test_size=0.2, random_state=42)

# Use the Random Forest Classifier

https://www.datacamp.com/community/tutorials/random-forests-classifier-python

We have already  used UMAP supervised classification to reduce the dimension on the training set.   
We then split our data into training and test datasets.   
We now use the random forest classifier on the training set, and see how it works on the test set.


In [9]:
#Create Gaussian Classifier
for this_cat in ['Category_3','Category_2','Category_1','Category_0']:
    #Train the model using the training sets 
    X_train = train_set.Reduced_dim_supervised
    X_test = test_set.Reduced_dim_supervised

    y_train = train_set[this_cat]
    y_test = test_set[this_cat]

    vecs = np.array(list(X_train.values))
    target = np.array(list(y_train.values))

    clf = RandomForestClassifier(n_estimators=100
                              ).fit(vecs, target)

    y_pred=clf.predict(np.array(list(X_test.values)))
# Model Accuracy, how often is the classifier correct?
    print(f"Accuracy for {this_cat} classification:",metrics.accuracy_score(y_test, y_pred))

Accuracy for Category_3 classification: 0.8087431693989071
Accuracy for Category_2 classification: 0.9748633879781421
Accuracy for Category_1 classification: 0.9846994535519126
Accuracy for Category_0 classification: 0.9912568306010929


In [35]:
#Create Gaussian Classifier
# for this_cat in ['Category_3','Category_2','Category_1','Category_0']:
for this_cat in ['Category_2','Category_1','Category_0']:
    #Train the model using the training sets 
    X_train = train_set.Reduced_dim
    X_test = test_set.Reduced_dim

    y_train = train_set[this_cat]
    y_test = test_set[this_cat]

    vecs = np.array(list(X_train.values))
    target = np.array(list(y_train.values))

    clf = RandomForestClassifier(n_estimators=100
                              ).fit(vecs, target)

    y_pred=clf.predict(np.array(list(X_test.values)))
# Model Accuracy, how often is the classifier correct?
    rs = round(metrics.accuracy_score(y_test, y_pred),4)
    tx = f"<p>Accuracy for {this_cat} classification: {rs}</p>"
    display(HTML(tx))


In [16]:
CN = text_processing.fetch_CN_mapper()
Cat1_Cat2_map = CPA[CPA.Level==2][['Code','Parent']].rename(columns={'Code':'Category_2','Parent':'Category_1'})
CN=CN.merge(Cat1_Cat2_map, on='Category_2', how='left')


# we now set up a higher level for A10 indstry levels (10 categories)
update_dict0 = {'A':'1','F':'3','J':'5', 'K':'6', 'L':'7','M':'8','N':'8'}
update_dict = {**update_dict0,**dict.fromkeys(['B','C','D','E'],'2'),**dict.fromkeys(['G','H','I'],'4'),
               **dict.fromkeys(['O','P','Q'],'9'), **dict.fromkeys(['R','S','T','U'],'10')}


CN['Category_0'] = CN.Category_1.replace(update_dict)
CN['Category_0']= CN['Category_0'].astype(str)
CN.sample(3)

2021-04-12 16:04:22,550 - text_processing - INFO - CN new Files imported and cleaned


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_old,CN_Description_cleaned,Category_2,Category_1,Category_0
7983,50072061,13.20.11,5007 20 61,10,"(excl. crêpes, and pongee, habutai, honan, shantung, corah and similar far eastern fabrics wholly of silk)","Densely-woven fabrics made from yarn of different colours, containing >= 85% silk or silk waste by weight, of a width > 57 cm to 75 cm","Densely-woven fabrics made from yarn of different colours, containing >= 85% silk or silk waste by weight, of a width > 57 cm to 75 cm",13.0,C,2.0
11582,73152000,25.93.17,7315 20 00,10,,"Skid chain for motor vehicles, of iron or steel","Skid chain for motor vehicles, of iron or steel",25.0,C,2.0
13004,84283900,,8428 39,7,"(excl. those for underground use and bucket, belt or pneumatic types)","Continuous-action elevators and conveyors, for goods or materials","Continuous-action elevators and conveyors, for goods or materials",,,


In [17]:
# Vectorize the CN description using FastText
CN["Descr_cleaned_vectorized"] = CN.CN_Description_cleaned.apply(
    text_to_vec
)

In [18]:
CN['Reduced_dim'] = text_processing.reduce_dimensionality(CN.Descr_cleaned_vectorized, 10)
#CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
#CN_test_df['Reduced_dim'] = text_processing.reduce_dimensionality(CN_test_df.Descr_cleaned_vectorized)

#CN_test_df['Reduced_dim_supervised'] = text_processing.train_test_umap(df.Descr_cleaned_vectorized,df.Category_2, CN_test_df.Descr_cleaned_vectorized)

2021-04-12 16:04:26,378 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Mon Apr 12 16:04:26 2021 Finding Nearest Neighbors
Mon Apr 12 16:04:26 2021 Building RP forest with 11 trees
Mon Apr 12 16:04:27 2021 NN descent for 14 iterations
	 0  /  14
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	 5  /  14
	 6  /  14
	 7  /  14
	 8  /  14
	 9  /  14
	 10  /  14
Mon Apr 12 16:04:29 2021 Finished Nearest Neighbor Search
Mon Apr 12 16:04:29 2021 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Mon Apr 12 16:04:52 2021 Finished embedding


In [19]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

this_cat = 'Category_1'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

#Results = CN_test_df.drop(['Descr_cleaned_vectorized','CN_Description_old','Excl_removed','Reduced_dim'],axis=1).copy()
Results = CN_test_df[['CN_Code','CPA_Code','CN_Description_cleaned','Category_1']].copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
display(Results.sample(5))
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

Accuracy: 0.9036386714618151


Unnamed: 0,CN_Code,CPA_Code,CN_Description_cleaned,Category_1,Predicted
5475,29394100,21.10.53,Ephedrine and its salts,C,C
1437,6042090,02.30.30,"Foliage, branches and other parts of plants grasses, fresh, suitable for bouquets or ornamental purposes",A,C
15991,95061180,32.30.11,Snow-skis,C,C
1090,3079100,03.00.42,"Live, fresh or chilled molluscs, even in shell ; fresh or chilled flours, meals and pellets of molluscs, fit for human consumption",A,C
5611,30061010,21.20.24,Sterile surgical catgut,C,C


                   pre       rec       spe        f1       geo       iba       sup

          A       0.00      0.00      1.00      0.00      0.00      0.00       594
          B       0.00      0.00      1.00      0.00      0.00      0.00       106
          C       0.91      0.99      0.01      0.95      0.08      0.01      8605
          D       0.00      0.00      1.00      0.00      0.00      0.00         2
          E       0.00      0.00      1.00      0.00      0.00      0.00       102
          J       0.07      0.16      0.99      0.10      0.39      0.14        32
          M       0.00      0.00      1.00      0.00      0.00      0.00         5
          R       0.00      0.00      1.00      0.00      0.00      0.00         7
          S       0.00      0.00      1.00      0.00      0.00      0.00         1

avg / total       0.83      0.90      0.10      0.86      0.08      0.01      9454



  _warn_prf(average, modifier, msg_start, len(result))


In [20]:
classification_report_imbalanced

<function imblearn.metrics._classification.classification_report_imbalanced(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, alpha=0.1)>

In [21]:
this_cat = 'Category_2'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 3 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
display(Results.sample(5))
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

Test 3 Accuracy: 0.08800507721599322


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_old,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
11383,73042990,24.20.12,7304 29 90,10,(excl. products of cast iron),"Casing and tubing of a kind used for drilling for oil or gas, seamless, of iron or steel, of an external diameter > 406,4 mm","Casing and tubing of a kind used for drilling for oil or gas, seamless, of iron or steel, of an external diameter > 406,4 mm",24,C,2,20
5041,29181200,20.14.34,2918 12 00,10,,Tartaric acid,Tartaric acid,20,C,2,24
14568,85393980,27.40.15,8539 39 80,10,"(excl. hot-cathode fluorescent lamps, mercury or sodium vapour lamps, metal halide lamps, ultraviolet lamps, and cold-cathode fluorescent lamps ""CCFLs"" for backlighting of flat panel displays)",Discharge lamps,Discharge lamps,27,C,2,20
3155,20094919,10.32.14,2009 49 19,10,(excl. containing spirit) not containing added sugar or other sweetening matter,"Pineapple juice, unfermented, Brix value > 67 at 20°C, value of > 30 € per 100 kg, whether or not containing added sugar or other sweetening matter","Pineapple juice, unfermented, Brix value > 67 at 20°C, value of > 30 € per 100 kg, whether or",10,C,2,1
8350,52114910,13.20.20,5211 49 10,10,,"Woven jacquard fabrics containing predominantly, but < 85% cotton by weight, mixed mainly or solely with man-made fibres and weighing > 200 g/m², made of yarn of different colours","Woven jacquard fabrics containing predominantly, but < 85% cotton by weight, mixed mainly or solely with man-made fibres and weighing > 200 g/m², made of yarn of different colours",13,C,2,20


                   pre       rec       spe        f1       geo       iba       sup

         01       0.06      0.29      0.79      0.10      0.47      0.21       417
         02       0.00      0.00      1.00      0.00      0.00      0.00        38
         03       0.00      0.00      1.00      0.00      0.00      0.00       139
         05       0.00      0.00      1.00      0.00      0.00      0.00         5
         06       0.00      0.00      1.00      0.00      0.00      0.00         5
         07       0.00      0.00      0.98      0.00      0.00      0.00        25
         08       0.00      0.00      1.00      0.00      0.00      0.00        71
         10       0.29      0.07      0.96      0.11      0.26      0.06      1748
         11       0.00      0.00      1.00      0.00      0.00      0.00       216
         12       0.00      0.00      1.00      0.00      0.00      0.00        16
         13       0.00      0.00      1.00      0.00      0.00      0.00       731
   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Try again with Binary Classifier

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [23]:
CN_test_df[['CN_Code', 'CPA_Code', 'Category_0']].dtypes

CN_Code        int64
CPA_Code      object
Category_0    object
dtype: object

In [24]:
np.array(CN_test_df[this_cat])

array(['01', '01', '01', ..., '91', '91', '91'], dtype=object)

In [25]:
y_CN_test

array(['01', '01', '01', ..., '91', '91', '91'], dtype='<U2')

In [26]:
this_cat = 'Category_0'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim_supervised.values))
y_train_BE = (df[this_cat]=='2')


bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train_BE)

CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=bin_forest_clf.predict(CN_test)

Results = CN_test_df.drop(['Descr_cleaned_vectorized'],axis=1).copy()
Results['Expected'] = (Results[this_cat]=='2')
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
print(precision_recall_fscore_support(Results.Expected,  np.array(Results.Predicted), pos_label=True, average='binary'))
display(Results.sample())


(0.890993265993266, 0.2401588201928531, 0.37833973728889286, None)


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_old,CN_Description_cleaned,Category_2,Category_1,Category_0,Reduced_dim,Expected,Predicted
3578,23061000,10.41.41,2306 10 00,10,", whether or not ground or in the form of pellets,","Oilcake and other solid residues, whether or not ground or in the form of pellets, resulting from the extraction of cotton seeds",Oilcake and other solid residues resulting from the extraction of cotton seeds,10,C,2,"[9.620353698730469, 2.551506280899048, 7.416372776031494, 6.456171989440918, 5.780008316040039, 6.509273052215576, 6.948757648468018, 5.201693058013916, 2.9731948375701904, 2.9721081256866455]",True,False


In [27]:
df_tmp = df[df.Category_0=='2'].drop('Descr_cleaned_vectorized', axis=1).copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train_C = (df_tmp[this_cat] =='C')
y_train_C

348     False
349     False
352     False
353     False
357     False
        ...  
3456    False
3457    False
3458    False
3459    False
3460    False
Name: Category_0, Length: 2702, dtype: bool

In [28]:

this_cat = 'Category_1'

#Train the model using the training sets 
df_tmp = df[df.Category_0=='2'].copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train_C = (df_tmp[this_cat] =='C')


bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train_C)

CN_test_df2= Results[Results.Predicted==True]

CN_test2 = np.array(list(CN_test_df2["Reduced_dim"].values))
y_CN_pred=bin_forest_clf.predict(CN_test)

Results2 = CN_test_df.drop(['Descr_cleaned_vectorized'],axis=1).copy()
Results2['Expected'] = (Results2[this_cat]=='C')
Results2['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
print(precision_recall_fscore_support(Results2.Expected,  np.array(Results2.Predicted), pos_label=True, average='binary'))
#Results2

(0.9105297867663222, 0.9626961069145845, 0.935886572897249, None)


## Finally, do multi class random forest for "C"

In [30]:
this_cat = 'Category_2'

#Train the model using the training sets 
df_tmp = df[df.Category_1=='C'].copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train = (df_tmp[this_cat])

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df3= Results2[Results2.Predicted==True]
CN_test3 = np.array(list(CN_test_df3["Reduced_dim"].values))

y_CN_test3 = (CN_test_df3.Category_2)

y_CN_pred3=clf.predict(CN_test3)


Prediction = CN_test_df3.drop(['Reduced_dim'],axis=1).copy()
Prediction['Predicted'] = pd.Series(data=y_CN_pred3.tolist(), index=CN_test_df3.index)

#print(classification_report_imbalanced(y_CN_test3, y_CN_pred3))

In [51]:
display(Prediction.sample(5))

Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_old,CN_Description_cleaned,Category_2,Category_1,Category_0,Expected,Predicted
9487,61091000,14.14.30,6109 10 00,10,,"T-shirts, singlets and other vests of cotton, knitted or crocheted","T-shirts, singlets and other vests of cotton, knitted or crocheted",14,C,2,True,27
2041,11010015,10.61.21,1101 00 15,10,,Flour of common wheat and spelt,Flour of common wheat and spelt,10,C,2,True,26
2855,19054090,10.72.11,1905 40 90,10,(excl. rusks),Toasted bread and similar toasted products,Toasted bread and similar toasted products,10,C,2,True,26
9989,63039290,13.92.15,6303 92 90,10,"(excl. nonwovens, knitted or crocheted, awnings and sunblinds)","Curtains, incl. drapes, and interior blinds, curtain or bed valances of synthetic fibres","Curtains, incl. drapes, and interior blinds, curtain or bed valances of synthetic fibres",13,C,2,True,27
6579,39152000,38.11.55,3915 20 00,10,,"Waste, parings and scrap, of polymers of styrene","Waste, parings and scrap, of polymers of styrene",38,E,2,False,26


## Try the binary clasifier for each C value

In [52]:
df_tmp = df[df.Category_1=='C'].copy()


In [53]:
this_cat = 'Category_2'

prediction = Results2[Results2.Predicted==True].drop(['Reduced_dim'],axis=1).copy()
prediction['prediction'] = '0'
#Train the model using the training sets 
for xcat in df_tmp.Category_2.unique():

    df_tmp = df[df.Category_1=='C'].copy()
    X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
    y_train = (df_tmp[this_cat]==xcat)

    bin_forest_clf = RandomForestClassifier(random_state=42)
    bin_forest_clf.fit(X_train,y_train)

    CN_test= Results2[Results2.Predicted==True]
    CN_t = np.array(list(CN_test["Reduced_dim"].values))
    y_CN_pred=bin_forest_clf.predict(CN_t)
    y_true = (CN_test[this_cat]==xcat)
    
    m = pd.Series(data=y_CN_pred.tolist(), index=CN_test.index)
    prediction['Predict'] = m

    prediction['temp'] = xcat
    prediction['prediction'] = prediction.temp.where(m, prediction.prediction)

    print('Results for ',xcat,'\n',precision_recall_fscore_support(y_true, y_CN_pred, pos_label=True, average='binary'))
    print('number predicted correctly :', len(prediction[(prediction.Category_2==xcat)&(prediction.Predict==True)]),
' number missed :', len(prediction[(prediction.Category_2==xcat)&(prediction.Predict==False)]))
display(prediction.sample(10))

  _warn_prf(average, modifier, msg_start, len(result))


Results for  10 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 1650


  _warn_prf(average, modifier, msg_start, len(result))


Results for  11 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 169


  _warn_prf(average, modifier, msg_start, len(result))


Results for  12 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 11
Results for  13 
 (0.1540983606557377, 0.06545961002785515, 0.09188660801564029, None)
number predicted correctly : 47  number missed : 671


  _warn_prf(average, modifier, msg_start, len(result))


Results for  14 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 358


  _warn_prf(average, modifier, msg_start, len(result))


Results for  15 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 172


  _warn_prf(average, modifier, msg_start, len(result))


Results for  16 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 202
Results for  17 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 185
Results for  18 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 1
Results for  19 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 72
Results for  20 
 (0.2768777614138439, 0.3197278911564626, 0.2967640094711918, None)
number predicted correctly : 376  number missed : 800


  _warn_prf(average, modifier, msg_start, len(result))


Results for  21 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 143
Results for  22 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 159


  _warn_prf(average, modifier, msg_start, len(result))


Results for  23 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 261


  _warn_prf(average, modifier, msg_start, len(result))


Results for  24 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 560


  _warn_prf(average, modifier, msg_start, len(result))


Results for  25 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 363
Results for  26 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 366
Results for  27 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 335


  _warn_prf(average, modifier, msg_start, len(result))


Results for  28 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 809
Results for  29 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 141


  _warn_prf(average, modifier, msg_start, len(result))


Results for  30 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 132
Results for  31 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 41


  _warn_prf(average, modifier, msg_start, len(result))


Results for  32 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 260
Results for  33 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_old,CN_Description_cleaned,Category_2,Category_1,Category_0,Expected,Predicted,prediction,Predict,temp
15016,87112098,30.91.12,8711 20 98,10,(excl. scooters),"Motorcycles, incl. mopeds, with reciprocating internal combustion piston engine of a cylinder capacity > 125 cm³ to 250 cm³","Motorcycles, incl. mopeds, with reciprocating internal combustion piston engine of a cylinder capacity > 125 cm³ to 250 cm³",30,C,2,True,True,19,False,33
11354,73021040,24.10.75,7302 10 40,10,,"Grooved rails of iron or steel, for railway or tramway track, new","Grooved rails of iron or steel, for railway or tramway track, new",24,C,2,True,True,19,False,33
6308,38245010,23.63.10,3824 50 10,10,,Concrete ready to pour,Concrete ready to pour,23,C,2,True,True,0,False,33
3213,20099011,10.32.17,2009 90 11,10,(excl. containing spirit) not containing added sugar or other sweetening matter,"Mixtures of apple and pear juice, unfermented, Brix value > 67 at 20°C, value of <= 22 € per 100 kg, whether or not containing added sugar or other sweetening matter","Mixtures of apple and pear juice, unfermented, Brix value > 67 at 20°C, value of <= 22 € per 100 kg, whether or",10,C,2,True,True,0,False,33
7605,47069100,17.11.14,4706 91 00,10,"(excl. that of bamboo, wood, cotton linters and fibres derived from recovered [waste and scrap] paper or paperboard)",Mechanical pulp of fibrous cellulosic material,Mechanical pulp of fibrous cellulosic material,17,C,2,True,True,26,False,33
3677,25010010,08.93.10,2501 00 10,10,,Sea water and salt liquors,Sea water and salt liquors,8,B,2,False,True,0,False,33
1548,7108010,10.39.11,0710 80 10,10,,"Olives, uncooked or cooked by steaming or by boiling in water, frozen","Olives, uncooked or cooked by steaming or by boiling in water, frozen",10,C,2,True,True,0,False,33
13706,84732190,28.23.22,8473 21 90,10,n.e.s.,"Parts and accessories of electronic calculators of subheading 8470.10, 8470.21 or 8470.29,","Parts and accessories of electronic calculators of subheading 8470.10, 8470.21 or 8470.29,",28,C,2,True,True,0,False,33
11445,73063049,24.20.33,7306 30 49,10,(excl. products plated or coated with zinc),"Threaded or threadable tubes ""gas pipe"", welded, of circular cross-section, of iron or non-alloy steel","Threaded or threadable tubes ""gas pipe"", welded, of circular cross-section, of iron or non-alloy steel",24,C,2,True,True,20,False,33
4445,28353100,20.13.42,2835 31 00,10,not chemically defined,"Sodium triphosphate ""sodium tripolyphosphate"", whether or not chemically defined","Sodium triphosphate ""sodium tripolyphosphate"", whether or",20,C,2,True,True,0,False,33


In [54]:
xcat = '10'
df_tmp = df[df.Category_1=='C'].copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train = (df_tmp[this_cat]==xcat)

bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train)

CN_test= Results2[Results2.Predicted==True]
CN_t = np.array(list(CN_test["Reduced_dim"].values))
y_CN_pred=bin_forest_clf.predict(CN_t)
y_true = (CN_test[this_cat]==xcat)

In [55]:
y_CN_pred

array([False, False, False, ..., False, False, False])

In [56]:
y_true

3        False
6        False
7        False
8        False
10       False
         ...  
16192    False
16196    False
16199    False
16200    False
16204    False
Name: Category_2, Length: 9098, dtype: bool

In [57]:
print('Results for ',xcat,'\n',precision_recall_fscore_support(y_true, y_CN_pred, pos_label=False, average='binary'))

Results for  10 
 (0.8186414596614641, 1.0, 0.9002780128127644, None)


In [58]:
xcat = '33'
prediction = Results2[Results2.Predicted==True].drop(['Reduced_dim','Expected','Predicted'],axis=1).copy()
prediction['prediction'] = '0'
#Train the model using the training sets 

df_tmp = df[df.Category_1=='C'].copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train = (df_tmp[this_cat]==xcat)

bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train)

CN_test= Results2[Results2.Predicted==True]
CN_t = np.array(list(CN_test["Reduced_dim"].values))
y_CN_pred=bin_forest_clf.predict(CN_t)
y_true = (CN_test[this_cat]==xcat)

m = pd.Series(data=y_CN_pred.tolist(), index=CN_test.index)
prediction['Predict'] = m

prediction['temp'] = xcat
prediction['prediction'] = prediction.temp.where(m, prediction.prediction)
prediction[prediction.Category_2=='10'].sample(5)

Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_old,CN_Description_cleaned,Category_2,Category_1,Category_0,prediction,Predict,temp
1570,7119010,10.39.12,0711 90 10,10,(excl. sweet pepper),"Fruits of genus Capsicum or Pimenta provisionally preserved, e.g. by sulphur dioxide gas, in brine, in sulphur water or in other preservative solutions, but unsuitable in that state for immediate consumption","Fruits of genus Capsicum or Pimenta provisionally preserved, e.g. by sulphur dioxide gas, in brine, in sulphur water or in other preservative solutions, but unsuitable in that state for immediate consumption",10,C,2,0,False,33
2071,11032090,10.61.32,1103 20 90,10,"(excl. rye, barley, oats, maize, rice and wheat)",Cereal pellets,Cereal pellets,10,C,2,0,False,33
1828,8134010,10.39.29,0813 40 10,10,,"Dried peaches, incl. nectarines","Dried peaches, incl. nectarines",10,C,2,0,False,33
1581,7123200,10.39.13,0712 32 00,10,not further prepared,"Dried wood ears ""Auricularia spp."", whole, cut, sliced, broken or in powder, but not further prepared","Dried wood ears ""Auricularia spp."", whole, cut, sliced, broken or in powder, but",10,C,2,0,False,33
2934,20059980,10.39.17,2005 99 80,10,"(excl. preserved by sugar, homogenised vegetables of subheading 2005.10, and tomatoes, mushrooms, truffles, potatoes, sauerkraut, peas ""Pisum sativum"", beans ""Vigna spp., Phaseolus spp."" asparagus, olives, sweetcorn ""Zea Mays var. Saccharata"", bamboo shoots, fruit of the genus Capsicum hot to the taste, capers, artichokes and mixtures of vegetables) not frozen","Vegetables, prepared or preserved otherwise than by vinegar or acetic acid, not frozen","Vegetables, prepared or preserved otherwise than by vinegar or acetic acid,",10,C,2,0,False,33


In [None]:
print('32 is correct :', len(prediction[(prediction.Category_2=='32')&(prediction.Predict==True)]),
'\n32 missed :', len(prediction[(prediction.Category_2=='32')&(prediction.Predict==False)]))

In [None]:
m