# Text Processing for Accelerator project

A simplified pipeline processing text with FastText.

* Load CPA data
* Basic text cleaning
* Vectorize (with FastText)
* Split into test and training sets
* Reduce dimension using UMAP supervised 
* Predict on test set
* Use metrics for an unbalanced dataset
* View the clustering in plotly scatterplots

In [1]:
# this bit shouldn't be necessary if we pip install -e .   in the parent directory
%load_ext autoreload
%autoreload 2

In [2]:
import functools
from pprint import pprint
from time import time
from IPython.display import display, HTML
import logging
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
import umap

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express
from imblearn.metrics import classification_report_imbalanced

import text_processing

pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load FastText Pretrained

Note: This requires a fair bit of memory (peaks at about 17.5 GiB)

Recommend shutting down other kernels first, once this has loaded memory usage drops again.

This takes a few minutes to load in.

In [3]:
wv = text_processing.fetch_fasstext_pretrained(filepath="../../data/wiki.en.bin")

2020-12-04 14:51:07,267 - text_processing - INFO - Loading FastText pretrained from ../../data/wiki.en.bin
2020-12-04 14:55:07,910 - text_processing - INFO - Model loaded


#### Load in the CPA data

In [4]:
CPA = text_processing.fetch_files()

2020-12-04 14:55:08,025 - text_processing - INFO - cleanded CPA File imported


In [5]:
CPA1 = CPA[CPA.Level.isin({5,6})][['Code','Descr_old','Descr','Category_0','Category_1','Category_2','Category_3']].copy()
df = text_processing.clean_col(CPA1, "Descr")
df.drop('Descr',axis=1,inplace=True)

df.sample(5)

2020-12-04 14:55:08,180 - text_processing - INFO - Cleaning column: Descr 


Unnamed: 0,Code,Descr_old,Category_0,Category_1,Category_2,Category_3,Descr_cleaned
4689,69.10.1,Legal services,8,M,69,69.1,legal services
3030,30.30.9,Sub-contracted operations as part of manufacturing of air and spacecraft and related machinery,2,C,30,30.3,sub-contracted operations part manufacturing air spacecraft related machinery
1606,21.20.12,"Medicaments, containing hormones, but not antibiotics",2,C,21,21.2,medicaments containing hormones
82,01.13.72,Sugar beet seeds,1,A,1,1.1,sugar beet seeds
3213,32.99.41,Cigarette lighters and other lighters; smoking pipes and cigar or cigarette holders and parts thereof,2,C,32,32.9,cigarette lighters lighters smoking pipes cigar cigarette holders parts thereof


### Vectorize CPA data using FastText

In [6]:
text_to_vec = functools.partial(text_processing.vectorize_text, wv)
df["Descr_cleaned_vectorized"] = df.Descr_cleaned.apply(text_to_vec)

In [7]:
# reduce the dimension for the whole lot
df['Reduced_dim'] = text_processing.reduce_dimensionality(df.Descr_cleaned_vectorized,10)
# reduce the dimension for the whole lot using supervised learning
df['Reduced_dim_supervised'] = text_processing.reduce_dimensionality_supervised(df.Descr_cleaned_vectorized,df.Category_2)

2020-12-04 14:55:08,573 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Fri Dec  4 14:55:08 2020 Finding Nearest Neighbors
Fri Dec  4 14:55:08 2020 Building RP forest with 8 trees
Fri Dec  4 14:55:09 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
	 9  /  12
	 10  /  12
Fri Dec  4 14:55:15 2020 Finished Nearest Neighbor Search
Fri Dec  4 14:55:17 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs


2020-12-04 14:55:32,491 - text_processing - INFO - Now applying umap to reduce dimension


Fri Dec  4 14:55:32 2020 Finished embedding
UMAP(min_dist=0.0, n_components=10, random_state=3052528580, verbose=10)
Construct fuzzy simplicial set
Fri Dec  4 14:55:32 2020 Finding Nearest Neighbors
Fri Dec  4 14:55:32 2020 Building RP forest with 8 trees
Fri Dec  4 14:55:33 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
Fri Dec  4 14:55:34 2020 Finished Nearest Neighbor Search


  return f(**kwargs)


Fri Dec  4 14:55:34 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Fri Dec  4 14:55:50 2020 Finished embedding


### Split the CPA data into a training set and a test set

In [8]:
# Split dataset into training set and test set
train_set, test_set = train_test_split(df.copy(), test_size=0.2, random_state=42)

# Use the Random Forest Classifier

https://www.datacamp.com/community/tutorials/random-forests-classifier-python

We have already  used UMAP supervised classification to reduce the dimension on the training set.   
We then split our data into training and test datasets.   
We now use the random forest classifier on the training set, and see how it works on the test set.


In [9]:
#Create Gaussian Classifier
for this_cat in ['Category_3','Category_2','Category_1','Category_0']:
    #Train the model using the training sets 
    X_train = train_set.Reduced_dim_supervised
    X_test = test_set.Reduced_dim_supervised

    y_train = train_set[this_cat]
    y_test = test_set[this_cat]

    vecs = np.array(list(X_train.values))
    target = np.array(list(y_train.values))

    clf = RandomForestClassifier(n_estimators=100
                              ).fit(vecs, target)

    y_pred=clf.predict(np.array(list(X_test.values)))
# Model Accuracy, how often is the classifier correct?
    print(f"Accuracy for {this_cat} classification:",metrics.accuracy_score(y_test, y_pred))

Accuracy for Category_3 classification: 0.8098360655737705
Accuracy for Category_2 classification: 0.9814207650273225
Accuracy for Category_1 classification: 0.985792349726776
Accuracy for Category_0 classification: 0.994535519125683


In [10]:
CN = text_processing.fetch_CN_files()
Cat1_Cat2_map = CPA[CPA.Level==2][['Code','Parent']].rename(columns={'Code':'Category_2','Parent':'Category_1'})
CN=CN.merge(Cat1_Cat2_map, on='Category_2', how='left')


# we now set up a higher level for A10 indstry levels (10 categories)
update_dict0 = {'A':'1','F':'3','J':'5', 'K':'6', 'L':'7','M':'8','N':'8'}
update_dict = {**update_dict0,**dict.fromkeys(['B','C','D','E'],'2'),**dict.fromkeys(['G','H','I'],'4'),
               **dict.fromkeys(['O','P','Q'],'9'), **dict.fromkeys(['R','S','T','U'],'10')}


CN['Category_0'] = CN.Category_1.replace(update_dict)
CN['Category_0']= CN['Category_0'].astype(str)
CN.sample(3)

2020-12-04 14:56:04,839 - text_processing - INFO - CN new Files imported and cleaned


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0
5233,29269070,20.14.43,2926 90 70,10,"(excl. acrylonitrile, 1-cyanoguanidine ""dicyandiamide"", fenproporex ""INN"" and its salts, methadone ""INN""-intermediate ""4-cyano-2-dimethylamino-4,4-diphenylbutane"", alpha-Phenylacetoacetonitrile and isophthalonitrile)",Nitrile-function compounds,20,C,2
4032,27101931,19.20.26,2710 19 31,10,,Gas oils of petroleum or bituminous minerals for undergoing a specific process as defined in Additional Note 5 to chapter 27,19,C,2
7707,48051910,17.12.34,4805 19 10,10,,"Wellenstoff, uncoated, in rolls of a width > 36 cm or in square or rectangular sheets with one side > 36 cm and the other side > 15 cm in the unfolded state",17,C,2


In [11]:
# Vectorize the CN description using FastText
CN["Descr_cleaned_vectorized"] = CN.CN_Description_cleaned.apply(
    text_to_vec
)

In [12]:
CN['Reduced_dim'] = text_processing.reduce_dimensionality(CN.Descr_cleaned_vectorized, 10)
#CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)]
#CN_test_df['Reduced_dim'] = text_processing.reduce_dimensionality(CN_test_df.Descr_cleaned_vectorized)

#CN_test_df['Reduced_dim_supervised'] = text_processing.train_test_umap(df.Descr_cleaned_vectorized,df.Category_2, CN_test_df.Descr_cleaned_vectorized)

2020-12-04 14:56:08,549 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Fri Dec  4 14:56:08 2020 Finding Nearest Neighbors
Fri Dec  4 14:56:08 2020 Building RP forest with 11 trees
Fri Dec  4 14:56:09 2020 NN descent for 14 iterations
	 0  /  14
	 1  /  14
	 2  /  14
	 3  /  14
	 4  /  14
	 5  /  14
	 6  /  14
	 7  /  14
	 8  /  14
	 9  /  14
	 10  /  14
Fri Dec  4 14:56:11 2020 Finished Nearest Neighbor Search
Fri Dec  4 14:56:11 2020 Construct embedding
	completed  0  /  200 epochs
	completed  20  /  200 epochs
	completed  40  /  200 epochs
	completed  60  /  200 epochs
	completed  80  /  200 epochs
	completed  100  /  200 epochs
	completed  120  /  200 epochs
	completed  140  /  200 epochs
	completed  160  /  200 epochs
	completed  180  /  200 epochs
Fri Dec  4 14:56:34 2020 Finished embedding


In [13]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

this_cat = 'Category_1'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))
#CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
display(Results.sample(5))
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

Accuracy: 0.8077004442563994


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
3775,25181000,08.11.30,2518 10 00,10,"(excl. broken or crushed dolomite for concrete aggregates, road metalling or railway or other ballast), not calcined or not sintered, incl.","Crude dolomite dolomite roughly trimmed or merely cut, by sawing or otherwise, into blocks or slabs of a rectangular ""incl. square"" shape",8,B,2,C
7640,48025590,17.12.14,4802 55 90,10,", not containing fibres obtained by a mechanical or chemi-mechanical process or of which <= 10% by weight of the total fibre content consists of such fibres, and weighing >= 80 g but <= 150 g/m², n.","Uncoated paper and paperboard, of a kind used for writing, printing or other graphic purposes, and non-perforated punchcards and punch-tape paper, in rolls of any sizee.s.",17,C,2,C
6579,39152000,38.11.55,3915 20 00,10,,"Waste, parings and scrap, of polymers of styrene",38,E,2,M
11520,73102119,25.92.11,7310 21 19,10,,"Cans of iron or steel, of a capacity of < 50 l, which are to be closed by soldering or crimping, of a kind used for preserving drink",25,C,2,C
11012,72112900,24.32.10,7211 29 00,10,"not clad, plated or coated, containing by weight >= 0,25% of carbon","Flat-rolled products of iron or non-alloy steel, of a width of < 600 mm, simply cold-rolled ""cold-reduced"",",24,C,2,C


                   pre       rec       spe        f1       geo       iba       sup

          A       0.00      0.00      1.00      0.00      0.00      0.00       594
          B       0.00      0.00      1.00      0.00      0.00      0.00       106
          C       0.91      0.89      0.08      0.90      0.26      0.07      8605
          D       0.00      0.00      1.00      0.00      0.00      0.00         2
          E       0.00      0.00      0.99      0.00      0.00      0.00       102
          F       0.00      0.00      1.00      0.00      0.00      0.00         0
          J       0.00      0.00      0.96      0.00      0.00      0.00        32
          M       0.00      0.00      0.94      0.00      0.00      0.00         5
          P       0.00      0.00      1.00      0.00      0.00      0.00         0
          R       0.00      0.00      1.00      0.00      0.00      0.00         7
          S       0.00      0.00      1.00      0.00      0.00      0.00         1

av

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [14]:
classification_report_imbalanced

<function imblearn.metrics._classification.classification_report_imbalanced(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, alpha=0.1)>

In [15]:
this_cat = 'Category_2'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim.values))
y_train = np.array(list(df[this_cat].values))

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=clf.predict(CN_test)

y_CN_test = np.array(list(CN_test_df[this_cat]))
# Model Accuracy, how often is the classifier correct?
print("Test 3 Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

Results = CN_test_df.drop(['Descr_cleaned_vectorized','Reduced_dim'],axis=1).copy()
Results['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
display(Results.sample(5))
print(classification_report_imbalanced(y_CN_test, y_CN_pred))

Test 3 Accuracy: 0.0239052253014597


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Predicted
13897,84831050,28.15.22,8483 10 50,10,,Articulated shafts,28,C,2,72
15063,87149610,30.92.30,8714 96 10,10,,Pedals for bicycles,30,C,2,27
4203,28111200,20.13.24,2811 12 00,10,,"Hydrogen cyanide ""hydrocyanic acid""",20,C,2,38
13253,84431970,28.99.14,8443 19 70,10,"(excl. machinery for printing textile materials, those for use in the production of semiconductors, ink jet printing machines, hectograph or stencil duplicating machines, addressing machines and other office printing machines of heading 8469 to 8472 and offset, flexographic, letterpress and gravure printing machinery)","Printing machinery used for printing by means of plates, cylinders and other printing components of heading 8442",28,C,2,73
592,3031200,10.20.13,0303 12 00,10,"(excl. sockeye salmon ""red salmon"")",Frozen Pacific salmon,10,C,2,14


                   pre       rec       spe        f1       geo       iba       sup

         01       0.00      0.00      1.00      0.00      0.00      0.00       417
         02       0.00      0.00      1.00      0.00      0.00      0.00        38
         03       0.00      0.00      1.00      0.00      0.00      0.00       139
         05       0.00      0.00      1.00      0.00      0.00      0.00         5
         06       0.00      0.00      1.00      0.00      0.00      0.00         5
         07       0.00      0.00      1.00      0.00      0.00      0.00        25
         08       0.00      0.00      1.00      0.00      0.00      0.00        71
         10       0.75      0.00      1.00      0.01      0.06      0.00      1748
         11       0.00      0.00      1.00      0.00      0.00      0.00       216
         12       0.00      0.00      1.00      0.00      0.00      0.00        16
         13       0.00      0.00      1.00      0.00      0.00      0.00       731
   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Try again with Binary Classifier

In [67]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [58]:
CN_test_df[['CN_Code', 'CPA_Code', 'Category_0']].dtypes

CN_Code        int64
CPA_Code      object
Category_0    object
dtype: object

In [59]:
np.array(CN_test_df[this_cat])

array(['1', '1', '1', ..., '10', '10', '10'], dtype=object)

In [52]:
y_CN_test

array(['1', '1', '1', ..., '10', '10', '10'], dtype='<U2')

In [110]:
this_cat = 'Category_0'

#Train the model using the training sets 
X_train = np.array(list(df.Reduced_dim_supervised.values))
y_train_BE = (df[this_cat]=='2')


bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train_BE)

CN_test_df= CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)]
CN_test = np.array(list(CN_test_df["Reduced_dim"].values))
y_CN_pred=bin_forest_clf.predict(CN_test)

Results = CN_test_df.drop(['Descr_cleaned_vectorized'],axis=1).copy()
Results['Expected'] = (Results[this_cat]=='2')
Results['Predicted'] = pd.Series(data=y_CN_pred2.tolist(), index=CN_test_df.index)
print(precision_recall_fscore_support(Results.Expected,  np.array(Results.Predicted), pos_label=True, average='binary'))
dispaly(Results.sample())


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Reduced_dim,Expected,Predicted
3,1012100,01.43.11,0101 21 00,10,,Pure-bred breeding horses,01,A,1,"[4.371167182922363, 4.2713141441345215, 3.029484510421753, 3.2516627311706543, 1.0452690124511719, 2.8048088550567627, 0.957646369934082, 2.319648027420044, 4.587272644042969, 5.369235515594482]",False,True
6,1012910,01.43.11,0101 29 10,10,,Horses for slaughter,01,A,1,"[4.993223667144775, 4.722702980041504, 3.9171667098999023, 2.408440589904785, 1.7557017803192139, 3.532912015914917, 1.8845216035842896, 1.152524709701538, 2.9480128288269043, 3.4713973999023438]",False,True
7,1012990,01.43.11,0101 29 90,10,"(excl. for slaughter, pure-bred for breeding)",Live horses,01,A,1,"[7.885985851287842, 4.089168548583984, 0.7208231091499329, 0.6245003938674927, -2.365338087081909, 2.2572553157806396, 1.014506220817566, 1.0964590311050415, 2.9066412448883057, 4.793102264404297]",False,True
8,1013000,01.43.12,0101 30 00,10,,Live asses,01,A,1,"[7.888180255889893, 4.088168621063232, 0.7192676663398743, 0.6175832748413086, -2.367185592651367, 2.2566211223602295, 1.0143322944641113, 1.098036766052246, 2.9084677696228027, 4.7963738441467285]",False,True
10,1019000,01.43.12,0101 90 00,10,,Live mules and hinnies,01,A,1,"[7.760809421539307, 4.844364643096924, 1.0845295190811157, 2.903223991394043, -2.7435245513916016, 1.3028923273086548, 0.5737646818161011, 0.4076416492462158, 1.9680979251861572, 4.820626735687256]",False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
16196,97020000,90.03.13,9702 00 00,10,,"Original engravings, prints and lithographs",90,R,10,"[3.354680061340332, 3.977051258087158, 2.4024713039398193, 2.675119638442993, 1.9343726634979248, 4.1116132736206055, 2.7841665744781494, 1.1603434085845947, -0.6579230427742004, 3.4041452407836914]",False,False
16199,97030000,90.03.13,9703 00 00,10,,"Original sculptures and statuary, in any material",90,R,10,"[2.979419708251953, 3.0503387451171875, 3.7242374420166016, 3.3472957611083984, 4.06452751159668, 3.0019617080688477, 2.8378853797912598, 0.2262466847896576, -0.749163806438446, 3.4150054454803467]",False,True
16200,97040000,91.02.20,9704 00 00,10,"not of current or new issue in which they have, or will have, a recognised face value","Postage or revenue stamps, stamp-postmarks, first-day covers, postal stationery, stamped paper and the like, used, or if unused,",91,R,10,"[3.4894795417785645, 4.183339595794678, 2.3391225337982178, 2.442664384841919, 2.3472044467926025, 4.3478593826293945, 2.925283193588257, 1.1007142066955566, -0.9359190464019775, 3.608502149581909]",False,True
16204,97050000,91.02.20,9705 00 00,10,,"Collections and collector's pieces of zoological, botanical, mineralogical, anatomical, historical, archaeological, palaeontological, ethnographic or numismatic interest",91,R,10,"[3.390747308731079, 4.085256099700928, 2.4493229389190674, 2.4739294052124023, 1.9800584316253662, 3.850285291671753, 2.9312877655029297, 1.2311877012252808, -0.27771446108818054, 3.482422351837158]",False,True


(0.9249356065252055, 0.8554736245036869, 0.8888495992456389, None)

In [104]:
df_tmp = df[df.Category_0=='2'].drop('Descr_cleaned_vectorized', axis=1).copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train_C = (df_tmp[this_cat] =='C')
y_train_C

348     False
349     False
352     False
353     False
357     False
        ...  
3456    False
3457    False
3458    False
3459    False
3460    False
Name: Category_1, Length: 2702, dtype: bool

In [119]:

this_cat = 'Category_1'

#Train the model using the training sets 
df_tmp = df[df.Category_0=='2'].copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train_C = (df_tmp[this_cat] =='C')


bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train_C)

CN_test_df2= Results[Results.Predicted==True]

CN_test2 = np.array(list(CN_test_df2["Reduced_dim"].values))
y_CN_pred=bin_forest_clf.predict(CN_test)

Results2 = CN_test_df.drop(['Descr_cleaned_vectorized'],axis=1).copy()
Results2['Expected'] = (Results2[this_cat]=='C')
Results2['Predicted'] = pd.Series(data=y_CN_pred2.tolist(), index=CN_test_df.index)
print(precision_recall_fscore_support(Results2.Expected,  np.array(Results2.Predicted), pos_label=True, average='binary'))
#Results2

(0.9069054335827302, 0.8592678675188844, 0.8824442057524765, None)


## Finally, do multi class random forest for "C"

In [128]:
this_cat = 'Category_2'

#Train the model using the training sets 
df_tmp = df[df.Category_1=='C'].copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train = (df_tmp[this_cat])

clf.fit(X_train,y_train)

# we stick to the lowest level
CN_test_df3= Results2[Results2.Predicted==True]
CN_test3 = np.array(list(CN_test_df3["Reduced_dim"].values))

y_CN_test3 = (CN_test_df3.Category_2)

y_CN_pred3=clf.predict(CN_test3)


Prediction = CN_test_df3.drop(['Reduced_dim'],axis=1).copy()
Prediction['Predicted'] = pd.Series(data=y_CN_pred3.tolist(), index=CN_test_df3.index)

#print(classification_report_imbalanced(y_CN_test3, y_CN_pred3))

In [125]:
display(Prediction.sample(5))

Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Expected,Predicted
1708,8051022,01.23.13,0805 10 22,10,,Fresh navel oranges,1,A,1,False,10
15726,92029080,32.20.12,9202 90 80,10,"(excl. with keyboard, those played with a bow and guitars)","Mandolins, zithers and other string musical instruments",32,C,2,True,33
13735,84749090,28.92.62,8474 90 90,10,(excl. of cast iron or cast steel),Parts of machinery of heading 8474,28,C,2,True,33
12978,84269190,28.22.14,8426 91 90,10,(excl. hydraulic cranes designed for the loading and unloading of vehicles),Cranes designed for mounting on road vehicles,28,C,2,True,33
4803,29062100,20.14.23,2906 21 00,10,,Benzyl alcohol,20,C,2,True,32


## Try the binary clasifier for each C value

In [129]:
df_tmp = df[df.Category_1=='C'].copy()


array(['10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20',
       '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31',
       '32', '33'], dtype=object)

In [188]:
this_cat = 'Category_2'

prediction = Results2[Results2.Predicted==True].drop(['Reduced_dim'],axis=1).copy()
prediction['prediction'] = '0'
#Train the model using the training sets 
for xcat in df_tmp.Category_2.unique():

    df_tmp = df[df.Category_1=='C'].copy()
    X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
    y_train = (df_tmp[this_cat]==xcat)

    bin_forest_clf = RandomForestClassifier(random_state=42)
    bin_forest_clf.fit(X_train,y_train)

    CN_test= Results2[Results2.Predicted==True]
    CN_t = np.array(list(CN_test["Reduced_dim"].values))
    y_CN_pred=bin_forest_clf.predict(CN_t)
    y_true = (CN_test[this_cat]==xcat)
    
    m = pd.Series(data=y_CN_pred.tolist(), index=CN_test.index)
    prediction['Predict'] = m

    prediction['temp'] = xcat
    prediction['prediction'] = prediction.temp.where(m, prediction.prediction)

    print('Results for ',xcat,'\n',precision_recall_fscore_support(y_true, y_CN_pred, pos_label=True, average='binary'))
    print('number predicted correctly :', len(prediction[(prediction.Category_2==xcat)&(prediction.Predict==True)]),
' number missed :', len(prediction[(prediction.Category_2==xcat)&(prediction.Predict==False)]))
display(prediction.sample(10))

  _warn_prf(average, modifier, msg_start, len(result))


Results for  10 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 1710


  _warn_prf(average, modifier, msg_start, len(result))


Results for  11 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 209
Results for  12 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 16


  _warn_prf(average, modifier, msg_start, len(result))


Results for  13 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 612


  _warn_prf(average, modifier, msg_start, len(result))


Results for  14 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 348
Results for  15 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 154


  _warn_prf(average, modifier, msg_start, len(result))


Results for  16 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 160
Results for  17 
 (0.03466557911908646, 0.46195652173913043, 0.0644916540212443, None)
number predicted correctly : 85  number missed : 99


  _warn_prf(average, modifier, msg_start, len(result))


Results for  18 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 0
Results for  19 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 76
Results for  20 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 1137


  _warn_prf(average, modifier, msg_start, len(result))


Results for  21 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 135


  _warn_prf(average, modifier, msg_start, len(result))


Results for  22 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 151
Results for  23 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 180


  _warn_prf(average, modifier, msg_start, len(result))


Results for  24 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 281


  _warn_prf(average, modifier, msg_start, len(result))


Results for  25 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 197


  _warn_prf(average, modifier, msg_start, len(result))


Results for  26 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 362
Results for  27 
 (0.5, 0.06129032258064516, 0.10919540229885057, None)
number predicted correctly : 19  number missed : 291


  _warn_prf(average, modifier, msg_start, len(result))


Results for  28 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 678


  _warn_prf(average, modifier, msg_start, len(result))


Results for  29 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 133
Results for  30 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 122


  _warn_prf(average, modifier, msg_start, len(result))


Results for  31 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 11


  _warn_prf(average, modifier, msg_start, len(result))


Results for  32 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 228
Results for  33 
 (0.0, 0.0, 0.0, None)
number predicted correctly : 0  number missed : 0


  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,Expected,Predicted,prediction,Predict,temp
14601,85409900,26.11.40,8540 99 00,10,(excl. parts of cathode ray tubes),"Parts of thermionic, cold cathode or photo cathode valves and tubes, n.e.s.",26,C,2,True,True,33,True,33
3124,20091998,10.32.12,2009 19 98,10,"(excl. containing spirit and frozen, with a value of <= 30 € per 100 kg and with > 30% added sugar) not containing added sugar or other sweetening matter","Orange juice, unfermented, Brix value > 20 but <= 67 at 20°C, whether or",10,C,2,True,True,0,False,33
15068,87149930,30.92.30,8714 99 30,10,,Luggage carriers for bicycles,30,C,2,True,True,33,True,33
4725,29043100,20.14.14,2904 31 00,10,,Perfluorooctane sulphonic acid,20,C,2,True,True,17,False,33
16152,96140090,32.99.41,9614 00 90,10,(excl. roughly shaped blocks of wood for the manufacture of pipes),"Smoking pipes, incl. pipe bowls, cigar or cigarette holders, and parts thereof, n.e.s.",32,C,2,True,True,17,False,33
4610,29012400,20.14.11,2901 24 00,10,,"Buta-1,3-diene and isoprene",20,C,2,True,True,0,False,33
14996,87089993,29.32.30,8708 99 93,10,,"Parts and accessories of closed-die forged steel, for tractors, motor vehicles for the transport of ten or more persons, motor cars and other motor vehicles principally designed for the transport of persons, motor vehicles for the transport of goods and special purpose motor vehicles, n.e.s.",29,C,2,True,True,33,True,33
1700,8043000,01.22.19,0804 30 00,10,,Fresh or dried pineapples,1,A,1,False,True,0,False,33
7745,48089000,17.12.72,4808 90 00,10,"(excl. sack kraft and other kraft paper, and goods of heading 4803)","Paper and paperboard, creped, crinkled, embossed or perforated, in rolls of a width > 36 cm or in square or rectangular sheets with one side > 36 cm and the other side > 15 cm in the unfolded state",17,C,2,True,True,33,True,33
915,3055110,10.20.23,0305 51 10,10,(excl. fillets and offal) not smoked stockfish,"Cod ""Gadus morhua, Gadus ogac, Gadus macrocephalus"", dried, unsalted,",10,C,2,True,True,0,False,33


In [146]:
xcat = '10'
df_tmp = df[df.Category_1=='C'].copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train = (df_tmp[this_cat]==xcat)

bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train)

CN_test= Results2[Results2.Predicted==True]
CN_t = np.array(list(CN_test["Reduced_dim"].values))
y_CN_pred=bin_forest_clf.predict(CN_t)
y_true = (CN_test[this_cat]==xcat)

In [147]:
y_CN_pred

array([False, False, False, ..., False, False, False])

In [148]:
y_true

3        False
6        False
7        False
8        False
10       False
         ...  
16184    False
16199    False
16200    False
16204    False
16206    False
Name: Category_2, Length: 8153, dtype: bool

In [152]:
print('Results for ',xcat,'\n',precision_recall_fscore_support(y_true, y_CN_pred, pos_label=False, average='binary'))

Results for  10 
 (0.7902612535263094, 1.0, 0.8828446149630036, None)


In [173]:
xcat = '33'
prediction = Results2[Results2.Predicted==True].drop(['Reduced_dim','Expected','Predicted'],axis=1).copy()
prediction['prediction'] = '0'
#Train the model using the training sets 

df_tmp = df[df.Category_1=='C'].copy()
X_train = np.array(list(df_tmp.Reduced_dim_supervised.values))
y_train = (df_tmp[this_cat]==xcat)

bin_forest_clf = RandomForestClassifier(random_state=42)
bin_forest_clf.fit(X_train,y_train)

CN_test= Results2[Results2.Predicted==True]
CN_t = np.array(list(CN_test["Reduced_dim"].values))
y_CN_pred=bin_forest_clf.predict(CN_t)
y_true = (CN_test[this_cat]==xcat)

m = pd.Series(data=y_CN_pred.tolist(), index=CN_test.index)
prediction['Predict'] = m

prediction['temp'] = xcat
prediction['prediction'] = prediction.temp.where(m, prediction.prediction)
prediction[prediction.Category_2=='10'].sample(5)

Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Category_0,prediction,Predict,temp
3171,20096971,10.32.15,2009 69 71,10,(excl. containing spirit),"Concentrated grape juice, incl. grape must, unfermented, Brix value > 30 but <= 67 at 20°C, value of <= 18 € per 100 kg, containing > 30% added sugar",10,C,2,0,False,33
3568,23032090,10.81.20,2303 20 90,10,(excl. beet pulp),Bagasse and other waste of sugar manufacture,10,C,2,0,False,33
3280,21050091,10.52.10,2105 00 91,10,,"Ice cream and other edible ice, containing >= 3% but < 7% milkfats",10,C,2,0,False,33
1836,8135019,10.39.29,0813 50 19,10,"(excl. mixtures of edible nuts, bananas, dates, figs, pineapples, avocados, guavas, mangoes, mangosteens, citrus fruit and grapes)","Mixtures of dried apricots, apples, peaches, incl. prunus persica nectarina and nectarines, pears, papaws ""papayas"" or other edible and dried fruit, containing prunes",10,C,2,0,False,33
1581,7123200,10.39.13,0712 32 00,10,not further prepared,"Dried wood ears ""Auricularia spp."", whole, cut, sliced, broken or in powder, but",10,C,2,0,False,33


In [183]:
print('32 is correct :', len(prediction[(prediction.Category_2=='32')&(prediction.Predict==True)]),
'\n32 missed :', len(prediction[(prediction.Category_2=='32')&(prediction.Predict==False)]))

32 is correct : 173 
32 missed : 55


In [172]:
m

3        False
6        False
7        False
8        False
10       False
         ...  
16184    False
16199    False
16200    False
16204    False
16206    False
Length: 8153, dtype: bool