# Text Processing for Accelerator project

A simplified pipeline processing text with FastText.

* Load CPA data
* Basic text cleaning
* Vectorize (with FastText)
* Reduce dimension using UMAP, both supervised and unsupervised
* Predict unclassified data

In [1]:
# this bit shouldn't be necessary if we pip install -e .   in the parent directory
%load_ext autoreload
%autoreload 2

In [2]:
import functools
from pprint import pprint
from time import time
from IPython.display import display, HTML
import logging
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
import umap

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express

import text_processing

pd.set_option('display.max_colwidth', None)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jupyter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load FastText Pretrained

Note: This requires a fair bit of memory (peaks at about 17.5 GiB)

Recommend shutting down other kernels first, once this has loaded memory usage drops again.

This takes a few minutes to load in.

In [3]:
wv = text_processing.fetch_fasstext_pretrained(filepath="../../data/wiki.en.bin")

2020-11-27 13:39:25,522 - text_processing - INFO - Loading FastText pretrained from ../../data/wiki.en.bin
2020-11-27 13:42:50,863 - text_processing - INFO - Model loaded


#### Load in the CPA data

In [4]:
CPA = text_processing.fetch_files()

2020-11-27 13:44:28,424 - text_processing - INFO - cleanded CPA File imported


In [5]:
CPA1 = CPA[CPA.Level.isin({3,4,5,6})][['Code','Descr_old','Descr','Category_0','Category_1','Category_2']].copy()
df = text_processing.clean_col(CPA1, "Descr")
df.drop('Descr',axis=1,inplace=True)
df['Cat'] = df.Category_1
# choose the category for our main classification

df.sample(5)

2020-11-27 13:44:29,436 - text_processing - INFO - Cleaning column: Descr 


Unnamed: 0,Code,Descr_old,Category_0,Category_1,Category_2,Descr_cleaned,Cat
4066,49.42.19,Other removal services,4,H,49,removal services,H
4674,68.31.11,"Residential buildings and associated land sale services on a fee or contract basis, except of time-share ownership properties",7,L,68,residential buildings associated land sale services fee contract basis,L
3957,47.00.43,Retail trade services of flat glass,4,G,47,retail trade services flat glass,G
3845,46.5,Wholesale trade services of information and communication equipment,4,G,46,wholesale trade services information communication equipment,G
1559,20.59.56,Pickling preparations; fluxes; prepared rubber accelerators; compound plasticisers and stabilisers for rubber or plastics; catalytic preparations n.e.c.; mixed alkylbenzenes and mixed alkylnaphthalenes n.e.c.,2,C,20,pickling preparations fluxes prepared rubber accelerators compound plasticisers stabilisers rubber plastics catalytic preparations mixed alkylbenzenes mixed alkylnaphthalenes,C


### Vectorize CPA data using FastText

In [7]:
text_to_vec = functools.partial(text_processing.vectorize_text, wv)
df["Descr_cleaned_vectorized"] = df.Descr_cleaned.apply(
    text_to_vec
)

## Dimensionality Reduction using UMAP

### Unsupervised dimension reduction

In [None]:
# df["Descr_cleaned_vectorized_low_dimension"] = text_processing.reduce_dimensionality(df.Descr_cleaned_vectorized)

### Dimensionality Reduction using UMAP supervised

In [8]:

df["Descr_cleaned_vectorized_low_dimension"] = text_processing.reduce_dimensionality_supervised(df.Descr_cleaned_vectorized, df.Cat)

2020-11-27 13:45:16,171 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Fri Nov 27 13:45:16 2020 Finding Nearest Neighbors
Fri Nov 27 13:45:16 2020 Building RP forest with 9 trees
Fri Nov 27 13:45:16 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
	 9  /  12
Fri Nov 27 13:45:23 2020 Finished Nearest Neighbor Search


  return f(**kwargs)


Fri Nov 27 13:45:26 2020 Construct embedding
	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Fri Nov 27 13:45:46 2020 Finished embedding


### Split the CPA data into a training set and a test set

In [10]:
# Split dataset into training set and test set
train_set, test_set = train_test_split(df.copy(), test_size=0.2, random_state=42)

# Use the Random Forest Classifier

https://www.datacamp.com/community/tutorials/random-forests-classifier-python

We have already split our data into training and test datasets and used UMAP supervised classification to reduce the dimension on the training set.   
We then used UMAP prediction to reduce the dimension on the test set.

We now use the random forest classifier on the training set, and see how it works on the test set.


In [11]:
#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets 
X_train = np.array(list(train_set["Descr_cleaned_vectorized_low_dimension"].values))
y_train = np.array(list(train_set.Cat.values))

X_test = np.array(list(test_set["Descr_cleaned_vectorized_low_dimension"].values))
y_test = np.array(list(test_set.Cat.values))

clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)


#### Accuracy

In [12]:
#Import scikit-learn metrics module for accuracy calculation
y_test = np.array(list(test_set.Cat.values))
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))


Accuracy: 0.9879963065558633


In [13]:
# get a dataframe of the level 2 descriptions
Cat_descr = text_processing.get_category_description(CPA, 1)
Cat_descr.sample(5)

Unnamed: 0,Category_1,Category_1_descr
4685,M,"PROFESSIONAL, SCIENTIFIC AND TECHNICAL SERVICES"
3618,G,WHOLESALE AND RETAIL TRADE SERVICES; REPAIR SERVICES OF MOTOR VEHICLES AND MOTORCYCLES
0,A,"PRODUCTS OF AGRICULTURE, FORESTRY AND FISHING"
5516,U,SERVICES PROVIDED BY EXTRATERRITORIAL ORGANISATIONS AND BODIES
3307,D,"ELECTRICITY, GAS, STEAM AND AIR CONDITIONING"


In [25]:

print(len(y_pred), len(y_test))
test_set['Predicted'] = pd.Series(data=y_pred.tolist(), index=test_set.index)
non_matches0 = test_set[test_set.Predicted != test_set.Cat].drop('Descr_cleaned_vectorized',axis=1)
non_matches1 = non_matches0.merge(Cat_descr, on='Category_1', how='inner')
non_matches = non_matches1.merge(Cat_descr.rename(columns={'Category_1_descr':'Prediced_Cat_descr', 'Category_1':'Predicted'}),
                                 on='Predicted', how='inner')


1083 1083


In [26]:
non_matches

Unnamed: 0,Code,Descr_old,Category_0,Category_1,Category_2,Descr_cleaned,Cat,Descr_cleaned_vectorized_low_dimension,Predicted,Category_1_descr,Prediced_Cat_descr
0,46.46.12,"Wholesale trade services of surgical, medical and orthopaedic instruments and devices",4,G,46,wholesale trade services surgical medical orthopaedic instruments devices,G,"[9.310953140258789, 5.6005353927612305, 6.543834686279297, -0.023517046123743057, 4.097960472106934, 7.09716272354126, 4.504067420959473, 4.323415279388428, 8.958893775939941, 7.982893466949463]",C,WHOLESALE AND RETAIL TRADE SERVICES; REPAIR SERVICES OF MOTOR VEHICLES AND MOTORCYCLES,MANUFACTURED PRODUCTS
1,01.26.12,Olives for production of olive oil,1,A,1,olives production olive oil,A,"[7.0381999015808105, 5.458954811096191, 4.1663818359375, 2.1175312995910645, 3.4548428058624268, 7.016144752502441, 5.037139415740967, 3.6507906913757324, 9.4058198928833, 5.878485679626465]",C,"PRODUCTS OF AGRICULTURE, FORESTRY AND FISHING",MANUFACTURED PRODUCTS
2,01.16.11,"Cotton, whether or not ginned",1,A,1,cotton whether,A,"[9.953943252563477, 5.169009685516357, 3.9541127681732178, -1.5118979215621948, 4.058659553527832, 6.438737392425537, 5.8681769371032715, 5.471585273742676, 7.542972564697266, 6.428776264190674]",C,"PRODUCTS OF AGRICULTURE, FORESTRY AND FISHING",MANUFACTURED PRODUCTS
3,38.12.21,Spent (irradiated) fuel elements (cartridges) of nuclear reactors,2,E,38,spent irradiated fuel elements cartridges nuclear reactors,E,"[8.673707962036133, 5.085537433624268, 6.337480545043945, -0.3022427558898926, 4.307536602020264, 7.334295272827148, 3.1383056640625, 4.211902141571045, 8.406109809875488, 8.148168563842773]",C,"WATER SUPPLY; SEWERAGE, WASTE MANAGEMENT AND REMEDIATION SERVICES",MANUFACTURED PRODUCTS
4,38.32.13,Briquettes n.e.c. (produced from several different industrialwastes etc.),2,E,38,briquettesproduced several different industrialwastes etc,E,"[9.494315147399902, 5.150977611541748, 4.825336933135986, -1.0162200927734375, 4.202905178070068, 6.765171051025391, 4.622105121612549, 5.232601165771484, 7.6984710693359375, 7.1002116203308105]",C,"WATER SUPPLY; SEWERAGE, WASTE MANAGEMENT AND REMEDIATION SERVICES",MANUFACTURED PRODUCTS
5,38.12.27,"Waste and scrap of primary cells, primary batteries and electric accumulators",2,E,38,waste scrap primary cells primary batteries electric accumulators,E,"[8.263056755065918, 5.1628289222717285, 3.1491007804870605, 0.3126712441444397, 3.2859604358673096, 6.92125940322876, 4.376079559326172, 4.316834926605225, 8.986431121826172, 6.113201141357422]",C,"WATER SUPPLY; SEWERAGE, WASTE MANAGEMENT AND REMEDIATION SERVICES",MANUFACTURED PRODUCTS
6,71.20.14,Technical inspection services of road transport vehicles,8,M,71,technical inspection services road transport vehicles,M,"[6.769769191741943, 3.2892050743103027, 7.405467987060547, 7.608370304107666, 8.856298446655273, 3.8895153999328613, 0.17763715982437134, 8.51385498046875, -0.4921194314956665, 14.535955429077148]",H,"PROFESSIONAL, SCIENTIFIC AND TECHNICAL SERVICES",TRANSPORTATION AND STORAGE SERVICES
7,93.19.12,Services of athletes,10,R,93,services athletes,R,"[8.332815170288086, 9.459389686584473, 2.354379892349243, 11.285989761352539, 3.7164382934570312, 1.1837884187698364, 1.4744195938110352, 6.738093852996826, 2.795396089553833, 0.20117609202861786]",I,"ARTS, ENTERTAINMENT AND RECREATION SERVICES",ACCOMMODATION AND FOOD SERVICES
8,85.53.11,Car driving school services,9,P,85,car driving school services,P,"[7.294715881347656, 1.0873725414276123, 7.941929340362549, 4.924047946929932, 5.580871105194092, -1.680103063583374, 2.9003400802612305, 6.285297870635986, -2.1978495121002197, 4.077579498291016]",I,EDUCATION SERVICES,ACCOMMODATION AND FOOD SERVICES
9,79.90.32,"Reservation services for convention centres, congress centres and exhibit halls",8,N,79,reservation services convention centres congress centres exhibit halls,N,"[8.553359985351562, 7.7629523277282715, 4.443340301513672, 8.728897094726562, 7.046304225921631, 2.980679988861084, 3.743736982345581, 2.4642460346221924, -1.2671756744384766, 5.511058330535889]",I,ADMINISTRATIVE AND SUPPORT SERVICES,ACCOMMODATION AND FOOD SERVICES


In [43]:
CN = text_processing.fetch_CN_files()
Cat1_Cat2_map = CPA[CPA.Level==2][['Code','Parent']].rename(columns={'Code':'Category_2','Parent':'Category_1'})
CN=CN.merge(Cat1_Cat2_map, on='Category_2', how='left')
CN.sample(3)

2020-11-27 14:10:01,789 - text_processing - INFO - CN new Files imported and cleaned


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1
12149,81083000,38.11.58,8108 30,7,(excl. ash and residues containing titanium),Titanium waste and scrap,38,E
4128,28012000,20.13.21,2801 20,7,,Iodine,20,C
14678,85442000,27.32.12,8544 20,7,,"Coaxial cable and other coaxial electric conductors, insulated",27,C


In [44]:
CN["Descr_cleaned_vectorized"] = CN.CN_Description_cleaned.apply(
    text_to_vec
)

CN["Descr_cleaned_vectorized_low_dimension"] = text_processing.train_predict_umap(
    train_set.Descr_cleaned_vectorized, train_set.Cat, CN.Descr_cleaned_vectorized)

#CN["Descr_cleaned_vectorized_low_dimension"] = text_processing.reduce_dimensionality(CN.Descr_cleaned_vectorized)

2020-11-27 14:10:26,067 - text_processing - INFO - Now applying umap to reduce dimension


UMAP(min_dist=0.0, n_components=10, n_neighbors=10, random_state=3052528580,
     verbose=10)
Construct fuzzy simplicial set
Fri Nov 27 14:10:26 2020 Finding Nearest Neighbors
Fri Nov 27 14:10:26 2020 Building RP forest with 8 trees
Fri Nov 27 14:10:26 2020 NN descent for 12 iterations
	 0  /  12
	 1  /  12
	 2  /  12
	 3  /  12
	 4  /  12
	 5  /  12
	 6  /  12
	 7  /  12
	 8  /  12
	 9  /  12
	 10  /  12
Fri Nov 27 14:10:26 2020 Finished Nearest Neighbor Search
Fri Nov 27 14:10:26 2020 Construct embedding


  return f(**kwargs)


	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Fri Nov 27 14:10:42 2020 Finished embedding
	completed  0  /  30 epochs
	completed  3  /  30 epochs
	completed  6  /  30 epochs
	completed  9  /  30 epochs
	completed  12  /  30 epochs
	completed  15  /  30 epochs
	completed  18  /  30 epochs
	completed  21  /  30 epochs
	completed  24  /  30 epochs
	completed  27  /  30 epochs


In [45]:
CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)].drop('Descr_cleaned_vectorized',axis=1)
CN_test_df.sample(2)

Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Descr_cleaned_vectorized_low_dimension
2275,12130000,01.11.50,1213,4,", whether or not chopped,","Cereal straw and husks, unprepared ground, pressed or in the form of pellets",1,A,"[6.974757671356201, 3.1228458881378174, 3.402285575866699, 2.1919214725494385, -1.1207762956619263, 2.4246151447296143, 5.346846580505371, 5.771240711212158, 7.522921085357666, 7.538023948669434]"
1326,4090000,01.49.21,409,4,,Natural honey,1,A,"[6.966550827026367, 1.889949917793274, 3.576120376586914, 3.171947717666626, -0.8104632496833801, 3.893202066421509, 4.78844690322876, 4.473344802856445, 5.786776542663574, 7.159286975860596]"


In [47]:
CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==4)].drop('Descr_cleaned_vectorized',axis=1)
CN_test = np.array(list(CN_test_df["Descr_cleaned_vectorized_low_dimension"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

#Import scikit-learn metrics module for accuracy calculation
y_CN_test = np.array(list(CN_test_df.Category_1))
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)
#CN_test_df.sample(10)

Accuracy: 0.7348066298342542


Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Descr_cleaned_vectorized_low_dimension,Predicted
12065,79070000,25.99.29,7907,4,,"Articles of zinc, n.e.s.",25,C,"[6.8997344970703125, 2.7892704010009766, 3.704699993133545, 1.9100350141525269, -1.2090647220611572, 2.3195607662200928, 5.403890609741211, 5.735423564910889, 7.214788913726807, 7.670832633972168]",C
3794,25210000,08.11.20,2521,4,,"Limestone flux; limestone and other calcareous stone, of a kind used for the manufacture of lime or cement",8,B,"[6.9647674560546875, 3.2986114025115967, 3.478912591934204, 2.210007667541504, -1.097478985786438, 2.311497449874878, 5.433619976043701, 5.923828125, 7.674107074737549, 7.5279154777526855]",C
5783,32110000,20.30.22,3211,4,,Prepared driers,20,C,"[6.989763259887695, 3.702418327331543, 3.5404562950134277, 1.9265789985656738, -0.807572603225708, 2.3073172569274902, 5.717047691345215, 6.157283306121826, 7.56751012802124, 7.605450630187988]",C
3981,27060000,19.10.20,2706,4,"not dehydrated or partially distilled, incl.","Tar distilled from coal, from lignite or from peat, and other mineral tars, whether or reconstituted tars",19,C,"[6.890868186950684, 3.1043508052825928, 3.639850378036499, 2.384371519088745, -1.1566147804260254, 2.2625045776367188, 5.379957675933838, 5.778299808502197, 7.629493713378906, 7.495373725891113]",C
5730,32050000,20.12.21,3205,4,(other than Chinese or Japanese lacquer and paints),Colour lakes ; preparations based on colour lakes of a kind used to dye fabrics or produce colorant preparations,20,C,"[7.298078536987305, 2.9872918128967285, 3.5680882930755615, 2.9959404468536377, -1.4463539123535156, 2.647268533706665, 5.0178985595703125, 6.2552490234375, 8.108952522277832, 7.408880710601807]",C
10754,71110000,24.41.50,7111,4,not further worked than semi-manufactured,"Base metals, silver or gold, clad with platinum,",24,C,"[6.819460868835449, 2.688161611557007, 3.773958444595337, 1.8268553018569946, -1.1484134197235107, 2.341683864593506, 5.469573497772217, 5.639402866363525, 7.1116862297058105, 7.7428178787231445]",C
6283,38210000,20.59.52,3821,4,,"Prepared culture media for the development or maintenance of micro-organisms ""incl. viruses and the like"" or of plant, human or animal cells",20,C,"[7.294963836669922, 3.0441555976867676, 3.5986063480377197, 3.085550308227539, -1.316444993019104, 2.786898136138916, 5.1099042892456055, 6.331294536590576, 8.121070861816406, 7.405853271484375]",C
6265,38160000,23.20.13,3816,4,(excl. preparations based on graphite or other carbonaceous substances),"Refractory cements, mortars, concretes and similar compositions",23,C,"[6.995981693267822, 3.419543981552124, 3.4109036922454834, 2.2536017894744873, -1.1123594045639038, 2.277566909790039, 5.404903888702393, 5.987984657287598, 7.809151649475098, 7.49137020111084]",C
15789,93040000,25.40.12,9304,4,"(excl. swords, cutlasses, bayonettes and similar arms of heading 9307)","Spring, air or gas guns and pistols, truncheons and other non-firearms",25,C,"[6.9927239418029785, 3.848196268081665, 3.421250581741333, 2.1580443382263184, -0.8386797904968262, 2.197298765182495, 5.684571743011475, 6.25738000869751, 7.979743480682373, 7.543983459472656]",C
12050,79020000,38.11.58,7902,4,"(excl. ash and residues from zinc production ""heading 2620"", ingots and other similar unwrought shapes, of remelted waste and scrap, of zinc ""heading 7901"" and waste and scrap of primary cells, primary batteries and electric accumulators)",Zinc waste and scrap,38,E,"[6.53981351852417, 3.5141782760620117, 3.2348785400390625, 1.8661344051361084, -0.7140233516693115, 2.3796579837799072, 5.7668561935424805, 5.679171085357666, 7.50644063949585, 7.637502670288086]",C


In [54]:
non_matches = CN_test_df[CN_test_df.Predicted != CN_test_df.Category_1].merge(Cat_descr.rename(columns={'Category_1':'Predicted',
             'Category_1_descr':'Predicted_descr'}),  on='Predicted',how='left').merge(Cat_descr, on='Category_1', how='left')
non_matches.sample(5)

Unnamed: 0,CN_Code,CPA_Code,CN_Section,CN_Level,Excl_removed,CN_Description_cleaned,Category_2,Category_1,Descr_cleaned_vectorized_low_dimension,Predicted,Predicted_descr,Category_1_descr
9,25020000,08.91.12,2502,4,,Unroasted iron pyrites,8,B,"[7.336482048034668, 2.7513153553009033, 4.382230758666992, 3.5765457153320312, 0.6826784610748291, 2.194370985031128, 4.7237324714660645, 5.172999858856201, 8.621283531188965, 8.172290802001953]",C,MANUFACTURED PRODUCTS,MINING AND QUARRYING
0,4090000,01.49.21,409,4,,Natural honey,1,A,"[6.966550827026367, 1.889949917793274, 3.576120376586914, 3.171947717666626, -0.8104632496833801, 3.893202066421509, 4.78844690322876, 4.473344802856445, 5.786776542663574, 7.159286975860596]",C,MANUFACTURED PRODUCTS,"PRODUCTS OF AGRICULTURE, FORESTRY AND FISHING"
45,97040000,91.02.20,9704,4,"not of current or new issue in which they have, or will have, a recognised face value","Postage or revenue stamps, stamp-postmarks, first-day covers, postal stationery, stamped paper and the like, used, or if unused,",91,R,"[7.155947208404541, 3.4886226654052734, 3.5725913047790527, 2.449514150619507, -1.0812259912490845, 2.385221481323242, 5.542203903198242, 6.507355213165283, 8.090235710144043, 7.561448097229004]",C,MANUFACTURED PRODUCTS,"ARTS, ENTERTAINMENT AND RECREATION SERVICES"
30,40040000,38.11.54,4004,4,,"Waste, parings and scrap of soft rubber and powders and granules obtained therefrom",38,E,"[6.818662643432617, 3.144841194152832, 3.7023751735687256, 2.318312168121338, -1.0802149772644043, 2.228505849838257, 5.515503883361816, 5.83602237701416, 7.630491733551025, 7.534660339355469]",C,MANUFACTURED PRODUCTS,"WATER SUPPLY; SEWERAGE, WASTE MANAGEMENT AND REMEDIATION SERVICES"
12,25140000,08.11.40,2514,4,"not roughly trimmed or merely cut, by sawing or otherwise, into blocks or slabs of a square or rectangular shape;","Slate, whether or slate powder and slate refuse",8,B,"[7.017999649047852, 3.441309928894043, 3.61346435546875, 2.462878704071045, -1.026330590248108, 2.34023118019104, 5.553254127502441, 6.253854751586914, 7.96026611328125, 7.518208980560303]",C,MANUFACTURED PRODUCTS,MINING AND QUARRYING


In [55]:
### repeat the above on level 7

CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==7)].drop('Descr_cleaned_vectorized',axis=1)
CN_test = np.array(list(CN_test_df["Descr_cleaned_vectorized_low_dimension"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

#Import scikit-learn metrics module for accuracy calculation
y_CN_test = np.array(list(CN_test_df.Category_1))
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

Accuracy: 0.869745575221239


In [56]:
### repeat the above on level 10

CN_test_df = CN[(CN.Category_2.notnull()) & (CN.CN_Level==10)].drop('Descr_cleaned_vectorized',axis=1)
CN_test = np.array(list(CN_test_df["Descr_cleaned_vectorized_low_dimension"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

#Import scikit-learn metrics module for accuracy calculation
y_CN_test = np.array(list(CN_test_df.Category_1))
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

Accuracy: 0.8868203934842395


In [57]:
### repeat the above on all levels

CN_test_df = CN[(CN.Category_2.notnull())].drop('Descr_cleaned_vectorized',axis=1)
CN_test = np.array(list(CN_test_df["Descr_cleaned_vectorized_low_dimension"].values))
#y_test = np.array(list(test_set.Cat.values))
y_CN_pred=clf.predict(CN_test)

#Import scikit-learn metrics module for accuracy calculation
y_CN_test = np.array(list(CN_test_df.Category_1))
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_CN_test, y_CN_pred))

CN_test_df['Predicted'] = pd.Series(data=y_CN_pred.tolist(), index=CN_test_df.index)

Accuracy: 0.8800845219228738
