Skip to content

Production ready code to extract 'domain-specific' features for your AI tasks. Train fast custom models in xgBoost optimized for applications in eCommerce.

License

Notifications You must be signed in to change notification settings

fastboardAI/fling

Repository files navigation

fLing : Fast Linguistics

GitHub Contributors Lines of Code

The fLing Open Source Project.

fling

Introduction

fLing is a library to extract task-specific linguistic features based on textual fields on your data. A real-time inference API will be released. This is a beta version of the of fling open source project. For download information and usage manuals, take a look at the notebook files in examples.

Primary modules

  • Data storage in ElasticSearch (distributed/real-time)
  • Customized distance metrics
  • Domain and task specific sentence generation
  • Textual Clustering
  • Multi-objective ranking
  • Ranking model deployment for live Apps.

Primary functionalities

  • Pre-process text columns in dataset, with state-of-the-art 'task-specific' NLP transformers based tokenizers.
  • Store data in elasticSearch, and get TF-IDF and BM25 based distance metrics, enhanced to fit your domain and task specific goals.
  • Add pretrained word embeddings to convert raw text to document embeddings (word2vec, glove, fastText, custom trained) for non-transformers based methods, and use custom embeddings to train customer transformer models.
  • Use domain enhanced BM25 and and tfidf based distance methods as weak learners, designed on specific tasks.
  • Compute clusters and save cluster characteristics in a trained model. Use distance-based metrics in ElasticSearch to cluster documents.
  • Use clusterID's as a new feature for other supervised and unsupervised tasks.

Dependencies

  • pytorch, huggingface/transformers, gensim

Technologies used

  • Adding Sequential Denoising Autoencoders
  • ClusterGANS initial edition

fastboardAI/fling https://github.com/fastboardAI/fling.git

Latest Developments tracked in arnab64/fling https://github.com/arnab64/fling.git

Example notebook: ADDING PRETRAINED VECTORS, TRAINING VECTORS, CREATING COMBINED VECTORS

# EXAMPLE: classifying SPAM with fLing

import matplotlib as mpl
from imp import reload
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd
import numpy as np
import scipy
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk,re,pprint
import sys,glob,os
import operator, string, argparse, math, random, statistics
import matplotlib.pyplot as plt
from sklearn import metrics
from fling import utilities as ut
from fling import tfidfModule as tfm

#load and preProcess (tokenize) the data, you can use other tokenizers as well
os.chdir("/Users/arnabborah/Documents/repositories/fling/")
spamtm = tfm.dataProcessor("datasets/spamTextMessages.csv",None)
spamtm.dataInitial
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Category Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
... ... ...
5567 spam This is the 2nd time we have tried 2 contact u...
5568 ham Will ü b going to esplanade fr home?
5569 ham Pity, * was in mood for that. So...any other s...
5570 ham The guy did some bitching but I acted like i'd...
5571 ham Rofl. Its true to its name

5572 rows × 2 columns

# creating a flingTFIDF to compute TF-IDF and add it as a new column (pd.dataframe) to data
ftf = tfm.flingTFIDF(spamtm.dataInitial,'Message')
ftf.smartTokenizeColumn()
ftf.getTF()
ftf.computeIDFmatrix()
ftf.getTFIDF()

#do the next line only if you are computing distances on tfIDF dict only
ftf.createDistanceMetadata()
[ ================================================== ] 100.00%
Adding term frequency column based on stopsRemoved
[ ================================================== ] 100.00%
Computing list of words for IDF...

Created list of terms for IDF matrix with 8780  terms.

Computing global IDF matrix...

[ ================================================== ] 100.00%
Computing and adding TF-IDF column based on stopsRemoved
[ ================================================== ] 100.00%
import gensim
from fling import vectorize as vect

# training and adding doc2vec vectors based on column 'Messages'.
# gensim is a requirement to train doc2vec vectors 
vecc = vect.vectorize(ftf.data,'Message')
trained_doc2vec_model = vecc.trainDocVectors()
vecc.addDocVectors()
vecc.data
5572 documents added!
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Category Message stopsRemoved tfMatrix sumTFIDF doc2vec
0 ham Go until jurong point, crazy.. Available only ... go jurong point crazy available bugis n great ... word tf tf-idf 0 go 1 ... 38.281443 [0.015742207, 0.0031893118, 0.010138756, -0.08...
1 ham Ok lar... Joking wif u oni... ok lar joking wif u oni word tf tf-idf 0 ok 1 1.31950... 12.583182 [-0.014953367, 0.030154036, 0.017708715, -0.10...
2 spam Free entry in 2 a wkly comp to win FA Cup fina... free entry wkly comp win fa cup final tkts st... word tf tf-idf 0 entry ... 49.524838 [0.008385706, 0.004221165, -2.3364251e-05, -0....
3 ham U dun say so early hor... U c already then say... u dun say early hor u c already say word tf tf-idf 0 u 2 1.669... 16.431526 [0.029679298, 0.06244122, -0.008049136, -0.119...
4 ham Nah I don't think he goes to usf, he lives aro... nah think goes usf lives around though word tf tf-idf 0 nah 1 2.70461... 16.678825 [0.004876227, -0.008055425, 0.0023417333, 0.00...
... ... ... ... ... ... ...
5567 spam This is the 2nd time we have tried 2 contact u... nd time tried contact u u ⣠pound prize cla... word tf tf-idf 0 ... 29.685673 [0.043106798, 0.06623637, -0.010588597, -0.185...
5568 ham Will ü b going to esplanade fr home? 㼠b going esplanade fr home word tf tf-idf 0 㼠1 1... 12.328684 [0.016016621, -0.01830655, 0.016508967, -0.105...
5569 ham Pity, * was in mood for that. So...any other s... pity * mood soany suggestions word tf tf-idf 0 pity ... 15.080331 [-0.18763976, 0.03453686, -0.027078941, -0.055...
5570 ham The guy did some bitching but I acted like i'd... guy bitching acted like i'd interested buying ... word tf tf-idf 0 guy ... 32.770129 [0.009096158, -0.0057535497, 0.004273705, -0.0...
5571 ham Rofl. Its true to its name rofl true name word tf tf-idf 0 rofl 1 3.143951 1 ... 7.558242 [-0.0014662278, 0.009742865, 0.0015902708, -0....

5572 rows × 6 columns

from fling import flingPretrained as fpt

# creating a flingPretrained
# dataProcessed = pd.read_pickle('datasets/data_tfidf_processed.pkl')
fdb = fpt.flingPretrained(vecc.data)
#adding pretrained glove vectors 
fdb.loadPretrainedWordVectors('glove')
fdb.addDocumentGloveVectors()

# adding combo vectors with tfidf and (glove + doc2vec) for inter sentence semantic information addition
fdb.tfidf2vec('tf-idf','glove')
# fdb.tfidf2vec('tf-idf','doc2vec')
fdb.splitTestTrain()
fdb.dataTrain
Working on pretrained word embeddings!

Loading Glove Model

400000  words loaded!

GloVe Vectors Loaded!

[ ================================================== ] 100.00%
Computing column: vec_tfidf-glove
[ ==                                                 ] 5.81%

/Users/arnabborah/Documents/repositories/fling/fling/flingPretrained.py:237: RuntimeWarning: Mean of empty slice
  return(np.nanmean(docVecList,axis=0))


[ =========================================          ] 83.44%%

png

# train group characteristics on column 'category' 
fdb.createGroupedCharacteristics('Category')
for key in fdb.groupedCharacteristic.keys():
    print('Characteristic of',key,'\n',fdb.groupedCharacteristic[key])   
Computing groupCharacteristics for, Category
Characteristic of glove 
 None
Characteristic of vec_tfidf-doc2vec 
 None
Characteristic of vec_tfidf-glove 
                                             vec_tfidf-glove
Category                                                   
ham       [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
spam      [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
Characteristic of doc2vec 
                                                     doc2vec
Category                                                   
ham       [-0.0008339239, 0.008468696, 0.0014372141, -0....
spam      [0.00509379, 0.008787291, -0.0049210927, -0.05...
Characteristic of glove-vector 
                                                glove-vector
Category                                                   
ham       [0.08621057522946847, 0.16108873455431685, 0.1...
spam      [0.038020029286601906, 0.25794960063990663, 0....
Characteristic of glove-tfIDF 
                                                 glove-tfIDF
Category                                                   
ham       [0.08615151890437718, 0.16173257886936682, 0.1...
spam      [0.032436123023218626, 0.24874980733559582, 0....
# predict vector based Category for each type of vector added
fdb.addVectorComputedGroup('glove-vector','cGroup_glove')
fdb.addVectorComputedGroup('doc2vec','cGroup_doc2vec')
fdb.addVectorComputedGroup('glove-tfIDF','cGroup_gloveWt_tfidf')
fdb.addVectorComputedGroup('vec_tfidf-glove','cGroup_tfidf-glove')
/Users/arnabborah/Documents/repositories/fling/fling/flingPretrained.py:284: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.dataTest[groupName] = computedGroups
#fdb.addVectorComputedGroup('vec_tfidf-doc2vec','cGroup_tfidf-doc2vec')
fdb.getAccuracy('Category','cGroup_glove')
fdb.getAccuracy('Category','cGroup_doc2vec')
fdb.getAccuracy('Category','cGroup_gloveWt_tfidf')
fdb.getAccuracy('Category','cGroup_tfidf-glove')
Accuracy of cGroup_glove 79.84449760765551 %
Accuracy of cGroup_doc2vec 78.88755980861244 %
Accuracy of cGroup_gloveWt_tfidf 79.90430622009569 %
Accuracy of cGroup_tfidf-glove 0.0 %

Example notebook: CLUSTERING

import os
import warnings
warnings.filterwarnings('ignore')
from fling import utilities as ut
from fling import tfidfModule as tfm

#change operating folder      
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/scripts/")
#read the .csv data file using the dataProcessor class
rp = tfm.dataProcessor("../datasets/DataAnalyst.csv")
                                  Job Description  Company Name
Industry                                                       
-1                                            353           352
IT Services                                   325           325
Staffing & Outsourcing                        323           323
Health Care Services & Hospitals              151           151
Consulting                                    111           111
...                                           ...           ...
Chemical Manufacturing                          1             1
Pet & Pet Supplies Stores                       1             1
Consumer Product Rental                         1             1
Metals Brokers                                  1             1
News Outlet                                     1             1

[89 rows x 2 columns]
#create a flingTFIDF object around the pre-processed daa
ftf = tfm.flingTFIDF(rp.dataInitialSmall,'Job Description')

# tokenization, customizable
ftf.smartTokenizeColumn()

# get Term Frequency of each document, and store add it as an object, in a new column
ftf.getTF()

# compute Inverse Document Frequencies across the entire vocabulary
ftf.computeIDFmatrix()

# get TFIDF, and store it as a new column in data, tf-idf
ftf.getTFIDF()

# compute sum of all tf-idf values and add it as a new column
ftf.createDistanceMetadata()
[ ================================================== ] 100.00%
Adding term frequency column based on stopsRemoved
[ ================================================== ] 100.00%
Computing list of words for IDF...

Created list of terms for IDF matrix with 27075  terms.

Computing global IDF matrix...

[ ================================================== ] 100.00%
Computing and adding TF-IDF column based on stopsRemoved
[ ================================================== ] 100.00%
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/scripts/")
ftf.data.to_pickle('../processFiles/data_tfidf_processed.pkl')
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/")
# load dataset with tf-idf vectors and load pretrained GloVe word vectors
from fling import flingPretrained as pre
import pandas as pd

dataProcessed = pd.read_pickle('processFiles/data_tfidf_processed.pkl')
fdb = pre.flingPretrained(dataProcessed)
fdb.loadPretrainedWordVectors('glove')

# adding glove vectors for every document
fdb.addDocumentGloveVector()
DBSCAN initialized!

Loading Glove Model

400000  words loaded!

GloVe Vectors Loaded!
# use DBSCAN clustering on the glove vectors loaded in the previos
from fling import flingDBSCAN as fdbscan

fdbscan1 = fdbscan.flingDBSCAN(fdb.data,None,25,'glove')
fdbscan1.dbscanCompute()
fdbscan1.addClusterLabel('glove-clusterID')
fdbscan1.printClusterInfo()
flingDBSCAN initialized!

computing best distance
[ ================================================== ] 100.00%

png

Best epsilon computed on GLOVE = 0.6544420699360174 


initiating DBSCAN Clustering with glove vectors

[                                                    ] 0.04%
 ----  cluster_1_ assigned to 565 points! ----
[                                                    ] 0.09%
 ----  cluster_2_ assigned to 855 points! ----
[                                                    ] 0.18%
 ----  cluster_3_ assigned to 58 points! ----
[                                                    ] 0.31%
 ----  cluster_4_ assigned to 119 points! ----
[                                                    ] 0.53%
 ----  cluster_5_ assigned to 109 points! ----
[                                                    ] 1.07%
 ----  cluster_6_ assigned to 53 points! ----
[                                                    ] 1.91%
 ----  cluster_7_ assigned to 37 points! ----
[ =                                                  ] 2.26%
 ----  cluster_8_ assigned to 55 points! ----
[ ===                                                ] 6.79%
 ----  cluster_9_ assigned to 35 points! ----
[ =======                                            ] 15.85%
 ----  cluster_10_ assigned to 32 points! ----
[ ====================                               ] 41.59%
 ----  cluster_11_ assigned to 27 points! ----
[ ================================================== ] 100.00%
 11 clusters formed!
Cluster characteristics:
 -- vectors: glove
 -- minPts: 25
 -- EstimatedBestDistance 0.6544420699360174
 -- 11 clusters formed!
 -- 1945 points assigned to clusters!
 -- 308 noise points!

 -- 13.670661340434975 % noise!
# converting tf-idf into vectors
fdb.tfidf2vec('tf-only')
fdb.tfidf2vec('tf-idf')

# clustering documents based on 
fdbscan2 = fdbscan.flingDBSCAN(fdb.data,None,25,'tfidf')
fdbscan2.dbscanCompute()
fdbscan2.addClusterLabel('tfidf-clusterID')
fdbscan2.printClusterInfo() 
flingDBSCAN initialized!

computing best distance
[ ================================================== ] 100.00%

png

Best epsilon computed on GLOVE-TFIDF = 1.4628292329952732 


initiating DBSCAN Clustering with tfidf vectors

[                                                    ] 0.04%
 ----  cluster_1_ assigned to 810 points! ----
[                                                    ] 0.09%
 ----  cluster_2_ assigned to 695 points! ----
[                                                    ] 0.31%
 ----  cluster_3_ assigned to 61 points! ----
[                                                    ] 0.93%
 ----  cluster_4_ assigned to 347 points! ----
[ =                                                  ] 3.86%
 ----  cluster_5_ assigned to 26 points! ----
[ =============                                      ] 26.14%
 ----  cluster_6_ assigned to 44 points! ----
[ ================                                   ] 32.45%
 ----  cluster_7_ assigned to 27 points! ----
[ ================================================== ] 100.00%
 7 clusters formed!
Cluster characteristics:
 -- vectors: tfidf
 -- minPts: 25
 -- EstimatedBestDistance 1.4628292329952732
 -- 7 clusters formed!
 -- 1995 points assigned to clusters!
 -- 258 noise points!

 -- 11.451398135818907 % noise!
fdb.data
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Job Description Company Name Industry stopsRemoved tfMatrix sumTFIDF glove-vector glove-clusterID tfidf2vec-tf tfidf2vec-tfidf tfidf-clusterID
0 Are you eager to roll up your sleeves and harn... Vera Institute of Justice\n3.2 Social Assistance eager roll sleeves harness data drive policy c... word tf tf-idf 0 data... 811.569328 [0.20507256798029552, 0.05984949950738914, 0.0... cluster_0_ [0.2986073091133004, 0.05040200935960588, 0.09... [0.26263354824176166, -0.023444644206149418, -... cluster_0_
1 Overview\n\nProvides analytical and technical ... Visiting Nurse Service of New York\n3.8 Health Care Services & Hospitals overview provides analytical technical support... word tf tf-idf 0 dat... 415.287583 [0.23643422682926837, -0.055056957317073156, 0... cluster_1_ [0.4055475764227641, -0.07285501829268287, 0.1... [0.35240058786555273, -0.1412004425681622, 0.0... cluster_1_
2 We�re looking for a Senior Data Analyst who ... Squarespace\n3.4 Internet we�re looking senior data analyst love mento... word tf tf-idf 0 data ... 439.815932 [0.155861351576923, 0.11735425461538473, -0.05... cluster_2_ [0.283220747730769, 0.14354892653846157, 0.044... [0.2563749918506738, 0.17575736117618113, -0.0... cluster_2_
3 Requisition NumberRR-0001939\nRemote:Yes\nWe c... Celerity\n4.1 IT Services requisition numberrr remoteyes collaborate cre... word tf tf-idf 0 � ... 569.217931 [0.2306739880813952, 0.09347254534883724, -0.0... cluster_2_ [0.29634610203488354, 0.10983982558139535, 0.0... [0.2966705423736133, 0.028126685382837024, -0.... cluster_2_
4 ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl... FanDuel\n3.9 Sports & Recreation fanduel group fanduel group worldclass team br... word tf tf-idf 0 fanduel... 420.106719 [0.12914707201834857, 0.11582829587155963, 0.0... cluster_3_ [0.17368260871559627, 0.10919291513761473, 0.0... [0.021771101166884813, 0.16355587986765768, -0... None
... ... ... ... ... ... ... ... ... ... ... ...
2248 Maintains systems to protect data from unautho... Avacend, Inc.\n2.5 Staffing & Outsourcing maintains systems protect data unauthorized us... word tf tf-idf 0 ... 43.940807 [0.2738081315789473, -0.001255321052631562, 0.... None [0.2949110263157894, 0.029555310526315794, 0.0... [0.23112386279259817, -0.08318866123802247, -0... cluster_4_
2249 Position:\nSenior Data Analyst (Corporate Audi... Arrow Electronics\n2.9 Wholesale position senior data analyst corporate audit j... word tf tf-idf 0 ... 439.042957 [0.2200468355481728, 0.10710706677740867, 0.04... cluster_1_ [0.3396034966777404, 0.09931764750830561, 0.09... [0.3077493047461843, 0.06387599003189207, 0.06... cluster_1_
2250 Title: Technical Business Analyst (SQL, Data a... Spiceorb -1 title technical business analyst sql data anal... word tf tf-idf 0 busin... 205.978695 [0.36188271052631577, 0.05400915065789475, 0.0... cluster_2_ [0.5060029144736842, 0.04490494473684211, 0.11... [0.45506833532863533, 5.3025424212786644e-05, ... cluster_2_
2251 Summary\n\nResponsible for working cross-funct... Contingent Network Services\n3.1 Enterprise Software & Network Solutions summary responsible working crossfunctionally ... word tf tf-idf 0 ... 364.177527 [0.25247974618181807, 0.07676844581818185, -0.... cluster_2_ [0.34654995709090924, 0.07137524545454547, 0.0... [0.27937433353352015, 0.08437047685035409, -0.... cluster_1_
2252 You.\n\nYou bring your body, mind, heart and s... SCL Health\n3.4 Health Care Services & Hospitals bring body mind heart spirit work senior quali... word tf tf-idf 0 data ... 366.509859 [0.23890638028806577, 0.1815799016460906, -0.0... cluster_2_ [0.3220337218518514, 0.22893831193415645, 0.07... [0.2850343471866271, 0.2451438898926933, -0.08... cluster_2_

2253 rows × 11 columns

About

Production ready code to extract 'domain-specific' features for your AI tasks. Train fast custom models in xgBoost optimized for applications in eCommerce.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published