fLing : Fast Linguistics

The fLing Open Source Project.

Introduction

fLing is a library to extract task-specific linguistic features based on textual fields on your data. A real-time inference API will be released. This is a beta version of the of fling open source project. For download information and usage manuals, take a look at the notebook files in examples.

Primary modules

Data storage in ElasticSearch (distributed/real-time)
Customized distance metrics
Domain and task specific sentence generation
Textual Clustering
Multi-objective ranking
Ranking model deployment for live Apps.

Primary functionalities

Pre-process text columns in dataset, with state-of-the-art 'task-specific' NLP transformers based tokenizers.
Store data in elasticSearch, and get TF-IDF and BM25 based distance metrics, enhanced to fit your domain and task specific goals.
Add pretrained word embeddings to convert raw text to document embeddings (word2vec, glove, fastText, custom trained) for non-transformers based methods, and use custom embeddings to train customer transformer models.
Use domain enhanced BM25 and and tfidf based distance methods as weak learners, designed on specific tasks.
Compute clusters and save cluster characteristics in a trained model. Use distance-based metrics in ElasticSearch to cluster documents.
Use clusterID's as a new feature for other supervised and unsupervised tasks.

Dependencies

pytorch, huggingface/transformers, gensim

Technologies used

Adding Sequential Denoising Autoencoders
ClusterGANS initial edition

fastboardAI/fling https://github.com/fastboardAI/fling.git

Latest Developments tracked in arnab64/fling https://github.com/arnab64/fling.git

Example notebook: ADDING PRETRAINED VECTORS, TRAINING VECTORS, CREATING COMBINED VECTORS

# EXAMPLE: classifying SPAM with fLing

import matplotlib as mpl
from imp import reload
from nltk.corpus import stopwords
from collections import Counter
import pandas as pd
import numpy as np
import scipy
import matplotlib as mpl
import matplotlib.pyplot as plt
import nltk,re,pprint
import sys,glob,os
import operator, string, argparse, math, random, statistics
import matplotlib.pyplot as plt
from sklearn import metrics

from fling import utilities as ut
from fling import tfidfModule as tfm

#load and preProcess (tokenize) the data, you can use other tokenizers as well
os.chdir("/Users/arnabborah/Documents/repositories/fling/")
spamtm = tfm.dataProcessor("datasets/spamTextMessages.csv",None)
spamtm.dataInitial

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Category	Message
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...
...	...	...
5567	spam	This is the 2nd time we have tried 2 contact u...
5568	ham	Will Ã¼ b going to esplanade fr home?
5569	ham	Pity, * was in mood for that. So...any other s...
5570	ham	The guy did some bitching but I acted like i'd...
5571	ham	Rofl. Its true to its name

5572 rows × 2 columns

# creating a flingTFIDF to compute TF-IDF and add it as a new column (pd.dataframe) to data
ftf = tfm.flingTFIDF(spamtm.dataInitial,'Message')
ftf.smartTokenizeColumn()
ftf.getTF()
ftf.computeIDFmatrix()
ftf.getTFIDF()

#do the next line only if you are computing distances on tfIDF dict only
ftf.createDistanceMetadata()

[ ================================================== ] 100.00%
Adding term frequency column based on stopsRemoved
[ ================================================== ] 100.00%
Computing list of words for IDF...

Created list of terms for IDF matrix with 8780  terms.

Computing global IDF matrix...

[ ================================================== ] 100.00%
Computing and adding TF-IDF column based on stopsRemoved
[ ================================================== ] 100.00%

import gensim
from fling import vectorize as vect

# training and adding doc2vec vectors based on column 'Messages'.
# gensim is a requirement to train doc2vec vectors 
vecc = vect.vectorize(ftf.data,'Message')
trained_doc2vec_model = vecc.trainDocVectors()
vecc.addDocVectors()
vecc.data

5572 documents added!

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Category	Message	stopsRemoved	tfMatrix	sumTFIDF	doc2vec
0	ham	Go until jurong point, crazy.. Available only ...	go jurong point crazy available bugis n great ...	word tf tf-idf 0 go 1 ...	38.281443	[0.015742207, 0.0031893118, 0.010138756, -0.08...
1	ham	Ok lar... Joking wif u oni...	ok lar joking wif u oni	word tf tf-idf 0 ok 1 1.31950...	12.583182	[-0.014953367, 0.030154036, 0.017708715, -0.10...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	free entry wkly comp win fa cup final tkts st...	word tf tf-idf 0 entry ...	49.524838	[0.008385706, 0.004221165, -2.3364251e-05, -0....
3	ham	U dun say so early hor... U c already then say...	u dun say early hor u c already say	word tf tf-idf 0 u 2 1.669...	16.431526	[0.029679298, 0.06244122, -0.008049136, -0.119...
4	ham	Nah I don't think he goes to usf, he lives aro...	nah think goes usf lives around though	word tf tf-idf 0 nah 1 2.70461...	16.678825	[0.004876227, -0.008055425, 0.0023417333, 0.00...
...	...	...	...	...	...	...
5567	spam	This is the 2nd time we have tried 2 contact u...	nd time tried contact u u â£ pound prize cla...	word tf tf-idf 0 ...	29.685673	[0.043106798, 0.06623637, -0.010588597, -0.185...
5568	ham	Will Ã¼ b going to esplanade fr home?	ã¼ b going esplanade fr home	word tf tf-idf 0 ã¼ 1 1...	12.328684	[0.016016621, -0.01830655, 0.016508967, -0.105...
5569	ham	Pity, * was in mood for that. So...any other s...	pity * mood soany suggestions	word tf tf-idf 0 pity ...	15.080331	[-0.18763976, 0.03453686, -0.027078941, -0.055...
5570	ham	The guy did some bitching but I acted like i'd...	guy bitching acted like i'd interested buying ...	word tf tf-idf 0 guy ...	32.770129	[0.009096158, -0.0057535497, 0.004273705, -0.0...
5571	ham	Rofl. Its true to its name	rofl true name	word tf tf-idf 0 rofl 1 3.143951 1 ...	7.558242	[-0.0014662278, 0.009742865, 0.0015902708, -0....

5572 rows × 6 columns

from fling import flingPretrained as fpt

# creating a flingPretrained
# dataProcessed = pd.read_pickle('datasets/data_tfidf_processed.pkl')
fdb = fpt.flingPretrained(vecc.data)
#adding pretrained glove vectors 
fdb.loadPretrainedWordVectors('glove')
fdb.addDocumentGloveVectors()

# adding combo vectors with tfidf and (glove + doc2vec) for inter sentence semantic information addition
fdb.tfidf2vec('tf-idf','glove')
# fdb.tfidf2vec('tf-idf','doc2vec')
fdb.splitTestTrain()
fdb.dataTrain

Working on pretrained word embeddings!

Loading Glove Model

400000  words loaded!

GloVe Vectors Loaded!

[ ================================================== ] 100.00%
Computing column: vec_tfidf-glove
[ ==                                                 ] 5.81%

/Users/arnabborah/Documents/repositories/fling/fling/flingPretrained.py:237: RuntimeWarning: Mean of empty slice
  return(np.nanmean(docVecList,axis=0))


[ =========================================          ] 83.44%%

# train group characteristics on column 'category' 
fdb.createGroupedCharacteristics('Category')
for key in fdb.groupedCharacteristic.keys():
    print('Characteristic of',key,'\n',fdb.groupedCharacteristic[key])

Computing groupCharacteristics for, Category
Characteristic of glove 
 None
Characteristic of vec_tfidf-doc2vec 
 None
Characteristic of vec_tfidf-glove 
                                             vec_tfidf-glove
Category                                                   
ham       [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
spam      [nan, nan, nan, nan, nan, nan, nan, nan, nan, ...
Characteristic of doc2vec 
                                                     doc2vec
Category                                                   
ham       [-0.0008339239, 0.008468696, 0.0014372141, -0....
spam      [0.00509379, 0.008787291, -0.0049210927, -0.05...
Characteristic of glove-vector 
                                                glove-vector
Category                                                   
ham       [0.08621057522946847, 0.16108873455431685, 0.1...
spam      [0.038020029286601906, 0.25794960063990663, 0....
Characteristic of glove-tfIDF 
                                                 glove-tfIDF
Category                                                   
ham       [0.08615151890437718, 0.16173257886936682, 0.1...
spam      [0.032436123023218626, 0.24874980733559582, 0....

# predict vector based Category for each type of vector added
fdb.addVectorComputedGroup('glove-vector','cGroup_glove')
fdb.addVectorComputedGroup('doc2vec','cGroup_doc2vec')
fdb.addVectorComputedGroup('glove-tfIDF','cGroup_gloveWt_tfidf')
fdb.addVectorComputedGroup('vec_tfidf-glove','cGroup_tfidf-glove')

/Users/arnabborah/Documents/repositories/fling/fling/flingPretrained.py:284: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.dataTest[groupName] = computedGroups

#fdb.addVectorComputedGroup('vec_tfidf-doc2vec','cGroup_tfidf-doc2vec')
fdb.getAccuracy('Category','cGroup_glove')
fdb.getAccuracy('Category','cGroup_doc2vec')
fdb.getAccuracy('Category','cGroup_gloveWt_tfidf')
fdb.getAccuracy('Category','cGroup_tfidf-glove')

Accuracy of cGroup_glove 79.84449760765551 %
Accuracy of cGroup_doc2vec 78.88755980861244 %
Accuracy of cGroup_gloveWt_tfidf 79.90430622009569 %
Accuracy of cGroup_tfidf-glove 0.0 %

Example notebook: CLUSTERING

import os
import warnings
warnings.filterwarnings('ignore')

from fling import utilities as ut
from fling import tfidfModule as tfm

#change operating folder      
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/scripts/")
#read the .csv data file using the dataProcessor class
rp = tfm.dataProcessor("../datasets/DataAnalyst.csv")

                                  Job Description  Company Name
Industry                                                       
-1                                            353           352
IT Services                                   325           325
Staffing & Outsourcing                        323           323
Health Care Services & Hospitals              151           151
Consulting                                    111           111
...                                           ...           ...
Chemical Manufacturing                          1             1
Pet & Pet Supplies Stores                       1             1
Consumer Product Rental                         1             1
Metals Brokers                                  1             1
News Outlet                                     1             1

[89 rows x 2 columns]

#create a flingTFIDF object around the pre-processed daa
ftf = tfm.flingTFIDF(rp.dataInitialSmall,'Job Description')

# tokenization, customizable
ftf.smartTokenizeColumn()

# get Term Frequency of each document, and store add it as an object, in a new column
ftf.getTF()

# compute Inverse Document Frequencies across the entire vocabulary
ftf.computeIDFmatrix()

# get TFIDF, and store it as a new column in data, tf-idf
ftf.getTFIDF()

# compute sum of all tf-idf values and add it as a new column
ftf.createDistanceMetadata()

[ ================================================== ] 100.00%
Adding term frequency column based on stopsRemoved
[ ================================================== ] 100.00%
Computing list of words for IDF...

Created list of terms for IDF matrix with 27075  terms.

Computing global IDF matrix...

[ ================================================== ] 100.00%
Computing and adding TF-IDF column based on stopsRemoved
[ ================================================== ] 100.00%

os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/scripts/")
ftf.data.to_pickle('../processFiles/data_tfidf_processed.pkl')
os.chdir("/Users/arnabborah/Documents/repositories/textclusteringDBSCAN/")

# load dataset with tf-idf vectors and load pretrained GloVe word vectors
from fling import flingPretrained as pre
import pandas as pd

dataProcessed = pd.read_pickle('processFiles/data_tfidf_processed.pkl')
fdb = pre.flingPretrained(dataProcessed)
fdb.loadPretrainedWordVectors('glove')

# adding glove vectors for every document
fdb.addDocumentGloveVector()

DBSCAN initialized!

Loading Glove Model

400000  words loaded!

GloVe Vectors Loaded!

# use DBSCAN clustering on the glove vectors loaded in the previos
from fling import flingDBSCAN as fdbscan

fdbscan1 = fdbscan.flingDBSCAN(fdb.data,None,25,'glove')
fdbscan1.dbscanCompute()
fdbscan1.addClusterLabel('glove-clusterID')
fdbscan1.printClusterInfo()

flingDBSCAN initialized!

computing best distance
[ ================================================== ] 100.00%

Best epsilon computed on GLOVE = 0.6544420699360174 


initiating DBSCAN Clustering with glove vectors

[                                                    ] 0.04%
 ----  cluster_1_ assigned to 565 points! ----
[                                                    ] 0.09%
 ----  cluster_2_ assigned to 855 points! ----
[                                                    ] 0.18%
 ----  cluster_3_ assigned to 58 points! ----
[                                                    ] 0.31%
 ----  cluster_4_ assigned to 119 points! ----
[                                                    ] 0.53%
 ----  cluster_5_ assigned to 109 points! ----
[                                                    ] 1.07%
 ----  cluster_6_ assigned to 53 points! ----
[                                                    ] 1.91%
 ----  cluster_7_ assigned to 37 points! ----
[ =                                                  ] 2.26%
 ----  cluster_8_ assigned to 55 points! ----
[ ===                                                ] 6.79%
 ----  cluster_9_ assigned to 35 points! ----
[ =======                                            ] 15.85%
 ----  cluster_10_ assigned to 32 points! ----
[ ====================                               ] 41.59%
 ----  cluster_11_ assigned to 27 points! ----
[ ================================================== ] 100.00%
 11 clusters formed!
Cluster characteristics:
 -- vectors: glove
 -- minPts: 25
 -- EstimatedBestDistance 0.6544420699360174
 -- 11 clusters formed!
 -- 1945 points assigned to clusters!
 -- 308 noise points!

 -- 13.670661340434975 % noise!

# converting tf-idf into vectors
fdb.tfidf2vec('tf-only')
fdb.tfidf2vec('tf-idf')

# clustering documents based on 
fdbscan2 = fdbscan.flingDBSCAN(fdb.data,None,25,'tfidf')
fdbscan2.dbscanCompute()
fdbscan2.addClusterLabel('tfidf-clusterID')
fdbscan2.printClusterInfo()

flingDBSCAN initialized!

computing best distance
[ ================================================== ] 100.00%

Best epsilon computed on GLOVE-TFIDF = 1.4628292329952732 


initiating DBSCAN Clustering with tfidf vectors

[                                                    ] 0.04%
 ----  cluster_1_ assigned to 810 points! ----
[                                                    ] 0.09%
 ----  cluster_2_ assigned to 695 points! ----
[                                                    ] 0.31%
 ----  cluster_3_ assigned to 61 points! ----
[                                                    ] 0.93%
 ----  cluster_4_ assigned to 347 points! ----
[ =                                                  ] 3.86%
 ----  cluster_5_ assigned to 26 points! ----
[ =============                                      ] 26.14%
 ----  cluster_6_ assigned to 44 points! ----
[ ================                                   ] 32.45%
 ----  cluster_7_ assigned to 27 points! ----
[ ================================================== ] 100.00%
 7 clusters formed!
Cluster characteristics:
 -- vectors: tfidf
 -- minPts: 25
 -- EstimatedBestDistance 1.4628292329952732
 -- 7 clusters formed!
 -- 1995 points assigned to clusters!
 -- 258 noise points!

 -- 11.451398135818907 % noise!

fdb.data

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Job Description	Company Name	Industry	stopsRemoved	tfMatrix	sumTFIDF	glove-vector	glove-clusterID	tfidf2vec-tf	tfidf2vec-tfidf	tfidf-clusterID
0	Are you eager to roll up your sleeves and harn...	Vera Institute of Justice\n3.2	Social Assistance	eager roll sleeves harness data drive policy c...	word tf tf-idf 0 data...	811.569328	[0.20507256798029552, 0.05984949950738914, 0.0...	cluster_0_	[0.2986073091133004, 0.05040200935960588, 0.09...	[0.26263354824176166, -0.023444644206149418, -...	cluster_0_
1	Overview\n\nProvides analytical and technical ...	Visiting Nurse Service of New York\n3.8	Health Care Services & Hospitals	overview provides analytical technical support...	word tf tf-idf 0 dat...	415.287583	[0.23643422682926837, -0.055056957317073156, 0...	cluster_1_	[0.4055475764227641, -0.07285501829268287, 0.1...	[0.35240058786555273, -0.1412004425681622, 0.0...	cluster_1_
2	Weâ��re looking for a Senior Data Analyst who ...	Squarespace\n3.4	Internet	weâ��re looking senior data analyst love mento...	word tf tf-idf 0 data ...	439.815932	[0.155861351576923, 0.11735425461538473, -0.05...	cluster_2_	[0.283220747730769, 0.14354892653846157, 0.044...	[0.2563749918506738, 0.17575736117618113, -0.0...	cluster_2_
3	Requisition NumberRR-0001939\nRemote:Yes\nWe c...	Celerity\n4.1	IT Services	requisition numberrr remoteyes collaborate cre...	word tf tf-idf 0 â�¢ ...	569.217931	[0.2306739880813952, 0.09347254534883724, -0.0...	cluster_2_	[0.29634610203488354, 0.10983982558139535, 0.0...	[0.2966705423736133, 0.028126685382837024, -0....	cluster_2_
4	ABOUT FANDUEL GROUP\n\nFanDuel Group is a worl...	FanDuel\n3.9	Sports & Recreation	fanduel group fanduel group worldclass team br...	word tf tf-idf 0 fanduel...	420.106719	[0.12914707201834857, 0.11582829587155963, 0.0...	cluster_3_	[0.17368260871559627, 0.10919291513761473, 0.0...	[0.021771101166884813, 0.16355587986765768, -0...	None
...	...	...	...	...	...	...	...	...	...	...	...
2248	Maintains systems to protect data from unautho...	Avacend, Inc.\n2.5	Staffing & Outsourcing	maintains systems protect data unauthorized us...	word tf tf-idf 0 ...	43.940807	[0.2738081315789473, -0.001255321052631562, 0....	None	[0.2949110263157894, 0.029555310526315794, 0.0...	[0.23112386279259817, -0.08318866123802247, -0...	cluster_4_
2249	Position:\nSenior Data Analyst (Corporate Audi...	Arrow Electronics\n2.9	Wholesale	position senior data analyst corporate audit j...	word tf tf-idf 0 ...	439.042957	[0.2200468355481728, 0.10710706677740867, 0.04...	cluster_1_	[0.3396034966777404, 0.09931764750830561, 0.09...	[0.3077493047461843, 0.06387599003189207, 0.06...	cluster_1_
2250	Title: Technical Business Analyst (SQL, Data a...	Spiceorb	-1	title technical business analyst sql data anal...	word tf tf-idf 0 busin...	205.978695	[0.36188271052631577, 0.05400915065789475, 0.0...	cluster_2_	[0.5060029144736842, 0.04490494473684211, 0.11...	[0.45506833532863533, 5.3025424212786644e-05, ...	cluster_2_
2251	Summary\n\nResponsible for working cross-funct...	Contingent Network Services\n3.1	Enterprise Software & Network Solutions	summary responsible working crossfunctionally ...	word tf tf-idf 0 ...	364.177527	[0.25247974618181807, 0.07676844581818185, -0....	cluster_2_	[0.34654995709090924, 0.07137524545454547, 0.0...	[0.27937433353352015, 0.08437047685035409, -0....	cluster_1_
2252	You.\n\nYou bring your body, mind, heart and s...	SCL Health\n3.4	Health Care Services & Hospitals	bring body mind heart spirit work senior quali...	word tf tf-idf 0 data ...	366.509859	[0.23890638028806577, 0.1815799016460906, -0.0...	cluster_2_	[0.3220337218518514, 0.22893831193415645, 0.07...	[0.2850343471866271, 0.2451438898926933, -0.08...	cluster_2_

2253 rows × 11 columns

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
build/lib/fling		build/lib/fling
datasets		datasets
dist		dist
example_files		example_files
examples		examples
fling		fling
testLibrary_files		testLibrary_files
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
example.ipynb		example.ipynb
fling.png		fling.png
newDataset.ipynb		newDataset.ipynb
requirements.txt		requirements.txt
setup.py		setup.py
testFling.ipynb		testFling.ipynb
testLibrary.ipynb		testLibrary.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fLing : Fast Linguistics

Introduction

Example notebook: ADDING PRETRAINED VECTORS, TRAINING VECTORS, CREATING COMBINED VECTORS

Example notebook: CLUSTERING

About

Releases

Packages

Contributors 3

Languages

License

fastboardAI/fling

Folders and files

Latest commit

History

Repository files navigation

fLing : Fast Linguistics

Introduction

Example notebook: ADDING PRETRAINED VECTORS, TRAINING VECTORS, CREATING COMBINED VECTORS

Example notebook: CLUSTERING

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages