## Demo BasicSemantics

Basis Semantic is the core class of bunkatech. It carries out the basic operations, namely: 
- terms extraction
- terms embeddings
- document embeddings

In [2]:
import pandas as pd
from bunkatech import BasicSemantics
from sklearn.datasets import fetch_20newsgroups
 
# Load Data
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
data = pd.DataFrame(docs, columns = ['docs'])
data = data.sample(2000, random_state = 42).reset_index(drop=True)
data['doc_index'] = data.index

###  Instantiate the BasicSemantics class

In [3]:
basic = BasicSemantics(data,
        text_var = 'docs',
        index_var = 'doc_index',
        terms_path=None, # no pre-extracted terms
        terms_embeddings_path=None, # no pre-computed terms embeddings
        docs_embeddings_path=None) # no pre-computed docs embeddings

#### Fit the class

For multilangue embedding, **distiluse-base-multilingual-cased-v1** is the best to use. For only english, you may use **all-MiniLM-L6-v2**

In [4]:
basic.fit(extract_terms=True, # Extract terms
          language="en", # Language of extraction terms
           sample_size_terms=1000, # Sample to extract terms from
            terms_ents=True, # Extract entities as part of terms
            terms_include_types=["PERSON", "ORG"], # Chose entities to keep from entities
            terms_ncs=False, # Extract nouns chuncks as part of terms
            terms_ngrams=(2, 2), # Chose ngrams to extract, bigrams are recommended
            terms_include_pos=["NOUN", "PROPN", "ADJ"], # Chose part-of-speech to keep from n-grams
            terms_limit=2000, # Top extracted Terms to keep
            terms_embedding=True, # Embed the Extracted terms
            docs_embedding=True, # Embed the documents
            terms_embedding_model="all-MiniLM-L6-v2", # Terms Embedding model
            docs_embedding_model="all-MiniLM-L6-v2", # Docs Embedding model
            docs_dimension_reduction = 5, # Number of dimensions to reduce the docs embedding. Enter False if you do not want to wish to reduce.
            docs_multiprocessing = False, # You can multiprocess the process to accelerate the embeddings of documents
            ) 

100%|█████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:11<00:00, 83.83it/s]


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

Start Embedding...


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

UMAP(n_components=5, verbose=True)
Wed Mar 30 14:57:17 2022 Construct fuzzy simplicial set
Wed Mar 30 14:57:18 2022 Finding Nearest Neighbors
Wed Mar 30 14:57:20 2022 Finished Nearest Neighbor Search
Wed Mar 30 14:57:21 2022 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

Wed Mar 30 14:57:23 2022 Finished embedding


#### Display the results

In [5]:
terms = basic.terms # Display extracted terms
terms_embeddings = basic.terms_embeddings # Display terms embeddings
docs_embeddings = basic.docs_embeddings # Display docs embeddings

In [6]:
terms.head(5)

Unnamed: 0,lemma,count_terms,text,main form,ent
0,fbi,16,FBI,fbi,ORG
1,sun,16,Sun | SUN,sun,ORG | PERSON
2,faq,15,FAQ,faq,ORG
3,ibm,15,IBM,ibm,ORG
4,apple,14,Apple,apple,ORG


In [7]:
terms_embeddings.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
fbi,-0.010757,-0.02258,-0.072251,0.027075,0.028994,-0.062471,0.082013,0.005155,0.013121,-0.025359,...,0.069753,-0.038552,0.022208,0.011805,-0.099422,0.011224,0.052369,-0.007461,0.111011,-0.064966
sun,-0.0341,0.110354,0.031398,0.096041,0.028723,-0.019378,0.170204,-0.063061,0.067839,-0.011797,...,-0.051929,-0.038254,-0.045386,-0.004905,-0.020007,0.017308,0.085608,-0.012262,-0.03281,0.038905
faq,0.007975,-0.007841,-0.106851,0.032403,-0.032816,0.064804,0.070553,0.063417,-0.068213,0.007695,...,0.019992,0.054783,0.061933,-0.055984,-0.032177,-0.011321,0.091173,0.004498,0.069162,-0.005235
ibm,-0.094194,-0.012911,-0.030876,-0.046995,-0.042532,-0.000814,0.07623,0.075563,0.068856,-0.03588,...,-0.047528,-0.010515,-0.076038,0.047723,0.007834,0.095698,0.078301,0.000371,0.085381,-0.029013
apple,-0.006138,0.031012,0.064794,0.010941,0.005267,-0.047476,0.081203,0.028981,0.066762,0.0303,...,0.013608,-0.019269,0.021374,-0.099896,-0.067282,0.07655,0.160245,-0.04819,0.07352,0.089539


In [8]:
docs_embeddings.head(5)

Unnamed: 0_level_0,0,1,2,3,4
doc_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,8.752942,7.512095,3.858257,4.748151,4.997494
1,8.577482,0.563074,3.324986,6.02084,5.804401
2,8.478346,2.663219,4.824983,6.247846,6.066621
3,8.936728,4.108136,6.747239,6.290255,3.968288
4,10.344117,4.097079,7.081563,5.168822,4.030689
