## Demo BasicSemantics

Basis Semantic is the core class of bunkatech. It carries out the basic operations, namely: 
- terms extraction
- terms embeddings
- document embeddings

In [1]:
import pandas as pd
from bunkatech import BasicSemantics
import warnings
warnings.filterwarnings('ignore')
 
    
data = pd.read_csv('../data/imdb.csv', index_col = [0])
data = data.sample(2000, random_state = 42)

###  Instantiate the BasicSemantics class

In [2]:
basic = BasicSemantics(data,
        text_var = 'description',
        index_var = 'imdb',
        terms_path=None, # no pre-extracted terms
        terms_embeddings_path=None, # no pre-computed terms embeddings
        docs_embeddings_path=None) # no pre-computed docs embeddings

#### Fit the class

For multilangue embedding, **distiluse-base-multilingual-cased-v1** is the best to use. For only english, you may use **all-MiniLM-L6-v2**

In [None]:
basic.fit(extract_terms=True, # Extract terms
          language="en", # Language of extraction terms
           sample_size_terms=1000, # Sample to extract terms from
            terms_ents=True, # Extract entities as part of terms
            terms_include_types=["PERSON", "ORG"], # Chose entities to keep from entities
            terms_ncs=False, # Extract nouns chuncks as part of terms
            terms_ngrams=(2, 2), # Chose ngrams to extract, bigrams are recommended
            terms_include_pos=["NOUN", "PROPN", "ADJ"], # Chose part-of-speech to keep from n-grams
            terms_limit=2000, # Top extracted Terms to keep
            terms_embedding=True, # Embed the Extracted terms
            docs_embedding=True, # Embed the documents
            terms_embedding_model="all-MiniLM-L6-v2", # Terms Embedding model
            docs_embedding_model="all-MiniLM-L6-v2", # Docs Embedding model
            docs_dimension_reduction = 5, # Number of dimensions to reduce the docs embedding. Enter False if you do not want to wish to reduce.
            docs_multiprocessing = False, # You can multiprocess the process to accelerate the embeddings of documents
            ) 

  0%|                                                                           | 0/1000 [00:00<?, ?it/s]2022-03-31 09:35:53,054 - INFO : loaded 'en_core_web_sm' spaCy language pipeline
2022-03-31 09:35:53,057 - INFO : loaded 'en_core_web_sm' spaCy language pipeline
2022-03-31 09:35:53,077 - INFO : loaded 'en_core_web_sm' spaCy language pipeline
2022-03-31 09:35:53,107 - INFO : loaded 'en_core_web_sm' spaCy language pipeline
2022-03-31 09:35:53,107 - INFO : loaded 'en_core_web_sm' spaCy language pipeline
2022-03-31 09:35:53,113 - INFO : loaded 'en_core_web_sm' spaCy language pipeline
2022-03-31 09:35:53,119 - INFO : loaded 'en_core_web_sm' spaCy language pipeline
  0%|                                                                 | 1/1000 [00:07<2:05:57,  7.57s/it]2022-03-31 09:35:53,227 - INFO : loaded 'en_core_web_sm' spaCy language pipeline
100%|███████████████████████████████████████████████████████████████| 1000/1000 [00:08<00:00, 112.17it/s]
2022-03-31 09:35:54,805 - INFO : Loa

Batches:   0%|          | 0/63 [00:00<?, ?it/s]

2022-03-31 09:36:10,170 - INFO : Load pretrained SentenceTransformer: all-MiniLM-L6-v2


Start Embedding...


2022-03-31 09:36:17,614 - INFO : Use pytorch device: cpu


Batches:   0%|          | 0/63 [00:00<?, ?it/s]

#### Display the results

In [None]:
terms = basic.terms # Display extracted terms
terms_embeddings = basic.terms_embeddings # Display terms embeddings
docs_embeddings = basic.docs_embeddings # Display docs embeddings

In [None]:
terms.head(5)

In [None]:
terms_embeddings.head(5)

In [None]:
docs_embeddings.head(5)