### Find vendor names in emails

#### Setup

JS setup
- `npm install pst-extractor` in process_pst folder.

- Comment out lines 162-165 in *process_pst/node_modules/pst-extractor/dist/PSTFolder.class.js*: 

  ```
  if ((emailRow && emailRow.itemIndex == -1) || !emailRow) {
    // no more!
    return null;
  }
  ```

Process PST files
- `node process_pst.js <input folder or file> <output folder>`

Python dependencies

- `pip install beautifulsoup4 bertopic flair keybert keyphrase_vectorizers scikit-learn`

#### Imports

In [3]:
from collections import Counter
from get_entities import main, similarity_sbert_kw, similarity_wordnet

#### Specify input and output, optional args, and run
- Only need to specify *input_folder* (should contain emails extracted from *process_pst.js*) and *output_folder* for a default run

- Set *use_custom* to True to use *custom_kwargs*

- Gets vendors by :occuring
    1. Getting keywords for every email

    2. Using *filter_func* with  *filter_func_args* to find emails with keywords related to *filter_terms*. (When looking for vendors *filter_terms*  should be related to payment, invoices, etc.)
    
    3. Using a named entity recognition tagger to get organization names from those emails, which are more likely to be vendors (Filters out most frequent organizations in overall emails because those might be false positives.)

- Two possible values for *filter_func*: similarity_wordnet_kw and similarity_sbert_kw. When using Wordnet, *filter_terms* must be a list of [Wordnet synsets](https://www.nltk.org/howto/wordnet.html). With S-BERT it should just be a list of words or phrases. 

- Use a different *filter_label* every time you change *filter_func*, *filter_terms*, or *filter_func_args* so that it output to a different file.


In [4]:
input_folder = 'data/extracted emails/process_pst_js/priorityservices/'
output_folder = 'data/vendor results/priorityservices/'

use_custom = False # Set to True to use custom_kwargs

kwargs = {
    'filter_label':'str', # Change every time you change the other kwargs.
    'filter_terms':[], # List of WordNet synsets for similarity_wordnet OR regular words/phrase for similarity_sbert_kw
    'filter_func_args': {}, # See get_entities.py for other args.
    'filter_threshold': float, # Similarity threshold for getting  emails relevant to filter_terms.
    'filter_func': similarity_wordnet or similarity_sbert_kw # wordnet uses synsets and sbert uses word embeddings.
     }

custom_kwargs = {
}

default_kwargs = {
    'filter_label':'sbert invoice related',
    'filter_terms': ['invoice'],
    'filter_func_args': {},
    'filter_threshold': .85,
    'filter_func': similarity_sbert_kw,
}

##########################################

kwargs = custom_kwargs if use_custom else default_kwargs

main(input_folder, output_folder, **kwargs)

Getting keywords...: 100%|██████████| 37758/37758 [23:52<00:00, 26.36it/s]  


2024-01-26 13:44:43,955 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


Getting entities...: 100%|██████████| 37758/37758 [36:20<00:00, 17.32it/s]  


2024-01-26 14:21:06,862 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


Getting entities from 1000 random docs...: 100%|██████████| 1000/1000 [06:03<00:00,  2.75it/s]



Finished.
