<a href="https://colab.research.google.com/github/WetSuiteLeiden/example-notebooks/blob/main/research-methods/methods_nlp__concordance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal of this notebook

Explain what a concordance is, and how it might help you explore what a word or phrase means.

Give some basic code and basic examples

## What 

A concordance has some [varied meanings in different fields](https://en.wikipedia.org/wiki/Concordance), 
but in [publishing](https://en.wikipedia.org/wiki/Concordance_(publishing) and linguistics it usually means to taking a term or phrase,
and shows all occurences of that term in the context it is used in) - context of a few words, up to a sentence.

Before that could be automated this was only done for a few words of special importance. 
With computers we can do it arbitrarily, we just have to deal with the bulk of output.

No, this is not at all complex, but it can still be quite _useful_ to a few tasks, such as inspecting your data,
finding things to match, finding out how to write pattern rules, etc.


Yes, you can think up more features to use this for actual research.
    
Yes, you can potentially do this faster with an installation of a search engine.

Both of those represents a learning curve, and you can get a decent stat from a relatively small piece of DIY code, such as that below.

In [2]:
import re
import random
import spacy
import wetsuite.helpers.localdata
import wetsuite.helpers.strings
import wetsuite.helpers.escape
import wetsuite.helpers.spacy
import wetsuite.helpers.notebook

In [60]:
# a list of strings, made elsewhere
sentence_store = wetsuite.helpers.localdata.LocalKV('sentences.db', str, str)

# assuming the full thing is a _lot_... 
#sentences = sentence_store.values()
#...then around 50K seems a good balance while developing, between having some cases and not spending very long
sentences = sentence_store.random_values(100000)

In [4]:
dutch = spacy.load('nl_core_news_lg')

In [66]:
def concordance( needle_re, haystack_strlist ):
    for hs_item in haystack_strlist:
        for before_str, match_str, match_obj, after_str in wetsuite.helpers.strings.findall_with_context(needle_re, hs_item, 40):
            # print it out in a way that makes it clear what part was the match
            print(f'{before_str:>50s} {match_str:^30s} {after_str}')


# the reason we chose regular expression is that, aside from literal matches such as...
#concordance(r'verbaal', sentences)

# ...it allows (limited) tricks like "words ending in vergunning" (via "one or more non-space character ending in the letters vegunning")
concordance(r'[^\s]+vergunning\b', sentences)

# ...or words with 'minister' somewhere in it   (The (?i) is ap python-regex thing that makes it case insensitive)
#concordance(r'(?i)[^\s]*minister[^\s]*', sentences)   
#concordance(r'(?i)[^\s]+fraud[^\s]+\b', sentences)


           kort gezegd, op neer dat verweerder de       omgevingsvergunning        alleen toetst aan het bestemmingsplan, 
          oor tussen haakjes daarachter het woord           ‘vergunning           ’ te plaatsen.
                                      Bouw bv een       omgevingsvergunning        voor het oprichten van 20 woningen verl
          gen en zij er weer in het bezit van een       verblijfsvergunning        zal worden gesteld.
          et college is bevoegd om een dergelijke       omgevingsvergunning        te verlenen op grond van artikel 2.12, 
                         Op het perceel waarop de         aanlegvergunning         betrekking heeft, rust ingevolge het te
                                       het zonder       omgevingsvergunning        aanpassen van de zijgevel van de woning
          De Afdeling stelt vast dat de verleende       omgevingsvergunning        niet in strijd is met het bestemmingspl
          ing van deze brief te beëindigen en een      exploit

Let's try to make something a little more controllable,
and give you the POS tags that you might indeed want when [developing rules](methods_nlp__patterns_spacy.ipynb).

In [74]:
from wetsuite.helpers.escape import nodetext

class Concordance():
    def __init__( self, haystack_items, context_charamt:int=30 ):
        self.haystack        = haystack_items
        self.context_charamt = context_charamt

    def process(self, needle_re):
        self.matches = []
        for hs_item in self.haystack:
            for before_str, match_str, _match_obj, after_str in wetsuite.helpers.strings.findall_with_context(needle_re, hs_item, self.context_charamt):
                self.matches.append( (before_str, match_str, after_str) )
        #return self
    
    def process_with_spacy(self, needle_re, model):
        # TODO: review, there is likely an off-by-1 error in here
        # CONSIDER: cacheing sentence parses
        self.matches = []
        for hs_item in self.haystack:
            for before_str, match_str, match_obj, after_str in wetsuite.helpers.strings.findall_with_context(needle_re, hs_item, self.context_charamt):

                m_st, m_en = match_obj.start(), match_obj.end()
                #print("Match at char pos %d..%d"%(m_st, m_en))
                doc = dutch(hs_item)
                toks = list(doc)

                before, hit, after = [],[],[]

                # tokens before match start (at m_st)
                while len(toks)>0: 
                    tok = toks.pop(0)
                    if tok.idx > m_st-self.context_charamt:
                        before.append('%s<span class="g">/%s</span>'%(nodetext(tok.text), nodetext(tok.pos_)))
                        #before.append( '%s/%s'%(tok.text, tok.pos_) )
                        #print( '%s/%s '%(tok.text, tok.pos_), end='' )
                        #print( 'B4:%s:%s/%s'%(tok.idx, tok.text, tok.pos_) )
                    if tok.idx + len(tok.text_with_ws) + 1 > m_st:
                        break

                # tokens before match end
                while len(toks)>0: 
                    tok = toks.pop(0)
                    if tok.idx+1 > m_en:
                        toks.insert(0, tok) # oops, put it back
                        break
                    #hit.append( '%s/%s'%(tok.text, tok.pos_) )
                    hit.append('%s<span class="g">/%s</span>'%(nodetext(tok.text), nodetext(tok.pos_)))
                    #print( '%s/%s '%(tok.text, tok.pos_), end='' )
                    #print( 'IN:%s:%s/%s'%(tok.idx, tok.text, tok.pos_) )

                # tokens after match
                for tok in toks:
                    if tok.idx < m_en + self.context_charamt:
                        #after.append('%s/%s '%(tok.text, tok.pos_))
                        after.append('%s<span class="g">/%s</span>'%(nodetext(tok.text), nodetext(tok.pos_)))
                    #print( '%s/%s '%(tok.text, tok.pos_), end='' )
                    #print( 'AF:%s:%s/%s'%(tok.idx, tok.text, tok.pos_) )

                self.matches.append( (' '.join(before), ' '.join(hit), ' '.join(after)) )
        #return self

    def _repr_html_(self):
        ret = ['<table>']
        for before, m, after in self.matches:
            ret.append('<tr>')
            ret.append(f'<td>{before}</td>')
            ret.append(f'<td class="c2">{m}</td>')
            ret.append(f'<td class="c3">{after}</td>')
            ret.append('</tr>')
        ret.append('</table>')
        #padding-left:2em; text-indent: -2em; 
        return '<div style="width:100%%"><style>.c2 {text-align:center; color:green} .c3 {text-align:left} .g {opacity:0.33;}</style>%s</div>'%"".join(ret)
    
c = Concordance( sentences )
#c.process(r'[^\s]+vergunning\b' )                       # no POS
c.process_with_spacy(r'(?i)[^\s]+fraud[^\s]+\b', dutch ) # with POS
display( c )

0,1,2
“/PUNCT /SPACE in/ADP Europa/PROPN op/ADP grote/ADJ schaal/NOUN,btw-fraude/NOUN,plaatsvindt/VERB met/ADP zogeheten/ADJ koper/NOUN
De/DET term/NOUN,btw-carrouselfraude/ADJ,wordt/AUX gebruikt/VERB voor/ADP een/DET samenstel/NOUN
hiermee/ADV dat/SCONJ eisende/ADJ partij/NOUN,gefraudeerd/VERB,heeft/AUX ./PUNCT
dat/DET kader/NOUN gezamenlijk/ADJ hebben/AUX,gefraudeerd/VERB,"met/ADP PGB-gelden/NOUN ,/PUNCT door/ADP het/DET innen/VERB"
is/AUX het/DET Centraal/PROPN Meldpunt/PROPN,Faillissementsfraude/PROPN,ondergebracht/VERB ./PUNCT
",/PUNCT in/ADP verband/NOUN gebracht/VERB met/ADP",beleggingsfraude/NOUN,en/CCONJ het/PRON niet/ADV nakomen/VERB van/ADP verplichtingen/NOUN
ontvangen/VERB over/ADP mogelijke/ADJ,beleggingsfraude/NOUN,door/ADP beleggingsvereniging/NOUN “/PUNCT Fibonacci/NOUN
"geschrift/NOUN (/PUNCT feiten/NOUN 5/NUM t/m/ADJ 10/NUM )/PUNCT ,/PUNCT",faillissementsfraude/PROPN,(/PUNCT feit/NOUN 11/NUM )/PUNCT en/CCONJ onttrekking/NOUN aan/ADP
een/DET proces-verbaal/NOUN van/ADP,uitkeringsfraude/NOUN,van/ADP de/DET Dienst/PROPN Werk/NOUN en/CCONJ Inkomen/VERB
verwijt/NOUN dat/PRON [/NOUN gedaagde/VERB ]/PROPN heeft/AUX,gefraudeerd/VERB,ter/ADP zitting/NOUN nog/ADV verklaard/VERB dat/SCONJ


One way you might extend this is to 
- get data out as python data, dataframe, etc.
- mention the source of each thing we are searching in, so that we can trace it back.

<!--

further reading

https://spotintelligence.com/2022/11/28/tf-idf/

https://www.tutorialspoint.com/gensim/gensim_creating_tf_idf_matrix.htm

https://mayurji.github.io/blog/2021/09/20/Tf-Idf

-->