Alejandro Rojo
https://github.com/alerojorela

# Semantic info extraction by distribution / correlation
We can get some semantic information in unsupervised learning by establishing correlations between words in sentences. A sentence like *he drove the car* can help us understand that *driving* is an action involving a car and if we find similar sentences on a regular basis maybe we can infere that *driving* is the characteristic action performed on a car.

This script is going to apply the **apriori algorithm** to find correlations between pairs of words related by an enunciation event (a sentence) using a corpus of several sentences.

We're going to see in action:
+ nltk library
    + pos-tagged corpora
    + lemmatization
+ implementation of the apriori algorithm to find correlations between words in several enunciation events
+ getting frequencies using Counter class from collections library
+ combinations using permutations and product from itertools library
+ sorting of dictionaries by values

We are gonna need a basic NLP library called nltk for retrieving a tagged corpora and its lemmatization.
Alternatively, we could have used some heavier NLP library like Spacy

In [1]:
try:
    import nltk
except:
    !pip install nltk
    import nltk

In [2]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, brown as corpus
from collections import Counter
from itertools import chain, combinations, permutations, product

We define a function in order to retrieve sentences from the brown corpus and extract some pos-tagged lexical units like nouns (N) and verbs (V) from it

In [3]:
sel_lex_pos = ['N', 'V']
# part of speech (POS) mapping of lexical items
# http://korpus.uib.no/icame/manuals/BROWN/INDEX.HTM
# N* noun NP V* verb JJ adj RB adverb
wn_mapping = {'N': wordnet.NOUN,
              'V': wordnet.VERB,
              'J': wordnet.ADJ,
              'R': wordnet.ADV}
def lexical_tag(tag):
    if tag[0] in sel_lex_pos and not tag.startswith('NP'):
        return wn_mapping.get(tag[0])

    
def get_nltk_lemmas(max_sentences=0):
    lemmatizer = WordNetLemmatizer()

    sentences = corpus.tagged_sents()
    if max_sentences:  # limits sentences
        sentences = sentences[:max_sentences]

    lexical_sentences = []
    for sentence in sentences:
        event = {}
        for token, pos in sentence:
            normalized_tag = lexical_tag(pos)
            if normalized_tag:
                normalized_token = lemmatizer.lemmatize(token, pos=normalized_tag).lower()
                event[normalized_token] = normalized_tag

        if event:
            lexical_sentences.append(event)
    return lexical_sentences

Next we get 10000 sentences. We can observe the returned dictionary, mapping lemmas to POS (noun and verb)

In [4]:
events = get_nltk_lemmas(max_sentences=10000)
events[:3]  # show some samples

[{'county': 'n',
  'jury': 'n',
  'say': 'v',
  'friday': 'n',
  'investigation': 'n',
  'primary': 'n',
  'election': 'n',
  'produce': 'v',
  'evidence': 'n',
  'irregularity': 'n',
  'take': 'v',
  'place': 'n'},
 {'jury': 'n',
  'say': 'v',
  'term-end': 'n',
  'presentment': 'n',
  'city': 'n',
  'committee': 'n',
  'charge': 'n',
  'election': 'n',
  'deserve': 'v',
  'praise': 'n',
  'thanks': 'n',
  'manner': 'n',
  'conduct': 'v'},
 {'term': 'n',
  'jury': 'n',
  'charge': 'v',
  'court': 'n',
  'judge': 'n',
  'investigate': 'v',
  'report': 'n',
  'irregularity': 'n',
  'primary': 'n',
  'win': 'v',
  'mayor-nominate': 'n'}]

Now we define a function to obtain the lift (correlation mean) providing pairs (function is not implemented for trios or more)

In [5]:
def get_lift(items, pairs, support_filter=0, lift_filter=0):
    # absolute frequencies of:
    # individual items
    support1 = dict(Counter(items))
    # pairs
    support2 = dict(Counter(pairs))

    # filter
    if support_filter:
        support2 = {k: v for k, v in support2.items() if v > support_filter}

    # confidence2
    lift2 = {pair: support / (support1[pair[0]] * support1[pair[1]])
             for pair, support in support2.items()}
    if lift_filter:
        lift2 = {k: v for k, v in lift2.items() if v > lift_filter}
    return lift2

Now we feed the items and pairs (items belonging to the same event) into the function

In [6]:
# flat sentence·item structure
all_items = [_ for event in events for _ in event]
# same event pairs
pairs = [_ for event in events for _ in list(permutations(event, 2))]

In [7]:
results = get_lift(all_items, pairs, support_filter=5)
{k: v for k, v in sorted(results.items(), key=lambda item: -item[1])}

{('birdie', 'par'): 0.07792207792207792,
 ('par', 'birdie'): 0.07792207792207792,
 ('hydrogen', 'atom'): 0.07,
 ('atom', 'hydrogen'): 0.07,
 ('cholesterol', 'blood'): 0.03431372549019608,
 ('blood', 'cholesterol'): 0.03431372549019608,
 ('cotton', 'gin'): 0.028070175438596492,
 ('gin', 'cotton'): 0.028070175438596492,
 ('par', 'hole'): 0.022727272727272728,
 ('hole', 'par'): 0.022727272727272728,
 ('gin', 'machinery'): 0.022556390977443608,
 ('machinery', 'gin'): 0.022556390977443608,
 ('toll-road', 'bond'): 0.022222222222222223,
 ('bond', 'toll-road'): 0.022222222222222223,
 ('fallout', 'shelter'): 0.021052631578947368,
 ('shelter', 'fallout'): 0.021052631578947368,
 ('allied', 'arts'): 0.021052631578947368,
 ('arts', 'allied'): 0.021052631578947368,
 ('license', 'fee'): 0.01856763925729443,
 ('fee', 'license'): 0.01856763925729443,
 ('va', 'hospital'): 0.015065913370998116,
 ('hospital', 'va'): 0.015065913370998116,
 ('farm', 'dealer'): 0.013725490196078431,
 ('dealer', 'farm'): 0.01

Additionally, We can check correlations between verbs and nouns `V*N`

In [8]:
# flat sentence·item structure
all_items = [_ for event in events for _ in event]
# same event pairs
pairs = []
for event in events:
    nouns = [token for token, pos in event.items() if pos =='n']
    verbs = [token for token, pos in event.items() if pos =='v']
    pair = list(product(verbs, nouns))
    if pair:
        pairs.extend(pair)

In [9]:
results = get_lift(all_items, pairs, support_filter=5)
{k: v for k, v in sorted(results.items(), key=lambda item: -item[1])}

{('allied', 'arts'): 0.021052631578947368,
 ('attract', 'industry'): 0.007692307692307693,
 ('united', 'states'): 0.0070130241877773,
 ('united', 'nations'): 0.006758583400919167,
 ('solve', 'problem'): 0.006216006216006216,
 ('reach', 'agreement'): 0.004310344827586207,
 ('confront', 'problem'): 0.0039447731755424065,
 ('sell', 'stock'): 0.0035714285714285713,
 ('pass', 'senate'): 0.0033068783068783067,
 ('sing', 'song'): 0.0031185031185031187,
 ('hit', 'ball'): 0.0030499428135722455,
 ('pay', 'taxpayer'): 0.0026954177897574125,
 ('drive', 'car'): 0.0025,
 ('pass', 'bill'): 0.0024733637747336376,
 ('attend', 'conference'): 0.0024489795918367346,
 ('pay', 'worker'): 0.0023584905660377358,
 ('spend', 'dollar'): 0.002142857142857143,
 ('raise', 'question'): 0.0020126509488211618,
 ('tell', 'trial'): 0.002,
 ('play', 'orchestra'): 0.0019828155981493722,
 ('happen', 'thing'): 0.0019636720667648502,
 ('involve', 'case'): 0.0018291089626339169,
 ('save', 'life'): 0.0017921146953405018,
 ('un

This provides us with some semantic information like
+ songs are sung
+ balls are hit
+ cars are driven
+ orchestras play

This, of course, will requiere additional filtering, but it has already the potential to orient us towards an ontology