# Co-occurance Matrix

Generally speaking, a co-occurrence matrix will have specific entities in rows (ER) and columns (EC). The purpose of this matrix is to present the number of times each ER appears in the same context as each EC. As a consequence, in order to use a co-occurrence matrix, you have to define your entites and the context in which they co-occur.

In NLP, the most classic approach is to define each entity (ie, lines and columns) as a word present in a text, and the context as a sentence.

*Consider the following text:*
> Roses are red. Sky is blue.

With the classic approach described before, we'll have the following matrix:

|| Roses | are | red | Sky | is | blue |
| :- | :-: | :-: | :-: | :-: | :-: | :-: |
| Roses | 1 | 1 | 1 | 0 | 0 | 0 |
| are | 1 | 1 | 1 | 0 | 0 | 0 |
| red | 1 | 1 | 1 | 0 | 0 | 0 |
| Sky | 0 | 0 | 0 | 1 | 1 | 1 |
| is | 0 | 0 | 0 | 1 | 1 | 1 |
| Blue | 0| 0 | 0 | 1 | 1 | 1 |

Here, each cell indicates wether the two items co-occur or not. You may replace it with the number of times it appears, or with a more sophisticated approach. You may also change the entities themselves, by putting nouns in columns and adjective in lines instead of every word.

`What are they used for in NLP?` The most evident use of these matrix is their ability to provide links between notions. Let's suppose you're working on products reviews. Let's also suppose for simplicity that each review is only composed of short sentences. You'll have something like that:

> Product X is amazing.<br/>I hate product Y.

Representing these reviews as one co-occurrence matrix will enable you associate products with appreciations.

*[[Source]](https://stackoverflow.com/questions/24073030/what-are-co-occurence-matrixes-and-how-are-they-used-in-nlp)*

In [1]:
import re
import string
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'Product X is amazing.',
    'I hate product Y.'
]

In [2]:
def clean(sentences):
    result = list()
    for sentence in sentences:
        table = str.maketrans('', '', string.punctuation)
        sentence = sentence.translate(table).lower()
        sentence = re.sub(' +', ' ', sentence).lstrip().rstrip()
        result.append(sentence)
    return result

def update_matrix(sent, feats, matrix, window_len):
    words = sent.split(' ')
    for focus_word_idx, focus_word in enumerate(words):    # Iterate each word as focus word
        focus_word = focus_word.lower()
        x = max(0, focus_word_idx - window_len)
        y = min(len(words), focus_word_idx + window_len + 1)
        for context_word_idx in range(x, y):
            if words[context_word_idx] in feats:
                matrix_row_idx = feats.index(focus_word)
                matrix_col_idx = feats.index(words[context_word_idx])
                matrix[matrix_row_idx][matrix_col_idx] += 1
    return matrix

In [4]:
corpus = clean(corpus)
vectorizer = CountVectorizer(stop_words=None, token_pattern=r"(?u)\b\w+\b")
vec = vectorizer.fit_transform(corpus)
features = vectorizer.get_feature_names()
n = len(features)
window_len = 2
matrix = np.zeros((n, n))   # Initialize co-occurance matrix to 0

for sentence in corpus:
    result = update_matrix(sentence, features, matrix, window_len)
    
print(result)

[[1. 0. 0. 1. 0. 1. 0.]
 [0. 1. 1. 0. 1. 0. 1.]
 [0. 1. 1. 0. 1. 0. 0.]
 [1. 0. 0. 1. 1. 1. 0.]
 [0. 1. 1. 1. 2. 1. 1.]
 [1. 0. 0. 1. 1. 1. 0.]
 [0. 1. 0. 0. 1. 0. 1.]]
