# Collocations and Bigrams

Collocations are the multi-word phrases in pieces of text coming from various from text source (web- page, document, etc.). In case of collocations, we group the words that tend to frequently appear together into various phrases in various combinations of different numbers.

Two common types of collocations: 

A: Bigrams: Groups each containing two frequently occuring words together.
e.g,"artificial intelligence","due to", etc.

B: Trigrams: Groups each containing two frequently occuring words together. 
e.g, "the New York", "the United States", etc. .

In [1]:
#Importing libraries
import nltk
from nltk.util import ngrams
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

In [2]:
#Defining the example text to be considered
text= "The quick brown fox jumped over the lazy dog."

#Extracting unigram tokens first
Tokens= nltk.word_tokenize(text)

#Creating bigrams from the tokens
bigram_tokens= tuple(nltk.bigrams(Tokens))
print(bigram_tokens)

(('The', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumped'), ('jumped', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog'), ('dog', '.'))


In [3]:
#Creating trigrams from the tokens
trigram_tokens= tuple(nltk.trigrams(Tokens))
print(trigram_tokens)

(('The', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumped'), ('fox', 'jumped', 'over'), ('jumped', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog'), ('lazy', 'dog', '.'))


Note: Not all bigrams and trigrams are meaningful. Combinations like ('brown', 'fox'), ('fox', 'jumped'),('quick', 'brown', 'fox'), ('brown', 'fox', 'jumped'), , can be used meaningfully, ('over', 'the'), ('over', 'the', 'lazy'), while combinations like  hold little significance since the bigram and trigram, in the latter cases, don't add much of semantic significance over the  semantic significance of the individual words like ('over','the') or don't act as self- content, stand-alone phrases as in the case of ('over', 'the', 'lazy') . 

Now, we are going to explore some of the methods that are going to filter- off the meaningful collocations from the non- meaningful ones. 

In [5]:
#Defining the text to be considered

original_text="Artificial Intelligence (AI) By JAKE FRANKENFIELD Updated March 08, 2021 Reviewed by GORDON SCOTT What Is Artificial Intelligence (AI)? Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving. The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal. A subset of artificial intelligence is machine learning, which refers to the concept that computer programs can automatically learn from and adapt to new data without being assisted by humans. Deep learning techniques enable this automatic learning through the absorption of huge amounts of unstructured data such as text, images, or video. KEY TAKEAWAYS Artificial intelligence refers to the simulation of human intelligence in machines. The goals of artificial intelligence include learning, reasoning, and perception. AI is being used across different industries including finance and healthcare. Weak AI tends to be simple and single-task oriented, while strong AI carries on tasks that are more complex and human-like. What if you had started investing years ago? Find out what a hypothetical investment would be worth today. SELECT A STOCK TSLA TESLA INC AAPL APPLE INC NKE NIKE INC AMZN AMAZON.COM, INC WMT WALMART INC SELECT INVESTMENT AMOUNT $ 1,000 SELECT A PURCHASE DATE 5 years ago CALCULATE Understanding Artificial Intelligence (AI) When most people hear the term artificial intelligence, the first thing they usually think of is robots. That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Earth. But nothing could be further from the truth. Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex."

In [6]:
print("The given piece of text converted to a list:")
print()
text=original_text.lower().split()
print(text)

The given piece of text converted to a list:

['artificial', 'intelligence', '(ai)', 'by', 'jake', 'frankenfield', 'updated', 'march', '08,', '2021', 'reviewed', 'by', 'gordon', 'scott', 'what', 'is', 'artificial', 'intelligence', '(ai)?', 'artificial', 'intelligence', '(ai)', 'refers', 'to', 'the', 'simulation', 'of', 'human', 'intelligence', 'in', 'machines', 'that', 'are', 'programmed', 'to', 'think', 'like', 'humans', 'and', 'mimic', 'their', 'actions.', 'the', 'term', 'may', 'also', 'be', 'applied', 'to', 'any', 'machine', 'that', 'exhibits', 'traits', 'associated', 'with', 'a', 'human', 'mind', 'such', 'as', 'learning', 'and', 'problem-solving.', 'the', 'ideal', 'characteristic', 'of', 'artificial', 'intelligence', 'is', 'its', 'ability', 'to', 'rationalize', 'and', 'take', 'actions', 'that', 'have', 'the', 'best', 'chance', 'of', 'achieving', 'a', 'specific', 'goal.', 'a', 'subset', 'of', 'artificial', 'intelligence', 'is', 'machine', 'learning,', 'which', 'refers', 'to', 'the',

In [7]:
#Calculating the frequency of unigrams

wordfreq=[]
for w in text:
    wordfreq.append(text.count(w))

print("Frequencies: "+str(wordfreq))    

Frequencies: [10, 12, 3, 3, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 6, 10, 12, 1, 10, 12, 3, 3, 9, 13, 2, 9, 4, 12, 3, 2, 9, 3, 1, 9, 2, 1, 1, 10, 2, 1, 1, 13, 2, 1, 1, 5, 1, 9, 1, 3, 9, 1, 1, 1, 1, 8, 4, 1, 2, 2, 3, 10, 1, 13, 1, 1, 9, 10, 12, 6, 1, 1, 9, 1, 10, 1, 1, 9, 1, 13, 1, 1, 9, 1, 8, 1, 1, 8, 1, 9, 10, 12, 6, 3, 2, 1, 3, 9, 13, 1, 9, 1, 1, 3, 1, 1, 3, 10, 1, 9, 1, 2, 1, 2, 1, 3, 1, 1, 3, 1, 1, 1, 1, 3, 1, 13, 1, 9, 1, 1, 9, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 10, 12, 3, 9, 13, 2, 9, 4, 12, 3, 1, 13, 1, 9, 10, 12, 1, 2, 1, 10, 1, 3, 6, 2, 1, 1, 1, 1, 1, 1, 10, 1, 1, 3, 1, 9, 5, 2, 10, 1, 1, 1, 1, 3, 1, 3, 1, 9, 3, 2, 1, 10, 1, 3, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 8, 1, 2, 1, 5, 1, 1, 3, 8, 1, 1, 1, 5, 1, 1, 5, 1, 1, 5, 1, 1, 5, 1, 1, 5, 3, 2, 1, 1, 1, 3, 8, 1, 1, 1, 2, 1, 1, 1, 10, 12, 3, 1, 2, 1, 1, 13, 2, 10, 1, 13, 1, 1, 1, 1, 2, 9, 6, 1, 1, 1, 1, 1, 10, 1, 1, 1, 1, 1, 2, 9, 1, 1, 3, 1, 1, 1, 1, 5, 1, 3, 13, 1, 10, 12, 6, 1, 3, 13, 1, 9, 4, 12, 3, 5, 1, 3, 8, 1, 9, 8, 3, 3, 1, 2, 1, 10, 1, 

In [8]:
#Printing the pairs of words and frequencies: 
print("Pairs: "+str(tuple(zip(text,wordfreq))))

Pairs: (('artificial', 10), ('intelligence', 12), ('(ai)', 3), ('by', 3), ('jake', 1), ('frankenfield', 1), ('updated', 1), ('march', 1), ('08,', 1), ('2021', 1), ('reviewed', 1), ('by', 3), ('gordon', 1), ('scott', 1), ('what', 3), ('is', 6), ('artificial', 10), ('intelligence', 12), ('(ai)?', 1), ('artificial', 10), ('intelligence', 12), ('(ai)', 3), ('refers', 3), ('to', 9), ('the', 13), ('simulation', 2), ('of', 9), ('human', 4), ('intelligence', 12), ('in', 3), ('machines', 2), ('that', 9), ('are', 3), ('programmed', 1), ('to', 9), ('think', 2), ('like', 1), ('humans', 1), ('and', 10), ('mimic', 2), ('their', 1), ('actions.', 1), ('the', 13), ('term', 2), ('may', 1), ('also', 1), ('be', 5), ('applied', 1), ('to', 9), ('any', 1), ('machine', 3), ('that', 9), ('exhibits', 1), ('traits', 1), ('associated', 1), ('with', 1), ('a', 8), ('human', 4), ('mind', 1), ('such', 2), ('as', 2), ('learning', 3), ('and', 10), ('problem-solving.', 1), ('the', 13), ('ideal', 1), ('characteristic', 1

# Calculate the Frequency of Bigrams and Find the most frequent Bigrams

In [9]:
bigram_generator=nltk.bigrams(text)
#for frequency distribution
bigram_fd=nltk.FreqDist(bigram_generator)
bigram_fd

FreqDist({('artificial', 'intelligence'): 9, ('intelligence', '(ai)'): 3, ('refers', 'to'): 3, ('to', 'the'): 3, ('human', 'intelligence'): 3, ('that', 'are'): 3, ('of', 'artificial'): 3, ('intelligence', 'is'): 3, ('the', 'simulation'): 2, ('simulation', 'of'): 2, ...})

In [10]:
bigram_fd.most_common(10)

[(('artificial', 'intelligence'), 9),
 (('intelligence', '(ai)'), 3),
 (('refers', 'to'), 3),
 (('to', 'the'), 3),
 (('human', 'intelligence'), 3),
 (('that', 'are'), 3),
 (('of', 'artificial'), 3),
 (('intelligence', 'is'), 3),
 (('the', 'simulation'), 2),
 (('simulation', 'of'), 2)]

# Calculate the Frequency of Trigrams and Find the most frequent Trigrams

In [11]:
trigram_generator=nltk.trigrams(text)
#for frequency distribution
trigram_fd=nltk.FreqDist(trigram_generator)
trigram_fd

FreqDist({('artificial', 'intelligence', '(ai)'): 3, ('refers', 'to', 'the'): 3, ('of', 'artificial', 'intelligence'): 3, ('artificial', 'intelligence', 'is'): 3, ('to', 'the', 'simulation'): 2, ('the', 'simulation', 'of'): 2, ('simulation', 'of', 'human'): 2, ('of', 'human', 'intelligence'): 2, ('human', 'intelligence', 'in'): 2, ('intelligence', '(ai)', 'by'): 1, ...})

In [12]:
trigram_fd.most_common(10)

[(('artificial', 'intelligence', '(ai)'), 3),
 (('refers', 'to', 'the'), 3),
 (('of', 'artificial', 'intelligence'), 3),
 (('artificial', 'intelligence', 'is'), 3),
 (('to', 'the', 'simulation'), 2),
 (('the', 'simulation', 'of'), 2),
 (('simulation', 'of', 'human'), 2),
 (('of', 'human', 'intelligence'), 2),
 (('human', 'intelligence', 'in'), 2),
 (('intelligence', '(ai)', 'by'), 1)]