# Inisghts from Text using N-Grams

## Overview

This notebook is an introductory NLP tutorial about N-Grams and how they can be used to find information within text data.

### N-Grams Overview

N-grams are a fundamental concept in NLP that capture sequential patterns in text. They have a wide range of applications and provide valuable insights into language usage and its structure. The "n" in n-grams signifies the number of tokens in the sequence. Tokens are a specified number of characters. The token can be an entire word, phrase, or number of characters. A 2-gram (also called a bigram) consists of pairs of consecutive tokens, a 3-gram (trigram) consists of triplets of consecutive tokens, and so on. The choice of n depends on the task and the text being analyzed.

Example of bigrams for the sentence "the quick brown fox":

`["the", "quick"], ["quick", "brown"], ["brown", "fox"]`

### How are N-Grams used?

N-grams capture local patterns in text and provide useful statistical information about the combination of words or characters in the text. By analyzing the frequencies of n-grams, models can learn patterns, predict the likelihood of certain words or phrases, and generate text that looks like the input data.

One common application of n-grams is in language modeling, where the objective is to predict the next word in a sequence given the previous context.

N-grams are also used in text classification tasks, such as sentiment analysis or spam detection.

## N-Grams Example in Python

We will create an example use of N-Grams using Python, to further understand how N-Grams works and their potential use.

### Bigrams

As mentioned earlier, Bigrams takes a look at the 2 consecutive tokens (or words in our case) across text.

In [7]:
# Create a function to generate bigrams from text
def generate_bigrams(text):
    # Split the text for each word
    words = text.split()
    bigrams = []

    # Iterate through the words
    for i in range(len(words) - 1):
        # Append each bigram or 2 consecutive words as a tuple
        bigrams.append((words[i], words[i + 1]))

    return bigrams

# Example text
text = "This is an example bigram use case in Python."
result = generate_bigrams(text)

# Print each bigram
for bigram in result:
    print(bigram)

('This', 'is')
('is', 'an')
('an', 'example')
('example', 'bigram')
('bigram', 'use')
('use', 'case')
('case', 'in')
('in', 'Python.')


We have now generated bigrams for any input text, in pure Python!

### N-Grams

For this example, we will be using NLTK.

#### About NLTK

NLTK (Natural Language Toolkit) is a popular Python library for natural language processing (NLP). It provides various functions, classes, and tools for NLP tasks such as tokenization, stemming, parts-of-speech tagging, etc.

In [25]:
# Library Imports
from nltk import ngrams

# Example usage
text = "An example n-gram use case in Python. This time using nltk."

# Generate trigrams this time using the NLTK ngrams function
ngram_result = list(ngrams(text.split(), 3))

# Print each ngram
for ngram in ngram_result:
    print(ngram)

('An', 'example', 'n-gram')
('example', 'n-gram', 'use')
('n-gram', 'use', 'case')
('use', 'case', 'in')
('case', 'in', 'Python.')
('in', 'Python.', 'This')
('Python.', 'This', 'time')
('This', 'time', 'using')
('time', 'using', 'nltk.')


## How to Use N-Grams

The power of N-Grams comes from using them as a form of likelihood or context. Once the N-Grams are generated, we can identify which sequences are the most popular and identify the words before and after. This transforms your text into a richer form of information to provide downstream to models or other NLP tasks.

For example, we can identify that the word "Python" is present 33% of the time from our example earlier. Using the information about your n-grams adds information to a broad set of NLP tasks including Language Modeling, Text Generation, Information Retrieval, Machine Translation, Text Classification, and more.

Explore what information your N-Grams can give you!

## What can you do next?

- Play around with NLTK! As mentioned earlier, NLTK is a powerful and popular NLP library in Python. I highly recommend exploring it more and finding out what NLP tasks can be done using the library.

- Start looking at Topic Modeling, Text Classification, and other NLP topics. With Large Language Models becoming more and more popular, it helps to understand how NLP works and the evolution of the field.

## Summary

N-grams in natural language processing (NLP) are continuous sequences of N items (typically words) that capture local linguistic patterns and are used for various downstream tasks such as language modeling, text generation, and information retrieval.

## Bio

Frankie Cancino is a Senior Data Scientist for Mercedes-Benz Research & Development, working on machine learning use cases.

### Links
* [NLTK Docs](https://www.nltk.org)
* [LinkedIn](https://www.linkedin.com/in/frankie-cancino/)
* [Twitter](https://twitter.com/frankiecancino)