## 1. What are Corpora

A corpus is a collection of written or spoken texts. With the use of computers it is possible to compile large amounts of authentic written and spoken language. This compilation of online text can then be analysed in various ways to establish patterns of grammar and vocabulary usage.

## 2. What are Tokens?
The term "token" refers to the total number of words in a text, corpus etc, regardless of how often they are repeated. The term "type" refers to the number of distinct words in a text, corpus etc.

## 3. What are Unigrams, Bigrams, Trigrams?
A 1-gram (or unigram) is a one-word sequence. For the above sentence, the unigrams would simply be: “I”, “love”, “reading”, “blogs”, “about”, “data”, “science”, “on”, “Analytics”, “Vidhya”.

A 2-gram (or bigram) is a two-word sequence of words, like “I love”, “love reading”, or “Analytics Vidhya”.

A 3-gram (or trigram) is a three-word sequence of words like “I love reading”, “about data science” or “on Analytics Vidhya”.

## 4. How to generate n-grams from text?
N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places.

In [4]:
s = """
    I am annu patel perusing b tech computer science and emginnering
    i ma from delhi
    i ma good in python programming, machine learning, sql
"""

In [5]:
import re

def generate_ngrams(s, n):
    # Convert to lowercases
    s = s.lower()
    
    # Replace all none alphanumeric characters with spaces
    s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
    
    # Break sentence in the token, remove empty tokens
    token = [token for token in s.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[token[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

In [6]:
generate_ngrams(s, n=5)

['\n i am annu patel',
 'i am annu patel perusing',
 'am annu patel perusing b',
 'annu patel perusing b tech',
 'patel perusing b tech computer',
 'perusing b tech computer science',
 'b tech computer science and',
 'tech computer science and emginnering\n',
 'computer science and emginnering\n i',
 'science and emginnering\n i ma',
 'and emginnering\n i ma from',
 'emginnering\n i ma from delhi\n',
 'i ma from delhi\n i',
 'ma from delhi\n i ma',
 'from delhi\n i ma good',
 'delhi\n i ma good in',
 'i ma good in python',
 'ma good in python programming',
 'good in python programming machine',
 'in python programming machine learning',
 'python programming machine learning sql\n']

## 5. Explain Lemmatization

Lemmatization is a linguistic term that means grouping together words with the same root or lemma but with different inflections or derivatives of meaning so they can be analyzed as one item. The aim is to take away inflectional suffixes and prefixes to bring out the word's dictionary form.

## 6. Explain Stemming
Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”. Stemming is an important part of the pipelining process in Natural language processing. The input to the stemmer is tokenized words.v

## 7. Explain Part-of-speech (POS) tagging
Part-of-Speech (PoS) tagging may be defined as the process of assigning one of the parts of speech to the given word. It is generally called POS tagging. In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. We already know that parts of speech include nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.

## 8. Explain Chunking or shallow parsing
Shallow parsing (also chunking or light parsing) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.)

## 9. Explain Noun Phrase (NP) chunking
Noun phrase chunking deals with extracting the noun phrases from a sentence. While NP chunking is much simpler than parsing, it is still a challenging task to build a accurate and very efficient NP chunker. The importance of NP chunking derives from the fact that it is used in many applications.

## 10. Explain Named Entity Recognition
Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.