# NLP Assignment 3
1. What's the difference between regular grammar and regular expressions?
2. In the context of NLP, what is parsing?
3. What exactly is the TF-IDF procedure?
4. Which factors influence Natural Language Processing interpretation?
5. What are the different types of conversational interfaces?
6. Apply Uni-gram and Bi-grammar-gram to the statement "I am really happy."

## 1. What's the difference between regular grammar and regular expressions?

   - There is a concept of regular language in theory of computation. Regular language is language which is accepted by finite automaton. Here finite automaton is a machine which checks whether the language is regular or not. If given language is accepted by machine then the language is regular. If not then language is not regular.

        E.g. : compiler, it checks given source code and if any error is present in code it shows that error.
        
   - Regular expressions, regular grammars and finite automata are simply three different formalisms for the same thing. There are algorithms to convert from any of them to any other.
    Now,

#### Regular grammar: 
   - It is a generator of regular language. A language which should generate all the strings from given language is called generator. For generating regular language we need some rules and that are defined in grammar. Grammar is embedded in compiler to check given source code.
   
- Regular grammars consist of a four tuple (N,Σ,P,S∈N) where N is the set of non-terminals, Σ is the set of terminals, S is the start non-terminal and P is the set of productions which tell us how to change the start symbol, step by step, into a string in Σ∗. P can have its productions drawn from one of two types (not both though):

1. Right Linear Grammars:

- For non-terminals B, C, terminal a and the empty string ε, all rules are of the form:
    - B→a
    - B→aC
    - B→ε
    
2. Left Linear Grammars:

Left linear grammars are the same, but rule #2 is B→Ca.

####  Regular expression: 
- It is representator of regular language. Regular expression is mathematically represent by some expression called regular expression. Regular expression is character sequence that define a search pattern.

- So regular expressions are recursively defined as follows:
    - ∅ is a regular expression
    - ε is a regular expression
    - a is a regular expression for every a∈Σ
    - if A and B are regular expressions then
    - A⋅B is a regular expression (concatentation)
    - A∣B is a regular expression (alternation)
    - A∗ is a regular expression (Kleene star)
    Along with some semantics (i.e. how we interpret the operators to get a string), we get a way of generating strings from a regular language.


## 2. In the context of NLP, what is parsing?

- A-Parser may be defined as a software component that takes text as input and converts it into a structural representation after verifying for correct syntax using formal grammar. It also creates a data structure, which can be a parse tree, an abstract syntax tree, or another hierarchical structure.
- The third phase of NLP is syntactic analysis, sometimes known as parsing or syntax analysis.
- The goal of this step is to extract precise, or dictionary-like, meaning from the text.
- Syntax analysis compares the text to formal grammar rules to determine its meaning.
- Syntactic analysis, often known as parsing, is the process of analyzing strings of symbols in natural language according to formal grammar principles.

Parsing is divided into two categories by derivation:
- Parsing from the top down - The parser constructs the parse tree from the start symbol and then tries to translate the start symbol to the input in this type of parsing.
- Parsing from the bottom up - The parser starts with the input symbol and tries to build the parser tree up to the start symbol in this type of parsing.


## 3. What exactly is the TF-IDF procedure?

- TF-IDF (term frequency-inverse document frequency) is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

- This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.


### Sklearn implements TF-IDF algorithm

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
 
x_train = ['The main idea of TF-IDF is that algorithm is an important feature that can be separated from the corpus background']
x_test=['Original text marked ',' main idea']
 
vectorizer = CountVectorizer(max_features=10)

tf_idf_transformer = TfidfTransformer()

tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(x_train))

x_train_weight = tf_idf.toarray()
 

tf_idf = tf_idf_transformer.transform(vectorizer.transform(x_test))
x_test_weight = tf_idf.toarray()
 
print('Output x_train text vector：')
print(x_train_weight)
print('Output x_test text vector：')
print(x_test_weight)

Output x_train text vector：
[[0.22941573 0.22941573 0.22941573 0.45883147 0.22941573 0.22941573
  0.22941573 0.22941573 0.45883147 0.45883147]]
Output x_test text vector：
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]


## 4. Which factors influence Natural Language Processing interpretation?

- Natural Language Processing is the technique used by computers to understand and take actions based upon human languages such as English. It is a part of Artificial Intelligence and cognitive computing. The process involves speech to text conversion, training the machine for intelligent decision making or actions. 

- Natural Language Processing or NLP works on the unstructured form of data, and it depends upon several factors such as regional languages, accent, grammar, tone, sentiments, locality, slang, pronunciation, etc.

- There are certain steps that NLP uses, such as lexical analysis, syntactical analysis, semantic analysis, discourse integration, and pragmatic analysis. Some of the popular NLP implementations are Amazon Alexa, Google Assistant, and Chatbots.



## 5. What are the different types of conversational interfaces?

- The conversational interface is an interface you can talk/write to in plain language. The aim is to provide a seamless user experience, as if you are talking to a friend or acquaintance. However, in practice, conversational interfaces mostly act as a stop-gap, answering basic questions, but unable to offer as much support as a live agent.


    - Basic bots
    - Text-based assistants
    - Voice assistants



## 6. Apply Uni-gram and Bi-grammar-gram to the statement "I am really happy."



**In computational linguistics, n-gram refers to n consecutive items in the text (items can be phoneme, syllable, letter, word or base pairs)**

N-grams of texts are widely used in the field of text mining and natural language processing. They are basically a set of co-occurring words within a defined window and when computing the n-grams, we typically move one word forward or more depending upon the scenario.

>For example, for the sentence **“The cow jumps over the moon”**. If **N=2** (known as bigrams), then the ngrams would be:

* the cow
* cow jumps
* jumps over
* over the
* the moon

In n-gram, **n = 1 is unigram**, **n = 2 is bigram**, **n = 3 is trigram**. 

gram is often used to compare sentence similarity, fuzzy query, sentence rationality, sentence correction, etc.


### Unigram Implementation

In [12]:
Sent = "I am really happy."
lst_sent = Sent.split (" ")
of_unigrams_in = []
for i in range(len(lst_sent)):
   of_unigrams_in.append(lst_sent[i])
   
    
print(of_unigrams_in)

['I', 'am', 'really', 'happy.']


### Bigram Implementation

In [5]:
Sent = "I am really happy."
lst_sent = Sent.split (" ")
of_bigrams_in = []
for i in range(len(lst_sent)- 1):
   of_bigrams_in.append(lst_sent[i]+ " " + lst_sent[ i + 1])
   
    
print(of_bigrams_in)

['I am', 'am really', 'really happy.']


### Trigram Implementation

In [6]:
import re
punctuation_pattern = re.compile(r"" "[.,!? ""] "" " )

sent = "I am really happy."
no_punctuation_sent = re.sub(punctuation_pattern , " " , sent )
lst_sent = no_punctuation_sent.split (" ")
trigram = []
for i in range(len(lst_sent)- 2):
   trigram.append(lst_sent[i] + " " + lst_sent[i + 1] + " " +lst_sent[i + 2])

In [7]:
trigram

['I am really', 'am really happy.']

**References:**
    
    - https://cs.stackexchange.com/questions/45755/difference-between-regular-expression-and-grammar-in-automata
    - https://www.tutorialspoint.com/natural_language_toolkit/natural_language_toolkit_parsing.htm#:~:text=Parsing%20and%20its%20relevance%20in,Syntactic%20analysis%20or%20syntax%20analysis.
    - https://monkeylearn.com/blog/what-is-tf-idf/#:~:text=TF%2DIDF%20(term%20frequency%2D,across%20a%20set%20of%20documents.
    - https://www.educba.com/what-is-natural-language-processing/
    - https://alan.app/blog/what-is-conversational-user-interface-cui/#TypesofConversationalUserInterfaces