# NLP Assignment 5
1. Explain the concept of lemmatization.
2. What exactly is NLU? Mention how it's used.
3. What exactly are stop words?
4. What is corpus juris?
5. Describe the veiled language model.
6. Explain how a word embedding works.

### 1. Explain the concept of lemmatization.

- In stemming and in lemmatization, we try to reduce a given word to its root word. The root word is called a stem in the stemming process, and it is called a lemma in the lemmatization process.

- In stemming, a part of the word is just chopped off at the tail end to arrive at the stem of the word. There are definitely different algorithms used to find out how many characters have to be chopped off, but the algorithms don’t actually know the meaning of the word in the language it belongs to. 
- In lemmatization, on the other hand, the algorithms have this knowledge. In fact, you can even say that these algorithms refer a dictionary to understand the meaning of the word before reducing it to its root word, or lemma.


- So, a lemmatization algorithm would know that the word better is derived from the word good, and hence, the lemme is good. But a stemming algorithm wouldn’t be able to do the same. 
- There could be over-stemming or under-stemming, and the word better could be reduced to either bet, or bett, or just retained as better. But there is no way in stemming that it could be reduced to its root word good. This, basically is the difference between stemming and lemmatization.

In [1]:
import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('going')

'go'

In [2]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going')

'going'

### 2. What exactly is NLU? Mention how it's used.

- Natural language understanding is a branch of artificial intelligence that uses computer software to understand input in the form of sentences using text or speech.
- NLU enables human-computer interaction. It is the comprehension of human language such as English, Spanish and French, for example, that allows computers to understand commands without the formalized syntax of computer languages. NLU also enables computers to communicate back to humans in their own languages.
- The main purpose of NLU is to create chat- and voice-enabled bots that can interact with the public without supervision. Many major IT companies, such as Amazon, Apple, Google and Microsoft, and startups have NLU projects underway.


### 3. What exactly are stop words?

- Stopwords are the words in any language which does not add much meaning to a sentence. 
- They can safely be ignored without sacrificing the meaning of the sentence. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. 
- In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as “The Who” or “Take That”.

- If we have a task of text classification or sentiment analysis then we should remove stop words as they do not provide any information to our model, i.e keeping out unwanted words out of our corpus, but if we have the task of language translation then stopwords are useful, as they have to be translated along with other words.

- There is no hard and fast rule on when to remove stop words. But I would suggest removing stop words if our task to be performed is one of Language Classification, Spam Filtering, Caption Generation, Auto-Tag Generation, Sentiment analysis, or something that is related to text classification.

- On the other hand, if our task is one of Machine Translation, Question-Answering problems, Text Summarization, Language Modeling, it’s better not to remove the stop words as they are a crucial part of these applications

In [8]:
from nltk.corpus import stopwords

import re

paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               """
# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',text)
text = text.lower()
text = re.sub(r'\d',' ',text)
text = re.sub(r'\s+',' ',text)

# Preparing the dataset
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
print(sentences[0])
# applying stopwards 
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]
    

['i', 'have', 'three', 'visions', 'for', 'india', '.']


In [9]:
sentences[0]

['three', 'visions', 'india', '.']

### 4. What is corpus juris?

- The legal term Corpus Juris means "body of law".

- It was originally used by the Romans for several of their collections of all the laws in a certain field—see Corpus Juris Civilis—and was later adopted by medieval jurists in assembling the Corpus Juris Canonici.

### 5. Describe the veiled language model.

-  language model is a probability distribution over sequences of words. Given such a sequence of length m, a language model assigns a probability to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences (the property of digital infinity), language modelling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. 

- Essentially, the problem addressed by the mechanisms of the illustrative embodiments is that often times human beings have hidden, implied, or veiled meaning in the natural language that they utilize. As a result, natural language processing systems, which operate on the literal meaning of the terms and phrases found in natural language content, cannot identify this hidden, implied, or veiled meaning and instead may come to an incorrect conclusion as to what the natural language content is stating based on the literal meaning. 

- For example, announcements by organizational, governmental, and business leaders, politicians, spokespersons, public relations individuals, and the like, often utilize language that attempts to provide information while hiding or at least obscuring the real meaning behind the announcement by utilizing specific combinations of terms and phrases that are ambiguous in their meaning.

### 6. Explain how a word embedding works.

- It is an approach for representing words and documents. Word Embedding or Word Vector is a numeric vector input that represents a word in a lower-dimensional space. It allows words with similar meaning to have a similar representation. They can also approximate meaning. A word vector with 50 values can represent 50 unique features.

- Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning model to work with text data. 
- They try to preserve syntactical and semantic information. The methods such as Bag of Words(BOW), CountVectorizer and TFIDF rely on the word count in a sentence but do not save any syntactical or semantic information. In these algorithms, the size of the vector is the number of elements in the vocabulary.
- We can get a sparse matrix if most of the elements are zero. 
- Large input vectors will mean a huge number of weights which will result in high computation required for training. Word Embeddings give a solution to these problems.


There are two different approaches to get Word Embeddings:

1) **Word2Vec:**
- Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network.
- Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW).
    - Word2vec training format:

<!--     Size Dimension

    Word1 vector1
    Word2 vector1
    ....
    WordN vectorN
 -->

2) **GloVe:**
- GloVe is an unsupervised learning algorithm for obtaining vocabulary vector representations. The aggregated global word co-occurrence statistics from the corpus are trained and the resulting representations show interesting linear substructures of the word vector space.

- GloVe word vector format: GloVe is a type of Word embedding. The format of the GloVe word vector and word2vec is a little different from the Stanford open source code training. 
- The first line of the model trained by word2vec is: thesaurus size and dimensions, while gloVe does not

    - GloVe training format:


<!--     Word1 vector1
    Word2 vector1
    ....
    WordN vectorN -->
    
- Therefore, we use the model trained by Glove to add a line of Vocabulary Size in front, and the model is used in the same way as word2vec. The official website provides a lot of word vector models trained using thesaurus, which can be downloaded and used directly.

### References:
- https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/
- https://www.techtarget.com/searchenterpriseai/definition/natural-language-understanding-NLU
- https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b248dadad47
- https://patents.justia.com/patent/9760564