# **Exp No.1 : Write a program to perform tokenization, filteration and script validation of English and Hindi Text**

## Tokenization

**Tokenization** is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens. <br> For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’ <br>

Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called **word tokenization** and same separation done for sentences is called **sentence tokenization**. <br>

In the process of tokenization, some characters like punctuation marks may be discarded.

Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text.

Let’s take an example. Consider the below string:

“This is a cat.”

What do you think will happen after we perform tokenization on this string? We get [‘This’, ‘is’, ‘a’, cat’].

There are numerous uses of doing this. We can use this tokenized form to:
*   Count the number of words in the text
*   Count the frequency of the word, that is, the number of times a particular word is present

**White Space Tokenization** is the simplest tokenization technique. Given a sentence or paragraph it tokenizes into words by splitting the input whenever a white space in encountered. This is the fastest tokenization technique but will work for languages in which the white space breaks apart the sentence into meaningful words. Example: English,Hindi.


Tokenisation with NLTK
NLTK is a standard python library with prebuilt functions and utilities for the ease of use and implementation. It is one of the most used libraries for natural language processing and computational linguistics.

Natural Language toolkit has very important module tokenize which further comprises of sub-modules

1.   word tokenize
2.   sentence tokenize




In [None]:
!pip install nltk



In [None]:
import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [None]:
'''
Tokenization of words
We use the method word_tokenize() to split a sentence into words. '''

from nltk.tokenize import word_tokenize
#text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of ""understanding"" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."
text = "New-York. is a city bitch."
print("word_tokenize",word_tokenize(text),'\n')

from nltk.tokenize import TreebankWordTokenizer 
#tokenizers work by separating the words using punctuation and spaces.
tokenizer = TreebankWordTokenizer() 
print("TreebankWordTokenizer",tokenizer.tokenize(text),'\n')


from nltk.tokenize import WordPunctTokenizer 
#It seperates the punctuation from the words.  
tokenizer = WordPunctTokenizer() 
print("WordPunctTokenizer",tokenizer.tokenize(text),'\n' )


#Multi-Word Expression Tokenizer(MWETokenizer): A MWETokenizer takes a string and merges multi-word expressions into single token
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer([('New', 'Panvel'), ('Kohann', 'K.', 'Toper')],separator=' ')

print("MWETokenizer",tokenizer.tokenize(text.split()))

word_tokenize ['New-York', '.', 'is', 'a', 'city', 'bitch', '.'] 

TreebankWordTokenizer ['New-York.', 'is', 'a', 'city', 'bitch', '.'] 

WordPunctTokenizer ['New', '-', 'York', '.', 'is', 'a', 'city', 'bitch', '.'] 

MWETokenizer ['New-York.', 'is', 'a', 'city', 'bitch.']


In [None]:
'''
Tokenization of sentences
We use the method sent_tokenize() to split paragrph into sentences. '''

from nltk.tokenize import sent_tokenize
text = "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."
print(sent_tokenize(text))

#The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, 
#which is already been trained and thus very well knows to mark the end and beginning of sentence at what 
#characters and punctuation '''

SyntaxError: ignored

In [None]:
'''
If we want to tokeinze text from any other languages we can use different pickle file other than English

https://github.com/alyssaq/nltk_data/blob/master/tokenizers/punkt/README contains support for other languages
'''

import nltk.data 

text = "NLP is Great! I won a free Coursera cupon. Lets start studying NLP."
# Loading PunktSentenceTokenizer using English pickle file 
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle') 
  
print(tokenizer.tokenize(text))

# to tokenize French text
french_tokenizer = nltk.data.load('tokenizers/punkt/PY3/french.pickle') 

text = "Bonjour, comment allez-vous. J'ai gagné une Coupe Coursera gratuite"

print(french_tokenizer.tokenize(text))

# to tokenize Spanish text
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle') 
  
text = 'Hola, cómo estás. Gané un cupón Coursera gratis'
print(spanish_tokenizer.tokenize(text))

# to tokenize German text
from nltk.tokenize import sent_tokenize

text = "NLP ist großartig! Ich habe einen kostenlosen Coursera Cupon gewonnen. Fangen wir an, NLP zu studieren."
token_text = sent_tokenize(text, language='german')
print(token_text)

['NLP is Great!', 'I won a free Coursera cupon.', 'Lets start studying NLP.']
['Bonjour, comment allez-vous.', "J'ai gagné une Coupe Coursera gratuite"]
['Hola, cómo estás.', 'Gané un cupón Coursera gratis']
['NLP ist großartig!', 'Ich habe einen kostenlosen Coursera Cupon gewonnen.', 'Fangen wir an, NLP zu studieren.']


**Student Task:**

Write a code to demonstrate Tokenization at word and sentence level in Hindi Language

# Filteration

One of the key steps in processing language data is to remove noise so that the machine can more easily detect the patterns in the data. Text data contains a lot of noise, this takes the form of special characters such as hashtags, punctuation and numbers. All of which are difficult for computers to understand if they are present in the data. We need to, therefore, process the data to remove these elements.
Additionally, it is also important to apply some attention to the casing of words. If we include both upper case and lower case versions of the same words then the computer will see these as different entities, even though they may be the same.

In [None]:
def  filter_text(inText,lowerFlag=False,upperFlag=False,numberFlag=False,htmlFlag=False,urlFlag=False,punctFlag=False,spaceFlag=False,hashtagFlag=False,emojiFlag=False):
    if lowerFlag:
      inText = inText.lower()

    if upperFlag:
      inText = inText.upper()

    if numberFlag:
      import re
      inText = re.sub(r"\d+", '', inText)

    if htmlFlag:
      import re
      inText = re.sub(r'<[^>]*>', '', inText)
    
    if urlFlag:
      import re
      inText = re.sub(r'(https?|ftp|www)\S+', '', inText)

    if punctFlag:
      import re
      import string
      exclist = string.punctuation #removes [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]
      # remove punctuations and digits from oldtext
      table_ = inText.maketrans('', '', exclist)
      inText = inText.translate(table_)

    if spaceFlag:
      import re
      inText = re.sub(' +'," ",inText).strip()
    
    if hashtagFlag:
      pass

    if emojiFlag:
      pass

    return inText

In [None]:
usrText = input()

Total 8 chickens are there. Isn't it?


In [None]:
filter_text(usrText,lowerFlag=True,htmlFlag=True,punctFlag=True,numberFlag=True)

'total  chickens are there isnt it'

**Student Task:**

Modify the above code to demonstrate filteration of hashtag word and certian emojis 

## Script Validation

In **script validation**, foreign words (the words which don't belong to the required input language) are detected and removed. In the sentence “ विदेशी को हटाना hoga आज ”  the word “hoga” is a word of Hindi language written using English characters. During script validation as per the NLP application requirement the word hoga will either be removed or transliterated into devanagari script “होगा”

In [None]:
def detectLang(inText,charFlag=False,wordFlag=False,sentenceFlag=False,lang="EN"):
  if charFlag:
    if len(inText)==1 and lang == "EN":
      if ord(inText) in list(range(65,123)):
        return "EN"
    if len(inText)==1 and lang == "HI":
      if ord(inText) in list(range(2304,2432)):
        return "HI"
        
  if wordFlag:
    if len(inText)>1 and lang == "EN":
      for x in inText:
        if ord(x) not in list(range(65,123)):
          return "Not Found"
      return "EN"
    if len(inText)>1 and lang == "HI":
      for x in inText:
        if ord(x) not in list(range(2304,2432)):
          return "Not Found"
      return "HI"

    if sentenceFlag:
      pass

  return "Not Found"

  #https://en.wikipedia.org/wiki/List_of_Unicode_characters
  #https://jrgraphix.net/r/Unicode/0020-007F

In [None]:
detectLang("स्वीकार",wordFlag=True,charFlag=True,lang="HI")

'HI'

In [None]:
detectLang("T",charFlag=True,lang="HI")

'Not Found'

**Student Task:**

Modify above code to detect language at sentence level and validate to check if input text contains only devnagri script