:# **NATURAL LANGUAGE PROCESSING**\
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.

The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.

In this , you will discover what natural language processing is and why it is so important.

What natural language is and how it is different from other types of data.
What makes working with natural language so challenging.
Where the field of NLP came from and how it is defined by modern practitioners.
---



### REGULAR EXPRESSION
A regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace”-like operations.

Regular expressions are specially encoded text strings used as patterns for matching sets of strings.

Regular expressions are a generalized way to match patterns with sequences of characters. It is used in every programming language like C++, Java and Python.



In [None]:
import re

In [None]:
my_string = "Lets write RegEx! wont that be fun? I sure think so. Can you find 4 sentence? or perhapse, all 19 words ?"

In [None]:
sentence_end= r"[.?!]"

In [None]:
print(re.split(sentence_end, my_string))

In [None]:
#split my_string on capitalized words  and print#
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words,my_string))

In [None]:
#split my_string on spaces and print the result#
spaces = r"\s+"
print(re.split(spaces,my_string))

In [None]:
#find all digit in my_string and print the result#
digits = r"\d+"
print(re.findall(digits, my_string))

In [None]:
re.match("b","abcdef")

In [None]:
re.search("d","abcdefg")

# **TOKENIZATION**

**Tokenization is the process of exchanging sensitive data for nonsensitive data called "tokens" that can be used in a database or internal system without bringing it into scope.

Although the tokens are unrelated values, they retain certain elements of the original data—commonly length or format—so they can be used for uninterrupted business operations. The original sensitive data is then safely stored outside of the organization's internal systems.

Unlike encrypted data, tokenized data is undecipherable and irreversible. This distinction is particularly important: Because there is no mathematical relationship between the token and its original number, tokens cannot be returned to their original form without the presence of additional, separately stored data. As a result, a breach of a tokenized environment will not compromise the original sensitive data.** 

# There are actually 4 Types of Tokenization:-
### 1. WORD_TOKENIZE 
### 2. SENT_TOKENIZE
### 3. REGEXP_TOKENIZE
### 4. TWEETTOKENIZER 

# 1. WORD_TOKENIZE

Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
word_tokenize("Welcome to NLP")


In [None]:
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

In [None]:
from nltk.tokenize import word_tokenize
  
text = "Hello everyone. I am Yash Thavkar. We are going to study about NLP"
word_tokenize(text)

# 2. SENT_TOKENIZE

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module,\
 which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.

In [None]:
sentence = sent_tokenize("scene_one")

In [None]:
tokenized_sent = word_tokenize(sentence[0])

In [None]:
unique_tokens = set(word_tokenize("scene_one"))
print(unique_tokens)

In [None]:
#only digits
tokenize_digits_and_words = ('\d+')
re.findall(tokenize_digits_and_words ,"he has 8 dogs and 11 cats")

In [None]:
# only alphabet
tokenize_digits_and_words =('[a-z]+')
re.findall(tokenize_digits_and_words, "he has 8 dogs and 11 cats")

In [None]:
#both alphabets and digits
tokenize_digits_and_words =('\w+')
re.findall(tokenize_digits_and_words,"he has 8 dogs and 11 cats")

In [None]:
my_str = ("match lowercase spaces nums like 12, but no commas")
re.match('[a-z0-9]+',my_str)

In [None]:
from nltk.tokenize import sent_tokenize
text = "Hello everyone. I am Yash Thavkar. We are going to study about NLP"
sent_tokenize(text)

# 3. REGEXP_TOKENIZE

Regular expressions can be used if you want complete control over how to tokenize text. As regular expressions can get complicated very quickly, I only recommend using them if the word tokenizers covered in the previous recipe are unacceptable.



In [None]:
from nltk import regexp_tokenize
my_str ="SOLDIER#1 : Found them? In Mercea? The coconut's tropical!"
pattern1 =r"(\w+|#\d+|\?|\!)"
regexp_tokenize(my_str, pattern= pattern1)

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")
text = "Hey , let's explore the topic  NLP."
tokenizer.tokenize(text)

# 4. TWEETTOKENIZER

Yes, the best way to tokenize tweets is to use the tokenizer built to tokenize tweets

In [None]:
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer



In [None]:
tweets =['This is the best #nlp exercise ive found online! #python',
         '#NLP is superfun! <3 #learning',
         'Thanks @Python :) #nlp #python'] 

In [None]:
pattern1 =r"#\w+"

In [None]:
regexp_tokenize(tweets[0], pattern1)


In [None]:
pattern2 =r"([#|@]\w+)"

In [None]:
regexp_tokenize(tweets[-1], pattern2)

In [None]:
regexp_tokenize(tweets[1], pattern1)

In [None]:
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

##NON -ASCII TOKENIZATION##

In [None]:
#Create a string
german_text =" Wann gehen wir zur Pizza? und fahren sie mit vorbei?"

In [None]:
all_words = word_tokenize(german_text)
print(all_words)

In [None]:
#tokenize and print only capital words
capital_words =r"[A-Z]\w+"

In [None]:
print(regexp_tokenize(german_text, capital_words))

## CHARTING WORD LENGTH WITH NLTK##

In [None]:
#PLOTTING A HISTOGRAM WITH MATPLOTLIB
from matplotlib import pyplot as plt
plt.hist([1,5,5,7,7,7,9])

In [None]:
#combining NLP data extraction with plotting 
from nltk.tokenize import word_tokenize
words = word_tokenize("This is a preety cool tool!")
word_length = [len(w)for w in words]
plt.hist(word_length)
plt.show()

# WORDS COUNT WITH BAGS OF WORDS

# **Basic method for finding topics in a text**

* Need to first create tokens using tokenization
* ... and then count up all the tokens
* The more frequent a word, the more important it might be
* Can be a great way to determine the significant words in a text

In [None]:
#BUILDING A COUNTER WITH BAG-OF-WORDS
from nltk.tokenize import word_tokenize
from collections import Counter
counter = Counter(word_tokenize("""The cat is in the box. The cat likes the box. The box is over the cat."""))
Counter
counter.most_common(4)

In [None]:
tokens = word_tokenize(german_text)

In [None]:
#convert  the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

In [None]:
#create a counter with the lowercase tokens bow_simple
from nltk.tokenize import word_tokenize
from collections import Counter
bow_simple = Counter(lower_tokens)

In [None]:
#print the 10 most common tokens
print(bow_simple.most_common(15))

# Text Preprocessing

Text preprocessing is a method to clean the text data and make it ready to feed data to the model. Text data contains noise in various forms like emotions, punctuation, text in a different case.






In [None]:
 #text preprocessing removing stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
text= "The Cat is in the Box , The Cat likes the Box , The Box is over the Cat."
tokens = [w for w in word_tokenize(text.lower())if w.isalpha()]
no_stops = [t for t in tokens if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

In [None]:
#exersice 2
from nltk.stem import WordNetLemmatizer

In [None]:
alpha_only = [t for t in lower_tokens if t.isalpha()]

In [None]:
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]

In [None]:
wordnet_lemmatizer = WordNetLemmatizer()

In [None]:
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

In [None]:
bow = Counter(lemmatized)

In [None]:
print(bow.most_common(10))