What are Corpus, Tokens, and Engrams?
A Corpus is defined as a collection of text documents for example a Twitter data set containing the tweets. Twitter data is a corpus. So corpus consists of documents, documents comprise paragraphs, paragraphs comprise sentences and sentences comprise further smaller units which are called Tokens.
Tokens can be words, phrases, or Engrams, and Engrams are defined as the group of n words together.
Refer this link <https://www.analyticsvidhya.com/blog/2021/02/basics-of-natural-language-processing-nlp-basics/> for more details.

Structure of a token :  prefix + morphene + suffix

#### Text Preprocessing

1. Noise Removal - It involves removing stop words (is, am, the), social media entities (mentions, hashtags), punctuations.
                    In the process tweet function, we can see it where using regex we are removing them. For stopwords, we are using nltk library.

2. Normalization -  a. Stemming: Simply removing suffixes from a word (Not recommended because it may give words which are not in vocabulary)
                    b. Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and grammar relations).

3. Object Standardization - It involves handling slangs, lingos or non-standard words with words from vocabulary 

#### Trade off between Stemming and Lemmatization:
The choice depends on the specific use case. Lemmatization produces a linguistically valid word while stemming is faster but may generate non-words.

In [11]:
# Noise Removal 

import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples    # sample Twitter dataset from NLTK
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

# With Stemming
def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks    
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # Lowercase the words
    tweet = tweet.lower()
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean

In [8]:
# Download it if you have not download the wordnet
#import nltk
#nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\MOHAMMED
[nltk_data]     USAMA\AppData\Roaming\nltk_data...


True

In [12]:
# Normalization
from nltk.stem.wordnet import WordNetLemmatizer 
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying"

# Apply stemming
print("Stemmed:", stem.stem(word))

# Apply lemmatization, specifying that the word is a verb
print("Lemmatized as verb:", lem.lemmatize(word, 'v'))

Stemmed: multipli
Lemmatized as verb: multiply


In [13]:
# With Lemmatization
def process_tweet_Lem(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    lem = WordNetLemmatizer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks    
    tweet = re.sub(r'https?://[^\s\n\r]+', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # Lowercase the words
    tweet = tweet.lower()
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            lemm_text = lem.lemmatize(word)  # Lemmatizing word
            tweets_clean.append(lemm_text)

    return tweets_clean

In [10]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

In [14]:
# Preprocessing and Stemming
process_tweet(text)

['found',
 '2002',
 'spacex',
 '’',
 'mission',
 'enabl',
 'human',
 'becom',
 'spacefar',
 'civil',
 'multi-planet',
 'speci',
 'build',
 'self-sustain',
 'citi',
 'mar',
 '2008',
 'spacex',
 '’',
 'falcon',
 '1',
 'becam',
 'first',
 'privat',
 'develop',
 'liquid-fuel',
 'launch',
 'vehicl',
 'orbit',
 'earth']

In [15]:
# Preprocessing and Lemmatizing
process_tweet_Lem(text)

['founded',
 '2002',
 'spacex',
 '’',
 'mission',
 'enable',
 'human',
 'become',
 'spacefaring',
 'civilization',
 'multi-planet',
 'specie',
 'building',
 'self-sustaining',
 'city',
 'mar',
 '2008',
 'spacex',
 '’',
 'falcon',
 '1',
 'became',
 'first',
 'privately',
 'developed',
 'liquid-fuel',
 'launch',
 'vehicle',
 'orbit',
 'earth']

### Tokenization using different librarires

In [23]:
# nltk
# Before running using this code
#nltk.download('punkt') 

# word tokenize
from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed liquid-fuel launch vehicle to orbit the Earth."""

print(word_tokenize(text))


from nltk.tokenize import sent_tokenize
sent_tokenize(text)

['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']


['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet species by building a self-sustaining city on Mars.',
 'In 2008, SpaceX’s Falcon 1 became the first privately developed liquid-fuel launch vehicle to orbit the Earth.']

In [33]:
# Using keras

# Word Tokenization
from keras.preprocessing.text import text_to_word_sequence
# define
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# tokenize
result = text_to_word_sequence(text)
result




['founded',
 'in',
 '2002',
 'spacex’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi',
 'planet',
 'species',
 'by',
 'building',
 'a',
 'self',
 'sustaining',
 'city',
 'on',
 'mars',
 'in',
 '2008',
 'spacex’s',
 'falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid',
 'fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'earth']

In [38]:
# Using gensim
#!pip install gensim==3.8.3

# word tokenization
from gensim.utils import tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
list(tokenize(text))

['Founded',
 'in',
 'SpaceX',
 's',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi',
 'planet',
 'species',
 'by',
 'building',
 'a',
 'self',
 'sustaining',
 'city',
 'on',
 'Mars',
 'In',
 'SpaceX',
 's',
 'Falcon',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid',
 'fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'Earth']

#### Basic Regex Code required for NLP

In [40]:
# To find a URL in a sentence
def find_url(string):
    text = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',string)
    #convert return value from list to string    
    return "".join(text)

example="I love spending time at https://www.leetcode.com/"
find_url(example)

'https://www.leetcode.com/'

In [45]:
#!pip install emoji 
import emoji

# To find emoticons in a sentence
def findEmoji(text):
    emo_text=emoji.demojize(text)
    line=re.findall(r':(.*?):',emo_text)
    return line

example="I love ⚽ very much 😁"
findEmoji(example)

['soccer_ball', 'beaming_face_with_smiling_eyes']

In [48]:
# To find an email

def findEmail(text):
    # Improved regex pattern for matching email addresses
    line = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', str(text))
    return ",".join(line)

# Example usage
example = "Gaurav's gmail is gsgs111@gmail.com"
print(findEmail(example))


gsgs111@gmail.com
