**Advantages and disadvantages of various text encoding techniques like OHE, BOW, N-Gram and TF-IDF**
| **** | **OHE** | **BOW** | **N\-Gram** | **TF\-IDF** |
|:---:|:---:|:---:|:---:|:---:|
| **Advantages** | Simple Implementation: One\-hot encoding is straightforward to implement and understand\. Each word or token is represented by a vector with a length equal to the vocabulary size, where only one element is set to 1 \(indicating the presence of the word\) and all others are set to 0\. | Simplicity: Bag\-of\-Words is relatively simple to understand and implement\. It involves counting the frequency of each word in the document, creating a vector representation based on these counts\. | Preservation of Local Context: N\-gram encoding captures local context by considering sequences of words \(or characters\) instead of individual tokens\. This can help in preserving some degree of syntactic and semantic information, especially in tasks where local context is important, such as sentiment analysis or named entity recognition\. | Reflects Importance: TF\-IDF captures the importance of a term within a document by considering both its frequency \(TF\) and its rarity across the entire document collection \(IDF\)\. Terms that appear frequently in a document but rarely in other documents are assigned higher weights, reflecting their significance\. |
| **** | Independence from Vocabulary Size: One\-hot encoding is not affected by the size of the vocabulary\. Each word is represented independently of other words, making it suitable for datasets with a large vocabulary\. | Flexibility: BOW can be easily adapted to different tasks and datasets\. It can handle various types of text data, including documents of different lengths and languages\. | Flexibility in Granularity: N\-gram encoding allows flexibility in the choice of N, the size of the sequence\. By varying N, you can capture different levels of linguistic information, ranging from individual words \(unigrams\) to larger phrases or sentences \(bigrams, trigrams, etc\.\)\. This flexibility can be beneficial in capturing different aspects of the text data\. | Reduces Impact of Common Terms: TF\-IDF reduces the influence of common terms that appear in many documents by assigning them lower weights\. This helps in emphasizing the discriminative power of less frequent terms, which are often more informative for distinguishing between documents\. |
| **** | Interpretability: One\-hot encoding provides a clear and interpretable representation of the data\. Each dimension of the one\-hot encoded vector corresponds directly to a specific word or token in the vocabulary\. | Interpretability: The resulting BOW representation provides some interpretability, as each dimension in the vector corresponds to a specific word, and the value represents the frequency of that word in the document\. | Reduction of Dimensionality: Compared to one\-hot encoding or bag\-of\-words, N\-gram encoding can potentially reduce the dimensionality of the feature space, especially when using higher\-order N\-grams\. By capturing sequences of words, it can help in reducing the sparsity of the feature representation, which can be advantageous for memory and computational efficiency\. | Simple and Efficient: TF\-IDF is relatively simple to compute and implement\. It does not require complex modeling or training and can be efficiently calculated using standard algorithms\. Additionally, TF\-IDF matrices tend to be sparse, which further improves computational efficiency and memory usage\. |
| **** | Useful for Sparse Data: One\-hot encoding works well with sparse data, where most of the values in the encoded vectors are zero\. This can be advantageous for memory efficiency and computational performance\. | Efficiency with Occurrence Information: BOW preserves some information about word occurrence, which can be useful in certain applications where word presence or absence is more relevant than the order or context of words\. | Robustness to Spelling Variations and Typos: N\-gram encoding can be robust to spelling variations and typos since it considers sequences of characters or words rather than individual tokens\. This can help in capturing similar textual patterns even when there are slight variations in the input text\. | Language Agnostic: TF\-IDF is language\-agnostic and can be applied to text data in any language without requiring language\-specific preprocessing or feature engineering\. This makes it widely applicable across different domains and languages\. |
| **** | Applicability to Various Machine Learning Models: One\-hot encoding is compatible with a wide range of machine learning models, including linear models, tree\-based models, and neural networks\. It can be easily incorporated into different types of algorithms without requiring complex modifications\. | Compatibility with Traditional ML Models: BOW representations are compatible with traditional machine learning models such as Naive Bayes, Logistic Regression, and Support Vector Machines\. These models can directly operate on the vectorized representations without requiring additional preprocessing\. | Capturing Phrase\-Level Information: N\-gram encoding can capture phrase\-level information, allowing the model to learn from common phrases or expressions in the text\. This can be particularly useful in tasks where understanding multi\-word expressions or idiomatic language is important, such as in sentiment analysis or machine translation\. | Interpretability: TF\-IDF provides some level of interpretability, as the resulting weights indicate the importance of each term within a document\. This allows users to gain insights into the key terms that contribute to the representation of a document and can aid in tasks such as document summarization or keyword extraction\. |
| **Disadvantages** | High Dimensionality: One\-hot encoding leads to high\-dimensional feature vectors, especially in datasets with large vocabularies\. This can result in increased memory requirements and computational complexity, especially when dealing with text datasets with a vast number of unique words\. | Loss of Sequence Information: BOW discards the order and sequence of words in the text, treating each document as an unordered collection of words\. This can lead to a loss of valuable information, especially in tasks where word order or context is crucial, such as sentiment analysis or natural language understanding\. | Increased Data Sparsity: Using higher\-order N\-grams can lead to increased data sparsity, especially in datasets with limited occurrences of specific sequences\. This can result in less reliable estimates of the frequency of rare N\-grams and may lead to overfitting or poor generalization performance, especially in the case of smaller datasets\. | Limited Semantic Understanding: TF\-IDF does not capture semantic relationships between terms and relies solely on term frequencies and document frequencies\. As a result, it may struggle to understand the semantic context or meaning of terms, leading to limitations in tasks requiring deeper semantic understanding, such as natural language understanding or question answering\. |
| **** | Loss of Semantic Information: One\-hot encoding treats each word as independent, ignoring any semantic relationships or similarities between words\. Consequently, it does not capture the semantic meaning of words or their contextual information, which may limit the performance of models, especially in tasks requiring understanding of language semantics\. | High Dimensionality: BOW representations can lead to high\-dimensional feature spaces, especially in large vocabularies or datasets with many unique words\. This can result in increased memory and computational requirements, as well as the curse of dimensionality, where the number of features surpasses the available data points\. | Higher Memory and Computational Requirements: Encoding text using N\-grams can lead to higher memory and computational requirements compared to simpler encoding methods like one\-hot encoding or bag\-of\-words, especially when using higher\-order N\-grams\. This is due to the increased number of features and the need to store and process the sequences\. | Insensitive to Word Order: TF\-IDF treats documents as bags of words and ignores the order and context of terms within the document\. This can be a disadvantage in tasks where word order or sequence information is crucial, such as sentiment analysis or text generation\. |
| **** | Inability to Represent Out\-of\-Vocabulary Words: One\-hot encoding cannot handle out\-of\-vocabulary words that were not seen during training\. When encountering unseen words during inference, it either requires special handling or leads to incomplete representations, potentially affecting model performance\. | Sparse Representation: BOW vectors are often sparse, containing mostly zeros, especially in datasets with a large vocabulary or long documents\. Sparse representations can be inefficient for storage and computation, and they may require special handling in certain machine learning algorithms\. | Loss of Global Context: While N\-gram encoding captures local context, it may suffer from the loss of global context, as it focuses only on contiguous sequences of words or characters within a fixed window size\. This can limit its ability to capture long\-range dependencies or understand the overall structure of the text, which may be important in certain tasks such as document classification or language modeling\. | Vocabulary Size Sensitivity: TF\-IDF is sensitive to the size of the vocabulary and the prevalence of rare terms in the document collection\. Large vocabularies or datasets with many rare terms can lead to sparse TF\-IDF matrices, which may require additional preprocessing or regularization techniques to handle effectively\. |
| **** | No Information about Word Frequency: One\-hot encoding does not preserve information about the frequency of words in the text\. All words are treated equally, regardless of their importance or frequency in the dataset, which may lead to suboptimal performance in tasks where word frequency is relevant\. | Ignoring Semantic Relationships: BOW treats each word as an independent feature, ignoring any semantic relationships or similarities between words\. This limitation can hinder the performance of models, particularly in tasks requiring understanding of language semantics or context\. | Vocabulary Size Sensitivity: N\-gram encoding is sensitive to the size of the vocabulary and the choice of N\. Larger vocabularies or higher\-order N\-grams can lead to an explosion in the number of features, resulting in increased computational complexity and potential overfitting, especially in the case of limited training data\. | Difficulty with Synonyms and Polysemy: TF\-IDF may struggle to differentiate between synonyms or terms with multiple meanings \(polysemy\) since it relies solely on term frequency and document frequency\. This can lead to inconsistencies in the representation of semantically similar terms, potentially affecting the performance of downstream tasks\. |
| **** | Lack of Continuity and Smoothness: One\-hot encoded vectors are discrete and lack continuity, which can pose challenges for certain machine learning algorithms, such as neural networks, that benefit from continuous representations and smooth gradients during training\. This discontinuity may lead to difficulties in convergence and optimization\. | Vulnerability to Synonyms and Polysemy: BOW does not differentiate between synonyms or different meanings of the same word \(polysemy\)\. Consequently, words with similar meanings may be represented differently, leading to a loss of discriminative power and potentially affecting model performance, especially in tasks requiring nuanced understanding of language\. | Inability to Capture Semantic Relationships: N\-gram encoding treats each sequence of words as a separate feature, without capturing semantic relationships between different sequences\. This can limit its ability to understand the semantic meaning of the text or to generalize well to unseen sequences, especially in tasks requiring deeper semantic understanding, such as question answering or natural language inference\. | Limited Handling of Out\-of\-Vocabulary Terms: TF\-IDF does not explicitly handle out\-of\-vocabulary terms that were not seen during training\. While it can assign weights to unseen terms based on their IDF scores, it may not fully capture their importance or relevance, especially if they are entirely absent from the document collection\. |


# **What is NATURAL LANGUAGE PROCESSING?**

**Natural Language Processing and Machine Learning make it possible to build robust models with the storage capacity and processing power available to us today. Natural Language Processing concepts deal with processing human langauge while discovering patterns, relationships and, semantics present in large amounts of data.**

![nlp-important-use-cases-min.png](attachment:nlp-important-use-cases-min.png)

# **Description**

**This is a assignment notebook for NLP in text processing. It covers Data Preprocessing and Feature Engineering in detail like tokenization,normalization, vectorization, cleaning of input data and basic analysis. Trying to implement the code using ntlk, spacy and textblob.**


# **Content**

* **ENVIRONMENT SETUP**
    * Install Packages
    * Import Packages
    

* **DATA PREPROCESSING**
    * Data Loading
    * Preliminary Analysis
    * Sentence Tokenization
    * Word Tokenization
    * Stopword Removal
    * Removal of Tags
    * Delimiter Removal
    * Spell Check 
    * Stemming
    * Lemmatization
    
    
* **FEATURE ENGINEERING**
    * Encoding    
    * POS Tagger
    * N-GRAM
    * Bag Of Words
    * TF
    * TF-IDF
    * Dependency Parser
    * Named Entity Recognition
    * Word Embedding
    * Sentiment Analysis
    * Subjectivity Detection

# **ENVIRONMENT SETUP**

# **Install Packages**

In [1]:
!pip install autocorrect
!pip install spacy
!python -m spacy download en_core_web_sm
!pip install spacytextblob
!pip install nltk

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m918.4 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# **Import Packages**

In [2]:
import warnings
import numpy as np 
import pandas as pd 
import os
import re
import nltk
from nltk.corpus import abc
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import *
from nltk.stem.snowball import *
from nltk.util import ngrams
import string
import spacy
from spacy import displacy
from spacytextblob.spacytextblob import SpacyTextBlob
from autocorrect import Speller
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA
from nltk.sentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from IPython.core.display import HTML
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# **DATA PREPROCESSING**

# **Data Loading**

In [3]:
data_nltk = pd.read_csv('./sample.csv')
print('Total number of entries in the sample twitter dataset are:', len(data_nltk))
data_nltk.head()

Total number of entries in the sample twitter dataset are: 93


Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,@AppleSupport causing the reply to be disregar...,119236.0,
1,119238,ChaseSupport,False,Wed Oct 11 13:25:49 +0000 2017,@105835 Your business means a lot to us. Pleas...,,119239.0
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,@76328 I really hope you all change but I'm su...,119238.0,
3,119240,VirginTrains,False,Tue Oct 10 15:16:08 +0000 2017,@105836 LiveChat is online at the moment - htt...,119241.0,119242.0
4,119241,105836,True,Tue Oct 10 15:17:21 +0000 2017,@VirginTrains see attached error message. I've...,119243.0,119240.0


# **Preliminary Analysis**

In [4]:
data_nltk.shape

(93, 7)

In [5]:
data_nltk.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tweet_id                 93 non-null     int64  
 1   author_id                93 non-null     object 
 2   inbound                  93 non-null     bool   
 3   created_at               93 non-null     object 
 4   text                     93 non-null     object 
 5   response_tweet_id        65 non-null     object 
 6   in_response_to_tweet_id  68 non-null     float64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 4.6+ KB


In [6]:
data_nltk.dtypes

tweet_id                     int64
author_id                   object
inbound                       bool
created_at                  object
text                        object
response_tweet_id           object
in_response_to_tweet_id    float64
dtype: object

In [7]:
data_nltk.isnull().sum()

tweet_id                    0
author_id                   0
inbound                     0
created_at                  0
text                        0
response_tweet_id          28
in_response_to_tweet_id    25
dtype: int64

response_tweet_id column has lot of null values and so is in_response_to_tweet_id. Let's examine a text corresponding to null response_tweet_id.

In [8]:
null_response_tweets = data_nltk[data_nltk.in_response_to_tweet_id.isnull()]
null_response_tweets.head(10)

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,@AppleSupport causing the reply to be disregar...,119236,
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,@76328 I really hope you all change but I'm su...,119238,
12,119250,105838,True,Wed Oct 11 05:33:17 +0000 2017,"@AppleSupport hi #apple, I’ve a concern about ...",119249119251,
14,119253,105839,True,Wed Oct 11 07:21:34 +0000 2017,I just updated my phone and suddenly everythin...,119252,
22,119256,105840,True,Wed Oct 11 12:53:29 +0000 2017,@76495 @91226 Please help! Spotify Premium ski...,119254,
24,119263,105841,True,Wed Oct 11 06:29:07 +0000 2017,@AppleSupport after the 11.0.2 my phone just s...,119262,
26,119265,105842,True,Wed Oct 11 10:42:43 +0000 2017,First flight for long time with @British_Airwa...,119264119266,
28,119268,105843,True,Wed Oct 11 06:27:16 +0000 2017,Okay @76099 I used my fucking phone for 2 minu...,119267,
32,119272,105844,True,Wed Oct 11 02:19:23 +0000 2017,You’ve paralysed my phone with your update @76...,119271,
34,119274,82476,True,Wed Oct 11 12:50:07 +0000 2017,"@O2 I received this a few weeks ago, since the...",119273,


In [9]:
data_nltk.isna()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,False,False,False,False,False,False,True
1,False,False,False,False,False,True,False
2,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
88,False,False,False,False,False,False,False
89,False,False,False,False,False,False,True
90,False,False,False,False,False,False,False
91,False,False,False,False,False,False,False


In [10]:
data = data_nltk[['tweet_id', 'text']]
data

Unnamed: 0,tweet_id,text
0,119237,@AppleSupport causing the reply to be disregar...
1,119238,@105835 Your business means a lot to us. Pleas...
2,119239,@76328 I really hope you all change but I'm su...
3,119240,@105836 LiveChat is online at the moment - htt...
4,119241,@VirginTrains see attached error message. I've...
...,...,...
88,119330,@105860 I wish Amazon had an option of where I...
89,119331,They reschedule my shit for tomorrow https://t...
90,119332,"@105861 Hey Sara, sorry to hear of the issues ..."
91,119333,@Tesco bit of both - finding the layout cumber...


In [11]:
text = data['text']
text

0     @AppleSupport causing the reply to be disregar...
1     @105835 Your business means a lot to us. Pleas...
2     @76328 I really hope you all change but I'm su...
3     @105836 LiveChat is online at the moment - htt...
4     @VirginTrains see attached error message. I've...
                            ...                        
88    @105860 I wish Amazon had an option of where I...
89    They reschedule my shit for tomorrow https://t...
90    @105861 Hey Sara, sorry to hear of the issues ...
91    @Tesco bit of both - finding the layout cumber...
92    @105861 If that doesn't help please DM your fu...
Name: text, Length: 93, dtype: object

# **Delimiter Removal and Emoji Removal**

Delimiters are removed to reduce the size of the dataset as they do not supply any vital information in some cases. A Few delimiters are question marks (?), full stops (.), and exclamation marks (!). For example, after delimiter removal the sentence 'I am cold!' becomes 'I am cold'. These punctuations list can be got from string library or from nltk.

![bfbe0b98b48faee19456e72e2fc61641-min.jpg](attachment:bfbe0b98b48faee19456e72e2fc61641-min.jpg)


In [12]:
from string import punctuation
print(f'Delimiters in English: \n{punctuation}')

Delimiters in English: 
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [13]:
type(data_nltk["text"][0])

str

In [14]:
import re

#sample_text = u'This dog \U0001f602'
#print(text) # with emoji

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text[0])) # no emoji

data_nltk["text"] = data_nltk["text"].apply(lambda sentence: emoji_pattern.sub(r'', sentence))

@AppleSupport causing the reply to be disregarded and the tapped notification under the keyboard is opened


In [15]:
import string
def remove_punctuation(text):
  for punctuation in string.punctuation:
    text = text.replace(punctuation, '')
  return text
data_nltk['text'] = data_nltk['text'].apply(remove_punctuation)

In [16]:
data_nltk["text"][0]

'AppleSupport causing the reply to be disregarded and the tapped notification under the keyboard is opened'

# **Sentence Tokenization**

For any corpus, we first divide a huge entity into smaller entities so that they can be treated individually. Tokenization also does a similar task but upon sentences in text. First, the text is broken down into sentences and that is further broken down into words. The input is given as text or a corpus. The output generates a list of sentences. For example, in the text, "I love dogs. I have a dog", the output is ["I love dogs,” I have a dog”] which is further divided into token like "I, love, dogs, have, a".

![Screenshot%20%2868%29.png](attachment:Screenshot%20%2868%29.png)

In [17]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/divya.amith/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [18]:
#creating 2 dataframes to perform nlp text preprocessing using spacy and nltk libraries
data_nltk["text"] = data_nltk["text"].apply(str.lower)
data_spacy = data_nltk.copy()

In [19]:
data_nltk["text"][0]

'applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened'

In [20]:
# sents1 = sent_tokenize(text[0])
# print(f'Sentence Tokenization using NLTK: \n {text[0]} => {sents1}')

In [21]:
data_nltk["text"] = data_nltk["text"].apply(lambda sentence: [sent_tokenize(sentence)])

In [22]:
data_nltk["text"][0]

[['applesupport causing the reply to be disregarded and the tapped notification under the keyboard is opened']]

# **Word Tokenization**

Word tokenization is the same as sentence tokenization. But, rather than applying it to
sentences, it is used on words so that individual words are separated as items in a
list. For example, in the sentence, "Chennai is humid,” the result is ["Chennai,” “is,”
“humid”].

![fig2-min-min.png](attachment:fig2-min-min.png)

In [23]:
#sents1

In [24]:
# words1 = [word_tokenize(sents) for sents in sents1]
# print(f'Word Tokenization using NLTK: \n {sents1} => {words1}')

In [25]:
#[sentence for sentence in data_nltk["text"][6]]

In [26]:
data_nltk["text"] = data_nltk["text"].apply(lambda sentences: [word_tokenize(" ".join(sentence)) for sentence in sentences] )

In [27]:
data_nltk["text"][0]

[['applesupport',
  'causing',
  'the',
  'reply',
  'to',
  'be',
  'disregarded',
  'and',
  'the',
  'tapped',
  'notification',
  'under',
  'the',
  'keyboard',
  'is',
  'opened']]

In [28]:
#sp = spacy.load('en_core_web_sm')

In [29]:
# print(f'Word Tokenization using SpaCy: \n\n{sp(text[0])} =>\n')

# words2 = sp(text[0])
# for word in words2:
#     print(word)

# **Stopword Removal**

The dataset may contain words like ‘after,’ ‘every’ and ‘I.’ These words are not relevant to important NLP applications like the sentiment detection process. Thereby, these words can be deleted. Stopwords supress the importance of other words due to their number of occurence.

![stopwords-min.jpg](attachment:stopwords-min.jpg)

In [30]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/divya.amith/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [31]:
# tokens1 = [word for word in words1 if not word in stopwords.words('english')] 
# print(f'Stopword Removal using NLTK: \n{words1} => {tokens1}')

for word in stopwords.words('english'):
     print(word)

i
me
my
myself
we
our
ours
ourselves
you
you're
you've
you'll
you'd
your
yours
yourself
yourselves
he
him
his
himself
she
she's
her
hers
herself
it
it's
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
that'll
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
don't
should
should've
now
d
ll
m
o
re
ve
y
ain
aren
aren't
couldn
couldn't
didn
didn't
doesn
doesn't
hadn
hadn't
hasn
hasn't
haven
haven't
isn
isn't
ma
mightn
mightn't
mustn
mustn't
needn
needn't
shan
shan't
shouldn
shouldn't
wasn
wasn't
weren
weren't
won
won't
wouldn
wouldn't


In [32]:
data_nltk["text"] = data_nltk["text"].apply(lambda words: [word for word in words if not word in stopwords.words('english')])

In [33]:
data_nltk["text"][0]

[['applesupport',
  'causing',
  'the',
  'reply',
  'to',
  'be',
  'disregarded',
  'and',
  'the',
  'tapped',
  'notification',
  'under',
  'the',
  'keyboard',
  'is',
  'opened']]

# **Removal of Tags**

During web scraping, the data is scraped from web pages residing on the website, and they contain HTML tags. These tags do not provide any necessary information and hence, can be erased. For example, a tag like < body > (Body Tag) is deleted.

![SEO-Meta-Tags-min.jpg](attachment:SEO-Meta-Tags-min.jpg)

In [None]:
# sent_with_html = "<head> <title> Natural Language Processing </title> </head>"
# remove_html = re.compile('<.*?>')

# print(f"Removing HTML tags: \n{sent_with_html} => {re.sub(remove_html, '', sent_with_html).strip()}")

In [None]:
# data_nltk['text'] = data_nltk['text'].str.replace('<.*?>', '')
# #remove urls as well
# data_nltk['text'] = data_nltk['text'].str.replace('http[s]?://(?:[a-z0-9-]+\.)+[a-z]{2,6}(?:/[^\\s/]+)+', '')

# **Stemming**

Stemming applies algorithmic rules to extract the stem out of the derived word. The words produced by this step do not have any essential meaning, but they are simply a bunch of letters put together without affixes. For example, the word “beautiful” is stemmed to “Beauti”.

![09-min.jpg](attachment:09-min.jpg)

In [36]:
tokens1 = ['applesupport',
  'causing',
  'the',
  'reply',
  'to',
  'be',
  'disregarded',
  'and',
  'the',
  'tapped',
  'notification',
  'under',
  'the',
  'keyboard',
  'is',
  'opened']
porterStemmer = PorterStemmer()
stemWords1 = [porterStemmer.stem(word) for word in tokens1]

print(f'Tokens after Stemming using Porter Stemmer: \n{stemWords1}')

Tokens after Stemming using Porter Stemmer: 
['applesupport', 'caus', 'the', 'repli', 'to', 'be', 'disregard', 'and', 'the', 'tap', 'notif', 'under', 'the', 'keyboard', 'is', 'open']


In [37]:
data_nltk["text"] = data_nltk["text"].apply(lambda tokenlist: [[porterStemmer.stem(word) for word in tokens] for tokens in tokenlist])

In [38]:
data_nltk["text"][0]

[['applesupport',
  'caus',
  'the',
  'repli',
  'to',
  'be',
  'disregard',
  'and',
  'the',
  'tap',
  'notif',
  'under',
  'the',
  'keyboard',
  'is',
  'open']]

# **Lemmatization**

Lemmatization is similar to stemming but it add context to bring out the true meaning. It groups inflected forms of words to be interpretd as a single root word. For example, the word “beautiful” is stemmed to “Beauty” unlike “Beauti”.

![stemvslemma.png](attachment:ec28f8ab-2d2f-4c47-a82f-27ebc94d3c0e.png)

In [39]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/divya.amith/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [40]:
wordNetLemmatizer = WordNetLemmatizer()
lemmaWords1 = [wordNetLemmatizer.lemmatize(word) for word in tokens1]

print(f'Tokens after Lemmatization using WordNet Lemmatizer: \n{tokens1} => {lemmaWords1}')

Tokens after Lemmatization using WordNet Lemmatizer: 
['applesupport', 'causing', 'the', 'reply', 'to', 'be', 'disregarded', 'and', 'the', 'tapped', 'notification', 'under', 'the', 'keyboard', 'is', 'opened'] => ['applesupport', 'causing', 'the', 'reply', 'to', 'be', 'disregarded', 'and', 'the', 'tapped', 'notification', 'under', 'the', 'keyboard', 'is', 'opened']


In [41]:
data_nltk["text"] = data_nltk["text"].apply(lambda tokenlist: [[wordNetLemmatizer.lemmatize(word) for word in tokens] for tokens in tokenlist])

# **FEATURE ENGINEERING**

# **Encoding**

Encoding is the process of encrypting data in a format that computers can understand. Humans comprehend natural language. However, a machine is capable of decoding only 0s and 1s. Encoding converts text to digits. For example, the words 'positive' and 'negative' are mapped to the numbers '0' and '1'.



![OHE.jpg](attachment:OHE.jpg)

In [44]:
#sample
animals = ['dog', 'cat', 'mouse', 'dog', 'lion', 'lion', 'mouse', 'tiger', 'rat', 'dog']

label_encoder = preprocessing.LabelEncoder()
data = pd.DataFrame({'Labels' : animals, 'Label Encoder Values' : label_encoder.fit_transform(animals)})

print("Label Encoder")
data.style.background_gradient(cmap = 'BrBG')

Label Encoder


Unnamed: 0,Labels,Label Encoder Values
0,dog,1
1,cat,0
2,mouse,3
3,dog,1
4,lion,2
5,lion,2
6,mouse,3
7,tiger,5
8,rat,4
9,dog,1


In [45]:
animals = np.array(['dog', 'cat', 'mouse', 'dog', 'lion', 'lion', 'mouse', 'tiger', 'rat', 'dog'])

animals.reshape(-1,1)

array([['dog'],
       ['cat'],
       ['mouse'],
       ['dog'],
       ['lion'],
       ['lion'],
       ['mouse'],
       ['tiger'],
       ['rat'],
       ['dog']], dtype='<U5')

In [46]:
animals = np.array(['dog', 'cat', 'mouse', 'dog', 'lion', 'lion', 'mouse', 'tiger', 'rat', 'dog'])

ohe = preprocessing.OneHotEncoder()
result = ohe.fit_transform(animals.reshape(-1,1)).toarray()

data = pd.DataFrame(result.astype(int))
data['Labels'] = animals

print("One Hot Encoder")
data.style.background_gradient(cmap = 'Wistia')

One Hot Encoder


Unnamed: 0,0,1,2,3,4,5,Labels
0,0,1,0,0,0,0,dog
1,1,0,0,0,0,0,cat
2,0,0,0,1,0,0,mouse
3,0,1,0,0,0,0,dog
4,0,0,1,0,0,0,lion
5,0,0,1,0,0,0,lion
6,0,0,0,1,0,0,mouse
7,0,0,0,0,0,1,tiger
8,0,0,0,0,1,0,rat
9,0,1,0,0,0,0,dog


In [47]:
animals

array(['dog', 'cat', 'mouse', 'dog', 'lion', 'lion', 'mouse', 'tiger',
       'rat', 'dog'], dtype='<U5')

In [70]:
def flatten_list(test_list):
    if isinstance(test_list, list):
        temp = []
        for ele in test_list:
            temp.extend(flatten(ele))
        return temp
    else:
        return [test_list]
ohe_twitter = preprocessing.OneHotEncoder()
k = data_nltk.copy()
enc_data = k['text'].apply(lambda x: ohe_twitter.fit_transform(np.array(flatten_list(x)).reshape(-1,1)))
k.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id,labels
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,"[[applesupport, caus, the, repli, to, be, disr...",119236.0,,"(0, 1)\t1.0\n (1, 3)\t1.0\n (2, 11)\t1.0\n..."
1,119238,ChaseSupport,False,Wed Oct 11 13:25:49 +0000 2017,"[[105835, your, busi, mean, a, lot, to, u, ple...",,119239.0,"(0, 0)\t1.0\n (1, 18)\t1.0\n (2, 5)\t1.0\n..."
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,"[[76328, i, realli, hope, you, all, chang, but...",119238.0,,"(0, 0)\t1.0\n (1, 8)\t1.0\n (2, 10)\t1.0\n..."
3,119240,VirginTrains,False,Tue Oct 10 15:16:08 +0000 2017,"[[105836, livechat, is, onlin, at, the, moment...",119241.0,119242.0,"(0, 3)\t1.0\n (1, 14)\t1.0\n (2, 12)\t1.0\..."
4,119241,105836,True,Tue Oct 10 15:17:21 +0000 2017,"[[virgintrain, see, attach, error, messag, ive...",119243.0,119240.0,"(0, 14)\t1.0\n (1, 9)\t1.0\n (2, 1)\t1.0\n..."


In [71]:
k['labels'][0]

<16x14 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

# **POS Tagger**

POS tagger is parts of speech tagger that is an in-built function found in a standard library. It tags the words in the sentences according to the grammar of the langauge. For example, in the text, “The pizza was disgusting but the location was beautiful”, the result after implementing POS tagger will be [“The [DT]”, “pizza [NN]”, “is [VB]”, “disgusting [VBG]”, “but [CC]”, “the [DT]”, “location [NN]”, “was [VBD], “beautiful [JJ]].


![1_qZELwIpKeEQ-j3EnRF-CrQ-2.jpg](attachment:1_qZELwIpKeEQ-j3EnRF-CrQ-2.jpg)

In [72]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/divya.amith/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [73]:
tagged_tokens1 = nltk.pos_tag(tokens1)

print(f'POS tagging using NLTK: \n{tokens1} => {tagged_tokens1}')

POS tagging using NLTK: 
['applesupport', 'causing', 'the', 'reply', 'to', 'be', 'disregarded', 'and', 'the', 'tapped', 'notification', 'under', 'the', 'keyboard', 'is', 'opened'] => [('applesupport', 'NN'), ('causing', 'VBG'), ('the', 'DT'), ('reply', 'NN'), ('to', 'TO'), ('be', 'VB'), ('disregarded', 'VBN'), ('and', 'CC'), ('the', 'DT'), ('tapped', 'JJ'), ('notification', 'NN'), ('under', 'IN'), ('the', 'DT'), ('keyboard', 'NN'), ('is', 'VBZ'), ('opened', 'VBN')]


In [75]:
tagged_tokens2 = [word.pos_ for word in tokens1]

print('POS tagging using SpaCy: \n')
for i in range(len(tagged_tokens2)):
    print(f'{tokens1[i]} : {tagged_tokens2[i]}')

AttributeError: 'str' object has no attribute 'pos_'

# **N-Gram**

N-gram is a language model widely used in NLP and is applied to statistical problems involving text and audio. It is a probabilistic model that predicts the next series of words. For example, in the sentence, “The movie was boring.” Unigram processes the text as [“The”, “movie”, “was”, “boring”]. Bi-gram processes the text as [“The movie”, “movie was”, “was boring”]. Tri-gram processes the text as [“The movie was”, “movie was boring”]

![7356.1569499094-min.png](attachment:7356.1569499094-min.png)

In [None]:
n_grams1 = ngrams(tokens1, 2)
n_grams1 = [ ' '.join(grams) for grams in n_grams1]

print(f'N-Gram using NLTK (n = 2): \n{tokens1} => {n_grams1}')

In [None]:
n_gram_finder = nltk.collocations.TrigramCollocationFinder.from_words(tokens1)

print(f'Most Common N-Gram Finder using NLTK (n = 3): \n{tokens1} => {n_gram_finder.ngram_fd.most_common(2)}')

# **Bag of Words**

The bag of words carries out sentence tokenization and word tokenization. After that, it counts the number of appearances of each word. For example, in a sentence, “It is nice but horrid, and that’s not a nice thing.” The word “nice” is extracted and countered with two occurrences.

![image-20190906164045-2-min.jpeg](attachment:image-20190906164045-2-min.jpeg)

In [None]:
word_count = {}

for word in tokens1:
    
    if word not in word_count.keys():
        word_count[word] = 1
    else:
        word_count[word] += 1
        
print(f'Bag of Words: \n{tokens1} => {word_count}')

# **Term Frequency**

TF – Term Frequency is described as the number of times that a term occurs in a document. It considers all the terms of equal importance. For example, the word “Fruit” appears five times in a document of 100 words, then the TF for “Fruit” is 5/100 = 0.05.

![term_frequency_after_stopword_removal-min.png](attachment:term_frequency_after_stopword_removal-min.png)

In [None]:
def color(val):
    
    color = 'hotpink' if val > 0 else ''
    return 'background-color: %s' % color

In [None]:
count_vectorizer = CountVectorizer()
text_list = list(text[0:10])

tf = count_vectorizer.fit_transform(text_list)

tf_feature_names = count_vectorizer.get_feature_names_out()

print('Term Frequency of Document')
df = pd.DataFrame(tf.toarray(), columns = tf_feature_names) / len(tf_feature_names)
df.style.set_caption("Term Frequency of Document")
df.style.applymap(color)

# **Term Frequency - Inverse Document Frequency**

TF-IDF – Term Frequency-Inverse Document Frequency is described as the importance of a word in a document, which is proportional to the number of times the word appears in the document. For example, the word “Fruit” appears in 100 of 10000 documents and the term frequency is 5 then the TF-IDF is 0.05 * log(10000/100) = 5 * 2 = 10.

![0_7r2GKRepjh5Fl41r-min.png](attachment:0_7r2GKRepjh5Fl41r-min.png)

In [None]:
tfidf_vectorizer = TfidfVectorizer()
text_list = list(text[0:10])

tfidf = tfidf_vectorizer.fit_transform(text_list)

tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

print('Term Frequency - Inverse Document Frequency of Document')
df = pd.DataFrame(tfidf.toarray(), columns = tfidf_feature_names)
df.style.set_caption("Term Frequency - Inverse Document Frequency of Document")
df.style.applymap(color)

Applying all the transformations to text data set using nltk.
