<details><summary><b>Books on NLP:</b></summary>
    
* Wikipedia article on NLP (https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
* NLTK Book (Natural Language Processing with Python from Oreilly (downloaded)
* Foundations of Statistical Natural Language Processing from Manning publications
</details>

<details open><summary><b>Examples of NLP:</b></summary>
    
* Clustering News Articles
* Suggesting similar books
* Grouping Legal Documents
* Analyzing Consumer Feedback
* Spam Email Detaction
</details>

<details open><summary><b>Our basic process for NLP:</b></summary>
    
* Compile all documents (Corpus)
* Featurize the words to numerics
* Compile features of documents
</details>

<details><summary><b>Detailed process steps:</b></summary>
    
* A standard way of doing this is through the use of what is known as "TF-IDF" (called Term Frequency - Inverse Document Frequency)
* Say we have 2 documents "Blue House" and "Red House"
    * Here total words list is (red, blue, house)
* Featurize the two documents based on word count
    * "Blue House" --> (red,blue,house) --> (0,1,1)
    * "Red House" --> (red,blue,house) --> (1,0,1)
        * Thus vectors represented as a vector of word counts is called a <b>"Bag of Words"</b>.
        * These are now vectors in an N-dimensional space.
        * we can compare vector with cosine similarity.
            $Sim(A,B) = Cos(\theta) = \frac{A.B}{||A||.||B||}$
            * Here $\theta$ is the angle between the two vectors A and B
            
![image.png](attachment:image.png)
</details>

<details><summary><b>Tokenizer and Tokenization in NLP</b> with <i>Tokenizer(), RegexTokenizer()</i></summary>

* Tokenization - a process of taking a text (sentence) and breaking it into individual tokens (usually words)
* Tokenizer - Normal tokenizer &
    * <u>Tokenizer()</u> -- Normal tokenizer which treats a SPACE character as a word/token separator
    * <u>RegexTokenizer(pattern)</u> -- Regular expression tokenizer - allows advanced tokenization based on regular expressios. This treats given pattern as the token separator.
</details>

<details><summary><b>Bag of Words and TF-IDF:</b></summary>

* We can imporve on "Bag of Words" by adjusting word counts based on their frequency in corpus (the group of all documents).
* We can use Term Frequency Inverse Document Frequency
* Term Frequency - Importance of the term with in that document.
    * TF(x.y) = Number of occurrences of term x in doucment y
* Inverse Document Frequency - Importance of the term in the corpus
    * IDF(t) = log(N/dfx)
        * N = Total number of documents
        * dfx = Number of documents with the term.

* TF-IDF = $W_{x,y} = t_{x,y}$ X $log(\frac{N}{df_x})$
    * This is mathematical expression for TF-IDF
    * $W_{x,y}$ is TF-IDF of the term x within document y = How imprtanta word to a document in a collection of corpus.
    * $t_{x,y}$ is the term frequency i.e. frequency of term x in document y
    * $df_{x}$ is the number of documents containing x
    * $N$ is the total number of documents 
    
* TF-IDF is a numerical statistics that reflects how imprtant a word is to a document in a collection of corpus.
* TF-IDF is often used as a weighting factor in searches of information retrieval, text mining and user modeling

* Various spark tools from pyspark.ml.feature helps with this entire feature behind the scene.
</details>

In [1]:
import sys
sys.path.append('C:/Users/nishita/exercises_udemy/MyTrials/tools')
from chinmay_tools import *

# ############################################################.

## NLP: NLP_Code_Along (Spam Message Classification)

* DATA PREPARATION STEPS
    * csv--(read, rename columns)-->data_in--(StringIndexer)-->data_labeled--(Tokenizer)-->data_words--(StopWordsRemover)-->data_filtered
    * data_filtered--(CountVectorizer)-->data_TF--(IDF)-->data_TFIDF--(VectorAssembler)-->data_features
    * data_features.select(['label', 'features'])-->data_clean
* SPAM PREDICTION (ML PROCESSIGN & EVALUATION)
    * clean_data--(randomSplit)-->train_data,test_data
    * NaiveBays().fit(train_data)-->spam_predictor
    * spam_predictor.transform(test_data)-->predicted_test_results
    * MulticlassClassificationEvaluator().evaluate(predicted_test_results)-- compares 'prediction' column against 'label'

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark1 = SparkSession.builder.appName('nlp_sms_classification').getOrCreate()

In [4]:
data_in = spark1.read.csv('Natural_Language_Processing/smsspamcollection/SMSSpamCollection', sep='\t')

In [5]:
data_in.columns
data_in.head(2)

['_c0', '_c1']

[Row(_c0='ham', _c1='Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'),
 Row(_c0='ham', _c1='Ok lar... Joking wif u oni...')]

In [6]:
print('Rename the columns to meaningful ones')
data_in = data_in.withColumnRenamed('_c0', 'class').withColumnRenamed('_c1', 'text')
data_in.columns

Rename the columns to meaningful ones


['class', 'text']

In [7]:
from pyspark.ml.feature import StringIndexer

In [8]:
indexer = StringIndexer(inputCol='class', outputCol='label')

In [9]:
data_labeled = indexer.fit(data_in).transform(data_in)

In [10]:
from pyspark.ml.feature import Tokenizer

In [11]:
tokenizer = Tokenizer(inputCol='text', outputCol='words')

In [12]:
data_words = tokenizer.transform(data_labeled)

In [13]:
from pyspark.ml.feature import StopWordsRemover

In [14]:
remover = StopWordsRemover(inputCol='words', outputCol='filtered')

In [15]:
data_filtered = remover.transform(data_words)

In [16]:
#validate word counts before and after stopword removal usind a count_words user defined function
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

In [17]:
count_tokens = udf(lambda word_list: len(word_list), IntegerType())

In [18]:
data_filtered.withColumn('word_count', count_tokens(col('words'))).withColumn('filtered_count', count_tokens(col('filtered'))).show()

+-----+--------------------+-----+--------------------+--------------------+----------+--------------+
|class|                text|label|               words|            filtered|word_count|filtered_count|
+-----+--------------------+-----+--------------------+--------------------+----------+--------------+
|  ham|Go until jurong p...|  0.0|[go, until, juron...|[go, jurong, poin...|        20|            16|
|  ham|Ok lar... Joking ...|  0.0|[ok, lar..., joki...|[ok, lar..., joki...|         6|             6|
| spam|Free entry in 2 a...|  1.0|[free, entry, in,...|[free, entry, 2, ...|        28|            23|
|  ham|U dun say so earl...|  0.0|[u, dun, say, so,...|[u, dun, say, ear...|        11|             9|
|  ham|Nah I don't think...|  0.0|[nah, i, don't, t...|[nah, think, goes...|        13|             7|
| spam|FreeMsg Hey there...|  1.0|[freemsg, hey, th...|[freemsg, hey, da...|        32|            18|
|  ham|Even my brother i...|  0.0|[even, my, brothe...|[even, brother, l.

In [19]:
from pyspark.ml.feature import CountVectorizer

In [20]:
#Get TF-IDF using CountVectorizer and IDF
c_vec = CountVectorizer(inputCol='filtered', outputCol='c_vec')

In [21]:
data_tf = c_vec.fit(data_filtered).transform(data_filtered)

In [22]:
from pyspark.ml.feature import IDF

In [23]:
idf = IDF(inputCol='c_vec', outputCol='tf_idf')

In [24]:
data_tfidf = idf.fit(data_tf).transform(data_tf)

In [25]:
from pyspark.ml.feature import VectorAssembler

In [26]:
assembler = VectorAssembler(inputCols=['label','tf_idf'], outputCol='features')

In [27]:
data_final = assembler.transform(data_tfidf)

In [28]:
data_clean = data_final.select(['label', 'features'])

In [29]:
train_data, test_data = data_clean.randomSplit([0.7, 0.3])

In [30]:
from pyspark.ml.classification import NaiveBayes

In [31]:
nb = NaiveBayes()  # Default columns expected are 'features', 'label', 'prediction' etc..

In [32]:
spam_predictor = nb.fit(train_data)

In [33]:
test_result = spam_predictor.transform(test_data)

In [34]:
# Now evaluate our prediction using MulticlassClassificaitonEvaluator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [35]:
# default params expected are 'prediction', 'label', metricName='f1'
# The prediciton an dlabel column names match in the test_result, so nto passing here

MulticlassClassificationEvaluator().evaluate(test_result)
MulticlassClassificationEvaluator(metricName='accuracy').evaluate(test_result)
MulticlassClassificationEvaluator(metricName='weightedPrecision').evaluate(test_result)
MulticlassClassificationEvaluator(metricName='weightedRecall').evaluate(test_result)
MulticlassClassificationEvaluator(metricName='f1').evaluate(test_result)


0.9166200239195221

0.9076281287246722

0.9431122470843195

0.9076281287246721

0.9166200239195221

In [36]:
MulticlassClassificationEvaluator(metricName='accuracy').evaluate(test_result)

0.9076281287246722

# ############################################################.

## NLP Ch-2: Tools_for_NLP (part-2).ipynb

* <b>TF-IDF</b>
    * TF-IDF : Term Frequency Inverse Document Frequency - is basically feature of vectorization used with text to reflect the <b>importance of a term to a document in the corpus</b> itself.
    * STEPS:
        1. tokenize the sentences using Tokenizer() or RegexTokenizer(pattern)
        * Apply HashingTF to generate rawFeatures (a preliminiary vectorization)
        * Apply IDF on result from HashingTF to generate final vectorized 'features'
        * This 'features' field from TF-IDF output can now be used as input to a ML algorithms.
* <b>CounterVectorizer</b>

<details><summary><b>Explanation on TF-IDF</b></summary>

<p><a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">Term frequency-inverse document frequency (TF-IDF)</a> 
is a feature vectorization method widely used in text mining to reflect the importance of a term 
to a document in the corpus. Denote a term by <code>$t$</code>, a document by  d , and the corpus by D.
Term frequency <code>$TF(t, d)$</code> is the number of times that term <code>$t$</code> appears in document <code>$d$</code>, while 
document frequency <code>$DF(t, D)$</code> is the number of documents that contains term <code>$t$</code>. If we only use 
term frequency to measure the importance, it is very easy to over-emphasize terms that appear very 
often but carry little information about the document, e.g. &#8220;a&#8221;, &#8220;the&#8221;, and &#8220;of&#8221;. If a term appears 
very often across the corpus, it means it doesn&#8217;t carry special information about a particular document.
Inverse document frequency is a numerical measure of how much information a term provides:

$$ IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1} $$

where |D| is the total number of documents in the corpus. Since logarithm is used, if a term 
appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid 
dividing by zero for terms outside the corpus (Here smoothing term 1 is added to both numeraor and denominator to avoid 'divide by zero error, in case the rm does not appear in any document). The TF-IDF measure is simply the product of TF and IDF:
$$ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). $$
</details>

<details><summary><b>Explanation on CountVectorizer</b></summary>
    
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.
</details>

_______
# Feature Extractors
_______

<h2 id="tf-idf">TF-IDF</h2>

<p><a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">Term frequency-inverse document frequency (TF-IDF)</a> 
is a feature vectorization method widely used in text mining to reflect the importance of a term 
to a document in the corpus. Denote a term by <code>$t$</code>, a document by  d , and the corpus by D.
Term frequency <code>$TF(t, d)$</code> is the number of times that term <code>$t$</code> appears in document <code>$d$</code>, while 
document frequency <code>$DF(t, D)$</code> is the number of documents that contains term <code>$t$</code>. If we only use 
term frequency to measure the importance, it is very easy to over-emphasize terms that appear very 
often but carry little information about the document, e.g. &#8220;a&#8221;, &#8220;the&#8221;, and &#8220;of&#8221;. If a term appears 
very often across the corpus, it means it doesn&#8217;t carry special information about a particular document.
Inverse document frequency is a numerical measure of how much information a term provides:

$$ IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1} $$

where |D| is the total number of documents in the corpus. Since logarithm is used, if a term 
appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid 
dividing by zero for terms outside the corpus (Here smoothing term 1 is added to both numeraor and denominator to avoid 'divide by zero error, in case the rm does not appear in any document). The TF-IDF measure is simply the product of TF and IDF:
$$ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). $$


In [None]:
from pyspark.sql import SparkSession
spark2 = SparkSession.builder.appName('nlp_2').getOrCreate()

In [None]:
# Create a dataframe holding the words of sentences.
# To the "creteDataFRame() we will pass two lists, 
#     1st parameter is a list of tupples represengting the individual rows and 
#     second param is a list of column names for each tupple record in 1st parameter
sentence_data = spark2.createDataFrame([
        (0, 'Hi I hear about Spark'),
        (1, 'I wish java could use case classes'),
        (2, 'Logistic regression models are neat')
    ], ['label', 'sentence']
)

In [None]:
from pyspark.ml.feature import Tokenizer

In [None]:
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')
words_data = tokenizer.transform(sentence_data)

In [None]:
words_data.show()

###### Generate TF

In [None]:
from pyspark.ml.feature import HashingTF

In [None]:
hashingTF = HashingTF(inputCol='words', outputCol='rawFeatures', numFeatures=20)
# There are 16 distinct words in the original dataframe sent_df, we we can pass around 20 as 'numFeatures' parameter.

* In HashingTF(), if 'numFeatures' is passed in, it must be more than the number of distinct tokens in the dataframe else there will be collision. i.e. collision means two or more words getting same hashed number. Default value is 2**18 i.e. 262144
* HashingTF uses 'feature hashing' also called 'hashing trick' for vectorizing features.
* Refer: https://en.wikipedia.org/wiki/Feature_hashing

For HashingTF Refer
* https://stackoverflow.com/questions/44966444/what-is-the-relation-between-numfeatures-in-hashingtf-in-spark-mllib-and-actual

In [None]:
featureized_data = hashingTF.transform(words_data)

In [None]:
featureized_data.show()

###### Now generate TF-IDF by generating IDF on the result from tF

In [None]:
from pyspark.ml.feature import IDF

In [None]:
idf = IDF(inputCol='rawFeatures', outputCol='features')

In [None]:
idf_model = idf.fit(featureized_data)

In [None]:
rescaled_data = idf_model.transform(featureized_data)

In [None]:
rescaled_data.show()

* If we had the first column of the initial sentence DF as 'label' and TF-IDF output as 'features', then we can use 'label' and 'feature' column to use with other ML algorithms.

## CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

In [None]:
from pyspark.ml.feature import CountVectorizer

In [None]:
# Input data: Each row is a bag of words with a ID.
sdf = spark2.createDataFrame([
    (0, "a b c".split(" ")),
    (1, "a b b c a".split(" "))
], ["id", "words"])

* CountVectorizer - Extracts a vocabulary from document collections and generates

In [None]:
sdf.show()

In [None]:
# Here vocabulary size (vocabSize) is 3, i.e a, b, c - it is the number of distinct tokens or words
# vocab size must be atleast higher than number of distinct words or vocabulary
# We want a token to be included in the vocabulary only if it is part of atleast 2 documents i.e. minDF=2
# minDF is the minimum number of documents a term must appear in to be included in the vocabulary.

cvec = CountVectorizer(inputCol='words', outputCol='features', vocabSize=3, minDF=2)

In [None]:
sdf_result = cvec.fit(sdf).transform(sdf)

In [None]:
sdf_result.show(truncate=False)

#### Explanation of count vectorization model
###### The 'features' column can be explained as below
* Second document (for id==1) is:  [a, b, b, c, a]
* features(for second document) is: (3,[0,1,2],[2.0,2.0,1.0])
    * First element in 'features' is 3 - it represents the vocabulary size, i.e. number of distinct terms : a, b, c
    * Second element is [0,1,2] -- it show which terms form the vocabulary appear in 'words' column.
    * Third element is [2.0,2.0,1.0] -- it show frequency of the occurring terms in order. In this example the term frequency for [0th, 1st, 2nd] terns i.e. [a,b,c] are [2.0, 2.0, 1.0)] respectively. I.e. in the second document i.e for sdf_result[id==2]. That means a appears twice, b appears twice, but c appears once.
* This is essentially the "Bag of Words" method w
    
    
    
# ASK Ashish Lal: HOW TO PRINT CountVectorizer vocabulary in jupyter

* Now we will use count vectorizer with the original words data

In [None]:
cvec2 = CountVectorizer(inputCol='words', outputCol='features', vocabSize=20, minDF=1)

In [None]:
results_data = cvec2.fit(words_data).transform(words_data)

In [None]:
results_data.show(truncate=False)

# ############################################################.

## NLP Ch-1: Tools_for_NLP(part-1).ipynb

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark1 = SparkSession.builder.appName('nlp_1').getOrCreate()

In [None]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer

In [None]:
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

* <b>'col'</b> function is for calling columns
* <b>'udf'</b> stands for user defined functions, we can create a function using lambda expressions

In [None]:
# Create a dataframe holding the sentences to be processed.
# To the "creteDataFRame() we will pass two lists, 
#     1st parameter is a list of tupples represengting the individual rows and 
#     second param is a list of column names for each tupple record in 1st parameter
sent_df = spark1.createDataFrame([
        (0, 'Hi I hear about Spark'),
        (1, 'I wish java could use case classes'),
        (2, 'Logistic,regression,models,are,neat')
    ], ['id', 'sentence']
)

# Here in the 3rd sentence we have purposefully put comma in place of space in the line

In [None]:
sent_df.show()

In [None]:
tokenizer = Tokenizer(inputCol='sentence', outputCol='words')

#### For regular expressions Refer https://www.geeksforgeeks.org/write-regular-expressions/
* We can use Notepad++ to search for the patterns
* \\W matches a single non-word character like a space or a comma etc (\\W+ actually matchees a consecutive seqiuence of such non-word characters. To pass a single reverse slash character we need to type two consecutive reverse slashes as escape sequence.
* Check out the resource links @ 09:10 / 16:12 in the video "NLP Tools Part One" for more on expressions

* Tokenizer() uses a SPACE as the word separator while tokenizing the sentence
* RegexTokenizer uses the supplied pattern as the word separator

###### Get the 'words' column from the 'senence' columnusing tokenizer

In [None]:
sent_df_tokenized = tokenizer.transform(sent_df)

In [None]:
sent_df_tokenized.show()

### Define a user defined function using lambda to count number of words in a list of words

In [None]:
count_tokens = udf(lambda word_list: len(word_list), IntegerType())

In [None]:
sent_df_tokenized.withColumn('wordcount', count_tokens(col('words'))).show()

* Here we see that the normal 'Tokenizer' could not split the sentence using comma as separator, so third entry show only one big joint word. We need to use RegexTokenizer here with comma i.e. a non-word charcter (\\W) as a separator which will match space and comma both.

In [None]:
regex_tokenizer = RegexTokenizer(inputCol='sentence', outputCol='words', pattern='\\W')

* Behaviour of RegexTokenizer without 'pattern' parameter is same as that of Tokenizer()

In [None]:
sent_df_tokenized = regex_tokenizer.transform(sent_df)

In [None]:
sent_df_tokenized.show()

In [None]:
sent_df_tokenized.withColumn('word_count', count_tokens(col('words'))).show()

* Now to remove "stop words" which are very common and frequently occurring words and does nto carry much meaning, like : 'or', 'the' etc

### We can use "StopwordsRemover" from spark to filter out the common stop words from our tokens or words

In [None]:
from pyspark.ml.feature import StopWordsRemover

In [None]:
remover = StopWordsRemover(inputCol='words', outputCol='filtered')

* Note the <b>truncate=False</b> parameter to sdf.show(trun..) to expand the columns to show all the values instead of putting tripple dots after a fixed width.

In [None]:
sent_df_filtered = remover.transform(sent_df_tokenized).withColumn('word_count', count_tokens(col('filtered')))
sent_df_filtered.show()

sent_df_filtered.show(truncate=False)



* Another example of stop word removal

In [None]:
sentenceData = spark1.createDataFrame([
    (0, ["I", "saw", "the", "red", "balloon"]),
    (1, ["Mary", "had", "a", "little", "lamb"])
], ["id", "raw"])

remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show()
remover.transform(sentenceData).show(truncate=False)


" We can even add our common words appended to the spark provided list to ignore them, may be due to a regulatory / domain requirement.

#### N-Gram
* N-Gram is a sequence of tokens typically words for some integer
* It is a sequence of N tokens for some integer N @ 13:02 / 16:12
* ngrams show sequence of 'n' consecutive words in order

In [None]:
from pyspark.ml.feature import NGram

In [None]:
ngram = NGram(n=2, inputCol='filtered', outputCol='ngrammed')

sent_df_ngrammed = ngram.transform(sent_df_filtered)

sent_df_ngrammed.withColumn('ngram_count', count_tokens(col('ngrammed'))).show()

In [None]:
sent_df_ngrammed.select('ngrammed').show(truncate=False)

In [None]:
ngram3 = NGram(n=3, inputCol='filtered', outputCol='ngrammed')

ngram3.transform(sent_df_filtered).withColumn('ngram_count', count_tokens(col('ngrammed'))).select('ngrammed', 'ngram_count').show(truncate=False)


###### This kind of ngrams are very useful in finding the relationship between say two words - which words always appear next to each other etc..
###### In more advanced Natural Language Processing N-Grams may be needed.