# NLP Via Spark

Spark can assist us in NLP applications!

For more information go to: https://spark.apache.org/docs/latest/ml-features.html

In this notebook we will show 3 applications of natural language processing (NLP) using Spark's MLlib.

## TF -IDF

*Reminder:* TF-IDF (Term Frequency-Inverse Document Frequency), is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents, by considering both the frequency of the word in the document and the frequency of the word in the entire corpus.


Term Frequency (TF):

*   Term Frequency measures how often a term (word) appears in a document. It is calculated using the formula:
*   TF(t,d)=
(Total number of terms in document d) /
(Number of times term t appears in document d)
*   TF gives higher weight to terms that appear more frequently within a document. It helps in identifying the importance of a word in a specific document.

Inverse Document Frequency (IDF):


*   Inverse Document Frequency measures the importance of a term across a collection of documents. It is calculated using the formula:
*   IDF(t,D)=log(
(Number of documents containing term t) /
(Total number of documents in corpus ∣D∣)
 )
*   IDF gives higher weight to terms that are rare across the entire corpus but are present in a few documents. It helps in identifying terms that are unique or specific to certain documents.

TF-IDF Calculation:


*   TF-IDF combines the TF and IDF values to determine the weight of a term in a document relative to the entire corpus. It is calculated as:
*    TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
*  TF-IDF increases with the number of times a term appears in a document (TF) but is offset by the rarity of the term across the corpus (IDF). This helps in identifying terms that are both frequent in a document and unique to that document.

















###Data
Note that we must not share copyrighted data, without prior written permission from the owner.
Our data consists of some of the best books written by F. Scott Fitzgerald, including the renowned work "The Great Gatsby." These books were obtained from The Project Gutenberg eBook, an extensive digital library initiative that provides free access to a wide range of public domain literary works.

You can find Project Gutenberg at https://www.gutenberg.org/

In [22]:
import re
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

# Create a Spark session
spark = SparkSession.builder \
    .appName("TF-IDF Example") \
    .getOrCreate()

# Define the list of document names and their respective paths
documents = {
    "TheGreatGatsby": "../data/TheGreatGatsby.txt",
    "ThisSideofParadise": "../data/ThisSideofParadise.txt",
    "TheBeautifulandDamned": "../data/TheBeautifulandDamned.txt",
    "TalesofTheJazzAge": "../data/TalesofTheJazzAge.txt"".txt",
    "FlappersandPhilosophers": "../data/FlappersandPhilosophers.txt",
    "AlltheSadYoungMen": "../data/AlltheSadYoungMen.txt",
}

# Preprocess and tokenize each document
unique_words = set()  # Set to store unique words
for doc_name, doc_path in documents.items():
    with open(doc_path, 'r', encoding='utf-8') as f:
        text = f.read()

    # Preprocess the data
    text = text.lower()  # Lowercase all words
    words = re.findall(r'\w+', text)  # Extract words using regular expression

    # Add unique words to the set
    unique_words.update(words)

# Convert set to list to create a DataFrame
text_data = spark.createDataFrame([(word,) for word in list(unique_words)], ["text"])

# Tokenize the text data
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_data = tokenizer.transform(text_data)

# Apply HashingTF to convert words to term frequency vectors
hashing_tf = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=1000)
tf_data = hashing_tf.transform(words_data)

# Compute IDF to get the TF-IDF vectors
idf_model = IDF(inputCol="rawFeatures", outputCol="features")
idf_data = idf_model.fit(tf_data).transform(tf_data)

# Extract the top N words with the highest TF-IDF scores
num_top_words = 5
top_words = idf_data.select("words", "features") \
    .rdd.map(lambda row: row.asDict()) \
    .flatMap(lambda x: [(word, score) for word, score in zip(x["words"], x["features"])]) \
    .sortBy(lambda x: x[1], ascending=False) \
    .take(num_top_words)

# Print the top N unique words with their TF-IDF scores
print(f"Top {num_top_words} unique words based on TF-IDF scores across the corpus:")
for word, score in top_words:
    print(f"{word}: {score:.3f}")




Top 5 unique words based on TF-IDF scores across the corpus:
diffusing: 6.967
decent: 6.967
skeins: 6.967
hope: 6.967
unimpeachable: 6.967


## Word2Vec

*Reminder:* Word2Vec is a technique for learning distributed representations of words in a continuous vector space from large text corpora. It associates words with vectors in such a way that semantically similar words are mapped to nearby points in the vector space, enabling algorithms to capture semantic relationships between words.

In [23]:
from pyspark.ml.feature import Word2Vec
from pyspark.ml.linalg import DenseVector
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Word2VecExample").getOrCreate()

# Prepare input data: Each row is a bag of words from a sentence or document
documentDF = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])

# FlatMap the array of words to separate rows
words_df = documentDF.rdd.flatMap(lambda x: x[0])
# Collect the words into a list
word_list = words_df.collect()

# Learn a mapping from words to vectors
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)
word_vectors = model.getVectors()

# Create a dictionary mapping words to vectors
word_to_vector = {}
for row in word_vectors.collect():
    word = row["word"]
    vector = DenseVector(row["vector"])
    word_to_vector[word] = vector

# Print each word and its Word2Vec representation
for word, vector in word_to_vector.items():
    print("Word: '{}'\nVector: {}".format(word, vector))


Word: 'heard'
Vector: [-0.1428680270910263,0.052687663584947586,0.10879096388816833]
Word: 'are'
Vector: [0.12437053769826889,0.14396001398563385,-0.15560980141162872]
Word: 'neat'
Vector: [-0.06944344192743301,0.16405947506427765,-0.13333088159561157]
Word: 'classes'
Vector: [0.10187613219022751,-0.14191721379756927,0.08895421773195267]
Word: 'I'
Vector: [0.11645925045013428,-0.04665014520287514,0.005499431863427162]
Word: 'regression'
Vector: [0.0016052094288170338,-0.0849643424153328,0.01142392959445715]
Word: 'Logistic'
Vector: [0.12236311286687851,-0.04140022397041321,0.08744674175977707]
Word: 'Spark'
Vector: [-0.018199866637587547,-0.07334957271814346,0.15221886336803436]
Word: 'could'
Vector: [0.12124157696962357,0.0344112291932106,0.007167116738855839]
Word: 'use'
Vector: [-0.15078683197498322,0.027394283562898636,-0.049794115126132965]
Word: 'Hi'
Vector: [-0.0015435452805832028,-0.13844861090183258,-0.01282623689621687]
Word: 'models'
Vector: [-0.02154526300728321,-0.09330314

In [24]:
# Define the cosine similarity function
def cosine_similarity(vec1, vec2):
    dot_product = float(vec1.dot(vec2))
    norm_vec1 = float(vec1.norm(2))
    norm_vec2 = float(vec2.norm(2))
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

# Calculate cosine similarity between specific words
word_pairs = [("Hi", "heard"), ("Spark", "Java"), ("regression", "neat")]

# Get the list of words
words = [row["word"] for row in word_vectors.collect()]

for pair in word_pairs:
    word1_vec = word_vectors.filter(word_vectors.word == pair[0]).select("vector").collect()[0][0]
    word2_vec = word_vectors.filter(word_vectors.word == pair[1]).select("vector").collect()[0][0]
    similarity = cosine_similarity(DenseVector(word1_vec), DenseVector(word2_vec))
    print("Cosine similarity between '{}' and '{}': {:.4f}".format(pair[0], pair[1], similarity))


Cosine similarity between 'Hi' and 'heard': -0.3255
Cosine similarity between 'Spark' and 'Java': -0.1902
Cosine similarity between 'regression' and 'neat': -0.8163


This example was not very informative. Let's try again with a larger amount of data.

#### Alice in Wonderland Example

In [27]:
import re
from pyspark.ml.feature import Word2Vec
from pyspark.ml.linalg import DenseVector
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Word2VecExample").getOrCreate()

# Load the text file "Alice_book.txt" and preprocess sentences
TEXT_PATH= "../data/Alice_book.txt"
with open(TEXT_PATH, 'r', encoding='utf-8') as f:
    sentences = f.readlines()

sentences = [sen.strip().lower() for sen in sentences]  # Lowercase all words
sentences = [sen.split() for sen in sentences if sen]  # Split sentences into words
sentences = [[re.sub(r'\W+', '', w) for w in sen] for sen in sentences]  # Remove non-word characters

# Prepare input data: Each row is a bag of words from a sentence or document
documentDF = spark.createDataFrame([(sen,) for sen in sentences], ["text"])

# Learn a mapping from words to vectors
word2Vec = Word2Vec(vectorSize=1000, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)
word_vectors = model.getVectors()

# Create a dictionary mapping words to vectors
word_to_vector = {}
for row in word_vectors.collect():
    word = row["word"]
    vector = DenseVector(row["vector"])
    word_to_vector[word] = vector



#### Cosine distance between words

- "tea" and "time" should have high similarity as they are closely related in the context of "tea-time" in "Alice in Wonderland".
- "queen" and "hearts" should have high similarity as they are closely related in the context of the Queen of Hearts character.
- "hatter" and "watch" should have low similarity as they could represent distinct concepts or contexts. In the book, the rabbit had a watch.
- "alice" and "dream" might have low similarity as they are less likely to co-occur in the text.


In [28]:
# Define the cosine similarity function
def cosine_similarity(vec1, vec2):
    dot_product = float(vec1.dot(vec2))
    norm_vec1 = float(vec1.norm(2))
    norm_vec2 = float(vec2.norm(2))
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

# Calculate cosine similarity between specific words
word_pairs = [("queen", "hearts"), ("tea", "time"), ("alice", "dream"), ("hatter", "watch")]

# Get the list of words
words = [row["word"] for row in word_vectors.collect()]

for pair in word_pairs:
    if pair[0] in word_to_vector and pair[1] in word_to_vector:
        word1_vec = word_to_vector[pair[0]]
        word2_vec = word_to_vector[pair[1]]
        similarity = cosine_similarity(DenseVector(word1_vec), DenseVector(word2_vec))
        print("Cosine similarity between '{}' and '{}': {:.4f}".format(pair[0], pair[1], similarity))
    else:
        print("One or both words '{}' and '{}' not found in the word vectors.".format(pair[0], pair[1]))


Cosine similarity between 'queen' and 'hearts': 0.9460
Cosine similarity between 'tea' and 'time': 0.5390
Cosine similarity between 'alice' and 'dream': -0.0349
Cosine similarity between 'hatter' and 'watch': 0.0255


## N-gram

*Reminder:* N-grams are contiguous sequences of n items (words, characters, or tokens) extracted from a text document, where n represents the number of items in the sequence. They are commonly used in natural language processing for tasks such as language modeling, text generation, and feature extraction.

In [29]:
    from pyspark.ml.feature import NGram
    from pyspark.sql import SparkSession

    # Create a SparkSession
    spark = SparkSession\
        .builder\
        .appName("NGramExample")\
        .getOrCreate()

    # Create a DataFrame with word sequences
    wordDataFrame = spark.createDataFrame([
        (0, ["Hi", "I", "heard", "about", "Spark"]),
        (1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
        (2, ["Logistic", "regression", "models", "are", "neat"])
    ], ["id", "words"])

    # Generate NGrams from the word sequences
    ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
    ngramDataFrame = ngram.transform(wordDataFrame)

    # Show the generated NGrams
    ngramDataFrame.select("ngrams").show(truncate=False)

+------------------------------------------------------------------+
|ngrams                                                            |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark]                         |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat]    |
+------------------------------------------------------------------+



#### Encoding

We can encode the vectors we got using n-grams for various natural language processing (NLP) tasks such as:
- Text Classification
- Sentiment Analysis
- Named Entity Recognition (NER)
- Topic Modeling
- Machine Translation
And more!

In [30]:
from pyspark.ml.feature import CountVectorizer

# Initialize CountVectorizer
cv = CountVectorizer(inputCol="ngrams", outputCol="features")

# Fit the CountVectorizer to the NGrams DataFrame
cv_model = cv.fit(ngramDataFrame)

# Transform the NGrams DataFrame to get the encoded features
encoded_df = cv_model.transform(ngramDataFrame)

# Show the encoded features
encoded_df.select("id", "ngrams", "features").show(truncate=False)


+---+------------------------------------------------------------------+----------------------------------------------+
|id |ngrams                                                            |features                                      |
+---+------------------------------------------------------------------+----------------------------------------------+
|0  |[Hi I, I heard, heard about, about Spark]                         |(14,[3,5,10,12],[1.0,1.0,1.0,1.0])            |
|1  |[I wish, wish Java, Java could, could use, use case, case classes]|(14,[2,4,6,7,11,13],[1.0,1.0,1.0,1.0,1.0,1.0])|
|2  |[Logistic regression, regression models, models are, are neat]    |(14,[0,1,8,9],[1.0,1.0,1.0,1.0])              |
+---+------------------------------------------------------------------+----------------------------------------------+



Let's break down the representation (14,[2,4,11,13],[1.0,1.0,1.0,1.0]):

14: Total vocabulary size, meaning there are 14 distinct NGrams in the dataset.
[2, 4, 11, 13]: Indices of the NGrams that have non-zero counts in the encoded vector. These indices correspond to the positions of NGrams in the vocabulary.
[1.0, 1.0, 1.0, 1.0]: Counts of each NGram in the encoded vector. Each count indicates how many times the corresponding NGram appears in the text data.
So, in this example, the vector [Hi I, I heard, heard about, about Spark] is represented as a sparse vector with a vocabulary size of 14, and it contains non-zero counts for the NGrams at indices 2, 4, 11, and 13.

We can now use the encoded vectors for various purposes such as training machine learning models, performing similarity calculations, or any other task that requires numerical representations of text data.

In [31]:
# Do not forget to release the resources held by Spark
spark.stop()