# NLP Via Spark

Spark can assist us in NLP applications!

For more information go to: https://spark.apache.org/docs/latest/ml-features.html

In this notebook we will show 3 applications of natural language processing (NLP) using Spark's MLlib.

## TF -IDF

*Reminder:* TF-IDF (Term Frequency-Inverse Document Frequency), is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents, by considering both the frequency of the word in the document and the frequency of the word in the entire corpus.


Term Frequency (TF):

*   Term Frequency measures how often a term (word) appears in a document. It is calculated using the formula:
*   TF(t,d)=
(Total number of terms in document d) /
(Number of times term t appears in document d)
*   TF gives higher weight to terms that appear more frequently within a document. It helps in identifying the importance of a word in a specific document.

Inverse Document Frequency (IDF):


*   Inverse Document Frequency measures the importance of a term across a collection of documents. It is calculated using the formula:
*   IDF(t,D)=log(
(Number of documents containing term t) /
(Total number of documents in corpus ∣D∣)
 )
*   IDF gives higher weight to terms that are rare across the entire corpus but are present in a few documents. It helps in identifying terms that are unique or specific to certain documents.

TF-IDF Calculation:


*   TF-IDF combines the TF and IDF values to determine the weight of a term in a document relative to the entire corpus. It is calculated as:
*    TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
*  TF-IDF increases with the number of times a term appears in a document (TF) but is offset by the rarity of the term across the corpus (IDF). This helps in identifying terms that are both frequent in a document and unique to that document.

















### Harry Potter Example
Spark session to analyze Harry Potter books by computing TF-IDF features and measuring cosine similarity between pairs of books. The preprocessing step ensures accurate analysis by handling text normalization and tokenization.

In [None]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql.functions import col, desc, udf
from pyspark.sql.types import DoubleType
import re

# Initialize Spark session
spark = SparkSession.builder \
    .appName("TF-IDF Example") \
    .getOrCreate()

# Define the preprocessing function
def preprocess_text(book_path):
    with open(book_path, 'r', encoding='utf-8') as f:
        text = f.read().lower().replace('\n', ' ')
        text = re.sub(r'\W+', ' ', text)
    return text

# Define the book paths
book_paths = ['../data/HPBook1.txt','../data/HPBook2.txt','../data/HPBook3.txt','../data/HPBook4.txt','../data/HPBook5.txt','../data/HPBook6.txt','../data/HPBook7.txt']

# Read and preprocess the book texts
book_texts = [preprocess_text(book_path) for book_path in book_paths]

# Create a DataFrame with book paths and texts
df = spark.createDataFrame([(book_path, text) for book_path, text in zip(book_paths, book_texts)], ['path', 'text'])

# Tokenize the text
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_df = tokenizer.transform(df)

# Apply TF-IDF
hashing_tf = HashingTF(inputCol="words", outputCol="rawFeatures")
tf_features = hashing_tf.transform(words_df)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idf_model = idf.fit(tf_features)
tfidf_features = idf_model.transform(tf_features)

# Show the TF-IDF features
tfidf_features.select("path", "features").show(truncate=False)


In [42]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.sql.functions import col, desc, udf
from pyspark.sql.types import DoubleType
import re

# Initialize Spark session
spark = SparkSession.builder \
    .appName("TF-IDF Example") \
    .getOrCreate()

# Define the preprocessing function
def preprocess_text(book_path):
    with open(book_path, 'r', encoding='utf-8') as f:
        text = f.read().lower().replace('\n', ' ')
        text = re.sub(r'\W+', ' ', text)
    return text

# Define the book paths
book_paths = ['../data/HPBook1.txt','../data/HPBook2.txt','../data/HPBook3.txt','../data/HPBook4.txt','../data/HPBook5.txt','../data/HPBook6.txt','../data/HPBook7.txt']

# Read and preprocess the book texts
book_texts = [preprocess_text(book_path) for book_path in book_paths]

# Create a DataFrame with book paths and texts
df = spark.createDataFrame([(book_path, text) for book_path, text in zip(book_paths, book_texts)], ['path', 'text'])

# Tokenize the text
tokenizer = Tokenizer(inputCol="text", outputCol="words")
words_df = tokenizer.transform(df)

# Apply TF-IDF
hashing_tf = HashingTF(inputCol="words", outputCol="rawFeatures")
tf_features = hashing_tf.transform(words_df)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idf_model = idf.fit(tf_features)
tfidf_features = idf_model.transform(tf_features)

# Show the TF-IDF features
tfidf_features.select("path", "features").show(truncate=False)


+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Check the similarity between books based on TF-IDF features




In [43]:
# Define a function to calculate cosine similarity
def cosine_similarity(v1, v2):
    return float(v1.dot(v2) / (v1.norm(2) * v2.norm(2)))

cosine_similarity_udf = udf(cosine_similarity, DoubleType())

# Calculate cosine similarity between book pairs
book_pairs_similarity = (tfidf_features.alias("i")
    .crossJoin(tfidf_features.alias("j"))
    .filter(col("i.path") < col("j.path"))
    .select(
        col("i.path").alias("book1"),
        col("j.path").alias("book2"),
        cosine_similarity_udf(col("i.features"), col("j.features")).alias("similarity")
    )
)

# Sort by similarity in descending order and select the top 5 similarities
top_5_similarities = book_pairs_similarity.orderBy(desc("similarity")).limit(5)

# Show the top 5 similarities between book pairs
top_5_similarities.show(truncate=False)

+-----------+-----------+-------------------+
|book1      |book2      |similarity         |
+-----------+-----------+-------------------+
|HPBook5.txt|HPBook7.txt|0.24994726959550115|
|HPBook6.txt|HPBook7.txt|0.20425334173286822|
|HPBook3.txt|HPBook5.txt|0.12461851071829834|
|HPBook5.txt|HPBook6.txt|0.12336893033749687|
|HPBook3.txt|HPBook7.txt|0.11819872432953012|
+-----------+-----------+-------------------+



## Word2Vec

*Reminder:* Word2Vec is a technique for learning distributed representations of words in a continuous vector space from large text corpora. It associates words with vectors in such a way that semantically similar words are mapped to nearby points in the vector space, enabling algorithms to capture semantic relationships between words.

In [26]:
from pyspark.ml.feature import Word2Vec
from pyspark.ml.linalg import DenseVector
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Word2VecExample").getOrCreate()

# Prepare input data: Each row is a bag of words from a sentence or document
documentDF = spark.createDataFrame([
    ("Hi I heard about Spark".split(" "), ),
    ("I wish Java could use case classes".split(" "), ),
    ("Logistic regression models are neat".split(" "), )
], ["text"])

# FlatMap the array of words to separate rows
words_df = documentDF.rdd.flatMap(lambda x: x[0])
# Collect the words into a list
word_list = words_df.collect()

# Learn a mapping from words to vectors
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)
word_vectors = model.getVectors()

# Create a dictionary mapping words to vectors
word_to_vector = {}
for row in word_vectors.collect():
    word = row["word"]
    vector = DenseVector(row["vector"])
    word_to_vector[word] = vector

# Print each word and its Word2Vec representation
for word, vector in word_to_vector.items():
    print("Word: '{}'\nVector: {}".format(word, vector))


Word: 'heard'
Vector: [-0.13917386531829834,-0.0984792485833168,0.13137060403823853]
Word: 'are'
Vector: [-0.02926809713244438,-0.13298393785953522,-0.04646815359592438]
Word: 'neat'
Vector: [-0.054623812437057495,0.07886628061532974,0.053552910685539246]
Word: 'classes'
Vector: [-0.10438217967748642,0.12623107433319092,-0.14298762381076813]
Word: 'I'
Vector: [-0.1577177494764328,0.09371575713157654,-0.1259755939245224]
Word: 'regression'
Vector: [-0.14404425024986267,0.09209312498569489,-0.08378593623638153]
Word: 'Logistic'
Vector: [-0.08757105469703674,0.12236057221889496,-0.09722736477851868]
Word: 'Spark'
Vector: [-0.007214682642370462,0.14007550477981567,-0.166773721575737]
Word: 'could'
Vector: [-0.09311074763536453,0.06391556560993195,-0.031752970069646835]
Word: 'use'
Vector: [0.06772690266370773,0.11048644781112671,-0.06937925517559052]
Word: 'Hi'
Vector: [0.06973613798618317,-0.007943478412926197,-0.12884926795959473]
Word: 'models'
Vector: [0.1053892970085144,0.125589177012

In [27]:
# Define the cosine similarity function
def cosine_similarity(vec1, vec2):
    dot_product = float(vec1.dot(vec2))
    norm_vec1 = float(vec1.norm(2))
    norm_vec2 = float(vec2.norm(2))
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

# Calculate cosine similarity between specific words
word_pairs = [("Hi", "heard"), ("Spark", "Java"), ("regression", "neat")]

# Get the list of words
words = [row["word"] for row in word_vectors.collect()]

for pair in word_pairs:
    word1_vec = word_vectors.filter(word_vectors.word == pair[0]).select("vector").collect()[0][0]
    word2_vec = word_vectors.filter(word_vectors.word == pair[1]).select("vector").collect()[0][0]
    similarity = cosine_similarity(DenseVector(word1_vec), DenseVector(word2_vec))
    print("Cosine similarity between '{}' and '{}': {:.4f}".format(pair[0], pair[1], similarity))


Cosine similarity between 'Hi' and 'heard': -0.8186
Cosine similarity between 'Spark' and 'Java': 0.7317
Cosine similarity between 'regression' and 'neat': 0.5088


This example was not very informative. Let's try again with a larger amount of data.

#### Alice in Wonderland Example

In [28]:
import re
from pyspark.ml.feature import Word2Vec
from pyspark.ml.linalg import DenseVector
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Word2VecExample").getOrCreate()

# Load the text file "Alice_book.txt" and preprocess sentences
TEXT_PATH= "../data/Alice_book.txt"
with open(TEXT_PATH, 'r', encoding='utf-8') as f:
    sentences = f.readlines()

sentences = [sen.strip().lower() for sen in sentences]  # Lowercase all words
sentences = [sen.split() for sen in sentences if sen]  # Split sentences into words
sentences = [[re.sub(r'\W+', '', w) for w in sen] for sen in sentences]  # Remove non-word characters

# Prepare input data: Each row is a bag of words from a sentence or document
documentDF = spark.createDataFrame([(sen,) for sen in sentences], ["text"])

# Learn a mapping from words to vectors
word2Vec = Word2Vec(vectorSize=1000, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)
word_vectors = model.getVectors()

# Create a dictionary mapping words to vectors
word_to_vector = {}
for row in word_vectors.collect():
    word = row["word"]
    vector = DenseVector(row["vector"])
    word_to_vector[word] = vector



#### Cosine distance between words

- "tea" and "time" should have high similarity as they are closely related in the context of "tea-time" in "Alice in Wonderland".
- "queen" and "hearts" should have high similarity as they are closely related in the context of the Queen of Hearts character.
- "hatter" and "watch" should have low similarity as they could represent distinct concepts or contexts. In the book, the rabbit had a watch.
- "alice" and "dream" might have low similarity as they are less likely to co-occur in the text.


In [29]:
# Define the cosine similarity function
def cosine_similarity(vec1, vec2):
    dot_product = float(vec1.dot(vec2))
    norm_vec1 = float(vec1.norm(2))
    norm_vec2 = float(vec2.norm(2))
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity

# Calculate cosine similarity between specific words
word_pairs = [("queen", "hearts"), ("tea", "time"), ("alice", "dream"), ("hatter", "watch")]

# Get the list of words
words = [row["word"] for row in word_vectors.collect()]

for pair in word_pairs:
    if pair[0] in word_to_vector and pair[1] in word_to_vector:
        word1_vec = word_to_vector[pair[0]]
        word2_vec = word_to_vector[pair[1]]
        similarity = cosine_similarity(DenseVector(word1_vec), DenseVector(word2_vec))
        print("Cosine similarity between '{}' and '{}': {:.4f}".format(pair[0], pair[1], similarity))
    else:
        print("One or both words '{}' and '{}' not found in the word vectors.".format(pair[0], pair[1]))


Cosine similarity between 'queen' and 'hearts': 0.9078
Cosine similarity between 'tea' and 'time': 0.8455
Cosine similarity between 'alice' and 'dream': 0.0660
Cosine similarity between 'hatter' and 'watch': -0.1602


## N-gram

*Reminder:* N-grams are contiguous sequences of n items (words, characters, or tokens) extracted from a text document, where n represents the number of items in the sequence. They are commonly used in natural language processing for tasks such as language modeling, text generation, and feature extraction.

In [30]:
    from pyspark.ml.feature import NGram
    from pyspark.sql import SparkSession

    # Create a SparkSession
    spark = SparkSession\
        .builder\
        .appName("NGramExample")\
        .getOrCreate()

    # Create a DataFrame with word sequences
    wordDataFrame = spark.createDataFrame([
        (0, ["Hi", "I", "heard", "about", "Spark"]),
        (1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
        (2, ["Logistic", "regression", "models", "are", "neat"])
    ], ["id", "words"])

    # Generate NGrams from the word sequences
    ngram = NGram(n=2, inputCol="words", outputCol="ngrams")
    ngramDataFrame = ngram.transform(wordDataFrame)

    # Show the generated NGrams
    ngramDataFrame.select("ngrams").show(truncate=False)

+------------------------------------------------------------------+
|ngrams                                                            |
+------------------------------------------------------------------+
|[Hi I, I heard, heard about, about Spark]                         |
|[I wish, wish Java, Java could, could use, use case, case classes]|
|[Logistic regression, regression models, models are, are neat]    |
+------------------------------------------------------------------+



#### Encoding

We can encode the vectors we got using n-grams for various natural language processing (NLP) tasks such as:
- Text Classification
- Sentiment Analysis
- Named Entity Recognition (NER)
- Topic Modeling
- Machine Translation
And more!

In [31]:
from pyspark.ml.feature import CountVectorizer

# Initialize CountVectorizer
cv = CountVectorizer(inputCol="ngrams", outputCol="features")

# Fit the CountVectorizer to the NGrams DataFrame
cv_model = cv.fit(ngramDataFrame)

# Transform the NGrams DataFrame to get the encoded features
encoded_df = cv_model.transform(ngramDataFrame)

# Show the encoded features
encoded_df.select("id", "ngrams", "features").show(truncate=False)


+---+------------------------------------------------------------------+----------------------------------------------+
|id |ngrams                                                            |features                                      |
+---+------------------------------------------------------------------+----------------------------------------------+
|0  |[Hi I, I heard, heard about, about Spark]                         |(14,[2,4,11,13],[1.0,1.0,1.0,1.0])            |
|1  |[I wish, wish Java, Java could, could use, use case, case classes]|(14,[3,5,6,7,10,12],[1.0,1.0,1.0,1.0,1.0,1.0])|
|2  |[Logistic regression, regression models, models are, are neat]    |(14,[0,1,8,9],[1.0,1.0,1.0,1.0])              |
+---+------------------------------------------------------------------+----------------------------------------------+



Let's break down the representation (14,[2,4,11,13],[1.0,1.0,1.0,1.0]):

14: Total vocabulary size, meaning there are 14 distinct NGrams in the dataset.
[2, 4, 11, 13]: Indices of the NGrams that have non-zero counts in the encoded vector. These indices correspond to the positions of NGrams in the vocabulary.
[1.0, 1.0, 1.0, 1.0]: Counts of each NGram in the encoded vector. Each count indicates how many times the corresponding NGram appears in the text data.
So, in this example, the vector [Hi I, I heard, heard about, about Spark] is represented as a sparse vector with a vocabulary size of 14, and it contains non-zero counts for the NGrams at indices 2, 4, 11, and 13.

We can now use the encoded vectors for various purposes such as training machine learning models, performing similarity calculations, or any other task that requires numerical representations of text data.

In [32]:
# Do not forget to release the resources held by Spark
spark.stop()