**Assignments 3**<br>
Name: Benedictus Bimo C W<br>
Student ID: 5025201097<br>
Class: Big Data A<br>
Lecturer: Abdul Munif, S.Kom., M.Sc.

## Source:
1. https://www.uber.com/en-ID/blog/lsh/
2. https://stackoverflow.com/questions/56816537/cant-find-kaggle-json-file-in-google-colab
3. https://spark.apache.org/docs/latest/api/python/index.html
4. https://spark.apache.org/docs/latest/ml-features.html#locality-sensitive-hashing

# Initialization

## Checking the Environment

In [None]:
!java --version
!python --version

## Installing Apache Spark (PySpark)

In [None]:
## Installing Apache Spark (PySpark)

!pip install pyspark

## Initialize Apache Spark context

In [3]:
# Import Apache Spark SQL
from pyspark.sql import SparkSession

# Create Spark Session/Context
# We are using local machine with all the CPU cores [*]
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Hello Pyspark") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [None]:
# Check spark session
print(spark)

# Data Mining Task

The LSH task always consists of three steps:

1. Converting original data into vectors
2. Calculate the hash using MinHash algorithm
3. Searching the similar pairs using k-Nearest Neighbor, or join algorithm.

## Downloading the dataset

In [None]:
!pip install kaggle

In [22]:
# PLEASE USE YOUR OWN KEY
# Download your own key according to this instruction https://github.com/Kaggle/kaggle-api#api-credentials

import json
api_token = {"username": "benewicaksono",
             "key": "c266a24ed937b36cbf606754380a80b5"}

with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(api_token, file)

!chmod 600 ~/.kaggle/kaggle.json


In [None]:
# Download from https://www.kaggle.com/datasets/urbanbricks/wikipedia-promotional-articles

!kaggle datasets download -d urbanbricks/wikipedia-promotional-articles

## Extract Dataset

In [None]:
!unzip wikipedia-promotional-articles.zip

## check files in current directory

In [None]:
!ls -la

## Read the dataset

In [None]:
# Read CSV (promotional.csv)
df = spark.read.option("header", True).csv("/content/promotional.csv")
df.printSchema()

In [None]:
# Add an ID for the dataset
from pyspark.sql.functions import monotonically_increasing_id

newsDF = df.withColumn("id", monotonically_increasing_id())
newsDF.show()

In [None]:
# Get the totals row
newsDF.count()

## 1. Prepare the tokenizer
We transform the input into tokenized words

In [None]:
# Prepare the tokenizer
from pyspark.ml.feature import Tokenizer

# create a tokenizer object to tokenize the text
tokenizer = Tokenizer(inputCol="text", outputCol="words")
# tokenize the text in the dataframe
wordsDF = tokenizer.transform(newsDF)

# show the resulting dataframe
wordsDF.show()

In [None]:
# Vectorize the dataset
from pyspark.ml.feature import CountVectorizer

# define the size of the vocabulary and the minimum document frequency
vocabSize=1000

# create a CountVectorizer object and fit it on the tokenized data
cvModel = CountVectorizer(inputCol="words", outputCol="features", vocabSize=vocabSize, minDF=10).fit(wordsDF)

# transform the tokenized data into a vectorized format
vectorizedDF = cvModel.transform(wordsDF)

# show the resulting dataframe
vectorizedDF.show()

## 2. Fit/train an LSH Model

In [None]:
from pyspark.ml.feature import MinHashLSH

# Define the MinHashLSH model with the desired input and output columns, and number of hash tables
mh = MinHashLSH(inputCol="features", outputCol="hashValues", numHashTables=3)

# Train the model using the vectorized data
LSHmodel = mh.fit(vectorizedDF)

# Apply the trained LSH model to the vectorized data and show the results
LSHmodel.transform(vectorizedDF).show()


## 3. Searching the similar pairs/items for a key "united" "states"

In [None]:
# Get the index of the word "united" and "states" in the vocabulary
print(cvModel.vocabulary.index("united"))
print(cvModel.vocabulary.index("states"))

In [34]:
# Convert the input with 2 words into a 1000-size vector
# If the words exist in the index, we will give the value 1.0, otherwise 0.0
# Final result: key = [0, 0, ..., 1.0, ..., 1.0, 0, ..., 0]
from pyspark.ml.linalg import Vectors
key = Vectors.sparse(vocabSize, {cvModel.vocabulary.index("united"): 1.0, cvModel.vocabulary.index("states"): 1.0})

In [35]:
# Define the number of neighbors
k = 40

In [None]:
# Search inside the LSH model that we already trained
resultDF = LSHmodel.approxNearestNeighbors(vectorizedDF, key, k)
resultDF.show()

In [37]:
# Save the result into CSV
import pandas as pd
data = resultDF.toPandas()
data.to_csv("result.csv")

## Check Result.csv

In [None]:
import pandas as pd

# read the CSV file into a Pandas dataframe
df = pd.read_csv('result.csv')

# display the first 5 rows of the dataframe
print(df.head())