<a href="https://colab.research.google.com/github/angelaoryza/BigData/blob/main/MinHashing/Locality_Sensitive_Hashing_(promotional_csv).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Big Data : Assignment 3**

Name : Angela Oryza Prabowo
Student Number : 5025201022

## **Checking the Environment**

In [20]:
!java --version
!python --version

openjdk 11.0.18 2023-01-17
OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1)
OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1, mixed mode, sharing)
Python 3.8.10


## **Installing Apache Spark (PySpark)**

In [21]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## **Initialize Apache Spark Context**

In [22]:
#Import Apache Spark SQL
from pyspark.sql import SparkSession

# Create Spark Session or Context
# We are using local machine with all CPU Scres indicated by the sign [*]
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("My Pyspark") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
      


## **Data Mining Task**

The LSH task always consists of three steps:


1.   Converting original data into vectors
2.   Calculate the hash using MinHash algorithm
3.   Searching the similar pair using k-Nearest Neighbor, or join algorithm.






## Downloading the dataset

In [23]:
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [24]:
!mkdir ~/.kaggle/
!touch ~/.kaggle/kaggle.json

api_token = {"username":"angelaoryza","key":"92d91d6fa54d3437723ec48fbf817808"}

import json 

with open('/root/.kaggle/kaggle.json', 'w') as file:
    json.dump(api_token, file)

!chmod 600 ~/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle/’: File exists


In [25]:
# Download from https://www.kaggle.com/datasets/urbanbricks/wikipedia-promotional-articles

!kaggle datasets download -d urbanbricks/wikipedia-promotional-articles

wikipedia-promotional-articles.zip: Skipping, found more recently modified local copy (use --force to force download)


In [26]:
!unzip wikipedia-promotional-articles.zip

Archive:  wikipedia-promotional-articles.zip
replace good.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: good.csv                
replace promotional.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: promotional.csv         


In [27]:
!ls -la

total 783140
drwxr-xr-x 1 root root      4096 Mar  8 07:20 .
drwxr-xr-x 1 root root      4096 Mar  8 06:54 ..
drwxr-xr-x 4 root root      4096 Mar  6 17:51 .config
-rw-r--r-- 1 root root 475685227 Oct 27  2019 good.csv
-rw-r--r-- 1 root root 115360355 Oct 27  2019 promotional.csv
drwxr-xr-x 1 root root      4096 Mar  6 17:52 sample_data
-rw-r--r-- 1 root root 210863294 Mar  8 07:02 wikipedia-promotional-articles.zip


## **Read the dataset**

In [28]:
# Read CSV
df = spark.read.option("header", True).csv("/content/promotional.csv")
df.printSchema()

root
 |-- text: string (nullable = true)
 |-- advert: string (nullable = true)
 |-- coi: string (nullable = true)
 |-- fanpov: string (nullable = true)
 |-- pr: string (nullable = true)
 |-- resume: string (nullable = true)
 |-- url: string (nullable = true)



In [29]:
# Add an ID for the dataset
from pyspark.sql.functions import monotonically_increasing_id

newsDF = df.withColumn("id", monotonically_increasing_id())
newsDF.show()

+--------------------+------+---+------+---+------+--------------------+---+
|                text|advert|coi|fanpov| pr|resume|                 url| id|
+--------------------+------+---+------+---+------+--------------------+---+
|1 Litre no Namida...|     0|  0|     1|  0|     0|https://en.wikipe...|  0|
|1DayLater was fre...|     1|  1|     0|  0|     0|https://en.wikipe...|  1|
|1E is a privately...|     1|  0|     0|  0|     0|https://en.wikipe...|  2|
|1Malaysia pronoun...|     1|  0|     0|  0|     0|https://en.wikipe...|  3|
|The Jerusalem Bie...|     1|  0|     0|  0|     0|https://en.wikipe...|  4|
|1st Round Enterpr...|     0|  0|     0|  1|     0|https://en.wikipe...|  5|
|2ergo is a provid...|     1|  0|     0|  0|     0|https://en.wikipe...|  6|
|2N Telekomunikace...|     1|  0|     0|  0|     0|https://en.wikipe...|  7|
|A 3D printing mar...|     1|  0|     0|  0|     0|https://en.wikipe...|  8|
|3DR is an America...|     1|  1|     0|  0|     0|https://en.wikipe...|  9|

In [30]:
# Get the total rows
newsDF.count()

23837

### **1. Prepare the Tokenizer**
We transform the input into tokenized words

In [31]:
# Prepare the tokenizer
from pyspark.ml.feature import Tokenizer

# In this section, since we're using the Tokenizer, the k-value for the input words is limited to 1
# Hence, the text will be split  each word
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsDF = tokenizer.transform(newsDF)

wordsDF.show()

+--------------------+------+---+------+---+------+--------------------+---+--------------------+
|                text|advert|coi|fanpov| pr|resume|                 url| id|               words|
+--------------------+------+---+------+---+------+--------------------+---+--------------------+
|1 Litre no Namida...|     0|  0|     1|  0|     0|https://en.wikipe...|  0|[1, litre, no, na...|
|1DayLater was fre...|     1|  1|     0|  0|     0|https://en.wikipe...|  1|[1daylater, was, ...|
|1E is a privately...|     1|  0|     0|  0|     0|https://en.wikipe...|  2|[1e, is, a, priva...|
|1Malaysia pronoun...|     1|  0|     0|  0|     0|https://en.wikipe...|  3|[1malaysia, prono...|
|The Jerusalem Bie...|     1|  0|     0|  0|     0|https://en.wikipe...|  4|[the, jerusalem, ...|
|1st Round Enterpr...|     0|  0|     0|  1|     0|https://en.wikipe...|  5|[1st, round, ente...|
|2ergo is a provid...|     1|  0|     0|  0|     0|https://en.wikipe...|  6|[2ergo, is, a, pr...|
|2N Telekomunikace..

In [35]:
# Vectorize the dataset
from pyspark.ml.feature import CountVectorizer

# I'm using 800 vocabSize since the text data in promotional.csv is less than in good.csv
vocabSize=800

# Train the CountVectorizer Model using our data
cvModel = CountVectorizer(inputCol="words", outputCol="features", vocabSize=vocabSize, minDF=10).fit(wordsDF)

# Transform our data into vector
vectorizedDF = cvModel.transform(wordsDF)
vectorizedDF.show()


+--------------------+------+---+------+---+------+--------------------+---+--------------------+--------------------+
|                text|advert|coi|fanpov| pr|resume|                 url| id|               words|            features|
+--------------------+------+---+------+---+------+--------------------+---+--------------------+--------------------+
|1 Litre no Namida...|     0|  0|     1|  0|     0|https://en.wikipe...|  0|[1, litre, no, na...|(800,[0,1,2,3,4,5...|
|1DayLater was fre...|     1|  1|     0|  0|     0|https://en.wikipe...|  1|[1daylater, was, ...|(800,[0,1,2,3,4,5...|
|1E is a privately...|     1|  0|     0|  0|     0|https://en.wikipe...|  2|[1e, is, a, priva...|(800,[0,1,2,3,4,5...|
|1Malaysia pronoun...|     1|  0|     0|  0|     0|https://en.wikipe...|  3|[1malaysia, prono...|(800,[0,1,2,3,4,5...|
|The Jerusalem Bie...|     1|  0|     0|  0|     0|https://en.wikipe...|  4|[the, jerusalem, ...|(800,[0,1,2,3,4,5...|
|1st Round Enterpr...|     0|  0|     0|  1|    

### **2. Fit/train an LSH Model**

In [37]:
from pyspark.ml.feature import MinHashLSH

mh = MinHashLSH(inputCol="features", outputCol="hashValues", numHashTables=3)
LSHmodel = mh.fit(vectorizedDF)

LSHmodel.transform(vectorizedDF).show()

+--------------------+------+---+------+---+------+--------------------+---+--------------------+--------------------+--------------------+
|                text|advert|coi|fanpov| pr|resume|                 url| id|               words|            features|          hashValues|
+--------------------+------+---+------+---+------+--------------------+---+--------------------+--------------------+--------------------+
|1 Litre no Namida...|     0|  0|     1|  0|     0|https://en.wikipe...|  0|[1, litre, no, na...|(800,[0,1,2,3,4,5...|[[1.9746211E7], [...|
|1DayLater was fre...|     1|  1|     0|  0|     0|https://en.wikipe...|  1|[1daylater, was, ...|(800,[0,1,2,3,4,5...|[[171183.0], [1.5...|
|1E is a privately...|     1|  0|     0|  0|     0|https://en.wikipe...|  2|[1e, is, a, priva...|(800,[0,1,2,3,4,5...|[[171183.0], [1.5...|
|1Malaysia pronoun...|     1|  0|     0|  0|     0|https://en.wikipe...|  3|[1malaysia, prono...|(800,[0,1,2,3,4,5...|[[171183.0], [927...|
|The Jerusalem Bie..

### **3. Searhing the similar pairs/items for a key "software" "internet"**



In [39]:
print(cvModel.vocabulary.index("software"))
print(cvModel.vocabulary.index("internet"))

225
574


In [41]:
# Testing searching for "software" "internet"

from pyspark.ml.linalg import Vectors

# Convert the input wwith 2 words into 800 size vectors
# If the words exist in the index we will give value = 1.0, otherwise 0.0
# Fina result: key = [0, 0, ... , 1.0,1.0, ...]

key = Vectors.sparse(vocabSize, {cvModel.vocabulary.index("software"):1.0, cvModel.vocabulary.index("internet"): 1.0})


In [43]:
# Define the number of neghburs
k = 40

# Search inside LSH model that we already trained
resultDF = LSHmodel.approxNearestNeighbors(vectorizedDF, key, k)
resultDF.show()

+--------------------+------+---+------+---+------+--------------------+----------+--------------------+--------------------+--------------------+------------------+
|                text|advert|coi|fanpov| pr|resume|                 url|        id|               words|            features|          hashValues|           distCol|
+--------------------+------+---+------+---+------+--------------------+----------+--------------------+--------------------+--------------------+------------------+
|Airavata is a sof...|     1|  0|     0|  0|     0|https://en.wikipe...|       420|[airavata, is, a,...|(800,[1,2,4,5,6,1...|[[1.69875943E8], ...|0.9333333333333333|
|Captain Sim is a ...|     1|  0|     0|  0|     0|https://en.wikipe...|      3315|[captain, sim, is...|(800,[0,1,3,5,6,7...|[[1.69875943E8], ...|0.9333333333333333|
|CaseWare Internat...|     1|  0|     0|  0|     0|https://en.wikipe...|      3432|[caseware, intern...|(800,[1,3,5,6,7,1...|[[1.69875943E8], ...|            0.9375|
|Uni

In [44]:
# Save the result into CSV
import pandas as pd

data = resultDF.toPandas()
data.to_csv("result_promotional.csv")