In [2]:
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace,rand

In [3]:
def init_spark():
    spark = SparkSession.builder.appName("HelloWorld").getOrCreate()
    sc = spark.sparkContext
    return spark, sc

# Exercise 1

We can define a similarity measure for multisets called SMS:

The divisor should be the union of the sets and elements which apear multiple times should appear as often as their maximum number of appearances in one of the sets. We denote this by $\cup '$

The divident should be the intersection of the two sets, but elements which appear multiple times in both sets should be present times the smaller number of appearances in either. We denote this operator by $\cap '$

Looking at an example - S1={A,A,B,B,B}, S2={A,A,B,C}:
$S1 \cap' S2 ={A,A,B}$
$S1 \cup' S2 ={A,A,B,B,B,C}$

Putting it together we have:
$SMS(S1,S2)=\frac{|S1 \cap' S2 |}{|S1 \cup' S2 |}=\frac{3}{6}$

Testing for regular sets, in which case SMS should be equal to Jaccard:
S1={1,2,4}, S2={2,4,7,8}
$Jaccard(S1,S2) = \frac{2}{5}$
$SMS(S1,S2)=\frac{|S1 \cap' S2 |}{|S1 \cup' S2 |}=\frac{|{1,2}|}{|{1,2,4,7,8}|}= \frac{2}{5}$

# Exercise 2

First all the datasets are loaded. We converted the grundgesetzt to a txt-file. As well as created 9 other txt-files to test
our process on (Movie-scripts of:Harry Potter1-7 + Shrek1-2).

In [4]:
spark, sc = init_spark()

dataframes= []

dataframes.append(spark.read.text("data_sheet6/grundgesetz.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/hp1.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/hp2.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/hp3.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/hp4.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/hp5.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/hp6.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/hp7.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/shrek.txt",wholetext=True))
dataframes.append(spark.read.text("data_sheet6/shrek2.txt",wholetext=True))

22/12/04 20:38:03 WARN Utils: Your hostname, LT1081 resolves to a loopback address: 127.0.1.1; using 172.24.119.217 instead (on interface eth0)
22/12/04 20:38:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/04 20:38:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


 Now we need to do some processing on the text. We will replace the newline and tab characters with whitespaces.

In [8]:
for dataFrame in dataframes:
    # firstly we remove word seperations indicated by -\n
    dataFrame = dataFrame.withColumn("value",regexp_replace("value","-\n",""))
    #secondly we remove tabs and newlines
    dataFrame = dataFrame.withColumn("value",regexp_replace("value","\n"," "))
    dataFrame = dataFrame.withColumn("value",regexp_replace("value","\t",""))
    #lastly multiple whitespaces are removed and collapsed to a single one
    dataFrame = dataFrame.withColumn("value",regexp_replace("value","\\s{2,}"," "))

After cleaning up we can start to create the shingles. For that we first define a shingling function.

In [6]:
def shingling_k(text,k):
    tokens = list(text)
    shingle = [tokens[i:i+k] for i in range(len(tokens) - k + 1)]
    unique_shingles = []

    for shingleList in shingle:
        shingleText = "".join(str(i) for i in shingleList)
        unique_shingles.append(shingleText)
    return set(unique_shingles)

Now we can execute the singling functions for the different sets

In [7]:
name=['Grundgesetz','HP1','HP2','HP3','HP4','HP5','HP6','HP7','Shrek','Shrek2']
i=0
for dataFrame in dataframes:
    print(type(dataFrame))
    set_of_shingles_5 = dataFrame.rdd.map(lambda row: shingling_k((row[0]),5))
    set_of_shingles_9 = dataFrame.rdd.map(lambda row: shingling_k((row[0]),9))

    set_5 = set_of_shingles_5.take(1)
    set_9 = set_of_shingles_9.take(1)

    size_of_set_5 = len(set_5[0])
    size_of_set_9 = len(set_9[0])

    print("Created shingles for set: "+name[i])
    print("Amount of different 5 shingles: "+str(size_of_set_5))
    print("Amount of different 9 shingles: "+str(size_of_set_9))
    i += 1

<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

Created shingles for set: Grundgesetz
Amount of different 5 shingles: 26468
Amount of different 9 shingles: 83065
<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

Created shingles for set: HP1
Amount of different 5 shingles: 75402
Amount of different 9 shingles: 298854
<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

Created shingles for set: HP2
Amount of different 5 shingles: 78668
Amount of different 9 shingles: 333446
<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

Created shingles for set: HP3
Amount of different 5 shingles: 87226
Amount of different 9 shingles: 397422
<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

Created shingles for set: HP4
Amount of different 5 shingles: 112516
Amount of different 9 shingles: 629863
<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

Created shingles for set: HP5
Amount of different 5 shingles: 132257
Amount of different 9 shingles: 800445
<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

Created shingles for set: HP6
Amount of different 5 shingles: 108148
Amount of different 9 shingles: 581263
<class 'pyspark.sql.dataframe.DataFrame'>


                                                                                

Created shingles for set: HP7
Amount of different 5 shingles: 116934
Amount of different 9 shingles: 660338
<class 'pyspark.sql.dataframe.DataFrame'>
Created shingles for set: Shrek
Amount of different 5 shingles: 20400
Amount of different 9 shingles: 44413
<class 'pyspark.sql.dataframe.DataFrame'>
Created shingles for set: Shrek2
Amount of different 5 shingles: 19669
Amount of different 9 shingles: 40016


# Exercise 3

### a)

$$\begin{pmatrix}
0 & 1 & 0 & 1\\
0 & 1 & 0 & 0\\
1 & 0 & 0 & 1\\
0 & 0 & 1 & 0\\
0 & 0 & 1 & 1\\
1 & 0 & 0 & 0\\
\end{pmatrix}$$

#### Minhash for h_1(x)

$$\begin{pmatrix}
5 & 1 & 3 & 1
\end{pmatrix}$$

#### Minhash for h_2(x)

$$\begin{pmatrix}
2 & 2 & 2 & 2
\end{pmatrix}$$

#### Minhash for h_3(x)

$$\begin{pmatrix}
0 & 1 & 4 & 0
\end{pmatrix}$$


### b)

None of those functions provides a true permutation?

For h_1(x): S_2 and S_4 collide

For h_2(x): all S collide

For h_3(x): S_1 and S_4 collide

### c)

S_1 and S_2: Jaccard=0/4=0  Hashsim=1/3

S_1 and S_3: Jaccard=0/4=0  Hashsim=1/3

S_1 and S_4: Jaccard=1/4    Hashsim=2/3

S_2 and S_3: Jaccard=0/4=0  Hashsim=1/3

S_2 and S_4: Jaccard=1/4    Hashsim=2/3

S_3 and S_4: Jaccard=1/4    Hashsim=1/3

# Exercise 4

If the Jaccard-similarity is 0 than, S1 and S2 do not share any elements. Since the minhash returns the element with the samllest hash value, they can never match up if S1 and S2 do not contain any shared items.

# Exercise 5

### a) $Jaccard_{S1,S2}=\frac{1}{4}$

### b)
Whenever column d is the first for all 120 permutations the two colums hash to the same value. The probability that this is true for S1 is 1 and the probability that this happens is true for S2 in $\frac{1}{4}$ of the cases. Hence out of all 120 permutations it is true for $120*\frac{1}{4}=30$ of them.

# Exercise 6

The provider of some service with login needs a way to store passwords. However, it is never a good idea to store password in plain text, as else a intruder can just read out all clear text passwords. Therefore, it is common practice to hash the
password a user selected, which will make an intrusion less harmfull. But if the hashing technique used is a quite common one
a set of hashes can still allow you to extract the passwords. To do so one can find identical hashes and work backwords from there. Secondly one could use a so-called rainbow table, which includes a very high number of precomputed hashes for some set of passwords and just find matching ones. This way one does not have to compute all the hashes especially if the hash-technique is quite common. Lastly to truely secure a hash a provider can add a salt, which is some random long string of characters, which is added to each password before hashing. This makes rainbow tables useless and forces the hacker to truely brut-force his way through the hash.