# Frequent Pattern Mining with Pyspark and MLlib
* Notebook by Adam Lang
* Date: 1/9/2024

# Overview
* In this notebook we will implement two of the frequent pattern mining algorithms from Pyspark's MLlib.

1. FP-growth
2. PrefixSpan

* It is important to remember that PrefixSpan is based on sequential pattern mining while FP-growth does not use ordered information.

# Create Spark Session

In [1]:
## Create spark session
import pyspark
from pyspark.sql import SparkSession

## init session
spark = SparkSession.builder.appName("FPMining").getOrCreate()
spark

# Datset Info
* This data was collected (2016-2018) through an interactive on-line personality test.
* The personality test was constructed with the "Big-Five Factor Markers" from the IPIP. https://ipip.ori.org/newBigFive5broadKey.htm
Participants were informed that their responses would be recorded and used for research at the beginning of the test, and asked to confirm their consent at the end of the test.

The following items were presented on one page and each was rated on a five point scale using radio buttons. The order on page was was EXT1, AGR1, CSN1, EST1, OPN1, EXT2, etc.
The scale was labeled 1=Disagree, 3=Neutral, 5=Agree

 - EXT1	I am the life of the party.
 - EXT2	I don't talk a lot.
 - EXT3	I feel comfortable around people.
 - EXT4	I keep in the background.
 - EXT5	I start conversations.
 - EXT6	I have little to say.
 - EXT7	I talk to a lot of different people at parties.
 - EXT8	I don't like to draw attention to myself.
 - EXT9	I don't mind being the center of attention.
 - EXT10	I am quiet around strangers.
 - EST1	I get stressed out easily.
 - EST2	I am relaxed most of the time.
 - EST3	I worry about things.
 - EST4	I seldom feel blue.
 - EST5	I am easily disturbed.
 - EST6	I get upset easily.
 - EST7	I change my mood a lot.
 - EST8	I have frequent mood swings.
 - EST9	I get irritated easily.
 - EST10	I often feel blue.
 - AGR1	I feel little concern for others.
 - AGR2	I am interested in people.
 - AGR3	I insult people.
 - AGR4	I sympathize with others' feelings.
 - AGR5	I am not interested in other people's problems.
 - AGR6	I have a soft heart.
 - AGR7	I am not really interested in others.
 - AGR8	I take time out for others.
 - AGR9	I feel others' emotions.
 - AGR10	I make people feel at ease.
 - CSN1	I am always prepared.
 - CSN2	I leave my belongings around.
 - CSN3	I pay attention to details.
 - CSN4	I make a mess of things.
 - CSN5	I get chores done right away.
 - CSN6	I often forget to put things back in their proper place.
 - CSN7	I like order.
 - CSN8	I shirk my duties.
 - CSN9	I follow a schedule.
 - CSN10	I am exacting in my work.
 - OPN1	I have a rich vocabulary.
 - OPN2	I have difficulty understanding abstract ideas.
 - OPN3	I have a vivid imagination.
 - OPN4	I am not interested in abstract ideas.
 - OPN5	I have excellent ideas.
 - OPN6	I do not have a good imagination.
 - OPN7	I am quick to understand things.
 - OPN8	I use difficult words.
 - OPN9	I spend time reflecting on things.
 - OPN10	I am full of ideas.

* The time spent on each question is also recorded in milliseconds. These are the variables ending in _E. This was calculated by taking the time when the button for the question was clicked minus the time of the most recent other button click.

1. **dateload** - The timestamp when the survey was started.
2. **screenw** - The width the of user's screen in pixels
3. **screenh** - The height of the user's screen in pixels
4. **introelapse** - The time in seconds spent on the landing / intro page
5. **testelapse** - The time in seconds spent on the page with the survey questions
6. **endelapse** - The time in seconds spent on the finalization page (where the user was asked to indicate if they has answered accurately and their answers could be stored and used for research. Again: this dataset only includes users who answered "Yes" to this question, users were free to answer no and could still view their results either way)

7. **IPC** - The number of records from the user's IP address in the dataset. For max cleanliness, only use records where this value is 1. High values can be because of shared networks (e.g. entire universities) or multiple submissions

8. **country** - The country, determined by technical information (NOT ASKED AS A QUESTION)

9. **lat_appx_lots_of_err** - approximate latitude of user. determined by technical information, THIS IS NOT VERY ACCURATE.

Read the article "How an internet mapping glitch turned a random Kansas farm into a digital hell" https://splinternews.com/how-an-internet-mapping-glitch-turned-a-random-kansas-f-1793856052 to learn about the perils of relying on this information
long_appx_lots_of_err   approximate longitude of user


**Source:** https://www.kaggle.com/tunguz/big-five-personality-test#data-final.csv

# Upload Dataset

In [2]:
## set path
path = '/content/drive/MyDrive/Colab Notebooks/PySpark Data Science/'
df = spark.read.option("delimiter", "\t").csv(path+'data-final.csv',\
                        inferSchema=True,\
                        header=True)

In [3]:
## view df
df.limit(5).toPandas()

Unnamed: 0,EXT1,EXT2,EXT3,EXT4,EXT5,EXT6,EXT7,EXT8,EXT9,EXT10,...,dateload,screenw,screenh,introelapse,testelapse,endelapse,IPC,country,lat_appx_lots_of_err,long_appx_lots_of_err
0,4,1,5,2,5,1,5,2,4,1,...,2016-03-03 02:01:01,768,1024,9,234,6,1,GB,51.5448,0.1991
1,3,5,3,4,3,3,2,5,1,5,...,2016-03-03 02:01:20,1360,768,12,179,11,1,MY,3.1698,101.706
2,2,3,4,4,3,2,1,3,2,5,...,2016-03-03 02:01:56,1366,768,3,186,7,1,GB,54.9119,-1.3833
3,2,2,2,3,4,2,2,4,1,4,...,2016-03-03 02:02:02,1920,1200,186,219,7,1,GB,51.75,-1.25
4,3,3,3,3,5,3,3,5,3,4,...,2016-03-03 02:02:57,1366,768,8,315,17,2,KE,1.0,38.0


In [4]:
## schema
df.printSchema()

root
 |-- EXT1: string (nullable = true)
 |-- EXT2: string (nullable = true)
 |-- EXT3: string (nullable = true)
 |-- EXT4: string (nullable = true)
 |-- EXT5: string (nullable = true)
 |-- EXT6: string (nullable = true)
 |-- EXT7: string (nullable = true)
 |-- EXT8: string (nullable = true)
 |-- EXT9: string (nullable = true)
 |-- EXT10: string (nullable = true)
 |-- EST1: string (nullable = true)
 |-- EST2: string (nullable = true)
 |-- EST3: string (nullable = true)
 |-- EST4: string (nullable = true)
 |-- EST5: string (nullable = true)
 |-- EST6: string (nullable = true)
 |-- EST7: string (nullable = true)
 |-- EST8: string (nullable = true)
 |-- EST9: string (nullable = true)
 |-- EST10: string (nullable = true)
 |-- AGR1: string (nullable = true)
 |-- AGR2: string (nullable = true)
 |-- AGR3: string (nullable = true)
 |-- AGR4: string (nullable = true)
 |-- AGR5: string (nullable = true)
 |-- AGR6: string (nullable = true)
 |-- AGR7: string (nullable = true)
 |-- AGR8: string (nu

In [5]:
## num of rows in dataset
df.count()

1015341

Summary
* There are a little over 1 million rows in the datset.

# FP-Growth Model
* Input array can't have any duplicate values --> this is important to remember!
* In the data we have duplicative responses in the variable columns which will yield errors.
  * To handle this we will use `caseWhen`

In [10]:
## handle duplicates
from pyspark.sql.functions import*

## personality types sub df
p_types = df.withColumn("vert",expr("CASE WHEN EXT1 in('4','5') or EXT3 in('4','5') or EXT7 in('4','5') or EXT9 in('4','5') THEN 'extrovert' WHEN EXT1 in('1','2') or EXT3 in('1','2') or EXT7 in('1','2') or EXT9 in('1','2') THEN 'introvert' ELSE 'neutrovert' END"))
p_types = p_types.withColumn("mood", expr("CASE WHEN EST2 in('4','5') THEN 'chill' WHEN EST2 in('1','2') THEN 'highstrung' ELSE 'neutral' END"))

## CREATE ARRAYS
p_types = p_types.select(array('mood','vert').alias("items"))
p_types.limit(4).toPandas()



Unnamed: 0,items
0,"[chill, extrovert]"
1,"[neutral, introvert]"
2,"[chill, extrovert]"
3,"[neutral, introvert]"


## Fit FP Model

In [11]:
from pyspark.ml.fpm import FPGrowth

## fit model
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.3, minConfidence=0.1)
model = fpGrowth.fit(p_types)

In [12]:
## view item popularity frequency
itempopularity = model.freqItemsets
itempopularity.createOrReplaceTempView("itempopularity")

spark.sql("SELECT * FROM itempopularity ORDER BY freq desc").limit(200).toPandas()

Unnamed: 0,items,freq
0,[extrovert],672037
1,[chill],435011
2,"[chill, extrovert]",325415
3,[introvert],325318
4,[highstrung],306443


### Association Rules Analysis
* Lift:
  * Calculation: Lift is calculated by dividing the confidence of a rule by the support of the consequent item.
  * Interpretation:
      * Lift > 1: Indicates a positive association, meaning the items are more likely to occur together than expected by chance.
      * Lift = 1: Indicates no association, meaning the items are independent.
      * Lift < 1: Indicates a negative association, meaning the items are less likely to occur together than expected.

In [13]:
## review assoc rules
assoc = model.associationRules
assoc.createOrReplaceTempView("assoc")

## query in desc order
spark.sql("SELECT * FROM assoc ORDER BY confidence desc").limit(20).toPandas()

Unnamed: 0,antecedent,consequent,confidence,lift,support
0,[chill],[extrovert],0.748062,1.130202,0.320498
1,[extrovert],[chill],0.484222,1.130202,0.320498


In [14]:
## predict outcome
predict = model.transform(p_types)
predict.limit(15).toPandas()

Unnamed: 0,items,prediction
0,"[chill, extrovert]",[]
1,"[neutral, introvert]",[]
2,"[chill, extrovert]",[]
3,"[neutral, introvert]",[]
4,"[chill, neutrovert]",[extrovert]
5,"[chill, extrovert]",[]
6,"[chill, extrovert]",[]
7,"[chill, extrovert]",[]
8,"[chill, extrovert]",[]
9,"[neutral, extrovert]",[chill]


# PrefixSpan Pattern Mining
* This is a sequential pattern mining approach compared to the above FP algorithm.

In [19]:
## lets create an array --> pyspark expects this as a "sequence" label
df_array = df.select(array(array('EXT1','EXT2'),array('EXT3','EXT4'), array('EXT5','EXT6'), array('EXT7','EXT8'), array('EXT9','EXT10')).alias("sequence"))
df_array.show(truncate=False)

+----------------------------------------+
|sequence                                |
+----------------------------------------+
|[[4, 1], [5, 2], [5, 1], [5, 2], [4, 1]]|
|[[3, 5], [3, 4], [3, 3], [2, 5], [1, 5]]|
|[[2, 3], [4, 4], [3, 2], [1, 3], [2, 5]]|
|[[2, 2], [2, 3], [4, 2], [2, 4], [1, 4]]|
|[[3, 3], [3, 3], [5, 3], [3, 5], [3, 4]]|
|[[3, 3], [4, 2], [4, 2], [2, 3], [3, 4]]|
|[[4, 3], [4, 3], [3, 3], [5, 3], [4, 3]]|
|[[3, 1], [5, 2], [5, 2], [5, 2], [3, 2]]|
|[[2, 2], [3, 3], [4, 2], [2, 2], [4, 4]]|
|[[1, 5], [3, 5], [2, 3], [2, 4], [5, 4]]|
|[[3, 3], [2, 3], [3, 2], [4, 3], [3, 5]]|
|[[3, 1], [5, 3], [5, 1], [5, 5], [5, 3]]|
|[[4, 1], [5, 4], [5, 1], [4, 1], [5, 2]]|
|[[1, 5], [1, 5], [1, 5], [1, 5], [1, 5]]|
|[[1, 5], [2, 5], [1, 4], [1, 2], [2, 5]]|
|[[2, 1], [3, 4], [4, 3], [5, 3], [3, 5]]|
|[[1, 4], [2, 4], [2, 3], [2, 4], [2, 4]]|
|[[4, 1], [5, 2], [4, 2], [3, 2], [4, 2]]|
|[[4, 2], [5, 3], [4, 4], [5, 2], [5, 2]]|
|[[5, 1], [5, 2], [5, 1], [5, 3], [5, 4]]|
+----------

In [20]:
from pyspark.ml.fpm import PrefixSpan


In [21]:
## create prefixspan
prefixSpan = PrefixSpan(minSupport=0.3, maxPatternLength=10)
sequence = prefixSpan.findFrequentSequentialPatterns(df_array)
sequence.show(10)

+---------------+------+
|       sequence|  freq|
+---------------+------+
|          [[1]]|688392|
|          [[3]]|811456|
|          [[5]]|714614|
|          [[2]]|857365|
|          [[4]]|869495|
|       [[2, 5]]|324431|
|       [[2, 3]]|369402|
|     [[2], [4]]|640768|
|  [[2], [4, 2]]|341779|
|[[2], [4], [4]]|378550|
+---------------+------+
only showing top 10 rows



In [22]:
## find specific sequences
sequence.where(size(col("sequence"))>1).show()

+---------------+------+
|       sequence|  freq|
+---------------+------+
|     [[2], [4]]|640768|
|  [[2], [4, 2]]|341779|
|[[2], [4], [4]]|378550|
|[[2], [4], [2]]|354996|
|     [[2], [2]]|619741|
|[[2], [2], [4]]|371093|
|[[2], [2], [2]]|340745|
|     [[2], [3]]|535357|
|[[2], [3], [4]]|312056|
|     [[2], [5]]|491046|
|     [[2], [1]]|399955|
|     [[1], [4]]|451389|
|     [[1], [2]]|451373|
|     [[1], [3]]|386571|
|     [[1], [5]]|493284|
|     [[1], [1]]|435008|
|  [[4, 2], [4]]|356858|
|  [[4, 2], [2]]|334139|
|     [[4], [4]]|657945|
|  [[4], [4, 2]]|348053|
+---------------+------+
only showing top 20 rows



Summary
* The sequences above really aren't telling us much. So we need to drill deeper into the individual sequences.

In [23]:
from pyspark.sql.functions import expr, round

## new df
## go over each array and find the size of each array within arrays
filtered = sequence.withColumn('size', expr('transform(sequence, x -> size(x))'))
filtered.show()

+---------------+------+---------+
|       sequence|  freq|     size|
+---------------+------+---------+
|          [[1]]|688392|      [1]|
|          [[3]]|811456|      [1]|
|          [[5]]|714614|      [1]|
|          [[2]]|857365|      [1]|
|          [[4]]|869495|      [1]|
|       [[2, 5]]|324431|      [2]|
|       [[2, 3]]|369402|      [2]|
|     [[2], [4]]|640768|   [1, 1]|
|  [[2], [4, 2]]|341779|   [1, 2]|
|[[2], [4], [4]]|378550|[1, 1, 1]|
|[[2], [4], [2]]|354996|[1, 1, 1]|
|     [[2], [2]]|619741|   [1, 1]|
|[[2], [2], [4]]|371093|[1, 1, 1]|
|[[2], [2], [2]]|340745|[1, 1, 1]|
|     [[2], [3]]|535357|   [1, 1]|
|[[2], [3], [4]]|312056|[1, 1, 1]|
|     [[2], [5]]|491046|   [1, 1]|
|     [[2], [1]]|399955|   [1, 1]|
|     [[1], [4]]|451389|   [1, 1]|
|     [[1], [2]]|451373|   [1, 1]|
+---------------+------+---------+
only showing top 20 rows



In [28]:
## row counts by percentages
row_cnt = df_array.count()
filtered = filtered.withColumn('percentage', round((col("freq")/row_cnt)*100,2))

## order by desc
filtered.orderBy(desc("percentage")).show()

+----------+------+------+----------+
|  sequence|  freq|  size|percentage|
+----------+------+------+----------+
|     [[4]]|869495|   [1]|     85.64|
|     [[2]]|857365|   [1]|     84.44|
|     [[3]]|811456|   [1]|     79.92|
|     [[5]]|714614|   [1]|     70.38|
|     [[1]]|688392|   [1]|      67.8|
|[[4], [4]]|657945|[1, 1]|      64.8|
|[[2], [4]]|640768|[1, 1]|     63.11|
|[[4], [2]]|631292|[1, 1]|     62.18|
|[[2], [2]]|619741|[1, 1]|     61.04|
|[[3], [4]]|609953|[1, 1]|     60.07|
|[[3], [3]]|586425|[1, 1]|     57.76|
|[[3], [2]]|578087|[1, 1]|     56.94|
|[[4], [3]]|554520|[1, 1]|     54.61|
|[[2], [3]]|535357|[1, 1]|     52.73|
|  [[4, 2]]|500865|   [2]|     49.33|
|[[4], [5]]|493853|[1, 1]|     48.64|
|[[1], [5]]|493284|[1, 1]|     48.58|
|[[2], [5]]|491046|[1, 1]|     48.36|
|[[3], [5]]|459342|[1, 1]|     45.24|
|[[5], [5]]|453373|[1, 1]|     44.65|
+----------+------+------+----------+
only showing top 20 rows



Summary
* This is great by it is showing arrays of size 1 which isn't too helpful.
* Lets filter further.

In [29]:
## filter for arrays > 1
filtered = filtered.where(array_contains(filtered.size,2))
filtered.orderBy(desc("percentage")).show()

+-------------+------+------+----------+
|     sequence|  freq|  size|percentage|
+-------------+------+------+----------+
|     [[4, 2]]|500865|   [2]|     49.33|
|     [[4, 3]]|407520|   [2]|     40.14|
|     [[2, 3]]|369402|   [2]|     36.38|
|[[4, 2], [4]]|356858|[2, 1]|     35.15|
|     [[5, 1]]|356214|   [2]|     35.08|
|[[4], [4, 2]]|348053|[1, 2]|     34.28|
|[[2], [4, 2]]|341779|[1, 2]|     33.66|
|[[4, 2], [2]]|334139|[2, 1]|     32.91|
|     [[4, 1]]|326818|   [2]|     32.19|
|     [[2, 5]]|324431|   [2]|     31.95|
|[[3], [4, 2]]|306263|[1, 2]|     30.16|
+-------------+------+------+----------+



Summary
* It appears the highest percentage is array size of 2 with the sequence 4,2.