<a href="https://colab.research.google.com/github/UdayLab/PAMI/blob/main/notebooks/CoMine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finding correlated patterns in a transactional database using CoMine

This tutorial has three parts. In the first part, we describe the basic approach to find correlated patterns in a transactional database using the CoMine algorithm. In the second part, we describe an advanced approach, where we evaluate the performance of CoMine algorithm on a dataset at different *minimum support* values. In the final part, we describe an advanced approach, where we evaluate the performance of CoMine algorithm on a dataset at different *minimum all-confidence* threshold values.

***

## Prerequisites:

1. Installing the PAMI library

In [None]:
!pip install pami #install the pami repository

2. Downloading a sample dataset

In [None]:
!wget -nc https://u-aizu.ac.jp/~udayrage/datasets/transactionalDatabases/Transactional_T10I4D100K.csv #download a sample transactional database

3. Printing few lines of a dataset to know its format.

In [None]:
!head -2 Transactional_T10I4D100K.csv

_format:_ every row contains items seperated by a seperator.

__Example:__

item1 item2 item3 item4

item1 item4 item6

***



```
# This is formatted as code
```

## Part 1: Finding correlated patterns using CoMine

### Step 1: Understanding the statistics of a database to choose an appropriate *minimum support* (*minSup*) value.

In [None]:
#import the class file
import PAMI.extras.dbStats.TransactionalDatabase as stats

#specify the file name
inputFile = 'Transactional_T10I4D100K.csv'

#initialize the class
obj=stats.TransactionalDatabase(inputFile,sep='\t')

#execute the class
obj.run()

#Printing each of the database statistics
print(f'Database size : {obj.getDatabaseSize()}')
print(f'Total number of items : {obj.getTotalNumberOfItems()}')
print(f'Database sparsity : {obj.getSparsity()}')
print(f'Minimum Transaction Size : {obj.getMinimumTransactionLength()}')
print(f'Average Transaction Size : {obj.getAverageTransactionLength()}')
print(f'Maximum Transaction Size : {obj.getMaximumTransactionLength()}')
print(f'Standard Deviation Transaction Size : {obj.getStandardDeviationTransactionLength()}')
print(f'Variance in Transaction Sizes : {obj.getVarianceTransactionLength()}')

#saving the distribution of items' frequencies and transactional lengths
itemFrequencies = obj.getSortedListOfItemFrequencies()
transactionLength = obj.getTransanctionalLengthDistribution()
obj.save(itemFrequencies, 'itemFrequency.csv')
obj.save(transactionLength, 'transactionSize.csv')

#Alternative apporach to derive the database statistics and plot the graphs
# obj.printStats()
# obj.plotGraphs()

### Step 2: Draw the items' frequency graph and transaction length's distribution graphs for more information

In [None]:
import PAMI.extras.graph.plotLineGraphFromDictionary as plt

itemFrequencies = obj.getFrequenciesInRange()
transactionLength = obj.getTransanctionalLengthDistribution()
plt.plotLineGraphFromDictionary(itemFrequencies, 100, 'Items\' frequency graph', 'No of items', 'frequency')
plt.plotLineGraphFromDictionary(transactionLength, 100, 'transaction distribution graph', 'transaction length', 'frequency')

### Step 3: Choosing an appropriate *minSup* value

_Observations_

  1. The input dataset is sparse as the sparsity value is 0.988 (=98.8%)
  2. Many items have low frequencies as seen in the items' frequency graph
  3. The dataset is not high dimensional as the inverted curve is around 10.

  Based on the above observations, let us choose a _minSup_ value of 100 (in count). We can increase or decrease the _minSup_ based on the number of patterns being generated.

In [None]:
minSup=100 #minSup is specified in count. However, the users can also specify minSup between 0 and 1.

### Step 4: Choosing an appropriate minimum all-confidence value


We are often interested in finding patterns having high correlation. Thus, let us choose a high minAllConf value, say 0.8 (or 80%) percent.

In [None]:
minAllConf=0.2

### Step 4: Mining correlated patterns using CoMine

In [None]:
from PAMI.correlatedPattern.basic import CoMine  as alg #import the algorithm

obj = alg.CoMine(iFile='Transactional_T10I4D100K.csv', minSup=minSup, minAllConf=minAllConf, sep='\t')    #initialize
obj.startMine()            #start the mining process

obj.save('correlatedPatternsAtMinSupCount100_020.txt') #save the patterns


correlatedPatternsDF= obj.getPatternsAsDataFrame() #get the generated correlated patterns as a dataframe
print('Total No of patterns: ' + str(len(correlatedPatternsDF))) #print the total number of patterns
print('Runtime: ' + str(obj.getRuntime())) #measure the runtime

print('Memory (RSS): ' + str(obj.getMemoryRSS()))
print('Memory (USS): ' + str(obj.getMemoryUSS()))


### Step 5: Investigating the generated patterns

Open the patterns' file and investigate the generated patterns. If the generated patterns were interesting, use them; otherwise, redo the Steps 3 and 4 with a different _minSup_ value.

In [None]:
!tail correlatedPatternsAtMinSupCount300_040.txt

The storage format is: _correlatedPattern:support:allConfidence_



***

## Part 2: Evaluating the CoMine algorithm on a dataset at different minSup values

### Step 1: Import the libraries and specify the input parameters

In [None]:
#Import the libraries
from PAMI.correlatedPattern.basic import CoMine  as alg #import the algorithm
import pandas as pd

#Specify the input parameters
inputFile = 'Transactional_T10I4D100K.csv'
seperator='\t'
minSupValues = [100, 150, 200, 250, 300]
#minimumSupport can also specified between 0 to 1. E.g., minSupList = [0.005, 0.006, 0.007, 0.008, 0.009]
minAllConf=0.3

### Step 2: Create a data frame to store the results of FP-growth

In [None]:
result = pd.DataFrame(columns=['algorithm', 'minSup', 'patterns', 'runtime', 'memory'])
#initialize a data frame to store the results of FPGrowth algorithm

### Step 2: Create a data frame to store the results of FP-growth

In [None]:
result = pd.DataFrame(columns=['algorithm', 'minSup', 'patterns', 'runtime', 'memory'])
#initialize a data frame to store the results of FPGrowth algorithm

### Step 3: Execute the algorithm at different minSup values

In [None]:
for minSup in minSupValues:
    obj = alg.CoMine(inputFile, minSup=minSup,minAllConf=minAllConf,sep=seperator)
    obj.startMine()
    #store the results in the data frame
    result.loc[result.shape[0]] = ['CoMine', minSup, len(obj.getPatterns()), obj.getRuntime(), obj.getMemoryRSS()]

### Step 4: Print the result

In [None]:
print(result)

### Step 5: Visualizing the results

In [None]:
from PAMI.extras.graph import plotLineGraphsFromDataFrame as plt

ab = plt.plotGraphsFromDataFrame(result)
ab.plotGraphsFromDataFrame()

***

## Part 3: Evaluating the CoMine algorithm on a dataset at different minAllConf values

### Step 1: Import the libraries and specify the input parameters

In [None]:
#Import the libraries
from PAMI.correlatedPattern.basic import CoMine  as alg #import the algorithm
import pandas as pd

#Specify the input parameters
inputFile = 'Transactional_T10I4D100K.csv'
seperator='\t'
minAllConfValues = [0.1, 0.2, 0.3, 0.4, 0.5]
#minimumSupport can also specified between 0 to 1. E.g., minSupList = [0.005, 0.006, 0.007, 0.008, 0.009]
minSup=100

### Step 2: Create a data frame to store the results of FP-growth

In [None]:
result = pd.DataFrame(columns=['algorithm', 'minAllConf', 'patterns', 'runtime', 'memory'])
#initialize a data frame to store the results of FPGrowth algorithm

### Step 3: Execute the algorithm at different minAllConf values

In [None]:
for minAllConf in minAllConfValues:
    obj = alg.CPGrowth(inputFile, minSup=minSupCount,minAllConf=minAllConf,sep=seperator)
    obj.startMine()
    #store the results in the data frame
    result.loc[result.shape[0]] = ['CoMine', minAllConf, len(obj.getPatterns()), obj.getRuntime(), obj.getMemoryRSS()]

### Step 4: Print the result

In [None]:
print(result)

### Step 5: Visualizing the results

In [None]:
result.plot(x='minAllConf', y='patterns', kind='line')
result.plot(x='minAllConf', y='runtime', kind='line')
result.plot(x='minAllConf', y='memory', kind='line')

#Graphs can be improved further by using additional packages, such as plotly and matplotlib