<a href="https://colab.research.google.com/github/UdayLab/PAMI/blob/main/notebooks/CMine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finding coverage patterns in transactional databases using CMine

This tutorial has two parts. In the first part, we describe the basic approach to find coverage patterns in a transactional database using the CMine algorithm. In the final part, we describe an advanced approach, where we evaluate the CMine algorithm on a dataset at different *minimum coverage support* threshold values.

***

## Prerequisites:

1. Installing the PAMI library

In [None]:
!pip install -U pami #install the pami repository

2. Downloading a sample dataset

In [None]:
!wget -nc https://u-aizu.ac.jp/~udayrage/datasets/transactionalDatabases/Transactional_T10I4D100K.csv #download a sample transactional database

3. Printing few lines of a dataset to know its format.

In [None]:
!head -2 Transactional_T10I4D100K.csv

_format:_ every row contains items seperated by a seperator.

__Example:__

item1 item2 item3 item4

item1 item4 item6

***

## Part 1: Finding coverage patterns with CMine

### Step 1: Understanding the statistics of a database to choose an appropriate *minimum coverage support* (*minCS*) value.

In [None]:
#import the class file
import PAMI.extras.dbStats.TransactionalDatabase as stats

#specify the file name
inputFile = 'Transactional_T10I4D100K.csv'

#initialize the class
obj=stats.TransactionalDatabase(inputFile,sep='\t')

#execute the class
obj.run()

#Printing each of the database statistics
print(f'Database size : {obj.getDatabaseSize()}')
print(f'Total number of items : {obj.getTotalNumberOfItems()}')
print(f'Database sparsity : {obj.getSparsity()}')
print(f'Minimum Transaction Size : {obj.getMinimumTransactionLength()}')
print(f'Average Transaction Size : {obj.getAverageTransactionLength()}')
print(f'Maximum Transaction Size : {obj.getMaximumTransactionLength()}')
print(f'Standard Deviation Transaction Size : {obj.getStandardDeviationTransactionLength()}')
print(f'Variance in Transaction Sizes : {obj.getVarianceTransactionLength()}')

#saving the distribution of items' frequencies and transactional lengths
itemFrequencies = obj.getSortedListOfItemFrequencies()
transactionLength = obj.getTransanctionalLengthDistribution()
obj.save(itemFrequencies, 'itemFrequency.csv')
obj.save(transactionLength, 'transactionSize.csv')

#Alternative apporach to derive the database statistics and plot the graphs
# obj.printStats()
# obj.plotGraphs()

### Step 2: Draw the items' frequency graph and transaction length's distribution graphs for more information

In [None]:
import PAMI.extras.graph.plotLineGraphFromDictionary as plt

itemFrequencies = obj.getFrequenciesInRange()
transactionLength = obj.getTransanctionalLengthDistribution()
plt.plotLineGraphFromDictionary(itemFrequencies, 100, 'Items\' frequency graph', 'No of items', 'frequency')
plt.plotLineGraphFromDictionary(transactionLength, 100, 'transaction distribution graph', 'transaction length', 'frequency')

### Step 3: Choosing an appropriate *minCS* value

_Observations_

  1. The input dataset is sparse as the sparsity value is 0.988 (=98.8%)
  2. Many items have low frequencies as seen in the items' frequency graph
  3. The dataset is not high dimensional as the inverted curve is around 10.

  Based on the above observations, let us choose a _minCS_ value of 300 (in count). We can increase or decrease the _minCS_ based on the number of patterns being generated.

In [None]:
minimumCoverageSupport=0.08 #A coverage pattern must appear at least in 8% of the transactions

### Step 4: Choosing other parameters (minRF and maxOR) values

In [None]:
minimumRelativeFrequency=0.02 #every item must appear at least 2% of the transactions
maximumOverlapRatio=0.8 #Overlap between an itemset and a new item must not be more than 80%

### Step 4: Mining coverage patterns using CMine

In [None]:
from PAMI.coveragePattern.basic import CMine as alg #import the algorithm

obj = alg.CMine(iFile=inputFile, minRF=minimumRelativeFrequency, minCS=minimumCoverageSupport, maxOR=maximumOverlapRatio, sep='\t')    #initialize
obj.startMine()            #start the mining process

obj.save('coveragePatterns.txt') #save the patterns


coveragePatternsDF= obj.getPatternsAsDataFrame() #get the generated frequent patterns as a dataframe
print('Total No of patterns: ' + str(len(coveragePatternsDF))) #print the total number of patterns
print('Runtime: ' + str(obj.getRuntime())) #measure the runtime

print('Memory (RSS): ' + str(obj.getMemoryRSS()))
print('Memory (USS): ' + str(obj.getMemoryUSS()))

### Step 5: Investigating the generated patterns

Open the patterns' file and investigate the generated patterns. If the generated patterns were interesting, use them; otherwise, redo the Steps 3 and 4 with a different _minSup_ value.

In [None]:
!head coveragePatterns.txt

The storage format is: _coveragePattern:coverageSupport_



***

## Part 2: Evaluating the CMiner algorithm on a dataset at different minSup values

### Step 1: Import the libraries and specify the input parameters

In [None]:
#Import the libraries
from PAMI.coveragePattern.basic import CMine as alg #import the algorithm
import pandas as pd

#Specify the input parameters
inputFile = 'Transactional_T10I4D100K.csv'
seperator='\t'
minimumCoverageSupportValues = [0.09,0.08,0.07,0.06,0.05]
#minimumCoverageSupport is specified between 0 to 1.

### Step 2: Create a data frame to store the results of CMine

In [None]:
result = pd.DataFrame(columns=['algorithm', 'minCS', 'patterns', 'runtime', 'memory'])
#initialize a data frame to store the results of FPGrowth algorithm

### Step 3: Execute the algorithm at different minSup values

In [None]:
for minimumCoverageSupport in minimumCoverageSupportValues:
    obj = alg.CMine(inputFile,minRF=minimumRelativeFrequency,minCS=minimumCoverageSupport,maxOR=maximumOverlapRatio,sep=seperator)
    obj.startMine()
    #store the results in the data frame
    result.loc[result.shape[0]] = ['CMine', minimumCoverageSupport, len(obj.getPatterns()), obj.getRuntime(), obj.getMemoryRSS()]

### Step 4: Print the result

In [None]:
print(result)

### Step 5: Visualizing the results

In [None]:
result.plot(x='minCS', y='patterns', kind='line')
result.plot(x='minCS', y='runtime', kind='line')
result.plot(x='minCS', y='memory', kind='line')

#Graphs can be improved further by using additional packages, such as plotly and matplotlib