<a href="https://colab.research.google.com/github/arjunpogaku/BOOK_Hands-on-Pattern-Mining/blob/main/chapter2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2: Handling Big Data - Classification, Storage, and Processing Techniques

## Install the PAMI package

In [2]:
!pip install PAMI

Collecting PAMI
  Downloading pami-2024.12.6.1-py3-none-any.whl.metadata (80 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/80.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.3/80.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting resource (from PAMI)
  Downloading Resource-0.2.1-py2.py3-none-any.whl.metadata (478 bytes)
Collecting validators (from PAMI)
  Downloading validators-0.34.0-py3-none-any.whl.metadata (3.8 kB)
Collecting sphinx-rtd-theme (from PAMI)
  Downloading sphinx_rtd_theme-3.0.2-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting discord.py (from PAMI)
  Downloading discord.py-2.4.0-py3-none-any.whl.metadata (6.9 kB)
Collecting fastparquet (from PAMI)
  Downloading fastparquet-2024.11.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting cramjam>=2.3 (from fastparquet->PAMI)
  Downloading cramjam-2.9.0-cp310-cp310-manylinux_2_17_x86_64.manyli

## Downloading a sample file

In [3]:
!wget -nc https://web-ext.u-aizu.ac.jp/~udayrage/datasets/transactionalDatabases/Transactional_T10I4D100K.csv

--2024-12-06 17:36:54--  https://web-ext.u-aizu.ac.jp/~udayrage/datasets/transactionalDatabases/Transactional_T10I4D100K.csv
Resolving web-ext.u-aizu.ac.jp (web-ext.u-aizu.ac.jp)... 163.143.103.34
Connecting to web-ext.u-aizu.ac.jp (web-ext.u-aizu.ac.jp)|163.143.103.34|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4019277 (3.8M) [text/csv]
Saving to: ‘Transactional_T10I4D100K.csv’


2024-12-06 17:36:56 (5.00 MB/s) - ‘Transactional_T10I4D100K.csv’ saved [4019277/4019277]



## Converting a CSV file into a Paraquet file

### Generic



```python
from PAMI.extras.convert import CSV2Parquet as alg

obj = alg.CSV2Parquet(inputFile,outputFile,sep)
obj.convert()

print('Runtime: ' + str(obj.getRuntime()))
print('Memory (RSS): ' + str(obj.getMemoryRSS()))
print('Memory (USS): ' + str(obj.getMemoryUSS()))
```



### Example 1: CSV2Paraquet

In [4]:
import PAMI.extras.convert.CSV2Parquet as cp

obj = cp.CSV2Parquet(inputFile='Transactional_T10I4D100K.csv',\
      outputFile='Transactional_T10I4D100K.parquet',sep='\t')
obj.convert()

print('Runtime: ' + str(obj.getRuntime()))
print('Memory (RSS): ' + str(obj.getMemoryRSS()))
print('Memory (USS): ' + str(obj.getMemoryUSS()))

Runtime: 0.8549163341522217
Memory (RSS): 317763584
Memory (USS): 294957056


## Converting a Paraquet file into a CSV file

### Generic


```python
from PAMI.extras.convert import Parquet2CSV as alg

obj = alg.Parquet2CSV(inputFile,outputFile,sep)
obj.convert()

print('Runtime: ' + str(obj.getRuntime()))
print('Memory (RSS): ' + str(obj.getMemoryRSS()))
print('Memory (USS): ' + str(obj.getMemoryUSS()))
```



### Example 2: Paraquet2CSV

In [5]:
import PAMI.extras.convert.Parquet2CSV as cp

obj = cp.Parquet2CSV(inputFile='Transactional_T10I4D100K.parquet',\
      outputFile='new_Tran_T10I4D100K.csv',sep='\t')
obj.convert()

print('Runtime: ' + str(obj.getRuntime()))
print('Memory (RSS): ' + str(obj.getMemoryRSS()))
print('Memory (USS): ' + str(obj.getMemoryUSS()))

Runtime: 2.4655816555023193
Memory (RSS): 219906048
Memory (USS): 198017024


## Converting a Dataframe into a Particular Database Type

### Generic

```python
from PAMI.extras.convert import DF2DB as alg
import pandas as pd
import numpy as np

obj = alg.DF2DB(dataFrame)
obj.convert2ParticularDatabase(outputFileName, other parameters)

print('Runtime: ' + str(obj.getRuntime()))
print('Memory (RSS): ' + str(obj.getMemoryRSS()))
print('Memory (USS): ' + str(obj.getMemoryUSS()))
```



### Example 3: Dataframe to transactional database

In [6]:
from PAMI.extras.convert import DF2DB as alg
import pandas as pd
import numpy as np

data = np.random.randint(1, 100, size=(1000, 4))
dataFrame = pd.DataFrame(data,
             columns=['Item1', 'Item2', 'Item3', 'Item4']
            )

obj = alg.DF2DB(dataFrame)
obj.convert2TransactionalDatabase(oFile='transactionalDB.csv',
       condition='>=', thresholdValue=36
     )
print('Runtime: ' + str(obj.getRuntime()))
print('Memory (RSS): ' + str(obj.getMemoryRSS()))
print('Memory (USS): ' + str(obj.getMemoryUSS()))

Runtime: 0.025201082229614258
Memory (RSS): 214548480
Memory (USS): 193175552


In [None]:
!head transactionalDB.csv #printing the created transactional database

Item2	Item4
Item1	Item2
Item1	Item3
Item1	Item2
Item1	Item2	Item3	Item4
Item1	Item2	Item3
Item3	Item4
Item3	Item4
Item1	Item2	Item3	Item4
Item2
