# Data Reading examples (transform numpy to ARFF (in-memory))
* **TL;DR**: the data is in a CSV or ARFF? Use ```stream_from_file(...)``` function to read it and obtain a valid ```Stream```
  
* Key functions and classes involved. 
  * ```ARFFStream(path=...)```: Class that inherits from ```Stream```
  * ```numpy_to_ARFF(X, y, ...)```: Function that receives numpy ```X``` and ```y``` and returns an ```Instances``` (MOA) and ```Header``` 
  * ```NumpyStream(X, y, ...)```: Class that returns a ```Stream``` compatible object given a numpy ```X``` and ```y```
  * ```stream_from_file(path_to_csv_or_arff=...)```: Function that returns a ```Stream```
 
**Notebook updated on 20/10/2023**

In [1]:
import pandas as pd
# local code imports
from capymoa.evaluation import windowed_evaluation
from capymoa.learner.classifier import OnlineBagging
from capymoa.stream.stream import *

arff_elec_path = '../data/electricity.arff'
csv_elec_path = '../data/electricity.csv'

capymoa_root: /home/anton/github.com/tachyonicClock/CapyMOA/src/capymoa
MOA jar path location (config.ini): /home/anton/github.com/tachyonicClock/CapyMOA/src/capymoa/jar/moa.jar
JVM Location (system): 
JAVA_HOME: /usr/lib64/jvm/java
JVM args: ['-Xmx8g', '-Xss10M']
Sucessfully started the JVM and added MOA jar to the class path


## Using ```stream_from_file(...)```
* This is how one is expected to read data from a CSV or ARFF file

* stream_from_file(<...>.csv)

In [2]:
%%time
stream = stream_from_file(path_to_csv_or_arff=csv_elec_path)
ob_learner = OnlineBagging(schema=stream.get_schema(), ensemble_size=5)

results = windowed_evaluation(stream=stream, learner=ob_learner, window_size=4500)

display(results['windowed'].metrics_per_window())
print(results['windowed'].metrics())

Unnamed: 0,classified instances,classifications correct (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent),Kappa M Statistic (percent)
0,4500.0,84.466667,67.338484,3.851444,60.419026
1,9000.0,82.933333,65.431704,-1.721854,62.862669
2,13500.0,84.2,68.45675,-7.564297,66.930233
3,18000.0,78.533333,53.581883,-49.074074,48.807631
4,22500.0,80.911111,58.33522,-19.637883,51.276234
5,27000.0,73.311111,40.992724,-109.965035,35.775401
6,31500.0,75.266667,44.775669,-103.10219,36.508842
7,36000.0,74.0,45.151302,-96.638655,34.963869
8,40500.0,74.288889,49.99539,-73.463268,40.697078
9,45000.0,82.622222,64.986886,-11.714286,64.838129


[45312.0, 81.97777777777779, 63.596765274097066, -17.706821480406372, 63.203266787658805]
CPU times: user 2.7 s, sys: 64.7 ms, total: 2.77 s
Wall time: 1.14 s


* stream_from_file(<...>.arff)

In [3]:
%%time
stream = stream_from_file(path_to_csv_or_arff=arff_elec_path)
ob_learner = OnlineBagging(schema=stream.get_schema(), ensemble_size=5)

results = windowed_evaluation(stream=stream, learner=ob_learner, window_size=4500)

display(results['windowed'].metrics_per_window())
print(results['windowed'].metrics())

Unnamed: 0,classified instances,classifications correct (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent),Kappa M Statistic (percent)
0,4500.0,84.466667,67.338484,3.851444,60.419026
1,9000.0,82.933333,65.431704,-1.721854,62.862669
2,13500.0,84.2,68.45675,-7.564297,66.930233
3,18000.0,78.533333,53.581883,-49.074074,48.807631
4,22500.0,80.911111,58.33522,-19.637883,51.276234
5,27000.0,73.311111,40.992724,-109.965035,35.775401
6,31500.0,75.266667,44.775669,-103.10219,36.508842
7,36000.0,74.0,45.151302,-96.638655,34.963869
8,40500.0,74.288889,49.99539,-73.463268,40.697078
9,45000.0,82.622222,64.986886,-11.714286,64.838129


[45312.0, 81.97777777777779, 63.596765274097066, -17.706821480406372, 63.203266787658805]
CPU times: user 555 ms, sys: 9.86 ms, total: 564 ms
Wall time: 201 ms


## Using ARFFStream directly
* If the data resides in an ARFF file, one can use ARFFStream directly as shown in the example below. 
* However, it is easier to use ```stream_from_file(path_to_csv_or_arff=...)```

In [4]:
from capymoa.evaluation import ClassificationEvaluator, ClassificationWindowedEvaluator

maxInstancesToProcess = 5000
sampleFrequency = 1000
instancesProcessed = 1

stream = ARFFStream(path=arff_elec_path)

learner = OnlineBagging(schema=stream.get_schema(), ensemble_size=5)

evaluator_TTT = ClassificationEvaluator(schema=stream.get_schema(), window_size=sampleFrequency)
evaluator_windowed = ClassificationWindowedEvaluator(schema=stream.get_schema(), window_size=sampleFrequency)

while stream.has_more_instances() and instancesProcessed <= maxInstancesToProcess:
    instance = stream.next_instance()
    prediction = learner.predict(instance)
    evaluator_TTT.update(instance.y(),prediction)
    evaluator_windowed.update(instance.y(),prediction)
    learner.train(instance)
    
    instancesProcessed += 1

print(evaluator_TTT.accuracy())
evaluator_windowed.metrics_per_window()

84.48


Unnamed: 0,classified instances,classifications correct (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent),Kappa M Statistic (percent)
0,1000.0,85.7,71.415778,-1.41844,70.816327
1,2000.0,83.5,61.565339,-17.857143,44.630872
2,3000.0,80.8,60.255853,-3.225806,53.284672
3,4000.0,87.6,73.842863,26.627219,68.286445
4,5000.0,84.8,64.54827,5.0,57.062147


## Testing the ```numpy_to_ARFF()``` function
* ```numpy_to_ARFF(...)``` should **not be used directly**, it is easier to just create a ```NumpyStream``` object.
* This function returns MOA (Java) objects (InstancesHeader and Instances)
  * ```<java class 'com.yahoo.labs.samoa.instances.InstancesHeader'>```
  * ```<java class 'com.yahoo.labs.samoa.instances.Instances'>```

In [5]:
!pip install scikit-learn



In [6]:
# Import necessary libraries
from sklearn.datasets import load_iris

# Load the Iris dataset
data = load_iris()
X = data.data  # Features
y = data.target  # Target labels

print(numpy_to_ARFF(X, y, "Iris"))

(<java object 'com.yahoo.labs.samoa.instances.Instances'>, <java object 'com.yahoo.labs.samoa.instances.InstancesHeader'>)


In [7]:
# Import necessary libraries
from sklearn.datasets import load_diabetes

# Load the Diabetes dataset
data = load_diabetes()
X = data.data  # Features
y = data.target  # Target (diabetes progression)

# print(data.DESCR)
arff_instances_data, arff_instances_header = numpy_to_ARFF(X, y, "Diabetes", data.feature_names)

print(arff_instances_data)

@relation Diabetes

@attribute age numeric
@attribute sex numeric
@attribute bmi numeric
@attribute bp numeric
@attribute s1 numeric
@attribute s2 numeric
@attribute s3 numeric
@attribute s4 numeric
@attribute s5 numeric
@attribute s6 numeric
@attribute target numeric

@data
0.038075906433423026,0.05068011873981862,0.061696206518683294,0.0218723855140367,-0.04422349842444599,-0.03482076283769895,-0.04340084565202491,-0.002592261998183278,0.019907486170462722,-0.01764612515980379,151.0,
-0.0018820165277906047,-0.044641636506989144,-0.051474061238800654,-0.02632752814785296,-0.008448724111216851,-0.019163339748222204,0.07441156407875721,-0.03949338287409329,-0.0683315470939731,-0.092204049626824,75.0,
0.08529890629667548,0.05068011873981862,0.04445121333659049,-0.00567042229275739,-0.04559945128264711,-0.03419446591411989,-0.03235593223976409,-0.002592261998183278,0.002861309289833047,-0.025930338989472702,141.0,
-0.0890629393522567,-0.044641636506989144,-0.011595014505211082,-0.03665608

# Reading CSV data from a file using ```np.genfromtxt```
* This is just an example of how to do this directly.
* It is best to just use ```stream_from_file(path_to_csv_or_arff=...)```

In [8]:
data = np.genfromtxt(csv_elec_path, delimiter=',', skip_header=1)  # Assuming a header row

# Extract the feature data (all columns except the last one) and target data (last column)
X = data[:, :-1]
y = data[:, -1]

# Extract the header from the CSV file (first row)
with open(csv_elec_path, 'r') as file:
    header = file.readline().strip().split(',')

# Optionally, you can print the shapes of X and y to verify
print("X shape:", X.shape)
print("y shape:", y.shape)
print(header)

arff_data, arff_header = numpy_to_ARFF(X, y.astype(int), "Elec", header[:-1], header[-1])

# schema=Schema(moa_header=arff_data.getHeader())
print(arff_header)

X shape: (45312, 6)
y shape: (45312,)
['period', 'nswprice', 'nswdemand', 'vicprice', 'vicdemand', 'transfer', 'class']


@relation Elec

@attribute period numeric
@attribute nswprice numeric
@attribute nswdemand numeric
@attribute vicprice numeric
@attribute vicdemand numeric
@attribute transfer numeric
@attribute class {0,1}

@data



## Example: Creating a stream using X and y (NumpyStream) 

## Testing every method of the NumpyStream class

In [9]:
data = np.genfromtxt(csv_elec_path, delimiter=',', skip_header=1)  # Assuming a header row

# Extract the feature data (all columns except the last one) and target data (last column)
X = data[:, :-1]
y = data[:, -1]

# Extract the header from the CSV file (first row)
with open(csv_elec_path, 'r') as file:
    header = file.readline().strip().split(',')
    
np_stream = NumpyStream(X=X, y=y.astype(int), dataset_name="Elec", feature_names=header[:-1], target_name=header[-1])


print(np_stream.next_instance())
print(np_stream.next_instance())
print(np_stream.next_instance())

print("restarting stream...")

np_stream.restart()

print(np_stream.next_instance())
print(np_stream.next_instance())
print(np_stream.next_instance())

np_stream.restart()

try:
    np_stream.get_moa_stream()
except ValueError as ve:
    print(ve)

counter_num_instances = 0
while np_stream.has_more_instances():
    np_stream.next_instance()
    counter_num_instances+=1

np_stream.next_instance()

print(f"Read {counter_num_instances} instances, total = {np_stream.arff_instances_data.numInstances() }")

<capymoa.stream.stream.Instance object at 0x7fd666c8d0f0>
<capymoa.stream.stream.Instance object at 0x7fd6802759c0>
<capymoa.stream.stream.Instance object at 0x7fd63cd6f760>
restarting stream...
<capymoa.stream.stream.Instance object at 0x7fd63cd6c0d0>
<capymoa.stream.stream.Instance object at 0x7fd63cd6f760>
<capymoa.stream.stream.Instance object at 0x7fd63cd6c0d0>
Not a moa_stream, a numpy read file
Read 45312 instances, total = 45312


## Using ```NumpyStream``` to create the stream and train
* Using an OnlineBagging object to be trained on this stream
* The goal of this example is to show that the ```Stream``` is ```valid``` i.e. it can be used for training and testing normally. 

In [10]:
data = np.genfromtxt(csv_elec_path, delimiter=',', skip_header=1)  # Assuming a header row

# Extract the feature data (all columns except the last one) and target data (last column)
X = data[:, :-1]
y = data[:, -1]

# Extract the header from the CSV file (first row)
with open(csv_elec_path, 'r') as file:
    header = file.readline().strip().split(',')
    
np_stream = NumpyStream(X=X, y=y.astype(int), dataset_name="Elec", feature_names=header[:-1], target_name=header[-1])

ob_learner = OnlineBagging(schema=np_stream.get_schema(), ensemble_size=5)

ob_windowed_evaluator = ClassificationWindowedEvaluator(schema=np_stream.get_schema(), window_size=5000)

while np_stream.has_more_instances():
    instance = np_stream.next_instance()
    
    prediction = ob_learner.predict(instance)
    ob_windowed_evaluator.update(instance.y(), prediction)
    ob_learner.train(instance)

ob_windowed_evaluator.metrics_per_window()

Unnamed: 0,classified instances,classifications correct (percent),Kappa Statistic (percent),Kappa Temporal Statistic (percent),Kappa M Statistic (percent)
0,5000.0,84.48,67.192489,2.512563,60.082305
1,10000.0,82.92,65.723322,-3.015682,64.192872
2,15000.0,84.26,68.414479,-11.948791,66.581741
3,20000.0,78.48,52.881349,-49.237171,47.716229
4,25000.0,73.98,40.318363,-65.732484,32.168926
5,30000.0,76.78,49.165527,-106.58363,43.060324
6,35000.0,73.18,43.235885,-98.961424,34.871297
7,40000.0,74.8,50.791615,-71.428571,37.561943
8,45000.0,82.6,65.046029,-10.687023,63.628763


## Testing ```stream_from_file(path_to_csv_or_arff=...)```

In [11]:
stream = stream_from_file(path_to_csv_or_arff=arff_elec_path)

print(stream.next_instance())

stream.get_moa_stream()

<capymoa.stream.stream.Instance object at 0x7fd63cd6ee90>


<java object 'moa.streams.ArffFileStream'>

In [12]:
stream = stream_from_file(path_to_csv_or_arff=csv_elec_path)

print(stream.next_instance())

## This method will raise an error as expected! 
# Furthermore, the user is not expected to invoke the get_moa_stream directly, unless they know what they are doing. 
try:
    stream.get_moa_stream()
except ValueError as ve:
    print(ve)

<capymoa.stream.stream.Instance object at 0x7fd6802756c0>
Not a moa_stream, a numpy read file


## Using ```stream_from_file``` using ARFF and using CSV (should output same results)

* Using 200 learners in Online Bagging to confirm that after reading the file there is no difference in runtime when it comes to the Stream object maintained in memory. 

In [13]:
%%time
stream = stream_from_file(path_to_csv_or_arff=arff_elec_path)

ob_learner = OnlineBagging(schema=stream.get_schema(), ensemble_size=200)

ob_windowed_evaluator = ClassificationEvaluator(schema=stream.get_schema())

while stream.has_more_instances():
    instance = stream.next_instance()
    
    prediction = ob_learner.predict(instance)
    ob_windowed_evaluator.update(instance.y(), prediction)
    ob_learner.train(instance)

CPU times: user 7.35 s, sys: 39.3 ms, total: 7.39 s
Wall time: 6.9 s


In [14]:
%%time
stream = stream_from_file(path_to_csv_or_arff=csv_elec_path)

ob_learner = OnlineBagging(schema=stream.get_schema(), ensemble_size=200)

ob_windowed_evaluator = ClassificationEvaluator(schema=stream.get_schema())

while stream.has_more_instances():
    instance = stream.next_instance()
    
    prediction = ob_learner.predict(instance)
    ob_windowed_evaluator.update(instance.y(), prediction)
    ob_learner.train(instance)

CPU times: user 7.32 s, sys: 56.8 ms, total: 7.37 s
Wall time: 6.99 s
