## This tutorial will explain how to compare data collected in GC logs to external data

# `Step 1`: Gather a dataset you would like to compare to a GC log. I will use the number of bytes in the bytecode for the method being compiled during runtime as an external non-gc source, to see if there is a relationship.
 

In [39]:
# Import pandas to read the file
import pandas as pd 

# Read in the dataframe, where the string below is the path to the file. Data generated using JVM flag -XX:+PrintCompilation
csv_data = pd.read_csv("just-in-time.csv")

print (csv_data)

          27   1
0         27  42
1         27  56
2         28  36
3         28  46
4         28  60
...      ...  ..
1343  201990   8
1344  201990  55
1345  201991  18
1346  201994  10
1347  201995   5

[1348 rows x 2 columns]


# `Step 2` : Organize your data such that it fits the "gc_event_dataframe" column expectation: Timestamps should go into `TimeFromStart_seconds`, a eventType can go into `EventType`, and your corresponding data can go into `Duration_miliseconds`
 In this example, we will create a new empty dataframe with the correct columns, the fill those columns with values.

In [40]:
import sys
sys.path.append("../")
from src.read_log_file import columnNames
column_names = columnNames()
 # print(column_names)
new_dataframe = pd.DataFrame() # Creates a new, empty, dataframe

for column in column_names:
    new_dataframe[column] = ""

new_dataframe["TimeFromStart_seconds"] = csv_data.iloc[:, 0]
new_dataframe["Duration_miliseconds"] = csv_data.iloc[:, 1]
event_type_list = ["Just-In-Time" for i in range(len(new_dataframe["EventType"]))]
new_dataframe["EventType"] = event_type_list
# print(new_dataframe)


# `Step 3:` Now that you have a completed dataframe, its time to add it to the dataset being analyzed. Append the gc_event_dataframe to the list of dataframes generated by running log analysis.

In [41]:
#################################################################
######                                                     ######
####  Taken from the original notebook: Set files and run     ####
######                                                     ######
#################################################################

################################################################################################
files = ["workload_gc.log"]
labels = ["GC workload", "Just-in-time data"] # MAKE SURE TO ADD AN EXTRA LABEL
time_range_seconds = None
################################################################################################


import sys
import matplotlib.pyplot as plt
import pandas as pd
sys.path.append("../../")
sys.path.append("../") 
plt.rcParams["figure.figsize"] = [12, 7]

from src.read_log_file import get_parsed_comparions_from_files 
gc_event_dataframes = get_parsed_comparions_from_files(files, time_range_seconds)

#### here, append the new data  ####
gc_event_dataframes.append(new_dataframe)

## Test! ##
print(len(gc_event_dataframes))


2


 

# `Step 4`: Take the new cells you created, and insert them ABOVE the normal cells.

# `Step 5`: Add the line to "append the new data" to the GC event dataframe AFTER the gc_log_analysis has read the file.

# `Step 6`: Now, run log analysis normally!

### NOTE: THIS CELL IS JUST AN EXAMPLE. Not used for most analysis.



 My JIT compilation data lives in the file JIT-data. 
 I construct this regex to capture all relevant lines, and create a CSV called just-in-time.csv

In [42]:

import re 
match_pattern = "^\s*(\d+) *\d+.*\((\d+) bytes\)(?:$|   made not entrant$)"
timestamps_miliseconds = []
bytes_loaded = []
for line in open("JIT-data", "r"):
    match = re.search(match_pattern, line)
    if match:
        timestamps_miliseconds.append(match.group(1))
        bytes_loaded.append(match.group(2))
# Now that we have our data, we can create the csv
with open("just-in-time.csv", "w") as output_file:
    for time, byte in zip(timestamps_miliseconds, bytes_loaded):
        output_file.write(time + ", " +  byte + "\n") # WRITE TO THE CSV FILE
