# Given a CSV file, Update columns to fit expected format for gc_event_dataframe

> ### The purpose of this tutorial and tool is to take a CSV file with external data, and make sure it can easily be accepted in the gc_log_analysis tool. Given a CSV filename, here we are able to transform the structure to fit that of the `gc_event_dataframe`, and create a new CSV file. Later, it is possible to import those CSV files with a single line of code (as seen in the bottom cell) for easy analysis. 

## Populate the following variable fileds. Then, run all cells
- `old_csv_filename` : the path to the file you would like to fix the format of.
- `my_column_transformations` : transformation functions for the data. Write None to ignore transforming this column of data. Length of this list must match the number of columns in the original dataset
- `new_column_names` : the updates column names you would like to have, as strings. Write None to ignore this column. Length of this list must match the number of columns in the original dataset
- `populate_columns` : If there is a column you would like to fill with the same value, add a tuple to the list here. The first index of the tuple is the column name, and the second index of the tuple is the constant value that every row will take on. List length for this can be any, but tuples must have length 2.

# To view the combined data, you have 2 options: 
## (1) You can rename the columns to match those in a `gc_event_dataframe`, and easily import the data, which will automatically get parsed.

> IMPORTANT: if you are using option 1, you MUST use the exact spelling / capitalization of the column names, as they are searched for as "keys" in a search for data.

## (2) You can import the data as a dataframe, and manually select the columns to plot, with an additional line per plot.



In [1]:
# Here is an example of using method 1.
# First, I will access the COLUMN NAMES, to fit my data into the described columns.
import sys
sys.path.append("../")
from src.read_log_file import columnNames
columnNames()

['DateTime',
 'TimeFromStart_seconds',
 'EventType',
 'EventName',
 'AdditionalEventInfo',
 'HeapBeforeGC',
 'HeapAfterGC',
 'Duration_miliseconds']

In [2]:
# Current CSV file name:
old_csv_filename = "./tutorial-files/sample_transaction_data.csv" 

# My data currently has these 3 columns: 
# ms_passed, transaction_type, transaction_duration
# I would like these to become the following 3 columns (described by new_column_names)
new_column_names = ["TimeFromStart_seconds", "EventType", "Duration_miliseconds"] # How to rename the columns. Choose None to not use the data in that column

# Because the data I have in my csv is in miliseconds, I will transform it into seconds. 
# I will not do transformations on the other two columns.
my_column_transformations = [lambda value : int(value) / 1000, None, None] # Applys to each element in the column

# I will populate the EventName column from the gc_event_dataframe with the word Transaction for every row. 
populate_columns = [("EventName", "Transaction")] # list of ('A', B): Sets all rows in Column 'A' to value B

### The cell below contains the code to create your CSV, but does not need to be inspected

In [3]:
# Get the column names
import sys
import pandas as pd 
sys.path.append("../src")
from read_log_file import columnNames

def create_formatted_csv(output_csv_filename):
    global old_csv_filename, new_column_names, populate_columns, column_names, my_column_transformations
    # Create a blank dataframe, and add the needed columns to it.
    df = pd.DataFrame()
    for column in columnNames():
        df[column] = ""
    
    # Gather data from columns of the original csv
    old_df = pd.read_csv(old_csv_filename)
    
    # Apply any transformations to each row in each column
    for index, transformation in enumerate(my_column_transformations):
        if transformation:
            old_df.iloc[:,index] = old_df.iloc[:,index].apply(transformation)

    # Populate the new array with data from the old, under the column names
    for index, (column, column_data) in enumerate(old_df.iteritems()):
        if new_column_names[index]:
            df[new_column_names[index]] = column_data

    # Populate columns with the same value in the new dataframe
    for column, value in populate_columns:
        df[column] = [value for i in range(len(df[column]))] 

    # 
    df.to_csv(output_csv_filename, index = False) # Create the CSV file
    return df 
    

In [6]:
# Run the function
create_formatted_csv("./tutorial-files/example_data.csv")

Unnamed: 0,DateTime,TimeFromStart_seconds,EventType,EventName,AdditionalEventInfo,HeapBeforeGC,HeapAfterGC,Duration_miliseconds
0,,0.352,Sell,Transaction,,,,115.172530
1,,1.507,Sell,Transaction,,,,189.881817
2,,2.740,Sell,Transaction,,,,142.341934
3,,3.336,Buy,Transaction,,,,153.363000
4,,4.707,Sell,Transaction,,,,26.094972
...,...,...,...,...,...,...,...,...
229,,229.572,Sell,Transaction,,,,259.170689
230,,230.959,Sell,Transaction,,,,177.218825
231,,231.631,Sell,Transaction,,,,236.976637
232,,232.583,Sell,Transaction,,,,197.810383


## Run this code block in any cell in the analysis notebook after running all cells. Then re-run all cells.

The cell below is not expected to run here. Copy and past the contents to the top cell in `analyze_logs_dev.ipynb` or whatever Notebook for analysis you are using.

In [None]:

csv_files_to_import = ["./tutorial_files/example_data.csv"] # Populate this with CSV files.

for csv_file in csv_files_to_import:
    gc_event_dataframes.append(pd.read_csv(csv_files_to_import))