# Initial Data Aggregation for Name that Neutrino! #
### Elizabeth Warrick ###

This python notebook serves as a source to outline how the initial form of data aggregation for the Name that Neutrino project will appear. The input will be the classifications csv from Zooniverse and then this notebook will output a table that will hold the classification counts for each subject as well as its IceCube info. 

First begin by importing necessary packages. 

In [1]:
#Imports
import pandas as pd #reading csv into data frames
from pandas import json_normalize
import json #reading java strings into python dictionaries
from IPython.display import FileLink #downloading the final csv. 
import numpy as np

In [3]:
#Load in the classifications csv from Zooniverse
ntn_classifications = 'name-that-neutrino-classifications.csv'
#Read in that csv as a pandas data frame. 
classifications = pd.read_csv(ntn_classifications)

The following cell will go through the annotations column of classifications to pick out event topology classification. 
We start by initializing an empty list and then loop through all classifications of events based on user votes. 
Then we save each row as a string, but we need to transform it into a list of dictionaries.
The variabel "subj_id" takes the Zooniverse subject id of each icecube event. 
Then we loop through each element in json version of each row q, this takes each row that is a giant string and turns it into dictionaries. 
We create a new key-value pair for zooniverse id to add it as a column  in the element row. 
Then we append the element (row) to the empty list.
Lastly, we take this list that is now full of dictionaries and put it into a data frame. 

In [4]:
#Expanding JSON Fields
#Converts JSON strings into Python dictionaries, providing access to key-value pairs.
my_list=[] #goal is to make a list of dictionaries with new key-value pair with each dictionary individually. 
for i in range(len(classifications.annotations)):
    q = classifications.annotations[i] #string, need to transform into list of dictionaries
    subj_id = classifications.subject_ids[i]
    for element in json.loads(q): #list of dictionaries, element is a dictionary. 
        element["id"]=subj_id #key value pair
        my_list.append(element) #adding new dictionaries to empty list

x = pd.DataFrame.from_dict(my_list) 
#putting a dictionary into a df in this way splits it up into keys = cols and values = values. 

This next cell takes the data frame we made above (table where each row is a different question asked to the users about a subject). We take the above data frame to extract out user classifications of events. 

We start by making a list of the different classes an event can be and by making an array of all the unique subject ids that have been classified . This is to help us tally up votes made by the users. 

We then begin building an empty data frame where the columns are the different event classes and the index is the event id of the unique events. 

Then we loop through the unique events in unique_events, and create a variable "counts" that goes through the data frame of tasks, x, and find the question where users classified the event toplogy and counts the different values for that. 
We then create an empty dictionary called "temp_dict."
Then we loop through classes to see if it appears in counts for the unique event and how many times it appears.
For every time it appears, we create a new key-value pair of the class type and its counts. If it doesn't appear then that key-value pair is 0. 
Lastly we combine events_counts with the temp_dict. 

In [16]:
classes = ["Cascade","Skimming", "Through-Going Track", "Starting Track", "Stopping Track"]

unique_events = np.unique(classifications['subject_ids'])

event_counts = pd.DataFrame(columns=classes, index = unique_events)
for event in unique_events:
    counts = x.loc[(x["task"] == 'T0') & (x["id"] == event)]['value'].value_counts()
    temp_dict = {}
    for c in classes:
        if c in counts.keys():
            temp_dict[c] = counts[c] #new key-value pair
        else: #if a type of event doesn't appear
            temp_dict[c] = 0
    event_counts.loc[event] = temp_dict

The following cell is a similar process as above but for accessing the icecube information of a zooniverse subject. 
Again we start my initiating an empty list and a list of the different columns we want. We then build another empty data frame out of the list of column names we want. 
We start by looping through each row in the subject data column of classifications and pull out the key of each row, which is its subject id. 
We then say that if the event subject id is in the empty list to continue through the loop (this ensures that we don't have any repeat terms). 
Then if the event is not a repeat (i.e. already appended to the empty list), then we add it and create a new dictionary where the value in the key-value pair is another dictionary with the subject data column of classifications. 
Then we put that dictionary as a row and save it to the data frame icecube_info, where the key is the subject id. 

In [6]:
my_list2 = []
classes2 = ['retired',"Run", "Event", 'Filename','#prediction_0000', '#prediction_0001', '#prediction_0002', '#prediction_0003','#prediction_0004']
icecube_info = pd.DataFrame(columns=classes2)
for i in range(len(classifications.subject_data)):
    r = classifications.subject_data[i]
    event2 = list(json.loads(r).keys())[0]
    if event2 in my_list2:
        continue
    else:
        my_list2.append(event2)
        dict2 = json.loads(r)[event2] #another dictionary, value in the key-value pair is another dict. 
        icecube_info.loc[int(event2)] = dict2 

Lastly we join the two dataframes and create a out filename and then convert to a csv that we can export. Make sure index=True to have the subject id included in the resulting table. 

In [14]:
output = event_counts.join(icecube_info, how="outer")
filename_output = 'classifications_counts.csv'
output.to_csv(filename_output, index=True)

The command below allows your to download the csv. 

In [15]:
FileLink(filename_output)