<a href="https://colab.research.google.com/github/chawlaKat/Notebook-Noodlings/blob/master/Bark_to_Byte.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2>Goal: Move .arff data into a dataframe</h2>

Functionality: <br>
*  Given an arff file, remove the attribute data types and create a pandas DataFrame
*  Format comparison for "where_equals" data filtering


Uses files (for testing):
*   sayNi.arff
*   sayHey.arff

Future work:


*   Use multi-indexing to preserve the data types
*   Use np array instead of list, for speed / optimization

Notes: 

*  Variable declarations and assignments may be out of order, because I moved the cells around







<h4>Imports</h4>



In [0]:
import arff
import pandas as pd
import numpy as np

<h4>Helpers</h4>

These must be defined above the primary method in order to cooperate, apparently :D

Get arff data: returns a dictionary<br>

---<br>


Format:

*   'attributes' : [('Name Element0', 'TYPE'), ('Name' Element1, 'TYPE')]
*   'data': [<br>
[Entry0 Element0, Entry0 Element1], <br>
[Entry1 Element0, Entry1 Element1],
<br>]



In [0]:
def open_file(name):
  file = open(name);
  loaded = arff.load(file);
  
  return loaded;

Given a list of lists, extracts the first element of each. (Specifically, used to get name of each attribute without extra data so we can use it as a column header)

In [0]:
def get_first(tuple_list):
  
  first_only = [];
  
  for el in tuple_list:
    first = el[0];
    first_only.append(first);
    
  return first_only;

Given data and a list of column headers, constructs a dataframe.

In [0]:
def to_dataframe(data, cols):
  dataframe = pd.DataFrame(data, columns = cols);
  return dataframe;

<h4>Primary Methods</h4>

Given an .arff file, return a data frame

In [0]:
def arff_to_dataframe(filename):
  loaded = open_file(filename);
  data = loaded['data'];
  atts = loaded['attributes'];
  col_headers = get_first(atts);
  
  frame = to_dataframe(data, col_headers);
  
  return frame;

Given a dataframe, column name, and something to compare to, return the relevant entries

In [0]:
def where_equals(dataframe, column, is_equal_to):
  source_column = dataframe[column]
  where_clause = source_column == is_equal_to
  
  matching_data = dataframe.loc[where_clause]
  
  return (matching_data);

**Basic sunny-day tests**

Commented so it works as an imported module. If running this alone, check that liac_arff has been installed.

In [0]:
#pip install liac_arff

Collecting liac_arff
  Downloading https://files.pythonhosted.org/packages/e9/35/fbc9217cfa91d98888b43e1a19c03a50d716108c58494c558c65e308f372/liac-arff-2.4.0.tar.gz
Building wheels for collected packages: liac-arff
  Building wheel for liac-arff (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/d1/6a/e7/529dc54d76ecede4346164a09ae3168df358945612710f5203
Successfully built liac-arff
Installing collected packages: liac-arff
Successfully installed liac-arff-2.4.0


In [0]:
# OPEN_FILE TESTS
# file_name = "sayNi.arff";
# my_loaded_file = open_file(file_name)
# my_loaded_file

{'attributes': [('name', 'STRING'),
  ('title', 'STRING'),
  ('braveryRating', 'NUMERIC'),
  ('lifestatus', ['living', 'dead', 'unknown', 'other'])],
 'data': [['King_Arthur', 'King-of-the-Britons', 9.0, 'other'],
  ['Sir_Bedevere', 'The-Wise', 5.0, 'other'],
  ['Sir_Lancelot', 'The-Brave', 11.0, 'other'],
  ['Sir_Galahad', 'The-Pure', 8.0, 'dead'],
  ['Sir_Robin', 'The-Not-Quite-So-Brave-As-Sir-Lancelot', 1.0, 'dead']],
 'description': '',
 'relation': 'Arthurs_Knights'}

In [0]:
#GET_FIRST TESTS
#attribute_names = get_first(my_loaded_file['attributes'])
#attribute_names

['name', 'title', 'braveryRating', 'lifestatus']

In [0]:
# TO_DATAFRAME TESTS
#ni_data = my_loaded_file['data'];
# ni_cols = attribute_names

# to_dataframe(ni_data, ni_cols)

In [0]:
#test
#knightFrame = arff_to_dataframe('sayNi.arff')

#knightFrame

Unnamed: 0,name,title,braveryRating,lifestatus
0,King_Arthur,King-of-the-Britons,9.0,other
1,Sir_Bedevere,The-Wise,5.0,other
2,Sir_Lancelot,The-Brave,11.0,other
3,Sir_Galahad,The-Pure,8.0,dead
4,Sir_Robin,The-Not-Quite-So-Brave-As-Sir-Lancelot,1.0,dead


In [0]:
#test
# deadKnightFrame = where_equals(knightFrame, 'lifestatus', 'dead')
# deadKnightFrame

Unnamed: 0,name,title,braveryRating,lifestatus
3,Sir_Galahad,The-Pure,8.0,dead
4,Sir_Robin,The-Not-Quite-So-Brave-As-Sir-Lancelot,1.0,dead


<h3>This is where I noodled around, so I knew how the bits worked and how they fit together</h3>

In [0]:
# #change vars to open different file
# filename = "sayNi.arff";

# opened_file = open(filename);
# loaded_file = arff.load(opened_file);

# loaded_file

{'attributes': [('name', 'STRING'),
  ('title', 'STRING'),
  ('braveryRating', 'NUMERIC'),
  ('lifestatus', ['living', 'dead', 'unknown', 'other'])],
 'data': [['King_Arthur', 'King-of-the-Britons', 9.0, 'other'],
  ['Sir_Bedevere', 'The-Wise', 5.0, 'other'],
  ['Sir_Lancelot', 'The-Brave', 11.0, 'other'],
  ['Sir_Galahad', 'The-Pure', 8.0, 'dead'],
  ['Sir_Robin', 'The-Not-Quite-So-Brave-As-Sir-Lancelot', 1.0, 'dead']],
 'description': '',
 'relation': 'Arthurs_Knights'}

In [0]:
# att_list = loaded_file['attributes']

# att_list

[('name', 'STRING'),
 ('title', 'STRING'),
 ('braveryRating', 'NUMERIC'),
 ('lifestatus', ['living', 'dead', 'unknown', 'other'])]

In [0]:
# data_list = loaded_file['data']

# data_list

[['King_Arthur', 'King-of-the-Britons', 9.0, 'other'],
 ['Sir_Bedevere', 'The-Wise', 5.0, 'other'],
 ['Sir_Lancelot', 'The-Brave', 11.0, 'other'],
 ['Sir_Galahad', 'The-Pure', 8.0, 'dead'],
 ['Sir_Robin', 'The-Not-Quite-So-Brave-As-Sir-Lancelot', 1.0, 'dead']]

In [0]:
# att_frame = pd.DataFrame(att_list)

# att_frame

Unnamed: 0,0,1
0,name,STRING
1,title,STRING
2,braveryRating,NUMERIC
3,lifestatus,"[living, dead, unknown, other]"


In [0]:
# #for each element of att_list, get first value; append it
# #this is brute force. clean it up later.

# att_name_list = [];

# for att in att_list:
#    att_name_list.append(att[0]);
    
# att_name_list

['name', 'title', 'braveryRating', 'lifestatus']

In [0]:
# dataframe = pd.DataFrame(data_list, columns = att_name_list)

# dataframe

Unnamed: 0,name,title,braveryRating,lifestatus
0,King_Arthur,King-of-the-Britons,9.0,other
1,Sir_Bedevere,The-Wise,5.0,other
2,Sir_Lancelot,The-Brave,11.0,other
3,Sir_Galahad,The-Pure,8.0,dead
4,Sir_Robin,The-Not-Quite-So-Brave-As-Sir-Lancelot,1.0,dead


In [0]:
# #skip this, i dont get it yet
# file_ni = open_file("sayNi.arff")
# attributes = np.array(file_ni['attributes'])

# attributes

array([['name', 'STRING'],
       ['title', 'STRING'],
       ['braveryRating', 'NUMERIC'],
       ['lifestatus', list(['living', 'dead', 'unknown', 'other'])]],
      dtype=object)

In [0]:
# #skip this, i don't get it yet
# att_names_only = [];

# for att in np.nditer(attributes):
#   print(att[0])

TypeError: ignored

In [0]:
# arff_to_dataframe('sayHey.arff')

Unnamed: 0,name,title,braveryRating,lifestatus
0,Aladdin,Street-Rat,9.0,living
1,Jasmine,Princess,10.0,living
2,Jafar,Royal-Physir,4.0,other
3,Sultan,Sultan,2.0,living
4,Abu,Monkey,9.0,living
