# More file I/O, intro to Pandas

In [0]:
# import our files functionality from last time
from google.colab import files

In [0]:
# open a file for writing
with open('test.txt', 'w') as f:
  for x in range(0, 11):
    f.write(str(x**2) + '\n')

#files.download('test.txt')
%pycat test.txt
%ls

### What happens when you try to write an integer to a text file?

In [0]:
# open a file for writing
with open('test.txt', 'w') as f:
  for x in range(0, 11):
    f.write(x)


## Binary file I/O
* So far we've just been dealing with text files where everything is a string (of characters)
* Binary files are written in "machine language" that is denser and easier to interpret (for the machine, not for you!)
* Can use bytearray to convert numbers over the range 0:255 to binary format

In [0]:
# open a file for writing binary
with open('test.bin', 'wb') as f:
  # generate a list of numbers, use bytearray to convert
  # numbers over the range 0:255 to binary format
  bytes_to_write = bytearray([0,1,2,3,4,5])

  # write to file!
  f.write(bytes_to_write)

# have a look!
files.download('test.bin')
%pycat test.bin

In [0]:
# now read it back in
with open('test.bin', 'rb') as f:
  # remember that f.read() reads in the entire file...
  bytes_read = f.read()

# notice that f.read() returns the byte array as a string
print(bytes_read)

## Can use numpy to make reading/writing binary formats more human friendly

In [0]:
import numpy as np

#### now read it back in - note that you HAVE to know the data type!
* if you use the wrong dtype, then you might try to read too many bytes in at once and you'll get the wrong numbers (or none...)


In [0]:
with open('test.bin', 'rb') as f:
  bytes_read = np.fromfile(f, dtype=np.int8)
  #bytes_read = np.fromfile(f, dtype=np.int16)
  
print(bytes_read)

## JSON (JavaScript Object Notation) format
* straightforward and standardized way of storing and exchanging data files
* kind of like a csv or a txt file in nature, but more sophisticated
* developed as a way of tranferring JavaScript objects between browsers and servers, but now frequently used for all types of data and languages
* takes one of several data formats: 
  * objects (like dictionaries)
  * arrays (like lists)
  * values (string in double quotes or a number)
  * strings (sequence of characters)

[link to main page](http://json.org/)

In [0]:
# import json module
import json

In [0]:
# build a dictionary with a bunch of different data types, including a sub list
# of dictionaries
user_profile = {
  "name": "John",
  "age": 30,
  "bicycle": "Giant",
}

out = json.dumps(user_profile)
print(out)

### Now write a .json file to disk - very similar to file creating/writing that we did above

In [0]:
with open('test.json', 'w') as outfile:
  json.dump(user_profile, outfile)
  
files.download('test.json')

## Now load the json file back in!

In [0]:
with open('test.json', 'r') as outfile:
  x = json.load(outfile)
  
# and you get back a dictionary
x['bicycle']

## Basic data structures - start with Series then build up to DataFrames

[Pandas quick start guide for Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)

* A **Series** is a 1D array that can hold any type of data (numeric types, non-numeric, Python objects and so forth).
    * Each entry is **labeled** with an index that is used to keep track of what each entry is, and can be used to lookup the value corresponding to each index during analysis (remember keys in dictionaries? similar idea)
    * These labels are fixed - they will always index the same value unless you explicitly break that link.
    * The list of labels that forms the index can either be declared upon series creation or, by default, it will range from 0 to len(data)-1.
        * If you're going to use Pandas to organize your data, specifying usable and informative labels is a good idea because that's one of the main advantages of organizing your data in this manner
        

**Warning**. Pandas will allow you to specify non-unique labels. This can be ok for operations that don't rely on indexing by label. However, operations that do rely on unique labels for indexing may throw an unexpected error so in general its good practice to use unique labels!


## Import Pandas and os (for file path functionality)

In [0]:
# import a generic pandas object and also a few specific functions that we'll use
import pandas as pd
import random as random

## Create a series of data stored in a list, and then make a set of index labels

In [0]:
# seed the generator
random.seed(0)

# For this simulation, lets have 12 subjects, and some data
# generated psuedo-randomly from a normal distribution
N = 12
M = 40

# init a list
data=[]

# fill it up with random integers over the range 0, M
for i in range(0,N):
  data.append(random.randint(0,M))

print(data)

## Make a list of subject names for use as index labels


In [0]:
label_prefix = 'Sub'
index=[]
for n in range(0,N):
    index.append(label_prefix+str(n))    
    
# print our list of index labels
print('Index labels: ', index, '\n')

## Then make our Pandas Series by passing in our data array and our index labels

In [0]:
s = pd.Series(data, index=index)
print(s)

## Note that each subject is now a field in the series and can be used to retrieve the corresponding value...there are a few ways to do this
* can access by number
* can access by field
* can access by index label

In [0]:
print(s[0])

In [0]:
# access by field
print(s.Sub11)

In [0]:
# access by index label
print(s['Sub0'])

## Can also use labels to check for membership or to index over labels

In [0]:
# check for membership
print('Sub11' in s) 

#### iterate over index labels


In [0]:
for i in s.index:
  print(i)

#### iterate over values...

In [0]:
for v in s.values:
  print(v)   

In [0]:
# can also get to the values more directly like this:
for d in s:
  print(d)