## Unit 2.1.4 Working with Files

### CSVs
Pandas read_csv() method takes a string representing the path to the file you want to read and returns the data frame object.  To output a data to a csv file, use .to_csv().  There are optional keyword arguments for both functions that allow you to tweak things.

In [4]:
import pandas as pd

#File must be in the same file as jupyter notebook, otherwise you need to input file path.
df = pd.read_csv('purchases.csv')
print(df)

  Unnamed: 0 country  ad_views  items_purchased
0     George      US        16                2
1       John     CAN        42                1
2     Thomas     CAN        32                0
3      James      US        13                8
4     Andrew     CAN        63                0
5     Martin      US        19                5
6    William      US        65                7
7    Zachary      US        23                3
8    Millard     CAN        16                0
9   Franklin      US        77                5


purchases.csv looks like this:

,country,ad_views,items_purchased

George,US,16,2

John,CAN,42,1

Thomas,CAN,32,0

James,US,13,8

Andrew,CAN,63,0

Martin,US,19,5

William,US,65,7

Zachary,US,23,3

Millard,CAN,16,0

Franklin,US,77,5


In [13]:
#creates a csv file of df called my_data in folder
df.to_csv('/Users/joannelin410/Desktop/my_data.csv')

### JSON - JavaScript Object Notation

JSON and XML files allow for more customizable and flexible data storage.  They are known as semi-structured files.
The flexibility of semi-structured data often comes with additional complexity.  JSON data can be deeply nested and many take substantial processing before you can get it in to the form you want to work with.

JSON is a way to represent a JavaScript object as a string.  "Objects" in JavaScript are collections of key-value pairs, like dictionaries in Python.  

You can create a data frame from a JSON file with read_json()

purchases.json looks like this:

{
  "Unnamed: 0":{
    "0":"George",
    "1":"John",
    "2":"Thomas",
    "3":"James",
    "4":"Andrew",
    "5":"Martin",
    "6":"William",
    "7":"Zachary",
    "8":"Millard",
    "9":"Franklin"
  },
  "country":{
    "0":"US",
    "1":"CAN",
    "2":"CAN",
    "3":"US",
    "4":"CAN",
    "5":"US",
    "6":"US",
    "7":"US",
    "8":"CAN",
    "9":"US"
  },"ad_views":{
    "0":16,
    "1":42,
    "2":32,
    "3":13,
    "4":63,
    "5":19,
    "6":65,
    "7":23,
    "8":16,
    "9":77
  },
  "items_purchased":{
    "0":2,
    "1":1,
    "2":0,
    "3":8,
    "4":0,
    "5":5,
    "6":7,
    "7":3,
    "8":0,
    "9":5
  }
}

In [16]:
df = pd.read_json('purchases.json')
print(df)

  Unnamed: 0  ad_views country  items_purchased
0     George        16      US                2
1       John        42     CAN                1
2     Thomas        32     CAN                0
3      James        13      US                8
4     Andrew        63     CAN                0
5     Martin        19      US                5
6    William        65      US                7
7    Zachary        23      US                3
8    Millard        16     CAN                0
9   Franklin        77      US                5


You can normalize a nested JSON into a flat table using pandas.io.json.json_normalize(). See here: http://pandas.pydata.org/pandas-docs/stable/io.html#normalization

You can output your dataframe as a JSON file using .to_json.

### XML
XML, or "eXtensible Markup Language", is a hierarchical semi-structured data format, like JSON.  They are both widely used to transder data over the web.  The newer JSON format is more common than older, clunkier XML.  

Pandas does not have an XML equivalent to read_csv() and read_json, so we use the xml module from the Python Standard Library to read in XML files and convert them to an element tree.  Then we can manually process the element tree into a list that we can feed into pandas.

Read more about element trees here: https://docs.python.org/3.6/library/xml.etree.elementtree.html

When generating data files, you should probably avoid XML because it takes more work to process.  

In [19]:
# Import Pandas and a part of the xml module.
import pandas
import xml.etree.ElementTree as ET


# Load and parse the XML file into a tree.
tree = ET.parse('purchases.xml')

# Find the root of the tree. This is the node of the tree where we'll
# start our iteration.
root = tree.getroot()

# Define a custom function to loop over our tree, extract values, and
# return a two-dimensional list. 
#If you are working with a differently structured XML, then you'll need
# to iterate over your XML tree differently

def xml_to_list(root):
    result = []
    for row in root:
        row_list = []
        for column in row:
            row_list.append(column.text)
        result.append(row_list)
    return result
    
# Feed our two-dimensional list into Pandas.
df = pandas.DataFrame(xml_to_list(root))
print(df)


          0    1   2  3
0    George   US  16  2
1      John  CAN  42  1
2    Thomas  CAN  32  0
3     James   US  13  8
4    Andrew  CAN  63  0
5    Martin   US  19  5
6   William   US  65  7
7   Zachary   US  23  3
8   Millard  CAN  16  0
9  Franklin   US  77  5


### Python open()
Python offers a more general-purpose way to open any files with open() function.

In [23]:
# Let's open the poem.txt file, create a file object, and print out the
# file text line by line.

with open('poem.txt') as poem_file:
    text = poem_file.readlines()
    print("This file is {} lines long".format(len(text)))
    for line in text:
        print(line)



This file is 19 lines long
Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

Readability counts.

Special cases aren't special enough to break the rules.

Although practicality beats purity.

Errors should never pass silently.

Unless explicitly silenced.

In the face of ambiguity, refuse the temptation to guess.

There should be one-- and preferably only one --obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

Now is better than never.

Although never is often better than *right* now.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Namespaces are one honking great idea -- let's do more of those!


open() creates a <b>file object</b>. The .readlines() method of the file object creates a list of strings, where each element of the list is a line of text from the input file.  Learn more here: https://docs.python.org/3/library/io.html#i-o-base-classes

open() will leave the file open until you close it.  .close() file object method closes the file.  Keeping a file open can keep resources tied up and cause unexpected trouble.  Using the "with" statement above means you don't have to remember to use .close(), because files opened in "with" statement automatically closes once "with" statement exits.  

** Use "with" when manually opening files is best practice!!!

### A word about encoding

All strings in Python 3 are <b>Unicode</b> strings, and UTF-8 is the default encoding Python uses whenever possible.

However, files may be created under a different encoding, and it is not possible to automatically determine a file's encoding and decode it correctly.  You need to make educated guesses about the likely encoding and use trial and error to test.

English-language Microsoft Windows uses cp152 encoding, Cyrillic Windows uses cp1251, etc.  Microsoft Windows is a big culprit in encoding problems.

Read about the history of Unicode here: https://docs.python.org/3/howto/unicode.html