## File I/O

## Where We Are
Representing Information
Functional Programming
Object-Oriented Programming
File I/O
Where We Are In The Course: File I/O
Where We Are In The Course

## What We're Doing
Principles of File I/O: File systems; the pathlib module; and Path-like objects
Plain Text: What are file-like objects; how to responsibly open (and close) files; and how to read data from and write data to external files
JSON: What is JSON; loading JSON data from a file into Python; dumping data from Python into a file
CSV: What is CSV; reading CSV data from a file into Python; writing data from Python to a file.
Lesson Overview: We'll learn about File I/o Principles, Plain Text, JSON and CSV
Lesson Overview

## Learning Objectives
By the end of this lesson, you will be able to:

Read and write generic data from and to external files
Use the JSON library to load and dump JSON-formatted data
Use the CSV library to load and dump CSV-formatted data
Combine all of the techniques we've seen to create advanced programs

## Files, Formats, and Structured Data
Data at rest are stored in files.
File extensions can inform us about the structure of the content within.
To connect Python to a file, we open a file object.
Remember, files can come from anywhere – your computer, a network, even the cloud!
The Big Picture: Extract, Transform, Load
The Big Picture: Extract, Transform, Load

## The Big Picture
When working with files, it's nice to think about it in three steps:

Extract data from files into Python
Transform the data - according to whatever it is you actually want to do - within Python
Write the data from Python back to a file.
Data flows into and out of Python through file-like objects - Python objects that can connect to the external filesystem.

## Applications of File I/O
Consume external data (large data sets)
Produce external results (machine learning models, analyses)
Abstractly, model the consumption and production of data

## New Terms
Term	Definition
File	A logical collection of digital information into a single unit.
File System	The mechanism by which an digital system divides its memory into an organized collection of logical files.
File Type	The specific format with which a file's data should be interpreted, often implicit in the file's extension or data.

https://en.wikipedia.org/wiki/List_of_file_formats
https://simple.wikipedia.org/wiki/File_system

## File Systems and Paths

Python's built-in pathlib module provides object-oriented filesystem paths
## pathlib

Python's built-in pathlib module provides object-oriented filesystem paths, and it's a useful introduction to thinking about the layout of files on a filesystem.

There are loads of useful things you can do with Paths (the documentation page is well-worth reading) but you'll be able to get by with the following pieces of functionality:

here = Path('.')  # Get an instance of a Path subclass describing the interpreter's current location (which is usually, but not necessarily, the directory containing your Python files).
here = here.resolve()  # Resolve symbolic links and `..` segments into an absolute `Path`.
parent = here.parent  # Navigate up the chain of parents. A purely lexical operation, so it's important to call `.resolve()` or a similar method first.
child = here / 'subfolder' / 'subfile.txt'  # Navigate to a subfolder or subfile. Hooray for magic methods (`__div__`) and polymorphism!
To reiterate, the important concept is that files are organized on a filesystem through nested directories ("folders"), with a specific file path from the root folder or a drive. The important detail is that Python's pathlib module provides good tools for working with files and file paths.

## Plain Text Files

The general pattern for reading and writing plain text from files is:

Extract data from a file into Python
Open a file-like object f
Call f.read() or a similar method
Do something with the data, now within Python
Write data from Python to a file.
Open a file-like object f
Call f.write(line) or a similar method
Plain Text File I/O diagram
Plain Text File I/O

## Opening a File
The open function in Python can open a file-like object from a path-like object, for reading, writing, or perhaps both (there are a few more flags as well).

with open(filepath, mode) as f:
    # Use the file-like object `f`
    ...
The mode can be 'r' for reading, or 'w' for writing, and there are a few other options too.

The new syntax of the with statement ensures that the opened file-like object will be closed when we exit the block - it's good hygiene when working with resources like files. It's roughly equivalent to:

try:
    f = open(filepath, mode)
    # Use the file-like object `f`
    ...
finally:
    f.close()
In this way, even if an error occurs in the body, Python will make sure to immediately close the file object.

## QUIZ QUESTION
Which of the following is the most appropriate syntax to open named my-outfile-file.txt for writing and write the string "Hello, world!" to it?

# (A)
with open('my-outfile-file.txt', 'w') as outfile: 
    outfile.write("Hello, world!")

# (B)
outfile = open('my-outfile-file.txt', 'w'): 
outfile.write("Hello, world!")
# Oops! We forgot to close the file.

Answer = A


Example
Suppose that we have a file queries.txt:

python programmer
UDACITY
Web developer

and that we wish to write a program that will normalize the queries in this file by lowercasing them all and removing the extra lines between each query.

# (1) Extract data from the `queries.txt` file into Python.
with open('queries.txt', 'r') as infile:
    contents = infile.read()  # Read one big string - the contents of this file.


# (2) Transform the data within Python.
queries = contents.split('\n')  # Split the string into a list by line breaks.
normalized = [query.strip().lower() for query in queries[::2]]  # Normalize each query with the stripped, lowercased version of every other line.

# (3) Write the normalized queries out to a file.
with open('normalized-queries.txt', 'w') as outfile:
    for query in normalized:
        outfile.write(query + '\n')  # It might be better to use outfile.writelines here, but let's practice `.write`-ing strings.
QUIZ QUESTION
Suppose that we open a file f with the 'r' mode. What is the type of the object returned by f.read()?

str

New Terms
Term	Definition
File Mode	The mode in which a file-like object is opened, including 'r' (for reading, the default) or 'w' for writing.
File-like Object	An object that behaves as a file, that can perhaps be read from, written to, or more.
Path-like Object	An object that behaves like a Path, in that it can identity a file path.
Plain Text	A file format that interprets binary data as plain text values.
The open keyword	A built-in function that opens a file-like object at a path.
The pathlib module	A built-in module that provides object-oriented filesystem paths.
The with keyword	A keyword that introduces a context manager for managing resources, such as automatically closing a file.

Further Reading
open: The built-in open function.

https://www.w3schools.com/python/python_file_handling.asp

https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files

https://docs.python.org/3/library/io.html#io-overview

https://docs.python.org/3/library/pathlib.html

https://www.educative.io/edpresso/the-with-statement-in-python




## File I/O

In this exercise, you'll write a function count_unique_words that prints the ten most common unique words from a text file.

def count_unique_words(filename):
    ...
Concretely, we'll be using hamlet.txt, a text file containing the full text of "The Tragedy of Hamlet, Prince of Denmark" released by Project Gutenberg under their license.

Your output might look like:

the 1109
and 763
of 735
to 673
I 514
a 499
in 455
my 443
you 423
HAMLET. 359
We won't worry too much about punctuation, capitalization, or other nuances of language. For this exercise, it's safe to say that, given a line of text from a text file, the "words" within that line are the elements that result when you split the line on any whitespace.

Hint: This will be significantly easier if you use a data type from Python's built-in collections module - collections.Counter. You can read more about collections.Counter in the Python documentation - https://docs.python.org/3/library/collections.html#collections.Counter 

License
This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away, or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you'll have to check the laws of the country where you are located before using this ebook.http://www.gutenberg.org/

## JSON
JSON Format
JSON stands for "Javascript Object Notation", and is a file format for encoding structured data.

JSON elements include text, numbers, booleans, and nothingness (like str, int/float, bool, and NoneType), as well as the (sequence) aggregate array (like Python's list) and the (associative) aggregate object (like Python's dict).

Let's look at some example files:

[
  {
    "class": "Iris-setosa",
    "petallength": 1.4,
    "petalwidth": 0.2,
    "sepallength": 5.1,
    "sepalwidth": 3.5
  },
  {
    "class": "Iris-versicolor",
    "petallength": 4.7,
    "petalwidth": 1.4,
    "sepallength": 7,
    "sepalwidth": 3.2
  },
  {
    "class": "Iris-virginica",
    "petallength": 6,
    "petalwidth": 2.5,
    "sepallength": 6.3,
    "sepalwidth": 3.3
  }
]
The above file represents an array of objects, each of which represents a particular flower. Each of these flowers has a class, a petal length and width, and a sepal length and width.

How about the information in a tweet?

{
  "created_at": "Thu Apr 06 15:24:15 +0000 2017",
  "id_str": "850006245121695744",
  "text": "1\/ Today we\u2019re sharing our vision for the future of the Twitter API platform!\nhttps:\/\/t.co\/XweGngmxlP",
  "user": {
    "id": 2244994945,
    "name": "Twitter Dev",
    "screen_name": "TwitterDev",
    "location": "Internet",
    "url": "https:\/\/dev.twitter.com\/",
    "description": "Your official source for Twitter Platform news, updates & events. Need technical help? Visit https:\/\/twittercommunity.com\/ \u2328\ufe0f #TapIntoTwitter"
  },
  "place": {
  },
  "entities": {
    "hashtags": [
    ],
    "urls": [
      {
        "url": "https:\/\/t.co\/XweGngmxlP",
        "unwound": {
          "url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
          "title": "Building the Future of the Twitter API Platform"
        }
      }
    ],
    "user_mentions": [
    ]
  }
}
This is a much more highly-structured object, but the idea is the same.

Even Nobel prizes can be represented in a JSON format:

{
  "prizes": [
    {
      "year": "2020",
      "category": "chemistry",
      "laureates": [
        {
          "id": "991",
          "firstname": "Emmanuelle",
          "surname": "Charpentier",
          "motivation": "\"for the development of a method for genome editing\"",
          "share": "2"
        },
        {
          "id": "992",
          "firstname": "Jennifer A.",
          "surname": "Doudna",
          "motivation": "\"for the development of a method for genome editing\"",
          "share": "2"
        }
      ]
    },
    {
      "year": "2020",
      "category": "economics",
      "laureates": [
        {
          "id": "995",
          "firstname": "Paul",
          "surname": "Milgrom",
          "motivation": "\"for improvements to auction theory and inventions of new auction formats\"",
          "share": "2"
        },
        {
          "id": "996",
          "firstname": "Robert",
          "surname": "Wilson",
          "motivation": "\"for improvements to auction theory and inventions of new auction formats\"",
          "share": "2"
        }
      ]
    },
    {
      "year": "2020",
      "category": "literature",
      "laureates": [
        {
          "id": "993",
          "firstname": "Louise",
          "surname": "Gl\u00fcck",
          "motivation": "\"for her unmistakable poetic voice that with austere beauty makes individual existence universal\"",
          "share": "1"
        }
      ]
    },
    {
      "year": "2020",
      "category": "peace",
      "laureates": [
        {
          "id": "994",
          "motivation": "\"for its efforts to combat hunger, for its contribution to bettering conditions for peace in conflict-affected areas and for acting as a driving force in efforts to prevent the use of hunger as a weapon of war and conflict\"",
          "share": "1",
          "firstname": "World Food Programme"
        }
      ]
    },
    {
      "year": "2020",
      "category": "physics",
      "laureates": [
        {
          "id": "988",
          "firstname": "Roger",
          "surname": "Penrose",
          "motivation": "\"for the discovery that black hole formation is a robust prediction of the general theory of relativity\"",
          "share": "2"
        },
        {
          "id": "989",
          "firstname": "Reinhard",
          "surname": "Genzel",
          "motivation": "\"for the discovery of a supermassive compact object at the centre of our galaxy\"",
          "share": "4"
        },
        {
          "id": "990",
          "firstname": "Andrea",
          "surname": "Ghez",
          "motivation": "\"for the discovery of a supermassive compact object at the centre of our galaxy\"",
          "share": "4"
        }
      ]
    },
    {
      "year": "2020",
      "category": "medicine",
      "laureates": [
        {
          "id": "985",
          "firstname": "Harvey",
          "surname": "Alter",
          "motivation": "\"for the discovery of Hepatitis C virus\"",
          "share": "3"
        },
        {
          "id": "986",
          "firstname": "Michael",
          "surname": "Houghton",
          "motivation": "\"for the discovery of Hepatitis C virus\"",
          "share": "3"
        },
        {
          "id": "987",
          "firstname": "Charles",
          "surname": "Rice",
          "motivation": "\"for the discovery of Hepatitis C virus\"",
          "share": "3"
        }
      ]
    }
]
## Attribution
The Iris dataset comes from UCI's ML Repository.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The tweet data comes from Twitter's Developer Docs.https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/overview/intro-to-tweet-json

The Nobel prize data comes from their v1 API http://api.nobelprize.org/v1/prize.json and is governed by their Terms of Use,https://www.nobelprize.org/about/terms-of-use-for-api-nobelprize-org-and-data-nobelprize-org/ 

Reddit provides JSON-formatted data.

If we loaded the contents of https://www.reddit.com/r/all/top.json?t=all into a Python object named content using json.load, which of the following expressions would construct a list of the top post titles?

You'll have to actually take a look at the JSON file at the above link.

Answer
[child["data]["title"]for child in content["data"]["children"]]








## Working With JSON

The general pattern is similar to before

The general pattern for reading and writing plain text from files is:

Extract data from a JSON file into Python
Open a file-like object f
Call json.load(f) or a similar method
Do something with the data, now within Python
Write data from Python to a file.
Open a file-like object f
Call json.dump(obj, f) or a similar method
Working with JSON Data
Working with JSON Data

Suppose that we have the file listings.json of job listings:

[
    {
        "name": "Udacity",
        "role": 100,
        "description": "A stellar Python instructor is needed for a new course!",
        "available": true
    },
    {
        "name": "Udacity",
        "role": 404,
        "description": "A quality assistance engineer who can start immediately.",
        "available": false
    }
]
and we want to write a program that will only keep available jobs.

import json

# Extract data into Python
with open('listings.json', 'r') as infile:
    contents = json.load(infile)  # Parse JSON data into a Python object. (A)

# Filter out all unavailable job listings.
available = [job for job in contents if job["available"]]

# Write available listings to an output file.
with open('available-listings.json', 'w') as outfile:
    json.dump(available, outfile, indent=2)

New Terms
Term	Definition
JSON	A standard format for encoding structured data, often thought of as (nested) sequences and mappings.
The json module	A built-in module that provides a JSON encoder and decoder through the json.dump and json.load functions.

https://en.wikipedia.org/wiki/JSON
https://www.json.org/json-en.html
https://docs.python.org/3/library/json.html
https://realpython.com/python-json/

