## Working with Files (part 1) - Intro

In this lesson you'll learn how to read and write to files in Python. 

Reading and writing to data files are one of the most fundamental skills as a data engineer. You're always working with files. Some common data engineering tasks are:
- Reading data files
- Checking the schema (or format) of each row
- Parsing rows of data and ensuring key fields are present and valid
- Transforming rows and fields to apply business logic or to validate their content with other reference sources
- Writing the output to a database or data file formats more suited for the Cloud or Big Data tools

Let's get started!

### Simple File I/O: Speaking to the Can! 

In this section you'll learn to write a simple message to a data file; open it and read the message back. This is what we call _"speaking to the can"_ where the can is the file that you will create.

Let's look at the most simple example to write a message to a file:

In [22]:

with open("./data/can.txt", "w") as myfile:
    msg = "Only from the heart can you touch the sky. -Rumi"
    myfile.write(msg)


Let's digest our code:
- The Python's built-in `open()` method allows us to open a file for reading and writing
- This method returns a file object or commonly also called a _file handle_. We assign this file object to a variable called `myfile` in our code. You can name this object anything you like but the handle is a special object that allows you to do operation (such as read and write) to the file
- The `open()` method takes two required positional parameters. The first parameter is the file **path** while the second parameter is the **mode** for our file operation.
- In this example, we open the file with `w` for _write_ mode.
- There are other common modes such as `r` for _read_ mode or `a` for _append_ mode
- The `with` is a special Python convention commonly used when working with files. This statement allows us to write a code block (indented by a tab)
- The file handle `.write()` method write any content to the file

Simple, right?!

It's important to note that the `with` statement knows to automatically close our file when the code is ended (unintended back). It's very important to always close files after you're done working withing with them. Open file handles take up operating resources. Most operating systems have a finite limit on how many open file handles they can handle; so by not closing files you will risk reaching this limit. Additionally (most) operating systems lock a file when your program is using them; meaning that no other program can work with the same file until it's closed by your program.

Let's examine this:

In [None]:
with open("./data/can.txt", "w") as myfile:
    msg = "Only from the heart can you touch the sky. -Rumi"
    myfile.write(msg)
    # pay attention that the file is still open inside the with block
    print("is file closed (inside with)? ", myfile.closed)

# outside of the with block, the file is closed
print("is file closed (outside with)? ", myfile.closed)


Now, let's open our file back in **read** mode and read back out content:

In [None]:
with open("./data/can.txt", "r") as myfile:
    msg = myfile.read()
    print(msg)

You can see that now:
- We open the file in `r` or _read_ mode
- The file object `read()` method reads the entire content of the file (to the end) and returns the content
- We assign the content to a variable called `msg` and print it

**NOTE:** a few important points:
- Opening a file in `w` mode completely over writes its previous content. If you open a file and don't write anything, then you'll end up with an empty file.
- If you like to add content to a file then use `a` or _append_ mode


The `read()` method without an argument reads the entire file to the end. Since the file content is read into the memory, if you read a very large file it might cause your computer to run out of memory! Therefore its better practice to read files in certain chunks. You'll see more examples later using the `readline()` method. 

For now, let's look at another example where we only read certain number of characters from our file and close it:

In [None]:
with open("./data/can.txt", "r") as myfile:
    # read 10 characters
    msg = myfile.read(10)
    print(f"read: '{msg}' len={len(msg)}")
    # read the next 9 characters
    msg = myfile.read(9)
    print(f"read: '{msg}' len={len(msg)}")

#### Exercise

- Open a file and write another Rumi quote
- Open the file again and read your quote

In [None]:
# open the file for writing

# open the file for reading

### File Encoding

It's important to note another `open()` method parameter called `encoding`. A file encoding is how the computer encodes and decodes the text into binary format. At the end of the day everything in computers is binary. There are two very common file encoding formats called `utf-8` and `utf-16`. The _utf-8_ formatting refers to the ASCII standard 8bit encoding which covers most of latin characters. The _utf-16_ is a broader 16bit encoding which covers all other languages characters such as Farsi letters (the original language of the poet Rumi) or Chinese letters. You can refer to the [ASCII Table](https://www.asciitable.com/) to see how characters are encoded into binary numbers.

Let's see this in practice:

In [None]:
with open("./data/can.txt", "w", encoding="utf-8") as myfile:
    msg = "Your heart knows the way. Run in that direction. -Rumi"
    myfile.write(msg)

print("file is written! Now reading:")
with open("./data/can.txt", "r", encoding="utf-8") as myfile:
    msg = myfile.read()
    print(msg)

**NOTE:** It's important to note that you should always open files in the same encoding format that they were originally written in; otherwise you'll get very funny looking characters or a `UnicodeError`. 

See the example below where we intentionally read/write with different encodings:

In [None]:
# !!! this code will throw a UnicodeError since we are writing and reading with different encoding types !!!

with open("./data/can.txt", "w", encoding="utf-16") as myfile:
    msg = "Your heart knows the way. Run in that direction. -Rumi"
    myfile.write(msg)

print("file is written! Now reading:")
with open("./data/can.txt", "r", encoding="utf-8") as myfile:
    msg = myfile.read()
    print(msg)

#### Exercise

- Write and read to a file 5 times. Find whatever you want to read/write!
- Try doing special things like writing a message with endline characters: `"this is \n a multi line \n text"`

In [None]:
# write your code here

### Resources

Additional reading:
- [Python Documentation: File I/O](https://docs.python.org/3/tutorial/inputoutput.html#tut-files)