# SAO/LIP Python Primer Course Lecture 7

In this notebook, you will learn about:
- File paths and the `os` library
- I/O in base Python
- The `pandas` library
- Reading and viewing files in `pandas`
- Manipulating datasets

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/acorreia61201/SAOPythonPrimer/blob/main/lectures/Lecture7.ipynb)

At the end of our discussion on `numpy`, we started covering the concept of *I/O*, or *input/output*. This is a useful feature that you can use to share or load large datasets. In this lecture, we'll cover two more methods to do this: using base Python, and using an external library `pandas`.

## File Paths

Before we get into that, it's important that we understand how files are stored on your computer. We can use a library `os` to visualize how this works.

In [None]:
import os

All files on an operating system are located in a *directory*, more commonly known as a folder. Directories can contain files or subdirectories, which themselves may have their own files and subdirectories. Each file has a sequence of directories describing where it is on your system known as a *path*.

This notebook, for example, has a path. If you're currently viewing this notebook, its path will be the *current working directory*. In Python, we can view the current working directory with `os.getcwd()`:

In [None]:
lecs = os.getcwd()
lecs

This current working directory is populated with some files auto-generated by Colab (if you're using that; you may have some other files if you're running locally). To view them all, we use the *list* command which we call in Python with `os.listdir()`.

In [None]:
os.listdir()

Each file within the same directory has a unique name with two parts. Each file has a *file name*, which itself contains the *file extension*. File extensions can tell your operating system how to interpret and open files. For example, extensions like `png` and `jpg` are interpreted as images, while extensions like `txt` are interpreted as plain-text files. Each lecture above has the extension `ipynb`, the standard for a Jupyter notebook. 

We can create a new directory in the current working directory using `os.mkdir()`:

In [None]:
os.mkdir('example_dir')

We can see the new directory using `os.listdir()`:

In [None]:
os.listdir()

Notice that if we try to add a new directory with the same name we get an error:

In [None]:
os.mkdir('example_dir')

We can move our current working directory to this new directory using `os.chdir()`:

In [None]:
os.chdir('example_dir')
os.getcwd()

Let's see what's in this new directory:

In [None]:
os.listdir()

It returns an empty list. This makes sense; we haven't added anything to it yet. We can, however, view what's in the previous directory (or any directory on the system, for that matter) using an *absolute path*:

In [None]:
os.listdir(lecs)

There are two special strings that represent `relative paths` to the current directory. The string `.` refers to the current working directory:

In [None]:
os.listdir('.')

The string `..` refers to the directory above the current working directory:

In [None]:
os.listdir('..')

We can use this to easily move up one directory in the path:

In [None]:
os.chdir('..')
os.getcwd()

We can also modify currently existing files. If we want to change the name of the directory we just made, we can use `os.rename()`:

In [None]:
os.rename('example_dir', 'new_dir')

In [None]:
os.listdir()

If we want to remove it entirely, we can use `os.remove()`:

In [None]:
os.rmdir('new_dir')

If we want to remove files, we can instead use `os.remove()`. If you try to remove a directory with this command, you'll get an error:

In [None]:
os.mkdir('example')

In [None]:
os.remove('example')

The same is true in reverse (i.e. using `os.rmdir()` on a file). These are the basics of using `os` to view and manipulate files; the full documentation is at https://docs.python.org/3/library/os.html. (There's a lot here; you may be better off doing your own research if you have a specific thing you want to do.)

## I/O in Base Python

Now, let's get to reading and writing files using built-in Python commands. First, we need a file to open. There are two ways to do this. We can download an external file from the internet using the command `wget`. This is a command-line function, so we have to precede it with `!`. For this example, we'll download the Gettysburg Address:

In [None]:
!wget https://collincapano.com/wp-content/uploads/2023/01/gettysburg_address-bliss_copy.txt

Let's see if it downloaded properly:

In [None]:
os.listdir()

Let's give it a more concise name using `os`:

In [None]:
os.rename('gettysburg_address-bliss_copy.txt', 'gettysburg_address.txt')

Alternatively, we can create our own file using the function `open()`. This is one of the primary functions used for I/O. As the name implies, its simplest use is to open a file. However, there are a variety of *modes* we can use, accessible via extra arguments. To create a new file, we pass `'x'` as a second argument, indicating create mode:

In [None]:
open('new_file.txt', 'x')

Again, we can check that this new file exists using `os`:

In [None]:
os.listdir()

### Reading Files

Now, let's open the files we've created. We can do this by using `open()` without any extra arguments:

In [None]:
newfile = open('new_file.txt')
gb = open('gettysburg_address.txt')

By default, `open()` will open a file in read mode. This means that all we can do is look at its contents. To print the contents of a file, we use the method `read()` on one of the objects we've created above. Let's try it with the Gettysburg Address:

In [None]:
print(gb.read())

Another way we can read the contents of a file is with the method `readlines()`, which we'll try out below:

In [None]:
gb.readlines()

Nothing printed...what's going on? It turns out that `readlines()` only prints out the lines we haven't viewed yet. Since `read()` with no arguments prints the entire document, additional calls of `read()` will print nothing.

To do this, we'll have to create a new `open()` instance. It's good practice to close any documents you've opened once you're done using them so you don't risk losing data. We can do this with the `close()` method:

In [None]:
newfile.close()
gb.close()

Now, let's open the Gettysburg Address again, this time using `readlines()`:

In [None]:
gb = open('gettysburg_address.txt')
print(gb.readlines())
gb.close()

`readlines()` adds all of the lines in the file to a list as individual elements. However, this includes special characters like `\n`, which require a `print()` statament to be interpreted. We can fix this by iterating over the list in a loop, just as we've learned before:

In [None]:
gb = open('gettysburg_address.txt')
lines = gb.readlines()
for i in lines:
    print(i)
gb.close()

This is pretty close to what's actually in the file. This also has the benefit of interpreting the `\n` files as actual line breaks. We can also write this using the method `readline()`, which reads the first line that hasn't been read yet:

In [None]:
gb = open('gettysburg_address.txt')
line = gb.readline() # placeholder variable starting at first line
while line != '': # iterate until reaching an empty line
    print(line) # print current line
    line = gb.readline() # redefine placeholder as next line
gb.close()

`read()` has an optional argument that we can pass to control how much information is printed to screen. Inputting an integer argument into `read()` prints out only that number of bytes, or characters. The default value is -1, which prints out all of the bytes in the file. The same is true for `readline()`, although doing the same for `readlines()` will only print the lines with less than the given number of bytes.

In [None]:
gb = open('gettysburg_address.txt')
print(gb.read(5)) # print 5 characters (spaces included)
print(gb.read(10)) # print 10 characters
print(gb.read(-1)) # print the rest of the characters

### Writing Files

Let's now try printing the contents of `new_file.txt`:

In [None]:
nf = open('new_file.txt')
print(nf.read())
nf.close()

Just as you'd probably expect, the new file is empty. In order to add text to the file, we'll need to make use of the write mode for `open`. We can do this by opening the file with the argument `"w"`:

In [None]:
nf = open('new_file.txt', 'w')

We can now add a line to the file using the method `write()`, with the argument being a string you wish to add to the file:

In [None]:
nf.write('This is a new line of text.')

Let's check the file now:

In [None]:
nf.read()

Oops, remember that we opened the file in write mode only. This means that we can't apply any of the methods we could use in read mode. Let's close this file to save our changes:

In [None]:
nf.close()

If we want to both read and write a file, we can open the file in read/write mode. We can do this by using the argument `'r+'`:

In [None]:
nf = open('new_file.txt', 'r+')

Now, we can read the file however we wish:

In [None]:
nf.read()

We can also write new lines to the file:

In [None]:
nf.write('This is also a new line of text.')

Let's check on our changes (remember we have to create a new instance):

In [None]:
nf.close()
nf = open('new_file.txt', 'r+')
print(nf.read())

The lines we've added insinuate that they have their own lines, but it seems like that's not the case. We'll have to use those special characters we say when reading the lines in Gettysburg Address.

In [None]:
nf.write('\n')
nf.write('This is actually a new line of text.\n')

In [None]:
nf.close()
nf = open('new_file.txt', 'r+')
print(nf.read())

If we want to add multiple lines to a text file, we can use `writelines()`, with the input being a list of lines we wish to add. We still have to add the line breaks manually.

In [None]:
nf.writelines(['This is a line written with writelines().\n', 'Remember to add in the linebreaks manually.\n'])

In [None]:
nf.close()
nf = open('new_file.txt', 'r+')
print(nf.read())
nf.close() # always remember to close your files

# I/O in `pandas`

Using `open()` is a basic way to read and write the contents of files. If we want to read in large datasets, there's are even more robust ways to do so. We can use a library called `pandas`, the most popular library used for reading data in scientific programming. We'll make sure it's installed and import it as always:

In [1]:
!pip install pandas



In [8]:
import pandas as pd
# TO DO: get this working

ModuleNotFoundError: No module named 'pandas'