# File IO and Exceptions
## 1DV501 - Introduction to Programming


Recording of this lecture:
- https://www.youtube.com/watch?v=3sqDNuaDKww

Recordings of lecture last year (2020):
- https://youtube.com/playlist?list=PLdXitOaYf2HAgjHNCfnosGw08AqzE2c4o


### Sign up for the first Python Test on Friday, October 8.
- Read information in Moodle.
- **Registration is mandatory. Deadline October 5.**
- Registration is now open in Moodle (Scroll down a bit to find it.)


### For students staying in Sweden
- The test will take place at Campus Växjö or at Campus Kalmar.
- We will not allow any student living in Sweden to take the test remotely ...
    - ...except students on the Physics distance program
- Exact time and place for the test will be presented later on.

### For students staying abroad
- Will be given an opportunity to take the Python Test remotely.
- The test will be monitored using Zoom. You will be asked to setup a webcam (or mobile phone) in such a way that you and your computer is in clear view during the test.

- More instructions related to the distance version of the Python Test will be
presented later on in Moodle.

## Today

- Working with Files and Directories
- File input/output (IO)
    - Working text files
    - Working with data files
- Text Processing (Extra material, Assignment 3 preparation)

**Reading instructions:** Chap. 9.3

## Working with files and directories

When programming we often need to access files and directories on our computer
- For example, read data from a file in certain directory, or output processed text to a certain file in a certain directory.
- The Python library os (operating system) can be used to quarry/access the file system on your computer.
- For example, to print all the Python (.py) files in a certain directory.
- The file system depends on your operating system (e.g. Mac or Windows) ⇒ The examples we show to today can look slightly different on different computers (operating systems)

### The `os` module



- Note . Windows and *nix systems differs in file structure / \

- Relative path -> the path relative to where your program runs.
- Absolute path -> the actual full path of the file system.

---

- *Tips for advanced users. If you're running Windows, it's possible to install WSL (Windows Subsystem for Linux).*

---

In [None]:
import os
path = os.getcwd()
print(path)

In [None]:
os.chdir('figures')

In [None]:
os.getcwd()

In [None]:
os.chdir('..')

In [None]:
os.getcwd()

In [None]:
os.chdir('/Users/frahaa/dev/courses/1DV501')

- The `os` module gives support for queries related to files and directories
- `os.getcwd()`  name of current working directory
    - The virtual machine's start directory in this execution
    - Not same as folder containing this program code
    - Topmost directory inside Visual Studio Code 
- `os.chdir('figures')` change to child directory
- `os.chdir('..')` change to parent directory


In [None]:
import os                       # Operating system module
os.chdir('/Users/frahaa/dev/courses/1DV501')

path = os.getcwd()              # Get current working directory
print("Current dir:", path)     # ... /1DV501

lst = os.listdir(path)  # List files and directories in path directory

for s in lst:
    print(s)       

subdir = os.chdir('figures')
print("\nMoved to dir:", os.getcwd())

lst = os.listdir(subdir) # List files and folders in subdir

for s in lst:
    if s.endswith(".py"):     # Print files ending with ".py"
        print(s)              # time.py, tax.py, shortname.py, quote.py, 

- `os.listdir(path)` List content (as strings) of directory `path`
- Directory content -> files and directories 
- Hidden entities (e.g. `.vscode` or `.DS_Store`) have names starting with a `.` 



## Using `os.scandir()`



`os.scandir()`

`s.scandir(path='.')` Returns an iterator 

https://docs.python.org/3/library/os.html


In [None]:
import os
os.chdir('/Users/frahaa/dev/courses/1DV501')

entries = os.scandir()

for a in entries:
    print('Class:', type(a))
    print('Name:', a.name)
    print('Ends with \'g\':', a.name.endswith('g'))
    print('Is a file:', a.is_file())
    print('Is a dir:', a.is_dir())
    print()

In [None]:
# Example from PDF-slides
import os
os.chdir('/Users/frahaa/dev/courses/1DV501')

def is_hidden(entry):
    return entry.name.startswith(".")

def print_entries(list_of_entries):
    for entry in list_of_entries:
        if entry.is_file() and not is_hidden(entry):
            print("File: ", entry.name, type(entry) )
        elif entry.is_dir() and not is_hidden(entry):
            print("Dir: ", entry.name, entry.path)

path = os.getcwd()
entries = os.scandir(path)  # List of entries of type DirEntry
print_entries(entries)      
print()

subdir = os.chdir('..')
entries = os.scandir(subdir)  # List of entries of type DirEntry
print_entries(entries)        

## Files and dirs, continued ...

Note "Portable Operating System Interface (POSIX)"

### The `os.listdir(...)` approach 

- `os.listdir(path)` -> all file and directory names in directory `path`
- Problem: all names are given as `strings` -> hard to know if it is a file or a directory
- Suitable approach when you quickly wants to find the content of a given directory

### The `os.scandir(...)` approach 

- `os.scandir(path)` -> all files and directories in `path` as `DirEntry` objects
- Each `DirEntry` object `entry`  comes with two attributes:
    - `entry.name` -> short local name of file or directory
    - `entry.path` -> fully qualified name of file or directory
    and two methods

- `entry.is_file()` -> True if `entry`  is a file 
- `entry.is_dir()` -> True if `entry`  is a directory

- Suitable approach for more complex problems like:
    - List all python files in a given directory
    - Find all sub-directories (transitively)  of a given directory


In [None]:

def count_dirs(path='.'):
    c_ = 0
    entries = os.scandir(path)
    for entry in entries:
        if entry.is_dir():
            #print(entry.name)
            c_ += 1 + count_dirs(entry.path)
    return c_

path = '/Users/frahaa/dev/courses/1DV501'
print(f"The path {path} contains {count_dirs(path)} subdirectories")

In [None]:
path = '/Users/frahaa/dev'
print(f"Dir {path} contains {count_dirs(path)} subdirectories")

- `count_dirs(path)` is a recursive function that visits all subdirectories
- Visits all  subdirectories transitively -> subdirs to subdirs to subdirs ...
- Difficult to handle without recursion (**perfect example when recursion is used**)


## Reading text from file

- `file =  open(path,"r")` open file `path` for reading (`r`)
- `file` is here an object representing a connection to a file
- `for line in file` -> read from file line by line




In [None]:

path = '/Users/frahaa/dev/courses/1DV501'
path += '/data/holy_grail_script_scene1.txt'
print("Reading from ",path)

file =  open(path,"r")
line_count = 0
for line in file:
    line_count += 1
    print(line)
file.close()
print("Line count: ",line_count)



 
---


- Ugly printout since `line` includes a `"\n"` and the file ends with a empty lines.



In [None]:
#path = ...

filelist=[]
file =  open(path,"r")
for x in file:
    filelist.append(str(x).replace('\n',''))
file.close()

In [None]:
filelist[3]

In [None]:
file =  open(path,"r")

type(file)

# Input/output modules

https://docs.python.org/3/library/io.html

The io module provides Python’s main facilities for dealing with various types of I/O. There are three main types of I/O: text I/O, binary I/O and raw I/O. These are generic categories, and various backing stores can be used for each of them. A concrete object belonging to any of these categories is called a file object. Other common terms are stream and file-like object.

`(f = open("myfile.txt", "r", encoding="utf-8"))` Often not necessary on *nix systems.

Note, text encoding is important. Use `utf-8` (Universal Coded Character Set) Transformation Format – 8-bit). 

In [None]:
#file = 'test'
file =  open(path,"r")

for x in file:
    print(x)

file.close()

- Ugly print problem solved by using `print(line.strip())` -> remove trailing `"\n"`



In [None]:

path = '/Users/frahaa/dev/courses/1DV501'
path += "/data/holy_grail_script_scene1.txt"

file =  open(path,"r")
full_text = ""
for line in file:
    full_text += line
file.close()
print(full_text)

---

- We first store entire text in a string (including linebreaks)

- Reading text is easy, just remember:
    - We read the text line by line,
    -  Lines also includes a final `"\n"` and
    -  Empty lines are also included.

---


- It is important to close the file connections (`file.close()`) once reading/writing is done.
- A non-closed connection might cause problems later on when you try to access a file.


## Writing text to a file



In [None]:

path = '/Users/frahaa/dev/courses/1DV501/output.txt'
full_text = 'It is I, Arthur, son of Uther Pendragon, from the castle of Camelot. King of the Britons, defeator of the Saxons, sovereign of all England!'

file = open(path,"w")
file.write(full_text)
file.close()


In [None]:
file = open(path,'r')
for line in file:
    print(line.strip())
file.close()

- Write entire text to file.

- Result: Text in file has same formatting as `full_text`.



In [None]:
lines = ["do\n","re\n","mi\n","fa\n","so\n","la\n"]

file = open(path,"w")
file.writelines(lines)
file.close()


In [None]:
file = open(path,'r')
for line in file:
    print(line.strip())
file.close()

- We write text line by line to file.
- Result: do,re,mi,fa,so,la as six separate lines.

- Writing text is also easy, just remember to handle the line breaks.
---

### Recommendations
- Always look at the content of the file you are about to read to understand how it is organized
- Always open the output file when writing to a file to inspect the result

## Reading and Writing text - Summary

We use `open(...)` to make a file connection
-  `open(path,"r")` -> open file for reading. Program will crash is file doesn't exists (or is read protected)
-  `open(path,"w")` -> open file for writing. The file will be created if it doesn't exist, or replaced if it does exist.
-  `open(path,"a")` -> open file for appending -> add new text at the end of a file. The file will be created if it doesn't exist, or appended if it does exist.
-  Default is `"r"` -> `open(path)` means open file for reading

---

- `file` in `file = open(...)` is a file object. File object usage:

-  `for line in file:` -> read one line at the time
-  `full_text = file.read()` -> read entire file content
-  `file.write(full_text)` -> write entire text 
-  `file.writelines(lines)` where `lines` is a list of strings -> write line by line (but not adding any linebreaks) 


## Safe file handling with `with-as`



In [None]:
path = '/Users/frahaa/dev/courses/1DV501/output3.txt'

#with open(path, "r") as file:
#   for line in file:
#       print( line.strip() )     
        

# Safe file writing 

with open(path, "a") as file:
   file.write("First line to add\n")
   file.write("Last line to add\n")

- `with` and `as` are two Python keywords
- The `with-as` statement includes file closing and was introduced to make sure that an open file is always closed (no matter what happens)
- **Although a bit cryptic, it is the recommended approach to open a file.**


# Reading Data Files

Two examples of numeric files with different formatting.

- File: `integers1.dat` (show in text editor)
- File: `integers2.dat` (show in text editor)

**Task:** For each file, implement a function read_file(path) returning an integer list.


*Notice*
- Different formatting -> different versions of read_file(path)
- We will get data as strings -> must be converted to integers

In [None]:
import os

def read_integers1(path):
    lst = []
    with open(path, "r") as file:
        for line in file: # Read one line at the time
            n = int(line.strip()) # Strip and convert to integer
            lst.append(n)
    return lst

# Program starts
path = '/Users/frahaa/dev/courses/1DV501' + "/data/integers1.dat"
lst1 = read_integers1(path)
print("integers1.dat: ", lst1)

In [None]:
import os

def read_integers2(path):
    lst = []
    with open(path, "r") as file:
        as_string = file.read() # Read entire file (as a long string) 
        string_list = as_string.split(";") # Split string into smaller strings
        for s in string_list: # Convert each string to an integer
            lst.append(int(s))
    return lst

# Program starts
path = '/Users/frahaa/dev/courses/1DV501' + "/data/integers2.dat"
lst2 = read_integers2(path)
print("integers2.dat: ", lst2) 

# File IO - Recommendations

- Always look at the content of the file you are about to read to understand how it is organized
- Text is often read line by line. Use `strip()` to remove line breaks (`\n`)
- Numeric data can be organized in many different ways. For example, comma-separated or one number per row.
- Use the `split()` function to divide rows to lists
- File reading gives data as a string -> must be converted to numbers when dealing with numeric data
- Always open the output file when writing to a file to inspect the result
- Start with small samples and print a lot to make sure that everything works

# Assignment 3 - Exercise 4

You can here download two large files containing English text:
- `holy_grail.txt` containing the script of the Monty Python movie The Holy Grail. (It is a Python course after all!)
- `eng_news_100K-sentences.txt` containing 100,000 sentences taken from English newspapers.

## Task

Write a program `find_words.py` that can identify individual words in a text and store these words (in lower case) in a separate text file.
By a word we here mean strings containing only the English letters plus ”’” and ”-”. Hence, we consider words like ”can’t”, ”John’s”, and ”full-time” as valid words. Furthermore, a word doesn’t contain any digits, or symbols like ”.”, ”,”, ”!”, ”?”, etc.

*Notice: The result of this exercise (two files containing only the words from the two text files) will be used later on in other exercises and the Mini-project.*

# Exercise 4 - Recommended Approach

**Step 1**

1. A function `read_file(file_path)` that reads the file specified by `file_path` and return a list of strings containing each row in the text file.
2. A function `get_words(row)` that divides a row (a string) into words and returns a list of words (strings)
3. A function `save_words(file_path, words)` that saves all the words in the list words in the file specified by file_path

**Step 2**,
- the function `get_words(row)`, is the hard part. How do you divide a row (text) into a list of words? How do you approach the problem?

# Exercise 4 - Identifying words

Input: A sentence from text as a string.

### Steps to be taken

1. Convert sentence to lower case sentence
2. Replace punctuation marks (e.g. `".","!","?",":", ...`) not part of
words with `" "`
3. Split sentence into list of raw words using `split()`. Example raw words:
    ```
    '11', '53am', 'thursday', '13', '(january)', '2011', 'emotions',
    'fly', 'during', 'ridge', 'lane', 'development', 'hearing', 'watfor
    'town', 'hall', 'reached', 'boiling', 'point', 'when', 'x', 'i'
    'controversial', 'plans', 'for', 'a', 'back', 'garden', 'john'
    'development', 'resurfaced', 'last', 'night', '3mx3m',
    '11am', 'this', 'probably', 'doesn’t', 'come', 'as', 'a', 'm'
    ```
4. Iterate over raw words and remove non-words -> our list of words
5. Example of non-words
    - Any raw word containing digits
    - The only single letter words in English are ”a” and ”I”

In general, look at printouts, identify non-words, and try to remove them

# Exercise 4 - Results

Result: A file containing words taken from the input text file.

- No exact word definition -> no exact word count
- We got the following word counts: 1896870 for `100k_sentences`, 10799 for `holy_grail`.
- We don’t expect you to get these results exactly, but they should be about the same (say ±5%).
- Switching from `100k_sentences` to `holy_grail` should be by switching input path. Not two different programs.

## Suggestions
- Do not start with large text files.
- Start with text files with only (say) 20 sentences that allows manual inspection (printouts)
- Try using multiple small text samples
- Evaluate effect of, for example removing any raw word containing parentheses, by printing words that are removed -> make sure that proper words are not removed.

# Assignment 3
- Assignment 3 is now available in Moodle
- Assignment 3 should be submitted using Gitlab. This holds for all students -> campus, distance, as well as others ”Others”.
- Instructions for how to use Gitlab will be published in Moodle. Post questions in Slack if you have problems.

*Notice*
- Assignment 3 contains fewer but larger and more complex exercises.

Take a small step approach where you divide the problem into several smaller problems which you solve one at the time. Test that each part works before you move on to the next one. Also, be sure to read the entire exercise text before you start any coding.