##  File I/O (Input / Output)

## Jupyter only writing to text file
* %%writefile filename.ext

Writes in the working directory (first run pwd) 

This is not Python specific command

In [1]:
import os
os.getcwd()

'd:\\Github\\RTU_Python_720_Fall_2020\\core'

In [1]:
%%writefile mylib.py
# this is a small .py file that we will use for as a module(import)
import math # importing standard Python math library
MY_PI = 3.1415926

def nb_year(pop_start, percent, yearly_arrival, pop_end):
    count = 0
    population = pop_start
    while population < pop_end:
        # short hand population *= (1+percent/100)
        # also shortone population += population * percent / 100
        population = population + math.floor(population * percent / 100)
        population += yearly_arrival
        count += 1
    return count

def add(a,b):
    return a+b

# could add main guard here

Writing mylib.py


In [2]:
import mylib # it works because C:\PyLib\mylib.py is in my PYTHONPATH enviroment variables
# import looks FIRST in your current directory

In [3]:
print(mylib.MY_PI)

3.1415926


In [None]:
mylib.nb_year(100,2,0,200)

42

In [4]:
mylib.add(5,15)  # so adding on the fly is problematic on Google Colab

20

![Frost](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Robert_Frost_NYWTS_3.jpg/440px-Robert_Frost_NYWTS_3.jpg)

[Robert Frost](https://en.wikipedia.org/wiki/Robert_Frost)

In [5]:
%%writefile two_roads.txt
Robert Frost
The Road not Taken

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

3rd verse

And both that morning equally lay
In leaves, no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

4th verse

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.

Writing two_roads.txt


## Path from pathlib library

pathlib is part of standart Python library - comes with every Python since I believe 3.4?

In [None]:
from pathlib import Path # pathlib is the modern way to handle file paths in Python

## Find out current working directory

In [7]:
current_work_directory = Path.cwd()
# this will give us absolute path to the current working directory
print(f"Current working directory is: {current_work_directory}")

Current working directory is: d:\Github\RTU_Python_720_Fall_2020\core


## Finding what we have in current working directory

In [None]:
# we are using list comprehension to get all files in the current directory
# we are using . to represent the current directory
# we are using * to represent all files
# here in the glob - which is for current directory - we are looking for all .txt files
# we obtain a list of Path objects as a result
files = [f for f in Path(".").glob("*.txt") if f.is_file()]
files # note it gives Windows path on Windows OS
# on Mac we would see Posix path

[WindowsPath('alice_queen.txt'),
 WindowsPath('requirements.txt'),
 WindowsPath('somefile.txt'),
 WindowsPath('two_roads.txt')]

In [12]:
# let's get the last file in the list
file_path = files[-1]
print(f"Last file in the list is: {file_path}") # notice when we print this Windows Path
# we are only shown the string representation of the Path object
# we can get the name of the file by using the .name attribute
print(f"Name of the file is: {file_path.name}")
# we can get stem of the file by using the .stem attribute
print(f"Stem of the file is: {file_path.stem}")
# we can get extension of the file by using the .suffix attribute
print(f"Extension of the file is: {file_path.suffix}")


Last file in the list is: two_roads.txt
Name of the file is: two_roads.txt
Stem of the file is: two_roads
Extension of the file is: .txt


In [16]:
print(f"Opening the file: {file_path.name}")
with open(file_path, encoding="utf-8") as f: #note that mode="r" is the default mode
    text_from_file = f.read() # generally this is all you need to read the file
    # file is still open here
    another_text_read = f.read() # this will read nothing because we are at the end of the file
    # it is possible to reset the file pointer to the beginning of the file
    f.seek(0) # we can start over, we can seek also to specific position in file
    # seek might be useful if you have a non standard file format and need to skip some bytes
    yet_another_text_read = f.read() # this will read the file again
    # again file is still open but again exhausted
# file is automatically closed after the block
print("Got", len(text_from_file), "symbols in", file_path)
# now let's see how many symbols we got in another_text_read
print("Got", len(another_text_read), "symbols in another_text_read")
# now let's see how many symbols we got in yet_another_text_read
print("Got", len(yet_another_text_read), "symbols in yet_another_text_read")

Opening the file: two_roads.txt
Got 783 symbols in two_roads.txt
Got 0 symbols in another_text_read
Got 783 symbols in yet_another_text_read


In [None]:
import string # this includes some usefule constants like
string.punctuation # not all punctuation is here, but most common is
# so this could serve as a good start for cleaning up the text

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

## Recipe to clean up bad characters in text file

In [19]:
clean_text = text_from_file # here text is still dirty...
# how many we have before cleaning
print("Got", len(clean_text), "symbols before cleaning")
for punct in string.punctuation:
    clean_text = clean_text.replace(punct, "") # so i can replace all occurences of all punctuation
print("Got", len(clean_text), "symbols after cleaning")

Got 783 symbols before cleaning
Got 765 symbols after cleaning


In [20]:
# head of cleaned text
# first 300 symbols
print(clean_text[:300])

Robert Frost
The Road not Taken

Two roads diverged in a yellow wood
And sorry I could not travel both
And be one traveler long I stood
And looked down one as far as I could
To where it bent in the undergrowth

Then took the other just as fair
And having perhaps the better claim
Because it was grass


In [21]:
# tokens = alice_text.split()
tokens = clean_text.split() # very handy because it splits by ANY whitespace, space, tab, newline, and more esoteric ones
print("Got", len(tokens), "tokens in", file_path)

Got 153 tokens in two_roads.txt


In [22]:
from collections import Counter
word_count = Counter(tokens)
word_count.most_common(20)

[('I', 8),
 ('the', 8),
 ('And', 6),
 ('as', 4),
 ('in', 3),
 ('a', 3),
 ('one', 3),
 ('and', 3),
 ('that', 3),
 ('not', 2),
 ('Two', 2),
 ('roads', 2),
 ('diverged', 2),
 ('wood', 2),
 ('could', 2),
 ('both', 2),
 ('be', 2),
 ('it', 2),
 ('took', 2),
 ('for', 2)]

## Normalization before counting

We can see in the above example that the same word is counted twice because of the different case. We can normalize the text to lowercase before counting.

In [24]:
# we have two ways of fixing it, we could lowercase the original text
# or we could lowercase the tokens we just created
# in a way it is easier to lowercase the original text and THEN split again
# but for now let's see how we can lowercase the tokens
# we could use list comprehension
# but for this we will use good old for loop
lowercase_tokens = []
for token in tokens:
    lowercase_tokens.append(token.lower())
# let's see how many tokens you got
print("Got", len(lowercase_tokens), "tokens in lowercase_tokens")
# we could even make a very basic assertion that length of tokens and lowercase_tokens should be the same
assert len(tokens) == len(lowercase_tokens), "Number of tokens should be the same"
# if assertion is correct, nothing happens, if it is wrong, an exception is raised

Got 153 tokens in lowercase_tokens


In [None]:
# let's count again
lower_counter = Counter(lowercase_tokens)
lower_counter.most_common(20) # this is much better now our counts are not case sensitive
# we see than and And are now counted together
# same for the and The
# small downside is that we lost the original case of the words such a I and Two
# there are cases where case is important(pun intended)

[('the', 9),
 ('and', 9),
 ('i', 8),
 ('in', 4),
 ('as', 4),
 ('a', 3),
 ('one', 3),
 ('that', 3),
 ('not', 2),
 ('two', 2),
 ('roads', 2),
 ('diverged', 2),
 ('wood', 2),
 ('could', 2),
 ('both', 2),
 ('be', 2),
 ('to', 2),
 ('it', 2),
 ('took', 2),
 ('for', 2)]

In [26]:
# CSV - comma separareted files are just text files with some structure
N = 50 # of course could use any other name for the variable
with open("forst_word_stats.csv", mode="w", encoding="utf-8") as f: # to write I need to specify mode="w"
    f.write("word, frequency\n") # we write a single row
    # f.write("\n".join(word_count.most_common(N)))
    for my_tuple in lower_counter.most_common(N): # it is easier to change values in one place
        f.write(f"{my_tuple[0]},{my_tuple[1]}\n") # most common returns a list of tuples


## Changing file names using Python

It is possible to change file names using Python.

In [27]:
# let's find out our latest text file that we created
csv_files = [f for f in Path(".").glob("*.csv") if f.is_file()]
# let's sort by modification time
# the lambda function is a way to define a function in one line
# sort can take optional argument 
# key which is a function that will be called on each element before sorting
# the effect that we are sorting files not by the file name but by the modification time
csv_files.sort(key=lambda f: f.stat().st_mtime)
# latest file is the last one
latest_csv = csv_files[-1]
# let's print the name of the latest file
# we could print the modification time as well but it is not very human readable
# it is based on epoch time starting from 1970.1.1
print(f"Latest csv file is: {latest_csv.name}")

Latest csv file is: forst_word_stats.csv


In [28]:
# so let's rename this file to something more meaningful
# the correct name is "frost_word_stats.csv"
# we could use the .rename method of the Path object
# we could also use the .replace method of the string object

# let's use rename
# latest_csv must be Path object for this to work

latest_csv.rename("frost_word_stats.csv")

WindowsPath('frost_word_stats.csv')

## Reading lines from a file

In [29]:
# we could read all text lines at once
with open(file_path, encoding="utf-8") as f:
    lines = f.readlines()   
    # still open but seek is at the end
# file is closed here
# how many lines we got
print("Got", len(lines), "lines in", file_path)

Got 30 lines in two_roads.txt


## Reading rows of text of larger files

In [30]:
# instead of reading the file all at once we can read it line by line
# this is beneficial for very large files
# we could even process a 1TB file with no problem
# as long as we do not try to load it all at once
# of course this file should have more than 1 line :)

# let's read the file line by line
with open(file_path, encoding="utf-8") as f:
    for row in f:
        print(row, end="")
        # we could have done something else here

Robert Frost
The Road not Taken

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

3rd verse

And both that morning equally lay
In leaves, no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

4th verse

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.


In [None]:
# so let's use this approach to save length of each row
row_counts = []
with open(file_path, encoding="utf-8") as f:
    for row in f:
        row_counts.append(len(row)) # so we count the length of each row one by one
# let's see what we got
print(row_counts)
# note that row counts also includes newline characters
# so for true characters we should subtract 1 from each count

[13, 19, 1, 37, 34, 34, 38, 37, 1, 35, 37, 39, 37, 37, 1, 10, 1, 34, 38, 38, 37, 38, 1, 10, 1, 36, 31, 37, 33, 38]


## Use with open always! 

* closes automatically!
* throws exceptions on errors

In [None]:
# Idiom on how to open AND close a file for reading and doing work
with open('two_roads.txt') as fin: # fin or f or file_in whatever name makes most sense, f is very common
    for line in fin:
        print(line)
        # do wo with each line here,save into a list or other structure
    # we can do more work with file here
    # maybe fin.seek(0) to read it again for some reason
    # File will be closed once this line ends
# File is closed now    
print("file is closed already here")
#closes here!
#closes automatically!!! 

Robert Frost

The Road not Taken



Two roads diverged in a yellow wood,

And sorry I could not travel both

And be one traveler, long I stood

And looked down one as far as I could

To where it bent in the undergrowth;



Then took the other, just as fair,

And having perhaps the better claim,

Because it was grassy and wanted wear;

Though as for that the passing there

Had worn them really about the same,



3rd verse



And both that morning equally lay

In leaves, no step had trodden black.

Oh, I kept the first for another day!

Yet knowing how way leads on to way,

I doubted if I should ever come back.



4th verse



I shall be telling this with a sigh

Somewhere ages and ages hence:

Two roads diverged in a wood, and I—

I took the one less traveled by,

And that has made all the difference.
file is closed already here


## For MacOS and Linux
* use pwd to see where you are
### myfile = open("/Users/MyUserName/SomeFolder/MaybeAnotherFolder/myfile.txt")

## For Windows
* use pwd to see where you are
### myfile = open("C:\\Users\\MyUserName\\SomeFolder\\MaybeAnotherFolder\\myfile.txt")

In [32]:
# Jupyter Magic !someOScommand for example !dir or !ls
!dir

 Volume in drive D is Data
 Volume Serial Number is 72A3-8E69

 Directory of d:\Github\RTU_Python_720_Fall_2020\core

11/21/2024  05:19 PM    <DIR>          .
11/21/2024  05:19 PM    <DIR>          ..
02/27/2023  06:11 PM             8,779 alice_queen.txt
11/21/2024  05:08 PM               434 frost_word_stats.csv
02/27/2023  06:11 PM             8,015 Jupyter Tips.ipynb
11/14/2024  04:58 PM             1,064 myAprilMod.py
11/14/2024  05:13 PM    <DIR>          myAprilPackage
11/21/2024  04:43 PM               611 mylib.py
02/27/2023  06:11 PM             1,131 MyMod.ipynb
02/27/2023  06:11 PM             1,378 Practice_1.ipynb
02/27/2023  06:11 PM            44,701 Python Classes.ipynb
04/17/2023  05:44 PM           157,878 Python Dictionaries.ipynb
02/27/2023  06:11 PM            72,412 Python File IO.ipynb
02/27/2023  06:11 PM            35,579 Python File Operations 2 Binary Files and Pickle.ipynb
11/07/2024  04:44 PM            78,151 Python Flow Control.ipynb
02/27/2023  06:11 PM

In [None]:
pwd

'/content'

In [42]:
# importing OS specific library for system work
# idea being that we can do same on Windows/Mac/Linux and not worry about the OS
import os

# absolute paths and relative paths

## Filtering file and writing a new one

In [34]:
# so let's open two files at once
# this way we can process the file sequentially
# even truly large files can be processed this way

# we will use the with statement
# we will use the open function
with open(file_path, encoding="utf-8") as fin: # so fin is file stream in
    with open("clean_frost.txt", mode="w", encoding="utf-8") as fout: # and file stream out
        for line in fin:
            # here we have a single line from incoming file read into line from fin stream
            # we could check if it has some specific properties
            # simple one would be whether it has any nonwhitespace characters
            if not line.strip(): # means we got nothing after stripping whitespace
                # we do nothing with empty lines and continue to the next lin
                continue

            # let's keep all lines that start with And
            if line.startswith("And"):
                # we could do some processing here
                # and write to fout
                fout.write(line) # so we are copying the line into the output file


# if you know Linux shell command you could have used grep, sed, awk to do the same
# but Python approach is more portable and more readable

In [None]:
def get_filtered_list(fname, good_words=()):
    lines_to_keep = []
    with open(afile, encoding="utf-8") as f:
        for line in f: # we go through file line by line
            if any(word in line for word in good_words):
                lines_to_keep.append(line)
    # file is closed here already
    print(len(lines_to_keep), f"lines with {good_words}")
    return lines_to_keep

## Appending to a file

You can only append data to a file at the end of the file. You can't insert data in the middle of a file or at the beginning of a file.

If you need to insert something in middle or beginning, you need to read the file, modify it and write it back completely.

In [37]:
# let's add a row of stars to clean_frost.txt
# also let's add current date and time
from datetime import datetime
with open("clean_frost.txt", mode="a", encoding="utf-8") as fout: # mode="a" is for append
    # a will add to the end of the file
    fout.write("*" * 80 + "\n")
    fout.write(f"File clean_frost.txt was created on {datetime.now()}\n")
    fout.write("*" * 80 + "\n")

### Modes:
  *  mode='r' - Read Only
  * 'w' - Write Only (and will overwrite existing files!!!)
  * 'a' - Apend Only (stream is at the end of file!)
  * 'r+' - Read and Write
  * 'w+' - Write and Read with Overwriting existing/make new files
  
  From C (fopen)
   * ``r+''  Open for reading and writing.  The stream is positioned at the
         beginning of the file.
         
    *   ``w+''  Open for reading and writing.  The file is created if it does not
         exist, otherwise it is truncated(**destroyed!**).  The stream is positioned at
         the beginning of the file.    