# Filesystem I/O in Python

Often, the text we want to work with comes in a raw .txt format. Take for example the [Cornell movie review dataset](http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz).

We can download the above archive, extract it and read a single review very easily.

In [None]:
with open("Datasets/review_polarity/txt_sentoken/neg/cv001_19502.txt") as f:
    text = f.read()

print(text)

Reading a whole file in a single operation may not be practical if the file is large and you wish to operate on it iteratively (i.e. line by line). 

Python provides a readlines() function which reads text files line-by-line (i.e. it uses yield for each line).

In [None]:
with open("Datasets/review_polarity/txt_sentoken/neg/cv001_19502.txt") as f:
    for line in f.readlines():
        print(line)

What about reading the files in a whole directory or subdirectory structure to process? We don't want to be typing in all the file names.

That's where the `os` module and specifically `os.walk` function can help. 

`os.walk` recursively navigates a directory tree and inspect all files within that structure. All we need is to specify which directory to start.

In [None]:
import os

for root, dirs, files in os.walk("Datasets/review_polarity/"):
    # this outer loop iterates over each subdirectory - 
    # and updates 'root' with the current directory being nagivated.
    print("\n--> Current root: ", root)
    
    #inside each subdirectory we get lists of sub-subdirectories (dirs that reside in the current root)
    for directory in dirs:
        print ("DIR  ", directory)
        
    #we also get a list of files in each sub-directory too (files inside the current root dir)
    for file in files:
        print("FILE ", file)
        

What if we are only interested in a specific set of files or directories? We can filter the filenames by matching them with rules.

Here we assume that we only want text files.

In [None]:
for root, dirs, files in os.walk("Datasets/review_polarity/"):    
    for file in files:
        if file.endswith(".txt"):
            print(file)

To get the pull path to a file, we can use `os.path.join`, which joins file paths in an OS independent way - it takes care of where to put slash/ back slash characters for you. 

In [None]:
for root, dirs, files in os.walk("Datasets/review_polarity/"):    
    for file in files:
        if file.endswith(".txt"):
            # current value of `root` holds the full directory path to the file
            print("\nRoot: ", root)
            print(os.path.join(root,file))

Now lets find out how many lines there are in total across all the reviews.

In [None]:
import os

linecount = 0

for root, dirs, files in os.walk("Datasets/review_polarity/"):    
    for file in files:
        if file.endswith(".txt"):
            with open(os.path.join(root,file)) as f:
                for line in f.readlines():
                    linecount += 1
                    
print("Total lines in all txt files", linecount)

### Writing a file

In [None]:
with open('Datasets/review_polarity/fake.txt', 'w') as file:  
    file.write('Fake review!')

## Conclusion

We are now able to process directory trees containing multiple files of text data and filter those files. We can read in a file in one chunk or incrementally line-by-line.