### Credits:

<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

This notebook is created by Zhuo Chen based on the notebooks created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org<br />

Reused and modified for internal use at Università Cattolica del Sacro Cuore di Milano, by Deborah Grbac, email deborah.grbac@unicatt.it and Valentina Schiariti, email valentina.schiariti-collaboratore@unicatt.it, released under CC BY License.

This repository is founded on **Constellate notebooks**. The original Jupyter notebooks repository was designed by the educators at **ITHAKA's Constellate project**. The project was sunset on July 1, 2025. This current repository uses and resuses Constellate notebooks as Open Educational Resources (OER), free for re-use under a Creative Commons CC BY License.
___


# Python Intermediate 3

**Description:** This notebook describes how to:
* The Python `pathLib` library;
* What are directories and files’ paths (absolute and relative path)
* How to use files’ paths to manipulate, organize and analyze files

This is Part 3 of 5 in the *Python Intermediate* series that will prepare you to do text analysis using the Python programming language.

**Note**: Running this notebook locally will give you full control to test, modify, and save your work. We strongly recommend downloading it before you begin.

___

In [2]:
### Download Sample Files for this Lesson
import urllib.request
from pathlib import Path #this creates a path that points to a file 

# Check if a data folder exists. If not, create it.
data_folder = Path('./data/') #this it creates a path in the current directory (the "./
data_folder.mkdir(exist_ok=True) #this creates the directory; the arguments says that if the directory already exist is fine and should not give an error

download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample.txt',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/books.zip',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_cloud.png'
] #we download the files
#this is a zip file

for url in download_urls:
    urllib.request.urlretrieve(url, './data/' + Path(url).name)

import zipfile
with zipfile.ZipFile("./data/books.zip", "r") as zip_ref:
    zip_ref.extractall("./data/books") #this section is used to extract the files in the zip file

bookzip_path = Path.cwd() / 'data' / 'books.zip'
bookzip_path.unlink()

print('Sample files ready.')

Sample files ready.


## A tree structure of filesystem
<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/filesystem_tree.png' width=700></center>

## An Introduction to `pathlib`

Python Intermediate 2 describes the way to open, read, and write files. The built-in module `pathlib` is the best way to connect to the file system so your code can work seamlessly with files and directories, even across different operating systems. For example, `pathlib` can help you accomplish tasks like:

* Find whether a particular file or directory already exists
* Simplify the code for reading and writing to files
* Find information about a file, including its extension
* Iterate a process over a large group of files in a directory (including subdirectories)

## Finding and Defining Paths

We can find the current working directory by using the `.cwd()` method. (This is similar to the unix command `pwd`.)

In [38]:
# Get the current working directory
Path.cwd()

WindowsPath('C:/Users/Utente/Introduction-to-Text-Analyisis-with-Python/Python Notebooks/Python_Intermediate')

This is similar to how we view file paths on the computer with the only difference that it has forward slashes `/`, whereas on a windows machine a path will use backward slashes, e.g. `C:\Windows\`.

We can create a path object at any time by using an assignment statement and passing a string into the `Path()` function. 

To do this, we can either use the **absolute path** seen previously that starts at the root of the operating system filesystem, or we can simplify by using the **relative path** if the files are in our current working directory.
See this example:

In [39]:
# Create a path object based on a string from a relative path
file_path = Path('./data/sample.txt') 

The `.` at the beggining refers to the current directory. If the file is not in your current directory but in a parent directory you can't refer to the current directory in this way. You will need to either refer to the file using the absolute path or put as many dots (`.`) as the number of level up in the parent directories: 

In [40]:
# Open the test file using the absolute path
with open('/Users/Utente/Introduction-to-Text-Analyisis-with-Python/Python Notebooks/test.txt', 'r') as f:
    print(f.read())

test test test


In [41]:
# Open the test file using the relative path
with open('../test.txt', 'r') as f:
    print(f.read())

test test test


We have created a **path object**, not simply a string. The path object has a lot more flexibility than a traditional string (as shown by the `type()` function). It allows us to create code that is easier to adapt for different operating systems since we do not have to be concerned about formatting the string with slashes in the correct direction and other technical issues that diverge from one operating system to another.

In [42]:
# We have created a Path object, not simply a string
type(file_path)

pathlib._local.WindowsPath

We can also use the `print()` function on the path object.

In [43]:
# Print out the path object
print(file_path)

data\sample.txt


The `.resolve()` method will take a relative path object and create an abolute path object. The absolute path is the full path from the root of the filesystem. On a Mac or Linux, this is simply `/`. On a Windows computer, it is usually `C:\`.

In [44]:
# Getting the full path using .resolve()
# Returns a path object

file_path.resolve() #all the way from the rooth directory to the file

WindowsPath('C:/Users/Utente/Introduction-to-Text-Analyisis-with-Python/Python Notebooks/Python_Intermediate/data/sample.txt')

We can also build a path based on strings separated by slashes.

In [45]:
# Building another path off the current working directory
# Using the slash notation

file_path = Path.cwd() / 'data' / 'sample.txt'
print(file_path)

C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\sample.txt


## Checking if a Path Points to an Existing File or Directory

| Method | Effect |
|---|---|
| `.is_file()` | Return a Boolean True/False if the path points at an existing file |
| `.is_dir()` | Return a Boolean True/False if the path points at an existing directory |

In [46]:
# Check whether path is a file
# Returns a Boolean

file_path.is_file()

True

In [47]:
# Check whether path is a directory
# Returns a Boolean

file_path.is_dir()

False

## Path Attributes

A path object also has useful attributes. Unlike the methods above which end in parentheses `()`, attributes do not use parentheses. (We can think of a method as function, since it ends in `()`, that additionally takes an object before the dot notation. These functions usually transform or do some kind of action that may need take arguments in the parentheses. An attribute, on the other hand, is more like a property of the object so it does not require parentheses.)

| Attribute | Information Returned |
|---|---|
| `.parent` | Return a path object for the parent directory |
| `.parents`[x] | Return a path object for parents x generations higher |
| `.name` | Return a string containing the file name with extension |
| `.stem` | Return a string containing the file name without extension |
| `.suffix` | Return a string containing the file extension |

In [48]:
# Get the parent of the path
# Returns a path object

file_path.parent

WindowsPath('C:/Users/Utente/Introduction-to-Text-Analyisis-with-Python/Python Notebooks/Python_Intermediate/data')

In [49]:
# Finding the grandparent of the path using .parent twice
# Returns a path object
# If the path was specified from a relative path, 
# you may need to use .resolve() to get the absolute path first

file_path.parent.parent

WindowsPath('C:/Users/Utente/Introduction-to-Text-Analyisis-with-Python/Python Notebooks/Python_Intermediate')

In [50]:
# Getting even deeper into the path
# Finding the grandparent of the full path using parents with index
# Returns a path object

file_path.parents[2] # Try changing the index

WindowsPath('C:/Users/Utente/Introduction-to-Text-Analyisis-with-Python/Python Notebooks')

In [51]:
# Return just the name of the file or folder
# Returns a string
file_path.name

'sample.txt'

In [52]:
# Return just the name of the file without extension
# Returns a string

file_path.stem

'sample'

In [53]:
# Return just the extension/suffix
# Returns a string

file_path.suffix

'.txt'

## Creating Files and Directories

To create a new file or directory, first create the desired path object then use the appropriate method:

* `.touch()` will create a new file
* `.mkdir()` will create a new directory

In [54]:
# Create a new file
new_file_path = Path.cwd() / 'data' / 'new_file.txt'
new_file_path.touch()

In [55]:
# Create a new directory
new_dir_path = Path.cwd() / 'data' / 'new_directory'
new_dir_path.mkdir(exist_ok = True) 
# The "exist_ok=True" parameter allows to not raise errors if directory exists

## Removing Files and Directories

To remove a file or directory, first create the path object then use the appropriate method:

* `.unlink()` will delete a file
* `.rmdir()` will delete a directory


In [56]:
# Remove a file
new_file_path.unlink()

In [57]:
# Remove a directory
new_dir_path.rmdir()

## Rename a File or Directory

To rename a file, you will need two path objects: the original path object and a new path object with the new name. The syntax looks like:

`old_path.rename(new_path)`

In [58]:
# Create an original file for this example

old_path = Path.cwd() / 'data' / 'original_file.txt'
old_path.touch()

In [None]:
# Rename the original file with `.rename()`
# On Windows, if the renamed file already exists an error will occur
# On Unix, if the renamed file already exists the file will be overwritten silently

new_path = Path.cwd() / 'data' / 'renamed_file.txt'
old_path.rename(new_path)

## Open, Read, and Write to Text Files

Path objects work with the context manager `with open`. Instead of passing a string into the `open()` function, we can add the path object onto the front of a `.open()` method.

In [60]:
# Opening the file with a context manager
# and the `.open()` method
# The 'r', read only mode, argument is optional with `.open()`

with file_path.open() as f: #the pathobject. open() method
    print(f.read())

A text file can have many words in it
These words are written on the second line
Third line
Fourth line
Fifth line
Sixth line
Seventh line
Eighth line
Ninth line
Tenth line. This is the end of the text file!


## Quickly Reading or Writing a File

If you are reading a small text file, then there is an even shorter way to read the file using a path object: `.read_text()`. This method opens the file, creates a string from the file object contents, and then closes the file object automatically.

In [61]:
# Using the read_text method
# Returns a string
print(file_path.read_text())

A text file can have many words in it
These words are written on the second line
Third line
Fourth line
Fifth line
Sixth line
Seventh line
Eighth line
Ninth line
Tenth line. This is the end of the text file!


There is also a fast method for writing to a file using a path object: `.write_text()`. This method opens the file object in write mode, writes a string to the file, and then closes it automatically. *Be careful with this method since it will overwrite any existing files!*

In [62]:
# Create a new file

new_file_path = Path.cwd() / 'data' / 'new_file.txt'

# Write to a file
# This overwrites the file if it already exists

new_file_path.write_text('Hello World!')
print(new_file_path.read_text())

Hello World!


## Gathering a List of Files with `.iterdir()` and `.rglob()`
It is common to gather a list of files in a directory (or set of directories) in order to excecute code on each one at a time. If all the files are in a single directory, then

In [21]:
# Use .iterdir() to iterate over files in a directory

input_dir = Path.cwd() / 'data' / 'books'
for file in input_dir.iterdir():
    print(file)

C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\100_years_of_solitude.png
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\adventures_of_sherlock_holmes_a_conan_doyle.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\alice_in_wonderland_lewis_carroll.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\a_room_with_a_view_e_m_forster.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\dracula_bram_stoker.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\little_women_louisa_m_alcott.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\middlemarch_george_eliot.txt
C:\Users\Utente\In

In [9]:
# Use .iterdir() to iterate over files in a directory
# Checking for a particular extension
# Only works for a single directory!

for file in input_dir.iterdir():
    if file.suffix == '.txt':
        print(file)

/home/jovyan/constellate-notebooks/Python-intermediate/data/books/dracula_bram_stoker.txt
/home/jovyan/constellate-notebooks/Python-intermediate/data/books/alice_in_wonderland_lewis_carroll.txt
/home/jovyan/constellate-notebooks/Python-intermediate/data/books/moby_dick_herman_melville.txt
/home/jovyan/constellate-notebooks/Python-intermediate/data/books/little_women_louisa_m_alcott.txt
/home/jovyan/constellate-notebooks/Python-intermediate/data/books/a_room_with_a_view_e_m_forster.txt
/home/jovyan/constellate-notebooks/Python-intermediate/data/books/middlemarch_george_eliot.txt
/home/jovyan/constellate-notebooks/Python-intermediate/data/books/adventures_of_sherlock_holmes_a_conan_doyle.txt
/home/jovyan/constellate-notebooks/Python-intermediate/data/books/pride_and_prejudice_jane_austen.txt


The `.iterdir()` method will work on a single directory, but if you have multiple nested directories then you can use the `.rglob()` method. Be careful with this method, however, since if there are a lot of nested folders it could take a very long time to process the whole directory tree.

In [24]:
# Use glob to iterate over all files including subfolders

for file in input_dir.rglob("*.txt"): 
    print(file)

C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\adventures_of_sherlock_holmes_a_conan_doyle.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\alice_in_wonderland_lewis_carroll.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\a_room_with_a_view_e_m_forster.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\dracula_bram_stoker.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\little_women_louisa_m_alcott.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\middlemarch_george_eliot.txt
C:\Users\Utente\Introduction-to-Text-Analyisis-with-Python\Python Notebooks\Python_Intermediate\data\books\moby_dick_herman_melville.txt
C:\Users\Utent

___

# Lesson Complete