# Files & Directories

A `file` is a sequence of bytes, stored in some `filesystem`, and can be accessed with a `filename`.

A `directory` is a collection of files, and sometimes other directories as well.

The term `folder` is a synonym for directory.

# Working with Files

## Create or Open with `open()`

We have to call the `open` function before we perform the following tasks:
* Read an existing file
* Write to a new file
* Append to an existing file
* Overwrite an existing file

*fileobj = open(filename,mode)*

- fileobj - file object returned by `open()`
- filename - String name of the file
- mode - string indicating the file's type and what we want to perform


First Letter of *mode* indicates the operation :
* **r** means read
* **w** means write. If the file doesn't exist, it will be created. If already exists it will be over-written
* **x** means write, but only if file doesn't exist
* **a** means append (write after the nd) if the file exists.

Second letter of *mode* is file's type:
- **t** (or nothing) means text
- **b** means binary

*close* the file once write is complete.

In [3]:
fileobj = open("example.txt","wt")
fileobj.close()

## Write a Text file with `print()`

In [4]:
fileobj = open("example.txt","wt")
print("Welcome to example of file",file=fileobj)
fileobj.close()

In [5]:
fileobj = open("example.txt","wt")
print("This is another line for my file",file=fileobj)
fileobj.close()

In [6]:
fileobj = open("example.txt","at")
print("\nThis is the outcome of append",file=fileobj)
fileobj.close()

In [7]:
fileobj = open("example.txt","xt") # success only if file doesnot exist
print("\nThis is the outcome of append",file=fileobj)
fileobj.close()

FileExistsError: [Errno 17] File exists: 'example.txt'

In [8]:
fileobj = open("example_xt.txt","xt") # success only if file doesnot exist
print("\nThis is the outcome of append",file=fileobj)
fileobj.close()

## Write a text file with `write()`

In [9]:
rhymes= """Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown,
And Jill came tumbling after.
"""

In [10]:
fileobj = open("rhymes.txt","wt")
fileobj.write(rhymes)
fileobj.close()

## Read a Text file with `read() , readline() , or readlines()`

In [11]:
f = open("rhymes.txt",'rt')
rhymes = f.read()
f.close()

In [12]:
print(rhymes)

Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown,
And Jill came tumbling after.



In [13]:
f = open("rhymes.txt",'rt')
rhymes = f.readline()
f.close()

In [14]:
rhymes

'Jack and Jill went up the hill\n'

In [15]:
f = open("rhymes.txt",'rt')
f.readline()

'Jack and Jill went up the hill\n'

In [16]:
f.readline()

'To fetch a pail of water.\n'

In [17]:
f.readline()

'Jack fell down and broke his crown,\n'

In [18]:
f.close()

In [19]:
f = open("rhymes.txt",'rt')
rhymes = f.readlines()
f.close()

In [20]:
rhymes

['Jack and Jill went up the hill\n',
 'To fetch a pail of water.\n',
 'Jack fell down and broke his crown,\n',
 'And Jill came tumbling after.\n']

## Write a Binary file with `write()`

In [21]:
#generate binary data
bdata = bytes(range(0,256))
len(bdata)

256

In [22]:
f = open("binfile","wb")
f.write(bdata)
f.close()

In [23]:
f = open("binfile","rb")
print(f.read())
f.close()

b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'


## Close Files automatically using `with`

In [25]:
with open("rhymes.txt","rt") as f:
    print(f.read())

Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown,
And Jill came tumbling after.



In [26]:
with open("demo.jpeg","rt") as f:
    print(f.read())

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

In [27]:
with open("demo.jpeg","rb") as f: # binary
    print(f.read())

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x01\xd8ICC_PROFILE\x00\x01\x01\x00\x00\x01\xc8\x00\x00\x00\x00\x040\x00\x00mntrRGB XYZ \x07\xe0\x00\x01\x00\x01\x00\x00\x00\x00\x00\x00acsp\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\xf6\xd6\x00\x01\x00\x00\x00\x00\xd3-\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\tdesc\x00\x00\x00\xf0\x00\x00\x00$rXYZ\x00\x00\x01\x14\x00\x00\x00\x14gXYZ\x00\x00\x01(\x00\x00\x00\x14bXYZ\x00\x00\x01<\x00\x00\x00\x14wtpt\x00\x00\x01P\x00\x00\x00\x14rTRC\x00\x00\x01d\x00\x00\x00(gTRC\x00\x00\x01d\x00\x00\x00(bTRC\x00\x00\x01d\x00\x00\x00(cprt\x00\x00\x01\x8c\x00\x00\x00<mluc\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x0cenUS\x00\x00\x00\x08\x00\x00\x00\x1c\x00s\x00R\x00G\x00BXYZ \x00\x00\x00\x00

# Working with Other Common File Formats

### .envrc

In [28]:
text = '''export DATABASE_URL="postgres://username:password@localhost:5432/mydatabase"
export API_KEY="your_api_key_here"
export DEBUG="true"
export PORT="3000"'''

with open(".envrc","wt") as f:
    f.write(text)

### JSON

In [29]:
# Read json

with open("policy.json", "rt") as f:
    policy = f.readlines()

In [30]:
print(policy)

['{\n', '    "Version": "2012-10-17",\n', '    "Statement": [\n', '      {\n', '        "Effect": "Allow",\n', '        "Action": [\n', '          "s3:GetObject",\n', '          "s3:PutObject"\n', '        ],\n', '        "Resource": "arn:aws:s3:::example-bucket/*"\n', '      },\n', '      {\n', '        "Effect": "Allow",\n', '        "Action": "dynamodb:*",\n', '        "Resource": "*"\n', '      },\n', '      {\n', '        "Effect": "Allow",\n', '        "Action": "ec2:Describe*",\n', '        "Resource": "*"\n', '      },\n', '      {\n', '        "Effect": "Deny",\n', '        "Action": "s3:*",\n', '        "Resource": "arn:aws:s3:::example-bucket/top-secret/*"\n', '      }\n', '    ]\n', '  }\n', '  ']


In [31]:
import json

with open("policy.json", "rt") as f:
    policy = json.load(f)

print(policy)


{'Version': '2012-10-17', 'Statement': [{'Effect': 'Allow', 'Action': ['s3:GetObject', 's3:PutObject'], 'Resource': 'arn:aws:s3:::example-bucket/*'}, {'Effect': 'Allow', 'Action': 'dynamodb:*', 'Resource': '*'}, {'Effect': 'Allow', 'Action': 'ec2:Describe*', 'Resource': '*'}, {'Effect': 'Deny', 'Action': 's3:*', 'Resource': 'arn:aws:s3:::example-bucket/top-secret/*'}]}


In [32]:
import pprint

In [34]:
pprint.pprint(policy)

{'Statement': [{'Action': ['s3:GetObject', 's3:PutObject'],
                'Effect': 'Allow',
                'Resource': 'arn:aws:s3:::example-bucket/*'},
               {'Action': 'dynamodb:*', 'Effect': 'Allow', 'Resource': '*'},
               {'Action': 'ec2:Describe*', 'Effect': 'Allow', 'Resource': '*'},
               {'Action': 's3:*',
                'Effect': 'Deny',
                'Resource': 'arn:aws:s3:::example-bucket/top-secret/*'}],
 'Version': '2012-10-17'}


### YAML

In [35]:
!pip install PyYaml



In [42]:
import yaml

with open("web-servers.yaml", "rt") as f:
    server = yaml.safe_load(f)

In [43]:
pprint.pprint(server) #load the data - python friendly

[{'become': True,
  'handlers': [{'name': 'Restart Nginx',
                'service': {'name': 'nginx', 'state': 'restarted'}}],
  'hosts': 'web_servers',
  'name': 'Install and configure Nginx',
  'tasks': [{'apt': {'update_cache': True}, 'name': 'Update apt package index'},
            {'apt': {'name': 'nginx', 'state': 'present'},
             'name': 'Install Nginx'},
            {'name': 'Copy Nginx configuration file',
             'notify': 'Restart Nginx',
             'template': {'dest': '/etc/nginx/nginx.conf',
                          'src': 'nginx.conf.j2'}}]}]


In [44]:
with open("web-servers-2.yaml", "wt") as f:
    yaml.dump(server,f)

### XML

In [24]:
# Working with XML File

import xml.etree.ElementTree as ET

# Parse the XML file
tree = ET.parse('sample.xml')

# Get the root element
root = tree.getroot()

# Accessing elements and attributes:
for child in root:
    print(child.tag, child.attrib)  # Print tag and attributes

# Finding specific elements:
for package in root.findall('package'):
    name = package.find('name').text
    price = package.find('price').text
    print(name, price)


package {}
package {}
Package 1 100
Package 2 150


In [23]:
import lxml.etree as etree

# Parse the XML file
tree = etree.parse('sample.xml')

# Get the root element
root = tree.getroot()

# Accessing elements and attributes: (similar to ElementTree)
# ... 

# XPath support:
for package in root.xpath('//package'):
    name = package.xpath('./name/text()')[0]
    price = package.xpath('./price/text()')[0]
    print(name, price)


Package 1 100
Package 2 150


### CSV

In [1]:
# open
with open("train.csv","rt") as f:
    content = f.read()
print(content)

Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
LP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y
LP001003,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N
LP001005,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y
LP001006,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y
LP001008,Male,No,0,Graduate,No,6000,0,141,360,1,Urban,Y
LP001011,Male,Yes,2,Graduate,Yes,5417,4196,267,360,1,Urban,Y
LP001013,Male,Yes,0,Not Graduate,No,2333,1516,95,360,1,Urban,Y
LP001014,Male,Yes,3+,Graduate,No,3036,2504,158,360,0,Semiurban,N
LP001018,Male,Yes,2,Graduate,No,4006,1526,168,360,1,Urban,Y
LP001020,Male,Yes,1,Graduate,No,12841,10968,349,360,1,Semiurban,N
LP001024,Male,Yes,2,Graduate,No,3200,700,70,360,1,Urban,Y
LP001027,Male,Yes,2,Graduate,,2500,1840,109,360,1,Urban,Y
LP001028,Male,Yes,2,Graduate,No,3073,8106,200,360,1,Urban,Y
LP001029,Male,No,0,Graduate,No,1853,2840,114,360,1,Rural,N

In [2]:
import pandas as pd
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
type(df)

pandas.core.frame.DataFrame

In [4]:
df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [5]:
df.describe() 

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History
count,614.0,614.0,592.0,600.0,564.0
mean,5403.459283,1621.245798,146.412162,342.0,0.842199
std,6109.041673,2926.248369,85.587325,65.12041,0.364878
min,150.0,0.0,9.0,12.0,0.0
25%,2877.5,0.0,100.0,360.0,1.0
50%,3812.5,1188.5,128.0,360.0,1.0
75%,5795.0,2297.25,168.0,360.0,1.0
max,81000.0,41667.0,700.0,480.0,1.0


# Working with Large Text Files
- Strategies for working with Large files

In [None]:
# Process in chunks (Iterative approach)
# Line by Line processing

with open('large_file.txt', 'r') as file:
    for line in file:
        pass
        # Process each line

In [7]:
def process_log_lines(filename):
    with open(filename, 'r') as file:
        for line in file:
            if "ERROR" in line:
                timestamp, error_message = extract_data(line)
                # Do something with the error (log, send alert, etc.)

# Call the function
# process_log_lines('large_logfile.log')

In [None]:
# Chunk based Processing
CHUNK_SIZE = 1024 * 1024  # 1 MB chunks
with open('large_file.txt', 'r') as file:
    while True:
        chunk = file.read(CHUNK_SIZE)
        if not chunk:
            break
        # Process the chunk

In [8]:
# Memory Effienct Libraries

import pandas as pd

for chunk in pd.read_csv('train.csv', chunksize=10000):
    pass
    # Process each DataFrame chunk

In [None]:
# Other option - Dask

In [None]:
def process_file(filename):
    with open(filename, 'r') as file:
        for line in file:
            yield process_line(line)  # Process and yield results one by one

for result in process_file('large_file.txt'):
    # Handle each result

# Text Encryption with Python

### Hashlib

In [9]:
import hashlib

There are various hashing algorithms available in hashlib, each with different characteristics:

- md5: A widely used algorithm, but no longer considered secure for new applications.
- sha1: Similar to md5, also not recommended for security-sensitive applications.
- sha256: A stronger hashing algorithm, commonly used for data integrity checks.
- sha384, sha512: Even stronger options for enhanced security.

In [10]:
data = "This is a password".encode()
hash_object = hashlib.sha256(data)
hash_object

<sha256 _hashlib.HASH object @ 0x13f3b7670>

In [11]:
hash_object.hexdigest()

'9ae12b1403d242c53b0ea80137de34856b3495c3c49670aa77c7ec99eadbba6e'

**Important points**

- The same data will always produce the same hash using the same algorithm.
- Even minor changes to the data will result in a completely different hash.
- Hashes cannot be reversed to obtain the original data.

**Additional Considerations:**

Use stronger algorithms like sha256 or higher for security-sensitive applications.
Hashes are not encryption; they cannot hide the content of the data.
Consider using HMAC (Hash-based Message Authentication Code) for message authentication, which combines hashing with a secret key.

### Cryptography

In [12]:
!pip install cryptography



In [13]:
# Hazmat 
# Recipes

# Symmetric Encryption
# AES

from cryptography.fernet import Fernet

# Key generation
key = Fernet.generate_key() 

# Instantiate the Fernet cipher
cipher = Fernet(key)

# Encryption
plaintext = b"Encrypt this secret message"
ciphertext = cipher.encrypt(plaintext)

# Decryption
decrypted_text = cipher.decrypt(ciphertext) 
print(decrypted_text)

b'Encrypt this secret message'


In [14]:
# Assemetric Encryption
# RSA

from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.asymmetric import padding, rsa

# Generate private and public keys 
private_key = rsa.generate_private_key(public_exponent=65537, key_size=2048)
public_key = private_key.public_key()

# Encryption (using the public key)
message = b"A message encrypted with RSA"
ciphertext = public_key.encrypt(
    message,
    padding.OAEP(
        mgf=padding.MGF1(algorithm=hashes.SHA256()),
        algorithm=hashes.SHA256(),
        label=None
    )
)

# Decryption (using the private key)
plaintext = private_key.decrypt(
    ciphertext,
    padding.OAEP(
        mgf=padding.MGF1(algorithm=hashes.SHA256()),
        algorithm=hashes.SHA256(),
        label=None
    )
)
print(plaintext)


b'A message encrypted with RSA'


In [16]:
# Password Hashing
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
from cryptography.hazmat.backends import default_backend

backend = default_backend()
salt = b'random_salt' # Generate a random salt

# Key derivation function
kdf = PBKDF2HMAC(
    algorithm=hashes.SHA256(),
    length=32,
    salt=salt,
    iterations=390000, 
    backend=backend
)

password = b"mysecurepassword"
key = kdf.derive(password)  

# To verify a password later:
# kdf.verify(password, key) 


# Working with Directories in Python 

In [17]:
# os
# shutil

import os
os.getcwd()

'/Users/nachikethpro/Desktop/author-repo/python-for-mlops-aiops-devops/2.automation-files-directory'

In [18]:
os.chdir(r"/Users/nachikethpro/Desktop/author-repo/python-for-mlops-aiops-devops")

In [19]:
os.getcwd()

'/Users/nachikethpro/Desktop/author-repo/python-for-mlops-aiops-devops'

In [20]:
os.chdir(r"/Users/nachikethpro/Desktop/author-repo/python-for-mlops-aiops-devops/2.automation-files-directory")

In [21]:
# create a new dir
os.mkdir("new_directory")

In [22]:
# create nested directories
os.makedirs("new_dir/sub_dir")

In [23]:
os.listdir() # list files and directories

['web-servers-2.yaml',
 'files-directories.ipynb',
 'policy.json',
 '.envrc',
 'sp.config',
 'book.txt',
 'new_directory',
 'demo.jpeg',
 'rhymes.txt',
 'web-servers.yaml',
 'new_dir',
 'sample.xml',
 'example_xt.txt',
 'example.txt',
 'train.csv',
 '.ipynb_checkpoints',
 'binfile',
 'food.py']

In [26]:
os.rmdir("new_directory") # only if empty

FileNotFoundError: [Errno 2] No such file or directory: 'new_directory'

In [25]:
os.rmdir("new_dir")

OSError: [Errno 66] Directory not empty: 'new_dir'

In [27]:
import shutil

In [28]:
shutil.rmtree("new_dir")

In [30]:
# Pathlib
from pathlib import Path

In [31]:
file_path = Path("demo.jpeg")

In [32]:
type(file_path)

pathlib.PosixPath

In [33]:
file_path

PosixPath('demo.jpeg')

In [34]:
file_path.suffix

'.jpeg'

In [None]:
# project_dir = Path("my_project") 

In [35]:
Path.cwd() # current working directory

PosixPath('/Users/nachikethpro/Desktop/author-repo/python-for-mlops-aiops-devops/2.automation-files-directory')

In [36]:
file_path.parent

PosixPath('.')

In [38]:
file_path.absolute()

PosixPath('/Users/nachikethpro/Desktop/author-repo/python-for-mlops-aiops-devops/2.automation-files-directory/demo.jpeg')

In [39]:
# check existence
file_path.exists()

True

In [41]:
file_path.parent.absolute()

PosixPath('/Users/nachikethpro/Desktop/author-repo/python-for-mlops-aiops-devops/2.automation-files-directory')

In [42]:
os.path.join(file_path.parent.absolute(), "train.csv")

'/Users/nachikethpro/Desktop/author-repo/python-for-mlops-aiops-devops/2.automation-files-directory/train.csv'

In [43]:
import datetime
datetime.datetime.now()

datetime.datetime(2024, 4, 18, 9, 45, 9, 361124)

In [44]:
datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

'20240418_094516'

### Examples from MLOps Projects

In [None]:
# Model Save
import os
import datetime

model_name = "my_model"
version = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
model_dir = f"models/{model_name}/{version}"

os.makedirs(model_dir, exist_ok=True)  # Create the directory structure

# Save model weights
model.save_weights(os.path.join(model_dir, "model.h5"))

# Save configurationa
with open(os.path.join(model_dir, "config.json"), "w") as f:
     json.dump(config_dict, f)


In [None]:
# Data Versioning

import os

data_dir = "datasets/my_dataset"
version = "v2"  # Update the version as needed
versioned_data_dir = os.path.join(data_dir, version)

os.makedirs(versioned_data_dir, exist_ok=True)

# Perform preprocessing and store data in the versioned directory

In [None]:
# Manage Experiment Outputs
import os
import time

experiment_name = "experiment_with_new_hyperparams"
timestamp = time.strftime("%Y%m%d_%H%M%S")  
experiment_dir = f"experiments/{experiment_name}_{timestamp}"

os.makedirs(experiment_dir)

# During your experiment:
# Save logs to os.path.join(experiment_dir, "training_log.txt")
# Save plots to os.path.join(experiment_dir, "performance_plot.png")
