# <center>Intermediate Python (Part-5)</center>

# ***<center>Working with Files</center>***

<img src=https://img.memecdn.com/dat-file_o_200764.jpg height=300 width=300>

## File Access Modes

>Access modes govern the type of operations possible in the opened file. It refers to how the file will be used once its opened. These modes also define the location of the File Handle in the file. File handle is like a cursor, which defines from where the data has to be read or written in the file. There are 6 access modes in python.

- **Read Only (‘r’)** : Open text file for reading. The handle is positioned at the beginning of the file. If the file does not exists, raises I/O error. This is also the default mode in which file is opened.


- **Read and Write (‘r+’)** : Open the file for reading and writing. The handle is positioned at the beginning of the file. Raises I/O error if the file does not exists.


- **Write Only (‘w’)** : Open the file for writing. For existing file, the data is truncated and over-written. The handle is positioned at the beginning of the file. Creates the file if the file does not exists.


- **Write and Read (‘w+’)** : Open the file for reading and writing. For existing file, data is truncated and over-written. The handle is positioned at the beginning of the file.


- **Append Only (‘a’)** : Open the file for writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.


- **Append and Read (‘a+’)** : Open the file for reading and writing. The file is created if it does not exist. The handle is positioned at the end of the file. The data being written will be inserted at the end, after the existing data.

## 1. Text files

### Opening a file

### Closing a file

<img src=https://memegenerator.net/img/instances/500x/58748422/if-you-could-just-close-that-file-thatd-be-great-emm-kay.jpg height=400 width=400>

### Writing to a file

- write
- writelines

### Reading from a file
- read
- readline
- readlines

### Moving the cursor

- seek(n) : takes the file handle to the nth byte from the beginning.

### Smarter way of opening files...

With the "with" statement, you get better syntax and exceptions handling. 

"The with statement simplifies exception handling by encapsulating common
preparation and cleanup tasks."

In addition, it will automatically close the file. The with statement provides
a way for ensuring that a clean-up is always used.


## 2. JSON files

>Javascript Object Notation abbreviated as JSON is a light-weight data interchange format. It Encode Python objects as JSON strings, and decode JSON strings into Python objects 

<img src=https://cosmic-s3.imgix.net/a36a7c70-4a9b-11e6-9c2d-9547cfa7474b-article-json-everywhere-meme.jpg?w=900 height=500 width=500>

- **json.dump(obj, fileObj) :** Serializes obj as a JSON formatted stream to fileObj.
- **json.dumps(obj) :** Serializes obj as JSON formatted string.
- **json.load(JSONfile) :** De-serializes JSONfile to a Python object.
- **json.loads(JSONfile) :** De-serializes JSONfile(type: string) to a Python object.

## 3. XML files

<img src=http://i.imgur.com/frZZbAQ.jpg height=300 width=300>

>XML stands for eXtensible Markup Language. It was designed to store and transport data. It was designed to be both human- and machine-readable.That’s why, the design goals of XML emphasize simplicity, generality, and usability across the Internet.

```python
import xml.etree.ElementTree as ET
```

In [39]:
import xml.etree.ElementTree as ET

>Here, we are using xml.etree.ElementTree (call it ET, in short) module. Element Tree has two classes – **ElementTree** represents the whole XML document as a tree, and **Element** represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.

In [61]:
# create element tree object
tree = ET.parse("resources/rssfeed.xml")

In [41]:
# get root element
root = tree.getroot()

In [42]:
# create empty list for news items
newsitems = []

### XPATH
>./channel/item is actually XPath syntax (XPath is a language for addressing parts of an XML document). Here, we want to find all item grand-children of channel children of the root(denoted by ‘.’) element.

![](https://indianpythonista.files.wordpress.com/2017/01/1.gif)

In [44]:
# iterate news items
for item in root.findall('./channel/item'):
    # empty news dictionary
    news = {}

    # iterate child elements of item
    for child in item:
        if child.tag == "media":
            news[child.tag] = child.attrib['url']
        else:
            news[child.tag] = child.text.encode('utf8')
            
    # append news dictionary to news items list
    newsitems.append(news)

## JSON vs XML

<img src=http://s2.quickmeme.com/img/37/378dc176eed7cbd89f077b3d3b065f4ac6ea9815dad3fbf71cfa94a9c92d5e93.jpg height=400 width=400>

Both JSON and XML can be used to receive data from a web server.

### JSON is Like XML Because
- Both JSON and XML are "self describing" (human readable)
- Both JSON and XML are hierarchical (values within values)
- Both JSON and XML can be parsed and used by lots of programming languages

### JSON is Unlike XML Because
- JSON doesn't use end tag
- JSON is shorter
- JSON is quicker to read and write
- JSON can use arrays

### Why JSON is Better Than XML

- XML is much more difficult to parse than JSON.
- JSON is parsed into a ready-to-use JavaScript object. (and JS objects can be easily converted to Python objects!)

<img src=https://memegenerator.net/img/instances/400x/45187234.jpg height=300 width=300>

## 4. CSV files

<img src=https://cdn-images-1.medium.com/max/1600/1*4Z2oMWCTiY-wxa3F9NaIAQ.png height=300 width=300>

> CSV (Comma Separated Values) is a simple file format used to store tabular data, such as a spreadsheet or database. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

For working CSV files in python, there is an inbuilt module called csv.

### Reading a CSV file

To read csv file, the file object is converted to csv.reader object.


In [63]:
import csv

In [65]:
# reading csv file
with open("resources/aapl.csv", 'r') as csvfile:
    # creating a csv reader object
    csvreader = csv.reader(csvfile)
     
    # extracting field names through first row
    fields = next(csvreader)
 
    # extracting each data row one by one
    for row in csvreader:
        rows.append(row)

### Writing a CSV file

To write to CSV file, use csv.writer object.

In [55]:
# field names
fields = ['Name', 'Branch', 'Year', 'CGPA']
 
# data rows of csv file
rows = [ ['Nikhil', 'COE', '2', '9.0'],
         ['Sanchit', 'COE', '2', '9.1'],
         ['Aditya', 'IT', '2', '9.3'],
         ['Sagar', 'SE', '1', '9.5'],
         ['Prateek', 'MCE', '3', '7.8'],
         ['Sahil', 'EP', '2', '9.1']]

In [None]:
# writing to csv file
with open(filename, 'w') as csvfile:
    # creating a csv writer object
    csvwriter = csv.writer(csvfile)
     
    # writing the fields
    csvwriter.writerow(fields)
     
    # writing the data rows
    csvwriter.writerows(rows)

### Writing a dictionary to a CSV file

csv.DictWriter object

In [None]:
# my data rows as dictionary objects
mydict =[{'branch': 'COE', 'cgpa': '9.0', 'name': 'Nikhil', 'year': '2'},
         {'branch': 'COE', 'cgpa': '9.1', 'name': 'Sanchit', 'year': '2'},
         {'branch': 'IT', 'cgpa': '9.3', 'name': 'Aditya', 'year': '2'},
         {'branch': 'SE', 'cgpa': '9.5', 'name': 'Sagar', 'year': '1'},
         {'branch': 'MCE', 'cgpa': '7.8', 'name': 'Prateek', 'year': '3'},
         {'branch': 'EP', 'cgpa': '9.1', 'name': 'Sahil', 'year': '2'}]
 
# field names
fields = ['name', 'branch', 'year', 'cgpa']

In [None]:
# writing to csv file
with open(filename, 'w') as csvfile:
    # creating a csv dict writer object
    writer = csv.DictWriter(csvfile, fieldnames = fields)
     
    # writing headers (field names)
    writer.writeheader()
     
    # writing data rows
    writer.writerows(mydict)

![](http://i.imgur.com/IFF1FJo.jpg)

## 5. ZIP files

>ZIP is an archive file format that supports lossless data compression. By lossless compression, we mean that the compression algorithm allows the original data to be perfectly reconstructed from the compressed data. So, a ZIP file is a single file containing one or more compressed files, offering an ideal way to make large files smaller and keep related files together.

Why do we need zip files?

- To reduce storage requirements.
- To improve transfer speed over standard connections.

To work on zip files using python, we will use an inbuilt python module called zipfile.

### Extracting from a zipfile

In [None]:
# importing required modules
from zipfile import ZipFile
 
# specifying the zip file name
file_name = "resources/my_files.zip"
 
# opening the zip file in READ mode
with ZipFile(file_name, 'r') as zip:
    # printing all the contents of the zip file
    zip.printdir()
 
    # extracting all the files
    print('Extracting all the files now...')
    zip.extractall()
    print('Done!')

### Writing to a zip file

In [None]:
import os
 
def get_all_file_paths(directory):
 
    # initializing empty file paths list
    file_paths = []
 
    # crawling through directory and subdirectories
    for root, directories, files in os.walk(directory):
        for filename in files:
            # join the two strings in order to form the full filepath.
            filepath = os.path.join(root, filename)
            file_paths.append(filepath)
 
    # returning all file paths
    return file_paths        
 
    
# path to folder which needs to be zipped
directory = 'resources/my_files'

# calling function to get all file paths in the directory
file_paths = get_all_file_paths(directory)

# printing the list of all files to be zipped
print('Following files will be zipped:')
for file_name in file_paths:
    print(file_name)

# writing files to a zipfile
with ZipFile('my_zipped_files.zip','w') as zip:
    # writing each file one by one
    for file in file_paths:
        zip.write(file)

print('All files zipped successfully!')

## 6. PDF files

![](https://lh3.googleusercontent.com/Rkf3FXjYCeQPlExy4RRi32sQs1J-DqGdzONTzGP2khJ992M6oPt1Ob6beMUA69M7vXc=w170)

>PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.

>Invented by Adobe, PDF is now an open standard maintained by the International Organization for Standardization (ISO). PDFs can contain links and buttons, form fields, audio, video, and business logic.

We will be using a third-party module, **PyPDF2**.

PyPDF2 is a python library built as a PDF toolkit. It is capable of:

- Extracting document information (title, author, …)
- Splitting documents page by page
- Merging documents page by page
- Cropping pages
- Merging multiple pages into a single page
- Encrypting and decrypting PDF files
and more!

To install PyPDF2:

```
$ pip install PyPDF2
```

### Extracting text from PDF files

In [None]:
# importing required modules
import PyPDF2
 
# creating a pdf file object
pdfFileObj = open('resources/example.pdf', 'rb')
 
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 
# printing number of pages in pdf file
print(pdfReader.numPages)
 
# creating a page object
pageObj = pdfReader.getPage(0)
 
# extracting text from page
print(pageObj.extractText())
 
# closing the pdf file object
pdfFileObj.close()

>While PDF files are great for laying out text in a way that’s easy for people to print and read, they’re not straightforward for software to parse into plaintext. As such, PyPDF2 might make mistakes when extracting text from a PDF and may even be unable to open some PDFs at all. There isn’t much you can do about this, unfortunately. PyPDF2 may simply be unable to work with some of your particular PDF files.

### Rotating PDF files

<img src=http://s2.quickmeme.com/img/63/630da292a23ce1243761be3c8ec658d6efae07579b00c7f4a5a25dc88a97e8b6.jpg height=250 width=250>

In [None]:
def PDFrotate(origFileName, newFileName, rotation):
 
    # creating a pdf File object of original pdf
    pdfFileObj = open(origFileName, 'rb')
     
    # creating a pdf Reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
 
    # creating a pdf writer object for new pdf
    pdfWriter = PyPDF2.PdfFileWriter()
     
    # rotating each page
    for page in range(pdfReader.numPages):
 
        # creating rotated page object
        pageObj = pdfReader.getPage(page)
        pageObj.rotateClockwise(rotation)
 
        # adding rotated page object to pdf writer
        pdfWriter.addPage(pageObj)
 
    # new pdf file object
    newFile = open(newFileName, 'wb')
     
    # writing rotated pages to new file
    pdfWriter.write(newFile)
 
    # closing the original pdf file object
    pdfFileObj.close()
     
    # closing the new pdf file object
    newFile.close()
     
        
# original pdf file name
origFileName = 'resources/example.pdf'

# new pdf file name
newFileName = 'resources/rotated_example.pdf'

# rotation angle
rotation = 270

# calling the PDFrotate function
PDFrotate(origFileName, newFileName, rotation)

### Merging PDF files

In [None]:
def PDFmerge(pdfs, output): 
    # creating pdf file merger object
    pdfMerger = PyPDF2.PdfFileMerger()
     
    # appending pdfs one by one
    for pdf in pdfs:
        with open(pdf, 'rb') as f:
            pdfMerger.append(f)
         
    # writing combined pdf to output pdf file
    with open(output, 'wb') as f:
        pdfMerger.write(f)

# pdf files to merge
pdfs = ['resources/example.pdf', 'resources/rotated_example.pdf']

# output pdf file name
output  = 'resources/combined_example.pdf'

# calling pdf merge function
PDFmerge(pdfs = pdfs, output = output)


### Splitting PDF

In [None]:
resources/def PDFsplit(pdf, splits):
    # creating input pdf file object
    pdfFileObj = open(pdf, 'rb')
     
    # creating pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
     
    # starting index of first slice
    start = 0
     
    # starting index of last slice
    end = splits[0]
     
     
    for i in range(len(splits)+1):
        # creating pdf writer object for (i+1)th split
        pdfWriter = PyPDF2.PdfFileWriter()
         
        # output pdf file name
        outputpdf = pdf.split('.pdf')[0] + str(i) + '.pdf'
         
        # adding pages to pdf writer object
        for page in range(start,end):
            pdfWriter.addPage(pdfReader.getPage(page))
         
        # writing split pdf pages to pdf file
        with open(outputpdf, "wb") as f:
            pdfWriter.write(f)
 
        # interchanging page split start position for next split
        start = end
        try:
            # setting split end positon for next split
            end = splits[i+1]
        except IndexError:
            # setting split end position for last split
            end = pdfReader.numPages
         
    # closing the input pdf file object
    pdfFileObj.close()
             

# pdf file to split
pdf = 'resources/example.pdf'

# split page positions
splits = [2,4]

# calling PDFsplit function to split pdf
PDFsplit(pdf, splits)

![](https://memegenerator.net/img/instances/400x/60931389.jpg)

To learn more:
- https://automatetheboringstuff.com/chapter13/
- https://pythonhosted.org/PyPDF2/