# FILE HANDLING IN PYTHON:-

One of the most common tasks that you can do with Python is reading and writing files. Whether it’s writing to a simple text file, reading a complicated server log, or even analyzing raw byte data, all of these situations require reading or writing a file.

What Is a File?
Before we can go into how to work with files in Python, it’s important to understand what exactly a file is and how modern operating systems handle some of their aspects.

At its core, a file is a contiguous set of bytes used to store data. This data is organized in a specific format and can be anything as simple as a text file or as complicated as a program executable. In the end, these byte files are then translated into binary 1 and 0 for easier processing by the computer.

Files on most modern file systems are composed of three main parts:

Header: metadata about the contents of the file (file name, size, type, and so on)

Data: contents of the file as written by the creator or editor

End of file (EOF): special character that indicates the end of the file


What this data represents depends on the format specification used, which is typically represented by an extension. For example, a file that has an extension of .gif most likely conforms to the Graphics Interchange Format specification. There are hundreds, if not thousands, of file extensions out there. For this tutorial, you’ll only deal with .txt or .csv file extensions.


File Paths
When you access a file on an operating system, a file path is required. The file path is a string that represents the location of a file. It’s broken up into three major parts:

1.Folder Path: the file folder location on the file system where subsequent folders are separated by a forward slash / (Unix) or backslash \ (Windows)

2.File Name: the actual name of the file

3.Extension: the end of the file path pre-pended with a period (.) used to indicate the file type



# 1. A file located within a file structure:-

Here’s a quick example. Let’s say you have a file located within a file structure like this:

Let’s say you wanted to access the cats.gif file, and your current location was in the same folder as path. In order to access the file, you need to go through the path folder and then the to folder, finally arriving at the cats.gif file. The Folder Path is path/to/. The File Name is cats. The File Extension is .gif. So the full path is path/to/cats.gif.


# 2.The file can be simply referenced by the file name and extension 

Now let’s say that your current location or current working directory (cwd) is in the to folder of our example folder structure. Instead of referring to the cats.gif by the full path of path/to/cats.gif, the file can be simply referenced by the file name and extension cats.gif


But what about dog_breeds.txt? How would you access that without using the full path? You can use the special characters double-dot (..) to move one directory up. This means that ../dog_breeds.txt will reference the dog_breeds.txt file from the directory of to:

The double-dot (..) can be chained together to traverse multiple directories above the current directory. For example, to access animals.csv from the to folder, you would use ../../animals.csv.

# 1.Line Endings


One problem often encountered when working with file data is the representation of a new line or line ending. The line ending has its roots from back in the Morse Code era, when a specific pro-sign was used to communicate the end of a transmission or the end of a line.

Later, this was standardized for teleprinters by both the International Organization for Standardization (ISO) and the American Standards Association (ASA). ASA standard states that line endings should use the sequence of the Carriage Return (CR or \r) and the Line Feed (LF or \n) characters (CR+LF or \r\n). The ISO standard however allowed for either the CR+LF characters or just the LF character.

Windows uses the CR+LF characters to indicate a new line, while Unix and the newer Mac versions use just the LF character. This can cause some complications when you’re processing files on an operating system that is different than the file’s source.


# 2.Character Encodings
Another common problem that you may face is the encoding of the byte data. An encoding is a translation from byte data to human readable characters. This is typically done by assigning a numerical value to represent a character. The two most common encodings are the ASCII and UNICODE Formats. ASCII can only store 128 characters, while Unicode can contain up to 1,114,112 characters.

ASCII is actually a subset of Unicode (UTF-8), meaning that ASCII and Unicode share the same numerical to character values. It’s important to note that parsing a file with the incorrect character encoding can lead to failures or misrepresentation of the character. For example, if a file was created using the UTF-8 encoding, and you try to parse it using the ASCII encoding, if there is a character that is outside of those 128 values, then an error will be thrown.



# Opening and Closing a File in Python

When you want to work with a file, the first thing to do is to open it. This is done by invoking the open() built-in function. open() has a single required argument that is the path to the file. open() has a single return, the file object:

In [40]:
file=open('Daily Activities.txt')
print(file)

<_io.TextIOWrapper name='Daily Activities.txt' mode='r' encoding='cp1252'>


It’s important to remember that it’s your responsibility to close the file.

In most cases, upon termination of an application or script, a file will be closed eventually. However, there is no guarantee when exactly that will happen. This can lead to unwanted behavior including resource leaks. It’s also a best practice within Python (Pythonic) to make sure that your code behaves in a way that is well defined and reduces any unwanted behavior.

# 1.First way of closing the file with the help of try-catch block

When you’re manipulating a file, there are two ways that you can use to ensure that a file is closed properly, even when encountering an error. The first way to close a file is to use the try-finally block:

In [41]:
reader=open('Daily Activities.txt')
try:
    for f in reader:
        print(f)
        
finally:
    reader.close()

Tomorrow Day:[22 October]

1.Python Statement[ 1 hrs]-Done

2.Python While Loops[1 hrs]-This need to be done correctly-Done

3.Python Functions[2 hrs]-This will take about 2 hrs- Done-Big Topic to Covered

4.Python File Handling and OS Module,Python List Comprehension-This Three Topics should be done today

[23rd October]

1.Python OOP[3 hrs]

2.Python Iterators,Decorators,Generators[1 hrs]

3.Python Lambda Functions[ 1 hrs]

4.Python map and Filter Function[1 hrs]

5.Python Exception Handling[1 hrs]

[24th October]

1.Python DateTime Module[ 2 hrs]

2.Python AdvancedOperationOfDict[ 2 hrs]

3.Python Regular Expression[2 hrs]

[25th October]

Python Exercise to boost Coding Techniques



# 2. The second way to close a file is to use the with statement:

The with statement automatically takes care of closing the file once it leaves the with block, even in cases of error. I highly recommend that you use the with statement as much as possible, as it allows for cleaner code and makes handling any unexpected errors easier for you.

Most likely, you’ll also want to use the second positional argument, mode. This argument is a string that contains multiple characters to represent how you want to open the file. The default and most common is 'r', which represents opening the file in read-only mode as a text file:

In [43]:
with open('Daily Activities.txt','r') as reader:
    try:
            print(reader.read())
    finally:
        reader.close()

Tomorrow Day:[22 October]
1.Python Statement[ 1 hrs]-Done
2.Python While Loops[1 hrs]-This need to be done correctly-Done
3.Python Functions[2 hrs]-This will take about 2 hrs- Done-Big Topic to Covered
4.Python File Handling and OS Module,Python List Comprehension-This Three Topics should be done today
[23rd October]
1.Python OOP[3 hrs]
2.Python Iterators,Decorators,Generators[1 hrs]
3.Python Lambda Functions[ 1 hrs]
4.Python map and Filter Function[1 hrs]
5.Python Exception Handling[1 hrs]
[24th October]
1.Python DateTime Module[ 2 hrs]
2.Python AdvancedOperationOfDict[ 2 hrs]
3.Python Regular Expression[2 hrs]
[25th October]
Python Exercise to boost Coding Techniques



Character           Meaning

'r'                 Open for reading (default)

'w'                 Open for writing, truncating (overwriting) the file first

'rb' or 'wb'	    Open in binary mode (read/write using byte data)


Let’s go back and talk a little about file objects. A file object is:

“an object exposing a file-oriented API (with methods such as read() or write()) to an underlying resource.” (Source)

There are three different categories of file objects:

Text files

Buffered binary files

Raw binary files

Each of these file types are defined in the io module. Here’s a quick rundown of how everything lines up.

# 1.Text File Types

A text file is the most common file that you’ll encounter. Here are some examples of how these files are opened:

In [44]:
read_file=open('Daily Activities.txt','r')
print(type(read_file))

<class '_io.TextIOWrapper'>


With these types of files, open() will return a TextIOWrapper file object.This is the default file object returned by open().

In [45]:
read_file.close()

# 2.Buffered binary files

A buffered binary file type is used for reading and writing binary files. Here are some examples of how these files are opened:

In [46]:
binary_file=open('Snip.txt','rb')
print(type(binary_file))
binary_file=open('Snip.txt','wb')
print(type(binary_file))
binary_file.close()

<class '_io.BufferedReader'>
<class '_io.BufferedWriter'>


With these types of files, open() will return either a BufferedReader or BufferedWriter file object:

# 3.Raw binary files

A raw file type is:

“generally used as a low-level building-block for binary and text streams.

In [47]:
raw_file=open('5 and 7.png','rb',buffering=0)
print(type(raw_file))

<class '_io.FileIO'>


# Reading and Writing Opened Files

Once you’ve opened up a file, you’ll want to read or write to the file. First off, let’s cover reading a file. There are multiple methods that can be called on a file object to help you out:

Method for Readng the File:-

.read(size=-1)-This reads from the file based on the number of size bytes. If no argument is passed or None or -1 is passed, then the entire file is read.

.readline(size=-1)-This reads at most size number of characters from the line. This continues to the end of the line and then wraps back around. If no argument is passed or None or -1 is passed, then the entire line (or rest of the line) is read.

.readlines()-This reads the remaining lines from the file object and returns them as a list.

In [None]:
raw_file.close()

In [48]:
with open('Snippets.txt','r') as reader:
        print(reader.read())
        reader.close()

data.describe()
data.info()
data.isnull().sum()
sns.pairplot(data)
rows=2
cols=7

fig.ax=plt.subplots(nrows=rows,ncols=cols,figsize=(16,4))
plt.tight_layout()
cols=data.columns
index=0

for i in range(rows):
   for j in range(columns):
	sns.displot(data[col[index],ax[i][j])
   index=index+1






Here’s an example of how to read 5 bytes of a line each time using the Python .readline() method:

In [49]:
with open('Snippets.txt','r') as reader:
    print(reader.readline(10))
    print(reader.readline(10))
    print(reader.readline(10))
    print(reader.readline(10))
    reader.close()

data.descr
ibe()

data.info(
)



Here’s an example of how to read the entire file as a list using the Python .readlines() method:

In [50]:
with open('Snippets.txt','r') as reader:
    print(reader.readlines())
    reader.close()

['data.describe()\n', 'data.info()\n', 'data.isnull().sum()\n', 'sns.pairplot(data)\n', 'rows=2\n', 'cols=7\n', '\n', 'fig.ax=plt.subplots(nrows=rows,ncols=cols,figsize=(16,4))\n', 'plt.tight_layout()\n', 'cols=data.columns\n', 'index=0\n', '\n', 'for i in range(rows):\n', '   for j in range(columns):\n', '\tsns.displot(data[col[index],ax[i][j])\n', '   index=index+1\n', '\n', '\n', '\n']


The above example can also be done by using list() to create a list out of the file object:

In [51]:
f=open('Snippets.txt','r')
print(list(f),end="")
f.close()

['data.describe()\n', 'data.info()\n', 'data.isnull().sum()\n', 'sns.pairplot(data)\n', 'rows=2\n', 'cols=7\n', '\n', 'fig.ax=plt.subplots(nrows=rows,ncols=cols,figsize=(16,4))\n', 'plt.tight_layout()\n', 'cols=data.columns\n', 'index=0\n', '\n', 'for i in range(rows):\n', '   for j in range(columns):\n', '\tsns.displot(data[col[index],ax[i][j])\n', '   index=index+1\n', '\n', '\n', '\n']

# Iterating Over Each Line in the File

A common thing to do while reading a file is to iterate over each line. Here’s an example of how to use the Python .readline() method to perform that iteration:

In [54]:
# File Handling using While Loop
with open('Snip.txt','r') as reader:
    line=reader.readline()
    while line!='':
        print(line)
        line=reader.readline() # Counter parameter

What are the things to be improving in coding:-



1) Understanding the Data

2) Exploring the Data

3) Exploring Data Analysis

4)Which model to Use: Using Visualization and Other Techniques

5)Improving the Thinking Skills



LifeCycle Of Data Science Project:

1)Data Collection

2)Feature Engineering:Handle Missing Values



#Why are there Missing Values?





#What are types of Missing Values:-

1)Mean/Median Imputation

2)Random Sample Imputation

3)Capture NAN values with a new feature

4)End of Distribution imputation

5)Abrirtary Imputation



How to Handle Categorical Values:-

1)Frequent Category Imputation

2)Adding a variable for NAN Values

-Suppose If you we have more frequent categories,we just replace NAN with a new category





Handling Categorical Features:-

1)One Hot Encoding





Underfitting and Overfitting the Data



1)Underfitting Data- Training Accuracy will be lower and Test Accuracy will be also lower

 High Bias and High Variance



2)Overfitting the Data-

Another way you could iterate over each line in the file is to use the Python .readlines() method of the file object. Remember, .readlines() returns a list where each element in the list represents a line in the file:

In [55]:
# File Handling using For Loop
with open('Snip.txt','r') as reader:
    for line in reader.readlines():
        print(line)

What are the things to be improving in coding:-



1) Understanding the Data

2) Exploring the Data

3) Exploring Data Analysis

4)Which model to Use: Using Visualization and Other Techniques

5)Improving the Thinking Skills



LifeCycle Of Data Science Project:

1)Data Collection

2)Feature Engineering:Handle Missing Values



#Why are there Missing Values?





#What are types of Missing Values:-

1)Mean/Median Imputation

2)Random Sample Imputation

3)Capture NAN values with a new feature

4)End of Distribution imputation

5)Abrirtary Imputation



How to Handle Categorical Values:-

1)Frequent Category Imputation

2)Adding a variable for NAN Values

-Suppose If you we have more frequent categories,we just replace NAN with a new category





Handling Categorical Features:-

1)One Hot Encoding





Underfitting and Overfitting the Data



1)Underfitting Data- Training Accuracy will be lower and Test Accuracy will be also lower

 High Bias and High Variance



2)Overfitting the Data-

However, the above examples can be further simplified by iterating over the file object itself:

In [56]:
with open('Snip.txt','r') as reader:
    for line in reader:
        print(line)

What are the things to be improving in coding:-



1) Understanding the Data

2) Exploring the Data

3) Exploring Data Analysis

4)Which model to Use: Using Visualization and Other Techniques

5)Improving the Thinking Skills



LifeCycle Of Data Science Project:

1)Data Collection

2)Feature Engineering:Handle Missing Values



#Why are there Missing Values?





#What are types of Missing Values:-

1)Mean/Median Imputation

2)Random Sample Imputation

3)Capture NAN values with a new feature

4)End of Distribution imputation

5)Abrirtary Imputation



How to Handle Categorical Values:-

1)Frequent Category Imputation

2)Adding a variable for NAN Values

-Suppose If you we have more frequent categories,we just replace NAN with a new category





Handling Categorical Features:-

1)One Hot Encoding





Underfitting and Overfitting the Data



1)Underfitting Data- Training Accuracy will be lower and Test Accuracy will be also lower

 High Bias and High Variance



2)Overfitting the Data-

# Writing Files In Python

Now let’s dive into writing files. As with reading files, file objects have multiple methods that are useful for writing to a file:



Method:-

write(string)            This writes the string to the file.

writelines(seq)	         This writes the sequence to the file. No line endings are appended to each sequence item. It’s up to                             you to add the appropriate line ending(s).


Here’s a quick example of using .write() and .writelines():

In [62]:
# Write Files In Python
with open('Andrew NG Notes.txt','r') as reader:
    read_file=reader.readlines()
    with open('Notes.txt','w') as writer:
        writer.writelines(read_file)

In [63]:
#Write File in Reversed Order In Python
with open('Andrew NG Notes.txt','r') as reader:
    read_file=reader.readlines()
    with open('Notes_reversed.txt','w') as writer:
        writer.writelines(reversed(read_file))

# Working With Bytes 

Sometimes, you may need to work with files using byte strings. This is done by adding the 'b' character to the mode argument. All of the same methods for the file object apply. However, each of the methods expect and return a bytes object instead:

In [65]:
with open('Notes.txt','rb') as reader:
    print(reader.readline())

b'Internet Company:\r\n'


You can actually open that file in Python and examine the contents! Since the .png file format is well defined, the header of the file is 8 bytes broken up like this:

Value	        Interpretation

0x89	        A “magic” number to indicate that this is the start of a PNG

0x50 0x4E 0x47	PNG in ASCII

0x0D 0x0A	    A DOS style line ending \r\n

0x1A	        A DOS style EOF character

0x0A	        A Unix style line ending \n


In [90]:

with open('2019-12-14 (2).png','rb') as byte_reader:
    print(byte_reader.read(1)) # This is the starting of the PNG File(\x89 as shown in the output)
    print(byte_reader.read(2)) 
    print(byte_reader.read(3)) # DOS Style line ending
    print(byte_reader.read(4)) # Unix Style line ending
    print(byte_reader.read(5)) 
    print(byte_reader.read(6))
    print(byte_reader.read(7))

b'\x89'
b'PN'
b'G\r\n'
b'\x1a\n\x00\x00'
b'\x00\rIHD'
b'R\x00\x00\x05V\x00'
b'\x00\x03\x00\x08\x06\x00\x00'


# __file__


The __file__ attribute is a special attribute of modules, similar to __name__. It is:

“the pathname of the file from which the module was loaded, if it was loaded from a file


To re-iterate, __file__ returns the path relative to where the initial Python script was called. If you need the full system path, you can use os.getcwd() to get the current working directory of your executing code.

In [115]:
 import os

print(os.getcwd()) # getting the current directory of the file
print(os.__file__)
print(os.__name__)

E:\Python Basics
c:\users\asif\appdata\local\programs\python\python38\lib\os.py
os


# Directory Tree Structure in Python

                                                       1.First Approach

In [107]:
import os

def list_files(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

In [108]:

list_files("E:\\Python Basics")

Python Basics/
    2019-12-14 (2).png
    5 and 7.png
    Andrew NG Notes.txt
    Daily Activities.txt
    debug.log
    Notes.txt
    Notes_reversed.txt
    Python Basics.ipynb
    Reading and  Writing File in Python.ipynb
    Session 1.1 - Python-Installation and Basics.ipynb
    Snip.txt
    Snippets.txt
    .ipynb_checkpoints/
        Python Basics-checkpoint
        Python Basics-checkpoint.ipynb
        Reading and  Writing File in Python-checkpoint.ipynb
        Session 1.1 - Python-Installation and Basics-checkpoint.ipynb


In [111]:
#list_files("E:\\Pen Drive Data")

                                                    Second Approach

In [112]:
from pathlib import Path

class DisplayablePath(object):
    display_filename_prefix_middle = '├──'
    display_filename_prefix_last = '└──'
    display_parent_prefix_middle = '    '
    display_parent_prefix_last = '│   '

    def __init__(self, path, parent_path, is_last):
        self.path = Path(str(path))
        self.parent = parent_path
        self.is_last = is_last
        if self.parent:
            self.depth = self.parent.depth + 1
        else:
            self.depth = 0

    @property
    def displayname(self):
        if self.path.is_dir():
            return self.path.name + '/'
        return self.path.name

    @classmethod
    def make_tree(cls, root, parent=None, is_last=False, criteria=None):
        root = Path(str(root))
        criteria = criteria or cls._default_criteria

        displayable_root = cls(root, parent, is_last)
        yield displayable_root

        children = sorted(list(path
                               for path in root.iterdir()
                               if criteria(path)),
                          key=lambda s: str(s).lower())
        count = 1
        for path in children:
            is_last = count == len(children)
            if path.is_dir():
                yield from cls.make_tree(path,
                                         parent=displayable_root,
                                         is_last=is_last,
                                         criteria=criteria)
            else:
                yield cls(path, displayable_root, is_last)
            count += 1

    @classmethod
    def _default_criteria(cls, path):
        return True

    @property
    def displayname(self):
        if self.path.is_dir():
            return self.path.name + '/'
        return self.path.name

    def displayable(self):
        if self.parent is None:
            return self.displayname

        _filename_prefix = (self.display_filename_prefix_last
                            if self.is_last
                            else self.display_filename_prefix_middle)

        parts = ['{!s} {!s}'.format(_filename_prefix,
                                    self.displayname)]

        parent = self.parent
        while parent and parent.parent is not None:
            parts.append(self.display_parent_prefix_middle
                         if parent.is_last
                         else self.display_parent_prefix_last)
            parent = parent.parent

        return ''.join(reversed(parts))

In [114]:
paths = DisplayablePath.make_tree(Path('E:\\Python Basics'))
for path in paths:
    print(path.displayable())

Python Basics/
├── .ipynb_checkpoints/
│   ├── Python Basics-checkpoint
│   ├── Python Basics-checkpoint.ipynb
│   ├── Reading and  Writing File in Python-checkpoint.ipynb
│   └── Session 1.1 - Python-Installation and Basics-checkpoint.ipynb
├── 2019-12-14 (2).png
├── 5 and 7.png
├── Andrew NG Notes.txt
├── Daily Activities.txt
├── debug.log
├── Notes.txt
├── Notes_reversed.txt
├── Python Basics.ipynb
├── Reading and  Writing File in Python.ipynb
├── Session 1.1 - Python-Installation and Basics.ipynb
├── Snip.txt
└── Snippets.txt


In [119]:
for path,dirs,files in os.walk('E:\\Python Basics'):
    print(path)
    for f in files:
        print(f)

E:\Python Basics
2019-12-14 (2).png
5 and 7.png
Andrew NG Notes.txt
Daily Activities.txt
debug.log
Notes.txt
Notes_reversed.txt
Python Basics.ipynb
Reading and  Writing File in Python.ipynb
Session 1.1 - Python-Installation and Basics.ipynb
Snip.txt
Snippets.txt
E:\Python Basics\.ipynb_checkpoints
Python Basics-checkpoint
Python Basics-checkpoint.ipynb
Reading and  Writing File in Python-checkpoint.ipynb
Session 1.1 - Python-Installation and Basics-checkpoint.ipynb


# Appending to A File

Sometimes, you may want to append to a file or start writing at the end of an already populated file. This is easily done by using the 'a' character for the mode argument:

In [121]:
with open('Andrew NG Notes.txt','a') as a_writer:
     a_writer.write('\n Talent Acquistion 2')

When you examine dog_breeds.txt again, you’ll see that the beginning of the file is unchanged and Beagle is now added to the end of the file:

In [125]:
with open('Andrew NG Notes.txt','r') as reader:
    print(reader.read())

Internet Company:
A/B Testing
Shipping faster
Decision making by the enginner

great AI Company

Strategic data Acquistions
Unified Data Warehouses
 Talent Acquistion
 Talent Acquistion 2


# Working With Two Files at the Same Time

There are times when you may want to read a file and write to another file at the same time. If you use the example that was shown when you were learning how to write to a file, it can actually be combined into the following:

In [129]:
first_file='Andrew NG Notes.txt'
second_file='Andrew NG Notes_reversed.txt'
with open(first_file,'r') as reader,open(second_file,'w') as writer:
    note_file=reader.readlines()
    writer.writelines(reversed(note_file))

In [136]:
first_source='Snip.txt'
second_source='Snip_r.txt'
with open(first_source,'r') as reader,open(second_source,'w') as writer:
    snip_file=reader.readlines()
    writer.writelines(reversed(snip_file))

In [None]:
Remaining:- Context Manager

#                                 READING AND WRITING CSV FILES IN PYTHON

What Is a CSV File?

A CSV file (Comma Separated Values file) is a type of plain text file that uses specific structuring to arrange tabular data. Because it’s a plain text file, it can contain only actual text data—in other words, printable ASCII or Unicode characters.

The structure of a CSV file is given away by its name. Normally, CSV files use a comma to separate each specific data value.

#Examples of CSV Data

column 1 name,column 2 name, column 3 name

first row data 1,first row data 2,first row data 3

second row data 1,second row data 2,second row data 3

...


Notice how each piece of data is separated by a comma. Normally, the first line identifies each piece of data—in other words, the name of a data column. Every subsequent line after that is actual data and is limited only by file size constraints.

In general, the separator character is called a delimiter, and the comma is not the only one used. Other popular delimiters include the tab (\t), colon (:) and semi-colon (;) characters. Properly parsing a CSV file requires us to know which delimiter is being used.

Where Do CSV Files Come From?

CSV files are normally created by programs that handle large amounts of data. They are a convenient way to export data from spreadsheets and databases as well as import or use it in other programs. For example, you might export the results of a data mining program to a CSV file and then import that into a spreadsheet to analyze the data, generate graphs for a presentation, or prepare a report for publication.

CSV files are very easy to work with programmatically. Any language that supports text file input and string manipulation (like Python) can work with CSV files directly.

# Parsing CSV Files With Python’s Built-in CSV Library

The csv library provides functionality to both read from and write to CSV files. Designed to work out of the box with Excel-generated CSV files, it is easily adapted to work with a variety of CSV formats. The csv library contains objects and other code to read, write, and process data from and to CSV files.



Reading CSV Files With csv

Reading from a CSV file is done using the reader object. The CSV file is opened as a text file with Python’s built-in open() function, which returns a file object. This is then passed to the reader, which does the heavy lifting.

In [140]:
import csv

with open('Employee.txt','r') as csv_file:
    csv_reader=csv.reader(csv_file,delimiter=",")
    line_count=0
    for rows in csv_reader:
        if line_count==0:
            print(f'Column name are {",".join(rows)}')
            line_count+=1
        else:
            print(f'\t{rows[0]} works in the {rows[1]} department, and was born in {rows[2]}.')
            line_count += 1
            print(f'Processed {line_count} lines.')

Column name are name,department,birthday month
	John Smith works in the Accounting department, and was born in November.
Processed 2 lines.
	Erica Meyers works in the IT department, and was born in March.
Processed 3 lines.
	Adil Hussain works in the IT department, and was born in April.
Processed 4 lines.


Each row returned by the reader is a list of String elements containing the data found by removing the delimiters. The first row returned contains the column names, which is handled in a special way.

Reading CSV Files Into a Dictionary With csv

Rather than deal with a list of individual String elements, you can read CSV data directly into a dictionary (technically, an Ordered Dictionary) as well.

In [146]:
import csv

with open('Employee.txt','r') as csv_file:
    csv_reader=csv.DictReader(csv_file,delimiter=",")
    line_count=0
    for rows in csv_reader:
        if line_count==0:
            print(f'Column name are {",".join(rows)}')
            line_count+=1
        print(f'\t{rows["name"]} works in the {rows["department"]} department, and was born in {rows["birthday month"]}.')
        line_count += 1
    print(f'Processed {line_count} lines.')

Column name are name,department,birthday month
	John Smith works in the Accounting department, and was born in November.
	Erica Meyers works in the IT department, and was born in March.
	Adil Hussain works in the IT department, and was born in April.
Processed 4 lines.


Where did the dictionary keys come from? The first line of the CSV file is assumed to contain the keys to use to build the dictionary. If you don’t have these in your CSV file, you should specify your own keys by setting the fieldnames optional parameter to a list containing them.

In [188]:
# Normally csv file reading

with open('acme.csv','r')  as csv_file:
    csv_reader=csv.reader(csv_file,delimiter=",")
    for row in csv_reader:
        print(row)

['', 'month', 'market', 'acme']
['1', '1/86', '-0.061134', '0.03016']
['2', '2/86', '0.00822', '-0.165457']
['3', '3/86', '-0.007381', '0.080137']
['4', '4/86', '-0.067561', '-0.109917']
['5', '5/86', '-0.006238', '-0.114853']
['6', '6/86', '-0.044251', '-0.099254']
['7', '7/86', '-0.11207', '-0.226846']
['8', '8/86', '0.030226', '0.073445']
['9', '9/86', '-0.129556', '-0.143064']
['10', '10/86', '0.001319', '0.034776']
['11', '11/86', '-0.033679', '-0.063375']
['12', '12/86', '-0.072795', '-0.058735']
['13', '1/87', '0.073396', '0.050214']
['14', '2/87', '-0.011618', '0.111165']
['15', '3/87', '-0.026852', '-0.127492']
['16', '4/87', '-0.040356', '0.054522']
['17', '5/87', '-0.047539', '-0.072918']
['18', '6/87', '-0.001732', '-0.058979']
['19', '7/87', '-0.008899', '0.236147']
['20', '8/87', '-0.020837', '-0.094778']
['21', '9/87', '-0.084811', '-0.135669']
['22', '10/87', '-0.262077', '-0.284796']
['23', '11/87', '-0.110167', '-0.171494']
['24', '12/87', '0.034955', '0.242616']
['25

In [183]:
with open('acme.csv','r') as csv_file:
    csv_reader=csv.DictReader(csv_file,delimiter=",")
    line_count=0
    for rows in csv_reader:
        print(rows["market"])

-0.061134
0.00822
-0.007381
-0.067561
-0.006238
-0.044251
-0.11207
0.030226
-0.129556
0.001319
-0.033679
-0.072795
0.073396
-0.011618
-0.026852
-0.040356
-0.047539
-0.001732
-0.008899
-0.020837
-0.084811
-0.262077
-0.110167
0.034955
0.012688
-0.00217
-0.073462
-0.043419
-0.05473
-0.011755
-0.061718
-0.10171
-0.032705
-0.045334
-0.079288
-0.036233
-0.011494
-0.093729
-0.065215
-0.037113
-0.044399
-0.084412
0.003444
-0.05676
-0.07897
-0.105367
-0.038634
-0.043261
-0.139773
-0.059094
-0.057736
-0.102524
0.023881
-0.079116
-0.078965
-0.161359
-0.119376
-0.076008
-0.006444
-0.026401


Optional Python CSV reader Parameters

The reader object can handle different styles of CSV files by specifying additional parameters, some of which are shown below:

1.delimiter specifies the character used to separate each field. The default is the comma (',').

2.quotechar specifies the character used to surround fields that contain the delimiter character. The default is a double quote (' " ').

3.escapechar specifies the character used to escape the delimiter character, in case quotes aren’t used. The default is no escape character.

# Writing CSV Files With csv

You can also write to a CSV file using a writer object and the .write_row() method:

The quotechar optional parameter tells the writer which character to use to quote fields when writing. Whether quoting is used or not, however, is determined by the quoting optional parameter:

If quoting is set to csv.QUOTE_MINIMAL, then .writerow() will quote fields only if they contain the delimiter or the quotechar. This is the default case.

If quoting is set to csv.QUOTE_ALL, then .writerow() will quote all fields

If quoting is set to csv.QUOTE_NONNUMERIC, then .writerow() will quote all fields containing text data and convert all numeric fields to the float data type

If quoting is set to csv.QUOTE_NONE, then .writerow() will escape delimiters instead of quoting them. In this case, you also must provide a value for the escapechar optional parameter.


In [11]:
import csv
with open('employee_file.csv', 'w') as employee_file:
    employee_writer = csv.writer(employee_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    employee_writer.writerow(['John Smith', 'Accounting', 'November',122])
    employee_writer.writerow(['Erica Meyers', 'IT', 'March',1234])

In [13]:
import csv
with open('acme_file.csv','w') as employee_writing:
    employee_writer=csv.writer(employee_writing,delimiter=",",quotechar='"',quoting=csv.QUOTE_MINIMAL)
    
    employee_writer.writerow(['Asid Farooq','"','Dalal Fatima','"','Satima Dafri'])
    employee_writer.writerow(['As Farooq','"','Dala Fama','"','Sa Daf'])

# Writing CSV File From a Dictionary With csv

Since you can read our data into a dictionary, it’s only fair that you should be able to write it out from a dictionary as well:

In [14]:
with open('employee_file2.csv', 'w') as csv_file:
    fieldnames = ['emp_name', 'dept', 'birth_month']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    writer.writeheader()
    writer.writerow({'emp_name': 'John Smith', 'dept': 'Accounting', 'birth_month': 'November'})
    writer.writerow({'emp_name': 'Erica Meyers', 'dept': 'IT', 'birth_month': 'March'})

In [30]:
with open('employee_file2.csv','r') as csv_file:
    for file in csv_file:
        print(file)

emp_name,dept,birth_month



John Smith,Accounting,November



Erica Meyers,IT,March





Unlike DictReader, the fieldnames parameter is required when writing a dictionary. This makes sense, when you think about it: without a list of fieldnames, the DictWriter can’t know which keys to use to retrieve values from your dictionaries. It also uses the keys in fieldnames to write out the first row as column names.

# Parsing CSV Files With the pandas Library

Of course, the Python CSV library isn’t the only game in town. Reading CSV files is possible in pandas as well. It is highly recommended if you have a lot of data to analyze.

pandas is an open-source Python library that provides high performance data analysis tools and easy to use data structures. pandas is available for all Python installations, but it is a key part of the Anaconda distribution and works extremely well in Jupyter notebooks to share data, code, analysis results, visualizations, and narrative text.

Reading CSV Files With pandas

To show some of the power of pandas CSV capabilities, I’ve created a slightly more complicated file to read, called acme.csv.


Reading the CSV into a pandas DataFrame is quick and straightforward:

In [45]:
import pandas as pd
df=pd.read_csv('acme.csv')
print(df)

    Unnamed: 0  month    market      acme
0            1   1/86 -0.061134  0.030160
1            2   2/86  0.008220 -0.165457
2            3   3/86 -0.007381  0.080137
3            4   4/86 -0.067561 -0.109917
4            5   5/86 -0.006238 -0.114853
5            6   6/86 -0.044251 -0.099254
6            7   7/86 -0.112070 -0.226846
7            8   8/86  0.030226  0.073445
8            9   9/86 -0.129556 -0.143064
9           10  10/86  0.001319  0.034776
10          11  11/86 -0.033679 -0.063375
11          12  12/86 -0.072795 -0.058735
12          13   1/87  0.073396  0.050214
13          14   2/87 -0.011618  0.111165
14          15   3/87 -0.026852 -0.127492
15          16   4/87 -0.040356  0.054522
16          17   5/87 -0.047539 -0.072918
17          18   6/87 -0.001732 -0.058979
18          19   7/87 -0.008899  0.236147
19          20   8/87 -0.020837 -0.094778
20          21   9/87 -0.084811 -0.135669
21          22  10/87 -0.262077 -0.284796
22          23  11/87 -0.110167 -0

That’s it: three lines of code, and only one of them is doing the actual work. pandas.read_csv() opens, analyzes, and reads the CSV file provided, and stores the data in a DataFrame.

Here are a few points worth noting:

First, pandas recognized that the first line of the CSV contained column names, and used them automatically. I call this Goodness.

However, pandas is also using zero-based integer indices in the DataFrame. That’s because we didn’t tell it what our index should be.

Further, if you look at the data types of our columns , you’ll see pandas has properly converted the Salary and Sick Days remaining columns to numbers, but the Hire Date column is still a String. This is easily confirmed in interactive mode