### A. Text Files and Lines

Recall that a Python string can be thought of as a sequence of characters. In a similar way, a text file can be thought of as a sequence of lines

For example, consider the following sample of a text file

    From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
    Return-Path: <postmaster@collab.sakaiproject.org>
    Date: Sat, 5 Jan 2008 09:12:18 -0500
    To: source@collab.sakaiproject.org
    From: stephen.marquard@uct.ac.za
    Subject: [sakai] svn commit: r39772 - content/branches/
    Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

These files are in a standard format for a file containing multiple mail messages. The lines which start with “From ” separate the messages and the lines which start with “From:” are part of the messages. For more information about the mbox format, see en.wikipedia.org/wiki/Mbox.

To break the file into lines, there is a special character that represents the “end of the line” called the newline character.

### B. Newline

In Python, the newline character is represented by \n

*(Even though this looks like two characters, it is actually a single character.)*

In [1]:
mystr = "A\nB"
print(mystr)

A
B


In [2]:
len(mystr)

3

In [3]:
mystr

'A\nB'

In [7]:
help(open)

Help on built-in function open in module io:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise IOError upon failure.
    
    file is either a text or byte string giving the name (and the path
    if the file isn't in the current working directory) of the file to
    be opened or an integer file descriptor of the file to be
    wrapped. (If a file descriptor is given, it is closed when the
    returned I/O object is closed, unless closefd is set to False.)
    
    mode is an optional string that specifies the mode in which the file
    is opened. It defaults to 'r' which means open for reading in text
    mode.  Other common values are 'w' for writing (truncating the file if
    it already exists), 'x' for creating and writing to a new file, and
    'a' for appending (which on some Unix systems, means that all writes
    append to the end of the file regardless of the current seek position

In [9]:
f = open("myfile0.txt", "w")

In [10]:
f.write('This is 1st line\n')

17

In [11]:
f.write("2nd line\n")

9

In [12]:
f.write('3rd')

3

In [13]:
f.close()

In [30]:
fr = open("myfile0.txt", 'r')

In [27]:
for i in fr:
    print(i, end='')

This is 1st line
2nd line
3rd

In [28]:
fr.read()

''

In [32]:
fr.close()

In [1]:
type(fr)

NameError: name 'fr' is not defined

**Note:** 
*So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.*

### C. Reading Files

In [2]:
import os

In [3]:
pwd

'C:\\Users\\PARTHI vs BHARATHI\\Downloads\\PRAXIS\\A Term 1\\Python'

In [4]:
#os.chdir('C:\\Users\\PARTHI vs BHARATHI\\Downloads')

In [9]:
#File handle
fhand = open("mbox-short.txt")

In [10]:
fhand

<_io.TextIOWrapper name='mbox-short.txt' mode='r' encoding='cp1252'>

**Note:**
*File handle does not contain the data for the file*

**1. Reading the data using a loop**

We can easily construct a for loop to read through and count each of the lines in a file:

In [8]:
#Reading all lines:
for i in fhand:
    print(i)

In [42]:
#only 10 lines present in the file:
count = 0
for line in fhand:
    print(line)
    count+=1
    if count == 10:
        break

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008

Return-Path: <postmaster@collab.sakaiproject.org>

Received: from murder (mail.umich.edu [141.211.14.90])

	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;

	 Sat, 05 Jan 2008 09:14:16 -0500

X-Sieve: CMU Sieve 2.3

Received: from murder ([unix socket])

	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;

	 Sat, 05 Jan 2008 09:14:16 -0500

Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])



In [14]:
fhand.readlines()

['From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008\n',
 'Return-Path: <postmaster@collab.sakaiproject.org>\n',
 'Received: from murder (mail.umich.edu [141.211.14.90])\n',
 '\t by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;\n',
 '\t Sat, 05 Jan 2008 09:14:16 -0500\n',
 'X-Sieve: CMU Sieve 2.3\n',
 'Received: from murder ([unix socket])\n',
 '\t by mail.umich.edu (Cyrus v2.2.12) with LMTPA;\n',
 '\t Sat, 05 Jan 2008 09:14:16 -0500\n',
 'Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])\n',
 '\tby flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;\n',
 '\tSat, 5 Jan 2008 09:14:15 -0500\n',
 'Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])\n',
 '\tBY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; \n',
 '\t 5 Jan 2008 09:14:10 -0500\n',
 'Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])\n',
 '\tby paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;\n',
 '\tSat,  5 Jan 2008 14:10:05 +0000 (GMT)\n',

In [24]:
#only 10 lines present in the file:
count = 0
for line in fhand:
    print(line, end='')
    count+=1
    if count == 10:
        break

	by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;
	Sat, 5 Jan 2008 09:14:15 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
	BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 
	 5 Jan 2008 09:14:10 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
	by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;
	Sat,  5 Jan 2008 14:10:05 +0000 (GMT)
Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>
Mime-Version: 1.0


In [45]:
#only 10 lines present in the file:
count = 0
for line in fhand:
    line = line.rstrip()
    print(line)
    count+=1
    if count == 10:
        break

	for <source@collab.sakaiproject.org>; Sat, 5 Jan 2008 09:12:19 -0500
Received: (from apache@localhost)
	by nakamura.uits.iupui.edu (8.12.11.20060308/8.12.11/Submit) id m05ECIaH010327
	for source@collab.sakaiproject.org; Sat, 5 Jan 2008 09:12:18 -0500
Date: Sat, 5 Jan 2008 09:12:18 -0500
To: source@collab.sakaiproject.org
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/sakai_2-5-x/content-impl/impl/src/java/org/sakaiproject/content/impl
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8


**2. Reading data using the read method for files**

In [15]:
#Reading data using read() method:
fhand = open("mbox-short.txt")

data = fhand.read()

In [16]:
#length of the data
len(data)

94626

In [17]:
#Let's see how the data is read
data[:250]

'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008\nReturn-Path: <postmaster@collab.sakaiproject.org>\nReceived: from murder (mail.umich.edu [141.211.14.90])\n\t by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;\n\t Sat, 05 Jan 2008 09:14:16 '

In [18]:
print(data[:250])

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 


**Disadvantage:**

Remember that this form of the open function should only be used if the file data will fit comfortably in the main memory of your computer. If the file is too large to fit in main memory, you should write your program to read the file in chunks using a for or while loop.


### D. Letting the user choose the file name

In [52]:
filename = input("Enter the file name: ")
file_hand = open(filename)
count = 0
for i in file_hand:
    print(i.strip())
    count+=1
    if count == 10:
        break

Enter the file name: mbox-short.txt
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])


### E. Using try, except and open

In [27]:
import os
os.getcwd()

'C:\\Users\\PARTHI vs BHARATHI\\Downloads\\Python'

In [28]:
try:
    import os
    os.chdir("C:\\Users\\PARTHI vs BHARATHI\\Downloads\\Python")
    file = input("Enter file name: ")
    fh = open(file)
    count = 0
    for i in fh:
        print(i)
        count+=1
        if count == 10:
            break
except:
    print("Please enter the correct file name.")

Enter file name: mbox-short.txt
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008

Return-Path: <postmaster@collab.sakaiproject.org>

Received: from murder (mail.umich.edu [141.211.14.90])

	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;

	 Sat, 05 Jan 2008 09:14:16 -0500

X-Sieve: CMU Sieve 2.3

Received: from murder ([unix socket])

	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;

	 Sat, 05 Jan 2008 09:14:16 -0500

Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])



### F. Searching through the file


a) For example, if we wanted to read a file and only print out lines which started with the prefix “From:

In [19]:
filename = 'mbox-short.txt'
fh = open(filename)
for line in fh:
    if line[:5] == 'From:':
        line = line.rstrip()
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [40]:
filename = 'mbox-short.txt'
fh = open(filename)
for line in fh:
    if line.startswith('From:'): 
        line = line.rstrip()
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


b) Can we have the list of email id?

In [29]:
filename = 'mbox-short.txt'
fh = open(filename)
l = []
for line in fh:
    if line.startswith('From:'): 
        line = line.strip("From: ")
        l.append(line.strip())
print(l)
s = set(l)
print(s)
li = sorted(list(s))
print(li)

['stephen.marquard@uct.ac.za', 'louis@media.berkeley.edu', 'zqian@umich.edu', 'jlowe@iupui.edu', 'zqian@umich.edu', 'jlowe@iupui.edu', 'cwen@iupui.edu', 'cwen@iupui.edu', 'gsilver@umich.edu', 'gsilver@umich.edu', 'zqian@umich.edu', 'gsilver@umich.edu', 'wagnermr@iupui.edu', 'zqian@umich.edu', 'antranig@caret.cam.ac.uk', 'gopal.ramasammycook@gmail.com', 'david.horwitz@uct.ac.za', 'david.horwitz@uct.ac.za', 'david.horwitz@uct.ac.za', 'david.horwitz@uct.ac.za', 'stephen.marquard@uct.ac.za', 'louis@media.berkeley.edu', 'louis@media.berkeley.edu', 'ay@media.berkeley.edu', 'cwen@iupui.edu', 'cwen@iupui.edu', 'cwen@iupui.edu']
{'jlowe@iupui.edu', 'stephen.marquard@uct.ac.za', 'david.horwitz@uct.ac.za', 'ay@media.berkeley.edu', 'louis@media.berkeley.edu', 'zqian@umich.edu', 'wagnermr@iupui.edu', 'gopal.ramasammycook@gmail.com', 'antranig@caret.cam.ac.uk', 'gsilver@umich.edu', 'cwen@iupui.edu'}
['antranig@caret.cam.ac.uk', 'ay@media.berkeley.edu', 'cwen@iupui.edu', 'david.horwitz@uct.ac.za', 'g

In [42]:
filename = 'mbox-short.txt'
fh = open(filename)
for line in fh:
    if line.startswith('From:'): 
        line = line[6:].rstrip()
        print(line)

stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
rjlowe@iupui.edu
zqian@umich.edu
rjlowe@iupui.edu
cwen@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
gsilver@umich.edu
zqian@umich.edu
gsilver@umich.edu
wagnermr@iupui.edu
zqian@umich.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
stephen.marquard@uct.ac.za
louis@media.berkeley.edu
louis@media.berkeley.edu
ray@media.berkeley.edu
cwen@iupui.edu
cwen@iupui.edu
cwen@iupui.edu


c) Extract lines which contain the string “@uct.ac.za” (i.e., they come from the University of Cape Town in South Africa):

In [67]:
filename = 'mbox-short.txt'
fh = open(filename)
for line in fh:
    if line.find("@uct.ac.za") != -1:
        print(line.rstrip())

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 07:02:32 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
r39753 | david.horwitz@uct.ac.za | 2008-01-04 13:05:51 +0200 (Fri, 04 Jan 2008) | 1 line
From david.horwitz@uct.ac.za Fri Jan  4 06:08:27 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 04:49:08 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 04:33:44 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From stephen.marquard@uct.ac.za Fri Jan  4 04:07:34 2008
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za


In [20]:
filename = 'mbox-short.txt'
fh = open(filename)
for line in fh:
    if "@uct.ac.za" in line:
        print(line.rstrip())

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 07:02:32 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
r39753 | david.horwitz@uct.ac.za | 2008-01-04 13:05:51 +0200 (Fri, 04 Jan 2008) | 1 line
From david.horwitz@uct.ac.za Fri Jan  4 06:08:27 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 04:49:08 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 04:33:44 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From stephen.marquard@uct.ac.za Fri Jan  4 04:07:34 2008
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za


In [23]:
filename = 'mbox-short.txt'
fh = open(filename)
for line in fh:
    if (line.startswith("From: ")) and ("@uct.ac.za" in line):
        #print(line[6:].strip())
        print(line.strip('From: '), len(line.strip('From: ')))

stephen.marquard@uct.ac.za
 27
david.horwitz@uct.ac.za
 24
david.horwitz@uct.ac.za
 24
david.horwitz@uct.ac.za
 24
david.horwitz@uct.ac.za
 24
stephen.marquard@uct.ac.za
 27


d) How many emails were received from University of Cape Town

In [24]:
filename = 'mbox-short.txt'
fh = open(filename)
count=0
for line in fh:
    if line.find("@uct.ac.za") != -1 and line.startswith("From:"):
        count+=1
        print(line.rstrip())
print("There are %d emails were received from UCT" %count)

From: stephen.marquard@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
There are 6 emails were received from UCT


In [70]:
f = open("mbox-short.txt")
for l in f:
    if l.startswith("From: "):
        print(l)

From: stephen.marquard@uct.ac.za

From: louis@media.berkeley.edu

From: zqian@umich.edu

From: rjlowe@iupui.edu

From: zqian@umich.edu

From: rjlowe@iupui.edu

From: cwen@iupui.edu

From: cwen@iupui.edu

From: gsilver@umich.edu

From: gsilver@umich.edu

From: zqian@umich.edu

From: gsilver@umich.edu

From: wagnermr@iupui.edu

From: zqian@umich.edu

From: antranig@caret.cam.ac.uk

From: gopal.ramasammycook@gmail.com

From: david.horwitz@uct.ac.za

From: david.horwitz@uct.ac.za

From: david.horwitz@uct.ac.za

From: david.horwitz@uct.ac.za

From: stephen.marquard@uct.ac.za

From: louis@media.berkeley.edu

From: louis@media.berkeley.edu

From: ray@media.berkeley.edu

From: cwen@iupui.edu

From: cwen@iupui.edu

From: cwen@iupui.edu

