### A. Text Files and Lines

Recall that a Python string can be thought of as a sequence of characters. In a similar way, a text file can be thought of as a sequence of lines

For example, consider the following sample of a text file

    From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
    Return-Path: <postmaster@collab.sakaiproject.org>
    Date: Sat, 5 Jan 2008 09:12:18 -0500
    To: source@collab.sakaiproject.org
    From: stephen.marquard@uct.ac.za
    Subject: [sakai] svn commit: r39772 - content/branches/
    Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

These files are in a standard format for a file containing multiple mail messages. The lines which start with “From ” separate the messages and the lines which start with “From:” are part of the messages. For more information about the mbox format, see en.wikipedia.org/wiki/Mbox.

To break the file into lines, there is a special character that represents the “end of the line” called the newline character.

### B. Newline

In Python, the newline character is represented by \n

*(Even though this looks like two characters, it is actually a single character.)*

In [None]:
mystr = "A\nB"
print(mystr)

In [None]:
len(mystr)

**Note:** 
*So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.*

### C. Reading Files

In [None]:
import os

In [None]:
#File handle
fhand = open("datsets/mbox-short.txt")

In [7]:
fhand = open("datasets/mbox-short.txt")
count = 0

for line in fhand:
    line = line.rstrip()
    count += 1
    print(line)
    
    if count == 10:
        break

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
	 Sat, 05 Jan 2008 09:14:16 -0500
Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])


In [None]:
fhand = open("datasets/mbox-short.txt")
count = 0

for line in fhand:
    count += 1          # counter is incremented in each iteration
    
print(count)

In [4]:
fhand = open("datasets/mbox-short.txt")
count = 0

for line in fhand:
    count += 1
    
    if count >= 10 and count <= 20:
        print(line)                   # it will print the lines from 10 to 20
    elif count > 20:
        break                         # for line > 20 it will terminate the loop
    else:
        continue                      # it will skip the particular step   

Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])

	by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;

	Sat, 5 Jan 2008 09:14:15 -0500

Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])

	BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 

	 5 Jan 2008 09:14:10 -0500

Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])

	by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;

	Sat,  5 Jan 2008 14:10:05 +0000 (GMT)

Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>

Mime-Version: 1.0



In [6]:
fhand = open("datasets/mbox-short.txt")
count = 0

for line in fhand:
    count += 1
    
    if count < 10:
        continue
  
    print(line)
    
    if count == 20:
        break

Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])

	by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;

	Sat, 5 Jan 2008 09:14:15 -0500

Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])

	BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 

	 5 Jan 2008 09:14:10 -0500

Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])

	by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;

	Sat,  5 Jan 2008 14:10:05 +0000 (GMT)

Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>

Mime-Version: 1.0



In [None]:
fhand = open("datasets/mbox-short.txt")
count = 0

for line in fhand:
    count += 1
    
    if count >= 10 and count <= 20:
        print(line)                        # loop will run for entire file

**Note:**
*File handle does not contain the data for the file*

**1. Reading the data using a loop**

We can easily construct a for loop to read through and count each of the lines in a file:

In [None]:
#Reading first 10 lines:
fhand = open("datasets/mbox-short.txt")
count = 0

for line in fhand:
    line = line.rstrip()
    count += 1
    print(line)
    
    if count == 10:
        break

In [None]:
#Total number of lines present in the file:
fhand = open("datasets/mbox-short.txt")
count = 0

for line in fhand:
    count += 1          # counter is incremented in each iteration
    
print(count)

In [None]:
#Total number of lines present in the file:


**2. Reading data using the read method for files**

In [1]:
#Reading data using read() method:
fhand = open("datasets/mbox-short.txt")

file = fhand.read()

In [2]:
#length of the data
len(file)

94626

In [3]:
#Let's see how the data is read
file[:400]

'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008\nReturn-Path: <postmaster@collab.sakaiproject.org>\nReceived: from murder (mail.umich.edu [141.211.14.90])\n\t by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;\n\t Sat, 05 Jan 2008 09:14:16 -0500\nX-Sieve: CMU Sieve 2.3\nReceived: from murder ([unix socket])\n\t by mail.umich.edu (Cyrus v2.2.12) with LMTPA;\n\t Sat, 05 Jan 2008 09:14:16 -0500\nR'

**Disadvantage:**

Remember that this form of the open function should only be used if the file data will fit comfortably in the main memory of your computer. If the file is too large to fit in main memory, you should write your program to read the file in chunks using a for or while loop.

### D. Letting the user choose the file name

In [16]:
# Homework D and E

file_name = input("Enter the file name : ")              # taking file name as input from user

fhand  = open(file_name)                                 # file is opened in read mode
file  = fhand.read()

print(len(file))

Enter the file name : mbox-short.txt
94626


### E. Using try, except and open

In [17]:
file_name = input("Enter the file name : ")              # taking file name as input from user

try:
    fhand = open(file_name)                              # file is opened in read mode
    file = fhand.read()
    
    print(len(file))

except:
    print("File is not present in current directory")    # exception will be raised if file is not present in current directory

Enter the file name : fdafs
File is not present in current directory


### F. Searching through the file

a) For example, if we wanted to read a file and only print out lines which started with the prefix “From:

In [4]:
fhand = open("datasets/mbox-short.txt")

for line in fhand:
    if line.startswith("From:"):
        line = line.rstrip()
        print(line)    

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


b) Can we have the list of email id?

In [10]:
fhand = open("datasets/mbox-short.txt")

for line in fhand:
    if line.startswith("From:"):
        #line = line.lstrip("From: ")
        line = line.rstrip()
        #print(line[6 :])
        email = line[line.find(" ")+1 :]
        print(line)

stephen.marquard@uct.ac.za
louis@media.berkeley.edu
zqian@umich.edu
jlowe@iupui.edu
zqian@umich.edu
jlowe@iupui.edu
cwen@iupui.edu
cwen@iupui.edu
gsilver@umich.edu
gsilver@umich.edu
zqian@umich.edu
gsilver@umich.edu
wagnermr@iupui.edu
zqian@umich.edu
antranig@caret.cam.ac.uk
gopal.ramasammycook@gmail.com
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
stephen.marquard@uct.ac.za
louis@media.berkeley.edu
louis@media.berkeley.edu
ay@media.berkeley.edu
cwen@iupui.edu
cwen@iupui.edu
cwen@iupui.edu


c) Extract lines which contain the string “@uct.ac.za” (i.e., they come from the University of Cape Town in South Africa):

In [12]:
fhand = open("datasets/mbox-short.txt")

string = '@uct.ac.za'

for line in fhand:
    if string in line:                  # checks if the given string is there in a line
        line = line.rstrip()            # removes extra spaces and \n from end of line
        print(line)

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 07:02:32 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
r39753 | david.horwitz@uct.ac.za | 2008-01-04 13:05:51 +0200 (Fri, 04 Jan 2008) | 1 line
From david.horwitz@uct.ac.za Fri Jan  4 06:08:27 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 04:49:08 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan  4 04:33:44 2008
From: david.horwitz@uct.ac.za
Author: david.horwitz@uct.ac.za
From stephen.marquard@uct.ac.za Fri Jan  4 04:07:34 2008
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za


d) How many emails were received from University of Cape Town

In [14]:
fhand = open("datasets/mbox-short.txt")

string = '@uct.ac.za'
count = 0

for line in fhand:
    if line.startswith("From:"):
        if string in line:               # checks if the given string is there in a line
            count += 1

print(count)    

6


In [None]:
# read mbox.txt add in new file all id where email from uct.ac.za

In [2]:
try:
    fhand = open("datasets/mbox-short.txt")                          # file 'mbox-short.txt' is opened in read mode
    fwrite = open("datasets/output-email.txt", "w")                  # file 'output-email.txt' is created if not exist and opened in write mode

    string = '@uct.ac.za'

    for line in fhand:
        if line.startswith("From:"):
            if string in line:                              # checks if the given string is there in a line
                email = line[line.find(" ")+1 :]            # it will fetch all the email-id's having string
                fwrite = open("datasets/output-email.txt", "a")      # file 'output-email.txt' is opened in append mode, so that lines will be appended not overwrite
                fwrite.write(email)                         # it will append the text in 'output-email.txt' file

    fread = open("datasets/output-email.txt")                        # file 'output-email.txt' is opened in read mode
    print(fread.read())                                     # print the text present in 'output-email.txt' file
    
except:
    print("File is not present")

stephen.marquard@uct.ac.za
stephen.marquard@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za
david.horwitz@uct.ac.za

