# Working with Data Files

In Python, we must open files before we can use them and close them when we are done with them. As you might expect, once a file is opened it becomes a Python object just like all other data. Table 1 shows the functions and methods that can be used to open and close files.

* Method Name

* open

#### open(filename,'r')

Open a file called filename and use it for reading. This will return a reference to a file object.

* write

#### open(filename,'w')

Open a file called filename and use it for writing. This will also return a reference to a file object.

close

filevariable.close()

File use is complete.

# Reading a File
As an example, suppose we have a text file called olympics.txt that contains the data representing about olympians across different years. The contents of the file are shown at the bottom of the page.

To open this file, we would call the open function. The variable, fileref, now holds a reference to the file object returned by open. When we are finished with the file, we can close it by using the close method. After the file is closed any further attempts to use fileref will result in an error.



In [2]:
file = open('olympics.txt','r')
data = file.read()
print(data[:50])
file.close()

Name,Sex,Age,Team,Event,Medal
A Dijiang,M,24,China


In [2]:
fileref = open("olympics.txt", "r")
file = fileref.read()
print(file[:100])
## other code here that refers to variable fileref
fileref.close()

Name,Sex,Age,Team,Event,Medal
A Dijiang,M,24,China,Basketball,NA
A Lamusi,M,23,China,Judo,NA
Gunnar 


In [4]:
fileref = open("olympics.txt", "r")
## other code here that refers to variable fileref
data = fileref.readlines()
print(data[:5])
fileref.close()

['Name,Sex,Age,Team,Event,Medal\n', 'A Dijiang,M,24,China,Basketball,NA\n', 'A Lamusi,M,23,China,Judo,NA\n', 'Gunnar Nielsen Aaby,M,24,Denmark,Football,NA\n', 'Edgar Lindenau Aabye,M,34,Denmark/Sweden,Tug-Of-War,Gold\n']


# Iterating over lines in a file
We will now use this file as input in a program that will do some data processing. In the program, we will examine each line of the file and print it with some additional text. Because readlines() returns a list of lines of text, we can use the for loop to iterate through each line of the file.

A line of a file is defined to be a sequence of characters up to and including a special character called the newline character. If you evaluate a string that contains a newline character you will see the character represented as \n. If you print a string that contains a newline you will not see the \n, you will just see its effects (a carriage return).

As the for loop iterates through each line of the file the loop variable will contain the current line of the file as a string of characters. The general pattern for processing each line of a text file is as follows:

for line in myFile.readlines():
    statement1
    statement2
    ...
To process all of our olypmics data, we will use a for loop to iterate over the lines of the file. Using the split method, we can break each line into a list containing all the fields of interest about the athlete. We can then take the values corresponding to name, team and event to construct a simple sentence.



In [4]:
olypmicsfile = open("olympics.txt", "r")

for aline in olypmicsfile.readlines():
    line=aline.strip()
    print(line)

olypmicsfile.close()

Name,Sex,Age,Team,Event,Medal
A Dijiang,M,24,China,Basketball,NA
A Lamusi,M,23,China,Judo,NA
Gunnar Nielsen Aaby,M,24,Denmark,Football,NA
Edgar Lindenau Aabye,M,34,Denmark/Sweden,Tug-Of-War,Gold
Christine Jacoba Aaftink,F,21,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,21,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,25,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,25,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,27,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,27,Netherlands,Speed Skating,NA
Per Knut Aaland,M,31,United States,Cross Country Skiing,NA
Per Knut Aaland,M,31,United States,Cross Country Skiing,NA
Per Knut Aaland,M,31,United States,Cross Country Skiing,NA
Per Knut Aaland,M,31,United States,Cross Country Skiing,NA
Per Knut Aaland,M,33,United States,Cross Country Skiing,NA
Per Knut Aaland,M,33,United States,Cross Country Skiing,NA
Per Knut Aaland,M,33,United States,Cross Country Skiing,NA
Per Knut Aaland,M,33,United States,Cro

In [7]:
olypmicsfile = open("olympics.txt", "r")

for aline in olypmicsfile.readlines()[1:]:
    aaline=aline.split(',')
    name=aaline[0]
    sex=aaline[1]
    age=aaline[2]
    team=aaline[3]
    event=aaline[4]
    medal=aaline[5]
    line=f'{name} is from {team} playing {event}'
    print(line)

olypmicsfile.close()
#line='{} is from {} playing {}'.format(aline[0],aline[3],aline[4])

A Dijiang is from China playing Basketball
A Lamusi is from China playing Judo
Gunnar Nielsen Aaby is from Denmark playing Football
Edgar Lindenau Aabye is from Denmark/Sweden playing Tug-Of-War
Christine Jacoba Aaftink is from Netherlands playing Speed Skating
Christine Jacoba Aaftink is from Netherlands playing Speed Skating
Christine Jacoba Aaftink is from Netherlands playing Speed Skating
Christine Jacoba Aaftink is from Netherlands playing Speed Skating
Christine Jacoba Aaftink is from Netherlands playing Speed Skating
Christine Jacoba Aaftink is from Netherlands playing Speed Skating
Per Knut Aaland is from United States playing Cross Country Skiing
Per Knut Aaland is from United States playing Cross Country Skiing
Per Knut Aaland is from United States playing Cross Country Skiing
Per Knut Aaland is from United States playing Cross Country Skiing
Per Knut Aaland is from United States playing Cross Country Skiing
Per Knut Aaland is from United States playing Cross Country Skiing
P

## Write code to find out how many lines are in the file olympics.txt as shown above. Save this value to the variable num_lines. Do not use the len method.


In [10]:
lines = open("olympics.txt", "r")
num_lines =  0
for aline in lines.readlines():
    num_lines+=1
print(num_lines)
lines.close()

60


In [None]:
md = open('olympics.txt', 'r')
for line in md:
    print(line.strip())
md.close()
# continue with other code

In [11]:
with open('olympics.txt', 'r') as md:
    for line in md.readlines():
        print(line.strip())
# continue on with other code

Name,Sex,Age,Team,Event,Medal
A Dijiang,M,24,China,Basketball,NA
A Lamusi,M,23,China,Judo,NA
Gunnar Nielsen Aaby,M,24,Denmark,Football,NA
Edgar Lindenau Aabye,M,34,Denmark/Sweden,Tug-Of-War,Gold
Christine Jacoba Aaftink,F,21,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,21,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,25,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,25,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,27,Netherlands,Speed Skating,NA
Christine Jacoba Aaftink,F,27,Netherlands,Speed Skating,NA
Per Knut Aaland,M,31,United States,Cross Country Skiing,NA
Per Knut Aaland,M,31,United States,Cross Country Skiing,NA
Per Knut Aaland,M,31,United States,Cross Country Skiing,NA
Per Knut Aaland,M,31,United States,Cross Country Skiing,NA
Per Knut Aaland,M,33,United States,Cross Country Skiing,NA
Per Knut Aaland,M,33,United States,Cross Country Skiing,NA
Per Knut Aaland,M,33,United States,Cross Country Skiing,NA
Per Knut Aaland,M,33,United States,Cro

# Writing Text Files
One of the most commonly performed data processing tasks is to read data from a file, manipulate it in some way, and then write the resulting data out to a new data file to be used for other purposes later. To accomplish this, the open function discussed above can also be used to create a new file prepared for writing. Note in Table 1 that the only difference between opening a file for writing and opening a file for reading is the use of the 'w' flag instead of the 'r' flag as the second parameter. When we open a file for writing, a new, empty file with that name is created and made ready to accept our data. If an existing file has the same name, its contents are overwritten. As before, the function returns a reference to the new file object.

In [14]:
file = open('vaishuf.txt','w')
for i in range(20):
    file.write(str(i)+'\n')
file.close()

In [15]:
filename = "squared_numbers.txt"
outfile = open(filename, "w")

for number in range(1, 13):
    value = number**2
    outfile.write(str(value)+'\n')
    

outfile.close()

infile = open(filename, "r")
print(infile.read()[:11])
infile.close()

1
4
9
16
25


# Hands-on exercise

In [8]:
file = open('emails.txt')
dic = {}
for email in file:
    if email.startswith('From:'):
        email = email.strip().split()[1]
        if email not in dic:
            dic[email] = 1
        else:
            dic[email] += 1
print(dic)

{'stephen.marquard@uct.ac.za': 2, 'louis@media.berkeley.edu': 3, 'zqian@umich.edu': 4, 'rjlowe@iupui.edu': 2, 'cwen@iupui.edu': 5, 'gsilver@umich.edu': 3, 'wagnermr@iupui.edu': 1, 'antranig@caret.cam.ac.uk': 1, 'gopal.ramasammycook@gmail.com': 1, 'david.horwitz@uct.ac.za': 4, 'ray@media.berkeley.edu': 1}


In [21]:
outfile = open("new_email.txt", "w")
file = open('emails.txt')
dic = {}
for email in file:
    if email.startswith('From:'):
        email = email.strip().split()[1]
        if email not in dic:
            dic[email] = 1
        else:
            dic[email] += 1
# first add emails and frequency at the top of the text file
outfile.write('Email, frequency'+'\n')
for key, value in dic.items():
    # using .format methos add key and value in a sepatated variable
    line='{}, {}'.format(key,value)
    # write the output at 'new_email.txt' file
    outfile.write(line+'\n')
file.close()
outfile.close()

In [18]:
infile = open("new_email.txt", "r")
for line in infile.readlines():
    print(line.strip())
infile.close()

stephen.marquard@uct.ac.za, 2
louis@media.berkeley.edu, 3
zqian@umich.edu, 4
rjlowe@iupui.edu, 2
cwen@iupui.edu, 5
gsilver@umich.edu, 3
wagnermr@iupui.edu, 1
antranig@caret.cam.ac.uk, 1
gopal.ramasammycook@gmail.com, 1
david.horwitz@uct.ac.za, 4
ray@media.berkeley.edu, 1


# CSV Format
CSV stands for Comma Separated Values. If you print out tabular data in CSV format, it can be easily imported into other programs like Excel, Google spreadsheets, or a statistics package (R, stata, SPSS, etc.).

For example, we can make a file with the following contents. If you save it as a file name grades.csv, then you could import it into one of those programs. The first line gives the column names and the later lines each give the data for one row.

Name,score,grade
Jamal,98,A+
Eloise,87,B+
Madeline,99,A+
Wei,94,A

# Writing data to a CSV File
The typical pattern for writing data to a CSV file will be to write a header row and loop through the items in a list, outputting one row for each. Here we a have a list of tuples, each representing one Olympian, a subset of the rows and columns from the file we have been reading from.

In [5]:
olympians = [("John Aalberg", 31, "Cross Country Skiing"),
             ("Minna Maarit Aalto", 30, "Sailing"),
             ("Win Valdemar Aaltonen", 54, "Art Competitions"),
             ("Wakako Abe", 18, "Cycling")]

outfile = open("reduced_olympics.csv", "w")
# output the header row
outfile.write('Name,Age,Sport')
outfile.write('\n')
# output each of the rows:
for olympian in olympians:
    row_string = '{},{},{}'.format(olympian[0], olympian[1], olympian[2])
    outfile.write(row_string)
    outfile.write('\n')
outfile.close()

EVAL FUNCTION

In [24]:
a=input('enter:')
eval(a)

enter:8+28+71*3-82


167