< [Sequences II](ZSequences-II.ipynb) | [PyFinLab Index](ALWAYS-START-HERE.ipynb) | [Dictionaries](ZDictionaries.ipynb) >

<a id = "ref00"></a>

<a><img src="figures/UUBS.png" width="180" height="180" border="10" /></a>
<hr>

<h2> Notebook A7: Files</h2>

<div class="alert alert-block alert-info" style="margin-top: 20px">

<li><a href="#ref1">Aims and Objectives</a></li>
<li><a href="#ref2">Introduction</a></li>
<li><a href="#ref3">Reading a File</a></li>
<li><a href="#ref5">Iterating over lines in a file</a></li>
<li><a href="#ref6">Finding a File in your Filesystem</a></li>
<li><a href="#ref7">Using `with` for Files</a></li>
<li><a href="#ref8">Reading and Processing a File</a></li>
<li><a href="#ref9">Writing Text Files</a></li>
<li><a href="#ref10">CSV Format</a></li>
<li><a href="#ref11">Reading data from a CSV File</a></li>
<li><a href="#ref12">Writing data to a CSV File</a></li>
<li><a href="#ref13">Handling Files</a></li>
<li><a href="#ref14">Glossary</a></li>
<li><a href="#ref15">Exercises</a></li>
<br>
<p></p>
Indicative Completion Time: <strong>3 hrs</strong> (not including video viewing time)

</div>


<a id="ref1"></a>
<h3>Aims</h3>

To introduce and understand the concept and use of: 
* file system structure
* opening files with different modes
* files as another kind of iterable sequence
* read/transform/write patterns
* parallel assignment to two or three variables

<h3>Objectives</h3>

On completion of this notebook you should be able to:
* read a single value from each line in a file
* convert the line to an appropriate value
* read a line and convert it into multiple values using `split` and assignment to multiple variables
* execute a complete cycle of reading, processing and writing substantial real data to and from files



<a id="ref2"></a>
<h2>Introduction</h2>

Thus far the data we have used in this course have all been either coded directly into the program, or have been entered by the user. In everyday data analytics data reside in files; web pages, images, word processing documents, video, audio, are all examples of data that live in files. In this notebook we introduce the Python concepts necessary to use data from files in our programs.

Here is the video collection for this notebook which introduces some important concepts related to working with files in Python. View them more than once as they lay the foundation for all that follows. Videos 1 and 2 provide an introduction and overview of the process of handling files (creating file objects, reading and writing files) with the `open` function. Videos 3 and 4 provide a more indepth treatment of the processes involved and focus on realistic data analytics tasks involving flat files. Videos 3 and 4 can be left until you commence working through the section <a href="#ref5">Iterating over lines in a file</a>.  

<a href="http://www.youtube.com/watch?feature=player_embedded&v=k27fTx4Umz8
" target="_blank"><img src="https://i.ytimg.com/vi/Uh2ebFW8OYM/maxresdefault.jpg" 
alt="Key Value Pairs" width="200" height="180" border="10" /></a>

<p></p>
<center><b>Video 1:</b> Reading files with open (an overview)</center>

<a href="http://www.youtube.com/watch?feature=player_embedded&v=8Zp2qEEFnpI
" target="_blank"><img src="https://i.ytimg.com/vi/Uh2ebFW8OYM/maxresdefault.jpg" 
alt="Key Value Pairs" width="200" height="180" border="10" /></a>

<p></p>
<center><b>Video 2:</b> Writing files with open (an overview)</center>

<a href="http://www.youtube.com/watch?feature=player_embedded&v=UG_2gHTHvrU
" target="_blank"><img src="https://i.ytimg.com/vi/Uh2ebFW8OYM/maxresdefault.jpg" 
alt="Key Value Pairs" width="200" height="180" border="10" /></a>

<p></p>
<center><b>Video 3:</b> Processing files </center>

<a href="http://www.youtube.com/watch?feature=player_embedded&v=UANEIDwjSLc
" target="_blank"><img src="https://i.ytimg.com/vi/Uh2ebFW8OYM/maxresdefault.jpg" 
alt="Key Value Pairs" width="200" height="180" border="10" /></a>

<p></p>
<center><b>Video 4:</b> Reading files (in detail)</center>

For our purposes, we will assume that our data files are text files; files filled with characters. The Python programs that you write are stored as text files. We can create these files in any of a number of ways. For example, we could use a text editor to enter data manually and then save. We could also download the data from a website and save it in a file. Regardless of how the file is created, Python will allow us to manipulate the contents.

In Python, we must `open` files before we can use them and `close` them when we are finished with them. As you might expect, once a file is opened it becomes a Python object just like all other data. Table 1 shows the functions and methods associated with opening and closing files.


| Method Name | Use                  | Explanation                                                                                          |
|-------------|----------------------|---------------------------------------------------------------------------------------------------------|
| open        | open(filename,'r')   | Open a file called filename and use it for reading. This will return a reference to a file object.      |
| open        | open(filename,'w')   | Open a file called filename and use it for writing. This will also return a reference to a file object. |
| close       | filevariable.close() | File use is complete.                                                  |

<p></p>
<p></p>
<center> <b>Table 1:</b> Opening and closing files</center>

Suppose we have a text file called `sports.txt` that contains data on a range of athletes and their performances in World Championship athletics competitions. (the file should be located in the same folder you used to access this notebook).

To open this file, we would call the `open` function. The variable, `fhand`, now holds a reference to the file object returned by `open`. When we are finished with the file, we can close it by using the `close` method. After the file is closed any further attempts to use `fhand` will result in an error. Note that `fhand` is a *handle* to the contents of the file does not hold the actual contents of the file per se.

In [None]:
fhand = open('sports.txt','r') # establishing a link to the file 
                               # (like dialing a phone number and getting connected)

## we can insert other code in here if we wish to 
## access/manipulate the data in the file directly

fhand.close() # here we are terminating the link 
              # (like hanging up the phone)


A common mistake is to get confused about whether you are providing a *variable name* or a *string literal* as input to the `open` function. In the code above, `'sports.txt'` is a *string literal* that should correspond to the name of a file,`sports.txt` , on your computer. If you use something like `x` without quotes, such as `open(x, 'r')`, `x` will be treated as a *variable name*. In such a case, `x` should be a variable that’s *already* been bound to a string value like `x ='sports.txt'`.

Take some time to familiarise yourself with the `print` commands below where we access some attributes of the file. The purpose of `print` here is solely to display the results so we can see them, what is more important is the structure and meaning of the rest of the `code` outside and inside the `print()` statements.

In [None]:
fhand = open('sports.txt','r')

fhand.name
print('1. The file name is',fhand.name,'\n')

fhand.mode
print('2. The file is in',fhand.mode,'mode. r means "read" mode \n')

filecontent = fhand.read() # the read() method will be fully explained in the next section

print('3.', type(filecontent),'is the class of the object filecontent \n')

print('4. And here is the file content exactly as it appears in the .txt file: \n\n',filecontent)
fhand.close() 

<a id="ref3"></a>
<h2>Reading a File</h2>

<div align="right"><a href="#ref00">back to top</a></div>

Once you have a file *object* (the thing returned by the `open` function), as depicted in this image, 

<a><img src="figures/Wk4_Fig1_20percent.png" width="400" height="180" border="10" /></a>

<center><b>Figure 1:</b> Labeled Syntax of a file object.</center> 

Python provides three methods to read data from that object. The `read()` method returns the entire contents of the file as a single string (or just some characters if you provide a number as an input parameter). The `readlines()` method returns the entire contents of the file as a list of strings, where each item in the list is one line of the file. The `readline()` method reads one line from the file and returns it as a string. The strings returned by `readlines()` or `readline()` will contain the newline character at the end. Table 2 summarises these methods and the rest of this section shows them in action.



| Method Name  | Use                    | Explanation                                                                                                                                                                                                                                                                                                                                |
|--------------|------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| write        | filevar.write(astring) | Add astring to the end of the file. filevar must refer to a file that has been opened for writing.                                                                                                                                                                                                                                         |
| read(n)      | filevar.read()         | Reads and returns a string of n characters, or the entire file as a single string if n is not provided.                                                                                                                                                                                                                                    |
| readline(n)  | filevar.readline()     | Returns the next line of the file with all text up to and including the newline character. If n is provided as a parameter then only n characters will be returned if the line is longer than n. Note the parameter n is not supported in the browser version of Python, and in fact is rarely used in practice, you can safely ignore it. |
| readlines(n) | filevar.readlines()    | Returns a list of strings, each representing a single line of the file. If n is not provided then all lines of the file are returned. If n is provided then n characters are read but n is rounded up so that an entire line is returned. Note like readline readlines ignores the parameter n in the browser.                             |


<p></p>
<center><b>Table 2:</b> Methods for reading files</center>


In this module, we will generally either iterate through the lines returned by `readlines()` with a `for` loop, or use `read()` to get all of the contents as a single string.

In other programming languages, where they don’t have the convenient `for` loop method of going through the lines of the file one by one, they use a different pattern which requires a different kind of loop, the `while` loop. 

A common error that beginning programmers make is to not realise that all these ways of reading the file contents actually empty the file. After you call `readlines()`, if you call it again you’ll get an empty list.

Suppose we wish to find the number of characters in the italicised sentence below

'*Learning how to work with files will prove very useful in the future!*'

How can we achieve that? I hope you didn't say 'count them'!

In [None]:
len('Learning how to work with files will prove very useful in the future')

If you do actually count the letters they total `56` but the answer given was `68` because spaces are counted as legitimate characters.

What if we have many, many more characters such that we cannot input them directly into the `len` function in the form of a single string, as we just did above? Suppose we wish to count the number of characters in a long text contained in a file. Let's count the characters in the file `titanic.txt`, which by the way at a size of 4kb is a miniscule file size by data analytics standards. However it will serve the purpose of demonstration here. NB: `titanic.txt` needs be in the same folder as this notebook, otherwise you will need to modify the code below to locate the file. We will look at doing this in the next section but for now having the files in the same folder is recommended.

In [None]:
fhand = open('titanic.txt','r') # making a call to the file

print(len(fhand.read())) # conversation: Hi, can I please have all the 
                         # characters in this file as a single string [.read()] 
                         # so I can count them [len()] and print out the result [print()].
        
fhand.close() # thank you and goodbye.

What if instead we had wanted to find the number of lines in this file?

In [None]:
fhand = open('titanic.txt','r') 
print(len(fhand.readlines())) 
fhand.close() 

As mentioned above, a mere `15` lines, not exactly a book. But the code will work just as well on files with as much data as a book or for that matter all the books ever written!

What if we had wanted to capture only the first 70 characters?

In [None]:
fhand = open('titanic.txt','r') 
print(fhand.read(70))                      
fhand.close() 

<a id="ref5"></a>
## Iterating over lines in a file

<div align="right"><a href="#ref00">back to top</a></div>

We will now use the file `sports.txt` as input in a program that will do some data processing. In the program, we will examine each line of the file and print it with some additional text. Because `readlines()` returns a list of lines of text, we can use the `for` loop to iterate through each line of the file.

A line of a file is defined to be a sequence of characters up to and including a special character called the newline character. If you evaluate a string that contains a newline character you will see the character represented as `\n`. On the other hand, if you print a string that contains a newline character you will not see the `\n` character in the output, you will just experience its effects (a new line!).


In [None]:
string = 'Where did the newline character go?\nIt has disappeared but this sentence has been displayed as a new line'
print(string)

As the `for` loop iterates through each line of the file the loop variable will contain the current line of the file as a string of characters. The general pattern for processing each line of a text file is as follows:

```python
for line in MyFile.readlines():
    statement1
    statement2
    ...
    ```

To process all of the athlete data, we can use a `for` loop to iterate over the lines of the file. Using the `split` method, we can break each line into a list containing all the fields of interest about the athletes. We can then take the values corresponding to *Name*, *Age* and *Event* to construct a simple sentence. 

In [None]:
AthleteFile = open("data/sports.txt","r")

for eachline in AthleteFile.readlines():
    values = eachline.split(",")
    print(values[0], "is", values[2], "years old and competes in", values[4])
     
AthleteFile.close()


To make the code a little simpler and to allow for more efficient processing, Python provides a built-in way to iterate through the contents of a file one line at a time, without first reading them all into a list. Some beginning programmers find this confusing so we don’t recommend doing it this way until you get a little more comfortable with Python. But this idiom is preferred by Python programmers, so you should be prepared to read it. And when you start dealing with big files, you may notice the efficiency gains of using it.

In [None]:
AthleteFile = open("sports.txt","r")

for eachline in AthleteFile:   # note we have not applied the readlines() method here
    values = eachline.split(",")
    print(values[0], "is ", values[2], "years old and competes in", values[4])

AthleteFile.close()


**Have a go:**
    
1. Write code to find out how many lines are in the file `Brian_O'Driscoll.txt`, located in the same folder as this notebook. Save this value to the variable `num_lines`. Don't use the `len` function.

In [None]:
#Write your code here




<div align="right">
<a href="#Hag1" class="btn btn-default" data-toggle="collapse">Click here for the answer</a>

</div>
<div id="Hag1" class="collapse">

```python
BodFile = open("Brian_O'Driscoll.txt","r")
num_lines = 0
for eachline in BodFile.readlines():
    num_lines += 1
print('This file has',num_lines,'lines')
BodFile.close()
```
```
If the question had allowed the len() function to be used 
then the result would have been the more compact...
```

```python
BodFile = open("Brian_O'Driscoll.txt","r")
print(len(BodFile.readlines()))    
BodFile.close()

```

</div>

<a id="ref6"></a>
## Finding a File in your Filesystem

<div align="right"><a href="#ref00">back to top</a></div>

If you have installed Python on your computer (by now you will have) and you are trying to get file reading and writing operations to work, there’s a little more that you may need to understand. Computer operating systems (like Windows and Mac OS) organise files into a hierarchy of folders (aka directories), with some folders containing other folders.

<a><img src="figures/filehierarchy.png" width="280" height="180" border="10" /></a>

<center><b>Figure 2:</b> Typical folder structure </center>


If your file and your Python program are in the same directory you can simply use the filename. For example, with the file hierarchy in the diagram, the file `myPythonProgram.py` could contain the code `open('data1.txt','r')` and there will be no issues because the Python program and the file it wishes to read, `data1.text` are in the same folder.

If the file you wish to read and your Python program are in different directories, however, then you need to specify a **path**. You can think of the filename as the short name for a file, and the path as the full name (see Figure 1 above). Typically, you will specify a *relative* file path, which provides instructions on where to find the file you wish to read, relative to the directory where the code is running from. For example, the program `myPythonProgram.py` could contain the code `open('../myData/data2.txt','r')`. The `../` means to go up one level in the directory structure, to the containing folder `allProjects`; then `myData/` gives instructions to descend into the `myData` subfolder where the file `data2.txt` is located.

There is also an option to use an *absolute* file path. For example, suppose the file structure in Figure 2 is stored on a computer in the user’s home directory,`/Users/joebloggs/myFiles`. Then code in any Python program running from *any* file folder could open `data2.txt` via `open('/Users/joebloggs/myFiles/allProjects/myData/data2.txt','r')`. You can tell an absolute file path because it begins with a `/`. 

If you were ever to move your programs and data to another computer, it will be much more convenient if you use relative file paths rather than absolute. That way, if you preserve the folder structure when moving everything, you won’t need to change your code. If you use absolute paths, then the account on the other computer will probably not have the same home directory name, it may be `/Users/janebloggs` for instance. Note that Python pathnames follow the UNIX conventions (Mac OS is a UNIX variant), rather than the Windows file pathnames that use `:` and `‘’`. The Python interpreter will translate to Windows pathnames when running on a Windows machine; you should be able to share your Python program between a Windows machine and a MAC without having to rewrite the file `open` commands.

**Have a go:**

2. Suppose you are in a directory called `Project`. In it you have a file containing your Python code. You would like to read data from a file called `YearlyProjections.csv` which is in a folder called `CompanyData`, which itself is inside the `Project` folder. What is the best way to open the file in your Python program?


```python
A. open("YearlyProjections.csv", "r")
B. open("../CompanyData/YearlyProjections.csv", "r")
C. open("CompanyData/YearlyProjections.csv", "r")
D. open("Project/CompanyData/YearlyProjections.csv", "r")
E. open("../YearlyProjections.csv", "r")
```

<div align="right">
<a href="#Hag2" class="btn btn-default" data-toggle="collapse">Click here for the answer</a>

</div>
<div id="Hag2" class="collapse">

```python
C. open("CompanyData/YearlyProjections.csv", "r")
```
</div>

3. Which of the following paths are relative file paths?


```python
A. Stacy/Applications/README.txt
B. /Users/Raquel/Documents/graduation_plans.doc
C. /private/tmp/swtag.txt
D. ScienceData/ProjectFive/experiment_data.csv
```

<div align="right">
<a href="#Hag3" class="btn btn-default" data-toggle="collapse">Click here for the answer</a>

</div>
<div id="Hag3" class="collapse">

```python
A. Stacy/Applications/README.txt
D. ScienceData/ProjectFive/experiment_data.csv
```
```
remember absolute paths start with `/`
```

</div>

<a id="ref7"></a>
## Using `with` for Files

<div align="right"><a href="#ref00">back to top</a></div>

Now that you have been introduced to the opening and closing of files via files handles, there is another mechanism that Python provides for us that removes the need for the often forgotten `.close()`. Forgetting to close a file does not necessarily cause a runtime error in the kinds of programs we typically write in an introductory programming course. However if we are writing a program that may run for days or weeks at a time, that does lots of file reading and writing, we may run into trouble.

Python has the notion of a context manager which automates the process of performing common operations at the start of a task, as well as automating certain operations at the end of the task. For reading and writing a file, the normal operation is to open the file and assign it to a variable. When finished working with the file the common operation is to make sure that file is closed.

The Python `with` statement makes using a context manager easy. The general form of a `with` statement is:

<div align="left">
    
```python
with <create some object that understands context> as <some name>:
    do some stuff with the object
    ...
```
</div>

When the program exits the `with` block, the context manager handles the common stuff that normally happens at the end, in our case closing a file. A simple example should clear up this somewhat abstract discussion of contexts. Here are the contents of a file called `mydata.txt`.

<div align="left">
    
```python
1 2 3

4 5 6
```
</div>

In [None]:
with open('data/mydata.txt', 'r') as md:
    for line in md:
        print(line)
# continue with other code
# here
# and here...
#.....


The first line of the `with` statement opens the file and assigns it to the variable `md`. Then we can iterate over `md` in any of the usual ways. In this case we print each line of the file using `print(line)`. When we are finished we simply stop indenting and let Python take care of closing the file and cleaning up (a gentle reminder of how indenting is a key component of Python syntax). The `with` syntax may be a little confusing initially as the file object is named after the `as` keyword. Hopefully Figure 3 below can provide some clarity


<a><img src="figures/Wk4_Fig3.png" width="400" height="180" border="10" /></a>

  <center><b>Figure 3: </b>The syntax for opening a file using a <b>with</b> statement.</center> 

This is equivalent to code that specifically closes the file at the end (see cell below); but instead the `with` approach implicitly marks the closing of the file via the ending of the indented block. This mitigates against forgetting the `.close()` invocation. Also using `with` closes the file even if an exception is reached. More on exceptions later.

In [None]:
md = open('mydata.txt', 'r')
for line in md:
    print(line)
md.close()
# continue with other code
# here
# and here...
#.....


<a id="ref8"></a>
## Reading and Processing a File

<div align="right"><a href="#ref00">back to top</a></div>

A step-by-step approach to processing the contents of a text file is listed below; if you’ve followed the previous sections you’ll understand that there are other options as well. Some of those options are preferable in some situations, and some are preferred by Python programmers for efficiency reasons. For the time being the steps below will certainly suffice:


1. Open the file using `with` and `open`.

2. Use `.readlines()` to get a list of the lines of text in the file.

3. Use a `for` loop to iterate through the list, each item in the list being a line from the file. On each iteration, process that line of text (for example by using `.split()`)

4. When you are finished extracting data from the file, continue writing your code outside the indentation. Using `with` will automatically close the file once the program exits the `with` block.
```

As a for instance...

<div align="left">
    
```python
fname = "yourfile.txt"
with open(fname, 'r') as fileref:         # step 1
    lines = fileref.readlines()           # step 2
    for line in lines:                     # step 3
        #some code that references the variable line 
        # eg line.split() still step 3
#now outside the indentation, some other code not relying on fileref # step 4

```
</div>

This approach may not be so good to use when you are working with big data. Imagine working with a datafile that has 10,000,000 rows of data (and many common data files do, some may have billions of rows). It would take a long time to read in all the data and then if you had to iterate over it, even more time would be necessary. This would be a case where programmers would employ another option for efficiency reasons.

One option involves iterating over the file itself while still iterating over each line in the file:


<div align="left">
    
```python
fname = "yourfile.txt"
with open(fname, 'r') as fileref:         # step 1
                                          # note the original step 2 has been dropped
    for line in fileref:                  # step 3 (lines replaced by fileref)
        #some code that references the variable line eg line.split() # still step 3
#now outside the indentation, some other code not relying on fileref # step 4

```
</div>

Another approach that will be of value is using the `read` method. We don’t have to read the entire file contents, for example, we can read the first n characters by entering `n` as a parameter to the `read` method `.read(n)`:


In [None]:
with open("sports.txt","r") as file1:
    print(file1.read(4)) # this will output the frst 4 characters

Once the method `.read(4)` is called the first 4 characters are retrieved.  If we call the method again, the *next* 4 characters are called. The output for the following cell will demonstrate the process for different inputs to `read()`:

In [None]:
with open("sports.txt","r") as file1:
    print(file1.read(4))
    print(file1.read(4))
    print(file1.read(7))
    print(file1.read(15))


Not particularly tidy looking output but the Python interpreter cannot spell, or read our minds, it simply outputs exactly what we asked it to.

 The general process is illustrated in Figure 4 (*in the figure `file1.read(7)` should be `file1.read(6)`*), and each colour represents the part of the file read after each time `.read()` is called:


<a><img src="figures/Wk4_Fig5.png" width="400" height="180" border="10" /></a>
 
<p></p>
<center><b>Figure 4:</b> Illustration using <b>.read()</b> to call different characters </center> 

 Here is an example using different values, chosen to produce tidier output: 

In [None]:
with open("sports.txt","r") as file1:
    print("first 30 characters: " + file1.read(30))
    print("This is the first 30 characters, isn't it?")
   

We can also read one line of the file at a time using the method **readline()**: 

In [None]:
 with open("sports.txt","r") as file1:
    print("first line: " + file1.readline())


 We can use a `for` loop to iterate through each line: 


In [None]:
 with open("sports.txt","r") as file1:
        i=0;
        for line in file1:
            print("Iteration" ,str(i),":",line)
            i=i+1;

We can use the `readlines` method to save the text file to a list: 

In [None]:
with open("data/sports.txt","r") as file1:
    FileasList=file1.readlines()

 Each element of the list corresponds to a line of text:

In [None]:
FileasList[0]

In [None]:
FileasList[1]

In [None]:
FileasList[2]

Notice how the `\n` character does not appear in the output if we request Python to `print(FileasList[2])` as opposed to `FileasList[2]`

In [None]:
print(FileasList[2])

<a id="ref9"></a>
## Writing Text Files

<div align="right"><a href="#ref00">back to top</a></div>

One of the most commonly performed data processing tasks is to read data from a file, manipulate it in some way, and then write the resulting data out to a new data file to be used for other purposes later. To accomplish this, the `open` function discussed above can also be used to create a new file prepared for writing. Note in Table 1 that the only difference between opening a file for writing and opening a file for reading is the use of the `'w'` flag instead of the `'r'` flag as the second parameter. When we open a file for writing, a new, empty file with that name is created and made ready to accept our data. If an existing file has the same name, its contents are overwritten. As before, the `open` function returns a reference to the new file object.

Table 2 shows one additional method on file objects that we have not used thus far. The `write` method allows us to add data to a text file. Recall that text files contain sequences of characters. We usually think of these character sequences as being the lines of the file where each line ends with the newline `\n` character. Be very careful to notice that the `write` method takes one parameter, a string. When invoked, the characters of the string will be added to the end of the file. This means that it is the programmer’s job to include the newline characters as part of the string if desired.

Assume that we have been asked to provide a file consisting of all the squared numbers from 1 to 12.

First, we will need to open the file. Afterwards, we will iterate through the numbers 1 to 12, and square each one of them. The resulting number will need to be converted to a string, then it can be written into the file.

The program below solves part of the problem. We first want to make sure that we’ve written the correct code to calculate the square of each number.

In [None]:
for number in range(1, 13):
    square = number * number
    print(square)

When we run this program, we see the lines of output on the screen. Once we are satisfied that it is creating the appropriate output, the next step is to add the necessary pieces to produce an output file and write the data lines to it. To begin, we need to open a new output file by calling the `open` function, `outfile = open("squared_numbers.txt",'w')`, using the `'w'` flag. We can choose any file name we like. If the file does not exist, it will be created. However, if the file does exist, it will be reinitialised as empty and you will lose any previous contents.

Once the file has been created, we just need to call the `write` method passing the string that we wish to add to the file. In this case, the string is already being printed so we will just change the print into a call to the `write` method. However, there is an additional step to take since the write method can only accept a string as input. We’ll need to convert the number to a string. Then we just need to add one extra character to the string; the newline character needs to be concatenated to the end of the line. The entire line now becomes `outfile.write(str(square)+ '\n')`. The print statement automatically outputs a newline character after whatever text it outputs, but the write method does not do that automatically. We also need to close the file when we are done.

The complete program is shown below.

In [None]:
filename = "squared_numbers.txt"
outfile = open(filename, "w")

for number in range(1, 13):
    square = number * number
    outfile.write(str(square) + "\n")
outfile.close()

infile = open(filename, "r")
#print(infile.read()[:10])
print(infile.read())


<a id="ref10"></a>
## CSV Format

<div align="right"><a href="#ref00">back to top</a></div>

CSV stands for Comma Separated Values. If you write tabular data to CSV format, it can be easily imported into other programs like Excel, Google spreadsheets, or a statistics package (R, stata, SPSS, etc.).

For instance, we could create a file with the contents below; if we save it with a file name `grades.csv`, then it could be imported directly into any of those software packages. The first line gives the column names and the following lines each give data for one row.

<div align="left">
    
```
Name,score,grade
Isabelle,93,A+
Faye,72,A
Darragh,67,B+
Martin,99,A+
Brian,57,C+
Caitlin,55,C+
   
```
</div>







<a id="ref11"></a>
## Reading data from a CSV File

<div align="right"><a href="#ref00">back to top</a></div>

We are able to read CSV files the same way we have with other text files. Because of the standardised structure of the data, there is a common pattern for processing it. To demonstrate this we will be use data from `sports.txt`.

Typically, CSV files will have a header as the first line, which contains column names. Then each subsequent row in the file will contain data that corresponds to the appropriate columns. The rows are referred to as observations, each row represents one observation (eg, one athlete).

All the file methods and approaches we have considered - `read`,` readline`, and `readlines`, and simply iterating over the file object itself - work on CSV files. 

In the example below we iterate over the lines. Because the values on each line are separated with commas, we can use the `.split()` method to parse each line into a collection of separate values. Spend as much time as it takes working through the cell below; pay close attention to the comments as well as the code. Code would generally not be this heavily commented but here the comments provide important clarification and direction to help you understand exactly what is happening.

In [None]:
fileconnection = open("data/sports.txt", 'r')  # creating the file object and assigning
                                          # it to the variable fileconnection
lines = fileconnection.readlines()  # creating a list whereby each item in the list is a 
                                    # line from sports.txt 
header = lines[0]   # taking the 1st list item (i.e.,1st line/1st row of file) and assigning 
                    # it to the variable header (a string variable)
field_names = header.strip().split(',') # we strip off any white space at each end of the string 
                                        # we have called header, then split the string 
                                        # (using comma(,) as the delimiter) and save it to a 
                                        # list called field_names. Thus each item in the list 
                                        # field_names is a word from the 1st line of sports.txt                                        
print(field_names)  # this sends the contents of the list field_names 
                    # to the console where we can see it
for row in lines[1:]:   # we are going to loop over every subsequent item in lines 
                        # (i.e., every subsequent line in the file sports.txt)
    vals = row.strip().split(',') # here we are stripping and splitting each subsequent line 
                                  # one at a time and assigning the results to the list variable vals
    if vals[5] != "NA": # conditional statement executed depending on whether or not the 
                        # vals[5] entry is NA or not i.e., whether or not the athlete has won a medal.
        print("{}: {}; {}.".format(    # if the athlete has won a medal we print out 
                vals[0],              # vals[0](athlete's name),vals[4](athlete's event)  
                vals[4],              # and vals[5](colour of the athlete's medal )
                vals[5]))
        

In the above code, we open the file, `sports.txt`, which we worked with earlier in the notebook.

We split the first row to get the field names. We split other rows to get values. Note that we specify to split on commas by passing `,` as a parameter. Also note that we first pass the row through the `strip()` method to get rid of the trailing `\n`.

Once we have parsed a line into separate values, we can use those values in the program. For example, in the code above, we select only those rows where the athlete won a medal, and we print out only three of the fields and in a different format.

Note that the trick of splitting the text for each row based on the presence of commas only works because commas are not used in any of the field values. Suppose that some of our events were more specific, and used commas. For example, `“Swimming, 100M Freestyle”`. How will a program processing a `.csv` file know when a comma is separating columns as opposed to just being part of the text string within a column?

The CSV format is actually a little more general than we have described and has a couple of solutions for that problem. One alternative format uses a different column separator, such as | or a tab (t). Sometimes, when a tab is used, the format is called TSV, for tab-separated values. If you get a file using a different separator, you can use `.split('|')` or `.split('\\t')`.

Another advanced CSV format uses commas to separate but encloses all values in double quotes.

For example, the data file might look like:


<div align="left">
    
```
"Name","Sex","Age","Team","Event","Medal"
"A Dijiang","M","24","China","Basketball","NA"
"Edgar Lindenau Aabye","M","34","Denmark/Sweden","Tug-Of-War","Gold"
"Christine Jacoba Aaftink","F","21","Netherlands","Speed Skating, 1500M","NA"
   
```
</div>



If you are reading a `.csv` file that has enclosed all values in double quotes, it’s actually a pretty tricky programming problem to split the text for one row into a list of values. You won’t want to try to do it directly. Instead, you should use Python’s built-in csv module. However, there’s a steepish learning curve for that, so best to get a sound understanding of reading CSV format by first learning to read straighforward, unquoted format and split lines on commas.

<a id="ref12"></a>
## Writing data to a CSV File

<div align="right"><a href="#ref00">back to top</a></div>

The typical pattern for writing data to a CSV file is to write a header row and loop through the items in a list, outputting one row for each. Here we a have a list of tuples, each representing one client, in the form of `("Name", Age,"Company")`

In [None]:
customers = [("John Smith", 53, "Mallaghan Finance"),   # ("John Smith", 53, "Mallaghan Finance") is an 
            ("Brian McAuley", 20, "New Money"),         # example of a tuple; a three-tuple in fact.
            ("Patrick Wallace", 44, "Clear Sky Assets"),
            ("Declan Tiernan", 48, "Oakfield Associates Ltd")]

outfile = open("best_customers.csv","w")
# output the header row
outfile.write('Name,Age,Company')
outfile.write('\n')
# output each of the rows:
for customer in customers:
    row_string = '{},{},{}'.format(customer[0], customer[1], customer[2]) # or more succinctly row_string = '{},{},{}'.format(*customer)
    outfile.write(row_string)
    outfile.write('\n')
outfile.close()


There are a few things worth noting in the code above.

Firstly, using `.format()` makes it nice and clear what we’re doing when we create the variable `row_string`. We are making a comma separated set of values; the {} curly brackets indicate where to substitute the actual values. The equivalent string concatenation would be hard to read. An alternative, also clear way to do it would be with the `join` method: `row_string = ','.join(customer[0], customer[1], customer[2])`.

Secondly, unlike the `print` function, remember that the `write` method on a file object does not automatically insert a newline. Instead, we have to explicitly add the character `\n` at the end of each line.

Thirdly, we must explicitly refer to each of the elements of `customer` when building the string to write. Note that just putting `.format(customer)` wouldn’t work because the interpreter would see only one value (a tuple) when it was expecting three values to try to substitute into the string template. In the previous notebook [Sequences II](01.14-Sequences-II.ipynb) we saw that Python provides an advanced technique for automatically unpacking the three values from the tuple, with `.format(*customer)`.

As described previously, if one or more columns contain text, and that text could contain commas, we need to do something to distinguish a comma in the text from a comma that is separating different values (cells in the table). If we want to enclose each value in double quotes, it can start to get a little tricky, because we would need to have the double quote character `" "` inside the string output. Though it is doable. Indeed, one reason Python allows strings to be delimited with either single quotes or double quotes is so that one can be used to delimit the string and the other can be a character in the string. If you get to the point where you need to quote all of the values, I recommend learning to use Python’s csv module.

In [None]:
customers = [("John Smith", 53, "Mallaghan Finance"),
            ("Brian McAuley", 20, "New Money"),
            ("Patrick Wallace", 44, "Clear Sky Assets"),
            ("Declan Tiernan", 48, "Oakfield Associates Ltd")]

outfile = open("best_customers.csv","w")
# output the header row
outfile.write('Name,Age,Company')
outfile.write('\n')
# output each of the rows:
for customer in customers:
    row_string = '"{}","{}","{}"'.format(customer[0], customer[1], customer[2])
    outfile.write(row_string)
    outfile.write('\n')
outfile.close()

<a id="ref13"></a>
## Handling Files

<div align="right"><a href="#ref00">back to top</a></div>

When working with files, there are a few things to keep in mind. When naming files, it’s wiser to not include spaces. While most operating systems can handle files with spaces in their names, not all can.

Additionally, suffixes in files names, for example the .txt in `FileNameExample.txt`, are not magic. Instead, these suffixes are a convention. For some operating systems the suffixes have no special significance, and only have meaning when used in a program. Other operating systems infer information from the suffixes - for example, `.exe` is a suffix that means a file is executable.

It’s a good idea to follow the conventions. If a file contains CSV formatted data, name it with the extension `.csv`, not `.txt`. A Python program will be able to read it either way, but if you follow convention you will help other people ascertain what’s in the file. And you will also help the computer’s operating system to predict what application program it should open when you double-click on the file.

<a id="ref14"></a>
## Glossary

<div align="right"><a href="#ref00">back to top</a></div>

* **open:** we must open a file before you can read its contents (NB: `open` is a function, everything else in this glossary is a method).
* **close:** when we are finished with a file, we should close it.
* **read:** will read the entire contents of a file as a string. This is often used in an assignment statement so that a variable can reference the contents of the file.
* **readline:** will read a single line from the file, up to and including the first instance of the newline character.
* **readlines:** will read the entire contents of a file into a list where each line of the file is a string and is an element in the list.
* **write:** will add characters to the end of a file that has been opened for writing.

<a id="ref15"></a>
## Exercises

<div align="right"><a href="#ref00">back to top</a></div>

You will need to access the following data files for these exercises: `student_data.txt`; `travel_plans.txt`; `emotion_words.txt`; they are located in the appropriate folder. I have printed out their content below as a visual aid to planning, debugging and checking your solutions, but you will still need to `open` the actual files before using the data.

`student_data.txt`

<div align="left">
    
```
joe 10 15 20 30 40
bill 23 16 19 22
sue 8 22 17 14 32 17 24 21 2 9 11 17
grace 12 28 21 45 26 10
john 14 32 25 16 89
   
```
</div>


`travel_plans.txt`.

<div align="left">
    
```
This summer I will be travelling.
I will go to...
Italy: Rome
Greece: Athens
England: London, Manchester
France: Paris, Nice, Lyon
Spain: Madrid, Barcelona, Granada
Austria: Vienna
I will probably not even want to come back!
However, I wonder how I will get by with all the different languages.
I only know English!
   
```
</div>


`emotion_words.txt`

<div align="left">
    
```
Sad upset blue down melancholy somber bitter troubled
Angry mad jittery enraged irate irritable wrathful outraged infuriated
Happy cheerful content elated joyous delighted lively glad
Confused disoriented puzzled perplexed dazed befuddled
Excited eager thrilled delighted
Scared afraid fearful panicked terrified petrified startled
Nervous anxious jittery jumpy tense uneasy apprehensive

```
</div>

1. The file `student_data.txt` contains one line for each student in an imaginary class. The student's name is the first entry on each line, followed by some exam scores. The number of scores may be different for different students. Write a program that prints out the names of students who have six or more exam scores.

In [None]:
# Write your code here.



<div align="right">
<a href="#Hag10" class="btn btn-default" data-toggle="collapse">Click here for the answer</a>

</div>
<div id="Hag10" class="collapse">

```python
f = open("student_data.txt", "r")

for aline in f:
    items = aline.strip().split()
    if len(items[1:]) >= 6:
        print(items[0])

f.close()
```
</div>

2. Create a list called `destination` using the data stored in `travel_plans.txt`. Each element of the list should contain a line from the file that lists a country and cities inside that country. Hint: each line that has this information also has a colon : in it.

In [None]:
# Write your code here





<div align="right">
<a href="#Hag11" class="btn btn-default" data-toggle="collapse">Click here for the answer</a>

</div>
<div id="Hag11" class="collapse">

```python

f = open("travel_plans.txt", "r")

destination = []
for aline in f:
    if ':' in aline:
        destination.append(aline.strip())
print(destination)

f.close()



```
</div>

3. Create a list called `j_emotions` that contains every word in `emotion_words.txt` that begins with the letter `j`.

In [None]:
# Enter your code here




<div align="right">
<a href="#Hag9912" class="btn btn-default" data-toggle="collapse">Click here for the answer</a>

</div>
<div id="Hag9912" class="collapse">

```python

f = open("emotion_words.txt", "r")

j_emotions = []
for aword in f.read().strip().split():
    if aword[0] == 'j':
        j_emotions.append(aword)
print(j_emotions)

f.close()


```
</div>

< [Sequences II](ZSequences-II.ipynb) | [PyFinLab Index](ALWAYS-START-HERE.ipynb) | [Dictionaries](ZDictionaries.ipynb) >

<div align="right"><a href="#ref00">back to top</a></div>