# BB1000 - Lecture 6: File I/O and Pandas

Josefine H. Andersen

In [None]:
import numpy as np
import pandas as pd

## File I/O

### File with text

In [None]:
# Typical call to open():
# open("filename", "mode")

A file is opened by use of the function `open()`. The syntax is as follows:
`file = open("filename", "mode")`

The `"mode"` input indicates the *mode* with which we open the file: we will use
- `"r"`: this opens the file as read-only. If the file does not exist, it will give an error
- `"w"`: this opens the file for writing. If the file does not exist, it is created, if the file does exist, *the content is deleted* so we write to an empty file
- `r+`: this opens the file for both reading and writing. If the file does not exist, it will give an error

For more info see, e.g., [this link](https://tutorial.eyehunts.com/python/python-file-modes-open-write-append-r-r-w-w-x-etc/)

In [None]:
# Here, we open a file in write mode, the file does not exist yet.
# We will name the file "file_created.txt"
created_file = open("file_created.txt", "w") # open a file named "file_created.txt" in write-mode and 
                                             # "save" it in variable "created_file" 
created_file.close() # close the file again -- THIS IS IMPORTANT TO KEEP TRACK OF, TO AVOID BUGS
# you can now see the file in your file browser on the left

In [None]:
# Here, we open a file in read-only mode
# The name of the file is "text_file.txt" and you can find it on Canvas Module 6.
# Make sure that the file is in your Lecture_06 directory and that 
# you can open it here in Jupyter-lab
text_file_name = "text_file.txt" # save" the file name as a variable
text_file = open(text_file_name,"r") # open the file in read-mode and "save" it in variable "text_file" 

There are different functions to get the content of the file. We will use `read()`.

**BEWARE** that once you have read the file, the program will be "at the end" of the file. If you run the next cell twice, nothing will be printed. A solution to this will be described in a following cell

In [None]:
text_file_content = text_file.read() # read the content of the file from the variable
                                     # "text_file" which contains the file information
                                     # and save it in variable "text_file_content"
print(text_file_content) # print the text_file_content that we just read
                         # compare the output to the file you have opened here in Jupyter lab

If we want to read the file again from the top, we can reset the "position" in the file with `.seek()`

In [None]:
text_file.seek(0, 0) # go to the first ("zeroth") letter in line 0 (i.e. line 1 from our POV - Remember Python counts from zero!)
                     # 
# this cell will print out "0". I don't know why (it is the line number)

In [None]:
text_file.close() # we close the file to prepare for the next part

In [None]:
text_file = open(text_file_name,"r+") # opens the file in read+write mode

In [None]:
# we can loop through the lines in the file, similar to what we have done with arrays
for line in text_file: # loop through the file. The iterator "line" will contain the line in
                       # the file that we are currently looking at in the loop
    print(line) # print the line in each iteration. Compare to the output from the print of the file content above

Looping through the lines in the file can be useful f.ex. if we want to search for a specific string.
In the following, we will look for the line "This is the line to be printed by josefine!" (find the line in the file that you have opened here in jupyter lab.

In [None]:
text_file.seek(0,0) # move to the first line in the file before we loop.

# loop through lines in file and search for a line containing the word "josefine"
for i, line in enumerate(text_file, 1): # the "enumerate" function adds an iterator to the loop. 
                                        # This is the "i" in "for i, line ...".
                                        # The "1" in "enumerate(text_file, 1)" means that the function
                                        # will count from 1 (and not zero, as is default)
    if 'josefine' in line: # a conditional statement. If the string is found in the line, 
                           # the following code will be executed
        print(i, line) # print the iterator "i" and the "line" where the word "josefine" occurs
        # now, we make a variable called "sentence" which will contain a string.
        sentence = "Josefine's line was line number " + str(i) + '\n' # in order to add the iterator (i.e., line 
                        # counter), we must use it as a string data type. This is done with "str()"
                        # The string '\n' is a command that adds a line-shift. I.e., if you print/write
                        # another string after "sentence", it will be on a new line

print(sentence) # print the sentence we defined above. Does it match what you see in the text file?

Now, we want to write the sentence we defined above to the file. For this, we use the function `.write()`

In [None]:
# First, we make sure we are at the end of the file
text_file.seek(0,2) # the second input, "2", indicates that we seek the end of the file. 
                    # For more details, google
                    # This will print out the number 35, I don't know why it prints (it is the line number)

text_file.write(sentence) # write the variable "sentence" to the text_file

In [None]:
# reset our position to the beginning of the file
text_file.seek(0,0)
# print the content. Do you see the new line in the output?
print( text_file.read() )

#To see the changes directly in the file, you might have to close and open it again in jupyter-lab

In [None]:
# close the file
text_file.close()

### Numpy arrays
Now we want to read and write numpy arrays from and to text files. To save an array, we use the numpy function `np.savetxt()` and to read into an array, we use `np.genfromtxt()`.

In [None]:
# First, we create a 2x2 array with random numbers
array2d = np.random.rand(2,2) # 2 rows, 2 columns
print(array2d)

In [None]:
# Write the array to a file named "data_file.txt"
np.savetxt('data_file.txt', array2d)

In [None]:
# We now open and read+print the file as a regular text file
file = open('data_file.txt','r') # open file named "data_file.txt" in read-mode
print( file.read() ) # print the file content
file.close() # close the file
# In the output, do you recognize you array?

In [None]:
# Here we create an np array from the data in the file "data_file.txt"
array_from_file = np.genfromtxt('data_file.txt')
print(array_from_file)

Now we want to combine the array we read from file with another array of the same dimension using `np.append`

In [None]:
# make a new 2D array filled with random numbers
new_array2d = np.random.rand(2,2)
print(new_array2d)

In [None]:
# combine the two arrays along axis 1 (columns) 
combined_array = np.append(array_from_file, new_array2d, axis = 1)
print(combined_array.shape) # print the shape of the array, i.e. the number of rows and columns
print(combined_array) 
# Do you recognize the values?

In [None]:
# Here we save the combined_array in a new file named "combined_array_file.txt" with non-default delimiter
# The delimiter is the "seperator" between the values in the array that we are saving in the file.
np.savetxt('combined_array_file.txt', combined_array, delimiter=' , ') # write array til text file as comma-separated values

In [None]:
# open, print, close new data file
new_data = open('combined_array_file.txt','r') # read-mode
print(new_data.read())
new_data.close()

## Pandas
**Documentation for the pandas package**:
[https://pandas.pydata.org/docs/reference/index.html#api](https://pandas.pydata.org/docs/reference/index.html#api)


### Creating a data frame 
We want to create a pandas dataframe from our data saved in "data_file.txt"

In [None]:
# read data from file into a numpy array
array_from_file = np.genfromtxt('data_file.txt')
print(array_from_file)

In [None]:
# Here we create a pandas data frame with our array_from_file as data, and the strings 'A' and 'B'
# as column names
titles = ['A','B'] # create a list with column names.
df_mydata = pd.DataFrame(array_from_file, columns=titles) # create a pandas dataframe named "df_mydata"

In [None]:
# print the data frame
df_mydata

In [None]:
# We can extract selected columns from the data frame. Here, we print the column with name 'A'
df_mydata['A']

We now want to add a single column named 'C' to df_mydata.

In [None]:
# First, create a new array with same # of rows as data from file
new_data = np.arange(2)
print(new_data)

In [None]:
# Create a column with name "C" and save the new_data
df_mydata['C'] = new_data

In [None]:
# There are now three data columns in our data frame
df_mydata

## The end