# Reading in data from files

In this section, we will be looking at ways  to read and write data. We do this using a basic (built-in) method as well as using the loadtxt and getfromtxt functions from numpy. We also discuss lambda functions as a way to prep data as it is being read.

### Basic File Actions: Open, do something and close
Python allows us to open a file, perform an action on it (reading, writing, appending) and then close the file.  We chose the mode of the file to be consisten with the action that we want to take.  The available modes and actions are:
* r = read only mode
* a = append mode
* w = write mode 
* r+ = read and write mode

In [12]:
#-# Open the file
# ourName = open( "filename" , "mode")
fileToRead = open( "random.txt", "r" )

#-# Perform the action (read)
contents = fileToRead.read()

#-# Close the file
fileToRead.close()

#-# Print out what was read with the type
print( type(contents) )
print( contents )

<class 'str'>
0 10
1 21
2 32
3 43
4 54
5 65
6 76
7 87
8 98
9 109



This reads our entire file in at once.  We can also read in certain numbes of characters or lines of text.
* read: Reads in the next character - you can give it the next number of characters to read in
* readline: Reads in an entire line
* readilnes: Reads in the rest of the file, but splits up the input based on line
Note that later statements start from where you are in the file, not the beginning (or even the beginning of a line)

In [7]:
#-# Open the file
# ourName = open( "filename", "mode")
fileToRead = open( "random.txt", "r" )

#-# Use readline
print( "#-# readline " )
print( fileToRead.readline() )

#-# Use read
print( "#-# read " )
print( fileToRead.read(3)  )
# Note that this prints out three characters (1, space, 2) and not 3 objects (1, space, 21)

#-# Use readlines
print( "#-# readlines " )
theRest = fileToRead.readlines() 
print( theRest )

#-# Close the file
fileToRead.close()

#-# readline 
0 10

#-# read 
1 2
#-# readlines 
['1\n', '2 32\n', '3 43\n', '4 54\n', '5 65\n', '6 76\n', '7 87\n', '8 98\n', '9 109\n']


We can also loop over the lines of a file, which is typically more efficient than reading it in, and allows us to put data in the form that we want.  In this case we use a list called dateFormatted and split based on the strings.

In [8]:
#-# Open the file
fileToRead = open( "random.txt", "r" )

#-# Read in the data line by line and put into dateFormatted
dataFormatted = []
for line in fileToRead:
    # Strip off the newline and split based on the space
    dataFormatted.append( line.strip('\n').split(' ') )
    
#-# Close the file
fileToRead.close()

#-# Print out the data
print( dataFormatted )

[['0', '10'], ['1', '21'], ['2', '32'], ['3', '43'], ['4', '54'], ['5', '65'], ['6', '76'], ['7', '87'], ['8', '98'], ['9', '109']]


We can also write to a file and append to a file in a very similar fashion.  Note that the write will delete any file that is already there with that name.

In [9]:
# Write to a file (using w)
fileToWrite = open( "sampleWrite.txt", "w" )
fileToWrite.write( "Does this show up?" ) 
fileToWrite.close()

# Write to the same file (using w again)
fileToWrite = open("sampleWrite.txt", "w" )
fileToWrite.write( "Is this the second line?" ) 
fileToWrite.write( "\nAdd another Line" )
fileToWrite.close()

# Append to the file (using a)
fileToAppend = open( "sampleWrite.txt", "a" )
fileToAppend.write( "\nAppended to the end" ) 
fileToAppend.close()

# Check to see what's there
fileToCheck = open( "sampleWrite.txt", "r" )
print( fileToCheck.read() )
fileToCheck.close()

Is this the second line?
Add another Line
Appended to the end


### Reading in regular data

The read commands we were doing before were very poweful, but they assume you will do all of your data prep and formatting as an additional step.  The numpy module comes with two excellent functions that allow you to read input files with known formatting in an easier manner: loadtxt and genfromtext. 

When using loadtxt, we do not have to open and close the file, these are included in the function.  We do have to provide a delimiter, of what separates one variable from another.  In this case, a space is used.

In [13]:
import numpy as np

# Read in similar to before
allData = np.loadtxt( "random.txt", delimiter=" " )
print( type(allData) ) 
print( allData ) 

<class 'numpy.ndarray'>
[[  0.  10.]
 [  1.  21.]
 [  2.  32.]
 [  3.  43.]
 [  4.  54.]
 [  5.  65.]
 [  6.  76.]
 [  7.  87.]
 [  8.  98.]
 [  9. 109.]]


We see that this is in a list-like form, called a numpy array, rather than a string before.  We also note that each of the values is listed with a period, as a float.

One of the ways we can use loadtxt to help with data prep is to split different columnds into different variables.  You use this by setting the unpack parameter to True

In [15]:
import numpy as np

# Break into columns
colA, colB = np.loadtxt( "random.txt", unpack=True )

print( "#-# colA" ) 
print( colA ) 

print( "\n#-# colB" ) 
print( colB ) 

#-# colA
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]

#-# colB
[ 10.  21.  32.  43.  54.  65.  76.  87.  98. 109.]


The loadtxt command also allows for only a portion of the data to be read in.  In this example, the usecols parameter is used to indicate which columns should be read in. In this example, only the first column is read in.

In [16]:
import numpy as np

# Read in a single column
firstCol = np.loadtxt( "random.txt", usecols=[0], unpack=True )
print( "#-# firstCol" ) 
print( firstCol ) 

#-# firstCol
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]


So far, python has assumed everything being read in is a float (the default for numpy).  We can typecast the values as they are read in using the dtype parameter.  Here we read in the second column as strings.

In [17]:
import numpy as np

# Change the type
secColSt = np.loadtxt( "random.txt", dtype=str, usecols=[1], unpack=True )
print( "#-# secColSt" ) 
print( secColSt ) 

#-# secColSt
['10' '21' '32' '43' '54' '65' '76' '87' '98' '109']


Numpy's genfromtxt is used the same way, but has some additional error handling.   This can be useful if your data is incomplete, or has some errors. You can put anything you want for missing values - common choices are nan (not a number) or 0. Do something so that you can easily identify problems and either filter them out or know you are ok ignoring them.

In [19]:
import numpy as np
digitNum,piVal = np.genfromtxt("corruptData.dat", delimiter=",", unpack=True, missing_values=' ', filling_values=np.nan)

print( "#-# digitNum" )
print( digitNum )

print( "\n#-# piVal" )
print( piVal )

#-# digitNum
[ 0.  1.  2.  3. nan  5.  6.  7.  8.  9. 10. 11. 12.]

#-# piVal
[ 3. nan  1.  4.  1.  5.  9.  2.  6.  5.  3.  5.  9.]


## Lambda functions

Lambda functions are anonymous functions which are defined by the keywork lambda. An example of this to compute a some value time pi would be:
```python
multByPiLmb = lambda x: x * np.pi
```
This is named multByPiLmb, but doesn't need to have a name.  

An input variable (often x) is used with the keyword lambda .  It is denoted
```python
lambda x
```
The function returns the value on the right side of the colon.  This is 
```python
x * np.pi
``` 
in this case.  

In [23]:
import numpy as np
multByPiLmb = lambda x: x * np.pi
multByPiLmb(3)

9.42477796076938

Note that this is similar to writing out a full function:

In [24]:
import numpy as np
def multByPiFxn( input ):
    return input * np.pi

multByPiFxn(3)

9.42477796076938

However, the way that the lambda function is written allows us to use it in ways that regular expressions cannot be used. For example, let's say you wanted to transform a variable as it was being read in. We can use an unnamed lambda function as a converter for our input data. In this example the converter is applied to the first column (0) and adds 10 to each value. 

In [26]:
import numpy as np
colA, colB = np.loadtxt( "random.txt", unpack=True )
print( "Original Columns: ", colA,colB )

colC, colD = np.loadtxt( "random.txt", converters = {0: lambda s: int(s)+10}, unpack=True )
print( "Updated Columns: ", colC,colD )

Original Columns:  [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] [ 10.  21.  32.  43.  54.  65.  76.  87.  98. 109.]
Updated Columns:  [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.] [ 10.  21.  32.  43.  54.  65.  76.  87.  98. 109.]


# Check yourself

In [29]:
# Use these variables with the first example
string1 = "For the Glory of Old State"
string2 = "For her founders strong and great."

Write both strings to a file using write and append modes. Read the file back in to verify that both strings were written.

In [30]:
# Try it here


Read in random.txt using numpy's loadtxt and typecast the values to integers as they are being read in.

In [31]:
# Try it here


Read in corruptData.dat with a zero for any erroneous data and use a lambda function to multiply the second column by 2.

In [28]:
# Try it here
