# Lab 6: Working with Data Files
In many kinds of research, we need to be able to read in and manipulate data. In this lab we will look at different ways of reading in, writing out, and manipulating data files.

## The Standard Method
As we saw in **Lab 03** we learned how to read in and write out a standard ascii file. In general we want to put data into numpy arrays, so that we can work with it. If we use a standard read in, that requires a lot of work on our part. When you want to create a new numpy array, it is computationally cheaper to create a list first and then create a numpy array.

In [None]:
#Get my imports dealt with
import numpy as np
import astropy.units as u

In [None]:
try:
    infile = open('data/hip_tiny.csv','r')
except IOError:
    print("File data/hip_tiny.csv could not be opened!")

#Define lists
name_list = list()
vmag_list = list()
    
for line in infile:
    #Check for header that begins with a # or are entirely blank
    if line.startswith("#") or line.isspace():
        print(line) #Print the header
        continue
    llist = line.split(',')
    name_list.append(llist[0])        #It is okay if the name is a string
    vmag_list.append(float(llist[5])) #Remember Vmag should be a float

infile.close()

#Now convert to numpy arrays
name_arr = np.array(name_list)
vmag_arr = np.array(vmag_list)
print(name_arr)
print(vmag_arr)

## The Numpy way

### Reading

Numpy has two built in functions for reading data files `np.loadtxt()` and `np.genfromtxt()`. Both work the same way, but `np.genfromtxt()` can handle missing data, so that is the one I generally use. These functions have several keywords for handling the data. The default column delimiter is whitespace, but we need to use commas, so we will set the *delimiter* keyword. Note the default data type is **float**.

In [None]:
#Load all the data into 2-d structured array
data_2darr = np.genfromtxt('data/hip_tiny.csv',delimiter=',')
print(data_2darr)

Oftentimes it is easier to work with a series of 1-d arrays, instead of one big 2-d array. We can work with 1-d arrays by setting the *unpack* keyword to True. We can also specify which columns we want using the *usecols* keyword.

In [None]:
name_arr, vmag_arr = np.genfromtxt('data/hip_tiny.csv',delimiter=',',usecols=(0,5),unpack=True)
print(name_arr)
print(vmag_arr)

If you want to use a 2d array for your data, you can go further and assign names to each column. This way you can access the data by name instead of by indexing.

In [None]:
data_2darr = np.genfromtxt('data/hip_tiny.csv',delimiter=',',names=True)
print(data_2darr.dtype.names)
print(data_2darr)

In [None]:
print(data_2darr['Ra_Degrees'])

In [None]:
data_2darr.dtype.names

### Writing

The numpy way to write your arrays to files is using `np.savetxt()`. For examlpe, let's try to save the 2d array we just read in a different file and with a different format. First with a very naive approach.

In [None]:
np.savetxt('data/hip_tiny_copy.txt', data_2darr)

In [None]:
!cat data/hip_tiny_copy.txt

As you can see, the default formatting of the file is not very pleasing to the eyes, and we also lose the header information. We can fix this by specifying both things.

In [None]:
header_str = '  '.join([x for x in data_2darr.dtype.names])
print(header_str)
np.savetxt('data/hip_tiny_copy.txt', data_2darr, header=header_str, fmt='%10f')

In [None]:
!cat data/hip_tiny_copy.txt

Now it looks better, but you can tell we had to work harder for this. You can either make your own function that takes care of all the formatting or use libraries that are more complete.

You can read more about the formatting of data here: https://docs.python.org/3/library/string.html#format-specification-mini-language

## Astropy Tables
The **astropy.table** is a module of astropy. This module provides methods for a new object type called Table. Table objects are very useful for working with large amounts of data with many columns. For instance, we can do all of the above with astropy Tables, and it is able to read from more than just text files. For more information see [the Astropy documentation for table module](http://docs.astropy.org/en/stable/table/)

In [None]:
from astropy.table import Table #Import in the Astropy object

The `Table.read()` method is a very easy way to read in information. It also automatically populates headers.

In [None]:
mytable = Table.read('data/hip_smaller.csv')
print(mytable) #Tables are smart enough to show you only the first and last few columns

For smaller datasets, you can have direct data access to search and page through the data. **Be Careful: Large datasets can overwhelm your notebook kernel!**

In [None]:
mytable.show_in_notebook() #Only use for relatively small tables

## Reading Data
Astropy tables can read/write many different formats: http://docs.astropy.org/en/stable/io/unified.html#built-in-table-readers-writers. Sometimes,though, it needs help.

We can do a quick check that we succeeded by using a Linux command `head`. `head` shows only the first ten lines of a file. We can access the linux command line by using `!`.

In [None]:
#Show the contents of hip_prob.txt
!head data/hip_prob.txt

In [None]:
prob_tab = Table.read('data/hip_prob.txt')

In [None]:
#Let's give it some help and suggest a format
prob_tab = Table.read('data/hip_prob.txt',format='ascii')
print(prob_tab)

**Additional note:** Be sure that the number of header columns matches your data. Also no two column names can repeat or it will not read, and the error messages will be **unhelpful**!

## Accessing data in an Astropy Table
Let's learn how to get useful information about our table.

In [None]:
#Get basic info about our table including how long it is and column names
mytable.info()

In [None]:
#Get statistical information about each column
mytable.info('stats')

Let's access a single column. The columns of an astropy table are similar to numpy arrays, but they have a column name associated with them. You can transform the columns back into normal numpy arrays using `np.array()`.

In [None]:
#Access one column
b_col = mytable['B (mag)']
print(b_col[1:3])
#Do math with two columns
bminusv_col = mytable['B (mag)'] - mytable['V (mag)']
print(bminusv_col[1:3])
#Convert to a numpy array
bminusv_array = np.array(bminusv_col)
print("Now in array form:")
print(bminusv_array[1:3])

You can also access individual rows, or list of indices

In [None]:
bmag = mytable['B (mag)'][1]
print(bmag)
bmag_col = mytable['B (mag)'][[1,3,6]]
print(bmag_col)

The real power of an astropy table is that you can use the results in one column to select values in another column

In [None]:
#Create a table with only stars less than err_Plx < 1
new_tab = mytable[(mytable['err_Plx'] < 1)]
new_tab

You can also do complex selection using bitwise and (`&`) or bitwise or (`|`)

In [None]:
#Select stars with Error in Parallax less than 1 and V Magnitude > 7
new_tab2 = mytable[(mytable['err_Plx'] < 1) & (mytable['V (mag)'] > 7)]
new_tab2

You can also access individual columns

In [None]:
name_col = mytable['#HIP (Name)'][(mytable['err_Plx'] < 1) & (mytable['V (mag)'] > 7)]
print(name_col[0:5])

## Modifying a table
You can modify a table the same way you modify a numpy array

In [None]:
change_tab = mytable
change_tab['#HIP (Name)'][[0,1,2,5,6,7]] = [1000,19,156,208,11,16453]
change_tab

You can also add new columns, rename, or remove old ones. Just make sure that the new column has exactly the same length as the table. You can also use units with your table.

In [None]:
new_column1 = np.arange(len(mytable))
new_column2 = mytable['Plx (milliarcsec)'] / 1000.0
change_tab['index'] = new_column1
change_tab['Plx'] = new_column2*u.arcsec
change_tab.rename_column('err_B','error_B')
change_tab.remove_column('err_V')
change_tab #Note the new row that gives the unit

## Creating a table from scratch
Often times you want to save a new table based on your work. Remember tables can have units too. Adding new columns is just like a dictionary.

In [None]:
new_tab = Table()
new_tab['Name'] = np.arange(10)
new_tab['Distance'] = new_tab['Name'] * 10 * u.km
new_tab['Distance2'] = new_tab['Distance'].to(u.m)
print(new_tab)

## Writing out a Table
Once you have a table you can write it out into any of the formats in http://docs.astropy.org/en/stable/io/unified.html#built-in-table-readers-writers. Let's write out the same table using a commas separted file and a pipe `|` separated file.

In [None]:
new_tab.write('data/distance.csv')
new_tab.write('data/distance.txt',format='ascii',delimiter='|')

In [None]:
!head data/distance.csv

In [None]:
!head data/distance.txt

When we write to ascii we lose the units. We can use one of the enhanced file types like ipac to store our infomation with metadata. If you are going to overwrite a file, you need to set overwrite to True.

In [None]:
new_tab.write('data/distance.txt',format='ascii.ipac',overwrite=True)
!head data/distance.txt

Finally, we can also store files in binary format. These cannot be read by `head`. The most common binary format in astronomy is the FITS format. You have to use the command 

In [None]:
new_tab.write('data/distance.fits') #If you need to overwrite put in overwrite=True
new2_tab = Table.read('data/distance.fits')
print(new2_tab) #Note the units survive

## Lab 6: Now it is your turn
Please answer the following questions, then print them off and turn them in. You don't need to print the whole notebook. Only print the pages starting from here.

Name: 

**Q1: Create numpy arrays from `data/hip_small.csv` for RA and DEC called ra_arr and dec_arr using the Standard Method. Print the Median of each.**

**Q2: Create a numpy array using `np.genfromtxt` from `data/hip_small.csv` for Plx called plx_arr. Print the minimum value.**

**Q3: Print off only those stars that have an RA less than 5 and a Parallax less than 20 in `mytable`.**

**Q4: Add a column to `change_tab` called `new_err_plx` based on the `err_Plx` column that has values and units arcsecs. (Hint: All `err` columns have the same units as the associated data columns)**

**Q5: Using `data/hip_smaller.csv` read in the file and create an astropy table called `hip_tab`. Then give the columns that have units in their name, the correct units, and then remove the units from the name of the column. Show your results.**

**Q6:Using `hip_tab` create a new table called `north_tab` with Dec greater than 0 degrees and Plx between 10 and 50 marcsec. Then make Plx and plx_err have units of arcsecs.**