# Lecture 12 Reading and writing files

Accessing data

To be able to access data, you need to be able to read and parse files. 

Python has built in functions to be able to 
- open files 
- navigate through its contents
- create files
- write to files

Let's try opening a new file (this will create a new file) and put some text in it 

In [1]:
# Files need names
# Let's pick something to name our new file, including an extension 

file_name = 'hello_world.txt'

The extension to a filename is the '.something'

While this doesn't edit the contents of the file, it tells programs what kind of file it is. 

- .txt - a basic extension for files just containg text 
- .pdf - used for PDF files
- .jpg - used for images with jpeg compression 
- .mp3 - a music file format 
- ... and many others

Now we can open (and create) a file using Python's `open` function

The `open` function has 1 mandatory argument, the file name

The first kwarg argument and most commonly used kwarg is `mode`

The `mode` is what "mode" the file is openned with, it specifies what we're allowed to do to the file

The commonly used modes are

- 'r' - open for reading (default)

- 'w' - open for writing

- 'a' - open for writing, appending to the end of file if it exists

Let's open our file in 'w' mode, so we can write something in it

In [2]:
f = open(file_name, 'w')

We've now opened a file. 

Check the folder your notebook is in, is 'hello_world.txt' there?

You can do this by going to your browser tab that you opened this notebook from

Or by looking in your file manager 

- Mac: Finder
- Windows: Windows File Explorer

Now let's try writing to the file

In [3]:
# this will write Hello World to the file, it also outputs the number of characters written
f.write('Hello World')

11

Open up the file

Is the text there?

In [4]:
'no it is not'

'no it is not'

Right now the file is "open" and for the updates to the file to be "saved", we first need to close the file. 



In [5]:
# Close the file like so
f.close()

Check the file again. 

Do you see the text in it now?

In [6]:
'Yes now the file says "Hello World"'

'Yes now the file says "Hello World"'

Now let's try opening it in append mode

In [7]:
f = open(file_name, 'a')

Now whatever we write will start at the end of the file. 

Warning: if we open it again in write mode, it would delete the content of the file 

In [8]:
# to make a new line in a string you use '\n'
f.write('\nHello World a second time')

26

In [9]:
f.close()

Now let's open it again, but for reading this time

In [10]:
f = open(file_name, 'r')

We can read the contents of the file now with the function `read`, like so

In [11]:
f.read()

'Hello World\nHello World a second time'

What if we try that again

In [12]:
f.read()

''

We get nothing because the file object has a current position in the file (like a cursor in your text editor)

After we read everything, we're now at the end of the file

We can used the `seek` function to move the cursor to a specific position in the file (by number of characters)

In [13]:
# back to the beginning 
f.seek(0)

0

In [14]:
# Let's try again
f.read()

'Hello World\nHello World a second time'

You can also read line by line

You can iterate over the file object with a for loop, this will give a line per loop

In [15]:
# back to the beginning
f.seek(0)

# You can iterate of the file object, this will give you a line per loop
for line in f:
    print(line)

Hello World

Hello World a second time


In [16]:
f.close()

Random strings of text isn't the best way to store data. 

Data needs to have a specific format that's easy to interpret and be "machine readable"

machine readable - a data format that can be read by a piece of code without human editting 

One of the most common formats to save table data in are comma-seperated value (csv) files 

csv files store a table of data by having columns seperated by a comma (,) and a new line for each row

typically the first row are the column names, then the rest of the rows are the data values

Let's make a csv file

First we need some data

In [17]:
import numpy as np

# Let's make 2 columns of data for a table
x_data = np.linspace(0, 3, 12)
y_data = 2*x_data

Now let's open a file to write to

In [18]:
csv_fname = 'data.csv'

csv_f = open(csv_fname, 'w')

Let's start by writing the column names

In [19]:
csv_f.write('X_DATA, Y_DATA')

14

Now let's loop over the data, writing each row

In [20]:
for i in range(len(x_data)):
    csv_f.write('\n')
    line_to_write = str(x_data[i]) + ', ' + str(y_data[i])
    print(line_to_write)
    csv_f.write(line_to_write)

0.0, 0.0
0.2727272727272727, 0.5454545454545454
0.5454545454545454, 1.0909090909090908
0.8181818181818181, 1.6363636363636362
1.0909090909090908, 2.1818181818181817
1.3636363636363635, 2.727272727272727
1.6363636363636362, 3.2727272727272725
1.909090909090909, 3.818181818181818
2.1818181818181817, 4.363636363636363
2.454545454545454, 4.909090909090908
2.727272727272727, 5.454545454545454
3.0, 6.0


In [21]:
csv_f.close()

There are several packages in Python that are able to parse csv files 

Here let's use astropy's Table module

The `Table.read` function reads in a file, parses the table and puts it into a structured numpy array

In [22]:
from astropy.table import Table

# use Table.read() to open a file and return a Table object
csv_table = Table.read(csv_fname)

In [23]:
# Let's look at the table
csv_table

X_DATA,Y_DATA
float64,float64
0.0,0.0
0.2727272727272727,0.5454545454545454
0.5454545454545454,1.0909090909090908
0.8181818181818181,1.6363636363636362
1.0909090909090908,2.1818181818181817
1.3636363636363635,2.727272727272727
1.6363636363636362,3.2727272727272725
1.909090909090909,3.818181818181818
2.1818181818181817,4.363636363636363
2.454545454545454,4.909090909090908


Inside the table object, the data is stored as structured numpy arrays

In [24]:
type(csv_table['X_DATA'].data)

numpy.ndarray

Astropy's table is able to read table data of many different formats. Here it's able to tell this is a csv file by the extension 

Another commonly used module to read and table data is Pandas

It has its own table type object that also uses numpy, called a Data Frame

In [25]:
import pandas as pd

csv_df = pd.read_csv(csv_fname)

In [26]:
csv_df

Unnamed: 0,X_DATA,Y_DATA
0,0.0,0.0
1,0.272727,0.545455
2,0.545455,1.090909
3,0.818182,1.636364
4,1.090909,2.181818
5,1.363636,2.727273
6,1.636364,3.272727
7,1.909091,3.818182
8,2.181818,4.363636
9,2.454545,4.909091


Pandas is commonly used in data science and a powerful tool to analyze data. 

In this class we're going to mainly use astropy for reading and handling data 

### FITS Data Format

The most commonly used data format in astronomy is the Flexible Image Transport System (FITS)

FITS files are able to hold both table data and image data, making them perfect for astronomy data

FITS files contain several sets of data, seperated into Header/Data Units (HDUs)

Each HDU consists of a 

- Header - Information about the data in plain text (ASCII characters) 

- Data - The data itself. It can be an image, 2D-array (or possible 3D) or a Table (rows, columns, and column names)


The Data is typically in Binary and not readable by eye. Software is needed to parse the data

This is done as ASCII characters take up a lot more storage than the binary representation of floats, ints and other data types. 

- One ASCII charater is 8 bits (it takes eight 1s and 0s to define a character). 
- It takes 16 digits to describe a float, 16 x 8bits = 128 bits (not including where the decimal point is and things like a comma for a csv file)
- A float in binary takes up 64 bits

To read fits we're going to use the fits module from astropy

> from astropy.io import fits

Then files can be opened with 

> fits_file = fits.open(file_name)

This opens the file like the command `open` did. 

By default it's opened in read mode. 

In read mode, edits can be made, but to save them you have to save it as a new file. 

With astropy it's also possible to give a url instead of a file name, and it will automatically download it for you. 

In [27]:
from astropy.io import fits

# f = fits.open('https://heasarc.gsfc.nasa.gov/FTP/fermi/data/gbm/daily/2022/02/02/current/glg_ctime_n1_220202_v00.pha')
# f = fits.open('https://heasarc.gsfc.nasa.gov/FTP/swift/data/obs/2025_01/00018938038/xrt/event/sw00018938038xpcw3po_uf.evt.gz')

# Let's open the file sw01222885000b_totflu.pha available at this url
# It's a spectral file giving counts in different energy bins
table_url = 'https://swift.gsfc.nasa.gov/results/batgrbcat/GRB240418A/data_product/01222885000-results/pha/sw01222885000b_t90.pha'
f = fits.open(table_url)

Now that we have the file opened let's see what's inside of it, by using `f.info()`

This will tell use all of the HDUs and their Names

In [28]:
f.info()

Filename: C:\Users\Insan\.astropy\cache\download\url\9cc569736f9daa51285545992766859c\contents
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     108   ()      
  1  SPECTRUM      1 BinTableHDU    396   80R x 4C   [I, D, D, D]   
  2  EBOUNDS       1 BinTableHDU    345   80R x 3C   [I, E, E]   
  3  STDGTI        1 BinTableHDU    144   1R x 2C   [D, D]   


We have 4 HDUs

One PrimaryHDU, for table data this is usually empty. It is required to have one of these, but it can only hold image data. 

Then 3 BinTableHDUs, these are tables. 

`f` is like a list of HDUs. You can get a single HDU with its index or by using the HDU's name

`HDU1 = HDU[1]`

or 

`HDU1 = HDU['SPECTRUM]`

In [29]:
HDU1 = f[1]

Let's first check out the header of HDU1. This will contain information about the data and other useful things and sometimes a lot of not as useful things

We can do that like so

In [30]:
HDU1_header = HDU1.header
HDU1_header

XTENSION= 'BINTABLE'           / binary table extension                         
BITPIX  =                    8 / 8-bit bytes                                    
NAXIS   =                    2 / 2-dimensional binary table                     
NAXIS1  =                   26 / width of table in bytes                        
NAXIS2  =                   80 / number of rows in table                        
PCOUNT  =                    0 / size of special data area                      
GCOUNT  =                    1 / one data group (required keyword)              
TFIELDS =                    4 / number of fields in each row                   
TTYPE1  = 'CHANNEL '           / Spectrum channel number                        
TFORM1  = 'I       '           / data format of field: 2-byte INTEGER           
TTYPE2  = 'RATE    '           / Spectrum rate                                  
TFORM2  = 'D       '           / data format of field: 8-byte DOUBLE            
TUNIT2  = 'count/s '        

You can see there's useful information about the data table at the top, 
- size
- column names
- data types
- units, ...

Then there's other information that's about the observation itself, 
- The start and stop times in different formats
- The instrument's name
- The coordinates of the object it was observing
- The actual pointing direction of the instrument, ...


Then there's the "History" section that looks like a whole bunch of gibberish, this is a log of what was done to create this fine

Each line in the HEADER is formatted as a "KEYNAME" = a value, followed by a comment

To get a single line's value, you can do this

`HDU1_header['KEYNAME']`

In [31]:
exposure = HDU1_header['EXPOSURE']
print(exposure)

12.4279999732971


Now let's check out the data

You can do this by looking at HDU1.data

In [32]:
data_table = HDU1.data

In [33]:
# Let's take a look
data_table

FITS_rec([( 0,  3.22987416e-04, 0.00060581, 1.        ),
          ( 1,  2.34029619e-03, 0.00087367, 1.        ),
          ( 2,  4.38274589e-03, 0.0010977 , 0.99599521),
          ( 3,  2.19495939e-03, 0.0010726 , 0.19478463),
          ( 4,  2.08603583e-03, 0.00098687, 0.14393946),
          ( 5,  4.57698789e-03, 0.00094827, 0.10628071),
          ( 6,  3.21719515e-03, 0.00089184, 0.07882585),
          ( 7,  3.20677344e-03, 0.00083956, 0.05820878),
          ( 8,  2.12463943e-03, 0.00081691, 0.04      ),
          ( 9,  1.46474031e-03, 0.00072312, 0.04      ),
          (10,  5.41243921e-04, 0.00065457, 0.04      ),
          (11,  3.28067705e-04, 0.00059307, 0.04      ),
          (12,  1.71227847e-04, 0.00056931, 0.04      ),
          (13,  1.50764883e-03, 0.00054015, 0.04      ),
          (14,  1.88210466e-03, 0.00050425, 0.04      ),
          (15,  6.22583571e-04, 0.00049588, 0.04      ),
          (16,  7.08502427e-04, 0.00046948, 0.04      ),
          (17,  1.11682521e-03,

Let's try to get some easier to read information. 

Let's get some information about the columns 

In [34]:
# This will give us some information about the columns

print(data_table.columns)

# This will give us a list of column names

print(data_table.columns.names)

ColDefs(
    name = 'CHANNEL'; format = 'I'
    name = 'RATE'; format = 'D'; unit = 'count/s'
    name = 'STAT_ERR'; format = 'D'; unit = 'count/s'
    name = 'SYS_ERR'; format = 'D'
)
['CHANNEL', 'RATE', 'STAT_ERR', 'SYS_ERR']


You can get a column of data by giving it the column's name

The column is a numpy array

In [35]:
data_table['RATE']

array([ 3.22987416e-04,  2.34029619e-03,  4.38274589e-03,  2.19495939e-03,
        2.08603583e-03,  4.57698789e-03,  3.21719515e-03,  3.20677344e-03,
        2.12463943e-03,  1.46474031e-03,  5.41243921e-04,  3.28067705e-04,
        1.71227847e-04,  1.50764883e-03,  1.88210466e-03,  6.22583571e-04,
        7.08502427e-04,  1.11682521e-03,  1.26467143e-03,  8.27044941e-04,
        2.91422840e-05,  4.52992985e-04,  8.86513701e-04,  3.08192787e-04,
        2.12555744e-04,  7.01398534e-04,  4.42355062e-04,  2.19405418e-04,
        4.68456565e-04,  1.60690280e-04,  2.34348897e-04,  4.90618058e-04,
        4.20988319e-04,  2.94418943e-04,  9.97212202e-05, -2.55416800e-04,
        2.55893289e-04,  5.18692063e-04,  1.32886445e-04,  1.98817443e-04,
        2.55107181e-04,  1.15666463e-04,  2.41816953e-04, -1.68598594e-04,
       -9.05437960e-05, -1.16101536e-04, -2.42811171e-04, -7.75912002e-05,
       -1.36510084e-04, -6.26318408e-05, -4.35313627e-04, -1.47859335e-04,
        3.14398717e-04,  

What's in the other HDUs?

In [36]:
f.info()

Filename: C:\Users\Insan\.astropy\cache\download\url\9cc569736f9daa51285545992766859c\contents
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     108   ()      
  1  SPECTRUM      1 BinTableHDU    396   80R x 4C   ['I', 'D', 'D', 'D']   
  2  EBOUNDS       1 BinTableHDU    345   80R x 3C   [I, E, E]   
  3  STDGTI        1 BinTableHDU    144   1R x 2C   [D, D]   


In [37]:
HDU2 = f['EBOUNDS']

In [38]:
HDU2.header

XTENSION= 'BINTABLE'           / binary table extension                         
BITPIX  =                    8 / 8-bit bytes                                    
NAXIS   =                    2 / 2-dimensional binary table                     
NAXIS1  =                   10 / width of table in bytes                        
NAXIS2  =                   80 / number of rows in table                        
PCOUNT  =                    0 / size of special data area                      
GCOUNT  =                    1 / one data group (required keyword)              
TFIELDS =                    3 / number of fields in each row                   
TTYPE1  = 'CHANNEL '           / Spectrum channel number                        
TFORM1  = 'I       '           / data format of field: 2-byte INTEGER           
TTYPE2  = 'E_MIN   '           / Channel lower energy bin edge                  
TFORM2  = 'E       '           / data format of field: 4-byte REAL              
TUNIT2  = 'keV     '        

This is a data table to be used in tandom with the previous table. It defines the energy min and max of the `channel` column

In [39]:
HDU2_table = HDU2.data

print(HDU2_table.columns)

ColDefs(
    name = 'CHANNEL'; format = 'I'
    name = 'E_MIN'; format = 'E'; unit = 'keV'
    name = 'E_MAX'; format = 'E'; unit = 'keV'
)


In [40]:
print(HDU2_table.columns.names)
for row in HDU2_table:
    print(row)

['CHANNEL', 'E_MIN', 'E_MAX']
(0, 0.0, 10.0)
(1, 10.0, 12.0)
(2, 12.0, 14.0)
(3, 14.0, 16.0)
(4, 16.0, 18.0)
(5, 18.0, 20.0)
(6, 20.0, 22.0)
(7, 22.0, 24.0)
(8, 24.0, 26.0)
(9, 26.0, 28.0)
(10, 28.0, 30.1)
(11, 30.1, 32.1)
(12, 32.1, 34.2)
(13, 34.2, 36.3)
(14, 36.3, 38.3)
(15, 38.3, 40.4)
(16, 40.4, 42.5)
(17, 42.5, 44.6)
(18, 44.6, 46.8)
(19, 46.8, 48.9)
(20, 48.9, 51.1)
(21, 51.1, 53.2)
(22, 53.2, 55.4)
(23, 55.4, 57.6)
(24, 57.6, 59.8)
(25, 59.8, 62.0)
(26, 62.0, 64.2)
(27, 64.2, 66.4)
(28, 66.4, 68.7)
(29, 68.7, 70.9)
(30, 70.9, 73.2)
(31, 73.2, 75.4)
(32, 75.4, 77.7)
(33, 77.7, 80.0)
(34, 80.0, 82.3)
(35, 82.3, 84.6)
(36, 84.6, 87.0)
(37, 87.0, 89.3)
(38, 89.3, 91.7)
(39, 91.7, 94.0)
(40, 94.0, 96.4)
(41, 96.4, 98.8)
(42, 98.8, 101.2)
(43, 101.2, 103.6)
(44, 103.6, 106.0)
(45, 106.0, 108.4)
(46, 108.4, 110.9)
(47, 110.9, 113.3)
(48, 113.3, 115.8)
(49, 115.8, 118.2)
(50, 118.2, 120.7)
(51, 120.7, 123.2)
(52, 123.2, 125.7)
(53, 125.7, 128.3)
(54, 128.3, 130.8)
(55, 130.8, 133.3)
(5

If you are only interested in the table and not the header. It can be more convenient to use an Astropy Table. 

The Table module can also read fits files, but it only works for tables (no images) and it doens't give you the header. 

The Table module is more flexible though and has better functions to display data

In [41]:
table = Table.read(table_url, hdu=1)

In [42]:
table

CHANNEL,RATE,STAT_ERR,SYS_ERR
Unnamed: 0_level_1,ct / s,ct / s,Unnamed: 3_level_1
int16,float64,float64,float64
0,0.00032298741642329033,0.0006058113610214997,1.0
1,0.0023402961874340942,0.0008736706687293406,1.0
2,0.004382745888921846,0.0010976986354282246,0.995995213525947
3,0.0021949593945541526,0.0010726031861027505,0.194784630902251
4,0.002086035828565968,0.0009868698005033746,0.14393945834151195
5,0.0045769878910468876,0.0009482658950131615,0.10628070885571646
6,0.0032171951450284657,0.0008918407463337803,0.07882585149794728
7,0.0032067734401000107,0.0008395569351301532,0.05820877744036666
8,0.0021246394282786454,0.0008169122858670772,0.03999999910593033
...,...,...,...


In [43]:
# This may or may not work
table.show_in_notebook()

DataGrid(auto_fit_params={'area': 'all', 'padding': 30, 'numCols': None}, corner_renderer=None, default_render…

In [44]:
# This also may or may not work
table.show_in_notebook(backend='classic')



idx,CHANNEL,RATE,STAT_ERR,SYS_ERR
Unnamed: 0_level_1,Unnamed: 1_level_1,ct / s,ct / s,Unnamed: 4_level_1
0,0,0.0003229874164232,0.0006058113610214,1.0
1,1,0.002340296187434,0.0008736706687293,1.0
2,2,0.0043827458889218,0.0010976986354282,0.995995213525947
3,3,0.0021949593945541,0.0010726031861027,0.194784630902251
4,4,0.0020860358285659,0.0009868698005033,0.1439394583415119
5,5,0.0045769878910468,0.0009482658950131,0.1062807088557164
6,6,0.0032171951450284,0.0008918407463337,0.0788258514979472
7,7,0.0032067734401,0.0008395569351301,0.0582087774403666
8,8,0.0021246394282786,0.000816912285867,0.0399999991059303
9,9,0.0014647403131698,0.0007231198523321,0.0399999991059303


In [54]:
table.colnames

['CHANNEL', 'RATE', 'STAT_ERR', 'SYS_ERR']

As you may have noticed this by default only gave us the first table. To get other tables in the file, you will need to know what HDU index or the name of the HDU you want

In [55]:
table_ebins = Table.read(table_url, hdu='EBOUNDS')

In [60]:
table_ebins.show_in_notebook()

DataGrid(auto_fit_params={'area': 'all', 'padding': 30, 'numCols': None}, corner_renderer=None, default_render…

Let's not forget to close our file

In [61]:
f.close()

Image data 

In [62]:
image_url = 'http://data.astropy.org/tutorials/FITS-images/HorseHead.fits'

f = fits.open(image_url)

In [63]:
f.info()

Filename: C:\Users\Insan\.astropy\cache\download\url\ff6e0b93871033c68022ca026a956d87\contents
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     161   (891, 893)   int16   
  1  er.mask       1 TableHDU        25   1600R x 4C   [F6.2, F6.2, F6.2, F6.2]   


In [64]:
f[0].header

SIMPLE  =                    T /FITS: Compliance                                
BITPIX  =                   16 /FITS: I*2 Data                                  
NAXIS   =                    2 /FITS: 2-D Image Data                            
NAXIS1  =                  891 /FITS: X Dimension                               
NAXIS2  =                  893 /FITS: Y Dimension                               
EXTEND  =                    T /FITS: File can contain extensions               
DATE    = '2014-01-09        '  /FITS: Creation Date                            
ORIGIN  = 'STScI/MAST'         /GSSS: STScI Digitized Sky Survey                
SURVEY  = 'SERC-ER '           /GSSS: Sky Survey                                
REGION  = 'ER768   '           /GSSS: Region Name                               
PLATEID = 'A0JP    '           /GSSS: Plate ID                                  
SCANNUM = '01      '           /GSSS: Scan Number                               
DSCNDNUM= '00      '        

In [65]:
image_data = f[0].data
print(image_data.shape)

(893, 891)


In [None]:
import matplotlib.pyplot as plt

# Let's view the image
plt.imshow(image_data, cmap='grey')
plt.colorbar()

NameError: name 'plt' is not defined

Next time we will cover more about images and plotting them