<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

<center>
<h1><font size="+3">ASTG Python Courses</font></h1>
</center>

---

<center>
<h2>
<font color="red"> 
File Manipulation and Usage <br> 
in <br> 
Science and Engineering Applications
</font> 
</h2>
</center>

In [None]:
# data downloads for this lesson

import urllib.request

# obtain jpg file from online
url = 'https://blog.lipsumarium.com/assets/img/posts/2017-07-22-caption-memes-in-python/one-does-not-simply-make-a-good-meme-generator-in-python.jpg'
urllib.request.urlretrieve(url, "meme.jpg")

# this file contains a list of Winter Olympic Medals and details
urllib.request.urlretrieve('http://winterolympicsmedals.com/medals.csv', "medals.csv")

# obtain JSON file from online
f = 'https://services.swpc.noaa.gov/json/solar_probabilities.json'
urllib.request.urlretrieve(f, 'probabilities.json')

# 1. The Basics

---

## 1.a Manipulating ASCII/Text Files

#### Create a file
```python
f = open('filename.txt', 'w')
f.write(data) # type(data) == str
f.close()

with open('filename.txt', 'w') as f:
     f.write(data) # writeline(s) as well
```

In [None]:
lons = [105.5, 67.25, 13.75, 86.20, 45.80]
lats = [-22.72, -43.56, 30.41, 75.57, 11.60]

with open('sample_text_file.txt', 'w') as f:
     for a, b in zip(lons,lats):
         f.write(str(a)+'  '+ str(b)+'\n') 

In [None]:
!cat sample_text_file.txt


#### Read a file
```python
f = open('filename.txt', 'r')
data = f.read() # readline(s) as well
f.close()

with open('filename.txt', 'r') as f:
     data = f.read() # we can also use other modes
```

In [None]:
with open('sample_text_file.txt', 'r') as f:
     lines = f.readlines()
        
print(lines)
for line in lines:
    a,b = line.split()
    print(a,b)

## 1b. Manipulating Binary Files

```python
with open('filename.bin', 'rb') as f:
     data = f.read() # read without decoding
```

If you had a Binary file with mixed data types of known formats, you would then use the [`struct`](https://docs.python.org/3/library/struct.html) function to aid you in decoding binary data. Imagery files such as JPG, PNG, etc. can be read directly using the binary mode of Python, but this can be very tedious as well as not a viable option to read image data.

_Warning:_ Be careful about the endianness of your files! (Big or little)

In [None]:
lons = [105.5, 67.25, 13.75, 86.20, 45.80]
lats = [-22.72, -43.56, 30.41, 75.57, 11.60]

import struct
with open('sample_bin_file.dat', 'wb') as f:
     for i in range(len(lons)):
         f.write(struct.pack('d', lons[i]))
         f.write(struct.pack('d', lats[i]))

In [None]:
ya = []
with open('sample_bin_file.dat', 'rb') as fid:
     i = 0
     nBytes = struct.calcsize('d')
     while True:
           rec = fid.read(nBytes)
           if len(rec) != nBytes:
              break
           (y,) = struct.unpack('d', rec)
           ya.append(y)
           i += 1

import numpy as np
ya = np.array(ya)
ya.shape = (5,2)
print(ya)

Take for example, the following image:

![meme](https://blog.lipsumarium.com/assets/img/posts/2017-07-22-caption-memes-in-python/one-does-not-simply-make-a-good-meme-generator-in-python.jpg)

In [None]:
with open('meme.jpg', 'rb') as f:
     data = f.read()
  
print(data[:40])

Here, we have read an imagery file in binary mode, but have not decoded this binary string into numbers, text, or whatever else we desire. Other packages such as PIL (Python Imaging Library) exist for those inclined which you should use instead the fork called [Pillow](https://python-pillow.org/). The more advanced and popular [OpenCV](https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_tutorials.html) would aid in doing [image manipulation](https://docs.python-guide.org/scenarios/imaging/).

# 2. Other File Types (Standard Packages)

---

Beyond document-based reading and writing of file data, what other types of data are there?

## 2a. CSV

Comma-separated value files are similar to spreadsheets and tabular/formatted data and used widely in the engineering and financial disciplines. In Python, we can read these directly using the `csv` module:

In [None]:
# this file contains a list of Winter Olympic Medals and details
!head medals.csv

In [None]:
import csv

cr = csv.reader(open('medals.csv')) # there is also a writer to write csv files

records = 0 # just need a counter to limit output
for row in cr:
    print(row)  
    if records != 10:
       records += 1
    else:
       break

We can also use [NumPy](https://github.com/pytrain/numpy/blob/master/IntroNumPy.ipynb) or [Pandas](https://github.com/pytrain/pandas/blob/master/Intro_Pandas.ipynb) or other packages to read this file type.

In [None]:
# NumPy
import numpy as np
year = np.loadtxt('medals.csv', delimiter=',', 
                  usecols=(0), unpack=True, skiprows=1)
print(year)

In [None]:
import pandas as pd
# only 10 rows of data will be displayed
pd.set_option("max_rows", 10) 
# print floating point numbers using fixed point notation,
np.set_printoptions(suppress=True)

# Pandas
data = pd.read_csv('medals.csv')
print(data)

# 2b. JSON

---

JavaScript Object Notion is basically a dictionary or list of dictionaries put into an ASCII/Text file or streamed directly. They are mainly used in web programming and with JavaScript for passing data between websites and the user. Like, CSV, Python contains a direct package to read this type of data.

The following data is the solar event probabilities from the Space Weather Prediction Center. This aids scientists in determining if there will be a solar event that could either cause damage to space-based instruments or impact other Earth-based instrumentation like GPS.

In [None]:
!cat probabilities.json

In [None]:
import json
with open('probabilities.json') as f:
     data = json.loads(f.read())
  
print(data[0])

Or, if we wanted to have some fun, we could continually find the location of the International Space Station:


In [None]:
import json
import urllib
import time
import datetime as dt

i = 0
while i < 10:
  response = urllib.request.urlopen("http://api.open-notify.org/iss-now.json")
  obj = json.loads(response.read())
  
  t = dt.datetime.utcfromtimestamp(obj['timestamp']).strftime('%Y-%m-%d %H:%M:%S')
  
  print('time: ', t, ', position: (',
        obj['iss_position']['latitude'], ' ,', obj['iss_position']['longitude'],
        ')', end='')
  
  time.sleep(5)
  i += 1
  print('\r', end='')

We can also use Pandas to read this file type.

In [None]:
pd.read_json('probabilities.json')

## Exercise

---




I'd like to find out the list of airbus flights and their properties of the aircraft. Download, read, and plot the points of the aircraft track.


CSV File: https://opensky-network.org/datasets/states/airbus_tree.csv

It's hard to test your knowledge of these packages as they are so simple and reading data is usually a preliminary step for data analysis.

In [None]:
#@title
import pandas as pd

data = pd.read_csv('https://opensky-network.org/datasets/states/airbus_tree.csv')

# plotting
%matplotlib inline
import matplotlib.pyplot as plt

for flight in data:
    plt.plot(data['lat'], data['lon'])

# 3. Other File Types (Non-Standard)

---

Beyond these multi-disciplinary file types, there are other file types that exists that are specific to a research area or data source.


## 3a. Earth Science (HDF-5 / netCDF4)

Due to the nature of the data produced by Earth Science models, one would need to store time-dependent data within files that can be grouped or put into a particular type of hierachy. HDF-5 is the base file type for this hierarchical data type and netCDF4 is a reduced version limiting to the groups to just one.


### <font color="red"> Manipulating NetCDF Files </font>

In [None]:
from netCDF4 import Dataset
import numpy as np
#from numpy.random import uniform

#------------------
# Creating the file
#------------------
with Dataset('my_file.nc4', mode='w', format='NETCDF4') as ncFid:
     print(ncFid.file_format)

     #------------------------
     # Defining the dimensions
     #------------------------
     time = ncFid.createDimension('time', None)
     lev  = ncFid.createDimension('lev', 72)
     lat  = ncFid.createDimension('lat', 91)
     lon  = ncFid.createDimension('lon', 144)

     print(ncFid.dimensions)

     #------------------------------------------
     # Creating variables and Setting attributes
     #------------------------------------------
     times = ncFid.createVariable('time','f8',('time',))
     times.units = 'hours since 0001-01-01 00:00:00.0'
     times.calendar = 'gregorian'

     levels = ncFid.createVariable('lev','i4',('lev',))
     levels.units = 'hPa'

     latitudes = ncFid.createVariable('lat','f4',('lat',))
     latitudes.units = 'degrees north'

     longitudes = ncFid.createVariable('lon','f4',('lon',))
     longitudes.units = 'degrees east'

     temp = ncFid.createVariable('temp','f4',('time','lev','lat','lon',))
     temp.units = 'K'

     ncFid.description = 'Sample netCDF file'
     ncFid.source      = 'netCDF4 python tutorial'
     ncFid.history     = 'Created on June 18, 2019'

     #---------------
     # Setting values
     #---------------
     latitudes[:]  =  np.arange(-90,91,2.0)
     longitudes[:] =  np.arange(-180,180,2.5)
     levels[:]     =  np.arange(0,72,1)
     temp[0:5,:,:,:] = 300*np.random.uniform(\
         size=(5,levels.size,latitudes.size, longitudes.size))


In [None]:
with Dataset('my_file.nc4', mode='r') as ncFid:
     temp = ncFid.variables['temp'][:]

print(temp.shape)
print(np.mean(temp), np.std(temp), np.max(temp), np.mean(temp))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
cs = plt.contourf(temp[0,0,:,:])

### <font color="red"> Manipulating HDF5 Files </font>

In [None]:
import h5py
import numpy as np

# gzip compression flag
comp = 6

#------------------
# Creating the file
#------------------
with h5py.File('my_file.h5', 'w') as hFid:
     #----------------
     # File attributes
     #----------------
     hFid.attrs['source']      = 'H5Py Tutorial'
     hFid.attrs['history']     = 'Created on June 18, 2019'
     hFid.attrs['description'] = 'Sample HDF5 file'

     #------------------------
     # Defining the dimensions
     #------------------------
     lat = np.arange(-90,91,2.0)
     dset = hFid.require_dataset('lat', 
                                 shape=lat.shape, 
                                 dtype=np.float32, compression=comp)
     dset[...] = lat
     dset.attrs['name'] = 'latitude'
     dset.attrs['units'] = 'degrees north'

     lon = np.arange(-180,180,2.5)
     dset = hFid.require_dataset('lon', shape=lon.shape, dtype=np.float32, compression=comp)
     dset[...] = lon
     dset.attrs['name'] = 'longitude'
     dset.attrs['units'] = 'degrees east'

     lev = np.arange(0,72,1)
     dset = hFid.require_dataset('lev', shape=lev.shape, dtype=np.int, compression=comp)
     dset[...] = lev
     dset.attrs['name'] = 'vertical levels'
     dset.attrs['units'] = 'hPa'

     time = np.arange(0,5,1)
     dset = hFid.require_dataset('time', shape=time.shape, maxshape=(None), dtype=np.float32, compression=comp)
     dset[...] = time
     dset.attrs['name'] = 'time'
     dset.attrs['units'] = 'hours since 2013-01-01 00:00:00.0'
     dset.attrs['calendar'] = 'gregorian'

     #------------------------------------------
     # Creating variables and Setting attributes
     #------------------------------------------
     arr = np.zeros((5,lev.size,lat.size,lon.size))
     arr[0:5,:,:,:] = 300*np.random.uniform(
                    size=(5,lev.size,lat.size,lon.size))
     dset = hFid.require_dataset('temp', shape=arr.shape, 
                                 dtype=np.float32, compression=comp)
     dset[...] = arr
     dset.attrs['name'] = 'temperature'
     dset.attrs['units'] = 'K'

     #---------------
     # Creating Groups 
     #---------------
     gpData2D = hFid.create_group('2D_Data')
     sgpLand  = gpData2D.create_group('2D_Land')
     sgpSea   = gpData2D.create_group('2D_Sea')

     gpData3D = hFid.create_group('3D_Data')

     #----------------------
     # Write data in a group
     #----------------------
     temp = gpData3D.create_dataset('temp', data=arr)
     temp.attrs['name'] = 'temperature'
     temp.attrs['units'] = 'K'


In [None]:
with h5py.File('my_file.h5', 'r') as hFid:
     print(hFid.keys())

     lev  = hFid['lev'].value
     lat  = hFid['lat'].value
     lon  = hFid['lon'].value
     time = hFid['time'].value

     temp1 = hFid['temp'].value
     print(temp1[0,0,0,0], temp1[4,6,7,15])

     temp2 = hFid['3D_Data']['temp'].value
     print(temp2[0,0,0,0], temp2[4,6,7,15])

In addition to this hierarchical raw data format for Earth Science data, there is also GIS application data types

### <font color="red"> Manipulating Shapefile Files </font>

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs
import cartopy.io.shapereader as shpreader

# Get the file name from the natural_earth database
shpfilename = shpreader.natural_earth(resolution='110m',
                                      category='cultural',
                                      name='admin_0_countries')

In [None]:
# Read file and get countries 
reader = shpreader.Reader(shpfilename)
countries = reader.records()
next_country = next(countries)

In [None]:
print(type(next_country.attributes))

In [None]:
# Print features of a country
for key in next_country.attributes:
    print("{:} --> {:}".format(key,country.attributes[key]))

In [None]:
#### define a function which returns the population given the country
population = lambda country: country.attributes['POP_EST']

# Countries sorted py population
countries_sorted_by_population = sorted(reader.records(), \
                                         key=population)

num_countries = len(countries_sorted_by_population)
n = 5

# Get the first 5 most populated
most_populated = countries_sorted_by_population[num_countries-n:]

print("Most Populated Countries")
for nation in most_populated:
    print("   {:>} --> {:>}".format(nation.attributes['NAME_LONG'], \
                               nation.attributes['POP_EST']))

# Get the 5 least populated
least_populated = countries_sorted_by_population[:n]

print()
print("Least Populated Countries")
for nation in least_populated:
    print("   {:>} --> {:>}".format(nation.attributes['NAME_LONG'], \
                               nation.attributes['POP_EST']))   

In [None]:
# Plotting

# Select the map projection
#----------------------
ax = plt.axes(projection=ccrs.PlateCarree())
ax.add_feature(cartopy.feature.OCEAN)
 
# Select the area of interest
#-----------------------
ax.set_extent([-150, 60, -25, 60])
 
for country in countries:
    if country.attributes['ADM0_A3'] == 'USA':
        ax.add_geometries(country.geometry, ccrs.PlateCarree(), \
                          facecolor=(0, 0, 1),
                          label=country.attributes['ADM0_A3'])
    else:
        ax.add_geometries(country.geometry, \
                          ccrs.PlateCarree(), \
                          facecolor=(0, 1, 0), \
                          label=country.attributes['ADM0_A3'])
 
plt.show()

## 3b. Space Science (Astronomy, Heliophysics, etc.) - FITS Files

FITS (Flexible Image Transport System) files contains imagery and the metadata associated with the imagery that is found in the file. FITS is a standard data format used within astronomy and is endorsed by [GSFC NASA](http://fits.gsfc.nasa.gov/) and the IAU (International Astronomical Union).

Most FITS files when opened from a web browser shows a header of ASCII (human readible) giving the details or descriptions of the data contained within the file.

> Sample Files:  
>  
> There are samples within the package AstroPy and some distributed online through GSFC. [Here](http://fits.gsfc.nasa.gov/fits_samples.html) is a link to those samples provided by GSFC.

### Reading a FITS File: Crab Nebula and Pulsar

In [None]:
from astropy.io import fits

# FITS sample file used from Chandra X-Ray Observatory:
# http://chandra.harvard.edu/photo/2009/crab/fits/crab.fits
image_file = fits.open('http://chandra.harvard.edu/photo/2009/crab/fits/crab.fits')

Our image file contains headers and data combined. Let's look at the header information first.

### FITS Headers

In [None]:
image_file[0].header

In [None]:
image_file.info()

In [None]:
image_data = image_file[0].data
print(image_data.shape)

### Plotting with AstroPy

Here, we will use matplotlib in conjunction with AstroPy to visualize this Nebula.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from astropy.visualization import astropy_mpl_style
plt.style.use(astropy_mpl_style)

plt.figure(figsize=(20,10))
plt.imshow(image_data, cmap='gray')
plt.colorbar()

In [None]:
plt.figure(figsize=(20,10))
plt.imshow(image_data, cmap='plasma')
plt.colorbar()

You can also create a FITS file from a NumPy array using the following template:

```python
hdu = fits.PrimaryHDU(new_data)
hdu.writeto('filename.fits')
```

The metadata can be added in later, but with the PrimaryHDU function, it goes ahead and fills some of that data in for you.

## 3c. Engineering Applications (Signal Processing, Streamed Data, etc.)

Signal processing is one example of an engineering application that would take a specific data format and require one to manipulate or modify the data in order to produce desired physical quantities. Let's take for example a sample sine wave for audio.

In [None]:
# Generate a sound
import numpy as np
from IPython.display import Audio
import matplotlib.pyplot as plt
%matplotlib inline

framerate = 44100
t = np.linspace(0,5,framerate*5)
data = np.sin(2*np.pi*220*t) # one tone
plt.plot(data)
data = data + np.sin(2*np.pi*224*t) # two tones (two sine waves)
plt.plot(data)
plt.xlim(0,1000)
Audio(data,rate=framerate)

In [None]:
# Can also do stereo or more channels
dataleft = np.sin(2*np.pi*220*t)
dataright = np.sin(2*np.pi*224*t)
plt.plot(dataleft)
plt.plot(dataright)
plt.xlim(0,1000)
Audio([dataleft, dataright],rate=framerate)

In [None]:
Audio("http://www.nasa.gov/mp3/574928main_houston_problem.mp3")  # From URL

# <font color="red">Summary</font>


| File Type | Python Package | Reader/Writer |
| --- | --- | --- |
| **text** | | | `open` |
| **binary** | | `open` |
| **binary (pickle)** | pickle | `load`/`dump`  |
|                 | Pandas |  `read_pickle`/`to_pickle` |
| **csv**    | Pandas | `read_csv`/`to_csv` |
|        | csv    | `open` |
|        | Numpy  | `genfromtxt`   |
| **JSON**   | json |    `load`/`dump`  |
|        | Pandas | `read_json`/`to_json` |
| **nc4**    | netCDF4 | `Dataset` |
| **HDF5**   | h5py    |  `File`  |
|        | Pandas  | `read_hdf`/`to_hdf` |
| **FITS**   | fits (astropy) | `open` |