<center>
<table>
  <tr>
    <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/nasa-logo.svg" width="100"/> </td>
     <td><img src="https://portal.nccs.nasa.gov/datashare/astg/training/python/logos/ASTG_logo.png?raw=true" width="80"/> </td>
     <td> <img src="https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png" width="130"/> </td>
    </tr>
</table>
</center>

---

<center>
<h1>
<font color="red"> 
Overview of Reading and Writing Files in Python
</font> 
</h1>
</center>

# <font color="blue">Introduction</font>


- Files are named locations on disk to store related information.
- A file can be seen as a contiguous set of bytes used to store data.
- Data in a file are organized in a specific format and can be anything as simple as a text file or as complicated as a program executable. 

Typically, a file operation takes place in the following order:

1. Open a file
2. Read or write (perform operation)
3. Close the file

Before attempting to open an existing file be ensure that it exists.

# <font color="blue">Manipulating ASCII Files</font>

### Create a file

To create a file, you need to use the `open()` function that takes two arguments: the file name and the mode.

First formulation:

```python
f = open(file_name, 'w')
f.write(data) 
f.close()
```

In this second formulation, the file is automatically closed:

```python
with open(file_name, 'w') as f:
     f.write(data)
```

**Example**

In [None]:
lons = [105.5, 67.25, 13.75, 86.20, 45.80, 150.5, -37.2]
lats = [-22.72, -43.56, 30.41, 75.57, 11.60, 17.3, 32.98]
num_lats = len(lats)

In [None]:
file_name = 'sample_text_file.txt'

with open(file_name, 'w') as f:
     for a, b in zip(lons,lats):
         f.write('{} \t {} \n'.format(a,b)) 

In [None]:
!cat $file_name

### Read file

First formulation:

```python
f = open(file_name, 'r')
data = f.read() 
f.close()
```

Second formulation:

```python
with open(file_name, 'r') as f:
     data = f.read() 
```

**Example**

In [None]:
with open(file_name, 'r') as f:
     lines = f.readlines()
        
#print(lines)
for line in lines:
    a,b = line.split()
    print(a,b)

# <font color="blue"> Manipulating Binary Files</font>

- We can use the same `open` function with the `'rb'` or `'wb'` mode to manipulate binary files.
- The <a href="https://docs.python.org/3/library/struct.html">struct</a> module in Python is used to convert native Python data types such as strings and numbers into a string of bytes and vice versa. 
- We can use the `struct` module to parse binary files of data stored in C structs in Python.

**Example**

In [None]:
file_name = 'sample_text_file.bin'

import struct
with open(file_name, 'wb') as f:
     for i in range(len(lons)):
         f.write(struct.pack('d', lons[i]))
         f.write(struct.pack('d', lats[i]))

In [None]:
ya = []
with open(file_name, 'rb') as fid:
     i = 0
     nBytes = struct.calcsize('d')
     while True:
           rec = fid.read(nBytes)
           if len(rec) != nBytes:
              break
           (y,) = struct.unpack('d', rec)
           ya.append(y)
           i += 1

import numpy as np
ya = np.array(ya)
ya.shape = (num_lats,2)
print(ya)

**Manipulating `JPEG` Files**

In [None]:
import urllib.request
url = "https://raw.githubusercontent.com/astg606/py_materials/master/input_output/"
file_name = "cat.jpg"
urllib.request.urlretrieve(url+file_name, file_name)

In [None]:
with open('cat.jpg', 'rb') as f:
    data = f.readline()
print (data)

In [None]:
#':'.join(x.encode('hex') for x in data)

Hex dump is useful for debugging. In a hex dump, each byte (8-bits) is represented as a two-digit hexadecimal number.

In [None]:
with open('cat.jpg', 'rb') as f:
    data = f.read()
 
    if data.startswith(b'\xff\xd8'):
        info = 'This is a jpeg file (%d bytes long)'
    else:
        info = 'This is a random file (%d bytes long)'

    print (info % len(data))

In [None]:
from PIL import Image
jpgfile = Image.open("cat.jpg")

print(jpgfile.bits, jpgfile.size, jpgfile.format, jpgfile.mode)

In [None]:
#%matplotlib inline
#jpgfile.show()

In [None]:
from IPython.display import Image
kitty = Image(filename = 'cat.jpg')
kitty

# <font color="blue"> Manipulating CSV Files</font>

- A CSV (Comma Separated Values) file is a text file that uses specific structuring to arrange tabular data.
- CSV files use a comma (or space, or tabs) to separate each specific data value. 
- They are a convenient way to manipulate data from spreadsheets and databases.

**Using the CSV Module**

In [None]:
file_name = 'sample_text_file.csv'

with open(file_name, 'w') as f:
     for a, b in zip(lons,lats):
         f.write('{}, {}\n'.format(a,b))

In [None]:
!cat $file_name

In [None]:
import csv

cr = csv.reader(open(file_name))

for line in cr:
    print(line)  

**Using Numpy**

In [None]:
import numpy as np
coords = np.loadtxt(file_name, delimiter=',', unpack=True)
print(coords)

**Using pandas**

In [None]:
import pandas as pd
df = pd.read_csv(file_name, names=["lons", "lats"])

df

You can also read remote files with pandas.

In [None]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

df

In [None]:
df.describe().transpose()

In [None]:
df.Age.plot()

# <font color="blue"> Manipulating JSON Files</font>

* JSON (JavaScript Object Notation) is a popular data format used for representing structured data. 
* It is a text format that is language independent and can be used in Python, Perl among other languages. 
* JSON format is used for data communications between servers and web applications.
* It is built on two structures:

     - A collection of name/value pairs. This is realized as an object, record, dictionary, hash table, keyed list, or associative array.
     - An ordered list of values. This is realized as an array, vector, list, or sequence.

In [None]:
from urllib.request import urlopen
import json
with urlopen('http://data.nba.net/prod/v2/2018/teams.json') as response:
     source = response.read()
     data = json.loads(source)
    
data

In [None]:
nba_teams = [team for team in data['league']['standard'] if team['isNBAFranchise']]
print(nba_teams)

In [None]:
with open('nba_teams.json', 'w') as f:
     json.dump(nba_teams, f, indent = 4, sort_keys = True)

In [None]:
!cat nba_teams.json

# <font color="blue"> Manipulating Excel Files</font>

**Using pandas**

In [None]:
import pandas as pd

# Create URL to Excel file (alternatively this can be a filepath)
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.xlsx'

df = pd.read_excel(url)

df

# <font color="blue"> Manipualting Scientific Data Format Files</font>

### <font color="red"> NetCDF Files </font>

- NetCDF (network Common Data Form) is a file format for storing multidimensional scientific data (variables) such as temperature, humidity, pressure, wind speed, and direction.
- A NetCDF file contains a header which describes the layout of the rest of the file, in particular the data arrays, as well as arbitrary file metadata in the form of name/value attributes. 
- The format is platform independent.
- The data are stored in a fashion that allows efficient subsetting.

In [None]:
from netCDF4 import Dataset
import numpy as np
#from numpy.random import uniform

#------------------
# Creating the file
#------------------
with Dataset('my_file.nc4', mode='w', format='NETCDF4') as ncFid:
     print(ncFid.file_format)

     #------------------------
     # Defining the dimensions
     #------------------------
     time = ncFid.createDimension('time', None)
     lev  = ncFid.createDimension('lev', 72)
     lat  = ncFid.createDimension('lat', 91)
     lon  = ncFid.createDimension('lon', 144)

     print(ncFid.dimensions)

     #------------------------------------------
     # Creating variables and Setting attributes
     #------------------------------------------
     times = ncFid.createVariable('time','f8',('time',))
     times.units = 'hours since 0001-01-01 00:00:00.0'
     times.calendar = 'gregorian'

     levels = ncFid.createVariable('lev','i4',('lev',))
     levels.units = 'hPa'

     latitudes = ncFid.createVariable('lat','f4',('lat',))
     latitudes.units = 'degrees north'

     longitudes = ncFid.createVariable('lon','f4',('lon',))
     longitudes.units = 'degrees east'

     temp = ncFid.createVariable('temp','f4',('time','lev','lat','lon',))
     temp.units = 'K'

     ncFid.description = 'Sample netCDF file'
     ncFid.source      = 'netCDF4 python tutorial'
     ncFid.history     = 'Created on June 18, 2019'

     #---------------
     # Setting values
     #---------------
     latitudes[:]  =  np.arange(-90,91,2.0)
     longitudes[:] =  np.arange(-180,180,2.5)
     levels[:]     =  np.arange(0,72,1)
     temp[0:5,:,:,:] = 300*np.random.uniform(\
         size=(5,levels.size,latitudes.size, longitudes.size))


In [None]:
with Dataset('my_file.nc4', mode='r') as ncFid:
     temp = ncFid.variables['temp'][:]

print(temp.shape)
print(np.mean(temp), np.std(temp), np.max(temp), np.mean(temp))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
cs = plt.contourf(temp[0,0,:,:])

### <font color="red"> HDF5 Files </font>

- The Hierarchical Data Format version 5 (HDF5), is an open source file format that supports large, complex, heterogeneous data. 
- HDF5 uses a "file directory" like structure that allows you to organize data within the file in many different structured ways, as you might do with files on your computer.
- The HDF5 format also allows for embedding of metadata making it self-describing.

![hdf5](https://www.neonscience.org/sites/default/files/images/HDF5/hdf5_structure4.jpg)
Image Source: https://www.neonscience.org/

In [None]:
import h5py
import numpy as np

# gzip compression flag
comp = 6

#------------------
# Creating the file
#------------------
with h5py.File('my_file.h5', 'w') as hFid:
     #----------------
     # File attributes
     #----------------
     hFid.attrs['source']      = 'H5Py Tutorial'
     hFid.attrs['history']     = 'Created on June 18, 2019'
     hFid.attrs['description'] = 'Sample HDF5 file'

     #------------------------
     # Defining the dimensions
     #------------------------
     lat = np.arange(-90,91,2.0)
     dset = hFid.require_dataset('lat', 
                                 shape=lat.shape, 
                                 dtype=np.float32, compression=comp)
     dset[...] = lat
     dset.attrs['name'] = 'latitude'
     dset.attrs['units'] = 'degrees north'

     lon = np.arange(-180,180,2.5)
     dset = hFid.require_dataset('lon', shape=lon.shape, dtype=np.float32, compression=comp)
     dset[...] = lon
     dset.attrs['name'] = 'longitude'
     dset.attrs['units'] = 'degrees east'

     lev = np.arange(0,72,1)
     dset = hFid.require_dataset('lev', shape=lev.shape, dtype=np.int, compression=comp)
     dset[...] = lev
     dset.attrs['name'] = 'vertical levels'
     dset.attrs['units'] = 'hPa'

     time = np.arange(0,5,1)
     dset = hFid.require_dataset('time', shape=time.shape, maxshape=(None), dtype=np.float32, compression=comp)
     dset[...] = time
     dset.attrs['name'] = 'time'
     dset.attrs['units'] = 'hours since 2013-01-01 00:00:00.0'
     dset.attrs['calendar'] = 'gregorian'

     #------------------------------------------
     # Creating variables and Setting attributes
     #------------------------------------------
     arr = np.zeros((5,lev.size,lat.size,lon.size))
     arr[0:5,:,:,:] = 300*np.random.uniform(
                    size=(5,lev.size,lat.size,lon.size))
     dset = hFid.require_dataset('temp', shape=arr.shape, 
                                 dtype=np.float32, compression=comp)
     dset[...] = arr
     dset.attrs['name'] = 'temperature'
     dset.attrs['units'] = 'K'

     #---------------
     # Creating Groups 
     #---------------
     gpData2D = hFid.create_group('2D_Data')
     sgpLand  = gpData2D.create_group('2D_Land')
     sgpSea   = gpData2D.create_group('2D_Sea')

     gpData3D = hFid.create_group('3D_Data')

     #----------------------
     # Write data in a group
     #----------------------
     temp = gpData3D.create_dataset('temp', data=arr)
     temp.attrs['name'] = 'temperature'
     temp.attrs['units'] = 'K'


In [None]:
with h5py.File('my_file.h5', 'r') as hFid:
     print(hFid.keys())

     lev  = hFid['lev'].value
     lat  = hFid['lat'].value
     lon  = hFid['lon'].value
     time = hFid['time'].value

     temp1 = hFid['temp'].value
     print(temp1[0,0,0,0], temp1[4,6,7,15])

     temp2 = hFid['3D_Data']['temp'].value
     print(temp2[0,0,0,0], temp2[4,6,7,15])

In addition to this hierarchical raw data format for Earth Science data, there is also GIS application data types

### <font color="red"> Shapefile Files </font>

The shapefile format:
* Is a digital vector storage format for storing geometric location and associated attribute information.
* Geographic features in a shapefile can be represented by points, lines, or polygons (areas).
* Is non-topological. It does not maintain spatial relationship information such as connectivity, adjacency, and area definition.
* Because the structure of points, lines, and polygons are different, each individual shapefile can only contain one vector type (all points, all lines or all polygons). You will not find a mixture of point, line and polygon objects in a single shapefile.
* Was introduced with ArcView GIS version 2 in the early 1990s.

#### Representation of the geographic features of a shapefile

![features](https://www.earthdatascience.org/images/courses/earth-analytics/spatial-data/points-lines-polygons-vector-data-types.png)
Image Source: Colin Williams (NEON)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs
import cartopy.io.shapereader as shpreader

# Get the file name from the natural_earth database
shpfilename = shpreader.natural_earth(resolution='110m',
                                      category='cultural',
                                      name='admin_0_countries')

In [None]:
# Read file and get countries 
reader = shpreader.Reader(shpfilename)
countries = reader.records()
next_country = next(countries)

In [None]:
print(type(next_country.attributes))

In [None]:
# Print features of a country
for key in next_country.attributes:
    print("{:} --> {:}".format(key,next_country.attributes[key]))

In [None]:
#### define a function which returns the population given the country
population = lambda country: country.attributes['POP_EST']

# Countries sorted py population
countries_sorted_by_population = sorted(reader.records(), \
                                         key=population)

num_countries = len(countries_sorted_by_population)
n = 5

# Get the first 5 most populated
most_populated = countries_sorted_by_population[num_countries-n:]

print("Most Populated Countries")
for nation in most_populated:
    print("   {:>} --> {:>}".format(nation.attributes['NAME_LONG'], \
                               nation.attributes['POP_EST']))

# Get the 5 least populated
least_populated = countries_sorted_by_population[:n]

print()
print("Least Populated Countries")
for nation in least_populated:
    print("   {:>} --> {:>}".format(nation.attributes['NAME_LONG'], \
                               nation.attributes['POP_EST']))   

In [None]:
# Plotting

# Select the map projection
#----------------------
ax = plt.axes(projection=ccrs.PlateCarree())
ax.add_feature(cartopy.feature.OCEAN)
 
# Select the area of interest
#-----------------------
ax.set_extent([-150, 60, -25, 60])
 
for country in countries:
    if country.attributes['ADM0_A3'] == 'USA':
        ax.add_geometries(country.geometry, ccrs.PlateCarree(), \
                          facecolor=(0, 0, 1),
                          label=country.attributes['ADM0_A3'])
    else:
        ax.add_geometries(country.geometry, \
                          ccrs.PlateCarree(), \
                          facecolor=(0, 1, 0), \
                          label=country.attributes['ADM0_A3'])
 
plt.show()

# <font color="blue"> Manipulating FITS Files </font>

- FITS (Flexible Image Transport System) is a portable file standard widely used in the astronomy community to store images and tables.
- Most FITS files when opened from a web browser shows a header of ASCII (human readible) giving the details or descriptions of the data contained within the file.


**Read File**

In [None]:
from astropy.io import fits
url = 'http://data.astropy.org/tutorials/FITS-images/HorseHead.fits'
fits_image = fits.open(url)

**Print Metadata**

In [None]:
fits_image.info()

In [None]:
fits_image[0].header

**Extract the Data**

In [None]:
image_data = fits_image[0].data

In [None]:
print(type(image_data))
print(image_data.shape)

**Viewing the Image Data**

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
from astropy.visualization import astropy_mpl_style
plt.style.use(astropy_mpl_style)

plt.imshow(image_data, cmap='gray')
plt.colorbar()

In [None]:
import numpy as np
print('Min:   ', np.min(image_data))
print('Max:   ', np.max(image_data))
print('Mean:  ', np.mean(image_data))
print('Stdev: ', np.std(image_data))

# <font color="blue"> Manipulating Audio Files</font>

**Playing Audio Files**

In [None]:
import urllib.request
urllib.request.urlretrieve("https://tinyurl.com/yx3k5kw5", "beat.wav")

In [None]:
import simpleaudio as sa

file_name = 'beat.wav'
wave_obj = sa.WaveObject.from_wave_file(file_name)
play_obj = wave_obj.play()
play_obj.wait_done()  # Wait until sound has finished playing

**Analyze the Audio File**

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.io import wavfile

In [None]:
samplerate, data = wavfile.read(file_name)

In [None]:
samplerate

In [None]:
data.shape

In [None]:
plt.plot(data[:4*samplerate]) #plot first 4 seconds

In [None]:
from scipy.fftpack import fft,fftfreq

datafft = fft(data)
#Get the absolute value of real and complex component:
fftabs = abs(datafft)

In [None]:
samples = data.shape[0]
freqs = fftfreq(samples, 1/samplerate)

In [None]:
plt.plot(freqs,fftabs)

# <font color="blue">Summary</font>


| File Type | Python Package | Reader/Writer |
| --- | --- | --- |
| **text** | | `open` |
| **binary** | | `open` |
| **binary (pickle)** | pickle | `load`/`dump`  |
|                 | Pandas |  `read_pickle`/`to_pickle` |
| **csv**    | Pandas | `read_csv`/`to_csv` |
|        | csv    | `open` |
|        | Numpy  | `genfromtxt`   |
| **Excel** | Pandas | `read_excel`/`to_excel` |
|            | xlrd | `open_workbook` |
|            | xlwt | `Workbook` |
| **JSON**   | json |    `load`/`dump`  |
|        | Pandas | `read_json`/`to_json` |
| **nc4**    | netCDF4 | `Dataset` |
| **HDF5**   | h5py    |  `File`  |
|        | Pandas  | `read_hdf`/`to_hdf` |
| **FITS**   | fits (astropy) | `open` |