# Installation
CNGI documentation is located here:
[https://cngi-prototype.readthedocs.io/en/latest/index.html](https://cngi-prototype.readthedocs.io/en/latest/index.html)

Google Colab requires specific older versions of some packages such as Pandas and Dask, so we will install CNGI without its normal dependencies and then manually install each dependency afterwards.

Normally, you would want to leave out the --no-dependencies option

In [2]:
import os, time

start = time.time()
print("installing cngi (takes a few minutes)...")
os.system("apt-get install libgfortran3")
os.system("pip install --extra-index-url https://casa-pip.nrao.edu/repository/pypi-group/simple casatools")
os.system("pip install cngi-prototype==0.0.8 --no-dependencies")
os.system("pip install --upgrade dask")
os.system("pip install --upgrade xarray")

elapsed = round(time.time()-start)
print(f'Finished installing after {elapsed} seconds')

print("downloading MeasurementSet from CASAguide First Look at Imaging...")
os.system("wget https://bulk.cv.nrao.edu/almadata/public/working/sis14_twhya_calibrated_flagged.ms.tar")
os.system("tar -xvf sis14_twhya_calibrated_flagged.ms.tar")
elapsed = round(time.time()-start)
print(f'Finished downloading after {elapsed} seconds')
print('complete')

installing cngi (takes a few minutes)...
Finished installing after 10 seconds
downloading MeasurementSet from CASAguide First Look at Imaging...
Finished downloading after 69 seconds
complete


# Initialize the Processing Environment
Colab [does not support](https://github.com/googlecolab/colabtools/issues/569) websocket connections between client and kernel, so scheduler status dashboard is inaccessible.
Possible to work around using ngrok?

In [3]:
from cngi.direct import InitializeFramework
client = InitializeFramework(workers=2,memory='6GB',processes=False)
client

Failed to start diagnostics server on port 8787. [Errno 99] Cannot assign requested address
Could not launch service 'bokeh' on port 8787. Got the following message:

[Errno 99] Cannot assign requested address
  self.scheduler.start(scheduler_address)


0,1
Client  Scheduler: inproc://172.28.0.2/655/1,Cluster  Workers: 2  Cores: 2  Memory: 12.00 GB


# Convert an MS to xarray NetCDF

In [4]:
from cngi.conversion import ms_to_ncdf

start = time.time()
ms_to_ncdf('sis14_twhya_calibrated_flagged.ms')
elapsed = round(time.time() - start)
print(f'Finished conversion in {elapsed} seconds')

processing ddi 0: chunks=1, size=53717
completed ddi 0
Complete.
Finished conversion in 12 seconds


# Open an xarray NetCDF based MS

(todo) Retrieve a summary of the xarray NetCDF MS file. 

Then create a new xarray Dataset from it.

This Dataset is the common data structure passed around to most other CNGI functions.

In [5]:
from cngi.ms import summarizeFile
from cngi.dio import read_ncdf

# returns summary as a pandas dataframe
#mssummary = summarizeFile('sis14_twhya_calibrated_flagged.pq')
#print(mssummary[['ddi','row_count_estimate','col_count','size_GB']])

# there is only one ddi in the MS, but pretend there are more and one is chosen
ddi = 0 #mssummary.ddi.values[0]

# here we create the dask dataframe for use in other CNGI functions
xds = read_ncdf('sis14_twhya_calibrated_flagged.ncdf',ddi=ddi)

# examine the start of the dataframe 
# note that the column selection should be made in to a convenience function
#cols = [col for col in ddf.columns.values if col not in list(ddf.columns.values[ddf.columns.str.match('(FLAG\d)|(R|IDATA\d)')])]
#ddf[cols].head()
xds

<xarray.Dataset>
Dimensions:         (chans: 384, pols: 2, rows: 80563, uvw: 3)
Coordinates:
  * chans           (chans) int32 0 1 2 3 4 5 6 ... 377 378 379 380 381 382 383
  * pols            (pols) int32 0 1
  * uvw             (uvw) int32 0 1 2
  * rows            (rows) int64 0 1 2 3 4 5 ... 80558 80559 80560 80561 80562
Data variables:
    ANTENNA1        (rows) int32 dask.array<chunksize=(53717,), meta=np.ndarray>
    ANTENNA2        (rows) int32 dask.array<chunksize=(53717,), meta=np.ndarray>
    ARRAY_ID        (rows) int32 dask.array<chunksize=(53717,), meta=np.ndarray>
    DATA_DESC_ID    (rows) int32 dask.array<chunksize=(53717,), meta=np.ndarray>
    EXPOSURE        (rows) float64 dask.array<chunksize=(53717,), meta=np.ndarray>
    FEED1           (rows) int32 dask.array<chunksize=(53717,), meta=np.ndarray>
    FEED2           (rows) int32 dask.array<chunksize=(53717,), meta=np.ndarray>
    FIELD_ID        (rows) int32 dask.array<chunksize=(53717,), meta=np.ndarray>
    FLA

Perform a sequence of calculations defined by the original [dask demo notebook](https://colab.research.google.com/github/ryanraba/casa6/blob/master/casa7experiments.ipynb)
1. apply flags (sets flagged data cells to nan)
2. average magnitude of cross products
3. subtract mean magnitude of each baseline from visibilities
4. filter out baselines with outlier mean noise
5. take the mean accross channels to get a continuum
6. plot the UV space


In [0]:
original_real = xds.RDATA
original_imag = xds.IDATA

In [98]:
# 1. apply flags (sets flagged data cells to nan)
xds['RDATA'] = xds.RDATA.where(xds.FLAG.isin([True]), drop=False)
xds['IDATA'] = xds.IDATA.where(xds.FLAG.isin([True]), drop=False)



In [0]:
# calculate mean values for each component of the complex visibilities
xds.RDATA.mean(skipna=True).values
type(xds.IDATA.mean(skipna=True).values)




In [45]:
# determine channel average
xds['RDATA'].mean(dim='chans')
xds['IDATA'].mean(dim='chans')

<xarray.DataArray 'RDATA' (pols: 2, rows: 80563)>
dask.array<mean_agg-aggregate, shape=(2, 80563), dtype=float64, chunksize=(2, 53717), chunktype=numpy.ndarray>
Coordinates:
  * pols     (pols) int32 0 1
  * rows     (rows) int64 0 1 2 3 4 5 6 ... 80557 80558 80559 80560 80561 80562

AttributeError: ignored