# Installation
CNGI documentation is located here:
[https://cngi-prototype.readthedocs.io/en/latest/index.html](https://cngi-prototype.readthedocs.io/en/latest/index.html)

Google Colab requires specific older versions of some packages such as Pandas and Dask, so we will install CNGI without its normal dependencies and then manually install each dependency afterwards.

Normally, you would want to leave out the --no-dependencies option

In [1]:
import os

print("installing cngi (takes a few minutes)...")
os.system("apt-get install libgfortran3")
os.system("pip install --extra-index-url https://casa-pip.nrao.edu/repository/pypi-group/simple casatools")
os.system("pip install cngi-prototype==0.0.6 --no-dependencies")
os.system("pip install --upgrade dask")

print("downloading MeasurementSet from CASAguide First Look at Imaging...")
os.system("wget https://bulk.cv.nrao.edu/almadata/public/working/sis14_twhya_calibrated_flagged.ms.tar")
os.system("tar -xvf sis14_twhya_calibrated_flagged.ms.tar")

print('complete')

installing cngi (takes a few minutes)...
downloading MeasurementSet from CASAguide First Look at Imaging...
complete


# Initialize the Processing Environment
This is a bit limited with Colab, and the bokeh dashboard doesn't work

In [2]:
from cngi.direct import InitializeFramework
client = InitializeFramework(2,'6GB',False)
client

Failed to start diagnostics server on port 8787. [Errno 99] Cannot assign requested address
Could not launch service 'bokeh' on port 8787. Got the following message:

[Errno 99] Cannot assign requested address
  self.scheduler.start(scheduler_address)


0,1
Client  Scheduler: inproc://172.28.0.2/126/1,Cluster  Workers: 2  Cores: 2  Memory: 12.00 GB


# Convert an MS to Apache Parquet
Takes some time


In [3]:
from cngi.conversion import ms_to_pq

ms_to_pq('sis14_twhya_calibrated_flagged.ms')

processing ddi 0: chunks=0, size=214868
completed ddi 0
Complete.


# Open an Apache Parquet based MS

Retrieve a summary of the Apache Parquet MS file. 

Then create a new Dataframe from it.

This Dataframe is the common data structure passed around to most other CNGI functions.

In [16]:
from cngi.ms import summarizeFile
from cngi.dio import read_pq

# returns summary as a pandas dataframe
mssummary = summarizeFile('sis14_twhya_calibrated_flagged.pq')
print(mssummary[['ddi','row_count_estimate','col_count','size_GB']])

# there is only one ddi in the MS, but pretend there are more and one is chosen
ddi = mssummary.ddi.values[0]

# here we create the dask dataframe for use in other CNGI functions
ddf = read_pq('sis14_twhya_calibrated_flagged.pq',ddi=ddi)

# examine the start of the dataframe 
# note that the column selection should be made in to a convenience function
cols = [col for col in ddf.columns.values if col not in list(ddf.columns.values[ddf.columns.str.match('(FLAG\d)|(R|IDATA\d)')])]
ddf[cols].head()

   ddi  row_count_estimate  col_count  size_GB
0    0               80563       2327     0.96


Unnamed: 0,UVW0,UVW1,UVW2,WEIGHT0,WEIGHT1,SIGMA0,SIGMA1,ANTENNA1,ANTENNA2,ARRAY_ID,DATA_DESC_ID,EXPOSURE,FEED1,FEED2,FIELD_ID,FLAG_ROW,INTERVAL,OBSERVATION_ID,PROCESSOR_ID,SCAN_NUMBER,STATE_ID,TIME,TIME_CENTROID
0,95.58333,-138.672313,-13.694759,20.415682,26.796448,0.221319,0.19318,1,2,0,0,6.048,0,0,0,False,6.048,0,2,4,0,4860027000.0,4860027000.0
1,-111.767122,28.948793,42.888321,19.593037,33.047886,0.225917,0.173951,1,3,0,0,6.048,0,0,0,False,6.048,0,2,4,0,4860027000.0,4860027000.0
2,-100.502448,-38.534069,51.759413,22.515686,32.499786,0.210745,0.175412,1,4,0,0,6.048,0,0,0,False,6.048,0,2,4,0,4860027000.0,4860027000.0
3,19.341554,-7.336952,-6.608505,24.278385,34.531357,0.20295,0.170174,1,5,0,0,6.048,0,0,0,False,6.048,0,2,4,0,4860027000.0,4860027000.0
4,33.538773,-117.010647,9.465506,23.7983,32.714806,0.204987,0.174835,1,6,0,0,6.048,0,0,0,False,6.048,0,2,4,0,4860027000.0,4860027000.0
