# Software Pipeline 
The following Jupyter notebook is a detailed documentation of the software pipeline for pre-processing of RoboPol Data for use in classifying artifacts and stars in an image using a Convolutional Neural Network. Each stage of the pipeline has a code along with detailed documentation of its working and what each line of the code means. The following are the different stages of the pipeline from the beginning to the end:
1. JSON File
2. Search and Copy 
3. IRAF Imexamine 
4. Sextractor
5. Sorting
6. Extract Stars
7. Extract Artifacts
8. FITS to PNG
9. Cutouts
10. Convolutional Neural Network

We will go through each stage in the above order. The structure of every code is in the form of a Python function. Except for the Convolutional Neural Network all other Python codes can be executed individually from the terminal without any input arguments. A substantial effort has been made to avoid making use of external Python libraries unless there is no other option such as the case when reading FITS files using astropy. 

# JSON File

A JSON file in Python stands for JavaScript Object Notation. It is used as a standard data interchange format mainly for transmitting data between a web application and a server. We will use this file mainly to store information of software paths of different files and other inputs that would be required for running the code. 

Following is the JSON File. 

'input_param' is a dictionary with keywords that would be used to get information required for the codes. Each keyword represents the name of a code in the pipeline. Following is a description of the keywords:

1. search_copy:

The first argument here is the path of the folder where we store the RoboPol Data.

The second argument is the path of the folder where we want to store the images which contain artifacts. 

The third argument is the path of the textfile which has a list of images that we know have artifacts. 


2. sorting:





In [None]:
import json

input_param = {"search_copy": ["/home/walop/Documents/data/", 
                               "./artifact_images",
                               "/home/walop/Documents/bad_data.txt"],
"sorting": ["./artifact_images/catalog/",
            "./sorted/"],
"extract_stars": ["./extract_stars/",
                  "./sorted/"],
"extract_artifacts": ["./extract_artifacts/",
                     "./artifact_images/output_87.log"],
"fits_to_png": ["./artifact_images/*.fits",
               "./png/"],
"cutout": ["./png/*.png",
          "./extract_artifacts/",
          "./extract_stars/",
          "./cutout/reflection_training/",
          "./cutout/star_training/"],
               
"cnn": ["./cutout/reflection_training/",
       "./cutout/star_training/",
       "./cutout/test/"],
"testing_script": ["./artifact_images/catalog/",
                  "./artifact_images/"]

}

with open('input_param.json', 'w') as json_file:
    json.dump(input_param, json_file)


# Code for Searching through directories of images for required images and copying them into a destination directory

This is the first function that we execute. Before usage check the following:

#### Inputs:
1. Enter the paths of 

a) Folder that contains the RoboPol data 

b) Folder where we want to store the images containing artifacts 

c) Textfile which has a list of the names of the images which we know have artifacts 

in the search_copy keyword of the JSON File as described in the JSON section. 


#### Usage: 
In the terminal, go to the home directory and type :

    python search_and_copy.py
#### Assumptions:
1. The first assumption is that each line in the text file has a part of the filename of the image without the extension. An example textfile is given with the name "bad_data.txt" 
2. The second assumption is that all images are contained in a directory which could have multiple subdirectories. This is common when you segregate data according to the date of observation into subdirectories.
3. The final assumption is that the data are FITS files. 

The code has been made generic enough to be modified with very small modifications for images or files of different types. 



In [3]:
import os
from shutil import copy
import json

def search_and_copy():
    json_file=open('input_param.json')
    data = json.load(json_file)
    
    source=data['search_copy'][0] #path where original RoboPol data is stored
    dest=data['search_copy'][1] #path where selected artifact images have to be kept ready
    textfile=data['search_copy'][2] #path to the textfile containing the names of images having artifacts
    
    f=open(textfile,"r") #read the textfile
    
    lines=f.readlines() #store all lines in file in list "lines"    

    for line in lines: 
        #go through every image name ('line') in the text file
        for dirpath, dirnames, filenames in os.walk(source): 
            #walk through all subdirectories and files in them
            
            image=line #Why we store line in image is because we will be 
            #appending .fits to the line every iteration. Therefore, 
            #we don't want to keep appending .fits to lines already 
            #ending with .fits else it would become something like 
            #.fits.fits and so on
            image=image.strip() #we strip the whitespace from each line
            image=image+".fits" #we add a '.fits' extension to each line 

            #First, make a list of all those files in each directory 
            #that end with the given "line" (image name) in our "bad_data.txt" 
            #file. That is make a list of all those images in 
            #each directory which have the same name as that of the 
            #names in our "bad_data.txt" file.
            #Then iterate through those found images in each directory. If they are
            #not already in the destination directory copy them to destination
            #and copy them to destination directory. 

            for foundfile in [fi for fi in filenames if image in fi]:
                src=os.path.join(dirpath,foundfile) #store the path of the image in list 'src'
                if not foundfile in os.listdir(dest):
            
                    copy(src,dest) #copy list 'src' to destination'dest' which can be a directory or a file
#search_and_copy() #Uncomment this to run the code in Jupyter notebook

# IRAF Imexamine

In order to train our Convolutional Neural Network we need to identify the artifacts and label them. Therefore, we need to manually first open the images containing the artifacts and record the pixel coordinates of the center of those artifacts. The Image Reduction and Analysis Facility (IRAF) has a tool called imexamine that allows us to perform several image processing tasks. 

#### Steps:
1. In IRAF, type epar imexamine and edit the following: 
   a) logfile - Enter the name of the output file that would contain the coordinates and pixel brightness of the center of the artifact
   b) keeplog - change to yes if it is no. 
2. Go to the folder containing the artifact images (using cd command) and then type : 
        imexamine *.fits 
3. Hover the mouse over the approximate center of the artifact that is visible to your eyes and press x. You will see 3 things that are printed in the terminal: x coordinate, y coordinate, pixel brightness. The same will be stored in the output logfile as well.
4. Press n to go to the next image and repeat step 3.
5. To quit, press q


#### Caution:
1. Imexamine is a pretty old tool and we need to make sure that we perform the above steps just as they have been mentioned. For instance, if we press m instead of n, it will actually print some image statistics that cannot be removed from the file. In case that happens, be sure to record this on paper and then once the file is made go and remove those lines. 
2. Imexamine will loop through the images automatically and when it reaches the last image it will again go back to the first image. Therefore, record on paper which was the first image you started with so that you know when to press q. 
3. It happens quite often that we will again loop through the images from the beginning after the last image. If it does happen like that, go to the output file generated and remove the repeated logs from the end of the file. Note that each new image name is recorded in the output file beginning with the # symbol and a number encolsed in square brackets indicating the sequence number in the loop. That will help you identify if you repeated any of the starting images. 


In [4]:
from IPython.display import Image
Image(filename='epar_imexamine.png') 

FileNotFoundError: [Errno 2] No such file or directory: 'epar_imexamine.png'