# Project 2: Services Architecture - Initial Investigation

## Purpose

We all learn by reading and experimenting.  To get into that mode, we will
review some documentation and run some pre-existing tests. Based on the 
test results, we will make some preliminary observations on the behaviors
of interfaces and start developing ideas on how we can leverage them in
the architecture we will be developing.

## Investigation

There are three main services available that are to be used to build the 
application. These services do not need to be changed. Your application will 
build on top of them. The services are:

- A [Document Repository](http://seappserver1.rit.edu/DMService): This is where the incoming applications (scanned images of submitted applications) are stored.
- An [OCR Service](http://seappserver1.rit.edu/OCRService): This service takes documents (scanned images) and converts the scanned information to text
- A [Parsing Service](http://seappserver1.rit.edu/ParserService): This service takes the textual information and extracts the key data for tracking purposes. 

We will start by more closely examining each of these services.  The main overview for [Forms Services](https://seappserver1.rit.edu/formsservices/) may also be useful.  

## Document Repository

The **Document Repository** as its name implies is where scanned image documents
reside.  We will need to be able to list the contents of the repository and 
retrieve files for processing. 

### List Files
The `List Files` test in our test application shows this being done within a C#
program.  (Run this before proceeding further)

In [2]:
# Run 1 - GetFileList

The output shows us a couple critical pieces of information that will be useful
as we go forward.  First off, it tells us the number of files in the repository.

We have constructed this as a testbed and divorced the document repository from
the document scanning function.  This means the number of documents will remain
constant throughout this project.  However, we as architects realize that our
while our test environment shows a constant number of documents, our production
system will differ and contantly have new documents arriving.

There are two important implications - the number of document will grow over 
time and our system must somehow now which documents have been processed and
ensure that we never resubmit the same document twice.

The documentation for the `ListFiles` call (avaialable through the *Document Repository*
link above) shows us it is a RESTful API call.  Our C# code provided the 
framework needed to make the call and demonstrated that output can be returned
in two different formats - XML or JSON.  

#### Practice: Output Formats

The following code block provides the basic instrumentation to call `ListFiles`,
you need to complete the call and return XML, and also Json

In [59]:
# Framework to make a REST call in Python.  Show what needs to be done to
# get different output formats
import requests

def get_dm_files():
    url = "http://seappserver1.rit.edu/dmservice/api/listfiles"
    hdr = {} #Set the appropriate name:value pair in header to force the data format in XML
    response = requests.get(url, headers=hdr)
    print(response.text)
    #Now make another API call to get Json format. Modify the hdr appropriately
    hdr = {'accept': 'application/json'}
    response = requests.get(url, headers=hdr)
    print(response.json())

get_dm_files()

[{"fileName":"Application_L_Page_002.png"},{"fileName":"Application_L_Page_003.png"},{"fileName":"Application_L_Page_004.png"},{"fileName":"Application_L_Page_005.png"},{"fileName":"Application_L_Page_006.png"},{"fileName":"Application_L_Page_007.png"},{"fileName":"Application_L_Page_008.png"},{"fileName":"Application_L_Page_009.png"},{"fileName":"Application_L_Page_010.png"},{"fileName":"Application_L_Page_011.png"},{"fileName":"Application_L_Page_012.png"},{"fileName":"Application_L_Page_013.png"},{"fileName":"Application_L_Page_014.png"},{"fileName":"Application_L_Page_015.png"},{"fileName":"Application_L_Page_016.png"},{"fileName":"Application_L_Page_017.png"},{"fileName":"Application_L_Page_018.png"},{"fileName":"Application_L_Page_019.png"},{"fileName":"Application_L_Page_020.png"},{"fileName":"Application_L_Page_021.png"},{"fileName":"Application_L_Page_022.png"},{"fileName":"Application_L_Page_023.png"},{"fileName":"Application_L_Page_024.png"},{"fileName":"Application_L_Page_0

#### Practice: File Retrieval

Complete the code below to retrieve a single file.  You should be able to pick any file name from the prior list of files 

In [None]:
# Framework to make a REST call in Python.  Show what needs to be done to get a file
import requests


def get_one_dm_file():
    print("Getting file")
    fileToRetrieve = "Application_L_Page_002.png" #Set this using the correct syntax for the query string param to get a file.  Pick out a valid file based on the ListFiles API from above
    command = f"/downloadfile?fileName={fileToRetrieve}"
    url = f"http://seappserver1.rit.edu/dmservice/api{command}"
    params = ""
    headers = ""
    response = requests.get(url, params) #The file data comes back in the response
    if response.status_code != 200:
        print("Error in retrieving file")
        return
    localFile = "./dm_file.png"
    with open(localFile, "wb") as file:
        file.write(response.content)
        print("Received file:" + fileToRetrieve + "; Saved as:" + localFile )




get_one_dm_file()


Getting file
Received file:Application_L_Page_002.png; Saved as:./dm_file.png


#### Practice: File Processing

Once you have an image file (.png, .jpg ...), you need to extract the data.  The OCR service provides several ways to do this.  Look at the testing app provided to experiment offline, and we will have you replicate the behaviour in this workbook.
Write python code to pick take the file you retrieved from the `getfile` API, and then use that file to submit for processing using using the `processfile` API.  Save the text output from the service in a local file and print the contents to the console.
Starter code is provided.  Fill in the rest.  


In [None]:
import requests

def post_ocr_process_file():
    url = "http://seappserver1.rit.edu/ocrservice/api/processfile?ocrLib={std}"
    filename = "./dm_file.png"
    with open(filename, 'rb') as f:
        result = requests.post(url, files={'file': f})
        print(result.json()) #Print the returned data, converted to json text


post_ocr_process_file()

{'_text': 'Registration Application\nFirst Name: Fidel\nLast Name: Beer\nApplication Type : new\nAddress: 433 Runolfsson Corner\nCij: West Reina\nDame: 7/24/1983 l2:00:00 AM\nEmail: Enola qilli\n- amson@elfrieda.ca\nPhone: 669.748.2 l 00\nDescription: voluptate ea a eos et quiquinam iste voluptate ea a eos et quiquinam iste voluptate ea a eos\net quiquinam istevoluptate ea a eos et quiquinam istevoluptate ea a eos et quiquinam istevoluptate ea a\neos et quiquinam istevoluptate ea a eos et quiquinam istevoluptate ea a eos et quiquinam istevoluptate\nea a eos et quiquinam iste voluptate ea a eos et quiquinam iste voluptate ea a eos et quiquinam\nistevoluptate ea a eos et quiquinam istevoluptate ea a eos et quiquinam istevoluptate ea a eos et quiqui\nnam istevoluptate ea a eos et quiquinam istevoluptate ea a eos et quiquinam istevoluptate ea a eos et qui\nquinam iste voluptate ea a eos et quiquinam iste voluptate ea a eos et quiquinam istevoluptate ea a eos\net quiquinam istevoluptate ea 

#### Practice: File Processing - Asynchronous
Not all work done by software services is instantaneous.  In fact, many modern services take seconds to multiple minutes to 'do their job'.  In the prior example, we took the lazy programming approach and just make the user wait till the job is done.  This works, but it prevents the user from being able to do anything else.  And what if the processing work took many minutes, or even hours?  Today, vision analytics and algorithms that take in big-data for machine learning can easily take that much time.  To allow parallel processing, is it necessary to provide APIs that enable asynchronous behaviour. i.e. Submit something for processing but don't wait for it.  Just come back from time to time and see if the job is completed.  
We will have to experiment with this time of behaviour.
You will use the `processfileasync` API, but will also need to implement a mechanism to check to see *when* the work is completed.


In [None]:
import requests
import time

"""
monitor is used to periodically check back with the server to see if a submitted job is completed.
Parameters: filePath - the file ON THE SERVER to look for
Returns: The text content of the processed file

"""
def monitor(filePath):
    count = 0
    #API for checking for a file
    url = "http://seappserver1.rit.edu/OCRService/api/GetProcessedFile?filename="
    while count < 10: #Try up to 10 times (10*5 = 50 seconds)
        response = requests.get(url + filePath)
        print(f"{count}: {response}")
        count = count + 1
        if response.status_code != 200:
            print(f"Error in processing file {filePath}")
            return ""
        if response.json().get('_fileReady', False) == False:
            print(f"[{count}]: File {filePath} is not ready")
            time.sleep(5) #Sleep 5 seconds
        else:
            return response.json().get('_fileData', "")


def ocr_async():
    #API for async processing
    url = "http://seappserver1.rit.edu/ocrservice/api/processfileasync"
    filename = "./dm_file.png"
    outputFile = ""
    result = ''

    with open(filename, 'rb') as f:
        #Use a post syntax from prior step to send the file to the API
        #e.g. result = xxxxxxxxxx
        result = requests.post(url, files={'file': f})
        print(result.json())
        outputFile = result.json().get('_outputFilePath', None)
        if outputFile == None:
            print("Error in processing file")
            return

    print(f"Monitor for: '{outputFile}'")
    ocrData = monitor(outputFile) #We conveniently provide you a monitor method
    if ocrData == "":
        print("OCR Failed") #We tried X number of times, but gave up
    else:
        destFile = 'async_ocr_result.txt' #Got a good result!
        print(f"OCR Result stored in {destFile}")
        file = open(destFile, "w")
        file.write(ocrData)

"""This will run the main command"""
ocr_async()


{'_inputFile': 'dm_file.png', '_inputFileLength': 152022, '_outputFilePath': 'd086e642-0762-4ae2-ba9c-861510da5d1b/dm_file.txt'}
Monitor for: 'd086e642-0762-4ae2-ba9c-861510da5d1b/dm_file.txt'
0: <Response [200]>
[1]: File d086e642-0762-4ae2-ba9c-861510da5d1b/dm_file.txt is not ready
1: <Response [200]>
[2]: File d086e642-0762-4ae2-ba9c-861510da5d1b/dm_file.txt is not ready
2: <Response [200]>
[3]: File d086e642-0762-4ae2-ba9c-861510da5d1b/dm_file.txt is not ready
3: <Response [200]>
OCR Result stored in async_ocr_result.txt


#### Putting it all together
So, we have experimented with two services
1. A document repository (DMService) that holds a set of files
2. An OCR service that converts images to text (and we did it two different ways)

At the end of this, we have text data from pictures.  Great.  And we would want to use that data.  Not just store blobs of text, but actually put that into a more organized form.  
- How about a database?  If we look at the text result, it is a set of fields, and there is data for each field.  Sounds a lot like a database table.

We have one more service to help us take plain text, and make it more 'organized' i.e. `field: value` as a nice data structure, which will allow us to easily store into a DB (we'll leave out the actual DB for now!)

In this final exercise, use the additional API http://seappserver1.rit.edu/dmservice/api/ReadForm.  
You can read more about the API on the website.
You will now put together all the pieces ... which is how you build pretty much all applications through integration of multiple distributed components  

1. Retrieve a file using `GetFile`
2. Process a file using one of the `ProcessFile...` APIs
3. Convert the raw text into a json (name:value) list using the `ReadForm` API and save that json formatted file
    - As mentioned above, you would then store the data into a table in a read DB, but we'll leave that for your own experimentation!  


Some starter python code is provided below.  Fill in the rest using what you learned in the prior steps ... and make the integrated application work!  

In [None]:
import requests


def get_a_file(file_name="Application_L_Page_002.png", file_path="./dm_file.png"):
    #Get file from DM
    print(f"Getting file {file_name}")
    url = f"http://seappserver1.rit.edu/dmservice/api"
    command = f"/downloadfile?fileName={file_name}"
    hdr = {'accept': 'application/json'}
    response = requests.get(url + command, headers=hdr)
    if response.status_code != 200:
        print("Error in retrieving file")
        return
    with open(file_path, "wb") as file:
        file.write(response.content)
        print(f"Downloaded file {file_name} to {file_path}")


def process_a_file(process_file_path="./dm_file.png", file_path="./ocr_result.txt"):
    #Use OCR server to convert image to raw text
    print(f"Processing file {process_file_path}")
    url = "http://seappserver1.rit.edu/ocrservice/api/processfile?ocrLib={std}"
    with open(process_file_path, 'rb') as file:
        response = requests.post(url, files={'file': file})
        if response.status_code != 200:
            print("Error in processing file")
            return
        with open(file_path, "w") as file:
            ocr_text = response.json().get('_text', None)
            if ocr_text == None:
                print("Error in processing file")
            file.write(ocr_text)


def parse_a_file(ocr_file_path="./ocr_result.txt"):
    #Use ParserService to submit a raw text file to conver to name:value pairs
    print(f"Parsing file {ocr_file_path} to json")
    url = "http://seappserver1.rit.edu/parserservice/api/ReadForm"
    with open(ocr_file_path, 'r') as file:
        result = requests.post(url, files={'file': file})
        if result.status_code != 200:
            print("Error in parsing file: ", result.text)
            return
        print(result.json())




print("Running the integrated application")

get_a_file()
process_a_file()
parse_a_file()
#Print the final json output!!!

Running the integrated application
Parsing file ./ocr_result.txt to json
{'allFields': [{'fieldName': 'First Name', 'fieldValue': ' Fidel'}, {'fieldName': 'Last Name', 'fieldValue': ' Beer'}, {'fieldName': 'Application Type ', 'fieldValue': ' new'}, {'fieldName': 'Address', 'fieldValue': ' 433 Runolfsson Corner'}, {'fieldName': 'Cij', 'fieldValue': ' West Reina'}, {'fieldName': 'Email', 'fieldValue': ' Enola qilli'}, {'fieldName': 'Phone', 'fieldValue': ' 669.748.2 l 00'}, {'fieldName': 'Description', 'fieldValue': ' voluptate ea a eos et quiquinam iste voluptate ea a eos et quiquinam iste voluptate ea a eos'}]}


#### Conclusions

Now that you have a working application, run the application we provided (the C# app) and run ALL the commands.  Observe the output, watch the behaviour.  Think about how the operations work (and why).  Consider the choices to be made in putting the service components together.  Add your thoughts below

#### Student observations

My main observation is mainly that at every tiny step every action is being properly communicated or rather the state of the application is always shown to the user so there is no confusion. In mine, as I was working, I had to start adding more error checking and error messages because I had entered something wrong and was not getting the desired output. Although these are confined working tests in the C# Application, I feel like it would display it properly as well.

As for the what and how the system works, The components small services that were chosen and used together to create a larger service that solves a specific problem that needs all the components to work together. Because each part of the system is a small service, this makes it modular and easily replacable which is a great thing is one of the components goes down or has some other issue. 




