# Microsoft Azure Computer Vision for Instagram image processing
## Etienne P Jacquot - ASC SYSADMIN - epj@asc.upenn.edu

### [Quickstart: Computer Vision client library for Python](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts-sdk/python-sdk)
__________________

To [install Azure CLI](https://pypi.org/project/azure-cli/) for Python on MacOS:
- `pip install azure-cli` 
    
To [install Azure SDKs](https://docs.microsoft.com/en-us/azure/cognitive-services/Custom-Vision-Service/python-tutorial):
- `pip install azure.cognitiveservices.vision.computervision`
- `pip install azure-cognitiveservices-vision-customvision` 


### On Azure I created an `ASC-ComputerVision` endpoint w/ Free tier (20/minute, 5k per month)

Save your credentials to `configs/config.ini` and do not share!

#### *Update -->> latest version for computervision to avoid errors...*

- more info https://pypi.org/project/azure-cognitiveservices-vision-computervision/

`pip install azure-cognitiveservices-vision-computervision==0.7.0`

_________


In [2]:
import pandas as pd
import os

import configparser
import sys
import time

import requests

from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes
from msrest.authentication import CognitiveServicesCredentials

### Using PhantomBuster to get Instgram URLs for `@blackivystories`

Save as csv in `./data/`

- https://phantombuster.com/4737542404301822/phantoms/1973743622888750/console


In [2]:
df = pd.read_csv('./data/blackivystories_PostsExtractor_03012021.csv')

## Reviewing columns, we can get our image URL as `imgUrl`

In [3]:
df.columns

Index(['postUrl', 'description', 'commentCount', 'likeCount', 'location',
       'locationId', 'pubDate', 'likedByViewer', 'isSidecar', 'type',
       'caption', 'profileUrl', 'username', 'sidePostUrl', 'imgUrl', 'postId',
       'timestamp', 'query', 'taggedFullName1', 'taggedUsername1',
       'taggedFullName2', 'taggedUsername2', 'taggedFullName3',
       'taggedUsername3', 'fullName'],
      dtype='object')

## Download each image to `./img` for long term (OPTIONAL)

- Microsoft Azure supports **remote_url** so you *can* do ocr without saving the images... I think that the `imgUrl` is an *s3content* served by facebook/instagram and really it is not supposed to be a permalink. Downloading is for long term reference.

In [4]:
mkdir ./img

mkdir: ./img: File exists


## Function to download & save as `postId`.png

In [5]:
def img_download(x):
    f = open('./img/{}.png'.format(str(x.postId)),'wb') # <-- Instagram postID as filename
    f.write(requests.get(x.imgUrl).content)             # <-- requests module get imgUrl (time sensitive content)
    f.close()
    return

In [63]:
df.apply(lambda x: img_download(x), axis=1)

0      None
1      None
2      None
3      None
4      None
       ... 
623    None
624    None
625    None
626    None
627    None
Length: 628, dtype: object

## Now run Microsoft Azure OCR to extract stories from downloaded instagram posts

In [6]:
ig_pngs = [item for item in os.listdir('./img/') if item.endswith('.png')]

### Set your Azure access key & endpoint

- this is your `./configs/config.ini`, specifically the **ASC-COMPUTERVISION** profile

In [7]:
# add azure Computer Vision key and endpoint to config.ini
azure_cred = {}
config = configparser.ConfigParser()

config.read('./configs/config.ini') # <--- add your Twitter API tokens to this file!
for item,value in config['ASC-COMPUTERVISION'].items():
    azure_cred[item]=value

In [8]:
# Azure Variables Here!
_url = azure_cred['endpoint'] # Here, paste your full endpoint from the Azure portal
_key = azure_cred['key1']  # Here, paste your primary key
_maxNumRetries = 10

## Using the Azure ComputerVision API Client

In [9]:
computervision_client = ComputerVisionClient(_url, CognitiveServicesCredentials(_key))

### Get a Computer Vision quick description - remote url

- we know it's likely text content for this instagram page.

In [10]:
# sample url for example... 
remote_image_url = df.imgUrl.iloc[-4]

In [11]:
'''
Describe an image - remote
This example describes the contents of an image with the confidence score.
'''
print("===== Describe an image - remote =====")
# Call API
description_results = computervision_client.describe_image(remote_image_url)

# Get the captions (descriptions) from the response, with confidence level
print("Description of remote image: ")
if (len(description_results.captions) == 0):
    print("No description detected.")
else:
    for caption in description_results.captions:
        print("'{}' with confidence {:.2f}%".format(caption.text, caption.confidence * 100))

===== Describe an image - remote =====
Description of remote image: 
'text' with confidence 98.24%


## Here is the sample remote url image:

![](./img/2347618041418001201.png)

__________

## Recognize Printed Text with OCR - remote URL example

Helpful list of examples here https://github.com/Azure-Samples/cognitive-services-quickstart-code/blob/master/python/ComputerVision/ComputerVisionQuickstart.py, though the text is not handwritten...



In [12]:
'''
Recognize Printed Text with OCR - local
This example will extract, using OCR, printed text in an image, then print results line by line.
'''
print("===== Detect Printed Text with OCR - local =====")

#remote_printed_text_image_url = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-sample-data-files/master/ComputerVision/Images/printed_text.jpg"
#remote_printed_text_image_url=remote_image_url
local_image_printed_text_path = "./img/{}".format(ig_pngs[-1])
local_image_printed_text = open(local_image_printed_text_path, "rb")

ocr_result_local = computervision_client.recognize_printed_text_in_stream(local_image_printed_text)
for region in ocr_result_local.regions:
    for line in region.lines:
        print("Bounding box: {}".format(line.bounding_box))
        s = ""
        for word in line.words:
            s += word.text + " "
        print(s)
print()
'''
END - Recognize Printed Text with OCR - local
'''



===== Detect Printed Text with OCR - local =====
Bounding box: 89,118,906,56
Last semester I took one of the many 
Bounding box: 106,185,868,56
race/gender/class-based discussion 
Bounding box: 74,251,932,44
courses at Cornell. The first half of the 
Bounding box: 160,318,758,44
class had a South Asian interim 
Bounding box: 145,384,793,59
professor. During one of the first 
Bounding box: 83,450,917,60
classes, we discussed a piece of work 
Bounding box: 145,517,785,57
from an Enlightenment thinker. I 
Bounding box: 198,583,686,60
brought up the fact that this 
Bounding box: 84,650,915,57
Enlightenment thinker was integral to 
Bounding box: 195,717,687,44
the construction of race and 
Bounding box: 100,783,884,59
post-enlightenment justifications for 
Bounding box: 90,850,902,56
slavery, and thus we should consider 
Bounding box: 137,917,801,58
the piece through that context... 
Bounding box: 659,992,353,39
@blackivystories 



'\nEND - Recognize Printed Text with OCR - local\n'

## This is that image:

![](./img/2359986520683068201.png)

_______

## Looking at our Example OCR content in a DataFrame

In [13]:
ocr_df = pd.DataFrame.from_dict(ocr_result_local.as_dict())

In [14]:
ocr_df

Unnamed: 0,language,text_angle,orientation,regions
0,en,0.0,Up,"{'bounding_box': '74,118,938,913', 'lines': [{..."


In [15]:
ocr_df.regions[0].keys()

dict_keys(['bounding_box', 'lines'])

### We can see all the words w/ bounding box information

In [16]:
ocr_text = []

for line in ocr_df.regions[0]['lines']:
    for word in line['words']:
        #print(word['text'])
        ocr_text.append(word['text'])
        
print(ocr_text)

['Last', 'semester', 'I', 'took', 'one', 'of', 'the', 'many', 'race/gender/class-based', 'discussion', 'courses', 'at', 'Cornell.', 'The', 'first', 'half', 'of', 'the', 'class', 'had', 'a', 'South', 'Asian', 'interim', 'professor.', 'During', 'one', 'of', 'the', 'first', 'classes,', 'we', 'discussed', 'a', 'piece', 'of', 'work', 'from', 'an', 'Enlightenment', 'thinker.', 'I', 'brought', 'up', 'the', 'fact', 'that', 'this', 'Enlightenment', 'thinker', 'was', 'integral', 'to', 'the', 'construction', 'of', 'race', 'and', 'post-enlightenment', 'justifications', 'for', 'slavery,', 'and', 'thus', 'we', 'should', 'consider', 'the', 'piece', 'through', 'that', 'context...', '@blackivystories']


_________

# RUNNING FOR ALL IMAGES W/ RATE LIMITING

Make sure to start with **AT LEAST** 60 seconds lead in time to prevent rate limits... this is a simple looping, I am guessing the computervision_client has a `rate_limiting_wait=yes` or something like that ... 

In [108]:
dfs = []

i = 0

for url in ig_pngs:
    
    print('getting ocr for -->',url)
    
    # RATE LIMITING, CHECK IN BEGINNING OF THE LOOP
    if i == 20:
        print('i --> {}'.format(i))
        print('waiting for 63 seconds...')
        i=0
        time.sleep(63)
        
    else:
        # TRY TO READ LOCAL IMAGE FILE
        try:
            local_image_printed_text_path = "./img/{}".format(url)
            local_image_printed_text = open(local_image_printed_text_path, "rb")
        except:
            print('oops! failed to open image...')
            sys.stderr.write()
            break
        # TRY TO GET AZURE OCR TEXT EXTRACTION
        try:
            ocr_result_local = computervision_client.recognize_printed_text_in_stream(local_image_printed_text)
            ocr_df = pd.DataFrame.from_dict(ocr_result_local.as_dict())
            #ocr_df['img_filename'] = url
            ocr_df['postId'] = url.strip('.png')
            dfs.append(ocr_df)
            i = i + 1
        except:
            print('oops! failed to get azure ocr...')
            sys.stderr.write()
            break

getting ocr for --> 2350613102216966045.png
getting ocr for --> 2340140894484999761.png
getting ocr for --> 2347618041418001201.png
getting ocr for --> 2393806188103257343.png
getting ocr for --> 2343348859268581716.png
getting ocr for --> 2338912215348179434.png
getting ocr for --> 2354215615625321577.png
getting ocr for --> 2362123660724497627.png
getting ocr for --> 2360747577210992175.png
getting ocr for --> 2382930617014024873.png
getting ocr for --> 2367252283412384155.png
getting ocr for --> 2374422979850657699.png
getting ocr for --> 2349874525984086923.png
getting ocr for --> 2363612331470268025.png
getting ocr for --> 2362839721379746585.png
getting ocr for --> 2430069395260057077.png
getting ocr for --> 2340997810832781473.png
getting ocr for --> 2345503318656671388.png
getting ocr for --> 2358282383087974949.png
getting ocr for --> 2343319077327950041.png
getting ocr for --> 2344794897858880330.png
i --> 20
waiting for 63 seconds...
getting ocr for --> 2344590726186435769.p

getting ocr for --> 2368414402602164737.png
getting ocr for --> 2348357136972929549.png
getting ocr for --> 2340413922787868078.png
getting ocr for --> 2435456592645585531.png
getting ocr for --> 2338953169153075283.png
getting ocr for --> 2425268245767683320.png
getting ocr for --> 2339615509221709310.png
getting ocr for --> 2429280809241780408.png
getting ocr for --> 2375084936458939021.png
i --> 20
waiting for 63 seconds...
getting ocr for --> 2346670157042540479.png
getting ocr for --> 2356258355028499940.png
getting ocr for --> 2430069395276709367.png
getting ocr for --> 2344748250403628742.png
getting ocr for --> 2342565041326525472.png
getting ocr for --> 2374505348406373809.png
getting ocr for --> 2342422486387566181.png
getting ocr for --> 2362809286335605783.png
getting ocr for --> 2376389501183794364.png
getting ocr for --> 2394697772453660948.png
getting ocr for --> 2336715794297532565.png
getting ocr for --> 2368671305106211205.png
getting ocr for --> 2333791347658297331.p

getting ocr for --> 2346743023452737565.png
getting ocr for --> 2349788573781468272.png
getting ocr for --> 2426388442528407629.png
getting ocr for --> 2343925246304020576.png
getting ocr for --> 2420053014594567050.png
getting ocr for --> 2423441911664047448.png
getting ocr for --> 2360684539606901219.png
getting ocr for --> 2370069315166040074.png
getting ocr for --> 2365767204575967760.png
getting ocr for --> 2345328829427798359.png
getting ocr for --> 2340997810815764888.png
getting ocr for --> 2359986520691407453.png
getting ocr for --> 2364171030899412613.png
getting ocr for --> 2425268245776156547.png
getting ocr for --> 2368383356431018757.png
getting ocr for --> 2368429690009707323.png
getting ocr for --> 2353355415943041815.png
getting ocr for --> 2369142546795753618.png
i --> 20
waiting for 63 seconds...
getting ocr for --> 2351890701719257708.png
getting ocr for --> 2354232088192103903.png
getting ocr for --> 2362196667685835822.png
getting ocr for --> 2377878835980265236.p

getting ocr for --> 2371509661045464400.png
getting ocr for --> 2354232088175448292.png
getting ocr for --> 2340191644722972950.png
getting ocr for --> 2423583352931245686.png
getting ocr for --> 2380788950681010364.png
getting ocr for --> 2342349126534264002.png
i --> 20
waiting for 63 seconds...
getting ocr for --> 2368475373404172667.png
getting ocr for --> 2380098885164288697.png
getting ocr for --> 2341672619744721693.png
getting ocr for --> 2375084936442323426.png
getting ocr for --> 2335111117103967439.png
getting ocr for --> 2368488071005282197.png
getting ocr for --> 2332161255416871592.png
getting ocr for --> 2353988936671920290.png
getting ocr for --> 2345414835720035476.png
getting ocr for --> 2338079904054525231.png
getting ocr for --> 2404947036555467877.png
getting ocr for --> 2424520003274680212.png
getting ocr for --> 2375712551020586988.png
getting ocr for --> 2345328829436105470.png
getting ocr for --> 2361381864323038420.png
getting ocr for --> 2368398231001739924.p

## Look at combined ocr_dfs of instagram post text extraction

- Text extracted for 602 images:

In [109]:
pd.concat(dfs)

Unnamed: 0,language,text_angle,orientation,regions,postId
0,en,0.0,Up,"{'bounding_box': '50,115,691,619', 'lines': [{...",2350613102216966045
0,en,0.0,Up,"{'bounding_box': '73,217,939,814', 'lines': [{...",2340140894484999761
0,en,0.0,Up,"{'bounding_box': '45,179,670,554', 'lines': [{...",2347618041418001201
0,en,0.0,Up,"{'bounding_box': '93,285,958,779', 'lines': [{...",2393806188103257343
0,en,0.0,Up,"{'bounding_box': '83,251,929,780', 'lines': [{...",2343348859268581716
...,...,...,...,...,...
0,en,0.0,Up,"{'bounding_box': '72,203,670,535', 'lines': [{...",2349788573764790476
0,en,0.0,Up,"{'bounding_box': '81,351,931,680', 'lines': [{...",2343172003437263671
0,en,0.0,Up,"{'bounding_box': '55,173,662,557', 'lines': [{...",2350598473239542497
0,en,0.0,Up,"{'bounding_box': '12,81,1052,934', 'lines': [{...",2332331152127719574


## *UPDATE -->> I exported the results to json, so not going to run again...*
- The above cell for OCR w/ rate limiting takes around 33 minutes to complete

In [3]:
#final_df = pd.concat(dfs)
#final_df.reset_index(drop=True,inplace=True)
#final_df.to_json('ig_ocr.json')

final_df = pd.read_json('./data/ig_ocr.json')

In [4]:
final_df

Unnamed: 0,language,text_angle,orientation,regions,postId
0,en,0.0,Up,"{'bounding_box': '50,115,691,619', 'lines': [{...",2350613102216966144
1,en,0.0,Up,"{'bounding_box': '73,217,939,814', 'lines': [{...",2340140894484999680
2,en,0.0,Up,"{'bounding_box': '45,179,670,554', 'lines': [{...",2347618041418001408
3,en,0.0,Up,"{'bounding_box': '93,285,958,779', 'lines': [{...",2393806188103257088
4,en,0.0,Up,"{'bounding_box': '83,251,929,780', 'lines': [{...",2343348859268581888
...,...,...,...,...,...
597,en,0.0,Up,"{'bounding_box': '72,203,670,535', 'lines': [{...",2349788573764790272
598,en,0.0,Up,"{'bounding_box': '81,351,931,680', 'lines': [{...",2343172003437263872
599,en,0.0,Up,"{'bounding_box': '55,173,662,557', 'lines': [{...",2350598473239542272
600,en,0.0,Up,"{'bounding_box': '12,81,1052,934', 'lines': [{...",2332331152127719424


__________

## Nested OCR text words as `txt` column

In [5]:
def ocr_expand(x):
        
    txt = []
    
    for line in x.regions['lines']:
        for word in line['words']:
            txt.append(word['text'])
            
    return txt

In [6]:
final_df['txt'] = final_df.apply(lambda x: ocr_expand(x), axis=1)

In [7]:
#final_df.drop(columns=['regions']).to_json('ig_ocr_txt.json')

In [8]:
final_df.txt

0      [Once, her, roommate, found, out, that, I, spe...
1      [I'm, an, Afro-Latina, from, a, low-income, ur...
2      [Being, a, Black, engineering, student, at, Co...
3      [I, had, a, precept, with, a, '21, male, stude...
4      [Psafety, was, called, for, a, noise, complain...
                             ...                        
597    [Needles, to, say, I, never, got, my, stuff, b...
598    [Brown, students, boast, of, their, progressiv...
599    [My, grade, was, published, while, sitting, ne...
600    [At, night,, there, was, a, guard, stationed, ...
601    [Last, semester, I, took, one, of, the, many, ...
Name: txt, Length: 602, dtype: object

### We now want to set our IvyLeague stop words to find references for each respective school ...

helpful example for logic https://stackoverflow.com/questions/6531482/how-to-check-if-a-string-contains-an-element-from-a-list-in-python


### Loop through each ivy & check if it's in the txt lower case... 

- I am guessing we will miss some examples for this, for example `upenn` or `penn's`

In [9]:
ivyleaguesToCheck = ['penn','brown','princeton','harvard','columbia','yale','columbia','cornell','dartmouth']

In [51]:
def ivyCheck(x,ivyToCheck):
    
    for txt in x.txt:
        if txt.lower() in ivyToCheck:            
            return txt.lower()
        else:
            # ignore
            continue

In [48]:
final_df['ivy'] = final_df.apply(lambda x: ivyCheck(x,ivyleaguesToCheck), axis=1)

### Our BlackIvyStories OCR college value_counts:

In [50]:
final_df.ivy.value_counts()

princeton    93
penn         57
brown        48
columbia     43
cornell      42
harvard      41
dartmouth    21
yale         14
Name: ivy, dtype: int64

In [55]:
penn_row_filter = final_df.ivy == 'penn'

In [58]:
penn_df = final_df[penn_row_filter]

In [94]:
penn_df.head()

Unnamed: 0,language,text_angle,orientation,regions,postId,txt,ivy
11,en,0.0,Up,"{'bounding_box': '68,114,944,917', 'lines': [{...",2374422979850657792,"[One, of, my, (non-Black), friends, and, I, we...",penn
17,en,0.0,Up,"{'bounding_box': '83,284,931,747', 'lines': [{...",2358282383087974912,"[Penn's, Undergraduate, Assembly, has, a, hist...",penn
20,en,0.0,Up,"{'bounding_box': '73,216,939,815', 'lines': [{...",2343993744749285376,"[Being, Black, at, Penn, means, watching, most...",penn
31,en,0.0,Up,"{'bounding_box': '80,284,932,747', 'lines': [{...",2341764308463485440,"[During, the, Africana, Summer, Institute,, a,...",penn
40,en,0.0,Up,"{'bounding_box': '74,183,938,848', 'lines': [{...",2342541965549375488,"[He, did, several, other, things, during, my, ...",penn


### Join the ocr text for further analysis... 

In [92]:
' '.join(penn_df.txt.iloc[29])

'I was followed by campus police for "trespassing" on my own campus upon leaving the library on a late night after studying. There were others who left the same time I did. None of them Black. No one else was questioned. Just me. - Penn 123 @blackivystories'

## TODO -->> 

- take the 57 penn post URLs and run commentGetter w/ docker
- postIds to reconstruct the public url