# Microsoft Azure Computer Vision for Instagram image processing
## Etienne P Jacquot - ASC SYSADMIN - epj@asc.upenn.edu

#### [Quickstart: Computer Vision client library for Python](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts-sdk/python-sdk)
__________________

### Getting Started:

To [install Azure CLI](https://pypi.org/project/azure-cli/) for Python on MacOS:
- `pip install azure-cli` 
    
To [install Azure SDKs](https://docs.microsoft.com/en-us/azure/cognitive-services/Custom-Vision-Service/python-tutorial):
- `pip install azure.cognitiveservices.vision.computervision`
- `pip install azure-cognitiveservices-vision-customvision` 


### On Azure I created an `ASC-ComputerVision` endpoint w/ Free tier (20/minute, 5k per month)

Save your credentials to `configs/config.ini` and do not share!

#### *Update -->> latest version for computervision to avoid errors...*

- more info https://pypi.org/project/azure-cognitiveservices-vision-computervision/

`pip install azure-cognitiveservices-vision-computervision==0.7.0`

_________


In [1]:
import pandas as pd
import os

import configparser
import sys
import time

import requests

from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes
from msrest.authentication import CognitiveServicesCredentials

### Using PhantomBuster to get Instgram URLs for `@blackivystories`

Save as csv in `./data/`

- https://phantombuster.com/4737542404301822/phantoms/1973743622888750/console


In [2]:
df = pd.read_csv('./data/blackivystories_PostsExtractor_03012021.csv')

## Reviewing columns, we can get our image URL as `imgUrl`

In [3]:
df.columns

Index(['postUrl', 'description', 'commentCount', 'likeCount', 'location',
       'locationId', 'pubDate', 'likedByViewer', 'isSidecar', 'type',
       'caption', 'profileUrl', 'username', 'sidePostUrl', 'imgUrl', 'postId',
       'timestamp', 'query', 'taggedFullName1', 'taggedUsername1',
       'taggedFullName2', 'taggedUsername2', 'taggedFullName3',
       'taggedUsername3', 'fullName'],
      dtype='object')

## Download each image to `./img` for long term (OPTIONAL)

- Microsoft Azure supports **remote_url** so you *can* do ocr without saving the images... I think that the `imgUrl` is an *s3content* served by facebook/instagram and really it is not supposed to be a permalink. Downloading is for long term reference.

In [4]:
mkdir ./img

mkdir: ./img: File exists


## Function to download & save as `postId`.png

In [5]:
def img_download(x):
    ######
    # I should have used x.postUrl.split('/')[-1] instead of the Id... 
    # for some reason I am not able to map back the postIds?    
    #f = open('./img/{}.png'.format(str(x.postUrl.split('/')[-1])),'wb') # <-- Instagram postID as filename
    
    f = open('./img/{}.png'.format(str(x.postId)),'wb') # <-- Instagram postID as filename
    f.write(requests.get(x.imgUrl).content)             # <-- requests module get imgUrl (time sensitive content)
    f.close()
    return

#### I am not going to run again, the imgUrls may have expired

In [63]:
df.apply(lambda x: img_download(x), axis=1)

0      None
1      None
2      None
3      None
4      None
       ... 
623    None
624    None
625    None
626    None
627    None
Length: 628, dtype: object

## Running Microsoft Azure OCR to extract stories from downloaded instagram posts

In [7]:
ig_pngs = [item for item in os.listdir('./img/') if item.endswith('.png')]

### Set your Azure access key & endpoint

- this is your `./configs/config.ini`, specifically the **ASC-COMPUTERVISION** profile

In [9]:
# add azure Computer Vision key and endpoint to config.ini
azure_cred = {}
config = configparser.ConfigParser()

config.read('./configs/config.ini') # <--- add your Twitter API tokens to this file!
for item,value in config['ASC-COMPUTERVISION'].items():
    azure_cred[item]=value

In [10]:
# Azure Variables Here!
_url = azure_cred['endpoint'] # Here, paste your full endpoint from the Azure portal
_key = azure_cred['key1']  # Here, paste your primary key
_maxNumRetries = 10

## Using the Azure ComputerVision API Client

- Helpful list of examples here https://github.com/Azure-Samples/cognitive-services-quickstart-code/blob/master/python/ComputerVision/ComputerVisionQuickstart.py, 

    - This works for handwritten text in addition to printed!


In [11]:
computervision_client = ComputerVisionClient(_url, CognitiveServicesCredentials(_key))

### *EXAMPLE: Get a Computer Vision quick description (remote url)*

- we know it's likely text content for this instagram page.

In [12]:
# sample url for examples... 
remote_image_url = df.imgUrl.iloc[-4]

In [14]:
'''
Describe an image - remote
This example describes the contents of an image with the confidence score.
'''
print("===== Describe an image - remote =====")
# Call API
description_results = computervision_client.describe_image(remote_image_url)

# Get the captions (descriptions) from the response, with confidence level
print("Description of remote image: ")
if (len(description_results.captions) == 0):
    print("No description detected.")
else:
    for caption in description_results.captions:
        print("'{}' with confidence {:.2f}%".format(caption.text, caption.confidence * 100))

===== Describe an image - remote =====
Description of remote image: 
'text' with confidence 98.24%


## Here is the sample remote url image:

![](./img/2347618041418001201.png)

__________

## Recognize Printed Text with OCR - remote URL example

remote URL


In [15]:
'''
Recognize Printed Text with OCR - local
This example will extract, using OCR, printed text in an image, then print results line by line.
'''
print("===== Detect Printed Text with OCR - local =====")

local_image_printed_text_path = "./img/{}".format(ig_pngs[-1])
local_image_printed_text = open(local_image_printed_text_path, "rb")

ocr_result_local = computervision_client.recognize_printed_text_in_stream(local_image_printed_text)
for region in ocr_result_local.regions:
    for line in region.lines:
        print("Bounding box: {}".format(line.bounding_box))
        s = ""
        for word in line.words:
            s += word.text + " "
        print(s)
print()
'''
END - Recognize Printed Text with OCR - local
'''



===== Detect Printed Text with OCR - local =====
Bounding box: 89,118,906,56
Last semester I took one of the many 
Bounding box: 106,185,868,56
race/gender/class-based discussion 
Bounding box: 74,251,932,44
courses at Cornell. The first half of the 
Bounding box: 160,318,758,44
class had a South Asian interim 
Bounding box: 145,384,793,59
professor. During one of the first 
Bounding box: 83,450,917,60
classes, we discussed a piece of work 
Bounding box: 145,517,785,57
from an Enlightenment thinker. I 
Bounding box: 198,583,686,60
brought up the fact that this 
Bounding box: 84,650,915,57
Enlightenment thinker was integral to 
Bounding box: 195,717,687,44
the construction of race and 
Bounding box: 100,783,884,59
post-enlightenment justifications for 
Bounding box: 90,850,902,56
slavery, and thus we should consider 
Bounding box: 137,917,801,58
the piece through that context... 
Bounding box: 659,992,353,39
@blackivystories 



'\nEND - Recognize Printed Text with OCR - local\n'

## This is that image:

![](./img/2359986520683068201.png)

_______

## *Reviewing our Example OCR content in a DataFrame*

In [16]:
ocr_df = pd.DataFrame.from_dict(ocr_result_local.as_dict())
ocr_df

Unnamed: 0,language,text_angle,orientation,regions
0,en,0.0,Up,"{'bounding_box': '74,118,938,913', 'lines': [{..."


### We can see all the words w/ bounding box information

In [16]:
ocr_text = []

for line in ocr_df.regions[0]['lines']:
    for word in line['words']:
        #print(word['text'])
        ocr_text.append(word['text'])
        
print(ocr_text)

['Last', 'semester', 'I', 'took', 'one', 'of', 'the', 'many', 'race/gender/class-based', 'discussion', 'courses', 'at', 'Cornell.', 'The', 'first', 'half', 'of', 'the', 'class', 'had', 'a', 'South', 'Asian', 'interim', 'professor.', 'During', 'one', 'of', 'the', 'first', 'classes,', 'we', 'discussed', 'a', 'piece', 'of', 'work', 'from', 'an', 'Enlightenment', 'thinker.', 'I', 'brought', 'up', 'the', 'fact', 'that', 'this', 'Enlightenment', 'thinker', 'was', 'integral', 'to', 'the', 'construction', 'of', 'race', 'and', 'post-enlightenment', 'justifications', 'for', 'slavery,', 'and', 'thus', 'we', 'should', 'consider', 'the', 'piece', 'through', 'that', 'context...', '@blackivystories']


_________

# RUNNING FOR ALL IMAGES FOR AZURE OCR W/ TIMED RATE LIMITING

- Make sure to start with **AT LEAST** 60 seconds lead in time to prevent rate limits... 
- this is a simple looping, I am guessing the computervision_client has a `rate_limiting_wait=yes` or something like that ... 

- rerunning to include filename instead of url striped value... 

In [18]:
dfs = []

i = 0

print('Azure OCR for Instagram images\n')
print('-'*50)

for url in ig_pngs:
    
    print('ocr for --> {}'.format(url))
    
    # RATE LIMITING, CHECK IN BEGINNING OF THE LOOP
    if i == 20:
        
        print('i is --> {}'.format(i))
        print('waiting for 63 seconds...')
        i=0
        time.sleep(63)
        
    else:
        # TRY TO READ LOCAL IMAGE FILE
        try:
            local_image_printed_text_path = "./img/{}".format(url)
            local_image_printed_text = open(local_image_printed_text_path, "rb")
        except:
            print('oops! failed to open image...')
            sys.stderr.write()
            break
        # TRY TO GET AZURE OCR TEXT EXTRACTION
        try:
            ocr_result_local = computervision_client.recognize_printed_text_in_stream(local_image_printed_text)
            ocr_df = pd.DataFrame.from_dict(ocr_result_local.as_dict())
            ocr_df['filename'] = url
            #ocr_df['filename'] = url.strip('.png')
            dfs.append(ocr_df)
            i = i + 1
        except:
            print('oops! failed to get azure ocr...')
            break

Azure OCR for Instagram images

--------------------------------------------------
ocr for --> 2350613102216966045.png
ocr for --> 2340140894484999761.png
ocr for --> 2347618041418001201.png
ocr for --> 2393806188103257343.png
ocr for --> 2343348859268581716.png
ocr for --> 2338912215348179434.png
ocr for --> 2354215615625321577.png
ocr for --> 2362123660724497627.png
ocr for --> 2360747577210992175.png
ocr for --> 2382930617014024873.png
ocr for --> 2367252283412384155.png
ocr for --> 2374422979850657699.png
ocr for --> 2349874525984086923.png
ocr for --> 2363612331470268025.png
ocr for --> 2362839721379746585.png
ocr for --> 2430069395260057077.png
ocr for --> 2340997810832781473.png
ocr for --> 2345503318656671388.png
ocr for --> 2358282383087974949.png
ocr for --> 2343319077327950041.png
ocr for --> 2344794897858880330.png
i is --> 20
waiting for 63 seconds...
ocr for --> 2344590726186435769.png
ocr for --> 2343993744749285462.png
ocr for --> 2375109852688305770.png
ocr for --> 236

ocr for --> 2357090857364396524.png
ocr for --> 2379480805069319749.png
ocr for --> 2352015248422411166.png
ocr for --> 2356258355112399066.png
ocr for --> 2368016749515204816.png
ocr for --> 2349788573789884120.png
ocr for --> 2335111117053777352.png
ocr for --> 2365767204567634501.png
ocr for --> 2343213371790582543.png
ocr for --> 2369962037578467862.png
ocr for --> 2352015248430732310.png
ocr for --> 2374482858707644548.png
ocr for --> 2351841377618518412.png
ocr for --> 2347395383493609677.png
ocr for --> 2362809286352377942.png
ocr for --> 2341672619778409606.png
i is --> 20
waiting for 63 seconds...
ocr for --> 2370857024625078605.png
ocr for --> 2377255424044295384.png
ocr for --> 2349874525992549603.png
ocr for --> 2353425698519369974.png
ocr for --> 2349823662741735416.png
ocr for --> 2429280809250142900.png
ocr for --> 2419871909807463066.png
ocr for --> 2346169071806241840.png
ocr for --> 2376633140921041056.png
ocr for --> 2423676819581787073.png
ocr for --> 23874272932043

ocr for --> 2355515227908825802.png
ocr for --> 2343123965964551318.png
ocr for --> 2367208342080134823.png
ocr for --> 2372085074314565133.png
ocr for --> 2335111117095606127.png
ocr for --> 2361479988001610561.png
ocr for --> 2360720290386065136.png
ocr for --> 2336023598107765663.png
i is --> 20
waiting for 63 seconds...
ocr for --> 2350297533303418931.png
ocr for --> 2351841377576643880.png
ocr for --> 2354751437608986432.png
ocr for --> 2356365673217055087.png
ocr for --> 2366410401153049210.png
ocr for --> 2335111117154412138.png
ocr for --> 2378641223671152848.png
ocr for --> 2346972879054670675.png
ocr for --> 2370721492872149981.png
ocr for --> 2351776914756570863.png
ocr for --> 2368435367931341898.png
ocr for --> 2377255424069382597.png
ocr for --> 2350297533337006144.png
ocr for --> 2368389542660287343.png
ocr for --> 2429280809216615779.png
ocr for --> 2374347343379539118.png
ocr for --> 2336817725867132042.png
ocr for --> 2380901976906222205.png
ocr for --> 23728352068448

## Look at combined ocr_dfs of instagram post text extraction

- Text extracted for 602 images:

## *UPDATE* These postIds are junk! not sure if the value was rounded or I did something wrong but yeah ... 

In [19]:
pd.concat(dfs)

Unnamed: 0,language,text_angle,orientation,regions,filename
0,en,0.0,Up,"{'bounding_box': '50,115,691,619', 'lines': [{...",2350613102216966045.png
0,en,0.0,Up,"{'bounding_box': '73,217,939,814', 'lines': [{...",2340140894484999761.png
0,en,0.0,Up,"{'bounding_box': '45,179,670,554', 'lines': [{...",2347618041418001201.png
0,en,0.0,Up,"{'bounding_box': '93,285,958,779', 'lines': [{...",2393806188103257343.png
0,en,0.0,Up,"{'bounding_box': '83,251,929,780', 'lines': [{...",2343348859268581716.png
...,...,...,...,...,...
0,en,0.0,Up,"{'bounding_box': '72,203,670,535', 'lines': [{...",2349788573764790476.png
0,en,0.0,Up,"{'bounding_box': '81,351,931,680', 'lines': [{...",2343172003437263671.png
0,en,0.0,Up,"{'bounding_box': '55,173,662,557', 'lines': [{...",2350598473239542497.png
0,en,0.0,Up,"{'bounding_box': '12,81,1052,934', 'lines': [{...",2332331152127719574.png


## *UPDATE -->> I exported the results to json, so not going to run again...*
- The above cell for OCR w/ rate limiting takes around 33 minutes to complete

In [20]:
ocr_df = pd.concat(dfs)

In [23]:
ocr_df.reset_index().to_json('./data/blackivystories_ig_ocr.json')

In [3]:
#final_df = pd.concat(dfs)
#final_df.reset_index(drop=True,inplace=True)
#final_df.to_json('ig_ocr.json')

final_df = pd.read_json('./data/ig_ocr.json')

In [25]:
final_df = ocr_df

__________

## Nested OCR text words as `txt` column

In [27]:
def ocr_expand(x):
        
    txt = []
    
    for line in x.regions['lines']:
        for word in line['words']:
            txt.append(word['text'])
            
    return txt

In [28]:
final_df['txt'] = final_df.apply(lambda x: ocr_expand(x), axis=1)

In [7]:
#final_df.drop(columns=['regions']).to_json('ig_ocr_txt.json')

In [29]:
final_df.txt

0    [Once, her, roommate, found, out, that, I, spe...
0    [I'm, an, Afro-Latina, from, a, low-income, ur...
0    [Being, a, Black, engineering, student, at, Co...
0    [I, had, a, precept, with, a, '21, male, stude...
0    [Psafety, was, called, for, a, noise, complain...
                           ...                        
0    [Needles, to, say, I, never, got, my, stuff, b...
0    [Brown, students, boast, of, their, progressiv...
0    [My, grade, was, published, while, sitting, ne...
0    [At, night,, there, was, a, guard, stationed, ...
0    [Last, semester, I, took, one, of, the, many, ...
Name: txt, Length: 602, dtype: object

### We now want to set our IvyLeague stop words to find references for each respective school ...

helpful example for logic https://stackoverflow.com/questions/6531482/how-to-check-if-a-string-contains-an-element-from-a-list-in-python


### Loop through each ivy & check if it's in the txt lower case... 

- I am guessing we will miss some examples for this, for example `upenn` or `penn's`

In [30]:
ivyleaguesToCheck = ['penn','brown','princeton','harvard','columbia','yale','columbia','cornell','dartmouth']

In [31]:
def ivyCheck(x,ivyToCheck):
    
    for txt in x.txt:
        if txt.lower() in ivyToCheck:            
            return txt.lower()
        else:
            # ignore
            continue

In [32]:
final_df['ivy'] = final_df.apply(lambda x: ivyCheck(x,ivyleaguesToCheck), axis=1)

### Our BlackIvyStories OCR college value_counts:

In [33]:
final_df.ivy.value_counts()

princeton    93
penn         57
brown        48
columbia     43
cornell      42
harvard      41
dartmouth    21
yale         14
Name: ivy, dtype: int64

In [41]:
final_df.reset_index().to_json('./data/blackivystories_ig_ocr_expanded.json')

## Specifically looking at penn stories

In [42]:
penn_row_filter = final_df.ivy == 'penn'

In [43]:
penn_df = final_df[penn_row_filter]

In [44]:
penn_df.head()

Unnamed: 0,language,text_angle,orientation,regions,filename,txt,ivy
0,en,0.0,Up,"{'bounding_box': '68,114,944,917', 'lines': [{...",2374422979850657699.png,"[One, of, my, (non-Black), friends, and, I, we...",penn
0,en,0.0,Up,"{'bounding_box': '83,284,931,747', 'lines': [{...",2358282383087974949.png,"[Penn's, Undergraduate, Assembly, has, a, hist...",penn
0,en,0.0,Up,"{'bounding_box': '73,216,939,815', 'lines': [{...",2343993744749285462.png,"[Being, Black, at, Penn, means, watching, most...",penn
0,en,0.0,Up,"{'bounding_box': '80,284,932,747', 'lines': [{...",2341764308463485199.png,"[During, the, Africana, Summer, Institute,, a,...",penn
0,en,0.0,Up,"{'bounding_box': '74,183,938,848', 'lines': [{...",2342541965549375735.png,"[He, did, several, other, things, during, my, ...",penn


In [45]:
penn_df.reset_index().to_json('./data/pennivystories.json')

### Join the ocr text for further analysis... 

In [46]:
' '.join(penn_df.txt.iloc[29])

'I was followed by campus police for "trespassing" on my own campus upon leaving the library on a late night after studying. There were others who left the same time I did. None of them Black. No one else was questioned. Just me. - Penn 123 @blackivystories'

## TODO -->> 

- take the 57 penn post URLs and run commentGetter w/ docker
- postIds to reconstruct the public url