## Selenium Web Scraping for IG Comments

In [1]:
import pandas as pd
import os

### Configurations for Docker container deployment

In [2]:
# path to repo decker deploy script
deploy_script_path = './deploy.sh' # <-- Make sure to edit for your docker mount -v

In [3]:
# IG Username (`username` in ig_df)
input_wwe_username = 'blackivystories'

In [4]:
# Our instagram df from PhantomBuster
ig_df = pd.read_csv('./data/blackivystories_PostsExtractor_03012021.csv')

### Loop through urls for BlackIvyStories IG username ...


In [5]:
# Our exported Azure OCR df with text
ocr_df = pd.read_json('./data/blackivystories_ig_ocr_expanded.json')

In [6]:
ocr_df.columns

Index(['index', 'language', 'text_angle', 'orientation', 'regions', 'filename',
       'txt', 'ivy'],
      dtype='object')

## We want to get our Penn urls... this worked in the jhub deployment. What about dartmouth for docker deployment?

I am reverse extracting this from the filename... probably wasn't best strategy to start with

In [8]:
#ivy_input_name = 'penn'
ivy_input_name = 'dartmouth'

In [9]:
postid_filename = []
ivy_ocr_pid = []

for pid in ocr_df[ocr_df.ivy == ivy_input_name].filename.str[:-4]:
    ivy_ocr_pid.append(int(pid))
    
ivy_ig_post_urls = ig_df[ig_df.postId.isin(ivy_ocr_pid)].postUrl.unique()


#### *TOTAL NUMBER OF URLS FOR IVY IG STORIES*

In [36]:
len(ivy_ig_post_urls)

16

In [37]:
# get our ids
url_ids = []

for url in ivy_ig_post_urls:
    url_ids.append(url.split('/')[-2])
    
# check if they already ran (in case the cell fails)
url_for_commentgetter = []

for ig_url in url_ids:
    json_outfile = ig_url + '.json'
    if json_outfile not in os.listdir('./data/etl/'): # <-- the url doesn't already have a json etl outfile
        url_for_commentgetter.append('https://www.instagram.com/p/{}/'.format(ig_url))

#### *HOW MANY URLS ARE YOU DEPLOYING DOCKER COMMENT GETTER FOR:*

*note -- If its zero then you are done for that respective ivy filter!*

In [38]:
len(url_for_commentgetter)

0

__________

## DOCKER DEPLOYMENT FOR INSTAGRAM COMMENT GETTER


- *You need Docker installed on your host... more info here: https://docs.docker.com/get-docker/*

- I think Firefox is best for testing but in the past I remember using Chrome as it was more friendly in docker with selenium... we will see:

_________

```bash
docker run -d -p 4444:4444 -v --shm-mem=4G selenium/standalone-firefox
# you can rename this container

# check for running container name
docker ps
docker rename _whatever_random_docker_name_is_ selenium-firefox

# nice to have a terminal windows visible with docker stats for monitoring
docker stats
```

__________
#### *RUN FOR FIRST BUILD, OR JUST RESTART*

- give it a minute to startup...

In [32]:
#!docker run -d -p 4444:4444 -v --shm-mem=4G selenium/standalone-chrome

In [33]:
!docker restart selenium-firefox

selenium-firefox


In [39]:
!docker ps

CONTAINER ID   IMAGE                         COMMAND                  CREATED             STATUS         PORTS                    NAMES
35faafb3499e   selenium/standalone-firefox   "/opt/bin/entry_poin…"   About an hour ago   Up 3 minutes   0.0.0.0:4444->4444/tcp   selenium-firefox


## Going to scale up for the remaining 540~ URLS for the rest of ivy schools...

pushing to github first as it'll run overnight ...
__________

### Docker bash deploy worked for the 16 dartmouth URLs in testing

### Looping through the 55 penn story instagram URLs for comment web scraping ... 

Network errors seem to mess this up... maybe the restart on failure flag isn't helpful? The pipeline failed to stop a restarted container...

- I removed `--restart=on-failure` from [deploy.sh](./deploy.sh)... I think if a container fails we can just rerun by refreshing urls for comment getter in cells above:

- there was a runaway container which hit 150+ clicks very quickly. I think I set a threshold of 500 clicks and it stops, I just hit the stop button to stop that container though. I probably need to rerun for just that one URL? I bumped this down to 100
- not sure why but the following URL keeps running off & clicking for the following https://www.instagram.com/p/CBn-zFWjDIf/... it did get comments and export to json though

In [35]:
for url in url_for_commentgetter:
  
    ##################################################################
    # Clean up URL (in case lacking https or has trailing slash)...
    if not url.startswith('https://'):
        url = 'https://' + url
    if url.endswith('/'):
        url=url[:-1]
    
    ##################################################################
    # Prepare Dockerfile (using template to overwrite for next URL)
    dockerfile_in = open("./configs/dockerfile_template", "rt")
    dockerfile_out = open("./Dockerfile", "wt")
    
    # Replace the value w/ URL -- this loops through lines, would be better using regex...
    for line in dockerfile_in:
        # This is how url gets passed to container...
        dockerfile_out.write(line.replace('replace_with_url', '"{}"'.format(url)))
    
    #close input and output dockerfile & template
    dockerfile_in.close()
    dockerfile_out.close()
    
    ##################################################################
    # BEGIN LOOPING THROUGH URLS BY LAUNCHING DOCKER CommentGetters
    print('Getting comments for IG post --> {}'.format(url))
    try:
        !bash $deploy_script_path
    except:
        print('oops! failed for {}'.format(url))
        break

Getting comments for IG post --> https://www.instagram.com/p/CBn-zFWjDIf
Error response from daemon: No such container: commentGetter
Error: No such container: commentGetter
Sending build context to Docker daemon  85.46MB
Step 1/6 : FROM python:3.8-slim
 ---> 62297c9f4e5c
Step 2/6 : COPY GetComments.py /app/GetComments.py
 ---> Using cache
 ---> 90c9796babd0
Step 3/6 : COPY configs /app/configs
 ---> Using cache
 ---> abc6089367f5
Step 4/6 : WORKDIR /app
 ---> Using cache
 ---> d8792d16e028
Step 5/6 : RUN pip install -r /app/configs/requirements.txt
 ---> Using cache
 ---> 53541ed81935
Step 6/6 : CMD ["python3", "GetComments.py", "https://www.instagram.com/p/CBn-zFWjDIf"]
 ---> Using cache
 ---> 0f5462a4c283
Successfully built 0f5462a4c283
Successfully tagged instagram_commentgetter:latest
------------------------------------------------------------
Raw html outfile prepared as: /data/raw/CBn-zFWjDIf.html
ETL outfile prepared as: /data/etl/CBn-zFWjDIf.json
Trying to click & load more c

In [17]:
!docker ps

CONTAINER ID   IMAGE                            COMMAND                  CREATED              STATUS              PORTS                    NAMES
cf28a2d6f945   instagram_commentgetter:latest   "python3 GetComments…"   About a minute ago   Up About a minute                            commentGetter
35faafb3499e   selenium/standalone-firefox      "/opt/bin/entry_poin…"   About an hour ago    Up 9 minutes        0.0.0.0:4444->4444/tcp   selenium-firefox


In [18]:
etl_files = [item for item in os.listdir('./data/etl/') if item.endswith('.json')]

### Function for getting url fron image filename

In [19]:
etl_files = [item for item in os.listdir('./data/etl/') if item.endswith('.json')]

dfs = []

for etl in etl_files:
    dfs.append(pd.read_json('./data/etl/{}'.format(etl)))
    
etl_df = pd.concat(dfs)

In [20]:
etl_df.shape

(1109, 3)

In [21]:
etl_df

Unnamed: 0,url,author,comment
0,https://www.instagram.com/p/CB1vWGmJJWc,blackivystories,@uofpenn . . . . . . #blackstudentsmatter #b...
1,https://www.instagram.com/p/CB1vWGmJJWc,doreenm1,😡 @uofpenn @columbia enough of this racist...
2,https://www.instagram.com/p/CB1vWGmJJWc,jay_theorist,@skaijackson 36w Reply
3,https://www.instagram.com/p/CB1vWGmJJWc,View,replies (1)
4,https://www.instagram.com/p/CB1vWGmJJWc,janzibrown,Wow 36w 1 like Reply
...,...,...,...
24,https://www.instagram.com/p/CCl0jqGh0S1,View,replies (1)
25,https://www.instagram.com/p/CCl0jqGh0S1,victoriatellez,What the fuck 33w 1 like Reply
26,https://www.instagram.com/p/CCl0jqGh0S1,kiwikiwikiwi25,😞 What an ignorant and horrible thing to say. ...
27,https://www.instagram.com/p/CCl0jqGh0S1,catie.mc23,@dartmouthcollege 33w Reply
