# Instagram Scraper UI

This **Notebook-as-Tool** provides you a User Interface for the command-line application [instagram-scraper](https://github.com/arc298/instagram-scraper). With this tool you can scrape media as well as metadata for  
1. Instagram accounts
2. Hashtags

**How to Use:**
For running or adapting this Colab Notebook you need to create a copy in you 

1.   Google drive: **File → Save a copy in Drive**. I will be stored in a folder ```Colab Notebooks```. Open this file with Google Colab and run the cells consecutively by pressing the **Play** button or pushing **shift+enter**.
2.   Instagram blocks most requests if no user credentials are provides. Since this information is sensitive it will be stored in a config file on Google Drive: 
  - Download the Instagram config template from: http://tiny.cc/instagram-config-template
  -  Add your information using your preferred code editor like [SublimeText](https://www.sublimetext.com), [Atom](https://atom.io/), [Brackets](http://brackets.io/). 
  - Upload to your Google Drive. I assume that you rename the template file to ```instagram_config.ini``` and place it in a folder named ```Colab_Data/Configs``` in ```MyDrive```.

**BEWARE: Instagram might block your account, if you use this tool. It is recommended to create a dedicated research account that is completely separate from any Instagram accounts you value and do not want to get blocked or deleted by the platform. This requires using a separate email address and potentially a separate cell phone number for veryfying the account.**

**Important notes:**
- Code is hidden in the background of Colab forms. For viewing and editing the code **double click** cell or select  **View → Show/hide code**
- Data will be stored in Google Drive in the folder ```Colab Data```. A connection to your drive will be authenticated when running setup code cells. This is temporary and only your current notebook will be conncted to your drive. The connection will be revoked when the notebook is terminated or by selecting **Runtime → Factory reset runtime**.


**Credits:** This notebook was written by Marcus Burkhardt. It provides a user interface for the command line application [instagram-scraper](https://github.com/arc298/instagram-scraper) which is maintained by GitHub user [arc298](https://github.com/arc298/). With this tool you can use Google Colab as runtime and Google Drive as cloud storage for retrieved data.

In [None]:
#@title Setup 1: Mount Google Drive for Loading and Storing Data
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

In [None]:
#@title Setup 2: Install and Load Required Libraries and Run Setup Procedures

# Import Libaries
import os
import json
import time
import configparser 
import importlib
import shutil
import subprocess
from tqdm.notebook import tqdm

#import pandas as pd
#from datetime import datetime

# Install Libraries
if importlib.util.find_spec('instagram_scraper') is None:
  !pip3 install instagram-scraper

# Defining path variable for config path
config_path = os.path.join("gdrive", "MyDrive", "Colab_Data", "Configs")
if not os.path.isdir(config_path):
  os.makedirs(config_path)

# Defining path variable for data path
data_path = os.path.join("gdrive", "MyDrive", "Colab_Data", "Data", "Instagram-Scraper")
if not os.path.isdir(data_path):
  os.makedirs(data_path)
pfile = os.path.join(data_path, 'protocol_queries.json')

try: 
  # Reading config and setting configuration values
  config = configparser.ConfigParser()
  config.read(os.path.join(config_path, "instagram_config.ini"))

  user_email = str(config['Instagram']['user'])
  user_pwd = str(config['Instagram']['pwd'])

  # Check if API credentials were successfully parsed
  if isinstance(user_email,str) and isinstance(user_pwd,str):
    print('Successful parsed config data.')

except:
  print('Error reading or parsing the config.')

debug = False

In [None]:
#@title Setup 3: Definition of Core and Support Functions Used by the Tool(s)

def check_path(path):
  if not os.path.isdir(path):
    os.makedirs(path)

def load_protocol(fname):
  if os.path.isfile(fname):
    with open(fname, 'r') as infile:
      return json.load(infile)
  else:
    return []

def save_protocol(protocol, fname):
  with open(fname, 'w') as outfile:
    json.dump(protocol, outfile, indent=4)

def make_protocol(tmp_params):
  tmp = {}
  tmp['timestamp'] = time.time()
  last = None
  for item in tmp_params:
    if last and last == 'instagram-scraper':
      tmp['query'] = item
    elif last in ['-u', '-p', '-d']:
      pass
    elif last in ['--maximum', '--media-types', '--latest-stamps']:
      tmp[last[2:]] = item
    elif item.startswith('--'):
      tmp[item[2:]] = True
    last = item
  
  if 'commments' in tmp:
    tmp['media-metadata'] = True
  return tmp


In [None]:
protocol = load_protocol(pfile)

#@title # **Retrieve User Data** 
#@markdown ## **Users**
#@markdown *Comma separate multiple user accounts.*
Users = "" #@param {type:"string"}
Users = [user.strip(' ') for user in Users.split(',')]
Include_Comments = True #@param {type:"boolean", description: "sdf"}

#@markdown ## **Options**

#@markdown **Set Maximum to 0 for retrieving all.** *(USE THIS OPTION WITH CARE!)*
Maximum = 25 #@param {type:"slider", min:0, max:1000, step:5}

#@markdown **BEWARE:** If you set Maximum to 0 for retrieving all media for  
#@markdown a given account the option "--latest-stamps" is passed to the tool
#@markdown allowing it to retrieve only new posts later on. 
#@markdown *IF YOU WANT TO RESET THE LATEST DATA MAKE SURE TO CHANGE THE TIMESTAMP 
#@markdown FOR THE USER TO 1 IN THE FILE latest.yaml IN YOUR CLOUD DATA FOLDER.
#@markdown RENAMING OR ERASING THE FILE WORKS AS WELL.

for user in tqdm(Users):
  print(f'Processing user: {user}')

  outpath = os.path.join(data_path, 'users', user)
  params = ["instagram-scraper", user, 
            "-u", user_email, 
            "-p", user_pwd]
  
  # Profile info
  print(f'  Retrieving profile metadata.')
  tmp_params = params.copy()
  tmp_params.append("--maximum")
  tmp_params.append(str(1))
  tmp_outpath = os.path.join(outpath, 'profile')
  check_path(tmp_outpath)
  tmp_params.append("-d")
  tmp_params.append(tmp_outpath)
  tmp_params.append("--media-types")
  tmp_params.append("none")
  tmp_params.append("--profile-metadata")
  tmp_cmd = ' '.join(tmp_params)
  if debug: 
    print(f'  {tmp_cmd}')
  subprocess.call(tmp_params)
  protocol.append(make_protocol(tmp_params))

  # Media and metadata
  print(f'  Retrieving media and associated metadata.')
  tmp_params = params.copy()
  tmp_outpath = os.path.join(outpath, 'media')
  check_path(tmp_outpath)
  tmp_params.append("-d")
  tmp_params.append(tmp_outpath)
  
  
  if isinstance(Maximum, int) and Maximum != 0:
    tmp_params.append("--maximum")
    tmp_params.append(str(Maximum))
    Retrieve_Latest = False
  else:
    Retrieve_Latest = True
    latest_file = "latest.yaml"
    latest_file_cloud = os.path.join(data_path, latest_file)
    latest_file_local = os.path.join(os.getcwd(), latest_file)
    if os.path.isfile(latest_file_local):
      os.remove(latest_file_local)
    if os.path.isfile(latest_file_cloud):
      shutil.copy(latest_file_cloud, latest_file_local)
    tmp_params.append("--latest-stamps")
    tmp_params.append("latest.yaml")
  
  if not Include_Comments:
    tmp_params.append("--media-metadata")
  else:
    tmp_params.append("--comments")
  tmp_cmd = ' '.join(tmp_params)
  if debug: 
    print(f'  {tmp_cmd}')
  subprocess.call(tmp_params)
  if Retrieve_Latest and os.path.isfile(latest_file_local):
    shutil.copy(latest_file_local, latest_file_cloud)
    os.remove(latest_file_local)
  protocol.append(make_protocol(tmp_params))
  save_protocol(protocol, pfile)
print('Done.')

In [None]:
protocol = load_protocol(pfile)

#@title # **Retrieve Data for Hashtags** 
#@markdown ## **Hashtags**
#@markdown *Comma separate multiple user accounts.*
Hashtags = "" #@param {type:"string"}
Hashtags = [tag.strip(' ') for tag in Hashtags.split(',')]
Include_Comments = True #@param {type:"boolean"}

#@markdown ## **Options**

#@markdown **Set Maximum to 0 for retrieving all.** *(USE THIS OPTION WITH CARE!)*
Maximum = 25 #@param {type:"slider", min:0, max:1000, step:5}

for hashtag in tqdm(Hashtags):
  print(f'Processing user: {hashtag}')

  outpath = os.path.join(data_path, 'hashtags', hashtag)
  params = ["instagram-scraper", hashtag, "--tag",  
            "-u", user_email, 
            "-p", user_pwd]
  
  # Media and metadata
  print(f'  Retrieving media and associated metadata.')
  tmp_params = params.copy()
  tmp_outpath = os.path.join(outpath, 'media')
  check_path(tmp_outpath)
  tmp_params.append("-d")
  tmp_params.append(tmp_outpath)
  
  if isinstance(Maximum, int) and Maximum != 0:
    tmp_params.append("--maximum")
    tmp_params.append(str(Maximum))
  
  if not Include_Comments:
    tmp_params.append("--media-metadata")
  else:
    tmp_params.append("--comments")
  tmp_cmd = ' '.join(tmp_params)
  if debug: 
    print(f'  {tmp_cmd}')
  subprocess.call(tmp_params)
  protocol.append(make_protocol(tmp_params))

  print('\n')
  save_protocol(protocol, pfile)
print('Done.')