# Test notebook: Instagram Crawler

We experiment with python to crawl an Instagram's profile, download a post and display the post's caption.

## Import required modules

In [1]:
import json
import requests
from bs4 import BeautifulSoup
from lxml.html.soupparser import fromstring

## Receive the Profile's Website

In [2]:
insta_profile_url = 'https://www.instagram.com/koloot.design/'

In [3]:
# we download the profile's website
insta_profile = requests.get(insta_profile_url, allow_redirects=True)

# error check
if insta_profile.status_code != 200:
    print("Could not download the Instagram profile")
    print("Response code: %d" %insta_profile.status_code)

In [4]:
# Let's parse the website
soup = BeautifulSoup(insta_profile.text, 'lxml') 

json_data_str = ''
# iterate through all scripts and find the
# first occurence of the script containing the data
for script in soup.find_all('script'):
    script_str = script.string
    if script_str.startswith('window._sharedData = '):
        json_data_str = script_str
        break

# clean data
json_data_str = json_data_str.strip('window._sharedData = ')
json_data_str = json_data_str.strip(';')


In [5]:
# decode string and receive json object
json_data = json.loads(json_data_str.replace("\n","\\n"))

In [6]:
#
# Source: https://gist.github.com/douglasmiranda/5127251#gistcomment-2398949
#
def find_json(key, dictionary):
    """Returns the value for a key in in a (nested) iterable.
       
       Arguments:
           - key: a dictionary's entry
           - dictionary: <list> or <dictionary>
           - returned: <string> "value"
           
       Returns:
           - <iterator>
    """    
    for k, v in dictionary.items():
        if k == key:
            yield v
        elif isinstance(v, dict):
            for result in find_json(key, v):
                yield result
        elif isinstance(v, list):
            for d in v:
                if isinstance(d, dict):
                    for result in find_json(key, d):
                        yield result

In [7]:
insta_posts = list(find_json('shortcode', json_data))

## Download a Single Instagram Post

In [8]:
insta_post = insta_posts[0]
insta_post_url = 'https://www.instagram.com/p/' + insta_post + '/'
print("Will download the post: %s" %insta_post_url)

Will download the post: https://www.instagram.com/p/BhRpkfqgnsf/


In [9]:
insta_post = requests.get(insta_post_url, allow_redirects=True)
soup = BeautifulSoup(insta_post.text, 'lxml') 

In [10]:
# find json data
soup.find(attrs={"type": "application/ld+json"}).string

'\n                {"@context":"http:\\/\\/schema.org","@type":"ImageObject","caption":".\\nAnafor typeface poster\\nBy erman Yilmaz\\n@_looperman_\\n\\nerman yilmaz (1985,turkey) is a graphic designer and graffiti artist based in i\\u0307stanbul, whose work focuses mainly on the arts, social and cultural sector.\\n\\nwww.ermanyilmaz.com\\n#looperman #ermanyilmaz #graphicdesign #graphic #design #designer #artist #type #artistic #Typography #vscoart #tbt #graphicdesigner #graphicart #creative #vsco #vscocam #typographer #poster #posterdesign #artwork #posters #creativity #dailyart #designeveryday #designinspiration #postereveryday #graphics #art #kolootdesign","representativeOfPage":"http:\\/\\/schema.org\\/True","uploadDate":"2018-04-07T16:22:35","author":{"@type":"Person","alternateName":"@koloot.design","mainEntityofPage":{"@type":"ProfilePage","@id":"https:\\/\\/www.instagram.com\\/koloot.design\\/"}},"comment":[{"@type":"Comment","text":"\\u0641\\u0648\\u0642 \\u0627\\u0644\\u0639\

## Instagram Post Data

The website with the Instagram post contains json data describing the post. 
The `application/ld+json` typed data can be simply parsed using python's json library

In [11]:
# parse the data from the website 
post_data_str = soup.find(attrs={"type": "application/ld+json"}).string
# remove any leading and trailing whitespaces such as \n, \r, \t, \f, space.
post_data_str = post_data_str.strip()
# prep string for json parsing
post_json = json.loads(post_data_str.replace("\n","\\n"))

In [12]:
# access json data content
print(post_json["caption"])

.
Anafor typeface poster
By erman Yilmaz
@_looperman_

erman yilmaz (1985,turkey) is a graphic designer and graffiti artist based in i̇stanbul, whose work focuses mainly on the arts, social and cultural sector.

www.ermanyilmaz.com
#looperman #ermanyilmaz #graphicdesign #graphic #design #designer #artist #type #artistic #Typography #vscoart #tbt #graphicdesigner #graphicart #creative #vsco #vscocam #typographer #poster #posterdesign #artwork #posters #creativity #dailyart #designeveryday #designinspiration #postereveryday #graphics #art #kolootdesign


In [13]:
post_json['uploadDate']

'2018-04-07T16:22:35'

In [14]:
#
#    <meta name="medium" content="image" />
#    <meta property="og:type" content="instapp:photo" />
#
post_content_type = soup.find('meta', attrs={"name": "medium"})['content']
post_content_type

'image'

In [15]:
post_image_url = soup.find('meta', attrs={"property": "og:image"})['content']
post_image_url

'https://scontent-frx5-1.cdninstagram.com/vp/8075278f878692f0135dfc3e300a52db/5DEF77CC/t51.2885-15/e35/29717856_423969544730130_3367786897353998336_n.jpg?_nc_ht=scontent-frx5-1.cdninstagram.com'