# Instagram Crawler

This notebook is a scratch to implementation of Instagram web scraping.

The goals are test and design scracthes of functions to get data from Instagram.

This notebook contains the implementation and tests for a snowballing effect function to get data about tags in Instagram posts.

**Some references**

https://www.eatthis.com/biggest-fast-food-chains-america/

https://medium.com/@h4t0n/instagram-data-scraping-550c5f2fb6f1

https://medium.com/@srujana.rao2/scraping-instagram-with-python-using-selenium-and-beautiful-soup-8b72c186a058

https://towardsdatascience.com/social-network-analysis-of-related-hashtags-on-instagram-using-instacrawlr-46c397cb3dbe

**Endpoint to User Information**

https://i.instagram.com/api/v1/users/{USER_ID}/info/

# Requesting Data

In [1]:
# Instagram tag search base url preffix
tagurl_prefix = 'https://www.instagram.com/explore/tags/'

In [3]:
# suffix to append to tag request url to retrieve data in JSON format
tagurl_suffix = '/?__a=1'

In [4]:
# parameter to be appended to the url in order to search a new set of posts after a previous set
tagurl_endcursor = '&max_id='

In [5]:
# a generic media post preffix (concat with media shortcode to view the post)
posturl_prefix = 'https://www.instagram.com/p/'

In [6]:
# target initial tags
tags = ['bolsonaro', 'haddad', 'dilma', 'ciro', 'guedes', 'moro']

In [7]:
# target url to initial test
tagurl = tagurl_prefix + tags[0] + tagurl_suffix

In [8]:
# checking target url
tagurl

'https://www.instagram.com/explore/tags/bolsonaro/?__a=1'

In [8]:
# needed module
import requests

In [9]:
# requesting JSON information
json_info = requests.get(tagurl).json()

In [10]:
# retrieving a list of posts
posts_list = json_info['graphql']['hashtag']['edge_hashtag_to_media']['edges']

In [11]:
# checking lenght of the list
len(posts_list)

65

In [12]:
# checking details about one media
posts_list[0]

{'node': {'__typename': 'GraphImage',
  'accessibility_caption': 'Image may contain: one or more people and text',
  'comments_disabled': False,
  'dimensions': {'height': 1080, 'width': 1080},
  'display_url': 'https://instagram.fnat1-1.fna.fbcdn.net/vp/2230dc04fde10320fb8bac0be9f26169/5DACCDC1/t51.2885-15/e35/s1080x1080/65137552_151111176028672_2662746227723959449_n.jpg?_nc_ht=instagram.fnat1-1.fna.fbcdn.net',
  'edge_liked_by': {'count': 1},
  'edge_media_preview_like': {'count': 1},
  'edge_media_to_caption': {'edges': [{'node': {'text': 'Câmara municipal do Rio livra Marcelo Crivella de Impeachment.\n🇧🇷 SIGA @PELONOVOBR'}}]},
  'edge_media_to_comment': {'count': 1},
  'id': '2074411265925452949',
  'is_video': False,
  'owner': {'id': '1100159559'},
  'shortcode': 'BzJyg4_jhSV',
  'taken_at_timestamp': 1561509114,
  'thumbnail_resources': [{'config_height': 150,
    'config_width': 150,
    'src': 'https://instagram.fnat1-1.fna.fbcdn.net/vp/cbd76321a28c832b7169cde3edba47ed/5D8A16D

With this JSON data, all the relevant information to the analysis can be retrieved. Some of them are:

- shortcode of the post (in case of need to visualize)
- text (at edge_media_to_caption > edges[0] > node > text)
- owner (can be the username or the userid)

In [13]:
# list of media dictionaries (filtered and processed information)
posts = []

for post in posts_list:
  
    node = post['node']
  
    id_post = node['id']
  
    id_owner = node['owner']['id']
  
    edges = node['edge_media_to_caption']['edges']
  
    shortcode = node['shortcode']
  
    # not all medias have a text
    text = edges[0]['node']['text'].replace('\n','') if len(edges) else ''
  
    post_url =  posturl_prefix + shortcode + '/'
  
    post_dict = {
        'id_post': id_post,
        'id_owner': id_owner,
        'shortcode': shortcode,
        'text': text,
        'post_url': post_url
    }
  
    posts.append( post_dict )

posts[0:3]

[{'id_owner': '1100159559',
  'id_post': '2074411265925452949',
  'post_url': 'https://www.instagram.com/p/BzJyg4_jhSV/',
  'shortcode': 'BzJyg4_jhSV',
  'text': 'Câmara municipal do Rio livra Marcelo Crivella de Impeachment.🇧🇷 SIGA @PELONOVOBR'},
 {'id_owner': '3545180235',
  'id_post': '2074411220988297909',
  'post_url': 'https://www.instagram.com/p/BzJygPJF4K1/',
  'shortcode': 'BzJygPJF4K1',
  'text': 'O presidente Jair Bolsonaro (PSL) voltou atrás e decidiu\xa0revogar o decreto\xa0que facilitou o porte de armas de fogo após derrota no Senado. Em reunião com senadores na tarde desta terça-feira (25), o ministro da Casa Civil, Onyx Lorenzoni, anunciou a decisão..Em maio, Bolsonaro editou dois decretos sobre posse e porte de armas de fogo e uso de munições. O pacote de mudanças foi alvo de críticas. Na semana passada o plenário do Senado aprovou parecer da Comissão de Constituição e Justiça (CCJ) que pede a suspensão dos decretos. O parecer seguiu para análise da Câmara dos Deputado

**Note**

We have a valid list of information to work with.

The goal now is to create a function to retrieve a large amount of posts. Are we talking about **snowballing effect**?

**Warning**

Some of the attributes in the post dictionary had names changed.

This does not affect the previous data analysis done with next notebooks with the file `data.json` generated in 2019-06-11.

# Snowballing Effect

In [14]:
# checking the end_cursor variable to iterate the search
json_info['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']

'QVFDSk9uQ2ZJSTZVMEhhRG1zcWY5SzZJOGJqbDZkNXRCZVUxSG1ZYl9SRENYSjhOdmpBR0FBbWtSN1BZQ3M0SFRyM1BTdkdDblNHMzNNeGFvaVJzVlVmaA=='

In [15]:
def json2posts(json_info):

    posts_list = json_info['graphql']['hashtag']['edge_hashtag_to_media']['edges']

    posts_dicts = []

    for post in posts_list:

        node = post['node']

        id_post = node['id']

        id_owner = node['owner']['id']

        shortcode = node['shortcode']

        edges = node['edge_media_to_caption']['edges']
        
        text = edges[0]['node']['text'].replace('\n','') if len(edges) else ''

        post_url = posturl_prefix + shortcode + '/'

        post_dict = {
            'id_post': id_post,
            'id_owner': id_owner,
            'shortcode': shortcode,
            'text': text,
            'post_url': post_url,
        }
    
        posts_dicts.append( post_dict )
    
    return posts_dicts

In [16]:
import time

def snowball(url, deep=1, end_cursor='', count=0, showurl=False, 
               sleep=0, forever=False, progress=False, pause=60 ):
  
    count = count + 1

    request_url = url + tagurl_endcursor + end_cursor

    if showurl :

        print(request_url)

    else:

        if progress :

            print( count )
            # if count == 1 :
            #  print( '*' * (deep-1) )
            # else:
            #  print( '*', end='' )

    # TODO Involve the request in a try-except block
    json_info = requests.get( request_url ).json()

    end_cursor = json_info['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']

    posts = json2posts( json_info )

    time.sleep(sleep)

    if count < deep :

        try:
          
          posts += snowball(
                url=url, 
                deep=deep, 
                end_cursor=end_cursor, 
                count=count, 
                showurl=showurl, 
                sleep=sleep,
                forever=forever,
                progress=progress, 
                pause=pause)
          
        except:
          
          if forever :
            
            print( 'Fail, retrying in ' + str(pause) + ' seconds' )
            
            time.sleep(pause)
            
            posts += snowball(
                url=url, 
                deep=deep, 
                end_cursor=end_cursor, 
                count=count, 
                showurl=showurl, 
                sleep=sleep,
                forever=forever, 
                progress=progress, 
                pause=pause)
          
          else:
            
            print( 'Fail, ' + str(count) + ' requests done' )
          
        else:

            pass

    return posts

In [17]:
posts = snowball(tagurl, deep=1)

print( str(len(posts)) + ' posts retrieved' )

65 posts retrieved


**Note**

Great! The function is working. Next, going deeper in search.

In [18]:
posts = snowball(tagurl, deep=3)

print( str(len(posts)) + ' posts retrieved' )

195 posts retrieved


**Note**

Great! The snowballing effect is working. Next, going deeper and testing the limits!

In [19]:
posts = snowball(tagurl, deep=10)

print( str(len(posts)) + ' posts retrieved' )

640 posts retrieved


**Note**

Some requests can take some time. Adding some feature to check the progress.

In [20]:
posts = snowball(tagurl, deep=10, showurl=True)

print( str(len(posts)) + ' posts retrieved' )

https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFDdllDM1IxaU9fNXlCSUlzYTB1aGR2TkhlUXJidVk3NzhmUHRsdnVuUWo4aGE5Z0kzeDlGbFhfSHBDU092c3R3VHQxazVGVXBtUEs5c2oxcU5VazZQRQ==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFEU2ZXdGgtWFVHaTBYWk5BckhqVmVCZlpsR3FvNEdJOFlPU1BiSlhHcEtHTGFzMExucktKU2xoVS04VXdkYXdzS1lqWWpObUpOX1E5QkxsOUVJdGVPTw==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFCdk9rSUJwYlVpU281ZUVaelotekdrVVRHX1g0NkM2bHdTMkJQSnktVE42djdFLXdGZFpXbUR5Vkc2SHU3b0YwTGkzdW5IZzcxOWNuTUxxcjcxNUdJSA==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFDSDM2UGRieEhPZGFIZG83Rm5rMnp4cHZiZDBVU0NaUU1MN2VnLU1sMk1VUnQyNmxrX3ZDaXVndUNSV1EzMUtRVGswWVUwWWpwTkllVkxublRkN2N5bw==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFDUmtQYWxDT2xpT3RUUjR4QTd3ejZGWWdaYXl1Z0R1ZnJ0dHN6THFoN3lNYTc3S0s0VWxmVC02SWxtbEo3LWw3V0hrYkIxczZFRzNhQnlYVms2RnlIVQ==
https://www.inst

**Note**

Great! But some times Instagram may be blocking some requests depending on the frequency.

Adding a parameter to smooth the requesting frequency.

In [21]:
%%time

posts = snowball(tagurl, deep=20, sleep=1, showurl=True)

print( str(len(posts)) + ' posts retrieved' )

https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFBVFY0bklFZHNpWGV3Ul9jWXlyWmo4dEVBTk9qcnBFX2tXMWpXdVVhT2ktVWV3akg1WURTekRqOUNHcDQ0MVVJcFVYckpYU2RXUXR1NDhoejFEaDNWeA==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFEMlZaMDlYYW1jMkJEdFRUdERCcVJ0NWlsRDBtZEotS2JmeVJrc2FZY2VmOFo1WnlLaTRKXzlfaldmMGRjU0VWRktMdm5sYy1CV3d4WWVfTHQzakJSSA==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFBZmlMQTNhUmJLUUM5dmdfRnk1dDJIMEVGX1NMNHdGVzd5R0VNQ09TdHpfeVJIMmVlckRTdVRGT3lVYTBLblpqS0xUTXl6U21zQnJENC1mUGxtaW0tSQ==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFBNTh2TFJmc3hwekE3SjFvblN4dGRIQllmQnN0a0NidUdoazlESnBXUEE1bW9BUHJRSlFxcHNXeTlkVFJNbFFkVHY1dWNwZXltS0JnckRfdlVQTS1SVQ==
https://www.instagram.com/explore/tags/bolsonaro/?__a=1&max_id=QVFBVndPaHlxeGQ3dnFlN19Oa2lOWXEzN05Cdms4cDV4SFdTNGxDaDhXSnRsLUljbUhRLVJ6R1VkZnVOcGxFTW1JS1AweFl5MDQ2cjVMVjRNbFl6M3lFcA==
https://www.inst

**Note**

Great! Going deeper and checking the limits!

In [22]:
%%time

posts = snowball(tagurl, deep=50, sleep=1, progress=True)

print( str(len(posts)) + ' posts retrieved' )

1
2
3
4
5
6
7
8
9
10
11
12
13
Fail, 12 requests done
769 posts retrieved
CPU times: user 1.07 s, sys: 93.4 ms, total: 1.16 s
Wall time: 32.8 s


**Note**

Good, but snowballing effect stops at the first block.

Adding a feature to make it try again forever.

In [23]:
%%time

posts = snowball(tagurl, deep=50, sleep=1, progress=True, forever=True)

print( str(len(posts)) + ' posts retrieved' )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Fail, retrying in 60 seconds
49
50
3176 posts retrieved
CPU times: user 4.17 s, sys: 280 ms, total: 4.45 s
Wall time: 3min 17s


**Note**

Great! Finally, an ultimate test.

In [24]:
%%time

posts = snowball(tagurl, deep=100, sleep=0.5, progress=True, forever=True, pause=30)

print( str(len(posts)) + ' posts retrieved' )

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Fail, retrying in 30 seconds
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
Fail, retrying in 30 seconds
92
Fail, retrying in 30 seconds
91
92
93
94
95
96
97
98
99
100
6262 posts retrieved
CPU times: user 8.63 s, sys: 629 ms, total: 9.26 s
Wall time: 5min 14s


**Note**

Great time to retrieve a large amount of posts!