<a href="https://colab.research.google.com/github/eklz/Prediction-de-likes-Instagram/blob/test/InstaScrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Instagram Scrapping

The following code let us extract the data from Instagram to build our database. It is done with the help of Selenium and BeautifulSoup.

In [0]:
!pip install selenium
!apt-get update 
!apt install chromium-chromedriver #chromedriver gets install on the current hosted session in colab. needed to use selenium

In [0]:
from selenium import webdriver
import datetime as dt
from bs4 import BeautifulSoup as bs
import json
import requests
import json
import os

In [5]:
from google.colab import drive

drive.mount('/content/gdrive')
root_path = '/content/gdrive/My Drive/InstaScrap/'
if not os.path.exists(root_path): #build the directory id it doesn't exist
      os.makedirs(root_path)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


## The User class

`loadData(self)` gets the Json file from the Java code of a given Instagram account. This Json file contains all the user informations and the informations of the last 12 posts.

`loadPictures(self)` take a look at the last 12 posts and return only the pictures (no video or group of pictures) with the associated informations.

`createUser(self)` returns a dictionnary with all of the informations needed : `{'name', 'nbPubli', 
        'nbFollowers', 'nbFollow', 
        'typeBusiness','latsPosts'}`



In [0]:
class User():

  def __init__(self, username):

    self.username=username
    self.URL='http://www.instagram.com/'+username+'/?hl=fr'
    self.data=self.loadData()
    self.nbPubli= self.data['edge_owner_to_timeline_media']['count']
    self.followed_by=self.data['edge_followed_by']['count']
    self.follow=self.data['edge_follow']['count']
    self.business=self.data['business_category_name']

  def loadData(self):

    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage') #in. order to run in colab

    browser = webdriver.Chrome('chromedriver',options=chrome_options)
    browser.get(self.URL)
    source = browser.page_source
    browser.quit()

    data = bs(source, 'html.parser')
    body = data.find('body')
    script = body.find('script', text=lambda t: t.startswith('window._sharedData'))
    page_json = script.text.split(' = ', 1)[1].rstrip(';')
    data = json.loads(page_json)
    data = data['entry_data']['ProfilePage'][0]['graphql']['user']
    return data

  def loadPictures(self):
    
    posts=[]
    for publi in self.data['edge_owner_to_timeline_media']['edges']:
      publi=publi['node']

      if publi['__typename'] == 'GraphImage':

        link=publi['shortcode']
        analyse=publi['accessibility_caption']
        comments=publi['edge_media_to_comment']['count']
        likes=publi['edge_liked_by']['count']
        try:
          location_id=publi['location']['id']
          location_name=publi['location']['name']
        except:
          location_id=None
          location_name=None
        timestamp=publi['taken_at_timestamp']
        time=dt.datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
        display_url=publi['display_url']

        carac={'link':link, 'likes':likes, 'comments':comments, 
               'location_id':location_id, 'location_name':location_name, 
               'time':time , 'analyse':analyse, 'display_url':display_url}
        posts.append(carac)

    return posts

  def createUser(self):

    dic={'name':self.username, 'nbPubli': self.nbPubli, 
        'nbFollowers':self.followed_by, 'nbFollow':self.follow, 
        'typeBusiness':self.business,'latsPosts':self.loadPictures()}

    return dic




## Dataset

The file `users.txt` contains the list of the 50 biggest instagram accounts extracted from Wikipedia. The dataset is savec as a jsoon in `dataset.txt`.

For my application I play with a much bigger Dataset of more than 1.500 Instagram accounts giving me more than 12.000 posts

In [0]:
with open(root_path+'users.txt','r') as f:
  text=f.readline()
listUsers=text.split('\t')

In [0]:
dataset=[]
for user in listUsers:
  infl=User(user)
  dataset.append(infl.createUser())

In [0]:
with open(root_path+'dataset.txt','w') as f:
  json.dump(dataset,f)

## Download the pictures associated with your dataset

With the dictionnary given by `class User()` it's very. easy to download all of the associated pictures.

In [0]:
def downloadPictures(dataset):
  for user in dataset:
    for post in user['lastPosts']:
      r = requests.get(post['display_url'])
        with open(root_path + user['nama'] + '_' + post['link'] + '.jpg', 'wb') as f:
          f.write(r.content)