# Lab 5 - Instagram API

In this lab, you will learn how to:
* Download content from [Instagram](https://about.instagram.com/)
* Create Pandas Dataframe From Instagram Metadata Files
* Identify top hashtags and words

This lab is written by Jisun AN (jisunan@smu.edu.sg) and Michelle KAN (michellekan@smu.edu.sg)


In [None]:
## Install instaloader library
!pip install instaloader

The <b>[Instaloader](https://instaloader.github.io/)</b> module is an open-source Python package having great functionalities to download pictures (or videos) along with the captions and metadata from Instagram.  

<b>Key Features of Instaloader:</b>
- downloads public and private profiles, hashtags, user stories, feeds and saved media,
- downloads comments, geotags and captions of each post,
- automatically detects profile name changes and renames the target directory accordingly,
- allows fine-grained customization of filters and where to store downloaded media,
- automatically resumes previously-interrupted download iterations

To access the Instagram content, let's first initialise the Instaloader class and get authenticated.<br>
(Note: You will need to sign up for an Instagram account if you do not have one)

In [None]:
# Import the module
import instaloader

# Create an instance of Instaloader class
loader = instaloader.Instaloader(compress_json=False)

# Enter your Instagram handle and password
ACCOUNT = ''
PASSWORD = ''

# Upon successful authentication, you should see a message saying Authentication OK.
# Otherwise, check your login details
try:
    loader.login(ACCOUNT, PASSWORD)
    print("Authentication OK")
except:
    print("Error during authentication")


## 1. Download Instagram Posts

Instaloader provides the __Post__ structure, which represents a picture, video or sidecar (set of multiple pictures/videos) posted in a user’s profile. Instaloader provides the following methods to download and iterate over these Posts:
<br>

a) <b>Hashtags</b> - triggers a search for posts associated with defined hashtags<br>
b) <b>Profiles</b> - provides methods for accessing profile information including user posts, number of followers etc.  
c) <b>Locations</b> - triggers a search for posts from a certain location.  

See [full list](https://instaloader.github.io/module/structures.html#) of Instagram methods.

### 1.1. Download by #hashtags

The Instaloader `Hashtag` class enables retrieval of Instagram posts associated with a defined hashtag (without preceeding #) using the `get_posts` generator object which contains the post structure.

To download each posts, we have to iterate over the generator object using `.download_post()` method of our core Instaloader() object.

The following code retrieve posts associated with "coronavirus" hashtag. This can take a while in the event of very large profiles and media files being retrieved. <br>

<img align="left" src="https://docs.google.com/uc?id=1nhCz5zbFKKD4KD-kPxrMIvGBbgET_QZG" width="25" style="vertical-align:middle;margin:0px 10px"/><b><i>We will stop after getting 5 posts in this example, otherwise it will keep collecting posts.<b><i>

In [None]:
HASHTAG = "coranavirus"

# Create a Hashtag instance from a given hashtag name
hashtag = instaloader.Hashtag.from_name(loader.context, HASHTAG)

# Load posts with defined hashtag into a generator object
loaded_posts = hashtag.get_posts()

# To download each posts, we have to iterate over the generator object 
for cnt_post, post in enumerate(loaded_posts):
    if cnt_post >= 5:
        break
    try:  
        loader.download_post(post, target="#"+hashtag.name)
    except:
        print("\nError in downloading. Process halted.") 
        break

<img align="left" src="https://docs.google.com/uc?id=1IegynNxVgb3GxQoXFD_HPJMRJcx8Rlmk" width="40" style="vertical-align:middle;margin:0px 6px"/>You should now observe that a folder named <b>#coronavirus</b> has been created in the same location as the current notebook containing downloaded post content, including but not limited to:
- `.txt` files with caption of every post
- `.json.xz` compressed JSON files of metadata for every post
- `.json` files of comments for every post
- `.jpg`/`.mp4` media files for every post, with the creation date being the actual published date
-  `.txt` files with the location’s name and a Google Maps link when available

Your files may appear similar to the following where each set of related Instagram post files will share the same {date_utc} filename:<br>
<img align="middle" src="https://docs.google.com/uc?id=1_Rb5RXNXm0XDLMlZ717f6YHjjy2TiKA5" width="400" style="vertical-align:middle;margin:0px 10px"/>
<br>
<img align="left" src="https://docs.google.com/uc?id=1nhCz5zbFKKD4KD-kPxrMIvGBbgET_QZG" width="40" style="vertical-align:middle;margin:0px 10px"/><b>Important</b> You should respect author’s rights when you 
download copyrighted content. Do not use images/videos from Instagram for commercial intent).

#### Download Top Posts by Hashtags

It is often the case that one hashtag contains hundreds of posts. In that case, we would want to limit the number of posts downloaded.

Besides, we may want to download only the top posts of a hashtag. We could create a function which uses a for-loop and instaloader's `.get_top_posts()` function. 

Let’s see the following example to retrieve the top 5 posts pertaining to '#covid19' hashtag.

In [None]:
def top_posts_hashtag(hashtag_name, max_count):
    """
    A function that downloads top {max_count} posts from a hashtag
    """
    # Create a Hashtag instance from a given hashtag name
    hashtag = instaloader.Hashtag.from_name(loader.context, hashtag_name)
    # Get top posts in a generator
    posts = hashtag.get_top_posts()
    
    for i in range(max_count):
        try:
            # Download one post structure at a time using next(posts) function
            loader.download_post(next(posts), target=f'#{hashtag_name}_top_post')
        except:
            break # If there are any errors, we break out of the loop

In [None]:
HASHTAG = 'covid19'

# Calling the function to retrieve top 5 posts based on the hashtag defined
top_posts_hashtag(HASHTAG, 5)

You may observe that it takes time to download posts with very large profiles and media files. Can we choose or determine the content we wish to download?
<br>

#### Customising Content to be Downloaded

The __Instaloader__ class `instaloader.Instaloader()` takes in a list of <i>optional</i> but useful parameters. You can use these parameters to customise the content to be downloaded for each post by the Instaloader.

|Instaloader Parameters|Description|
|:---|:---|
|download_pictures<br>download_video_thumbnails<br>download_videos|Boolean variable whether to download image and video files. True by default.|
|download_geotags|Boolean variable whether to download geotags when available.<br>Geotags are stored as a text file with the location’s name and a Google Maps link. True by default |
|download_comments|Boolean variable whether to download comments for each post. True by default.|
|save_metadata|Boolean variable whether to create a JSON file containing the metadata of each post. True by default.|
|compress_json|Boolean variable. JSON files are downloaded as xz compress format by default. <br>Set this parameter to False to download pretty formatted JSONs. |
|post_metadata_txt_pattern|Template of caption text file for each Post. Caption textfile will not be created if set to empty string.|

Refer to the full list of parameters [here](https://instaloader.github.io/module/instaloader.html).

We have not set any parameter values during the initial Instaloader class creation hence the default parameters values are used for the '#covid19' hashtag post download. Let's now update some of the parameters to change the download content:
- disable download of media content to speed up the download process
- disable download of comments file to speed up the download process
- download JSON metadata in pretty format instead of compressed format 
- disable download of caption text file since caption content can be retrieved from JSON metadata

In [None]:
import instaloader

# customise the parameter settings based on your download preference
loader = instaloader.Instaloader(download_pictures=False,
                            download_video_thumbnails=False,
                            download_videos=False,
                            download_comments=False,
                            compress_json=False,
                            post_metadata_txt_pattern="")

### Uncomment the following codes and enter your Instagram login details
### to authenticate if you have just restarted your notebook
# ACCOUNT = ''
# PASSWORD = ''
# try:
#     loader.login(ACCOUNT, PASSWORD)
#     print("Authentication OK")
# except:
#     print("Error during authentication")


You can customise the parameter settings based on your download preference. <br><b>Rerun the Instagram post download by hashtag codes.</b> What difference do you observe?

### 1.2 Download by Profiles

Let's extract some valuable information from an Instagram profile using the `Profile` object. In this example, we will retrieve the profile of World Health Organization (WHO).

In [None]:
USERNAME='who' #instagram handle

# Create a Profile instance from the given username
profile = instaloader.Profile.from_username(loader.context, USERNAME)

print("Username: ", profile.username)
print("User ID: ", profile.userid)
print("Number of Posts: ", profile.mediacount)
print("Followers: ", profile.followers)
print("Followees: ", profile.followees)
print("Biography: ", profile.biography,profile.external_url)

Similarly, getting posts from any Instagram user profile requires iterating over the generator object using `.download_post()` method of our core Instaloader() object. 

The following example retrieve 5 Instagram posts by World Health Organization (WHO).

In [None]:
def get_profile_posts(username, max_count):
    """
    A function that downloads {max_count} posts of an instagram profile
    """
    # Load the hashtag object into a variable
    profile = instaloader.Profile.from_username(loader.context, username)
    
    # Get top posts in a generator
    posts = profile.get_posts()
    
    for i in range(max_count):
        try:
            # Download one post structure at a time using next(posts) function
            loader.download_post(next(posts), target=f"{profile.username}")
        except:
            print("Error in downloading post")
            break # If there are any errors, we break out of the loop

In [None]:
USERNAME = 'who' #World Health Organization

# Calling the function to retrieve 5 posts based on the USERNAME defined
get_profile_posts(USERNAME, 5)

Check that a folder has been created with the Profile username <b>ie. 'who'</b> in the same location as the current notebook containing downloaded metadata files. 

### 1.3 Download by Location

Let's collect instagram data based on location. Instagram has its own ID for various locations. For example, 363850430717323 is ID of Universal Studios Singapore.  (Read this blog, [How to find an Instagram location ID](https://axentmedia.com/how-to-find-an-instagram-location-id/)). Once you know you can the following code `loader.download_location(LOCATION_ID, max_count=5)` to collect data based on location. 

In [None]:
# customise the parameter settings based on your download preference
# we will collect pictures, but will not compress json, will not get txt files
loader = instaloader.Instaloader(compress_json=False,
                            post_metadata_txt_pattern="")

# Enter your Instagram handle and password
ACCOUNT = ''
PASSWORD = ''

# Upon successful authentication, you should see a message saying Authentication OK.
# Otherwise, check your login details
try:
    loader.login(ACCOUNT, PASSWORD)
    print("Authentication OK")
except:
    print("Error during authentication")


In [None]:
location_id = "363850430717323"  # Universal Studios Singapore 
my_max_count = 5 # you can set maximum number of posts to download from a location
loader.download_location(location_id, max_count=my_max_count)


## 2. Create a Pandas Dataframe From JSON Metadata Files

To analyse the downloaded Instagram post content e.g., text, it is recommended to convert the download json metadata files into a dataframe. 

#### Enter your path where the json files are stored
(you can select any of the folders created by any of the above download process)

In [None]:
# 'mypath' variable can be changed to your local path or Google Drive path
mypath = "."

# folder name where JSON metadata files are stored e.g, who
folder_name='#coranavirus'

# set the path of JSON files
json_path = f'{mypath}/{folder_name}/'
json_path

#### Extract data from json

The following code retrieves all the json filenames stored in the specified folder and extract content from the json structure of each file. You are recommended to open the json files and understand the structure. 

In [None]:
import os, json
import pprint
import datetime

# retrieve all filenames with .json extension located in the folder that you have provided
json_files = [filename for filename in os.listdir(json_path) if filename.endswith('.json')]

# iterate through the list of JSON files
for js in json_files:
        
    # open and read each json file
    with open(os.path.join(json_path, js)) as json_file:
        json_text = json.load(json_file)
        #pprint.pprint(json.text)
        
        try:
            # extract Instagram post information
            unix_timestamp = json_text['node']['taken_at_timestamp']
            timestamp = datetime.datetime.fromtimestamp(unix_timestamp)

            username = json_text['node']['owner']['username']
            image_desc, text = '', ''
            
            try:
                image_desc = json_text['node']['accessibility_caption']
            except:
                pass
            try:
                text = json_text['node']['edge_media_to_caption']['edges'][0]['node']['text']
            except:
                pass
            
            print (f'Timestamp: {timestamp}')
            print (f'Username: @{username}')
            print (f'Image description: {image_desc}')
            print (f'Caption: {text}')
            print('='*80)
        except:
            continue

#### Create dataframe from json metadata files

Let's now write a function to convert the JSON metadata files found in the defined path into a dataframe:

In [None]:
import os, json
import pandas as pd
import datetime

def convert_json_to_df(path_to_json):

    # retrieve all filenames with .json extension located in path_to_json
    json_files = [filename for filename in os.listdir(path_to_json) if filename.endswith('.json')]
    
    ## initialise list to store post details
    post_list = []

    # iterate through each json file
    for js in json_files:
        
        # posts with comments have json files for the comments, we will pass it.        
        if "comments" in js:
            continue
        
        # open and read each json file
        with open(os.path.join(path_to_json, js)) as json_file:
            json_text = json.load(json_file)

            # extract Instagram post information
            unix_timestamp = json_text['node']['taken_at_timestamp']
            timestamp = datetime.datetime.fromtimestamp(unix_timestamp)

            username = json_text['node']['owner']['username']
            image_desc, text = '', ''
            
            try:
                image_desc = json_text['node']['accessibility_caption']
            except:
                pass
            try:
                text = json_text['node']['edge_media_to_caption']['edges'][0]['node']['text']
            except:
                pass

            # append post_list with extracted post information
            post_list.append([timestamp, username, image_desc, text])
            
    # populate dataframe with list of tweets
    df = pd.DataFrame(data=post_list,columns=['timestamp','username','image_desc','text'])
    
    return df

In [None]:
# If you want to read all texts in the dataframe
pd.set_option('display.max_colwidth', 150)

In [None]:
# Call the function to combine and convert JSON files found in the json_path into a dataframe
df = convert_json_to_df(json_path)
df.head()

In [None]:
# Call the function to combine and convert JSON files found in the json_path into a dataframe
json_path = "./%363850430717323"
df = convert_json_to_df(json_path)
df.head()

### Exercise 1

Using the `convert_json_to_df` function, add on the following details to the dataframe for each Instagram post:
- post id 
- full name of the post owner
- location name, if available
- number of likes of the post
- number of comments of the post

Hint: Open and refer to the downloaded JSON file structures. 

In [None]:
## Enter your code below 





### Exercise 2 

1. Collect data from two or more locations. Your data will be stored in multiple folders now.
2. Write a function that read data from multiple folders, extract the following information (`postid, timestamp, username, user_fullname, location_name, nlikes, ncomments, image_desc, text`--same to one extracted in `convert_json_to_df` from Exercise 1) and store it in one list. You can name the function as `convert_json_to_df_from_multiple_folders` and the input of the function can be a list of input folders (`list_path_to_json`).
3. Using the convert_json_to_df_from_multiple_folders, convert jsons in multiple folders into one df
4. Try to compare the average likes of posts from different locations. Once you can identify the column of `NumLikes` from the df, you can use `mean()` to see the distribution of values. 'mean' would be the average value.  


In [None]:
# 1-1. Collecting data from Location 1
location_id_1 = "???"
my_max_count = ???

# Write your code


In [None]:
# 1-2. Collecting data from Location 2
location_id_2 = "???"
my_max_count = ???

# Write your code


In [None]:
# 2. Function that read data from multiple folders 

def convert_json_to_df_from_multiple_folders(list_path_to_json):

# Write your code
    
    return df



In [None]:
# 3. Convert jsons in multiple folders in one df

# Write your code


In [None]:
# below will show different location names
df['Location'].value_counts()

In [None]:
# 4. Compare the average likes of posts from different locations

# Write your code


## 3. Collect 100 Instagram posts & Do EDA

In [None]:
# We will collect jsons only without downloading picture or video
# If you receive the following error --"LoginRequiredException: Redirected to login page. Use --login."
# please run this cell. 

import instaloader

# Create an instance of Instaloader class
loader = instaloader.Instaloader()

# Enter your Instagram handle and password
ACCOUNT = ''
PASSWORD = ''

# Upon successful authentication, you should see a message saying Authentication OK.
# Otherwise, check your login details
try:
    loader.login(ACCOUNT, PASSWORD)
    print("Authentication OK")
except:
    print("Error during authentication")
    
# customise the parameter settings based on your download preference
# configuration 
loader = instaloader.Instaloader(download_pictures=False,
                            download_video_thumbnails=False,
                            download_videos=False,
                            download_comments=False,
                            compress_json=False,
                            post_metadata_txt_pattern="")


In [None]:
# We will collect 100 posts. When you collect many posts, it'd be better to send queries slowly. 
# We will sleep for 3 secs in between queries using `time` library. 

import time

def posts_hashtag_with_sleep(hashtag_name, max_count):
    """
    A function that downloads {max_count} posts from a hashtag
    """
    # Create a Hashtag instance from a given hashtag name
    hashtag = instaloader.Hashtag.from_name(loader.context, hashtag_name)
    # Get top posts in a generator
    posts = hashtag.get_posts()
    
    for i in range(max_count):
        try:
            # Download one post structure at a time using next(posts) function
            loader.download_post(next(posts), target=f'#{hashtag_name}')
            time.sleep(3)
        except:
            break # If there are any errors, we break out of the loop
          

In [None]:
# Calling the function to retrieve 100 posts based on the hashtag defined
# Below code will take some time. Let it run until it finishes. 
HASHTAG = "sgfood"
posts_hashtag_with_sleep(HASHTAG, 100)


In [None]:
# # If Instagram blocks you, please download the dataset using below code 
# # Download sample instagram dataset! 
# !npx degit anjisun221/css_codes/sample_data/#main jsondatasets  -f

In [None]:
# If you want to read all texts
pd.set_option('display.max_colwidth', 150)


In [None]:
# Call the function to combine and convert JSON files found in the json_path into a dataframe
json_path = "./#sgfood"
# json_path = "./jsondatasets/#sgfood" # Using below, if you're using downloaded dataset
df = convert_json_to_df(json_path)
print(df.shape)
df.head()


#### Let's extract hashtags from caption using regular expression.

In the below example, we use `lambda` which is an anonymous function, a function that is defined without a name. 
A addone function is expressed with a standard Python function definition using the keyword def as follows:

```
def addone(x):
    return x+1
```

where addone() takes an argument x and returns x+1 upon invocation.

If you use a Python lambda construction, you get the following:

```lambda x: x+1```


In [None]:
import re
df['hashtags'] = df['Caption'].apply(lambda x: re.findall(r"#(\w+)", x))
df.head()


In [None]:
# Since hashtags is a list, we can simple iterate each of hashtag to count them. 
from collections import defaultdict

h2c = defaultdict(int)
for hashtags in df['hashtags']:
    for h in hashtags:
        h2c[h] += 1

In [None]:
# operator can be used to sort the keys based on their values
import operator
for (h, c) in sorted(h2c.items(), key=operator.itemgetter(1), reverse=True)[:20]:
    print (h,c)

### Exercise 3 

Let's count the words and find the top 30 words by ranking the words. 

We will first clean our text using function from Lab 4. Below code will add `cleaned_text` column to our df. 

You can use the codes for counting hashtags, but note that cleaned_text is not in the list, but string. You will need to split text into words by using `split()`.


In [None]:
# Below is the code from Lab 4, cleaning our text and will add cleaned_text column.
import re 
import string
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop = stopwords.words('english')

def clean_text_round1(text):
    '''Make text lowercase, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('#\w*', '', text)
    text = re.sub('@\w*', '', text)    
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = ' '.join([word for word in text.split() if word not in (stop)])
    return text

# Let's take a look at the updated text
df['clean_text'] = pd.DataFrame(df['Caption'].apply(clean_text_round1))
df.head()


In [None]:
# Read clean_text and iterate each line to split and count words. 

# Write your code 


In [None]:
# Print the top 30 words based on their counts. 

# Write your code 
    

### Exercise 4

Draw scatter plot for comparing number of likes vs number of comments.

If you're using matplotlib, you can do it in one line ;) 

You can use matplotlib, seaborn, plotly, anything as you want. 


In [None]:
# Write your code 
