# InstaSpot

Our main goal is to be able to recommend new travel destinations to users based on their interest in travel posts on Instagram. To achieve this, we will explore different ways to build recommender systems. We will compare results between a content-based and a collaborative filtering approach.

## Table of Content

1. Importing modules

2. Data processing
   
3. Data Analysis

...

## 1. Importing modules

First, let's import some libraries that we're going to use in the notebook.

In [1]:
import csv
from zipfile import ZipFile
from pyspark.rdd import RDD
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,isnan, when, count

Let's also initialize our spark session

In [2]:
def init_spark():
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    return spark

In [3]:
spark = init_spark()

22/03/25 23:02:16 WARN Utils: Your hostname, cnmk.local resolves to a loopback address: 127.0.0.1; using 192.168.0.54 instead (on interface en0)
22/03/25 23:02:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/25 23:02:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Let's define some metadata

In [4]:
# To modify accordingly
DATASET_PATH = 'data/post-metadata/*.info'
INFLUENCER_TEXT_PATH = 'data/influencers.txt'
LOCATIONS = {}
# USERS = 

## 2. Data processing

Our [dataset](https://sites.google.com/site/sbkimcv/dataset#h.4eo4r5p70z10) comes from Proceedings of The Web Conference (WWW 20), ACM, 2020, provided by Seungbae Kim.

This dataset classified influencers into nine categories namely *beauty, family, fashion, fitness, food, interior, pet, travel, and others*. It contains 300 posts per influencer, so there are over 10 million Instagram posts where each influencer is categorized based on their post metadata. Each post metadata file is in JSON format.

### Retrieve travel influencers


Since we're only interested in travel influencers, we will retrieve all usernames from the travel category using <code>influencers.txt</code> which contains a list of influencers with their Instagram username, category, the number of followers, followees, and posts.

In [5]:
lines = spark.sparkContext.textFile(INFLUENCER_TEXT_PATH)

# get category and username index
headers = lines.take(2)
header = headers[0]
category_index = header.split("\t").index("Category")
username_index = header.split("\t").index("Username")
post_index = header.split("\t").index("#Posts")

# filter travel influencers
lines = lines.filter(lambda line: line not in headers)
lines = lines.map(lambda line: line.split("\t"))
travel_influencers = lines.filter(lambda line: line[category_index] == 'travel')
# get all travel influencers IG username
travel_usernames = travel_influencers.map(lambda line: line[username_index])

print('Total travel users:',travel_usernames.count())

                                                                                

Total travel users: 4210


As we can see above, there were 4210 instagram users categorized as travel influencers. 

### Extract travel post metadata 
Now let's go ahead and extract the post metadata of those users only. Each post metadata filename starts with a username followed by a post ID. 

In [6]:
# TODO: Add code to extract travel influencers info files


### Extract relevant fields
Since there are more information than we need, we will extract relevant fields from the JSON files.

The following fields are the ones we found the most relevant to our project:

| Fields               | Description                                      |
| :------------------- | :------------------------------------------------|
| post_id              | ID of the instagram post                         |
| owner_id             | owner username of instagram post                 |
|accessibility_caption | describes what the post is about                 |
|likes_count           | number of likes the post received                |
|comments_count        | number of comments the post received             |
|commenters_id         | username list of users who commented on the post |
|tagged_users_id       | username list of tagged users on the post        |
|caption               | caption of the post                              |
|hashtags              | hashtags from caption of the post                |
|location_id           | location id of the post                          |
|location_name         | location name of the post                        |

### Helper functions

Below are helper functions that will extract the required fields.

In [7]:
# helper function to extract counts
def extract_counts(row, field):
    if field not in row:
        return 0
    if row[field] is None:
        return 0
    if 'count' not in row[field] or row[field]['count'] is None:
        return 0
    return row[field]['count']

# helper function to traverse user network
def extract_nodes_from_edges(row, field, secondary_fields):
    result = []
    if field not in row or row[field] is None \
    or 'edges' not in row[field] or row[field]['edges'] is None:
        return []

    for edge in row[field]['edges']:
        if 'node' in edge and edge['node']:
            no_error = True
            temp = edge['node']
            for f in secondary_fields:
                if f in temp and temp[f]:
                    temp = temp[f]
                else:
                    no_error = False
                    
            if no_error:
                result.append(temp)
 
    return result

# helper function to extract tagged users from caption
def extract_tagged_users(caption):
    tagged = []
    if caption is None or len(caption) == 0:
        return tagged
    else: 
        for word in caption[0].split():
            if word[0] == '@':
                tagged.append(word[1:])
        return tagged
    
# likes_count
def likes(row):
    return extract_counts(row, 'edge_media_preview_like')

# comments_count
def comments_count(row):
    return extract_counts(row, 'edge_media_to_parent_comment')

# tagged_users_id
def extract_tagged_users_id(row):
    return extract_nodes_from_edges(row, 'edge_media_to_tagged_user', ['user', 'username'])

# commenters_id
def extract_commenters_id(row):
    return extract_nodes_from_edges(row, 'edge_media_to_parent_comment', ['owner', 'username'])

# hashtags
def extract_hashtags(caption):
    hashtags = []
    if caption is None or len(caption) == 0:
        return hashtags
    else: 
        for word in caption[0].split():
            if word[0] == '#':
                hashtags.append(word[1:])
        return hashtags

# caption
def extract_text_from_caption(row):
    result = []
    if 'edge_media_to_caption' not in row or row['edge_media_to_caption'] is None \
    or 'edges' not in row['edge_media_to_caption'] or row['edge_media_to_caption']['edges'] is None:
        return []
    
    for edge in row['edge_media_to_caption']['edges']:
        if 'node' in edge and edge['node'] and 'text' in edge['node']:
            result.append(edge['node']['text'])
    return result

# location id, name
def extract_location(row):
    result = {
        'location_name': '',
        'location_id': ''
    }
    if 'location' in row and row['location']:
        if 'name' in row['location'] and 'id' in row['location']:
            result['location_name'] = row['location']['name']
            result['location_id']   = row['location']['id']
        
    return result

# owner_id
def extract_post_owner_id(row):
    if 'owner' not in row or row['owner'] is None:
        return ''
    
    if 'username' not in row['owner'] or row['owner']['username'] is None:
        return ''

    return row['owner']['username']

# post_id
def extract_post_id(row):
    if 'id' not in row or row['id'] is None:
        return ''
    
    return row['id']

# accessibility_caption
def extract_accessibility_caption(row):
    if 'accessibility_caption' not in row or row['accessibility_caption'] is None:
        return ''
    
    return row['accessibility_caption']
    
# returns an RDD where each row is a json file 
def create_post_as_json(row):
    post_id = extract_post_id(row)
    location = extract_location(row)
    owner_id = extract_post_owner_id(row)
    caption = extract_text_from_caption(row)
    hashtags = extract_hashtags(caption)
    likes_count = likes(row)
    tagged_users_id = extract_tagged_users_id(row) # TODO: ADD @ FROM CAPTIONS
    commenters_id = extract_commenters_id(row)
    comment_count = comments_count(row)
    accessibility_caption = extract_accessibility_caption(row)
    
    return {
        'post_id': post_id,
        'owner_id': owner_id,
        'location_id' : location['location_id'],
        'location_name' : location['location_name'],
        'likes_count': likes_count,
        'comments_count': comment_count,
        'commenters_id': commenters_id,
        'tagged_users_id': tagged_users_id,
        'caption': caption,
        'hashtags': hashtags,
        'accessibility_caption': accessibility_caption     
    }

# converts a json file into tuples
def convert_json_to_tuple(row):
    post_id = row['post_id']
    location_name = row['location_name']
    location_id = row['location_id']
    likes_count = row['likes_count']
    owner_id = row['owner_id']
    caption = row['caption']
    hashtags = row['hashtags']
    tagged_users_id = row['tagged_users_id']
    commenters_id = row['commenters_id']
    accessibility_caption = row['accessibility_caption']
    comment_count = row['comments_count']
    return (post_id, owner_id, location_id, location_name, 
            likes_count, comment_count, commenters_id,
            tagged_users_id, caption, hashtags, accessibility_caption)

# when exporting the data to CSV, it doesn't allow arrays, so the they need to be converted to strings
def flatten_json_lists(row):
    row['caption'] = '. '.join(row['caption'])
    row['hashtags'] = ', '.join(row['hashtags'])
    row['tagged_users_id'] =  ', '.join(row['tagged_users_id'])
    row['commenters_id'] =  ', '.join(row['commenters_id'])
    return row

# function that replaces "\r" with "\n"
def remove_carry_returns(row):
    row['caption'] = row['caption'].replace('\r', '').replace('\n', ' ')
    return row

Now that we are all set, we will read all JSON files into an RDD,

In [8]:
df = spark.read.json(DATASET_PATH)
rdd =  df.rdd

22/03/25 23:03:07 WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.
                                                                                

We will then map our helper functions to extract the neccessary fields

In [9]:
#transform data to the needed format
rdd = rdd.map(lambda r: create_post_as_json(r)).\
    map(lambda r: flatten_json_lists(r)).\
    map(lambda r: remove_carry_returns(r)).\
    map(lambda r: convert_json_to_tuple(r))

And finally convert our RDD into a dataframe with the following schema to better explore our data

In [10]:
schema = ['post_id', 'owner_id', 'location_id', 'location_name',  
          'likes_count', 'comments_count', 'commenters_id', 
          'tagged_users_id', 'caption', 'hashtags', 'accessibility_caption']

df = rdd.toDF(schema)

                                                                                

In [11]:
df.count()

                                                                                

68353

As we can see above, we collected 68,353 post metadata.

Let's have a look at the first 20 rows.

In [12]:
df.limit(20).toPandas().head(20)

Unnamed: 0,post_id,owner_id,location_id,location_name,likes_count,comments_count,commenters_id,tagged_users_id,caption,hashtags,accessibility_caption
0,1875572106509410527,thetravellingbeautyqueen,567077758.0,Mexico,11813,151,"normandothemagician, waelalteen, remybaghdady,...","mexicotravel, peperlupe, camillawithlove, yuca...",My newest - 21 st magazine cover😊👸🏼👑📸❤ Mid Tim...,"thetravellingbeautyqueen, lenkajosefiova, cove...",
1,1829719472242373040,pinnywooh,256392895.0,Valley of Fire State Park,7163,219,"mrs_vernova, giingerann, misssebyaha, puercoes...",,"Обещала вам пост, как выглядит типичный рабочи...",,
2,1881175916568618668,putopis,,,936,41,"naturetalker, zeljka_dja, travelbookcroatia, a...","huffpost, natgeotravel, foodandwine, jetsettim...",Rovinj is a town full of beautiful colors and ...,"Podravka, vegetamaestro, rovinj, istria, cooli...",
3,1802821674903711318,mahfamily5,202278920291131.0,Edmonds Marina Beach Park,120,29,"mahfamily5, mahfamily5, glampfam, mcculloughsw...","momswithcameras, king5evening, edmondsdowntown...",Mia had a great morning despite the little sle...,,
4,1938656069423140660,frabjous_existence,1300521560082629.0,The Rooftop at Pier 17,288,46,"giu_lucchi, frabjous_existence, lewisnation.lo...","nycgo, nbcnewyork, nymag, uonewyork, streeteas...","ᴛᴀsᴛʏ ᴛʜᴜʀsᴅᴀʏ, ᴀɴᴅ ᴛᴏᴅᴀʏ ᴡᴇ ᴠᴇɴᴛᴜʀᴇ ᴅᴏᴡɴ ᴛᴏ ᴘ...",skatetheskyline,
5,1910816411687530750,vivircorriendo,215026825.0,"Donostia-San Sebastián, Spain",1667,36,"soyloquevivo, vaboom, carloantoniobaroni, davi...","raulgomez82, odlo, igor_quijano, mariamainez, ...",EMBAJADORA 50/50/25 . Gracias a la Organizació...,"bss505025, VivircorRiendo, QueAReirNoTeGaneNad...",
6,2022997046060861046,griffinthall,234626259.0,"Coachella, California",1064,31,"dyluxe, chasefisher, aaron.griver, markweeeene...","andiefitzgerald, griffinthall, sarah_cothren, ...",Such an incredible #Coachella weekend with the...,"Coachella, livefree, puravidabracelets, pvtake...",
7,1883554250934263592,zitamaleki,1481295935232058.0,Bittersweet,1429,33,"_baran.mystyle_, almaa_food, fafa.trv, h._zahr...","express, fendi, pierrecardintr, swarovski, cha...",حتمن یادتون میاد که یه موقعی این بحث خیلی داغ ...,,
8,1997024571324339641,high_vis,3003208.0,American Airlines Center,373,0,,"dallasmavs, valerie_ramirez, cyntgm, sportsill...",Killer @dallasmavs halftime show by @inthelab2...,"Truemaverick, dallasmavericks, dallasmavsshop",
9,2013304121478023187,viaja_inspirado,214881134.0,"Valparaíso, Chile",1162,0,,"viaja_inspirado, sientevalpo, chiletravel, fco...",Valparaiso de mi amor ❤️🎶 Que lindo es Valpara...,,


## 3. Data Analysis

So far we have been discovering and structuring our data. Our next step will be to perform a descriptive data analysis to have a better summary our features.

Since performing action functions on this huge amount of data can be costly, we will only focus on the most important features. 

### Location

Let's have a look at the number of unique values in <code>location_id</code> and <code>location_name</code>. 

In [13]:
print("location_id: ", df.select("location_id").distinct().count())

[Stage 12:>                                                         (0 + 3) / 3]

location_id:  24853


                                                                                

In [14]:
print("location_name: ", df.select("location_name").distinct().count())



location_name:  24068


                                                                                

Since the number unique values of <code>location_id</code> and <code>location_name</code> differ, we can deduce that there are some post metadata that have missing location info. 

Let's count how many times <code>location_id</code> and <code>location_name</code> are both missing in a post metadata.

In [15]:
df.filter((col("location_name") == '') & (col("location_id") == '')).count()

                                                                                

19760

Let's also count the number of times a post metadata has <code>location_name</code> missing but has a <code>location_id</code> and vice versa.

In [16]:
df.filter((col("location_name") == '') & (col("location_id") != '')).count()

                                                                                

0

In [17]:
df.filter((col("location_name") != '') & (col("location_id") == '')).count()

                                                                                

0

Based on the missing counts above, we noticed that there are 19,760 post metadata with no information about its location, therefore we will drop these rows.

In [18]:
df = df[(df.location_name != '') & (df.location_id != '')]

In [19]:
assert df.count() == (68353-19760)
print("19760 rows dropped successfully!")



19760 rows dropped successfully!


                                                                                

We will also create a dictionary of location.

In [20]:
location_id=df.select("location_id").collect()
location_name=df.select("location_name").collect()

                                                                                

In [21]:
for loc_id, loc_name in zip(location_id, location_name):
    if loc_id[0] in LOCATIONS:
        pass
    else:
        LOCATIONS[loc_id[0]] = loc_name[0]

In [22]:
LOCATIONS

{'567077758': 'Mexico',
 '256392895': 'Valley of Fire State Park',
 '202278920291131': 'Edmonds Marina Beach Park',
 '1300521560082629': 'The Rooftop at Pier 17',
 '215026825': 'Donostia-San Sebastián, Spain',
 '234626259': 'Coachella, California',
 '1481295935232058': 'Bittersweet',
 '3003208': 'American Airlines Center',
 '214881134': 'Valparaíso, Chile',
 '12318445': 'Central Park',
 '197048104279564': 'City of Philadelphia',
 '221555314': 'Philippines',
 '258679201373942': 'KENYA,Africa',
 '1711208798927948': 'Disneyland California',
 '213011753': 'Sydney, Australia',
 '242911093': 'Monte Damavand Mt 5671 (Iran)',
 '250816764': 'Robinswood House-City of Bellevue',
 '412301973': 'Costa Rica',
 '236434121': 'Barbados',
 '645926014': 'The Mansions At Doral',
 '213927878': 'Beirut, Lebanon',
 '468034852': 'Puerto Tranquilo, Aisen Del General Carlos Ibanez Del Campo, Chile',
 '250525610': 'Regent Street',
 '213045606': "St. Paul's Cathedral",
 '214639976': 'Silver Lake, Los Angeles',
 '

### Accessibility Caption

In [24]:
# To show accessibility caption has too many missing values

In [23]:
#save data into csv files

# clean_data.toDF(schema).write.format("com.databricks.spark.csv").save("csv_formated_data", header="true")
