## Likes Report
 
#### Calvin Muller
#### September 20, 2021

Here we are taking a look at my personal Instagram data, specifically a comprehensive list of every post that I have liked on Instagram. This amount of information and data can be a double-edged sword. On one hand, it can be incredibly useful, and lead to great insights both about psychology and even marketing strategies. However, it can just as easily be viewed as just a smorgasbord of random information, with no insights to anything.

In order to make sense of this data, we need to first ask a few questions. First, where did this data come from? Who collected it? Could there be biases as a result of this collection method? While not knowing exactly who collected this data, it can be assumed that Instagram at one point created some sort of data collection method for all of their users, that records each interaction on its app and website, and stores these pieces of data -- where they make some available to every user, such as myself. The reason for collecting this data was most likely the reason for anything in the corporate world: money. Instagram likely uses this massive amount of data they get to sell to advertisers, so that they can more effectively promote products on the platform. We also have to question how reliable this data may be, given these assumptions. The data that is presented to us is most likely very reliable in its accuracy of recording the data presented, because Instagram would have set up algorithms that exactly extract this data without flaws. So we can trust that this data is accurate. However, I believe that this data might be unreliable in the sense that it likely does not give us the full picture. I am sure there are pieces and categories of data that Instagram holds back, and does not allow the public access to.

In [1]:
import pandas as pd
import json
import matplotlib as plt

In [2]:
with open(r"C:\Users\cjmul\Downloads\School\Fall 2021\EMAT 22110\calvin.muller7_20210902\likes\liked_posts.json") as j:
    dat = json.load(j)

In the above cells, I am importing different packages to python, so that it will be able to read my code. I am also loading in the main json file that this data is stored in.

In [3]:
dat.keys()

dict_keys(['likes_media_likes'])

In [None]:
dat

In above cells, I am looking at the data I imported, labeled dat, and then am using the .keys() command to look at what keys are in this data.

In [5]:
df_likes = pd.DataFrame(x['title'] for x in dat['likes_media_likes'])

In [6]:
df_likes

Unnamed: 0,0
0,schoofsies
1,jordanjarc
2,frankiefrancescac
3,zrozler
4,adipotti
...,...
5306,dlofilm
5307,basketballdreamart
5308,irving11central
5309,kingljames


In the above cells, I paired the pandas DataFrame syntax with list comprehension, to make a dataframe that pulled out each instance of 'title' from the data inside of 'likes_media_likes'

In [7]:
type(dat['likes_media_likes'])

list

In [9]:
type('string_list_data')

str

In these above cells, I am looking at what types of data are both 'likes_media_likes' as well as 'string_list_data' this information can be helpful to me when using list comprehension.

In [10]:
dat['likes_media_likes'][0]

{'title': 'schoofsies',
 'media_list_data': [],
 'string_list_data': [{'href': 'https://www.instagram.com/p/BkqgokvBnsO/',
   'value': 'ð\x9f\x91\x8d',
   'timestamp': 1530406044}]}

In [38]:
type(dat['likes_media_likes'][0])

dict

In this cell above, I am looking at one instance inside the 'likes_media_likes' list to see what this dictionary looks like. I also wanted to confirm what type of structure this is.

In [11]:
test = pd.DataFrame(dat['likes_media_likes'])

In [12]:
test

Unnamed: 0,title,media_list_data,string_list_data
0,schoofsies,[],[{'href': 'https://www.instagram.com/p/Bkqgokv...
1,jordanjarc,[],[{'href': 'https://www.instagram.com/p/Bkp1Rr8...
2,frankiefrancescac,[],[{'href': 'https://www.instagram.com/p/Bkn7U4_...
3,zrozler,[],[{'href': 'https://www.instagram.com/p/BkoATQ6...
4,adipotti,[],[{'href': 'https://www.instagram.com/p/Bknxh4C...
...,...,...,...
5306,dlofilm,[],[{'href': 'https://www.instagram.com/p/CBoJDaD...
5307,basketballdreamart,[],[{'href': 'https://www.instagram.com/p/CBX07Gu...
5308,irving11central,[],[{'href': 'https://www.instagram.com/p/CBd411F...
5309,kingljames,[],[{'href': 'https://www.instagram.com/p/CBNu_nn...


In these above cells, I am making a dataframe with just the raw list that is 'likes_media_likes'. While this does give me a dataframe with 'title', which is helpful, it is still mostly unreadable because a lot of data is stored here under 'string_list_data'. In order to make this more readable I need to figure out a way to unpack that group of data.

In [15]:
names = [x['string_list_data'] for x in dat['likes_media_likes']]

In [None]:
names

In the above cells, I am using list comprehension to extract just 'string_list_data' from 'likes_media_likes', and what it shows me is that it is a list of a bunch of different lists.

In [34]:
names[0][0]

{'href': 'https://www.instagram.com/p/BkqgokvBnsO/',
 'value': 'ð\x9f\x91\x8d',
 'timestamp': 1530406044}

In [18]:
type(names[0])

list

In [19]:
names.keys()

AttributeError: 'list' object has no attribute 'keys'

In the above cells, I am doing multiple different things. For the bottom two, I am looking at what type of data structure 'names[0]' is, and it reveals that it is a list. However, in order to do list comprehension to get good data for a data frame, I need to be able to access the dictionaries inside those lists. That is why in the first line of code here, I looked at the zeroeth element of the zeroeth element of names. This allows me access to the dictionary. This took me many trials and errors to figure out, as I couldn't initially find how to get down to this second level of list. I kept running into different error codes, and would have to look at documentation as well as different forums to try and figure out. While the solution was much simpler than what I was searching for, it did give me good experience in troubleshooting code and being able to look for similar problems and solutions elsewhere online.

I also looked to see if there were any keys in names, which there were not.

In [28]:
ts = [x[0] for x in names]

In [None]:
ts

In [30]:
tsDF = pd.DataFrame(ts)

In [32]:
tsDF

Unnamed: 0,href,value,timestamp
0,https://www.instagram.com/p/BkqgokvBnsO/,ð,1530406044
1,https://www.instagram.com/p/Bkp1Rr8F_KqQQme5xf...,ð,1530372550
2,https://www.instagram.com/p/Bkn7U4_l8pZyVqw9_l...,ð,1530316986
3,https://www.instagram.com/p/BkoATQ6lvqeBVuWJjF...,ð,1530312255
4,https://www.instagram.com/p/Bknxh4Cldxt/,ð,1530307294
...,...,...,...
5306,https://www.instagram.com/p/CBoJDaDls14/,ð,1592707282
5307,https://www.instagram.com/p/CBX07GulO62/,ð,1592707279
5308,https://www.instagram.com/p/CBd411FFyMS/,ð,1592707278
5309,https://www.instagram.com/p/CBNu_nnlYUR/,ð,1592707277


In the above cells, I use list comprehension to get the list of dictionaries, which can then be converted into a sensible dataframe. However, I don't stop there, as I also want to add in the 'title' category into this dataframe.

In [35]:
tsDF['title'] = test['title']

In [36]:
tsDF

Unnamed: 0,href,value,timestamp,title
0,https://www.instagram.com/p/BkqgokvBnsO/,ð,1530406044,schoofsies
1,https://www.instagram.com/p/Bkp1Rr8F_KqQQme5xf...,ð,1530372550,jordanjarc
2,https://www.instagram.com/p/Bkn7U4_l8pZyVqw9_l...,ð,1530316986,frankiefrancescac
3,https://www.instagram.com/p/BkoATQ6lvqeBVuWJjF...,ð,1530312255,zrozler
4,https://www.instagram.com/p/Bknxh4Cldxt/,ð,1530307294,adipotti
...,...,...,...,...
5306,https://www.instagram.com/p/CBoJDaDls14/,ð,1592707282,dlofilm
5307,https://www.instagram.com/p/CBX07GulO62/,ð,1592707279,basketballdreamart
5308,https://www.instagram.com/p/CBd411FFyMS/,ð,1592707278,irving11central
5309,https://www.instagram.com/p/CBNu_nnlYUR/,ð,1592707277,kingljames


Here, I am finally adding the 'title' from my test dataframe to the dataframe I have just made, which gives me a good dataframe including the data that I want.

In [57]:
tsDF.groupby(['title']).count()

Unnamed: 0_level_0,href,value,timestamp
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
21savage,11,11,11
2kmemes,1,1,1
2sportsclips,1,1,1
99wokeboi,4,4,4
_.samurai_jack._,1,1,1
...,...,...,...
zack_carson,2,2,2
zendaya,1,1,1
zionheadlines,1,1,1
zionwilliamson,4,4,4


In [62]:
tsDF.groupby(['timestamp']).min()

Unnamed: 0_level_0,href,value,title
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1513525748,https://www.instagram.com/p/BcgBvQphRlj/,ð,jenselter
1513525751,https://www.instagram.com/p/Bcqds--nyZG/,ð,jenselter
1513525882,https://www.instagram.com/p/BcyvpGjB7f9/,ð,itsbriittt_
1513526370,https://www.instagram.com/p/BcvkHzfB4wz/,ð,itsbriittt_
1514561564,https://www.instagram.com/p/BdRPJvLA05i/,ð,lauraivetteg
...,...,...,...
1630362226,https://www.instagram.com/p/CTNq3p_FWBgC6UHKa_...,ð,zrozler
1630420729,https://www.instagram.com/p/CTON2vmIacdyhS0Lzu...,ð,jordanjarc
1630420989,https://www.instagram.com/p/CTOscW-r0cO/,ð,jingzhiyong
1630510499,https://www.instagram.com/p/CTP_LaylyTe/,ð,amia_muller


In the first cell above, I am grouping all of the unique 'title's' together, which are different users, and it is showing how many of that user's posts I have liked. In the second cell above, I am trying to use a new method to sort the data by the minimum timestamp, which should show the oldest pictures I liked first.

## A hypothesis regarding Instagram data

My theoretical hypothesis regarding this likes data, given other data files that Instagram provides is this: the users that I most interact with (like their posts), share more common topics with me (according to Instagram's topics data), and there is a strong correlation between similar topics and interaction with posts. I believe that one could make a predictive model between percentage of shared topics and percentages of posts liked. This would indicate that shared topics is something that Instagram's algorithm takes into account when promoting different posts on a user's feed.

The statistical hypothesis is this: with a 5 percent confidence level, we can predict that a user will interact 25% more with another user who shares at least 10% of common topics, according to Instagram's topic data.

Obviously, while only having one account's topic data, this would be a hard hypothesis to test. However, in the ideal world, with all the data that Instagram has, we could easily look into other user's topic data to compare across accounts, to see if this correlation exists. And if it does, it could help tailor people's Instagram home pages to people with similar interests.