# Instagram Likes: The Clean Up and Questioning

Amber Cocchiola
9/9/21

This report will dissect all of the posts I have liked on Instagram. This data comes in the form of a json that will then be turned into a dataframe for ease of comprehension. The file was provided by Instagram and is of my personal like data.

In [1]:
import pandas as pd
import json
import datetime

In [2]:
with open(r"C:\Users\frenc\EMATfall21folder\bamberwhatever_20210907\likes\liked_posts.json") as j:
    data = json.load(j)

### Discovering the DataFrame
The next steps are used to determine what dictionary needs to be used in order to create the best dataframe.

In [3]:
data.keys()

dict_keys(['likes_media_likes'])

In [4]:
likes = pd.DataFrame(data['likes_media_likes'])

For the sake of the nb viewer, I will only be displaying the first 20 data points in the dataframe, even though there are over 50,000 data points. The whole list of data points can be accessed by removing .head() from the dataframe.

In [5]:
likes.head(20)

Unnamed: 0,title,media_list_data,string_list_data
0,olsaintdick,[],[{'href': 'https://www.instagram.com/p/BHBFezO...
1,olsaintdick,[],[{'href': 'https://www.instagram.com/p/BHE4ln0...
2,ptxofficial,[],[{'href': 'https://www.instagram.com/p/BHGGiw-...
3,salvatrice___,[],[{'href': 'https://www.instagram.com/p/BHF56zy...
4,altpress,[],[{'href': 'https://www.instagram.com/p/BHFlOf7...
5,ptvjaime,[],[{'href': 'https://www.instagram.com/p/BHFipHA...
6,piercetheveil,[],[{'href': 'https://www.instagram.com/p/BHFtg3-...
7,fearlessrecords,[],[{'href': 'https://www.instagram.com/p/BHFdlKh...
8,altpress,[],[{'href': 'https://www.instagram.com/p/BHFVRaf...
9,jordanfish86,[],[{'href': 'https://www.instagram.com/p/BHFJFAM...


In [6]:
likes = pd.DataFrame([x['string_list_data'][0] for x in data['likes_media_likes']])

In [7]:
likes.head(20)

Unnamed: 0,href,value,timestamp
0,https://www.instagram.com/p/BHBFezOj6Bt/,ð,1466900196
1,https://www.instagram.com/p/BHE4ln0j-fG/,ð,1466900089
2,https://www.instagram.com/p/BHGGiw-D7YV/,ð,1466900049
3,https://www.instagram.com/p/BHF56zyDAGCtwE5PVI...,ð,1466889685
4,https://www.instagram.com/p/BHFlOf7BLgh/,ð,1466887802
5,https://www.instagram.com/p/BHFipHAD0vz/,ð,1466887750
6,https://www.instagram.com/p/BHFtg3-Bt69/,ð,1466887748
7,https://www.instagram.com/p/BHFdlKhjQb3/,ð,1466875086
8,https://www.instagram.com/p/BHFVRafBfU3/,ð,1466875033
9,https://www.instagram.com/p/BHFJFAMgB-A/,ð,1466875030


This seems close, but we lose the value of title.

In [8]:
title = [x['title'] for x in data['likes_media_likes']]

In [9]:
likes['title'] = title

In [10]:
likes.head(20)

Unnamed: 0,href,value,timestamp,title
0,https://www.instagram.com/p/BHBFezOj6Bt/,ð,1466900196,olsaintdick
1,https://www.instagram.com/p/BHE4ln0j-fG/,ð,1466900089,olsaintdick
2,https://www.instagram.com/p/BHGGiw-D7YV/,ð,1466900049,ptxofficial
3,https://www.instagram.com/p/BHF56zyDAGCtwE5PVI...,ð,1466889685,salvatrice___
4,https://www.instagram.com/p/BHFlOf7BLgh/,ð,1466887802,altpress
5,https://www.instagram.com/p/BHFipHAD0vz/,ð,1466887750,ptvjaime
6,https://www.instagram.com/p/BHFtg3-Bt69/,ð,1466887748,piercetheveil
7,https://www.instagram.com/p/BHFdlKhjQb3/,ð,1466875086,fearlessrecords
8,https://www.instagram.com/p/BHFVRafBfU3/,ð,1466875033,altpress
9,https://www.instagram.com/p/BHFJFAMgB-A/,ð,1466875030,jordanfish86


Now we have all data included in the dataframe, but "value" seems like it just holds gibberish.

In [11]:
likes = likes.drop("value", axis = 1)

In [12]:
likes.head(20)

Unnamed: 0,href,timestamp,title
0,https://www.instagram.com/p/BHBFezOj6Bt/,1466900196,olsaintdick
1,https://www.instagram.com/p/BHE4ln0j-fG/,1466900089,olsaintdick
2,https://www.instagram.com/p/BHGGiw-D7YV/,1466900049,ptxofficial
3,https://www.instagram.com/p/BHF56zyDAGCtwE5PVI...,1466889685,salvatrice___
4,https://www.instagram.com/p/BHFlOf7BLgh/,1466887802,altpress
5,https://www.instagram.com/p/BHFipHAD0vz/,1466887750,ptvjaime
6,https://www.instagram.com/p/BHFtg3-Bt69/,1466887748,piercetheveil
7,https://www.instagram.com/p/BHFdlKhjQb3/,1466875086,fearlessrecords
8,https://www.instagram.com/p/BHFVRafBfU3/,1466875033,altpress
9,https://www.instagram.com/p/BHFJFAMgB-A/,1466875030,jordanfish86


Now all data labels with meaningful entries are included and organized. 

### Who, When, Where, Why?


The above information is a collection of my liked Instagram posts as given to me by the company Instagram itself. Although I requested this data, it was stored and called upon by Instagram. Although I provided the data, as liked posts are sort of like an extended survey where I choose yes or no, I did not catalog or even remember the posts listed. I acted like a study participant rather than the researcher. Therefore, the researcher of Instagram collected this data. 

I can only assume this data was collected for the purpose of monetization of my preferences. In Instagram, there is a page that will show you posts that it expects you will like, called the explore page. These posts are based on posts that you had come across before and liked. It finds these posts by determining what categories your previous liked posts fit into. By liking these posts and continuing to like posts on your feed, it can provide even more posts with greater accuracy to things you like, and this cycle continues as long as you use the app. With more and more data for their algorithm to learn from, it can more reliably present you posts that you are going to like and interact with. This is a great way to keep you engaged in the app and to keep you satisfied with your experience, but it also lets Instagram know your interests. They can sell your interests to advertisers who can then target you for their products. 

While the method described above may seem fool proof, there are a number of ways that the interests Instagram believes you have can be incorrect. All of this "liked posts" analyzing in order to present more posts to you is done by a computer who may have a hard time deciding *why* you liked a certain post. For example, in my own ad interest document, "Castelli Cycling" was listed. I had never heard of this before, let alone was interested in it. After scrolling through some more interests, there were several entries that related to italian cycling like Castelli. This begs the question, what did I do to generate that response? I feel that this is the wrong question, as the answer for me or anyone else with unrelated ads is, well, nothing that a human would note. The right question, in my opinion, is "What did the computer categorize some posts as that I did not?" My explore page on Instagram is mostly memes. It is entirely possible that I liked a few posts with italian bicycles in the reaction image of some memes or that Castelli Cycling was tagged in the caption with no relation to the picture. The computer may see a meme with a bicycle and think, "Ah, this human likes bicycles!", when in fact the human in question was laughing at a the words above the image that happened to be a bike. Of course, this idea is dependent on the assumption that Instagram can categorize pictures automatically based on what is in them and not on some other method. 

An alterante explanantion of this would be based on the information found in [this article](https://www.digitaltrends.com/social-media/instagram-ads-interests/) by Hillary K. Grigonis where she explains that ad interests on Instagram are generated by posts you've liked, accounts you follow, websites you visit, and information found on your Facebook page. This would mean that if I had a cousin or an old high school classmate I followed that turned their page into a cycling page, the computer might think that I too enjoy cycling since I follow them instead of the human reasoning that I might have to follow family or that I forgot I even followed Joe from Algebra. Regardless of how, the cycling example shows that not all the computer generated interests actually line up with the human's interests, leading to unreliable data. 



# Counting the Likes

In [13]:
howManyLikes = likes.groupby('title').count()

howManyLikes now contains a dataframe where it counts how many posts I have liked per account. This list is over 10,000 elements, so I am displaying the first 20 below as an example.

In [14]:
howManyLikes.head(20)

Unnamed: 0_level_0,href,timestamp
title,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0072973525247,2,2
0.010010001,12,12
00gieb00gieenthusiast,1,1
0hmygodz,1,1
0um3n0,1,1
1.800.text.posts,10,10
10_months_older_now,1,1
11.08am,1,1
13mac_,1,1
1776campaign,1,1


# Conclusions

There are a few take aways from this data. It appears I may spend too much time on Instagram, as there are over 58,000 data entries for liked posts. This made the data somewhat difficult to visualize and look through due to the sheer size. An example of this would be that an attempt to turn the data into a bar graph turned into a bar graph that looked like a solid black square in the labels and the fact that the first and last five accounts shown when the data was grouped and counted were all accounts I had never heard of and only had one or two posts that I had liked. Accounts that I follow and therefore see posts from everyday would have more then one or two likes, but I could not see these in the list provided. However, this also means there are plenty of data for the sample size for further analysis. 

### Reflection

Limitations to the data gathering and organizing presented here include my basic understanding of coding in Python. The timestamp values are not converted to clock time, which could add a new level of conclusions and could lead to further questions for analysis. But, due to my lack of knowledge on how to convert a column in a DataFrame from timestamp to clock time in Python, this was left undone. 

Another way to prepare data about Instagram likes would be to have a different data set all together that had multiple users likes. The data I was able to request from Instagram was just about myself, which limited posts to just the ones I have interacted with. 

Next steps include continuing to format the data in a readable way as well as seeing how the addition of clock time could reveal more information about *when* I choose to like things, not only *what* I choose to like. This can lead to further questions to ask about the data that I have not thought of yet. 

### Hypothesis

A question and proposed explanation I have as a result of organizing this data is as follows:

I suspect that accounts that post earlier than 6:00 PM during any day of the week will receive more 30% more likes from myself than accounts that primarily post before 5:00 PM. 

I believe that this could be the case since I typically am in class during the day or I do not have classes but am working a night shift. During a day with class, I find myself always looking at my phone in between classes and during lunch breaks. Although one might think that class would limit social media use, being on campus and having my source of entertainment be my phone would lead to my checking my feeds more. This is in contrast to in the evening when I am at home and may be focused on a series of tasks that I am less likely to take breaks between like cooking dinner or playing video games. On days where I waitress, I often work in the evenings, leaving my mornings with plenty of opportunities to check my phone and me not being able to look at my phone at all later in the day. During days where I have not looked at my phone for 8 hours, I am less likely to examine and like every photo in my Instagram feed, rather opting to skim through all of the missed posts and liking a select few so that I am able to be "caught up" in my feed faster. 