## Example of data wrangling for Instagram comment file
Several students attempted to read in the data from the Instagram Comments file, but the structure can be difficult to work with. In this notebook, I am going to show some of the steps that you can complete in order to tidy the data and begin working with that data file.

In [1]:
import json
import pandas as pd
import matplotlib as plt

In [2]:
path = r"C:\Users\dsilva2\OneDrive - Kent State University\Research_KSU_Share\SocMed-Personal-Data\instagram-technogecko-2024-08-16-vdXEFDYF\your_instagram_activity\comments\post_comments_1.json"
with open(path) as j:
    com_dict = json.load(j)

#### This is the first issue that many people ran into. The json file is not structured as a dictionary but instead is a list. Instead of `.keys()` we should use ways of describing list-type objects.

In [3]:
com_dict.keys()

AttributeError: 'list' object has no attribute 'keys'

In [4]:
com_list = com_dict
len(com_list)

38

In [5]:
com_list[0]

{'media_list_data': [{'uri': ''}],
 'string_map_data': {'Comment': {'value': 'Were you entertained?'},
  'Media Owner': {'value': 'reams_esq'},
  'Time': {'timestamp': 1656548242}}}

#### The list contains dictionaries, so this is a good place to start our DataFrame.

In [6]:
com_df1 = pd.DataFrame(com_list)
com_df1.head()

Unnamed: 0,media_list_data,string_map_data
0,[{'uri': ''}],{'Comment': {'value': 'Were you entertained?'}...
1,[{'uri': ''}],"{'Comment': {'value': 'Hi, doctor!'}, 'Media O..."
2,[{'uri': ''}],{'Comment': {'value': 'Fins Fardashian said it...
3,[{'uri': ''}],{'Comment': {'value': 'This speaks to me. Hope...
4,[{'uri': ''}],"{'Comment': {'value': 'â½â½â½ð'}, 'Media..."


#### That is pretty mess, but we can see two things. First, the `media_list_data` seems empty, but we can double check that. Second, the majority of the data is in the `string_map_data` column. We will work here to tidy the DataFrame further.

In [7]:
com_df1['media_list_data'].iloc[0]

[{'uri': ''}]

#### This line just averages the length of all the lists stored in the `media_list_data` column. The average is 1.0, meaning there is only one element in each of the lists. This means we can extract the elements from the list without worrying about losing data.

In [8]:
sum([len(x) for x in com_df1['media_list_data']])/len(com_df1['media_list_data'])

1.0

In [9]:
com_df1['uri'] = [x[0]['uri'] for x in com_df1['media_list_data']]
com_df1['uri'].unique()

array([''], dtype=object)

#### The `.unique()` method is showing us each of the unique strings stored in the new `uri` column. As it turns out, the only string stored is an empty string. It would be save to remove and ignore this data completely.

In [10]:
com_df1['string_map_data'].iloc[0]

{'Comment': {'value': 'Were you entertained?'},
 'Media Owner': {'value': 'reams_esq'},
 'Time': {'timestamp': 1656548242}}

In [11]:
com_df2 = pd.DataFrame(list(com_df1['string_map_data']))
com_df2.head()

Unnamed: 0,Comment,Media Owner,Time
0,{'value': 'Were you entertained?'},{'value': 'reams_esq'},{'timestamp': 1656548242}
1,"{'value': 'Hi, doctor!'}",{'value': 'jackie.cc'},{'timestamp': 1650653282}
2,{'value': 'Fins Fardashian said it would help ...,{'value': 'newyorkercartoons'},{'timestamp': 1623332975}
3,{'value': 'This speaks to me. Hope you are wel...,{'value': 'klorenza___'},{'timestamp': 1581818525}
4,{'value': 'â½â½â½ð'},{'value': 'ali_saurusrex'},{'timestamp': 1552781566}


#### We are getting closer, but still have a few levels of the structure to work through.

In this limited case, we can assume that there is only one dictionary under each cell of the columns. This is a great use case for the pandas function `json_normalize()`. Usually this function does a very bad job of making a tidy DataFrame, but it is really good when there is limited nesting and heirarchy in the DataFrame or JSON file.

In [12]:
com_df3 = pd.json_normalize(list(com_df1['string_map_data']))
com_df3.head()

Unnamed: 0,Comment.value,Media Owner.value,Time.timestamp
0,Were you entertained?,reams_esq,1656548242
1,"Hi, doctor!",jackie.cc,1650653282
2,Fins Fardashian said it would help my achy bac...,newyorkercartoons,1623332975
3,"This speaks to me. Hope you are well, friend.",klorenza___,1581818525
4,â½â½â½ð,ali_saurusrex,1552781566


#### I am now just cleaning up the names of the columns from the defaults chosen by the `json_normalize()` function into something more understandable.

I am then applying the methods discussed in class to get to some frequency data about the number of comments on each post owned by an account.

In [13]:
com_df4 = com_df3.rename(columns={'Comment.value': 'Comment', 'Media Owner.value': 'Media Owner', 'Time.timestamp': 'Timestamp'})
com_df4.head()

Unnamed: 0,Comment,Media Owner,Timestamp
0,Were you entertained?,reams_esq,1656548242
1,"Hi, doctor!",jackie.cc,1650653282
2,Fins Fardashian said it would help my achy bac...,newyorkercartoons,1623332975
3,"This speaks to me. Hope you are well, friend.",klorenza___,1581818525
4,â½â½â½ð,ali_saurusrex,1552781566


In [14]:
com_df4.groupby('Media Owner').count().\
sort_values('Comment', ascending = False)

Unnamed: 0_level_0,Comment,Timestamp
Media Owner,Unnamed: 1_level_1,Unnamed: 2_level_1
technogecko,18,18
ali_saurusrex,8,8
thechadlarson,3,3
a_matt_silva,1,1
colin_storm,1,1
danneabreanne,1,1
fanatikbikeco,1,1
jackie.cc,1,1
klorenza___,1,1
mtb_crohnert92,1,1


#### As it turns out, most of my comments are on my own account. Presumably this is in response to other people commenting, but we would have to check back with the account to verify this assumption.