# Importing libraries and loading JSON file

In [341]:
import pandas as pd
import numpy as np
import json
import datetime

In [316]:
with open('chessbuds_messages.json') as j:
    chess_buds = json.load(j)

# Exploring the type of data
**I first checked if the JSON file has been loaded as a dictionary, and viewed the keys of this JSON dictionary, which helped me identify the main variables.**

In [317]:
type(chess_buds)

dict

In [318]:
#chess_buds

In [319]:
chess_buds.keys()

dict_keys(['participants', 'messages', 'title', 'is_still_participant', 'thread_type', 'thread_path', 'magic_words', 'joinable_mode'])

# Checking key types
**I wrote a function to iterate over the types of key elements of the dictionary, which helped understand the structure of nested dictionaries. Then, I accessed these nested data to find out the data content and what elements could pass to the DataFrame constructor for next tidy process. In detail, the dict of 'participants' only have 'name' variables, and it is also unnecessary for other elements to build dataframe for extensive processing for analysis purposes. However, 'message' data contained a list of entries, each represented as a dictionary with various keys. Some keys incorporated multiple observational units and contained nested data strctures, such as 'reactions', 'bumped_message_metadata', which went against Wickham's tidy date principles.**

In [320]:
def check_key_type(dict):
    for key in dict.keys():
        print(key, type(dict[key]))
        
check_key_type(chess_buds)

participants <class 'list'>
messages <class 'list'>
title <class 'str'>
is_still_participant <class 'bool'>
thread_type <class 'str'>
thread_path <class 'str'>
magic_words <class 'list'>
joinable_mode <class 'dict'>


In [321]:
chess_buds['participants']

[{'name': 'Scott Pence'},
 {'name': 'Chad Larson'},
 {'name': 'Joanna Rusch'},
 {'name': 'Angela Babbitt Pence'},
 {'name': 'David Silva'},
 {'name': 'Aaron Rusch'},
 {'name': 'Timothy Vanderpool'}]

In [269]:
chess_buds['title']

'Chess Buds'

In [270]:
chess_buds['is_still_participant']

True

In [271]:
chess_buds['magic_words']

[]

In [272]:
chess_buds['joinable_mode']

{'mode': 1, 'link': ''}

In [305]:
chess_buds['messages']

# Creating a dataframe from 'messages'
**I converted 'messages' into a DataFrame 'cb_m_df' and checked the initial structure.**

In [322]:
cb_m_df = pd.DataFrame(chess_buds['messages'])
cb_m_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,reactions,type,is_unsent,is_taken_down,bumped_message_metadata,share,photos,gifs,users
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,"[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,{'bumped_message': 'Maybe he just wants to rid...,,,,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'To be fair to Hans....no o...,,,,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'He would have to prove he ...,,,,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...","[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,"{'bumped_message': 'Yeah, no way. You over sho...",,,,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",,Generic,False,False,"{'bumped_message': 'From what I see, I don't t...",,,,


# Converting timestamps to datetime
**I converted a column of timestamp_ms from milliseconds to a readable datetime format, then splitted it into date and time, and added them as new columns to 'cb_m_df'.**

In [345]:
def convert_timestamps(df, timestamp_column = 'timestamp_ms'):
    df['full_datetime'] = pd.to_datetime(df[timestamp_column], unit='ms')
    df['date'] = df['full_datetime'].dt.date
    df['time'] = df['full_datetime'].dt.strftime('%H:%M:%S')
    return df
cb_m_df = convert_timestamps(cb_m_df)
cb_m_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,reactions,type,is_unsent,is_taken_down,bumped_message_metadata,share,photos,gifs,users,full_datetime,date,time
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,"[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,{'bumped_message': 'Maybe he just wants to rid...,,,,,2022-10-21 17:55:33.946,2022-10-21,17:55:33
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'To be fair to Hans....no o...,,,,,2022-10-21 17:30:48.613,2022-10-21,17:30:48
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'He would have to prove he ...,,,,,2022-10-21 17:26:56.381,2022-10-21,17:26:56
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...","[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,"{'bumped_message': 'Yeah, no way. You over sho...",,,,,2022-10-21 17:26:04.883,2022-10-21,17:26:04
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",,Generic,False,False,"{'bumped_message': 'From what I see, I don't t...",,,,,2022-10-21 17:25:11.157,2022-10-21,17:25:11


**I first checked the type of 'reactions' in the unit and confirmed that it was a structured as a list with the nested data - 'reaction' and 'actor'. Next, I iterated over the 'reactions' column and helped me find out what information ('reaction', 'actor', and 'NaN') was available comprehensively. For a simpler numeric summary per reaction, I decided to quantify the engagement each message received and focus primarily on the number of reactions. Specifically, I flattened 'reactions' into a list of lists, which each contained 'reaction', and then countered the number of reactions for each message.**

In [275]:
type(cb_m_df['reactions'].iloc[0])

list

In [306]:
[x for x in cb_m_df['reactions']]
#[x for x in cb_m_df['reactions'][0]]

In [323]:
reactions = [x['reaction'] for x in cb_m_df['reactions'].iloc[0]]
reactions
#len(reactions)

['ð\x9f\x91\x8d', 'ð\x9f\x91\x8d']

# Handling NA values
**I knew there were NA values, which were not list formats. I asked GPT for help skip NA values. The prompt I used was how to iterate over a list without failing on NA values, and then they introduced isinstance() function.
After that, I extracted 'reaction' from the original 'reactions' list and counted the number of reactions per message (setting NA value to zero). Then I created and merged new column named 'number_of_reactions' into the DataFrame 'cb_m_df'.**

In [346]:
#if no nan values
#reactions = [[y['reaction'] for y in x]
#             for x in cb_m_df['reactions']]

reactions = [[y['reaction'] for y in x] if isinstance(x, list) else []
             for x in cb_m_df['reactions']]
#reactions

In [347]:
number_of_reactions = [len(x) if isinstance(x, list) else 0
                       for x in reactions]
#number_of_reactions

In [348]:
cb_m_df['number_of_reactions'] = number_of_reactions
cb_m_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,reactions,type,is_unsent,is_taken_down,bumped_message_metadata,share,photos,gifs,users,full_datetime,date,time,number_of_reactions
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,"[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,{'bumped_message': 'Maybe he just wants to rid...,,,,,2022-10-21 17:55:33.946,2022-10-21,17:55:33,2
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'To be fair to Hans....no o...,,,,,2022-10-21 17:30:48.613,2022-10-21,17:30:48,2
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'He would have to prove he ...,,,,,2022-10-21 17:26:56.381,2022-10-21,17:26:56,2
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...","[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,"{'bumped_message': 'Yeah, no way. You over sho...",,,,,2022-10-21 17:26:04.883,2022-10-21,17:26:04,2
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",,Generic,False,False,"{'bumped_message': 'From what I see, I don't t...",,,,,2022-10-21 17:25:11.157,2022-10-21,17:25:11,0


**I then checked the type of 'bumped_message_metadata' and confirmed that it was a structured as a dict with the nested data - 'bumped_message' and 'is_bumped'. To clarity of data handling, I decided to extracted 'bumped_message' and 'is_bumped' and represent them as seperate columns in 'cb_m_df'. However, I found out the text contents of 'bumped_message' were the same as of 'content', so I would drop the 'bumped_message' data.**

In [327]:
type(cb_m_df['bumped_message_metadata'].iloc[0])

dict

In [331]:
#[x for x in cb_m_df['bumped_message_metadata'][0]]
#cb_m_df['bumped_message_metadata']
[x for x in cb_m_df['bumped_message_metadata']]

In [349]:
is_bumped = [x['is_bumped'] for x in cb_m_df['bumped_message_metadata']]
cb_m_df['is_bumped '] = is_bumped
#is_bumped
cb_m_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,reactions,type,is_unsent,is_taken_down,bumped_message_metadata,share,photos,gifs,users,full_datetime,date,time,number_of_reactions,is_bumped
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,"[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,{'bumped_message': 'Maybe he just wants to rid...,,,,,2022-10-21 17:55:33.946,2022-10-21,17:55:33,2,False
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'To be fair to Hans....no o...,,,,,2022-10-21 17:30:48.613,2022-10-21,17:30:48,2,False
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'He would have to prove he ...,,,,,2022-10-21 17:26:56.381,2022-10-21,17:26:56,2,False
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...","[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,"{'bumped_message': 'Yeah, no way. You over sho...",,,,,2022-10-21 17:26:04.883,2022-10-21,17:26:04,2,False
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",,Generic,False,False,"{'bumped_message': 'From what I see, I don't t...",,,,,2022-10-21 17:25:11.157,2022-10-21,17:25:11,0,False


In [314]:
#bumped_message = [x['bumped_message'] if isinstance(x, dict) and 'bumped_message' in x else np.nan
#                    for x in cb_m_df['bumped_message_metadata']
#                    ]
#cb_m_df['bumped_message '] = bumped_message 
#cb_m_df.head()

# Alterative tidy format
**I noticed that the majority of the columns for 'share', 'photos', 'gifs', 'users' contain NaN values. However, to ensure data objective, I checked the number of non-NaN entries in each column (20, 14, 16, 4). There were a total of 223 rows in this DataFrame, and the highest proportion of valid entries in 'share' column did not even reach 9%. While my decision was to keep and clean entries in these columns for data integrity, alternative tidy format is to delete these variables as their valid values do not reach a threshold of 10%.**

# Handling variables of 'gifs', 'share','users', 'photos'
**My strategies were to extract essential contents without keys.**

In [118]:
not_na_gifs = cb_m_df['gifs'].notna()
count_not_na_gifs = not_na_gifs.sum()
count_not_na_gifs

np.int64(16)

In [287]:
non_nan_gifs_df = cb_m_df[not_na_gifs]
non_nan_gifs_df['gifs'].iloc[0]
#[x for x in cb_m_df['gifs']]

[{'uri': 'messages/inbox/chessbuds_npjakt9u1g/gifs/271509378_440207271109794_8171423686120017391_n_1734383773606976.gif'}]

In [350]:
gifs_2 = [x[0]['uri'] if isinstance(x, list) and len(x) > 0 else x for x in cb_m_df['gifs']]
#gifs_2
cb_m_df['gifs_2'] = gifs_2

In [310]:
#[x for x in cb_m_df['share']]

In [134]:
not_na_share = cb_m_df['share'].notna()
count_not_na_share = not_na_share.sum()
count_not_na_share
#non_nan_share_df = cb_m_df[not_na_share]
#non_nan_share_df

np.int64(20)

In [351]:
share_2 = [x['link'] if isinstance(x, dict) and 'link' in x else x for x in cb_m_df['share']]
#share_2
cb_m_df['share_2'] = share_2

In [128]:
count_not_na_users = (cb_m_df['users'].notna()).sum()
count_not_na_users

np.int64(4)

In [129]:
count_not_na_photos = (cb_m_df['photos'].notna()).sum()
count_not_na_photos

np.int64(14)

In [352]:
#[x for x in cb_m_df['photos']]
photos_2 = [x[0]['uri'] if isinstance(x, list) and x else x for x in cb_m_df['photos']]
#photos_2
cb_m_df['photos_2'] = photos_2

In [353]:
#[x for x in cb_m_df['users']]
users_2 = [x[0]['name'] if isinstance(x, list) and x else x for x in cb_m_df['users']]
#users_2
cb_m_df['users_2'] = users_2

In [354]:
cb_m_df.head()

Unnamed: 0,sender_name,timestamp_ms,content,reactions,type,is_unsent,is_taken_down,bumped_message_metadata,share,photos,...,users,full_datetime,date,time,number_of_reactions,is_bumped,gifs_2,share_2,photos_2,users_2
0,Joanna Rusch,1666374933946,Maybe he just wants to ride the publicity for ...,"[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,{'bumped_message': 'Maybe he just wants to rid...,,,...,,2022-10-21 17:55:33.946,2022-10-21,17:55:33,2,False,,,,
1,Chad Larson,1666373448613,To be fair to Hans....no one wants to be assoc...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'To be fair to Hans....no o...,,,...,,2022-10-21 17:30:48.613,2022-10-21,17:30:48,2,False,,,,
2,Chad Larson,1666373216381,He would have to prove he didn't cheat and tha...,"[{'reaction': 'ð', 'actor': 'Scott Pence'},...",Generic,False,False,{'bumped_message': 'He would have to prove he ...,,,...,,2022-10-21 17:26:56.381,2022-10-21,17:26:56,2,False,,,,
3,Scott Pence,1666373164883,"Yeah, no way. You over shoot and hope to get a...","[{'reaction': 'ð', 'actor': 'Chad Larson'},...",Generic,False,False,"{'bumped_message': 'Yeah, no way. You over sho...",,,...,,2022-10-21 17:26:04.883,2022-10-21,17:26:04,2,False,,,,
4,Chad Larson,1666373111157,"From what I see, I don't think he could win. ...",,Generic,False,False,"{'bumped_message': 'From what I see, I don't t...",,,...,,2022-10-21 17:25:11.157,2022-10-21,17:25:11,0,False,,,,


# Removing the original complex columns
**In this final step, I dropped the original columns with nested structures and created a tidy version of DataFrame 'cb_m_df_tidy'.**

In [355]:
cb_m_df_tidy = cb_m_df.drop(columns = ['timestamp_ms', 'full_datetime', 'reactions','share','photos','gifs','users','bumped_message_metadata'])
cb_m_df_tidy.head()

Unnamed: 0,sender_name,content,type,is_unsent,is_taken_down,date,time,number_of_reactions,is_bumped,gifs_2,share_2,photos_2,users_2
0,Joanna Rusch,Maybe he just wants to ride the publicity for ...,Generic,False,False,2022-10-21,17:55:33,2,False,,,,
1,Chad Larson,To be fair to Hans....no one wants to be assoc...,Generic,False,False,2022-10-21,17:30:48,2,False,,,,
2,Chad Larson,He would have to prove he didn't cheat and tha...,Generic,False,False,2022-10-21,17:26:56,2,False,,,,
3,Scott Pence,"Yeah, no way. You over shoot and hope to get a...",Generic,False,False,2022-10-21,17:26:04,2,False,,,,
4,Chad Larson,"From what I see, I don't think he could win. ...",Generic,False,False,2022-10-21,17:25:11,0,False,,,,


In [356]:
cb_m_df_tidy.to_csv('cb_m_df_tidy.csv', index=False)

# Why my final DataFrame meets tidy data principles?
**I have first converted timestamps into a readable datatime format, which was much easier for audiences to understand. I have then removed unnecessary key elements like "url" and "link" in units, seperated multiple variables which were stored in one column, such as "bumped_message_metadata", and also converted complex information into a numerical variables, making them easier to further analyze ("reactions" to "number of reactions"). In my final DataFrame, each row represents an observation, like who sent message, what time, and what content, how many reactions received, and each column is a variable, such as "bumped message content" and "is it bumped". Analysts and audiences could easily read and extract needed variables or data from this tidy DataFrame.**

# Potential visualization
## 1. Scatter plot visualization
**This scatter plot visualizes each message as a point, where the color of each point represents a sender. Each sender would be assigned a unique color to be easily distiguished. While the x-axis of the plot is 'date', showing when each message sent, the y-axis represents 'number of reactions' each message received. By making this plot interactive, when audiences hovers their mouse over any point, a box will appear displaying exact time and message contents.**
## 2. Bar chart visualization
**The bar chart will visualize the total number of reactions each sender received accorss all their messages. On the x-axis, each bar represents a sender, while the height of each bar indicates the total number of reactions they received on the y-axis.**