# Workflow:  Replace usernames with pseudo data 

### Steps ###
0. **Formulate Goal**
1. **Determine which variable should be selected to reach goal**
2. **Select raw data source**
3. **List available information in raw data source**
4. **Create dictionary with usernames and substitutes**
5. **Create function to search for sensitive info and replace it with pseudo data**
____________________________________________________________________________________________________________

In [1]:
import json
import pandas as pd
from pathlib import Path
import numpy as np
import emojis
import emoji
import regex
import matplotlib.pyplot as plt
from datetime import date
import re

### 0. Goal

**Develop generic label search function for sensitive info in (instagram) .json** 

### 1. Variable

* Automatically find all usernames in all .json files
* Automatically create a dictionary with usernames as original key and (pseudo) anonymized username as substitute
* Automatically change sensitive user information to (pseudo) anonymized key


### 2. Raw data source

**Instagram**

Instagram download contains the following folders:

* direct  > date folders (YYDDMM) > photos directly send to other users via 'message' on that day
* photos  > date folders (YYDDMM) > photos posted on your 'page' that day
* videos  > date folders (YYDDMM) > videos posted on your 'page' that day
* stories > date folders (YYDDMM) > photos posted on your 'story' that day
* profile > date folders (YYDDMM) > photo used as profile picture on that day

Instagram download contains the following files (not in folders):
0. **information_about_you**: your primary location (home adress)
1. **searches**: your search info on instagram with corresponding timestamp
2. **autofill**: ? (*'You have no data in this section'*)
3. **checkout**: the email of payment account (N.B. insta if free) 
4. **connections**: all your connections with corresponding timestamps (e.g., when did you start following them or vice versa)
5. **devices**: information about the used devices
6. **likes**: likes of media posts and comments of other users with corresponding timestamp
7. **media**: caption of photo posts, video posts, and stories with corresponding timestamp and path to corresponding media (within download)
8. **seen_content**: all content (posts, videos, adds, chains) you've seen on instagram with corresponding timestamp and author (username of poster)
9. **settings**: account setting (allow comments from)
10. **stories_activities**: your activity on story polls of other users
11. **account_history**: info of logged in devices (e.g., ip adress) and registration info (e.g., name, email)
12. **comments**: your comments on other (unknown) users posts with corresponding timestamp
13. **messages**: private messages between you and other users with corresponding timestamps, shared media, links, etc.
14. **profile**: all information about your profile (e.g., username, email, full name, start date, etc.)
15. **saved**: all saved media with corresponding timestamp and owner of media (username)
16. **uploaded_contacts**: ? (*'You have no data in this section'*)


### 3. List available information

#### What sensitive info is where?

Files containing 'usernames' 
* 1. searches.json --> username of other users (direct)
* 4. connections.json --> username of other users (direct) (your connections: people following you, or people you follow)
* 6. likes.json --> username of other users (direct)
* 7. media.json --> username of other users (indirect) (within your caption you can tagg people with @username)
* 8. seen_content.json --> username of other (unknown) users (direct)
* 10. stories_activities.json --> username of other users (direct)
* 11. account_history.json --> registration_info list
* 12. comments.json --> username of other (unkown) users (direct + indirect) (within your comment you can tagg people with @username)
* 13. messages.json --> username of other users (direct + indirect) (within your caption you can tagg people with @username, but the full names of the users are also used frequently 'hey Kees! how are you?')
* 14. profile.json --> username of your account (direct)
* 15. saved.json --> username of other users (direct)

Files containing other personal info
* 0. information_about_you.json --> primary location (home adress)
* 3. checkout.json --> payment_account_emails
* 5. devices.json --> device_id
* 11. account_history.json --> login_history (e.g., ip adress, device id) and registration_info (e.g., name, email)
* 14. profile.json --> all profile info (e.g., email, gender, name, link to profile picture, username, etc.)


### 4. Create dictionary with usernames and substitutes

#### Download data from instagram
* Go to your profile and click on the wheel.
* Click Privacy and Security.
* Scroll down to Data Download and click Request Download.
* Enter the email address where you'd like to receive a link to your data and click Next.
* Enter your Instagram account password and click Request Download.
* You'll soon receive an email titled Your Instagram Data with a link to your data. Click Download Data and follow the instructions to finish downloading your information

#### Enter relevant information
In said email, Instagram provides your data in a zip folder with the following title: 'username_currentdate'. Save this zipped folder in your *Downloads* folder. N.B. Make sure to save this workflow script in the same folder as the data!

To correctly access your instagram data, please fill in your username and name (the latter is for accessing your computer path correctly: e.g., C:/Users/*name*/Downloads).

In [2]:
username = 'roosvoor' # Enter username here
name = 'Roos' # Enter name of computer here

# If you downloaded the data today
datum = f'{date.today()}'
datum = re.findall(r'\d+', datum)
datum = ''.join(datum)

# If you downloaded the data on another date
# datum = '' # Enter date in format YYYYMMDD

#### Read files

In [3]:
# Set path to zip folder
folder = username+'_'+datum+'.zip'
project = Path('C:/Users/'+name+'/Downloads')
data = project / folder
data

WindowsPath('C:/Users/Roos/Downloads/roosvoor_20200605.zip')

In [4]:
# importing required modules 
from zipfile import ZipFile 
  
# specifying the zip file name 
file_name = data
  
# opening the zip file in READ mode 
with ZipFile(file_name, 'r') as zip: 
    # printing all the contents of the zip file 
    zip.printdir() 
  
    # extracting all the files 
    print('Extracting all the files now...') 
    zip.extractall() 
    print('Done!') 


File Name                                             Modified             Size
information_about_you.json                     2020-06-05 01:55:22           61
devices.json                                   2020-06-05 01:55:22         1429
settings.json                                  2020-06-05 01:55:22           35
seen_content.json                              2020-06-05 01:55:22        61362
stories_activities.json                        2020-06-05 01:55:22           53
connections.json                               2020-06-05 01:55:22        25268
saved.json                                     2020-06-05 01:55:22        13165
searches.json                                  2020-06-05 01:55:22         2492
messages.json                                  2020-06-05 01:55:22       347382
profile.json                                   2020-06-05 01:55:22          415
media.json                                     2020-06-05 01:55:22        53862
likes.json                              

Done!


In [5]:
# Passive files (generated by insta)
json_file_you = project / 'information_about_you.json'
json_file_autofill = project / 'autofill.json'
json_file_pay = project / 'checkout.json'
json_file_users = project / 'connections.json'
json_file_device = project / 'devices.json'
json_file_settings = project / 'settings.json'
json_file_account = project / 'account_history.json'
json_file_user = project / 'profile.json'
json_file_contact = project / 'uploaded_contacts'

# Interaction files (generated by users)
json_file_like = project / 'likes.json'
json_file_med = project / 'media.json'
json_file_seen = project / 'seen_content.json'
json_file_stories = project / 'stories_activities.json'
json_file_com = project / 'comments.json'
json_file_mes = project / 'messages.json'
json_file_saved = project / 'saved.json'
json_file_search = project / 'searches.json'

#### Find all 'explicit' usernames

In [6]:
def usernames():
    
    # Load profile.json to get username of user
    with open(json_file_user, encoding = "utf8") as json_user:
        user = json.load(json_user)
    
    user = pd.DataFrame.from_dict(user, 
        orient = 'index').T 
    
    # Load connections.json to get username of all connections
    with open(json_file_users, encoding = "utf8") as json_users:
        users = json.load(json_users)

    users = pd.DataFrame.from_dict(users, 
        orient = 'index').T 

    users = users.index.values.tolist()
    
    # Create scramble function
    from random import shuffle

    def shuffle_word(word):
        word = list(word)
        shuffle(word)
        return ''.join(word)

    # Create dictionary with original username as key
    dictionary = {}
    dictionary = {user['username'][0]: ('__'+shuffle_word(user['username'][0]))}
    
    for name in users:
        new = {name: ('__'+shuffle_word(name))}
        dictionary.update(new)
    
    # look for usernames outside of connections 
    # Saved media
    with open(json_file_saved, encoding = "utf8") as json_saved:
        saved = json.load(json_saved)
    
    users = pd.DataFrame(saved['saved_media'])[1]
    
    # Likes
    with open(json_file_like, encoding = "utf8") as json_likes:
        likes = json.load(json_likes)
    
    user_like = pd.DataFrame(likes['media_likes'])[1]
    user_like = user_like.append(pd.DataFrame(likes['comment_likes'])[1])
        
    # Seen content
    with open(json_file_seen, encoding = "utf8") as json_seen:
        seen = json.load(json_seen)
    
    user_seen = pd.DataFrame(seen['chaining_seen'])['username']
    user_seen = user_seen.append(pd.DataFrame(seen['ads_seen'])['author'])
    user_seen = user_seen.append(pd.DataFrame(seen['posts_seen'])['author'])
    user_seen = user_seen.append(pd.DataFrame(seen['videos_watched'])['author'])
    
    # Search media
    with open(json_file_search, encoding = "utf8") as json_search:
        search = json.load(json_search)

    user_search = pd.DataFrame(search)['search_click']
    
    # Media comments
    with open(json_file_com, encoding = "utf8") as json_comments:
        comments = json.load(json_comments)

    user_com = pd.DataFrame(comments['media_comments'])[2]
    
    # Merge all usernames
    users = users.append(user_seen)
    users = users.append(user_like)
    users = users.append(user_search)
    users = users.append(user_com)
    users = set(users)
    
    for name in users:
        if name in dictionary:
            next
        else: 
            dictionary.update({name:('__'+shuffle_word(name))})
    
    return(dictionary)
    

In [7]:
usernames = usernames()
usernames

{'roosvoor': '__ovrsrooo',
 'beberson': '__obrbnsee',
 'danielpolosetzky': '__oolptadlnyiezesk',
 'sophie_soof': '__ohiso_ofspe',
 'symonab': '__asmonyb',
 'mana.fazel': '__.leazaamnf',
 'evaendema': '__deaemnaev',
 'zack_from_earth': '__hearrao_fmz_ktc',
 'jboonstra73': '__artoj37obns',
 '_romyrachel': '__ahlmr_ycore',
 'lauraderooij': '__odeaiurjrloa',
 'veerlegewoon': '__oeweelnrgevo',
 'sophiejacobs1993': '__139ieocp9jhsobsa',
 'momo_schaap': '__paohomcams_',
 'mitalipoovs': '__tiiapvsmolo',
 'bonnievanderlee': '__bennlieaoeerndv',
 'agnesdesl': '__edessangl',
 'theycallmenita': '__hymealttcinlea',
 'die_ene_insta': '__dstani_n_ieee',
 'hannadohle': '__naheaholnd',
 'jurrekuin': '__jikreurnu',
 'yaramiora': '__ayoramria',
 'faraah.aulia': '__laarahua.aif',
 'bluunie': '__enuiubl',
 'tiarmaguvnor': '__umirorgatanv',
 'ingevanooijen': '__ennoivijnoega',
 'dieuweertje': '__ujterieedew',
 'al.bert.0': '__.a0t.brle',
 'annelotte2': '__enetotnla2',
 'hugomcgurran': '__gharoucmgrun',
 'an

#### Find other sensitive info

Extract name of user

In [8]:
def names():
    
    # Load profile.json to get name of user
    with open(json_file_account, encoding = "utf8") as json_user:
        name = json.load(json_user)

    # Find replacement for user's username 
    with open(json_file_user, encoding = "utf8") as json_user:
        user = json.load(json_user)
    
    user = pd.DataFrame.from_dict(user, 
        orient = 'index').T 
    
    # Create dictionary
    name_dic = {}
    name_dic = {name['registration_info']['registration_username']: usernames[user['username'][0]]}

    return(name_dic)

names()

{'Roos': '__ovrsrooo'}

Extract mail of user

In [9]:
def mail():
    
    # Load profile.json to get username of user
    with open(json_file_user, encoding = "utf8") as json_user:
        user = json.load(json_user)
    
    # Create dictionary
    mail_dic = {}
    mail_dic = {user['email']: usernames[user['username']]}
    
    return(mail_dic)

mail()

{'vladimirvladimirina@gmail.com': '__ovrsrooo'}

### 5.Create function to search for sensitive info and replace it with pseudo data

#### Save dictionary to key file

In [10]:
# Combine all usernames with the name and mail of the user
df = pd.DataFrame(list(usernames.items()))
df = df.append(list(mail().items()), ignore_index=True)
df = df.append(list(names().items()), ignore_index=True)

df = df.rename(columns={0: 'id', 1: 'subt'})
df

Unnamed: 0,id,subt
0,roosvoor,__ovrsrooo
1,beberson,__obrbnsee
2,danielpolosetzky,__oolptadlnyiezesk
3,sophie_soof,__ohiso_ofspe
4,symonab,__asmonyb
...,...,...
831,portrait_viral,__iattroriapvr_l
832,bella.ruis,__.ilrseluab
833,abs_at_home,__oa_st_eamhb
834,vladimirvladimirina@gmail.com,__ovrsrooo


In [11]:
# Write the sensitive info with corresponding substitutes to csv
df.to_csv('keys.csv', index = False, encoding='utf-8')

#### Find and replace usernames in all files

In [12]:
# Look for all usernames in the instagram folder and replace it with substitutes
from anonymize import Anonymize

anonymize_csv = Anonymize('keys.csv')
anonymize_csv.substitute(data, 'anonymized_'+usernames[username])

print('Your instagram data has been succesfully anonymized!')
print('The data is saved in the following folder: anonymized'+usernames[username])

Your instagram data has been succesfully anonymized!
The data is saved in the following folder: anonymized__ovrsrooo


Please send this folder to researcher X via method B. Thank you!