<a href="https://colab.research.google.com/github/atilatech/atila-core-service/blob/master/notebooks/clean_and_upload_twitter_archive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TwitterBase

Create a permanent database of your twitter archive.

1. Upload a zip file of your twitter archive
2. Remove private information
3. Create and upload to s3 bucket
4. Serve with cloudfront
5. Redirect twitter.atila.ca/<username> -> deployed_url

## Optional: Upload Zip File to Google Drive

If you are running in colab, you first need to upload the file to Google Drive so that your archive can still be accessed if you restart your notebook.

I recommend uploading it to your root Google drive folder and make a folder called `twitter-archive` and put the zip file in there.


## Understand Directory Structure

It helps to understand the directory structure.

Create a created_by_notebook.txt file and see where it is in your repo. That will help you know where the root of the notebook is.

In [7]:
!mkdir twitter-archive
!ls
!pwd

drive  sample_data  sample.txt	twitter-archive
/content


## Upload Zip File of Twitter Archive

Unzipping the file will be large and take a long time if you make any mistakes so start by creating a test.txt file with some `foobar` dummy data in your Google Drive file.

Move that into your notebook folder and try reading that file to verify it works.


Then you will do the same for your main file, Unzip to the colab notebook.

Tip: [Show unzip progress](https://askubuntu.com/questions/909918/how-to-show-unzip-progress)

In [14]:
!unzip /content/drive/MyDrive/twitter-archive/twitter-archive-2022-12-21-tomiwa1a-2e6e0cef38e03f7ea2dfe285b39861e6572acb880b292cf95042d912c64d0651.zip \
    -d twitter-archive | awk 'BEGIN {ORS=" "} {if(NR%10==0)print "."}'

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 

In [None]:
twitter_archive_file_path = "archive"

In [16]:
import json

with open('twitter-archive/data/tweets.js') as dataFile:
    tweets = dataFile.read()
    tweets = tweets[tweets.find('[') : tweets.rfind(']')+1]
    tweets = json.loads(tweets)


In [None]:
tweets[0]

## Remove Personal Information

1. Running the Twitter archive locally we can see that Twitter archive contains the following pieces of information that we want to remove. [todo add screenshot]

## How Personal Data is Stored

Most of the data is stored like this:

```
window.YTD.account.part0 = [
  {
    "account" : {
      "email" : "tomia@atila.ca",
      "createdVia" : "<oauth:NNNNNN>",
      "username" : "tomiwa1a",
      "accountId" : "388018813",
      "createdAt" : "2011-10-10T01:46:39.000Z",
      "accountDisplayName" : "Tomiwa 😃"
    }
  }
]
```

###  Account
Note: This is a non exhaustive list. It's based on what I can see in my own twitter profile (@tomiwa1a). Users with different accounts may have other type of personal information, which I'm not privy to seeing.
1. General information:
    1. Created via oAuth -> `account.js` `root[acc`
    1. User Creation IP
    1. email
    1. Phone number
    1. Age info
    1. Email Changes
    1. Protected History

1. Profile:
    1. None (maybe screen name changes might be considred protected private) 

1. Connected Applications

1. Contacts
    1. Very personal because this includes phone number of yourself and everyone in your social network

1. Sessions

1. Account access history



### Direct Messages

Everything here can be considered personal

### Personalization

1. Demographics
    1. Language
    2. Gender
    3. Date of Birth
    4. Age
1. Interests
1. Advertiser lists
1. Location
1. Saved Searches

### Ads
Everything here

### Lists

1. None (Subscribed in lists might be considered personal but we can iterate based on user feedback.)

## Removing Personal Data from Account

1. Remove the `createdVia` field.

In [None]:
import json
import copy

def sensitize_creation_info():

    # load the data from file
    with open('twitter-archive/data/account.js') as file_io:
        file_data_str = file_io.read()
        # get the json part of the data
        start_replace_index = file_data_str.find('[')
        end_replace_index = file_data_str.rfind(']')+1
        json_file_data_str = file_data_str[start_replace_index:end_replace_index]
        file_data = json.loads(json_file_data_str)
    
    # sensitize the data

    sensitized_data = copy.deepcopy(file_data)
    sensitized_data[0]['account']['createdVia'] = '<redacted>'
    print('data: ', file_data)
    print('sensitized_data: ', sensitized_data)

    sensitized_data_str = json.dumps(sensitized_data, indent=4)

    # replace the original file data with sensitized information
    with open('twitter-archive/data/account_sensitized.js', 'w+') as file_io:
        file_io.write(file_data_str[:start_replace_index] 
                      + sensitized_data_str + \
                      file_data_str[end_replace_index+1:])

sensitize_creation_info()
!cat twitter-archive/data/account_sensitized.js


## Generalize logic to work for different account types

1. We want to define a function that we can just pass a map of fields we want to sensitize and it will automatically handle the sensitization logic.

In [67]:

# we could make the keys be an array of files to make our code more efficient
# however, open and closing a file even 1000 times still takes 
# less than half a second (0.309s): https://stackoverflow.com/a/11349501/5405197


# utility for setting nested values: https://stackoverflow.com/a/69572347/5405197
# nested_set(data_dictionary, "user_data.phone", "123")
def nested_set(obj, path, value):
    *path, last = path.split(".")
    for bit in path:
        obj = obj.setdefault(bit, {})
    obj[last] = value


def sensitize(file_name, replace_path, replace_value):

    # load the data from file
    with open(f'twitter-archive/data/{file_name}') as file_io:
        file_data_str = file_io.read()
        # get the json part of the data
        start_replace_index = file_data_str.find('[')
        end_replace_index = file_data_str.rfind(']')+1
        json_file_data_str = file_data_str[start_replace_index:end_replace_index]
        file_data = json.loads(json_file_data_str)
    

    
    # if replace path is an empty string, that means
    # replace entire json data in the javascript file
    sensitized_data = copy.deepcopy(file_data[0])
    if replace_path:
        # Get the value to be replaced
        nested_set(sensitized_data, replace_path, replace_value)

        sensitized_data = [sensitized_data] # convert back into a list as part0
    else:
        sensitized_data = replace_value
    # replace the original file data with sensitized information

    # expects
    sensitized_data_str = json.dumps(sensitized_data, indent=4)
    with open(f"twitter-archive/data/"
    # note the switch to double quotes 
    f"{file_name.replace('.js', '')}_sensitized.js", 'w+') as file_io:
        file_io.write(file_data_str[:start_replace_index] 
                      + sensitized_data_str + \
                      file_data_str[end_replace_index+1:])

    # use padding to make the outputs aligned so we can visually compare them
    print('data__pad______: ', file_data)
    print('sensitized_data: ', sensitized_data)
    print(f'sensitized: {file_name}#{replace_path}\n')

## Sensitize a few fields to make sure it's working properly

In [68]:
sensitize_configs = [
    {
        'file_path': 'account.js',
        'replace_path': 'account.createdVia',
        'replace_value': '<redacted>'
    },
    {
        'file_path': 'account-creation-ip.js',
        'replace_path': 'accountCreationIp.userCreationIp',
        'replace_value': '<redacted>'
    }
]
# note config is reserved in iPython
for config_val in sensitize_configs: 
    
    sensitize(config_val['file_path'],
              config_val['replace_path'],
              config_val['replace_value'])

data__pad______:  [{'account': {'email': 'tomiwa@atila.ca', 'createdVia': 'oauth:111111', 'username': 'tomiwa1a', 'accountId': '388018813', 'createdAt': '2011-10-10T01:46:39.000Z', 'accountDisplayName': 'Tomiwa 😃'}}]
sensitized_data:  [{'account': {'email': 'tomiwa@atila.ca', 'createdVia': '<redacted>', 'username': 'tomiwa1a', 'accountId': '388018813', 'createdAt': '2011-10-10T01:46:39.000Z', 'accountDisplayName': 'Tomiwa 😃'}}]
sensitized: account.js#account.createdVia

data__pad______:  [{'accountCreationIp': {'accountId': '388018813', 'userCreationIp': '11.111.11.11'}}]
sensitized_data:  [{'accountCreationIp': {'accountId': '388018813', 'userCreationIp': '<redacted>'}}]
sensitized: account-creation-ip.js#accountCreationIp.userCreationIp



## Now Define all sensitize Configs


In [61]:
ALL_SENSITIZE_CONFIGS = [
    {
        'file_path': 'account.js',
        'replace_path': 'account.createdVia',
        'replace_value': '<redacted>'
    },
    {
        'file_path': 'account-creation-ip.js',
        'replace_path': 'accountCreationIp.userCreationIp',
        'replace_value': '<redacted>'
    },
    {
        'file_path': 'email-address-change.js',
        'replace_path': '',
        'replace_value': []
    },
    {
        'file_path': 'direct-messages.js',
        'replace_path': '',
        'replace_value': []
    }
]
# note config is reserved in iPython
for config_val in ALL_SENSITIZE_CONFIGS: 
    
    sensitize(config_val['file_path'],
              config_val['replace_path'],
              config_val['replace_value'])

sensitized: account.js#account.createdVia

sensitized: account-creation-ip.js#accountCreationIp.userCreationIp

sensitized: direct-messages.js#

