# Discord Scrape
#### Step 1

To begin our data collection process, we will be getting data from a private discord server. 

Each message in this channel has token metrics for the moment that liquidity has been locked in for the token and trading is live. In a real world setting, this is all the information we know when guaging whether a token is a good investment or not.

This data includes:

- Token address
- Contract creation date
- Verified/Renounced status of contract
- Marketcap
- Number of Buys and Sells
- Honeypot status (a common scam)
- Tax information
- Liquidity
- Owner (Where liquidity tokens have been locked)
- Unlock date (if any)
- Holder contract link
- Deployer contract link
- Funding source link
- amount of ethereum held in contracts
- maximum number of tokens allowed per transaction and per wallet
- Liquidity time (from message timestamp)
- Description
- Social meda links
- Contract links

For our immediate purposes, we will focus on numeric data, links will be ignored.

There have been updates to the structure of the alert messages as the team of the product have updated the metrics available over the years. For consistency in our data we will be starting our scrape from the first message id that contains our desired structure. 

There are also some limitations to keep in mind:
- 50 messages per request
- 50 requests per second

More information on interacting with discords api as it pertains to this project can be found here:
- https://discord.com/developers/docs/resources/channel#get-channel-messages

In [15]:
import json
import requests
import time

def retrieve_msgs(channel_id, after_id='000000000000', output_file='../data/unique_scraped_data_alert_time.json'):
    headers = {
        'authorization': #api key here
    }

    url = f'https://discord.com/api/v9/channels/{channel_id}/messages'
    params = {'after': after_id}

    existing_data = set()  # Store existing 'dict0' values to check for duplicates
    
    with open(output_file, 'a') as file:
        while True:
            req = requests.get(url, headers=headers, params=params)
            if req.status_code == 200:
                json_data = req.json()
                
                if not json_data:
                    print("Reached the end of available messages.")
                    break
                
                for msg in json_data:
                    if 'embeds' in msg and msg['embeds']:
                        scraped_item_data = {'dict0': msg['embeds'][0]['description'][-43:-1]}
                        skip_values = {9, 10, 11, 18, 19, 20, 25}

                        for f, field in enumerate(msg['embeds'][0]['fields'], 1):
                            if f not in skip_values:
                                scraped_item_data[f'dict{f}'] = field

                        scraped_item_data['timestamp'] = msg['timestamp']
                        
                        # Check for duplicate 'dict0' values before writing
                        dict0_value = scraped_item_data['dict0']
                        if dict0_value not in existing_data:
                            json.dump(scraped_item_data, file)
                            file.write('\n')
                            existing_data.add(dict0_value)  # Add 'dict0' value to the set of existing data

                last_message_id = json_data[-1]['id']
                params['after'] = last_message_id
                time.sleep(1)  # Delay in seconds between API calls, conservatively within rate limit as this is my main account

            else:
                print(f'Error: {req.status_code} - {req.text}')
                break

liq_lock = # channel id
retrieve_msgs(liq_lock)

Reached the end of available messages.


Testing json file readability to see if data looks how we expect:

In [14]:
file_path = '../data/unique_scraped_data_alert_time.json'

# Initialize a list to store individual JSON objects
json_objects = []

with open(file_path, 'r') as file:
    # Read each line in the file
    for line in file:
        # Load each line as a JSON object
        try:
            json_obj = json.loads(line)
            json_objects.append(json_obj)
        except json.JSONDecodeError as e:
            print(f"Error decoding JSON: {e}")

# Check if JSON objects were found
if json_objects:
    # Access the last JSON object in the list
    last_json_object = json_objects[-1]
    print("Last JSON object:", last_json_object)
else:
    print("No valid JSON objects found in the file.")

Last JSON object: {'dict0': '0xf893f20cf0f3cf47cb8ba2f45995b92d46a19e2f', 'dict1': {'name': 'Created', 'value': '<t:1674241739:R>', 'inline': True}, 'dict2': {'name': 'Verified | Renounced', 'value': ':green_circle: True | :green_circle: Not Ownable', 'inline': True}, 'dict3': {'name': 'Marketcap', 'value': '$7,956', 'inline': True}, 'dict4': {'name': 'Buys | Sells', 'value': ':red_circle: 14 | 0', 'inline': True}, 'dict5': {'name': 'Honeypot', 'value': ':green_circle: Buy | :green_circle: Sell', 'inline': True}, 'dict6': {'name': 'Taxes', 'value': ':green_circle: 5% | 5.9%', 'inline': True}, 'dict7': {'name': 'Liquidity', 'value': ':green_circle: $12,541', 'inline': True}, 'dict8': {'name': 'Owner', 'value': ':red_circle: 100% Pinksale', 'inline': True}, 'dict12': {'name': 'Eth Balance', 'value': '-\n0.04Ξ\n0.28Ξ\n5.83Ξ\n0.02Ξ\n0.06Ξ\n0.08Ξ\n0.15Ξ\n-', 'inline': True}, 'dict13': {'name': 'Deployer', 'value': '[0xc83e…698e](https://etherscan.io/address/0xc83e6f1cc56892cf7de3550645bbeb4

Next, we will transform this json data into a pandas dataframe in: **['discord_dataframe_step_2'](./discord_dataframe_step_2.ipynb)**