# Discord Dataframe Creation
#### Step 2

With our scraped discord data, we will now convert it to a pandas dataframe to be workable within a data science environment as well as use our data to scrape the additional data needed for our research

In [1]:
import pandas as pd
import json

Initial data cleaning from our 'unique_scraped_data_alert_time.json' file
getting rid of:
- 'ERR'
- '\n'

In [3]:
input_file_path = '../data/unique_scraped_data_alert_time.json'
output_file_path = '../data/filtered_scrape_data_alert_time.json'

with open(input_file_path, 'r') as input_file, open(output_file_path, 'w') as output_file:
    # Read and process each line in the input file
    for line in input_file:
        try:
            json_data = json.loads(line.strip())

            # Check if any field value contains 'ERR'
            err_found = any('ERR' in str(value.get('value', '')) for value in json_data.values() if isinstance(value, dict))

            # Write to the new file only if 'ERR' is not found
            if not err_found:
                output_file.write(json.dumps(json_data) + '\n')

        except json.JSONDecodeError as e:
            print(f"Error decoding JSON on line: {line.strip()} - {e}")

Here, each metric within the alert message falls within a relative structure. 
Each metric is representation by a 'dict' in the json data, we will extract the best approximate of the information here. Further cleaning may be necessary before our EDA process.

As this is an iterative process with sanity checks to ensure we are getting the data we want, we will track our progress with a line_number increment. 

In [2]:
values_list = []
line_number = 1

with open('../data/filtered_scrape_data_alert_time.json', 'r') as file:
    for line in file:
        json_data = json.loads(line.strip())
        
        # Process the JSON data and extract necessary fields
        value_item_list = [
            json_data['dict0'],  # contract address
            json_data['timestamp'],  # liquidity lock alert timestamp
            
            json_data['dict1']['value'],  # created timestamp
            json_data['dict2']['value'].split('|')[0].split(':')[2].strip(), # verified
            json_data['dict2']['value'].split('|')[1].split(':')[2].strip(), # renounced
            json_data['dict3']['value'], # marketcap
            json_data['dict4']['value'].split('|')[0].split(':')[2].strip(), # buys
            json_data['dict4']['value'].split('|')[1].strip(), # sells
            json_data['dict4']['value'].split('|')[0].split(':')[1].replace('_circle', '').strip(), # buy/sell rating
            json_data['dict5']['value'] if json_data['dict6']['value'] == '-' else (False if json_data['dict5']['value'].startswith(':red_circle:') else (True if json_data['dict5']['value'].split('|')[1].split(':')[1].strip() == 'green_circle' else False)),  # honeypot, true if safe
            json_data['dict6']['value'] if json_data['dict6']['value'] == '-' else json_data['dict6']['value'].split('|')[0].split(':')[2].strip(), # buy tax
            json_data['dict6']['value'] if json_data['dict6']['value'] == '-' else json_data['dict6']['value'].split('|')[1].strip(), # sell tax
            json_data['dict6']['value'] if json_data['dict6']['value'] == '-' else json_data['dict6']['value'].split('|')[0].split(':')[1].replace('_circle', '').strip(), # buy/sell tax rating
            json_data['dict7']['value'].split('$')[1].replace(',', '').strip(), # liquidity
            json_data['dict8']['value'].split('%')[1].strip(), # owner
            json_data['dict8']['value'].split('|')[0].split(':')[1].replace('_circle', '').strip(), # owner rating
            json_data['dict14']['value'], # deployer balance
            json_data['dict15']['value'], # deployer transactions count
            json_data['dict16']['value'].split('\n')[0], # funding source
            json_data['dict17']['value'].split('\n')[0], # funding amount
            json_data['dict22']['value'], # max per wallet
            json_data['dict23']['value'], # max per transaction
        ]

        # add to values_list
        values_list.append(value_item_list)
        
        line_number += 1  # Increment line number counter


# create dataframe with json data stored in values_list
columns = ['address', 'alert_timestamp', 'created', 'verified', 'renounced', 'marketcap', 'buys', 'sells', 'buysell_rating', 'honeypot', 'buytax', 'selltax', 'taxrating', 'liquidity','owner', 'owner_rating', 'deployer_balance', 'deployer_tx', 'funding_source', 'funding_amount', 'max_wallet', 'max_tx']
df = pd.DataFrame(values_list, columns=columns)

Let's check the general strucure of our data:

In [6]:
df.head()

Unnamed: 0,address,alert_timestamp,created,verified,renounced,marketcap,buys,sells,buysell_rating,honeypot,...,taxrating,liquidity,owner,owner_rating,deployer_balance,deployer_tx,funding_source,funding_amount,max_wallet,max_tx
0,0xa190700f5ae95de4eabf29fa9469bd85ff5a7919,2022-11-04T23:43:09.315000+00:00,<t:1667604491:R>,True,Not Ownable,"$9,755",100,24,green,True,...,green,7958,Unicrypt,green,0.06Ξ,18,:yellow_circle: [0x04af…65fd](https://ethersca...,0.03Ξ,2% | 20000000,1% | 10000000
1,0x9de736b02f3d09738ac42cdea046b014b0d54d60,2022-11-04T23:37:28.477000+00:00,<t:1667604059:R>,True,False,"$21,924",90,19,green,True,...,green,8947,Unicrypt,green,0.61Ξ,9,:green_circle: [Binance 14](https://etherscan....,0.75Ξ,2% | 2000000,100% | 100000000000000000
2,0xaaf8a1aad53c9384be3aecb5a16af6121a5ad935,2022-11-04T23:33:33.024000+00:00,<t:1667603939:R>,True,Not Ownable,"$8,951",27,7,yellow,True,...,green,7680,Team Finance,green,0.19Ξ,9,:yellow_circle: [Binance 15](https://etherscan...,1.34Ξ,-,-
3,0xa17ae9a7174cdbc5294e3fad8afbafc1be1764a3,2022-11-04T23:33:18.962000+00:00,<t:1667604239:R>,True,False,"$25,578",75,8,yellow,True,...,green,17461,Unicrypt,green,0.92Ξ,7,:yellow_circle: [Binance 15](https://etherscan...,2.50Ξ,2% | 200000,100% | 10000000000000000000000000
4,0x3b2d93677c433c191aa379c78b97e0685c3f4798,2022-11-04T23:33:02.893000+00:00,<t:1667603795:R>,False,False,"$44,448",16,10,yellow,True,...,red,11481,Unicrypt,green,0.34Ξ,8,:yellow_circle: [0xb739…0d45](https://ethersca...,3.70Ξ,-,-


Let's see the number of data points we've collected:

In [7]:
len(df)

60559

Our data is not perfectly cleaned, however there is additional data we still need to collect. A thorough cleaning process will take place once all our data is collected.

Let's export our base discord metrics

In [11]:
df.to_csv('../data/discord_data.csv', index=False)

We now have our data in a readable csv format.

This data represents the snapshot of metrics at the beginning of token trading.

Next we want to scape some metrics at the peak of it's trading, the all time high price.

We will do that by scraping Dune api, that process can be found here: **['dune_scrape_step_3'](./dune_scrape_step_3.ipynb)**