# Scraping Volume from Defined 

As mentioned in our Data Cleaning process, we will need volume metric to filter out potential scam tokens and determine a more accurate target class.

While intially going through with this process. We realize that the Defined volume metrics need a pair_address instead of a regular address. Allow me to explain the difference:
- A regular address is a tokens unqiue address on the blockchain
- A pair address is for trading, it is a pair of regular addresses, one address for the token you trade, and another address for the asset you trade it with. These assets are typically:
    - Eth/Weth
    - Usdc (a token that replicates the us dollar, also know as 'stablecoin')

Not too worry, we don't need to do too much extra work. Remember our initial discord scrape? Although we didn't use the links in our dataframe we did scrape them into our json file.

More on that process here: **['pair_extraction'](./pair_extraction.ipynb)**

Table of Contents:<a id="Contents"></a>
* [Data sanity](#sanity)
* [Scraping Data](#scrape)
* [Conclusion & Next Steps](#nextsteps)

### Load in Data<a id="sanity"></a>
**Sanity checks**
<p><a href="#Contents" style="font-size: 12px;">Back to Table of Contents</a></p>

In [3]:
import pandas as pd

import datetime as dt

In [6]:
# Getting data for target class rows only => from: 'token_cleaning.ipynb'
df = pd.read_csv('../data/volume_for_target_addresses.csv')

# volume scrape requires pair addresses rather than token addresses, => from: 'pair_extraction.ipynb'
df_pairs = pd.read_csv('../data/pair_addresses.csv')

Ensure all addresses are lowercase for merging

In [15]:
df['address'] = df['address'].str.lower()
df_pairs['address'] = df_pairs['address'].str.lower()

In [16]:
merged_df = pd.merge(df, df_pairs, on='address', how='inner')

In [17]:
merged_df

Unnamed: 0,address,alert_timestamp,ath_date,pair_address
0,0x568d77b90622e5be0ae4132edc87e07188f6c19d,2022-11-04,2022-11-06,0x123ad76dba817a1db8f564f392dc120a6f604063
1,0x50b3cdbb29351c189ddcea87556eddc29535f10c,2022-11-04,2022-11-06,0xbb8ee1a73ca54faa39b95798f22a6bb5f29dfe0e
2,0xe424c77eff453ea6be368c8480e8ae405868bda2,2022-11-04,2022-11-07,0x944945f428fc38a37e6aaafff1a287eaa38af991
3,0xfd5d8dab40f60e184426c3dd6a5edc9e7842f375,2022-11-05,2023-02-12,0xac8ffbf156583bc3c9efe1e195547a24adebf151
4,0x2dc1f8a31080d0d7d03c2c719a02955a3548e478,2022-11-05,2022-11-07,0x5277dbc69a18a7dd1d50204c84659f18379c2681
...,...,...,...,...
2403,0x1210c3e3c6296dbf2708f5a4800228236984b475,2023-12-23,2023-12-29,0x7713a5f3e126313689306529d4a5e750c4e51147
2404,0xac7945a20c3a6629749f40ad57aac77ad9c60fed,2023-12-24,2023-12-27,0xfbd2684d6399e976f53debe465eff2cacfed2528
2405,0x38e3e9512f669f7297e8d2dd1bbc1a994ee78e4b,2023-12-26,2024-01-01,0xfa4458db8fe0fcef68bc24ec2c55ba76f4d34a5e
2406,0x98e35f5599b57998900e5e0675721c90a5499327,2023-12-27,2023-12-31,0x996684e437b90e6823730d40dc3e8ff397370e11


In [18]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2408 entries, 0 to 2407
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   address          2408 non-null   object
 1   alert_timestamp  2408 non-null   object
 2   ath_date         2408 non-null   object
 3   pair_address     2408 non-null   object
dtypes: object(4)
memory usage: 75.4+ KB


In [19]:
merged_df.nunique()

address            2408
alert_timestamp     413
ath_date            419
pair_address       2408
dtype: int64

In [20]:
merged_df.isnull().sum()

address            0
alert_timestamp    0
ath_date           0
pair_address       0
dtype: int64

For our Defined api call, in addition to the pair address we also need to use timestamps for historic calls. 

We need need to convert our date columns:

- From String to Datetime
- From Datetime subtract Datetime(1970,1,1)
- Convert the result to seconds

In [21]:
merged_df['ath_date'] = pd.to_datetime(merged_df['ath_date'])
merged_df['alert_timestamp'] = pd.to_datetime(merged_df['alert_timestamp'])

rename column to provide clarity / prevent confusion 

In [22]:
merged_df.rename(columns={'alert_timestamp': 'alert_date'}, inplace=True)

In [24]:
merged_df['ath_price_timestamp'] = (merged_df['ath_date'] - dt.datetime(1970,1,1)).dt.total_seconds().astype(int)
merged_df['alert_timestamp'] = (merged_df['alert_date'] - dt.datetime(1970,1,1)).dt.total_seconds().astype(int)

In [25]:
merged_df.drop_duplicates()

Unnamed: 0,address,alert_date,ath_date,pair_address,ath_price_timestamp,alert_timestamp
0,0x568d77b90622e5be0ae4132edc87e07188f6c19d,2022-11-04,2022-11-06,0x123ad76dba817a1db8f564f392dc120a6f604063,1667692800,1667520000
1,0x50b3cdbb29351c189ddcea87556eddc29535f10c,2022-11-04,2022-11-06,0xbb8ee1a73ca54faa39b95798f22a6bb5f29dfe0e,1667692800,1667520000
2,0xe424c77eff453ea6be368c8480e8ae405868bda2,2022-11-04,2022-11-07,0x944945f428fc38a37e6aaafff1a287eaa38af991,1667779200,1667520000
3,0xfd5d8dab40f60e184426c3dd6a5edc9e7842f375,2022-11-05,2023-02-12,0xac8ffbf156583bc3c9efe1e195547a24adebf151,1676160000,1667606400
4,0x2dc1f8a31080d0d7d03c2c719a02955a3548e478,2022-11-05,2022-11-07,0x5277dbc69a18a7dd1d50204c84659f18379c2681,1667779200,1667606400
...,...,...,...,...,...,...
2403,0x1210c3e3c6296dbf2708f5a4800228236984b475,2023-12-23,2023-12-29,0x7713a5f3e126313689306529d4a5e750c4e51147,1703808000,1703289600
2404,0xac7945a20c3a6629749f40ad57aac77ad9c60fed,2023-12-24,2023-12-27,0xfbd2684d6399e976f53debe465eff2cacfed2528,1703635200,1703376000
2405,0x38e3e9512f669f7297e8d2dd1bbc1a994ee78e4b,2023-12-26,2024-01-01,0xfa4458db8fe0fcef68bc24ec2c55ba76f4d34a5e,1704067200,1703548800
2406,0x98e35f5599b57998900e5e0675721c90a5499327,2023-12-27,2023-12-31,0x996684e437b90e6823730d40dc3e8ff397370e11,1703980800,1703635200


In [26]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2408 entries, 0 to 2407
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   address              2408 non-null   object        
 1   alert_date           2408 non-null   datetime64[ns]
 2   ath_date             2408 non-null   datetime64[ns]
 3   pair_address         2408 non-null   object        
 4   ath_price_timestamp  2408 non-null   int32         
 5   alert_timestamp      2408 non-null   int32         
dtypes: datetime64[ns](2), int32(2), object(2)
memory usage: 94.2+ KB


## Scraping the Data<a id="scrape"></a>
<p><a href="#Contents" style="font-size: 12px;">Back to Table of Contents</a></p>

In [27]:
import requests
import json
import time

In addition to volume metrics on the day trading goes live (almost always the liquidity lock day), we want metrics for:

- Volume at ath date
- Volume the day before ath
- Volume the dat after ath

If volume the day before and/or after is 0 or irregularly lower than the ath day we would consider this a red flag and not a true indication of a successful invesment opportunity. We've seen false ath growth from liquidity manipulation scams, in which cases there is no volume leading up to the final transaction and no volume afterwards either.

In short: This will enable us to filter out false positives and scams

To get the day before and after our timestamp, we can either add or subtract **86400**

As of the moment of this call, the api does not allow a list of timestamp inputs => so we will have to make 4 calls for each row of data.

In [30]:
defined_api_results = []

In [31]:
iterations_acc_1 = 0

In [1]:
url = "https://graph.defined.fi/graphql"

headers = {
  "content_type":"application/json",
  "Authorization": "api key here"
}

for index, row in merged_df.iloc[iterations_acc_1:2409].iterrows():
    row_object = []
    
    pair = row['pair_address']
    alert = row['alert_timestamp']
    ath = row['ath_price_timestamp']
    ath_before = row['ath_price_timestamp'] - 86400
    ath_after = row['ath_price_timestamp'] + 86400

    getDetailedPairStats = f"""
    query {{
        getDetailedPairStats(pairAddress: "{pair}" networkId:1 timestamp:{alert} tokenOfInterest:token1) {{
        tokenOfInterest
        stats_day1 {{
          statsUsd {{
            volume {{
              currentValue
              previousValue
              change
            }}
          }}
          start
          end
        }}
      }}
    }}
    """
    
    response = requests.post(url, headers=headers, json={"query": getDetailedPairStats})

    if response.status_code == 200:
        row_object.append(row['address'])
        row_object.append(json.loads(response.text))
        print('success')
    else:
        print(f"Request failed for index {index}. Status code: {response.status_code}")
        break

    time.sleep(0.25)
    
    # second call
    getDetailedPairStats = f"""
    query {{
        getDetailedPairStats(pairAddress: "{pair}" networkId:1 timestamp:{ath} tokenOfInterest:token1) {{
        tokenOfInterest
        stats_day1 {{
          statsUsd {{
            volume {{
              currentValue
              previousValue
              change
            }}
          }}
          start
          end
        }}
      }}
    }}
    """
    response = requests.post(url, headers=headers, json={"query": getDetailedPairStats})

    if response.status_code == 200:
        row_object.append(json.loads(response.text))
        print('success')
    else:
        print(f"Request failed for index {index}. Status code: {response.status_code}")
        break

    time.sleep(0.25)

    # third call
    getDetailedPairStats = f"""
    query {{
        getDetailedPairStats(pairAddress: "{pair}" networkId:1 timestamp:{ath_before} tokenOfInterest:token1) {{
        tokenOfInterest
        stats_day1 {{
          statsUsd {{
            volume {{
              currentValue
              previousValue
              change
            }}
          }}
          start
          end
        }}
      }}
    }}
    """
    response = requests.post(url, headers=headers, json={"query": getDetailedPairStats})

    if response.status_code == 200:
        row_object.append(json.loads(response.text))
        print('success')
    else:
        print(f"Request failed for index {index}. Status code: {response.status_code}")
        break

    time.sleep(0.25)

    # fourth call
    getDetailedPairStats = f"""
    query {{
        getDetailedPairStats(pairAddress: "{pair}" networkId:1 timestamp:{ath_after} tokenOfInterest:token1) {{
        tokenOfInterest
        stats_day1 {{
          statsUsd {{
            volume {{
              currentValue
              previousValue
              change
            }}
          }}
          start
          end
        }}
      }}
    }}
    """
    response = requests.post(url, headers=headers, json={"query": getDetailedPairStats})

    if response.status_code == 200:
        row_object.append(json.loads(response.text))
        print('success')

        defined_api_results.append(row_object)
    else:
        print(f"Request failed for index {index}. Status code: {response.status_code}")
        break

    time.sleep(0.25)

    print(len(defined_api_results))
    # track successful iterations only
    iterations_acc_1 += 1

In [36]:
len(defined_api_results)

2408

Sample look at data

In [40]:
defined_api_results[100]

['0xcc578f173851cfad2fc124e1499bd492849b2b6d',
 {'data': {'getDetailedPairStats': {'tokenOfInterest': 'token1',
    'stats_day1': {'statsUsd': {'volume': {'currentValue': '0',
       'previousValue': '0',
       'change': 0}},
     'start': 1669334400,
     'end': 1669420801}}}},
 {'data': {'getDetailedPairStats': {'tokenOfInterest': 'token1',
    'stats_day1': {'statsUsd': {'volume': {'currentValue': '102936',
       'previousValue': '91997',
       'change': 0.118906}},
     'start': 1669507200,
     'end': 1669593601}}}},
 {'data': {'getDetailedPairStats': {'tokenOfInterest': 'token1',
    'stats_day1': {'statsUsd': {'volume': {'currentValue': '91997',
       'previousValue': '0',
       'change': 1}},
     'start': 1669420800,
     'end': 1669507201}}}},
 {'data': {'getDetailedPairStats': {'tokenOfInterest': 'token1',
    'stats_day1': {'statsUsd': {'volume': {'currentValue': '60479',
       'previousValue': '102936',
       'change': -0.41246}},
     'start': 1669593600,
     'end

In [41]:
defined_api_results[100][0] #address

'0xcc578f173851cfad2fc124e1499bd492849b2b6d'

In [42]:
defined_api_results[100][1] #alert
defined_api_results[100][2] #ath
defined_api_results[100][3] #before
defined_api_results[100][4] #after

{'data': {'getDetailedPairStats': {'tokenOfInterest': 'token1',
   'stats_day1': {'statsUsd': {'volume': {'currentValue': '0',
      'previousValue': '0',
      'change': 0}},
    'start': 1669334400,
    'end': 1669420801}}}}

In [52]:
defined_api_results[100][2]['data']['getDetailedPairStats']['stats_day1']['statsUsd']['volume']['currentValue'] #ath

'102936'

In [53]:
defined_api_results[100][3]['data']['getDetailedPairStats']['stats_day1']['statsUsd']['volume']['currentValue'] #before

'91997'

In [55]:
defined_api_results[100][4]['data']['getDetailedPairStats']['stats_day1']['statsUsd']['volume']['currentValue'] #after

'60479'

In [56]:
values_list = []
line_number = 1

for item in defined_api_results:
    value_item_list = [
        item[0],
        item[1]['data']['getDetailedPairStats']['stats_day1']['statsUsd']['volume']['currentValue'],
        item[2]['data']['getDetailedPairStats']['stats_day1']['statsUsd']['volume']['currentValue'],
        item[3]['data']['getDetailedPairStats']['stats_day1']['statsUsd']['volume']['currentValue'],
        item[4]['data']['getDetailedPairStats']['stats_day1']['statsUsd']['volume']['currentValue'],
    ]

    values_list.append(value_item_list)
    line_number += 1


Let's create our Dataframe:

In [58]:
columns = ['address', 'daily_volume_on_launch', 'daily_volume_at_ath', 'daily_volume_before_ath', 'daily_volume_after_ath']
volume_df = pd.DataFrame(values_list, columns=columns)

In [59]:
volume_df

Unnamed: 0,address,daily_volume_on_launch,daily_volume_at_ath,daily_volume_before_ath,daily_volume_after_ath
0,0x568d77b90622e5be0ae4132edc87e07188f6c19d,0,41236,89661,48167
1,0x50b3cdbb29351c189ddcea87556eddc29535f10c,0,21972,26726,21040
2,0xe424c77eff453ea6be368c8480e8ae405868bda2,0,14226,26511,27099
3,0xfd5d8dab40f60e184426c3dd6a5edc9e7842f375,0,0,0,340
4,0x2dc1f8a31080d0d7d03c2c719a02955a3548e478,0,68054,29311,108706
...,...,...,...,...,...
2403,0x1210c3e3c6296dbf2708f5a4800228236984b475,0,1481,228,1263
2404,0xac7945a20c3a6629749f40ad57aac77ad9c60fed,0,60313,203861,41105
2405,0x38e3e9512f669f7297e8d2dd1bbc1a994ee78e4b,0,31932,40029,196417
2406,0x98e35f5599b57998900e5e0675721c90a5499327,0,80330,224771,128706


In [64]:
sum(volume_df['daily_volume_on_launch'] == '0')

2247

In [66]:
volume_df.to_csv('../data/target_addresses_volume.csv', index=False)

## Conclusion & Next Steps<a id="nextsteps"></a>
<p><a href="#Contents" style="font-size: 12px;">Back to Table of Contents</a></p>

We now have our volume data to help determine our target class.

Let's head back to our data cleaning and put this data to use: **['tokens_cleaning'](../Research/tokens_cleaning.ipynb)**