# Data Cleaning
#### Step 5

Now that we've collected our initial data, let's clean it up and work towards giving a target class label

Table of Contents:<a id="Contents"></a>
* [Initial Cleaning](#initialcleaning)
* [Finding Target Class](#targetclass)
* [Scrape for Additional Data](#moreinfo)
* [Conclusion & Next Steps](#nextsteps)

In [1]:
import pandas as pd
import numpy as np

import datetime

## Initial Cleaning<a id="initialcleaning"></a>
<p><a href="#Contents" style="font-size: 12px;">Back to Table of Contents</a></p>

Load in data

In [2]:
df1 = pd.read_csv('../data/discord_data.csv') # from discord ~ base metrics
df2 = pd.read_csv('../data/ath_price_data.csv') # from dune ~ ath date and price
df3 = pd.read_csv('../data/defined_data.csv') # from defined (token info only) ~ ticker and total supply

Rename columns for merging ease

In [3]:
df2 = df2.rename(columns={'token_bought_address': 'address'})
df2 = df2.rename(columns={'block_date': 'ath_date', 'price':'ath_price'})

Ensure all addresses are completely lowercase for mergining

In [4]:
df1['address'] = df1['address'].str.lower()
df2['address'] = df2['address'].str.lower()
df3['address'] = df3['address'].str.lower()

Merge data

In [5]:
first_merge = pd.merge(df1, df2, on='address', how='outer')
df = pd.merge(first_merge, df3, on='address', how='outer')

Drop nulls and duplicate addresses

In [6]:
df = df.dropna()
df = df.drop_duplicates(subset='address', keep='first')

'created' column from discord timestamp to datetime

In [7]:
#discord timestamp string to datetime object
def convert_to_date(timestamp_string):
    
    # extract timestamp number
    timestamp = int(timestamp_string.split(':')[1].split(':')[0])
    
    # Convert extracted timestamp to  datetime object with datetime library
    date = datetime.datetime.fromtimestamp(timestamp)
    return pd.to_datetime(date.date())

# Apply the conversion function to the entire column
df['created'] = df['created'].apply(convert_to_date)

Clean columns with non ml interpretable symbols to base numeric form

In [8]:
df['ath_date'] = pd.to_datetime(df['ath_date'])
df['funding_source_rating'] = df['funding_source'].str.extract(r':(\w+)_circle:')
df['funding_source'] = df['funding_source'].str.extract(r'\[(.*?)\]')
df['max_wallet'] = df['max_wallet'].str.split('|').str.get(0).str.strip()
df['max_tx'] = df['max_tx'].str.split('|').str.get(0).str.strip()
df['funding_amount'].fillna(0, inplace=True)

The following columns need to be made numeric

In [9]:
to_numeric = ['buytax', 'selltax', 'max_wallet', 'max_tx']
df = df.replace('-', np.nan)

#In the discord source docs, if max_wallet has a value of 'TD' it simply means theres a transfer delay and the max value is the same as max_tx
df.loc[df['max_wallet'] == 'TD', 'max_wallet'] = df.loc[df['max_wallet'] == 'TD', 'max_tx']

df['marketcap'] = df['marketcap'].replace({'\$': '', ',': ''}, regex=True)
df[to_numeric] = df[to_numeric].replace({'%': ''}, regex=True)


df[to_numeric] = df[to_numeric].apply(pd.to_numeric)
df['marketcap'] = df['marketcap'].apply(pd.to_numeric)

# has Eth Ξ symbol cannot be parsed by to_numeric()
df['funding_amount'] = df['funding_amount'].str.slice(stop=-1).astype(float)
df['deployer_balance'] = df['deployer_balance'] .str.slice(stop=-1).astype(float)

Now that numerics have been handled, let's handle any new nulls from our cleaning

In [10]:
inf_columns = df.columns[df.isin([np.inf, -np.inf]).any()]

for col in inf_columns:
    inf_values_count = df[col].isin([np.inf, -np.inf]).sum()
    print(f"Column '{col}' has {inf_values_count} infinite values.")

Tax being null is the same as there being no tax, therefore tax = 0 and rating is green as we see below in other instances where buy and sell tax are both 0

In [11]:
df.loc[df['buytax'].isnull() & df['selltax'].isnull(), 'taxrating'] = 'green'
df['buytax'].fillna(0, inplace=True)
df['selltax'].fillna(0, inplace=True)

maximum limitations on max buy and sell means that there is no limit on the amount you can buy, therefore you can theoretically purchase 100% of the token

In [12]:
df['max_wallet'].fillna(100, inplace=True)
df['max_tx'].fillna(100, inplace=True)

Let's also convert honeypot to boolean, if the sell is red then it is a honeypot scam and should NOT be bought, and will be given a value of False

In [13]:
df['honeypot'] = np.where(df['honeypot'] == 'True', True, False)

of the non-numeric values, it appears we can hot-encode the following columns:
- renounced
- buysell_rating
- taxrating
- owner
- owner_rating
- funding_source_rating

In [14]:
cols_to_dummy = ['renounced', 'buysell_rating', 'taxrating', 'owner', 'owner_rating', 'funding_source_rating']
dummies = pd.get_dummies(df[cols_to_dummy])
df = pd.concat([df, dummies], axis=1)

## Finding our Target Class<a id="targetclass"></a>
<p><a href="#Contents" style="font-size: 12px;">Back to Table of Contents</a></p>

Let's add some updated data to our existing dataframe in order to better find our target variable:

In [15]:
df_alert = pd.read_csv('../data/partial_discord_data_time_alert.csv')
df_alert.drop(columns=['created', 'ath_date'], inplace=True)
df_alert = df_alert.drop_duplicates(subset='address', keep='first')
df = pd.merge(df, df_alert, on='address', how='left')

Alert timestamp needs to be updated to a datetime object:

In [16]:
df['alert_timestamp'] = pd.to_datetime(df['alert_timestamp'])

Now to calculate our measure of success, or target variable 

Were going to use the all time high price * token supply to get the all time high marketcap

Were then going to compare marketcap at liquidity lock to all time high marketcap to measure growth

In [17]:
df['ath_marketcap'] = df['ath_price'] * df['totalSupply']
df['growth'] = (df['ath_marketcap'] / df['marketcap'])

As a standard for our initial target class, let's also look at the number of tokens whose ATH date occured at least one day after the liquidity alert time:

Let'observe how various thresholds of minimum time between alert and ath effect the number of potential target class candidates in our dataup:

In [18]:
for i in range(1,15):
    print(f'The number of tokens that has an ATH date at least {i} days after liquidity lock time are:', len(df[(df['ath_date'] - df['alert_timestamp']).dt.days > i]))

The number of tokens that has an ATH date at least 1 days after liquidity lock time are: 2408
The number of tokens that has an ATH date at least 2 days after liquidity lock time are: 1914
The number of tokens that has an ATH date at least 3 days after liquidity lock time are: 1615
The number of tokens that has an ATH date at least 4 days after liquidity lock time are: 1441
The number of tokens that has an ATH date at least 5 days after liquidity lock time are: 1311
The number of tokens that has an ATH date at least 6 days after liquidity lock time are: 1195
The number of tokens that has an ATH date at least 7 days after liquidity lock time are: 1124
The number of tokens that has an ATH date at least 8 days after liquidity lock time are: 1054
The number of tokens that has an ATH date at least 9 days after liquidity lock time are: 989
The number of tokens that has an ATH date at least 10 days after liquidity lock time are: 946
The number of tokens that has an ATH date at least 11 days af

We need to verify our target variables using daily volume: let's take these 2408 tokens and begin a new scrape for that data:

Due to resource limitations, we are only scraping for tokens with a gap greater than 1 day

In [19]:
df_for_volume = df[(df['ath_date'] - df['alert_timestamp']).dt.days > 1][['address', 'alert_timestamp', 'ath_date']]
df_for_volume.to_csv('../data/volume_for_target_addresses.csv', index=False)

## Finding Additional Data<a id="moreinfo"></a>
<p><a href="#Contents" style="font-size: 12px;">Back to Table of Contents</a></p>

To successfully determine our target class we will need to filter out scam growth and other suspicious price action.

A common scam in these tokens is called a 'rug pull'. This is when the token developer removes liquidity from the token, which does not allow anyone to buy or sell their tokens. 

Even though there are liquidity lock methods, developers have gotten crafty in working around these. As a general rule of thumb, locked liquidity is better than unlocked liquidity, but you should always proceed with caution, as the security of your tokens are never 100% secure. This is one drawback of crypto's anonymous nature. 

In a rug pull, the developer still has permission to trade tokens, and they well often sell off any remaining tokens they hold in whatever liquidity is left over. Because the remaining liquidity gets used in that last transaction, the price impact is quite high. This means that in one small transaction the developer makes it appear that the token has shot up 100%-1000% in many cases. This shows up in our scraped data for all-time high price.

We can return to Defined, where we can get volume metrics. Once liquidity has been pulled, trading can no longer happen, which means volume must be 0 or unusually low. By finding volume metrics for the data of the all time high price, as well as the day before and after, we can get a much more accurate sense of which tokens truly deserve to be in our target class.

This scraping process can be found here: **['defined_volume_scrape'](../dataScrape/defined_volume_scrape.ipynb)**

We are back from the scrape

Let's read in our scraped data

In [20]:
df_with_volume = pd.read_csv('../data/target_addresses_volume.csv')
df = pd.merge(df, df_with_volume, on='address', how='left')

Overall look at big gainer tokens

In [21]:
print(len(df[df['growth'] > 10]))
print(len(df[df['growth'] > 100]))
print(len(df[df['growth'] > 1000]))

18340
8479
7092


Due to domain knowledge, given the size of our dataset we know these numbers are far too high.

We need to filter down by more than just date, we should add volume metrics as well to ensure they are truly target class worthy

Let's create our target class column

In [22]:
df['suitable_investment'] = 0

# conditions are subject to change beyond the scope of this bootcamp when introducing reinforcement learning
conditions = (df['growth'] > 15) & \
              ((df['ath_date'] - df['alert_timestamp']).dt.days > 1) & \
              (df['daily_volume_after_ath'] > 10000) & \
              (df['daily_volume_before_ath'] > 10000) & \
              (df['daily_volume_at_ath'] > 10000)

df.loc[conditions, 'suitable_investment'] = 1

In [23]:
index_to_drop = df[(df['suitable_investment'] == 1) & np.isinf(df['growth'])].index
df = df.drop(index_to_drop)

In [24]:
df['suitable_investment'].sum()

665

We have 665 suitable investments according to our first iteration of requirments

Let's export for EDA

In [None]:
df.to_csv('../data/complete_df.csv', index=False)

## Cleaning Conclusion<a id="nextsteps"></a>
<p><a href="#Contents" style="font-size: 12px;">Back to Table of Contents</a></p>

We now have some data to work with.

Let's get to our data exploration here: **['data_eda'](./data_eda.ipynb)**

**Note for the future of this project** 

We will use this data for the initial iteration of insights and modelling. If we can determine that a profitable trading strategy is indeed possible we will circle back and invest more in our data collection and engineering process. Why?

As is stands our trading strategy is based on many conservative assumptions:
- We assume a non target class investment results in a 100% loss
    - In the real world, while many may be scams there are still growth opportunities between 2-15x, or even breakeven
- We assume target class investments cash out at 15x
    - While the average growth within our target class is greater than 15, we have not created a selling/take profit strategy, and it would be dishonest to assume that we could sell the abosulte peak of every toke, every time.
- Gas fees, liquidity, and transaction limits are factors in modelling, but not factored when evaluation profitability
    - When the selling strategy is created, these must be accounted for
 
Creating specific trading strategies requires extensive data, which will likely include scraping the blockchain itself. As it stands now we have a snapshot of the beginning and peak of each token, but to truly create a thorough strategy we need all the trading volume and moves inbetween. 

Before we invest such resouces in aquiring this data, we need a baseline case to determine if this asset class of crypto is profitable at all.