# **EDA, Feature Engineering, Hypothesis Testing, and Classification on NFT Collection Dataset**

### **Introduction**

##### What is Niftyprices?
Niftyprice is a new and small team focused on providing the most up to date and comprehensive NFT data in the market today. As you know, this is a fast moving space, so we are working hard to continually push updates, increase our coverage, and enhance our data engine to provide you the best product possible while staying up to speed on all the rapid changes taking place in the NFT world. At the end of the day, we're here for you, the NFT investors and patrons, so please let us know any feedback you have or how we can help solve your burning NFT issues. -NP [(source)](https://www.niftyprice.io/about)

##### What are NFT's?
A non-fungible token is a unit of data stored on a digital ledger, called a blockchain, that certifies a digital asset to be unique and therefore not interchangeable. NFTs can be used to represent items such as photos, videos, audio, and other types of digital files. [(source)](https://www.niftyprice.io/about)

### **Dataset Overview**

##### Feature Information

* Collection Name: The name of the NFT collection.
* Floor Purchase Price: The lowest price of any NFT in the collection in Ethereum (ETH).
* 24%: The percentage of floor price's moving values per 24 hours.
* 7d%: The percentage of floor price's moving values per 7 days.
* Total Float: The toral amount of minted NFTs.
* Floor Cap: The lowest market capitalization—total value of the collection's items in circulation—in in Ethereum (ETH).
* Volume: The volume of sales from the NFT collection in Ethereum (ETH) per 24 hours.
* 24h Volume%: The percentage of volume's moving values per 24 hours.
* Sales: The number of sales from the NFT collection.
* 24h Owners%: The ownership percentage of all items in the collection per 24 hours.
* %Float: The percentage of listed NFTs.
* 24h supply%: The percentage of supply's moving values per 24 hours.
* image_url: The associated image of the NFT collection.

##### Source

Dataset was scraped from [niftyprices](https://www.kaggle.com/fedesoriano/heart-failure-prediction) on February 20, 2022.
You can scrape the latest data by yourself using ['scraper.py'](https://github.com/berodimas/one-month-one-dataset/blob/master/%5B2022-02%5D%20NFT%20Price%20Analysis%20and%20Regression/data/scraper.py) python script on 'data' folder.


### **Section 1: Setup, Load, and Clean** 

In [None]:
import os
data_path = ['data']

In [None]:
## Import neccessary libraries to load data
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
## Load in the Dataset
filepath = os.sep.join(
    data_path + ['2022-03-01_niftyprice.csv'])
df = pd.read_csv(filepath)
df = df.sort_values(by=['Floor Cap'], ascending=False, ignore_index=True)

In [None]:
df.head()

In [None]:
## Examine the information from the data
df.info()

In [None]:
## If there's no volume change in past 7 days it might be newly listed
## Therfore we gonna remove rows with 0 value on '7d%' column
df = df[df['7d%'] != 0]

In [None]:
## Fill NaN value to 0
df['24h Volume%'].fillna(value=0, inplace=True)

In [None]:
## We plan to choose top 200 NFT's collections
df = df.drop(df.index[200:])

In [None]:
df= df.reset_index(drop=True)

In [None]:
## Drop all unnecessary columns
df.drop(columns=['image_url', '24h supply%', '7d%',
        'Floor Cap', 'Collection Name'], inplace=True)


In [None]:
## Take a quick look of the dataframe
df

In [None]:
## Re-examine the information from the data
df.info()


In [None]:
## Create range section in describe table
nft_df = df.copy()
stat_df = nft_df.describe()
stat_df.loc['range'] = stat_df.loc['max'] - stat_df.loc['min']
stat_df.T

In [None]:
## Import neccessary libraries to visualize data
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_theme(style="dark")

In [None]:
plt.figure(figsize=[15, 10])
sns.heatmap(data=nft_df.corr(), annot=True)


In [None]:
## There are 3 potential column to be our target variable
## We gonna choose column with highest correlation with each others

corr_column_target = ['Volume', 'Sales', '%Float']
headers = ['Column Name', 'Sum of Correlations']
df_corr_result = pd.DataFrame(columns=headers)
i = 0

for k in corr_column_target:
    fields = list(nft_df.columns)
    fields.remove(k)
    y = (nft_df[k])
    correlations = nft_df[fields].corrwith(y)
    df_corr_result.loc[i] = [k, correlations.abs().sum()]
    i += 1

df_corr_result.sort_values(
    by=['Sum of Correlations'], ascending=False, ignore_index=True)


First Section Short Recap/Conclusion:
* '%Float' have the highest correlation with each other.
* We gonna compare each column as a target for Regression session. 

### **Section 2: Simple Exploratory Data Analysis (EDA)**

In [None]:
## Check for unique variables on each features
## Making sure that all of the columns were numerical

nft_df.nunique()

In [None]:
numerical_data_columns = list(nft_df.columns)

#### **Numerical Data Columns EDA**

In [None]:
## Visualize distribution on numerical features
rows = len(numerical_data_columns)
cols = 3

fig = plt.figure(1, (18, rows*3))

i = 0
for feature in numerical_data_columns:

    i += 1
    ax1 = plt.subplot(rows, cols, i)
    sns.kdeplot(data=nft_df, x=feature)
    ax1.set_xlabel(None)
    ax1.set_title(f'Distribution of {feature}')
    plt.tight_layout()

    i += 1
    ax2 = plt.subplot(rows, cols, i)
    sns.violinplot(data=nft_df, x=feature)
    ax2.set_xlabel(None)
    ax2.set_title(f'{feature} - Swarm Plot')
    plt.tight_layout()

    i += 1
    ax3 = plt.subplot(rows, cols, i)
    sns.boxplot(data=nft_df, x=feature, orient='h', linewidth=2.5)
    ax3.set_xlabel(None)
    ax3.set_title(f'{feature} - Box Plot')
    plt.tight_layout()

plt.show()


In [None]:
## Find outliers using Tukey's method
def tukey_outliers(x):
    ## Tukey outliers are based on the boundaries defined by quantiles and IQR
    q1 = np.percentile(x, 25)
    q3 = np.percentile(x, 75)

    iqr = q3 - q1

    lower_boundary = q1 - (iqr * 1.5)
    upper_boundary = q3 + (iqr * 1.5)

    outliers = x[(x < lower_boundary) | (x > upper_boundary)]
    return outliers


In [None]:
## Calculate the tukey outliers
outlier_dict = {}
for num_feature in numerical_data_columns:
    outliers = tukey_outliers(nft_df[num_feature])
    if len(outliers):
        print(f"-> {num_feature} has {len(outliers)} tukey outliers")
        outlier_dict[num_feature] = outliers
    else:
        print(f"-> {num_feature} doesn't have any tukey outliers.")
        outlier_dict[num_feature] = None


In [None]:
## Show the percentage of outliers

for x in numerical_data_columns:
    outliers = nft_df.loc[outlier_dict[x].index]
    print("{} has {}% of outliers".format(
        x, round(len(outliers)/len(nft_df) * 100, 2)))


In [None]:
## Perform test whether a sample differs from a normal distribution
from scipy.stats import normaltest

ALPHA = 0.05

for col in nft_df:
    stat, p = normaltest(nft_df[col].values)
    print('{}: stat={}, p={}'.format(col, stat, p))
    if p <= ALPHA:
        print('Probably not Gaussian\n')
    else:
        print('Probably Gaussian\n')


Numerical Data Columns EDA Short Recap/Conclusion:
* '24h Owners% ' has the highest percentage of outliers with '17%'.
* All of the columns were probably not normally distributed.
* We gonna analyze skewness on next section.

### **Section 3: Feature Engineering**

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [None]:
## Create list of scaler

scalers = [
    (MinMaxScaler(), "MinMaxScaler"),
    (StandardScaler(), "StandardScaler")
]


In [None]:
## Compare result of skewness after scaled from each scaler
for scaler, scaler_desc in scalers:
    nft_df_fe = nft_df.copy()
    skew_result = []
    for column in numerical_data_columns:
        nft_df_fe[[column]] = scaler.fit_transform(nft_df_fe[[column]])
        skew_result.append({column: nft_df_fe[column].skew()})
    print("Skew Result After " + scaler_desc)
    print(skew_result)
    print("----------")


In [None]:
## Both were pretty same, so use any of those wouldn't be much problem
nft_df_fe = nft_df.copy()
for column in [numerical_data_columns]:
    nft_df_fe[column] = StandardScaler().fit_transform(nft_df_fe[column])


In [None]:
## Display statistical value after scaling
nft_df_fe.describe().T

In [None]:
## Take a quick look of the dataframe
nft_df_fe


In [None]:
skew_limit = 0.75
df_skew = nft_df_fe.copy()
skew_vals = df_skew.skew()

In [None]:
## Display skewness value for each columns
skew_cols = (skew_vals
             .sort_values(ascending=False)
             .to_frame()
             .rename(columns={0: 'Skew'})
             .query('abs(Skew) > {}'.format(skew_limit)))

skew_cols


In [None]:
## Create before-after tansformation graph
skew_features = skew_cols.index.tolist()
for field in skew_features:
    # Create two "subplots" and a "figure" using matplotlib
    fig, (ax_before, ax_after) = plt.subplots(1, 2, figsize=(25, 10))

    # Create a histogram on the "ax_before" subplot
    df_skew[field].hist(ax=ax_before)

    # after_skew = np.sqrt(df_skew[field] + 0 - min(df_skew[field]))
    after_skew = np.cbrt(df_skew[field])

    # Apply a log transformation (numpy syntax) to this column
    after_skew.hist(ax=ax_after)

    # Formatting of titles etc. for each subplot
    ax_before.set(title='before np.sqrt', ylabel='frequency', xlabel='value')
    ax_after.set(title='after np.sqrt', ylabel='frequency', xlabel='value')
    fig.suptitle('Field "{}"\nBefore: {} | After: {}\n'.format(
        field, df_skew[field].skew(), after_skew.skew()))


In [None]:
## Apply transformation to the feature
for column in numerical_data_columns:
    df_skew[column] = np.cbrt(df_skew[column])


In [None]:
df_skew

Feature Transformation Short Recap/Conclusion:
* Because there's negative value on skewed features ('Oldpeak'), we gonna use Cube Root as Feature Transformation approach. 
* After Feature Transformation with Cube Root method, all of the skewness seems getting close to 0.75.

### **Section 4: Regression**

In [None]:
df_skew.head()

In [None]:
## Split the Training and Test set with KFold
## We gonna make 3 type of Training and Test set: %Float, Sales, and Volume

from sklearn.model_selection import KFold

X_float = df_skew.drop(columns=["%Float"])
y_float = df_skew["%Float"]

X_sales = df_skew.drop(columns=["Sales"])
y_sales = df_skew["Sales"]

X_volume = df_skew.drop(columns=["Volume"])
y_volume = df_skew["Volume"]

kf = KFold(shuffle=True, random_state=72018, n_splits=4)


In [None]:
## Import neccessary libraries for modelling

from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import r2_score, mean_squared_error

import datetime


In [None]:
## Create list of transformers

transformers = [
    (PolynomialFeatures(degree=1), "PolynomialFeatures (Degree 1)"),
    (PolynomialFeatures(degree=2), "PolynomialFeatures (Degree 2)"),
    (PolynomialFeatures(degree=3), "PolynomialFeatures (Degree 3)")
]


In [None]:
## Create list of models

models = [
    Lasso(max_iter=1000),
    Ridge(max_iter=1000)
]


In [None]:
search_space_dict = {}

search_space_dict['Lasso'] = {
    'lasso__alpha': np.geomspace(0.001, 0.1, 50)
}

search_space_dict['Ridge'] = {
    'ridge__alpha': np.geomspace(0.001, 0.1, 50)
}



In [None]:
## Create pipeline matrix

pipelines_matrix = {}

for transformer, transformer_desc in transformers:
    pipelines_matrix[transformer_desc] = {}
    print(transformer_desc)
    for model in models:
        print("            ", model.__class__.__name__)
        pipelines_matrix[transformer_desc][model.__class__.__name__] = make_pipeline(
            transformer, model)


In [None]:
## Create a function for performing cross validation of all algorithms
## Fuction will return a dataframe with the result from each pipeline

def cross_validator(X_train, y_train, pipelines_matrix):
    i = 0
    for transformer in pipelines_matrix:
        print("----------------------", transformer)
        for model in pipelines_matrix[transformer]:
            i += 1
            print("     +++++++", model)
            startT = datetime.datetime.now()

            pipeline = pipelines_matrix[transformer][model]

            search_space = search_space_dict[model]
            regressor = GridSearchCV(pipeline,
                                search_space,
                                scoring='neg_root_mean_squared_error',
                                cv=kf)
            regressor.fit(X_train, y_train)

            print("          rmse: ", regressor.best_score_)

            headers = ['transformer', 'model',
                       'rmse', 'best_params']
            dfResultsTemp = pd.DataFrame(columns=headers)
            dfResultsTemp.loc[0] = [
                transformer, model, regressor.best_score_, regressor.best_params_]

            print("             exec time:", datetime.datetime.now() -
                  startT, datetime.datetime.now())

            if i == 1:
                data_concat = dfResultsTemp.copy()
            else:
                data_concat = pd.concat([data_concat, dfResultsTemp])

    return data_concat


#### **GridSearch with '%Float' as Target**

In [None]:
grid_search_df = cross_validator(X_float, y_float, pipelines_matrix)


In [None]:
grid_search_df.sort_values(by=['rmse'], ascending=False, ignore_index=True)

In [None]:
pipeline = Pipeline(steps=[
    ('transformer', PolynomialFeatures(degree=3)),
    ('model', Lasso(alpha=0.01151395399326447, max_iter=1000))])


In [None]:
pipeline.fit(X_float, y_float)


In [None]:
y_predict = pipeline.predict(X_float)
print(
    f"RMSE Score for Lasso Regression: {mean_squared_error(y_float, y_predict, squared=False)}")
print(f"R2 Score for Lasso Regression: {r2_score(y_float, y_predict)}")


In [None]:
f = plt.figure(figsize=(10, 10))
ax = plt.axes()

ax.plot(y_float, pipeline.predict(X_float),
        marker='o', ls='', ms=3.0)

lim = (0, y_float.max())

ax.set(xlabel='Actual Confidence',
       ylabel='Predicted Confidence',
       xlim=lim,
       ylim=lim,
       title='Lasso Regression Results')


#### **GridSearch with 'Sales' as Target**

In [None]:
grid_search_df = cross_validator(X_sales, y_sales, pipelines_matrix)


In [None]:
grid_search_df.sort_values(by=['rmse'], ascending=False, ignore_index=True)

In [None]:
pipeline = Pipeline(steps=[
    ('transformer', PolynomialFeatures(degree=3)),
    ('model', Lasso(alpha=0.016768329368110076, max_iter=1000))])


In [None]:
pipeline.fit(X_sales, y_sales)


In [None]:
y_predict = pipeline.predict(X_sales)
print(
    f"RMSE Score for Lasso Regression: {mean_squared_error(y_sales, y_predict, squared=False)}")
print(f"R2 Score for Lasso Regression: {r2_score(y_sales, y_predict)}")


In [None]:
f = plt.figure(figsize=(10, 10))
ax = plt.axes()

ax.plot(y_sales, pipeline.predict(X_sales),
        marker='o', ls='', ms=3.0)

lim = (0, y_sales.max())

ax.set(xlabel='Actual Confidence',
       ylabel='Predicted Confidence',
       xlim=lim,
       ylim=lim,
       title='Lasso Regression Results')


#### **GridSearch with 'Volume' as Target**

In [None]:
grid_search_df = cross_validator(X_volume, y_volume, pipelines_matrix)


In [None]:
grid_search_df.sort_values(by=['rmse'], ascending=False, ignore_index=True)

In [None]:
pipeline = Pipeline(steps=[
    ('transformer', PolynomialFeatures(degree=3)),
    ('model', Lasso(alpha=0.0071968567300115215, max_iter=1000))])


In [None]:
pipeline.fit(X_volume, y_volume)


In [None]:
y_predict = pipeline.predict(X_volume)
print(
    f"RMSE Score for Lasso Regression: {mean_squared_error(y_volume, y_predict, squared=False)}")
print(f"R2 Score for Lasso Regression: {r2_score(y_volume, y_predict)}")


In [None]:
f = plt.figure(figsize=(10, 10))
ax = plt.axes()

ax.plot(y_volume, pipeline.predict(X_volume),
        marker='o', ls='', ms=3.0)

lim = (0, y_volume.max())

ax.set(xlabel='Actual Confidence',
       ylabel='Predicted Confidence',
       xlim=lim,
       ylim=lim,
       title='Lasso Regression Results')


Final Model Evaluation Short Recap/Conclusion:
* After building model for 3 different target, 'Volume' got the best score with the highest R2 Score


---
**2022 | Dimas Adrian Mukti / [@berodimas](https://berodimas.netlify.app/)**