# Prediction Models for Art Blocks Collections

This notebook provides statistical & machine learning models that can predict whether a collection minted before August 1st 2021 will have a resale in August; and how much will the collection be sold for in August.

## 1. Data Preparation

There are two types of predictions that can be done:

1.   **Collection level:** To predict which collection would be sold in August and how much would it be sold for. i.e. among all the collections / projects in Art Blocks, based on the different features of each collection such as duration of the mint, number of tokens minted or average sale price, which collection will be sold in August and for how much.
2.   **Token level:** To predict which specific token within the same collection would be sold in August and how much would it be sold for. i.e. within the Chromie Squiggle collection, based on the different traits and features, which one specific token will be sold in August and for how much.

Since each collection is different and will have different token level features, the prediction can only be done on either the collection level or the token level. 

This project will only focus on the collection level prediction. 

### 1.1 Data Structure


The collection data is transformed into two data structure formats, suitable for building time-series regression and non-time-series machine learning models. The models and methodologies are explained in more detail in section 2.

* Non-time series data: (`collection_data.csv` | *unique key*: `collection_name`): contains static non-time dependent information of the collection i.e. artist name, aspect ratio, curation status etc. This data structure is used for machine learning models i.e. Decision Tree, Random Forest, XGBoost.

* Time-series data (`collection_data_ts.csv` | *unique key*: `collection_name`, `year_month`): contains time-dependent information of the collection i.e. sale volume, price. This data structure is used for regression models i.e. OLS, poisson regression.



The SQL code used to generate the two datasets are in the SQL queries folder.

In [None]:
###################
## Load packages ##
###################
import io
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import graphviz
from sklearn.metrics import accuracy_score, balanced_accuracy_score, log_loss, confusion_matrix, mean_squared_error
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz
from sklearn.naive_bayes import BernoulliNB
from sklearn import preprocessing
from sklearn.model_selection import KFold
from graphviz import Source
from sklearn import tree
from sklearn import datasets, ensemble, preprocessing

## Functions
def plot_confusion_matrix(cm, classes=None, perc = True, title = 'my title'):
    """Plots a confusion matrix."""
    if classes is not None:
        ax = sns.heatmap(cm, xticklabels=classes, yticklabels=classes, vmin=0., vmax=1., annot=True, cmap='Greens')
        bottom, top = ax.get_ylim()
        ax.set_ylim(bottom+0.5, top-0.5)
        for t in ax.texts: t.set_text(t.get_text()+"%")
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.title(title)


    else:
        ax = sns.heatmap(cm, vmin=0., vmax=1., cmap = 'Greens')
        bottom, top = ax.get_ylim()
        ax.set_ylim(bottom+0.5, top-0.5)
        for t in ax.texts: t.set_text(t.get_text()+"%")
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.title(title)



def plot_feature_importance(model, feature_name = None, title = 'Feature Importance'):
  feature_importance = model.feature_importances_
  sorted_idx = np.argsort(feature_importance)
  pos = np.arange(sorted_idx.shape[0]) + .5
  fig = plt.figure(figsize=(12, 6))
  plt.barh(pos, feature_importance[sorted_idx], align='center')
  plt.yticks(pos, np.array(feature_name)[sorted_idx])
  plt.title(title)

### 1.2 Data Definition
 
Fields directly from the Flipside database without transformation i.e. artist, aspect_ratio, curation_status etc. are not included here. The definition of the fields created / calcualted from other fields are below:
* COUNT_TOKEN: total number of tokens of the collection
* DAYS_SINCE_MINT: number of days between mint date (`created_at_timestamp`) and September 1st 2021.
* FEATURE_NUMBER: number of features of the collection
* TRAITS_NUMBER: number of traits of the collection
* MINT_CURRENCY: same as tx_currency
* MINT_DURATION: number of minting days
* AUGUST_SALE_COUNT: number of sales in August 2021 from the collection
* AUGUST_SALE_PRICE: average sale price in August 2021 from the collection
* YEAR_MONTH: the year and month of `block_timestamp`
* SALE_COUNT: number of sales of the collection in the particular month
* PRICE_USD: same as `price_usd` in nft_events table
* PRICE_RANGE: difference between the minimum and maximum price_usd of the collection in the particular month

### 1.3 Data Formatting & Cleansing
The following code formats the data type so it can be processed through models.

In [None]:
###################################
## Read in collection level data ##
###################################
coll_data = pd.read_csv('collection_level_data.csv')
coll_data.head()

coll_data_ts =  pd.read_csv('collection_level_data_ts.csv')
coll_data_ts.head()

## Format data type
## Convert all string fields into category
coll_data['COLLECTION_NAME'] = coll_data['COLLECTION_NAME'].astype('category')
coll_data['ARTIST'] = coll_data['ARTIST'].astype('category')
coll_data['CURATION_STATUS'] = coll_data['CURATION_STATUS'].astype('category')
coll_data['SCRIPT_TYPE'] = coll_data['SCRIPT_TYPE'].astype('category')
coll_data['MINT_CURRENCY'] = coll_data['MINT_CURRENCY'].astype('category')

## Fix different formats in aspect ratio
coll_data['ASPECT_RATIO'] = np.where(coll_data['ASPECT_RATIO']=='1/1', 1, coll_data['ASPECT_RATIO'])
coll_data['ASPECT_RATIO'] = np.where(coll_data['ASPECT_RATIO']=='100/100', 100, coll_data['ASPECT_RATIO'])
coll_data['ASPECT_RATIO'] = coll_data['ASPECT_RATIO'].astype('float64')
print(coll_data.dtypes)

## Drop fields with only 1 category
coll_data = coll_data.drop(['MINT_CURRENCY','IS_DYNAMIC', 'USE_HASH'], axis = 1)


In [None]:
########################################
## Prepare non-time-series model data ##
########################################

## Create dependent binary variable y_c
# if sale count is n.a., then there is no sale
y_c = pd.DataFrame(np.where(coll_data['AUGUST_SALE_COUNT'].isna(), 0, 1))
y_c.columns = ['sale_y_n']

## Create independent variables
x_c = coll_data.drop(['COLLECTION_NAME','ARTIST','AUGUST_SALE_COUNT', 'AUGUST_SALE_PRICE'], axis = 1)

## Convert categorical to numeric (LogisticRegression package doesn't take categorical variables)
cleanup_nums = {"CURATION_STATUS":     {"curated": 4, "factory": 2, "playground": 3},
                "SCRIPT_TYPE": {"p5js": 1, "threejs": 2, "js":3, "regl": 4, "zdog": 5, "tonejs": 6, "custom": 7, "a-frame":8, "svg":9 }}

x_c = x_c.replace(cleanup_nums)

## Replace n.a. with 999 (can't exclude rows with n.a. because otherwise no Y_c = 0)
x_c = x_c.fillna(999)

## 3. Collection Level Prediction Models
### 3.1 Which collection will sell in August?
The following models predict which collection will have a resale in August 2021. The result of the prediction is a binary outcome (0 for no resale, 1 for resale). Logistic regression, Decision Tree and Random Forest are used. Decision Tree and Random Forest both predict with 100% accuracy, but Decision Tree is easier to interpretate and has a lower log loss. The most important features to predict if a collection will be resold in August is the number of tokens and the duration of the mint event.


#### 3.1.1 Logistic Regression

The logistic regression correctly predicts 99% of the resales, 83% of the no resales and mis-predict 2 resales into non-resales, 1 non-resale into resale. 

The top 3 important features in the Decision Tree Classifier are:
- mint duration
- number traits of a collection
- aspect ratio

In [None]:
#########################
## Logistic Regression ##
#########################
logreg = LogisticRegression(random_state=123).fit(x_c, y_c)
logreg_y = logreg.predict(x_c)
logreg_y_p = logreg.predict_proba(x_c)

display('The balanced accuracy for logistic regression is: {:.3f}'.format(balanced_accuracy_score(y_c, logreg_y)))
display('The log loss for logistic regression is: {:.3f}'.format(log_loss(y_c, logreg_y_p)))

## Mis-predicted collection
logreg_y = pd.DataFrame(logreg_y)
logreg_y.columns = ['sale_y_n_predict']

lr_out = pd.concat([coll_data['COLLECTION_NAME'], y_c, logreg_y], axis=1)
lr_out['mis_predict'] = np.where(lr_out['sale_y_n']==lr_out['sale_y_n_predict'], 0, 1)
display(lr_out[lr_out['mis_predict'] == 1])

## Confusion matrix 
lr_cm = confusion_matrix(y_c, logreg_y)
lr_cm_value = lr_cm/lr_cm.sum(axis = 1)[:, np.newaxis]

plot_confusion_matrix(lr_cm_value*100, classes = ['No Resale','Resale'], title = 'Confusion Matrix (%) - Logistic Regression')

## Feature Importance

importance = pd.concat([pd.DataFrame(x_c.columns), pd.DataFrame(logreg.coef_).T], axis=1)
importance.columns = ["feature_name", "feature_value"]

sns.barplot(x="feature_name", y="feature_value", data=importance)
plt.title("Feature Importance")
plt.xticks(rotation=90)



#### 3.1.2 Decision Tree Classifier

The Decision Tree correctly predicts 100% of the resales and no resales when the tree depth reaches 4. The top 3 most important features are:
- number of tokens
- mint duration
- number of days since the mint

The tree path shows the fewer the tokens and the shorter the minting duration a collection has, the less likely the collection will be sold in August.

In [None]:
##############################
## Decision Tree Classifier ##
##############################

# Train Decision Tree classifier 
dt = DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=123).fit(x_c, y_c)
dt_y = dt.predict(x_c)
dt_y_p = dt.predict_proba(x_c)

display('The balanced accuracy for Decision Tree is: {:.3f}'.format(balanced_accuracy_score(y_c, dt_y)))
display('The log loss for Decision Tree is: {:.3f}'.format(log_loss(y_c, dt_y_p)))

## Print mis-predicted collection
dt_y = pd.DataFrame(dt_y)
dt_y.columns = ['sale_y_n_predict']

dt_out = pd.concat([coll_data['COLLECTION_NAME'], y_c, dt_y], axis=1)
dt_out['mis_predict'] = np.where(dt_out['sale_y_n']==dt_out['sale_y_n_predict'], 0, 1)
print('There are no mis-classified collections!')
#display(dt_out[dt_out['mis_predict'] == 1])

# DOT data
dot_data = tree.export_graphviz(dt, out_file=None, 
                                feature_names=x_c.columns,  
                                class_names = ['no-resale','resale'],
                                #class_names=str(dt.classes_),
                                filled=True)

# Draw graph
graph = graphviz.Source(dot_data, format="png") 
graph

## Feature Importance
plot_feature_importance(dt, feature_name = x_c.columns, title = 'Feature Importance')

#### 3.1.3 Random Forest Classifier

Random Forest is similar to Decision Tree, except that it uses ensemble method to create sub-samples to build many decision trees to train the model better. Random Forest also predicts quite well with 100% accuracy score, but with a higher log loss than Decision Tree. 

Since a simple decision tree already predict 100% accuracy, Random Forest is not going to improve model performance, hence it's only shown as an additional choice here. The most important features are the same as Decision Tree:
- mint duration
- number of tokens a collection has

In [None]:
##############################
## Random Forest Classifier ##
##############################

# Train random forest classifier 
rf = RandomForestClassifier(random_state=123, criterion='entropy').fit(x_c, y_c)
rf_y = rf.predict(x_c)
rf_y_p = rf.predict_proba(x_c)

display('The balanced accuracy for Random Forest is: {:.3f}'.format(balanced_accuracy_score(y_c, rf_y)))
display('The log loss for Random Forest is: {:.3f}'.format(log_loss(y_c, rf_y_p)))

## Print mis-predicted collection
rf_y = pd.DataFrame(rf_y)
rf_y.columns = ['sale_y_n_predict']

rf_out = pd.concat([coll_data['COLLECTION_NAME'], y_c, rf_y], axis=1)
rf_out['mis_predict'] = np.where(rf_out['sale_y_n']==rf_out['sale_y_n_predict'], 0, 1)
print('There no mis-classified collections!')

## Confusion matrix
rf_cm = confusion_matrix(y_c, rf_y)
rf_cm_value = rf_cm/rf_cm.sum(axis = 1)[:, np.newaxis]

plot_confusion_matrix(rf_cm_value*100, classes = ['No Resale','Resale'], title = 'Confusion Matrix (%) - Random Forest')

## Feature Importance
plot_feature_importance(rf, feature_name = x_c.columns, title = 'Feature Importance')

### 3.2 At what price the collection will sell in August?

As mentioned in section 2, the prediction can be based on time-series regressions or machine learning tree-based regression models.

**Why tree-based regression model is more suitable than time-series model in this case**

The issue here is that a time-series model usually predicts a price trend of a particular item (i.e. a single stock price), in this case a collection. We have over 100 unique collections, all with different price history, sale volumes and traits etc. So each collection will need its own model to fully incorporate the price trend in the prediction (which is a lot of models!). 

Also, the time history is quite short for each collection. With less than 1 year of history and maximum 8 or 9 monthly data points for each collection, the time-series model will not likely to give reliable and robust results. Using daily data could increase the sample size, but not all collections have a sale everyday. A time-series model often require equal time intervals, which means the days with missing sale data due to no sale will have to be approximated with interpolation methods. 

Based on these reasons, and the fact that there are a lot of categorical variables in the data, tree-based regression models will be more suitable than time-series model. A summary of the 3 selected models and their performances are below. The most important feature to predict the August price turns out to be July's price from all 3 models.

#### 3.2.1 Data Preparation
**Transform time-series variables**

In order to to use the time-series data in a non-time-series tree-based model, the monthly average price in the collection level time-series data is pivoted to the collection level data, so each month's average sale price becomes an additional column, which will serve as features in the classification models. The number of sales from each month is also transformed in the same way.

In [None]:
#############################################
## Sale price & number of sale time-series ##
#############################################
## Pivot sale price 
coll_price_pvt = coll_data_ts.pivot(index="COLLECTION_NAME", columns="YEAR_MONTH", values="PRICE_USD")
coll_price_pvt.columns=['Dec21_price','Jan21_price','Feb21_price','Mar21_price','Apr21_price','May21_price','Jun21_price','Jul21_price','Aug21_price','Sep21_price'] 
display(coll_price_pvt.head(5))

## Pivot sale volume
coll_sale_pvt = coll_data_ts.pivot(index="COLLECTION_NAME", columns="YEAR_MONTH", values="SALE_COUNT")
coll_sale_pvt.columns=['Dec21_sale_num','Jan21_sale_num','Feb21_sale_num','Mar21_sale_num','Apr21_sale_num','May21_sale_num','Jun21_sale_num','Jul21_sale_num','Aug21_sale_num','Sep21_sale_num'] 
display(coll_sale_pvt.head(5))

**Predict % change of sale price from July to August**

The price modelling data contains both the static features from `collection_level_data.csv` and the time-dependent features from `collection_level_data_ts.csv`.

Three different tree-based regression models are used here:
- [Decision Tree Regression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html): an algorithm that can predict continous dependent variables in a tree structure.
- [Random Forest Regression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html): a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
- [Gradient Boosting Regression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html): builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

In [None]:
###################################
## Prepare data with time-series ##
###################################

## Join coll_data with price monthly change and sale volume monthly change data
coll_data_full = coll_data.merge(coll_price_pvt, on='COLLECTION_NAME', how='left')
coll_data_full = coll_data_full.merge(coll_sale_pvt, on='COLLECTION_NAME', how='left')

## Create dependent variable
y = coll_data_full['AUGUST_SALE_PRICE']
y = y.fillna(0)

## Create independent variables
## August and September are excluded
x = coll_data_full.drop(['COLLECTION_NAME','ARTIST','AUGUST_SALE_COUNT', 'AUGUST_SALE_PRICE','Aug21_price','Sep21_price', 'Aug21_sale_num','Sep21_sale_num'], axis = 1)

## Convert categorical to numeric (LogisticRegression package doesn't take categorical variables)
cleanup_nums = {"CURATION_STATUS":     {"curated": 4, "factory": 2, "playground": 3},
                "SCRIPT_TYPE": {"p5js": 1, "threejs": 2, "js":3, "regl": 4, "zdog": 5, "tonejs": 6, "custom": 7, "a-frame":8, "svg":9 }}

x = x.replace(cleanup_nums)

## Replace n.a. with 999 (can't exclude rows with n.a. because otherwise no Y_c = 0)
x = x.fillna(0)

#### 3.2.2 Decision Tree Regression
The initial Decision Tree Regression is without any hyper-parameter. When the tree depths reach 18, the R-squared is 100%. Since the tree is very deep and is likely to over-fit the data, different trials of tree depths vs. R-squared is plotted and the elbow point is the optimal tree depth, which is 5.

The most important features to predict the August sale price are:
- July's sale price 
- July's sale number
- December's sale number 
- curation status. 

The scatter plot shows how close the predicted price is to the actual. The perfect prediction will form a 45 degrees diagonal line. In the plot it shows most of the points are on the diagonal line, except for some low price predictions.

In [None]:
##############################
## Decision Tree Regression ##
##############################

# Train Decision Tree without specifying hyper-parameter
dtr = DecisionTreeRegressor(random_state=123).fit(x, y)
display('The R-squared for Decision Tree Regression is: {:.3f}'.format(dtr.score(x, y)))
display('The tree depth without hyper-parameter tuning is: {:.3f}'.format(dtr.get_depth()))

## Trials of different tree depths

max_depths = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
dt_table = pd.DataFrame(columns=['max_depth', 'R-squared'])

for d in max_depths:
  dt_trial = DecisionTreeRegressor(random_state = 123, max_depth = d).fit(x,y)
  rsq = dt_trial.score(x, y)
  dt_table = dt_table.append({'max_depth':d,
                              'R-squared':rsq}, ignore_index = True)

plt.plot(dt_table['max_depth'], dt_table['R-squared'])
plt.xlabel('max tree depth')
plt.ylabel('R-squared')
plt.title('Trials of different tree depth vs. R-squared')

## R-squared & feature importance
dtr = DecisionTreeRegressor(max_depth = 5, random_state=123).fit(x, y)
dtr_y = dtr.predict(x)
display('The R-squared for Decision Tree regression is: {:.3f}'.format(dtr.score(x, y)))
display('The optimal tree depth after hyper-parameter tuning is: {:.3f}'.format(dtr.get_depth()))

# Plot feature importance
plot_feature_importance(dtr, feature_name = x.columns, title = 'Feature Importance')

## Decision Tree Plot
# DOT data
dot_data_dtr = tree.export_graphviz(dtr, out_file=None, 
                                feature_names=x.columns,  
                                class_names = None,
                                filled=True)

# Draw graph
graph_dtr = graphviz.Source(dot_data_dtr, format="png") 
graph_dtr

## Scatter plot actual vs. predicted sale price for August
f, axs = plt.subplots(1,2,figsize=(10,5))
plt.subplot(1, 2, 1)
sns.scatterplot(x=(y), y=(dtr_y))
plt.xlabel("Actual sale price")
plt.xlabel("Predicted sale price")
plt.title("Predicted vs. actual sale price")
plt.xticks(rotation=90)

## Scatter plot (log-scale) actual vs. predicted sale price for August 
plt.subplot(1, 2, 2)
sns.scatterplot(x=np.log(y), y=np.log(dtr_y))
plt.xlabel("Actual sale price")
plt.xlabel("Predicted sale price")
plt.title("Predicted vs. actual sale price (log-scale)")

#### 3.2.2 Random Forest Regression
Random Forest Regression predicts slighly better than Decision Tree Regression, with R-squared of 97.3%. Since a simple decision tree already has R-squared of 99.7% and it's more easily interpretable, there is no need to explore other models to improve the performance. So Random Forest is shown as an extra choice here.

The top 3 most importance features are:
- July's sale price 
- June's sale price
- May's sale price

The predicted price vs. actual in the scatter plot shows some points are off away from the perfect 45 degrees diagonal line, indicating the prediction is less accurate than Decision Tree.

In [None]:
###############################
## Random Forest Regresssion ##
###############################

# Train Random Forest without specifying hyper-parameter
rfr = RandomForestRegressor(random_state=123).fit(x, y)
rfr_y = rfr.predict(x)
display('The R-squared for Random Forest Regression is: {:.3f}'.format(rfr.score(x, y)))

# Plot feature importance
plot_feature_importance(rfr, feature_name = x.columns, title = 'Feature Importance')

## Scatter plot actual vs. predicted sale price for August
f, axs = plt.subplots(1,2,figsize=(10,5))
plt.subplot(1, 2, 1)
sns.scatterplot(x=(y), y=(rfr_y))
plt.xlabel("Actual sale price")
plt.xlabel("Predicted sale price")
plt.title("Predicted vs. actual sale price")
plt.xticks(rotation=90)

## Scatter plot (log-scale) actual vs. predicted sale price for August 
plt.subplot(1, 2, 2)
sns.scatterplot(x=np.log(y), y=np.log(rfr_y))
plt.xlabel("Actual sale price")
plt.xlabel("Predicted sale price")
plt.title("Predicted vs. actual sale price (log-scale)")

#### 3.2.4 Gradient Boosting Regressor
Gradient Boosting Regressor is similar to Decision Tree, except that it uses ensemble method to learn from the previous step's error and build the next step in the decision trees in order to train the model better. Since a simple decision tree already has R-squared of 99.7% and it's easily interpretable, there is no need to explore other models to improve the performance. So Gradient Boosting is also shown as a different additional choice here.

Gradient Boosting Regressor as expected improves the performance to 99.9% R-squared due to its greedy search algorithm nature. The top 3 most importance features are:
- July's sale price 
- July's sale number
- March's sale price 

The predicted price vs. actual in the scatter plot shows almost a perfect line, except for a couple of underpredicted outliers when looking at the log-scale plot.

In [None]:
#################################
## Gradient Boosting Regressor ##
#################################

# Train GBoost without specifying hyper-parameter
gb = ensemble.GradientBoostingRegressor(random_state=123).fit(x, y)
gb_y = gb.predict(x)
display('The R-squared for Gradient Boosting Regression is: {:.3f}'.format(gb.score(x, y)))


# Plot feature importance
plot_feature_importance(gb, feature_name = x.columns, title = 'Feature Importance')

## Scatter plot actual vs. predicted sale price for August
f, axs = plt.subplots(1,2,figsize=(10,5))
plt.subplot(1, 2, 1)
sns.scatterplot(x=(y), y=(gb_y))
plt.xlabel("Actual sale price")
plt.xlabel("Predicted sale price")
plt.title("Predicted vs. actual sale price")
plt.xticks(rotation=90)

## Scatter plot (log-scale) actual vs. predicted sale price for August 
plt.subplot(1, 2, 2)
sns.scatterplot(x=np.log(y), y=np.log(gb_y))
plt.xlabel("Actual sale price")
plt.xlabel("Predicted sale price")
plt.title("Predicted vs. actual sale price (log-scale)")

## 4. Conclusion

(1) For the prediction of whether a collection will be resold in August, all the models show that the most important features are:
- the number of tokens a collection has: the fewer tokens, the less likely of resale
- the duration of the minting event: the shorter the duration, the less likely of resale

The smaller number of tokens and shorter minting duration lead to a lower chance of resale in August.

(2) For the prediction of the resale price in August, all 3 tree-based regression models have a high R-squared more than 97%. The most important feature is **July's sale price**.

Decision Tree is suitable for predicting both events - (1) and (2). The model performance is good with an accuracy score of 100% for event (1) and 99.7% for event (2).

## 5. Limitations & Future Improvements

Although the model performance is very well, this might be due to the fact that there are only very few collections that were not resold in August. In such an unbalanced class of too many resales, the model could simply predict the most frequent class to get a high accuracy score. There are also some other limitations of the models, which are summarised along with the unbalanced class issue below:
- Unbalanced class of too many resales, model tends to predict the most frequenty class to achieve high accuracy score.
- Not enough data in no resale category to do a K-fold cross validation or training/testing split; so the models cannot be tested on out of sample data.
- Not enough long sale history to build a time-series model.
- The set up of monthly data as features would mean the model needs to be recalibrated every month to incorporate new data.

The above limitations can all be mitigated with more data with longer history, which can be achieved as time goes.