# Random forests and collaborative filtering on bitcoin 
## Bitcoin data revisited 

The goal of this notebook is to use Latent Dirichlet Allocation as a collaborative filtering algorithm on bitcoin data. 

In [7]:
# import data from septembre of 2022 since it was rather stable during this period

# data are 15 minute candlesticks
import pandas as pd
df = pd.read_csv("btc-data/BTCUSDT_15_2023-09-01_2023-09-30.csv", names=["time","open","high","low","close","vol"], header=None)
print(df.shape)
df.head()

(2881, 6)


Unnamed: 0,time,open,high,low,close,vol
0,2023.09.01 00:00,26009.3,26016.5,25627.2,25856.8,20063.967
1,2023.09.01 00:15,25856.8,26021.2,25840.1,25910.7,5158.731
2,2023.09.01 00:30,25910.7,25955.9,25888.7,25945.7,1300.917
3,2023.09.01 00:45,25945.7,26029.7,25945.0,26013.4,1616.798
4,2023.09.01 01:00,26013.4,26013.4,25962.2,25991.8,844.671


**Now we want to recreate this blog post:** <br/>
https://towardsdatascience.com/create-a-recommendation-system-based-on-time-series-data-using-latent-dirichlet-allocation-2aa141b99e19

To do this, we need to think about our data a bit different: 
In the blog post LDA has been used to get groups of different people who watch shows with a certain probability during a 24 hour period. 

Therefore it is necessary to ask a different question about the data: During which hour of the day is what variation the most likely? 

The next step is therefore to calculate normalized differences of the day starting with the open price. 

In [10]:
# transform to date_time
df["time"] = pd.to_datetime(df["time"])

# Extracting day and hour
df["day"] = df["time"].dt.day
df["hour"] = df["time"].dt.hour
df.head()

Unnamed: 0,time,open,high,low,close,vol,day,hour
0,2023-09-01 00:00:00,26009.3,26016.5,25627.2,25856.8,20063.967,1,0
1,2023-09-01 00:15:00,25856.8,26021.2,25840.1,25910.7,5158.731,1,0
2,2023-09-01 00:30:00,25910.7,25955.9,25888.7,25945.7,1300.917,1,0
3,2023-09-01 00:45:00,25945.7,26029.7,25945.0,26013.4,1616.798,1,0
4,2023-09-01 01:00:00,26013.4,26013.4,25962.2,25991.8,844.671,1,1


In [18]:
# get first entry of open col each day as reference point
daily_open = df.groupby(df["day"]).first()
print(df.shape)
daily_open.head()

(2881, 8)


Unnamed: 0_level_0,time,open,high,low,close,vol,hour
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,2023-09-01,26009.3,26016.5,25627.2,25856.8,20063.967,0
2,2023-09-02,25763.6,25817.5,25733.4,25789.7,981.207,0
3,2023-09-03,25843.0,25858.0,25831.2,25851.6,275.492,0
4,2023-09-04,26037.2,26062.2,25999.2,26006.0,1046.109,0
5,2023-09-05,25822.5,25828.0,25735.9,25754.6,1639.1,0


In [None]:
df["daily_open"] = df.groupby(df["time"].dt.date)["open"].transform("first")