# Creating a High-Signal BTC Article Suggestion Engine

![btc](https://upload.wikimedia.org/wikipedia/commons/9/9a/BTC_Logo.svg)

In [2]:
!pip install pandas
!pip install gensim
!pip install pandarallel
!pip install tables
!pip install ipywidgets
!pip install sklearn
!pip install seaborn
!pip install nltk
!pip install num2words
!pip install psutil




In [3]:
# Importing libraries

# Python Modules
import re
import random
import sys
import os.path
from time import time
# Data Wrangling
import json
import pandas as pd
# *Expands Dataframe so as to view it in its entirety
pd.options.display.max_colwidth = 750
# Scientific Computing
import numpy as np
# Natural Language Processing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import num2words
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim import models
# Visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Suppress Developer Warnings
import warnings
warnings.filterwarnings("ignore")
# Info on processess and system utilization
import psutil

In [4]:
data_path = '/Users/pretermodernist/coding-projects/bitcoin-knowledge-bot/datasets/knowledge_datasets/bitcoin_knowledge.json'

In [5]:
# Reading in our bitcoin data from a json lines file with a reproducible function


def wrangle_jsonl(path: str):
    '''
    Reads in our bitcoin data from a json lines file

    Parameters
    ----------
    None
    
    Returns
    -------
    df: pandas datafarme 
        Contains text data from several reputable BTC news and historical sources
    '''
    # Preparing jsonl file for reading in
    with open(path) as l:
        lines = l.read().splitlines()

    # Loading the json lines object into an intermediary pandas DataFrame
    df_inter = pd.DataFrame(lines)
    df_inter.columns = ['json_element']

    # Applying json loads function on each row of the json_element column to decode
    # json object into a dictionary
    df_inter['json_element'].apply(json.loads)

    # Converting any semi-structured json data with a normalize function 
    # json keys are made into a flat table
    df = pd.json_normalize(df_inter['json_element'].apply(json.loads))

    # Returning read_in json and its DF equivalent so we can explore our data
    # whichever way we wish
    return df

In [6]:
btc = wrangle_jsonl(data_path)
btc.sample(10)

Unnamed: 0,title,url,body,image
2121,On the Origins of Bitcoin:\nStages of Monetary Evolution,https://nakamotoinstitute.org/origins-of-bitcoin/,"It so happens that such strings of shell beads have characteristics that we now understand in retrospect to be essential characteristics of media of exchange and value storage. They are durable, divisible, portable, and to some degree interchangeable (though different colors and types could naturally be valued differently). Beads can be unstrung, counted, and restrung in different numerical configurations.",https://nakamotoinstitute.org/static/img/docs/on-the-origins-of-bitcoin/bitcoin.jpg
2221,Reciprocal Altruism in the Theory of Money,https://nakamotoinstitute.org/reciprocal-altruism-in-the-theory-of-money/,"To the economist, altruism simply means gaining satisfaction from benefiting other people. Nothing about that requires an economic explanation. Peoples preferences are exogenous, and if they enjoy helping others, that is not an economic issue per se. Whereas in biology, the only value is that which best spreads genes. It is easier to understand how this should produce selfishness on the part of an organism, so an organism which enjoyed helping others would be in need of an explanation.",https://nakamotoinstitute.org/static/img/docs/reciprocal-altruism-in-the-theory-of-money/matrix-1.svg
1104,The Ricardian Contract,https://nakamotoinstitute.org/the-ricardian-contract/,"A contract can encapsulate the detail. Consider the original sovereign currency contracts, in which the issuer promised to pay the bearer in ounces of precious metal. That is four datum in the contract already: which sovereign, ""pay to bearer,"" what to pay, and how much of it.",https://licensebuttons.net/l/by-sa/4.0/88x31.png
1167,Triple Entry Accounting,https://nakamotoinstitute.org/triple-entry-accounting/,Our experiences weigh in on the side of enablement. I refer to the experiences of digital money issuers. Our own first deployment of a system was with a single entry bookkeeping system. Its failure rate even though coding was tight was such that it could not sustain more than 20 accounts before errors in accounting crept in and the system lost cohesion. This occurred within weeks of initial testing and was never capable of being fielded. The replacement double entry system was fielded in early 1996 and has never lost a transaction (although there have been some close shaves[3]).,https://nakamotoinstitute.org/static/img/docs/triple-entry-accounting/interim-receipt.png
2593,Against the Minimum Majority Measure,https://nakamotoinstitute.org/mempool/against-the-minimum-majority-measure/,"A reorganization of the block chain can undo previous transactions, removing bitcoins from the wallets of the people who received those transactions.",https://nakamotoinstitute.org/static/img/mempool/against-the-minimum-majority-measure/en-confirmed-double-spend-cost.png
792,Formalizing and Securing Relationships on Public Networks,https://nakamotoinstitute.org/formalizing-securing-relationships/,"All public key operation can be performed inside an unreadable hardware board or smart card on a machine with a very narrow serial-line connection (ie, it carries only a simple single-use protocol with well-verified security) to a dedicated firewall. This is economical for high traffic servers, but may be less practical for individual users. Besides better security, it has the added advantage that hardware speeds up the public key computations.",https://nakamotoinstitute.org/static/img/docs/formalizing-securing-relationships/mutually.gif
2652,Crashes and Hyperinflation,https://nakamotoinstitute.org/mempool/crashes-and-hyperinflation/,"Because just about any good can be used as money, it is better to think of money as a behavior rather than as a thing. The money good is often something that is not very useful; it is something that is easy to count and difficult to reproduce. Money is rarely demanded to be consumed. Even something gold, which has important uses in industry or as jewelry, is more often just stored away. It is not, therefore, the nature of the specific good that is used which makes money useful. It is the way people treat it.",https://nakamotoinstitute.org/static/img/mempool/crashes-and-hyperinflation/dr-strangelove-bomb.jpg
1578,Measuring Value,https://nakamotoinstitute.org/measuring-value/,A crucial issue in transaction costs economics,https://nakamotoinstitute.org/static/img/docs/measuring-value/taxnarrowhouse.jpg
2389,Bitcoin is Not Backed by Nothing,https://nakamotoinstitute.org/mempool/bitcoin-is-not-backed-by-nothing/,"Venezuela, Argentina, and Turkey all have governments, militaries and the authority to tax, yet the currencies of each have deteriorated significantly over the past five years. While its not sufficient to prove the counterfactual, each is an example that contradicts the idea that a currency derives its value as a function of government. Each and every episode of hyperinflation should be evidence enough of the inherent flaws in fiat monetary systems, but unfortunately it is not. Rather than understanding hyperinflation as the logical end game of all fiat systems, most simply believe hyperinflation to be evidence of monetary mismanagement. This simplistic view ignores first principles, as well as the dynamics which ensure monetary debasem...",https://nakamotoinstitute.org/static/img/mempool/bitcoin-is-not-backed-by-nothing/bitcoin-price.png
2638,Crashes and Hyperinflation,https://nakamotoinstitute.org/mempool/crashes-and-hyperinflation/,"I am now in a position to talk about why money persists, despite the fact that it is merely a shared hallucination. Why dont people, so to speak, wake up and throw their money on the ground? In my previous article I explained the growth of money as a step-by-step process in which a good becomes monetized as one investor after another comes to treat it like money.",https://nakamotoinstitute.org/static/img/mempool/crashes-and-hyperinflation/dr-strangelove-bomb.jpg


In [7]:
# Cleaning text by replacing dirty data with empty strings using regular expressions

def cleaner_regex(uncleaned):
    """
    Accepts a single text document in the form of a pandas Series and performs 
    several regex substitutions in order to clean it.

    Is only meant to be applied to the text column.

    Parameters
    ----------
    text: pandas Series

    Returns
    -------
    text: pandas Series
    """
    # square_brackets = r'\[.*?\]'
    replace_with = ""
    # removed_line = re.sub(square_brackets, replace_with, uncleaned)
    half_cleaned = re.sub(r"(\d+)", lambda x: num2words.num2words(int(x.group(0))), uncleaned)
    cleaned = re.sub(r'https?://[^\s]+', replace_with, half_cleaned)

    # Applying case normalization to each body of text
    return cleaned.lower()

In [8]:
btc['body'] = btc['body'].apply(cleaner_regex)

In [9]:
# To apply this cleaner function without replacing the original column we can use a list comp and assign it to a new column
# This can be handy for comparing the original and cleaned versions of the same corpus

# Need to assign the column to a variable before looping through it with a list comphrension

# uncleaned_prompt = btc.body
# btc['cleaned_body'] = [cleaner_regex(corpus) for corpus in uncleaned_prompt]

In [10]:
btc.sample(5)

Unnamed: 0,title,url,body,image
192,Mastering bitcoin - Bitcoin Core: The Reference Implementation,https://github.com/bitcoinbook/bitcoinbook/blob/develop/ch03.asciidoc,"if youre reading this book and interested in developing bitcoin software, you should be running your own node.",https://github.com/bitcoinbook/bitcoinbook/blob/develop/images/cover.png?raw=true
1519,Advances in Distributed Security,https://nakamotoinstitute.org/advances-in-distributed-security/,"an unfair way to create a total order would be to impose a certain predictable rule for breaking ties. for example, we could decide on a total order for the processes and break ties in the causal order by referring to this total order.",https://licensebuttons.net/l/by-sa/4.0/88x31.png
702,Crypto Anarchy and Virtual Communities,https://nakamotoinstitute.org/virtual-communities/,"is this a good thing? mostly yes. crypto anarchy has some messy aspects, of this there can be little doubt. from relatively unimportant things like price-fixing and insider trading to more serious things like economic espionage, the undermining of corporate knowledge ownership, to extremely dark things like anonymous markets for killings.",https://licensebuttons.net/l/by-sa/4.0/88x31.png
2697,"It's Not About the Technology, It's About the Money",https://nakamotoinstitute.org/mempool/its-not-about-the-technology-its-about-the-money/,"there is an individual benefit to using money, and its very simple. the person who accepts money gets to defer his decisions about what to buy to a later time. someone who does not want to use money must have a better idea about what he is going to do with the goods he receives in payment than the person who accepts money. when one has money, then one is not committed. if i am the first person to accept money in payment and my bet on it pays off, then i have the option to choose what i want later, and i do not have to choose based on the limited information i have now. this benefit explains why someone would want something that is good for keeping in storage. if he wants to keep his options open, then he can open his vault the moment th...",https://nakamotoinstitute.org/static/img/mempool/its-not-about-the-technology-its-about-the-money/lyle-lanley-blockchain-technology.png
1487,The Cathedral and the Bazaar,https://nakamotoinstitute.org/the-cathedral-and-the-bazaar/,"having studied linus's behavior and formed a theory about why it was successful, i made a conscious decision to test this theory on my new (admittedly much less complex and ambitious) project.",https://licensebuttons.net/l/by-sa/4.0/88x31.png


In [11]:
# Exporting regex cleaned dataset
timestamp = pd.Timestamp.now().strftime('bitcoin_knowledge_regexed_%Y-%m-%d-%H%M.json')
folder = r'/Users/pretermodernist/coding-projects/bitcoin-knowledge-bot/datasets/knowledge_datasets'
filepath = os.path.join(folder, timestamp)
btc.to_json(filepath, orient="records", lines=True)

In [12]:
cores = psutil.cpu_count()
cores_used = int(cores/3)
print(f"This System Contains {cores} CPU Cores")
print(f"Will be using {cores_used} Cores for Data Transformation & Model Training")

This System Contains 12 CPU Cores
Will be using 4 Cores for Data Transformation & Model Training


In [13]:
# Tuning our vectorizer model
vect = TfidfVectorizer(
                       stop_words='english',
                       ngram_range=(1, 2),       # Allowing for both unigrams & bigrams
                       max_features=10000       # Not allowing more than 10k features/dimensions in our model
                       )

In [14]:
btc_knn = pd.DataFrame(btc['body'])

## USER INPUT EXAMPLE ##

user_input = ["What's the blockchain?"]
# Inserting our user input into the dataframe for querying
btc_knn.loc[len(btc_knn.index)] = user_input
# Viewing our user input in our KNN dataframe
btc_knn.tail(1)

Unnamed: 0,body
3182,What's the blockchain?


In [15]:
dtm = vect.fit_transform(btc_knn['body'])
dtm = pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
dtm

Unnamed: 0,ab,abacus,abandoned,abiding,abilities,ability,ability control,ability parties,able,able spend,...,zero thousand,zero years,zero zero,zeros,zimmermann,zu,zu einer,zur,zwei,zwischen
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3178,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3179,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3180,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3181,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
%%time
nn = NearestNeighbors(
    n_neighbors = 25,
    algorithm = 'ball_tree',  # Using ball_tree to measure distance of points
    n_jobs = cores_used
    ) 
 
nn.fit(dtm)  # Fitting our DTM to our KNN model

CPU times: user 2.17 s, sys: 62.5 ms, total: 2.23 s
Wall time: 2.26 s


NearestNeighbors(algorithm='ball_tree', n_jobs=4, n_neighbors=25)

In [17]:
# There's our test user_input!
btc_knn.iloc[-1]

body    What's the blockchain?
Name: 3182, dtype: object

In [18]:
# sampling our fake review from dtm_new to use as our query point
doc_index = -1
# using integer location(iloc) on the document index in order to sample them
doc = [dtm.iloc[doc_index].values]

# Query Using kneighbors 
neigh_dist, neigh_index = nn.kneighbors(doc)

In [19]:
# For Loop for querying articles based on our user input for the chatbot
for doc in neigh_index:
    recommendation = btc_knn.iloc[doc]

recommendation.head()

Unnamed: 0,body
3182,What's the blockchain?
2684,"yet people are running around everywhere in the bitcoin world screaming blockchain blockchain blockchain for all kinds of non-intuitive purposes until they're buried under piles of money. i can't believe how long it's taking for people to get wise to this ruse, but i hope it won't last too much longer. a blockchain does not have a wide range of applications. however, there is one applicationtwo, namely that of being a currency, which is overwhelmingly important."
2407,"without the money, there is no security and without the security, the value of the currency and the integrity of the chain both break down. it is for this reason that a blockchain is only useful within the application of money, and money does not magically grow on trees. yep, it is that simple. a blockchain is only good for one thing, removing the need for a trusted third-party which only works in the context of money. a blockchain cannot enforce anything that exists outside the network. while a blockchain would seem to be able to track ownership outside the network, it can only enforce ownership of the currency that is native to its network. bitcoin tracks ownership and enforces ownership. if a blockchain cannot do both, any records it..."
440,"another common mistake is to take transactions ""off blockchain"" in a misguided effort to reduce transaction fees or accelerate transaction processing. an ""off blockchain"" system will record transactions on an internal, centralized ledger and only occasionally synchronize them to the bitcoin blockchain. this practice, again, substitutes decentralized bitcoin security with a proprietary and centralized approach. when transactions are off blockchain, improperly secured centralized ledgers can be falsified, diverting funds and depleting reserves, unnoticed."
104,a public transaction ledger (the blockchain)


In [22]:
def return_suggestion(recommendation, btc):
    knn_recommendation = btc[btc["body"].str.contains(recommendation.iloc[1].values[0])==True]
    return knn_recommendation

suggestion = return_suggestion(recommendation, btc)
suggestion

Unnamed: 0,title,url,body,image
2684,"It's Not About the Technology, It's About the Money",https://nakamotoinstitute.org/mempool/its-not-about-the-technology-its-about-the-money/,"yet people are running around everywhere in the bitcoin world screaming blockchain blockchain blockchain for all kinds of non-intuitive purposes until they're buried under piles of money. i can't believe how long it's taking for people to get wise to this ruse, but i hope it won't last too much longer. a blockchain does not have a wide range of applications. however, there is one applicationtwo, namely that of being a currency, which is overwhelmingly important.",https://nakamotoinstitute.org/static/img/mempool/its-not-about-the-technology-its-about-the-money/lyle-lanley-blockchain-technology.png
