# Creating a High-Signal BTC Article Suggestion Engine

![btc](https://upload.wikimedia.org/wikipedia/commons/9/9a/BTC_Logo.svg)

In [5]:
!pip install lit_nlp tfds-nightly transformers==4.1.1
!pip install pandas
!pip install bs4
!pip install gensim
!pip install num2words
!pip install pandarallel
!pip install tables
!pip install ipywidgets
!pip install sklearn
!pip install seaborn
!pip install nltk
!pip install psutil
!pip install spacy


Collecting lit_nlp
  Using cached lit_nlp-0.4.1-py3-none-any.whl (746 kB)
Collecting tfds-nightly
  Using cached tfds_nightly-4.4.0.dev202201130107-py3-none-any.whl (4.2 MB)
Collecting transformers==4.1.1
  Using cached transformers-4.1.1-py3-none-any.whl (1.5 MB)
Collecting filelock
  Using cached filelock-3.4.2-py3-none-any.whl (9.9 kB)
Collecting sacremoses
  Using cached sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
Collecting tokenizers==0.9.4
  Using cached tokenizers-0.9.4.tar.gz (184 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting portpicker
  Using cached portpicker-1.5.0-py3-none-any.whl (14 kB)
Collecting ml-collections
  Using cached ml_collections-0.1.0-py3-none-any.whl (88 kB)
Collecting umap-learn
  Using cached umap_learn-0.5.2-py3-none-any.whl
Collecting sacrebleu
  Using cached sacrebleu-2.0.0-py3-none-any.whl (90 kB)
Collecting t

In [29]:
!pip install spacy



In [33]:
!python3 --version

Python 3.9.0


In [35]:
!python3 -m spacy download en_core_web_sm

/Library/Frameworks/Python.framework/Versions/3.9/bin/python3: No module named spacy


In [7]:
# Importing libraries

# Python Modules
import re
import string
import random
import sys
import os
from time import time
# Web Scraping
import requests
from bs4 import BeautifulSoup
# Data Wrangling
import json
import pandas as pd
import pickle
# *Expands Dataframe so as to view it in its entirety
pd.options.display.max_colwidth = 750
# Scientific Computing
import numpy as np
# Data Preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
# Natural Language Processing
import num2words
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim import models
import spacy
spacy.util.fix_random_seed(0)
# Visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Suppress Developer Warnings
import warnings
warnings.filterwarnings("ignore")
# Info on processess and system utilization
import psutil
# Multi-thread processing
from pandarallel import pandarallel

In [8]:
data_path = '/Users/pretermodernist/coding-projects/bitcoin-knowledge-bot/datasets/knowledge_datasets/bitcoin_knowledge.json'

In [9]:
# Reading in our bitcoin data from a json lines file with a reproducible function


def wrangle_jsonl(path: str):
    '''
    Reads in our bitcoin data from a json lines file

    Parameters
    ----------
    None
    
    Returns
    -------
    df: pandas datafarme 
        Contains text data from several reputable BTC news and historical sources
    '''
    # Preparing jsonl file for reading in
    with open(path) as l:
        lines = l.read().splitlines()

    # Loading the json lines object into an intermediary pandas DataFrame
    df_inter = pd.DataFrame(lines)
    df_inter.columns = ['json_element']

    # Applying json loads function on each row of the json_element column to decode
    # json object into a dictionary
    df_inter['json_element'].apply(json.loads)

    # Converting any semi-structured json data with a normalize function 
    # json keys are made into a flat table
    df = pd.json_normalize(df_inter['json_element'].apply(json.loads))

    # Returning read_in json and its DF equivalent so we can explore our data
    # whichever way we wish
    return df

In [10]:
btc = wrangle_jsonl(data_path)
btc.sample(10)

Unnamed: 0,title,url,body,image
1094,The Ricardian Contract,https://nakamotoinstitute.org/the-ricardian-contract/,"Secondly, there is no reason to believe that the above list is complete.",https://licensebuttons.net/l/by-sa/4.0/88x31.png
641,Crypto Anarchy and Virtual Communities,https://nakamotoinstitute.org/virtual-communities/,"These virtual communities typically are ""opaque"" to outsiders. Attempts to gain access to the internals of these communities are rarely successful. Law enforcement and intelligence agencies (such as the NSA in the U.S., Chobetsu in Japan, SDECE in France, and so on, in every country) may infiltrate such groups and use electronic surveillance (ELINT) to monitor these virtual communities. Not surprisingly, these communities are early adopters of encryption technology, ranging from scrambled cellphones to full-blown PGP encryption.[6]",https://licensebuttons.net/l/by-sa/4.0/88x31.png
2849,Bitcoin's Rugged Individualism,https://nakamotoinstitute.org/mempool/bitcoins-rugged-individualism/,"An organization inherently cannot take control of its own bitcoins, so its only solution is to remove the agency of its own employees and owners. It is not enough for an organization to distribute keys between its own people, although that will help. However, it is much easier for people within an organization to collude than without, and furthermore within an organization people will tend to have similar characters so the probabilities that each is a bad actor are not independent of one another. Votes must come from different organizations on order to maximize the security of the wallet.",https://nakamotoinstitute.org/static/img/mempool/bitcoins-rugged-individualism/ministry-of-information.jpg
575,Bitcoin: A Peer-to-Peer Electronic Cash System,https://nakamotoinstitute.org/bitcoin/,"New transaction broadcasts do not necessarily need to reach all nodes. As long as they reach many nodes, they will get into a block before long. Block broadcasts are also tolerant of dropped messages. If a node does not receive a block, it will request it when it receives the next block and realizes it missed one.",https://nakamotoinstitute.org/static/img/bitcoin/transactions.svg
117,Mastering bitcoin - Introduction,https://github.com/bitcoinbook/bitcoinbook/blob/develop/ch01.asciidoc,"The Bitcoin network started in 2009, based on a reference implementation published by Nakamoto and since revised by many other programmers. The implementation of the Proof-of-Work algorithm (mining) that provides security and resilience for Bitcoin has increased in power exponentially, and now exceeds the combined processing power of the worlds top supercomputers. Bitcoins total market value has at times exceeded $1 trillion US dollars, depending on the bitcoin-to-dollar exchange rate. The largest transaction processed so far by the network was $1.1 billion US dollars, transmitted instantly and processed for a fee of only $0.68.",https://github.com/bitcoinbook/bitcoinbook/blob/develop/images/cover.png?raw=true
2299,Bitcoin is One for All,https://nakamotoinstitute.org/mempool/bitcoin-is-one-for-all/,First published on Unchained Capital Blog,https://nakamotoinstitute.org/static/img/mempool/bitcoin-is-one-for-all/Americans-with-savings.png
2712,"It's Not About the Technology, It's About the Money",https://nakamotoinstitute.org/mempool/its-not-about-the-technology-its-about-the-money/,"The fact that money has a positive feedback between demand and value implies that there cannot normally be a stable equilibrium between two moneys. Any initial imbalance between them would tend to expand. If one currency was slightly more preferred than the other, people would react to this by demanding slightly more. This makes the preferred even more preferable than before. Any two moneys will interact in this way, thus leaving one to dominate the rest.",https://nakamotoinstitute.org/static/img/mempool/its-not-about-the-technology-its-about-the-money/lyle-lanley-blockchain-technology.png
1610,Measuring Value,https://nakamotoinstitute.org/measuring-value/,"On a larger scale, the Laffer curve may be the most important economic law of political history. Adams[1] uses it to explain the rise and fall of empires. The most successful governments have been implicitly guided by their own incentives both their short-term desire for revenue and their long-term success against other governments to optimize their revenues according to the Laffer Curve. Governments that overburdened their taxpayers, such as the Soviet Union and later Roman Empire, ended up on the dust-heap of history, while governments that collected below the optimum were often conquered by their better-funded neighbors. Democratic governments may maintain high tax revenues over historical time by more peaceful means than conquerin...",https://nakamotoinstitute.org/static/img/docs/measuring-value/taxnarrowhouse.jpg
820,A Formal Language for Analyzing Contracts,https://nakamotoinstitute.org/contract-language/,"If we wish to add the constraint that right2 cannot be performed until right1 has been, we use then:",https://licensebuttons.net/l/by-sa/4.0/88x31.png
2587,Against the Minimum Majority Measure,https://nakamotoinstitute.org/mempool/against-the-minimum-majority-measure/,Balances are something that could be much more distributed today given that new addresses are free to create and free to use if youre receiving a new transaction anyway. Ideally something like confidential transactions will deployed on Bitcoin in the future so that it wont be possible for third parties to monitor balances.,https://nakamotoinstitute.org/static/img/mempool/against-the-minimum-majority-measure/en-confirmed-double-spend-cost.png


In [11]:
# Cleaning text by replacing dirty data with empty strings using regular expressions

def clean_gpt3_training_data_title(uncleaned):
    """
    Accepts a single text document in the form of a pandas Series and performs 
    several regex substitutions in order to clean it.

    Is only meant to be applied to the title column.

    Parameters
    ----------
    text: pandas Series

    Returns
    -------
    text: pandas Series
    """
    newline_breaks = '\n\n###\n\n'
    left_over = ' ->'
    replace_with = ""
    removed_lines = re.sub(newline_breaks, replace_with, uncleaned)
    cleaned = re.sub(left_over, replace_with, removed_lines)

    return cleaned


def clean_gpt3_training_data_text(uncleaned):
    """
    Accepts a single text document in the form of a pandas Series and performs 
    several regex substitutions in order to clean it.

    Is only meant to be applied to the text column.

    Parameters
    ----------
    text: pandas Series

    Returns
    -------
    text: pandas Series
    """
    newline_break = '\n'
    end = 'END'
    replace_with = ""
    removed_line = re.sub(newline_break, replace_with, uncleaned)
    nearly_clean = re.sub(r"(\d+)", lambda x: num2words.num2words(int(x.group(0))), removed_line)
    almost_clean = re.sub(r'https?://[^\s]+', '', nearly_clean)
    cleaned = re.sub(end, replace_with, almost_clean)

    # Applying case normalization to each body of text
    return cleaned.lower()

In [12]:
btc['title'] = btc['title'].apply(clean_gpt3_training_data_title)

In [14]:
# To apply this cleaner function without replacing the original column we can use a list comp and assign it to a new column
# This can be handy for comparing the original and cleaned versions of the same corpus

# Need to assign the column to a variable before looping through it with a list comphrension
uncleaned_prompt = btc.body
btc['cleaned_text'] = [clean_gpt3_training_data_text(corpus) for corpus in uncleaned_prompt]

In [15]:
def tokenize(doc):
    
    tokens = []
    
    doc = nlp(doc)
    
    for token in doc:
        # Filtering out punctuation, and stop words. Filtering in lemmas and case normalizion.
        if ((token.is_punct != True) and
            (token.is_lower != True) and
            (token.lemma_ != 'PRON-') and
            (token.is_stop == False)):
                tokens.append(token.lemma_)

    return tokens

In [16]:
btc

Unnamed: 0,title,url,body,image,cleaned_text
0,Mastering bitcoin - Preface,https://github.com/bitcoinbook/bitcoinbook/blob/develop/preface.asciidoc,"I first stumbled upon bitcoin in mid-2011. My immediate reaction was more or less ""Pfft! Nerd money!"" and I ignored it for another six months, failing to grasp its importance. This is a reaction that I have seen repeated among many of the smartest people I know, which gives me some consolation. The second time I came across bitcoin, in a mailing list discussion, I decided to read the whitepaper written by Satoshi Nakamoto to study the authoritative source and see what it was all about. I still remember the moment I finished reading those nine pages, when I realized that bitcoin was not simply a digital currency, but a network of trust that could also provide the basis for so much more than just currencies. The realization that ""this isn...",https://github.com/bitcoinbook/bitcoinbook/blob/develop/images/cover.png?raw=true,"i first stumbled upon bitcoin in mid-two thousand and eleven. my immediate reaction was more or less ""pfft! nerd money!"" and i ignored it for another six months, failing to grasp its importance. this is a reaction that i have seen repeated among many of the smartest people i know, which gives me some consolation. the second time i came across bitcoin, in a mailing list discussion, i decided to read the whitepaper written by satoshi nakamoto to study the authoritative source and see what it was all about. i still remember the moment i finished reading those nine pages, when i realized that bitcoin was not simply a digital currency, but a network of trust that could also provide the basis for so much more than just currencies. the realiza..."
1,Mastering bitcoin - Preface,https://github.com/bitcoinbook/bitcoinbook/blob/develop/preface.asciidoc,"Two years later, after creating a number of small startups to explore various bitcoin-related services and products, I decided that it was time to write my first book. Bitcoin was the topic that had driven me into a frenzy of creativity and consumed my thoughts; it was the most exciting technology I had encountered since the internet. It was now time to share my passion about this amazing technology with a broader audience.",https://github.com/bitcoinbook/bitcoinbook/blob/develop/images/cover.png?raw=true,"two years later, after creating a number of small startups to explore various bitcoin-related services and products, i decided that it was time to write my first book. bitcoin was the topic that had driven me into a frenzy of creativity and consumed my thoughts; it was the most exciting technology i had encountered since the internet. it was now time to share my passion about this amazing technology with a broader audience."
2,Mastering bitcoin - Preface,https://github.com/bitcoinbook/bitcoinbook/blob/develop/preface.asciidoc,"This book is mostly intended for coders. If you can use a programming language, this book will teach you how cryptographic currencies work, how to use them, and how to develop software that works with them. The first few chapters are also suitable as an in-depth introduction to bitcoin for noncodersthose trying to understand the inner workings of bitcoin and cryptocurrencies.",https://github.com/bitcoinbook/bitcoinbook/blob/develop/images/cover.png?raw=true,"this book is mostly intended for coders. if you can use a programming language, this book will teach you how cryptographic currencies work, how to use them, and how to develop software that works with them. the first few chapters are also suitable as an in-depth introduction to bitcoin for noncodersthose trying to understand the inner workings of bitcoin and cryptocurrencies."
3,Mastering bitcoin - Preface,https://github.com/bitcoinbook/bitcoinbook/blob/develop/preface.asciidoc,"The leafcutter ant is a species that exhibits highly complex behavior in a colony super-organism, but each individual ant operates on a set of simple rules driven by social interaction and the exchange of chemical scents (pheromones). Per Wikipedia: ""Next to humans, leafcutter ants form the largest and most complex animal societies on Earth."" Leafcutter ants dont actually eat leaves, but rather use them to farm a fungus, which is the central food source for the colony. Get that? These ants are farming!",https://github.com/bitcoinbook/bitcoinbook/blob/develop/images/cover.png?raw=true,"the leafcutter ant is a species that exhibits highly complex behavior in a colony super-organism, but each individual ant operates on a set of simple rules driven by social interaction and the exchange of chemical scents (pheromones). per wikipedia: ""next to humans, leafcutter ants form the largest and most complex animal societies on earth."" leafcutter ants dont actually eat leaves, but rather use them to farm a fungus, which is the central food source for the colony. get that? these ants are farming!"
4,Mastering bitcoin - Preface,https://github.com/bitcoinbook/bitcoinbook/blob/develop/preface.asciidoc,"Although ants form a caste-based society and have a queen for producing offspring, there is no central authority or leader in an ant colony. The highly intelligent and sophisticated behavior exhibited by a multimillion-member colony is an emergent property from the interaction of the individuals in a social network.",https://github.com/bitcoinbook/bitcoinbook/blob/develop/images/cover.png?raw=true,"although ants form a caste-based society and have a queen for producing offspring, there is no central authority or leader in an ant colony. the highly intelligent and sophisticated behavior exhibited by a multimillion-member colony is an emergent property from the interaction of the individuals in a social network."
...,...,...,...,...,...
3177,Working and Saving are Revolutionary Acts,https://nakamotoinstitute.org/mempool/working-and-saving-are-revolutionary-acts/,"The counterargument to this optimistic view is that governments will also have more resources and technology; counteracting any gains the private sector will make, i.e. economic growth fuels the growth of the State and we should all go Galt or agorist. This counterargument ignores governments completely unsustainable fiscal/monetary practices as well as its inclination to tax at rates low enough to permit economic growth. The former leads to State self-destruction, while the latter ensures the continued growth of the private economy.",https://nakamotoinstitute.org/static/img/mempool/working-and-saving-are-revolutionary-acts/work.jpg,"the counterargument to this optimistic view is that governments will also have more resources and technology; counteracting any gains the private sector will make, i.e. economic growth fuels the growth of the state and we should all go galt or agorist. this counterargument ignores governments completely unsustainable fiscal/monetary practices as well as its inclination to tax at rates low enough to permit economic growth. the former leads to state self-destruction, while the latter ensures the continued growth of the private economy."
3178,Working and Saving are Revolutionary Acts,https://nakamotoinstitute.org/mempool/working-and-saving-are-revolutionary-acts/,"Given that a private law society is inevitable over the longterm, what is left for the young libertarian activist to do?",https://nakamotoinstitute.org/static/img/mempool/working-and-saving-are-revolutionary-acts/work.jpg,"given that a private law society is inevitable over the longterm, what is left for the young libertarian activist to do?"
3179,Working and Saving are Revolutionary Acts,https://nakamotoinstitute.org/mempool/working-and-saving-are-revolutionary-acts/,"If you enjoy politics, then by all means continue to advocate ending the Fed and lowering entitlement spending. Just understand that your efforts mirror those of the White Rose; noble yet inconsequential in the grand scheme of things. There is also an opportunity cost attached to lobbying, electioneering, and otherwise yelling from street corners.",https://nakamotoinstitute.org/static/img/mempool/working-and-saving-are-revolutionary-acts/work.jpg,"if you enjoy politics, then by all means continue to advocate ending the fed and lowering entitlement spending. just understand that your efforts mirror those of the white rose; noble yet inconsequential in the grand scheme of things. there is also an opportunity cost attached to lobbying, electioneering, and otherwise yelling from street corners."
3180,Working and Saving are Revolutionary Acts,https://nakamotoinstitute.org/mempool/working-and-saving-are-revolutionary-acts/,A healthier and more fulfilling approach is to maximize capital formation by:,https://nakamotoinstitute.org/static/img/mempool/working-and-saving-are-revolutionary-acts/work.jpg,a healthier and more fulfilling approach is to maximize capital formation by:


In [18]:
cores = psutil.cpu_count()
cores_used = int(cores/3)
print(f"This System Contains {cores} CPU Cores")
print(f"Will be using {cores_used} Cores for Data Transformation & Model Training")
pandarallel.initialize(progress_bar=True, nb_workers=cores_used)

This System Contains 12 CPU Cores
Will be using 4 Cores for Data Transformation & Model Training
INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


In [None]:
# Initializing a multi-thread code processing library for our NLP transformations
# with spaCy
pandarallel.initialize(
                       progress_bar=True,    # Enabling this allows us to view how long it'll take for our core-distributed code to execute
                       nb_workers=2   # Each worker is a CPU core
                       )
# loading in spaCy's medium sized English language model. It's large size is both too large and unnecessary for this project
nlp = spacy.load('en_core_web_md')

In [None]:
def tokenize(doc):
    
    tokens = []
    
    doc = nlp(doc)
    
    for token in doc:
        # Filtering out punctuation, and stop words. Filtering in lemmas and case normalizion.
        if ((token.is_punct != True) and
            (token.is_lower != True) and
            (token.lemma_ != 'PRON-') and
            (token.is_stop == False)):
                tokens.append(token.lemma_)

    return tokens

In [None]:
btc

In [None]:
%%time
btc['tokens'] = btc['text'].parallel_apply(tokenize)

In [None]:
# Need to convert this tokens from a list to string for our vectorizer model later on. Will get errors if we don't.
def listintostring(string):
    '''
    This function converts lists into strings
    '''
    str = " "
    return (str.join(string))

In [None]:
btc['token_list'] = btc['tokens'].apply(listintostring)

In [None]:
btc['token_list'] = btc['tokens'].apply(listintostring)
#%%
# Create a vector representation of the reviews 
# Name that doc-term matrix "dtm"

# YOUR CODE HERE
from sklearn.feature_extraction.text import TfidfVectorizer
# Tuning our vectorizer model
vect = TfidfVectorizer(
                       stop_words='english',
                       ngram_range=(1,2),    # Allowing for both unigrams AND bigrams
                       max_features=10000    # Not allowing more than 10k features/dimensions in our model
                       )

In [None]:
btc_test = pd.DataFrame(btc['text'])
### USER INPUT EXAMPLE ###
user_input = ["What is a blockchain?"]
# Inserting our user input into the dataframe for querying
btc_test.loc[len(btc_test.index)] = user_input
# viewing our user input in our test DF
btc_test.tail(1)

In [None]:
dtm = vect.fit_transform(btc_test['text'])
dtm = pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
dtm

In [None]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=25, algorithm='ball_tree')  # Using ball_tree to measure distance of points
nn.fit(dtm)  # Fitting our DTM to our KNN model

In [None]:
# There's our test user_input!
btc_test.iloc[10371]

In [None]:
# sampling our fake review from dtm_new to use as our query point
doc_index = 10371
# using integer location(iloc) on the document index in order to sample them
doc = [dtm.iloc[doc_index].values]

# Query Using kneighbors 
neigh_dist, neigh_index = nn.kneighbors(doc)

In [None]:
# For Loop for querying articles based on our user input for the chatbot
for doc in neigh_index:
    suggestion = btc_test.iloc[doc]
    print(suggestion)

In [None]:
def return_link(suggestion, btc):
  text_we_want = suggestion[0]
  for b in btc.index:
    if btc["body"].str.contains(text_we_want) == b:
      return btc.link.iloc[b]

query = return_link(suggestion, btc)
query