# Creating a High-Signal BTC Article Suggestion Engine

![btc](https://upload.wikimedia.org/wikipedia/commons/9/9a/BTC_Logo.svg)

In [1]:
!pip install pandas
!pip install gensim
!pip install pandarallel
!pip install tables
!pip install ipywidgets
!pip install sklearn
!pip install seaborn
!pip install nltk
!pip install psutil



In [1]:
# Importing libraries

# Python Modules
import re
import random
import sys
import os.path
from time import time
# Data Wrangling
import json
import pandas as pd
# *Expands Dataframe so as to view it in its entirety
pd.options.display.max_colwidth = 750
# Scientific Computing
import numpy as np
# Natural Language Processing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics.pairwise import linear_kernel
import num2words
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim import models
# Visualizations
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Suppress Developer Warnings
import warnings
warnings.filterwarnings("ignore")
# Info on processess and system utilization
import psutil

In [2]:
data_path = '/Users/pretermodernist/coding-projects/bitcoin-knowledge-bot/datasets/knowledge_datasets/bitcoin_articles.json'

In [3]:
# Reading in our bitcoin data from a json lines file with a reproducible function


def wrangle_jsonl(path: str):
    '''
    Reads in our bitcoin data from a json lines file

    Parameters
    ----------
    None
    
    Returns
    -------
    df: pandas datafarme 
        Contains text data from several reputable BTC news and historical sources
    '''
    # Preparing jsonl file for reading in
    with open(path) as l:
        lines = l.read().splitlines()

    # Loading the json lines object into an intermediary pandas DataFrame
    df_inter = pd.DataFrame(lines)
    df_inter.columns = ['json_element']

    # Applying json loads function on each row of the json_element column to decode
    # json object into a dictionary
    df_inter['json_element'].apply(json.loads)

    # Converting any semi-structured json data with a normalize function 
    # json keys are made into a flat table
    df = pd.json_normalize(df_inter['json_element'].apply(json.loads))

    # Returning read_in json and its DF equivalent so we can explore our data
    # whichever way we wish
    return df

In [22]:
btc = wrangle_jsonl(data_path)
btc.sample(10)

Unnamed: 0,title,url,body,image,type
16162,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"+ steganography, the science of hiding the existence of",,article
17920,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"transactions/ od course not, and yet this is",,article
1206,Bitcoin Has No Intrinsic Value — and That’s Great,https://medium.com/coinmonks/bitcoin-has-no-intrinsic-value-and-thats-great-e6994adbfe0f,"In our present day, it just so happens that the best stores of value are also those that have some element of utility as a commodity. The key distinction here is that gold, real estate, or any form of commodity money, is not a store of value because of its utility as a commodity, but despite that utility!",,article
26895,Truledger in Plain English,https://nakamotoinstitute.org/truledger/,"Bob: (<bobsid>,gettime,<serverid>,<req5#>)",https://nakamotoinstitute.org/static/img/cc-4-0-by-sa.png,article
11919,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,specifying the hardware to the vendors,,article
20633,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,RSA algorithm is the most widely used form of public key,,article
7022,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"enforced, or won't be in the future.",,article
14301,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,many others: the answer to all these questions is mostly,,article
7260,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"environment, paradoxical as that may sound. ""Even paranoids",,article
14222,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"- example: banks have cameras, by 1998, that can",,article


In [5]:
# Cleaning text by replacing dirty data with empty strings using regular expressions

def cleaner_regex(uncleaned):
    """
    Accepts a single text document in the form of a pandas Series and performs 
    several regex substitutions in order to clean it.

    Is only meant to be applied to the text column.

    Parameters
    ----------
    text: pandas Series

    Returns
    -------
    text: pandas Series
    """
    # square_brackets = r'\[.*?\]'
    replace_with = ""
    # removed_line = re.sub(square_brackets, replace_with, uncleaned)
    half_cleaned = re.sub(r"(\d+)", lambda x: num2words.num2words(int(x.group(0))), uncleaned)
    cleaned = re.sub(r'https?://[^\s]+', replace_with, half_cleaned)

    # Applying case normalization to each body of text
    return cleaned.lower()

In [23]:
btc['body'] = btc['body'].apply(cleaner_regex)

In [30]:
# To apply this cleaner function without replacing the original column we can use a list comp and assign it to a new column
# This can be handy for comparing the original and cleaned versions of the same corpus

# Need to assign the column to a variable before looping through it with a list comphrension

# uncleaned_prompt = btc.body
# btc['cleaned_body'] = [cleaner_regex(corpus) for corpus in uncleaned_prompt]

In [9]:
btc.sample(5)

Unnamed: 0,title,url,body,image,confidence_score
26148,Bit Gold,https://nakamotoinstitute.org/bit-gold/,"thus, bit gold will not be fungible based on a simple function of, for example, the length of the string. instead, to create fungible units dealers will have to combine different-valued pieces of bit gold into larger units of approximately equal value. this is analogous to what many commodity dealers do today to make commodity markets possible. trust is still distributed because the estimated values of such bundles can be independently verified by many other parties in a largely or entirely automated fashion.",https://licensebuttons.net/l/by-sa/4.0/88x31.png,3
10803,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,anon.penet.fi (whose code is now a long-detached strain,,2
416,Why It’s Hard to “Get” Bitcoin,https://unchained.com/blog/blockchain-spectrum/,"using the network to measure consensus on some political decision is extremely important, because blockchains such as bitcoin have no leaders to officially represent them. this can be a weakness when decisiveness is needed but is an incredible strength when building robust political systems. there is no robert mugabe to encourage to abdicate power. there is no un security council with coveted seats for stakeholders. there is no julian assange whose (alleged) personal misdeeds can color the public perception of political goals.",,2
12243,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"ways of bypassing these laws, of making our own moral",,2
17502,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"- jefferson savings and loan association, texas",,2


In [7]:
# Exporting regex cleaned dataset
timestamp = pd.Timestamp.now().strftime('bitcoin_knowledge_regexed_%Y-%m-%d-%H%M.json')
folder = r'/Users/pretermodernist/coding-projects/bitcoin-knowledge-bot/datasets/knowledge_datasets'
filepath = os.path.join(folder, timestamp)
btc.to_json(filepath, orient="records", lines=True)

# Article Suggestion with KNN - Failed

In [65]:
cores = psutil.cpu_count()
cores_used = int(cores/3)
print(f"This System Contains {cores} CPU Cores")
print(f"Will be using {cores_used} Cores for Data Transformation & Model Training")

This System Contains 12 CPU Cores
Will be using 4 Cores for Data Transformation & Model Training


In [66]:
# Tuning our vectorizer model
vect = TfidfVectorizer(
                       stop_words='english',
                       ngram_range=(1, 2),       # Allowing for both unigrams & bigrams
                       max_features=10000       # Not allowing more than 10k features/dimensions in our model
                       )

In [67]:
btc_knn = pd.DataFrame(btc['body'])

## USER INPUT EXAMPLE ##

user_input = ["Who made Bitcoin?"]
# Inserting our user input into the dataframe for querying
btc_knn.loc[len(btc_knn.index)] = user_input
# Viewing our user input in our KNN dataframe
btc_knn.tail(1)

Unnamed: 0,body
32362,Who made Bitcoin?


In [68]:
dtm = vect.fit_transform(btc_knn['body'])
dtm = pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
dtm

Unnamed: 0,aantonop,aantonop bitcoin,ab,abandon,abandoned,aber,abiding,abilities,ability,able,...,zero thirty,zero thousand,zero years,zero zero,zeros,zimbabwe,zimmermann,zip,zu,zur
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32359,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32360,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
%%time
knn = NearestNeighbors(
    n_neighbors = 10,
    algorithm = 'ball_tree',  # Using ball_tree to measure distance of points
    n_jobs = cores_used
    ) 
 
knn.fit(dtm)  # Fitting our DTM to our KNN model

CPU times: user 34.3 s, sys: 315 ms, total: 34.6 s
Wall time: 34.8 s


NearestNeighbors(algorithm='kd_tree', n_jobs=4)

In [None]:
# There's our test user_input!
btc_knn.iloc[-1]

body    Who made Bitcoin?
Name: 32362, dtype: object

In [None]:
# sampling our fake review from dtm_new to use as our query point
doc_index = -1
# using integer location(iloc) on the document index in order to sample them
doc = [dtm.iloc[doc_index].values]

# Query Using kneighbors 
neigh_dist, neigh_index = knn.kneighbors(doc)

In [None]:
# For Loop for querying articles based on our user input for the chatbot
for doc in neigh_index:
    recommendation = btc_knn.iloc[doc]

recommendation.head()

Unnamed: 0,body
22468,bitcoin:onegdknineuzphbzqzxtwoaninejfpthreedifourwebwqgmoqa?
32362,Who made Bitcoin?
30255,"bitcoin on vaarallinen paikka. loputon luettelo hakkeroinneista, huijareista ja varkauksista. bitcoin lupaa verkon, jossa luottamus on hajautettu. tiedtk miksi? koska muita bitcoinaajia on olemassa."
3708,bitcoinsettlesettlementslowspeedtpstransactions
111,"korean by hyungmok joh (part one, part two, part three, part four)."


In [143]:
def return_suggestion(recommendation, btc):
    knn_recommendation = btc[btc["body"].str.contains(recommendation.iloc[1].values[0])==True].copy()
    return knn_recommendation

suggestion = return_suggestion(recommendation, btc)
suggestion

Unnamed: 0,title,url,body,image,confidence_score


# Article Suggestion with Linear Kernel

In [24]:
# def copy_df(btc):
#     btc_backup = btc
#     return btc_backup

In [34]:
def user_input(question, btc):
    userinput = question
    # removing the previous row that included a visitory query for subsequent questions
    btc = btc[btc['title'] != 'visitor_query']
    btc.loc[len(btc.index)] = ['visitor_query', None, userinput, None, None]
    return btc

btc2 = user_input("What's the best way to buy Bitcoin?", btc)
btc2.sample(3)

Unnamed: 0,title,url,body,image,type
29775,On the Origins of Bitcoin:\nStages of Monetary Evolution,https://nakamotoinstitute.org/on-the-origins-of-bitcoin/,technical experiment with no functioning monetary uses,https://nakamotoinstitute.org/static/img/docs/on-the-origins-of-bitcoin/bitcoin.jpg,article
18982,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,feel certain more such pressures will be applied (after,,article
16319,The Cyphernomicon,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"- one day it's illegal to have contact with china, the next",,article


In [35]:
# Tuning our vectorizer model

def preprocess(btc2):
    btcc = btc2
    indices = pd.Series(btcc.index, index=btcc['title']).drop_duplicates()
    content = btcc['body']
    vect = TfidfVectorizer(
                       stop_words='english',
                       strip_accents='unicode',
                       analyzer='word',
                       sublinear_tf=False,
                       norm='l2',
                       use_idf=True,
                       ngram_range=(1, 2),
                       max_features=10000       # Not allowing more than 10k features/dimensions in our model
                       )

    tfidf_matrix = vect.fit_transform(content)
    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
    return vect, btcc, cosine_similarities

vect, btcc, cosine_similarities = preprocess(btc2)
cosine_similarities


array([[1.        , 0.29700865, 0.05522624, ..., 0.02754458, 0.        ,
        0.02889886],
       [0.29700865, 1.        , 0.02748335, ..., 0.0144801 , 0.        ,
        0.03038407],
       [0.05522624, 0.02748335, 1.        , ..., 0.05777229, 0.        ,
        0.03770646],
       ...,
       [0.02754458, 0.0144801 , 0.05777229, ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.02889886, 0.03038407, 0.03770646, ..., 0.        , 0.        ,
        1.        ]])

In [56]:
# Checking to see if certain words were penalized
tfidf_stop = list(vect.stop_words_)
term_lookup = ['bitcoin',
               'satoshi',
               'nakamoto',
               'philosophy',
               'blockchain',
               'segwit',
               'tx',
               'transaction',
               'lightning',
               'network',
               'samourai',
               'cypherpunks',
               'proof-of-work',
               'proof',
               'work',
               'decentralized',
               'deleuzian'
              ]

print([i for i in tfidf_stop if i in term_lookup])

for i in tfidf_stop:
    if i in term_lookup:
        print('tfidf may be causing penalizing the wrong words')
    else:
        continue

['tx', 'deleuzian']
tfidf may be causing penalizing the wrong words
tfidf may be causing penalizing the wrong words


In [31]:
def get_recommendations(df, column, value, cosine_similarities, limit=10):
    """Return a dataframe of content recommendations based on TF-IDF cosine similarity.
    
    Args:
        df (object): Pandas dataframe containing the text data. 
        column (string): Name of column used, i.e. 'title'. 
        value (string): Name of title to get recommendations for, i.e. 1982 Ferrari 308 GTSi For Sale by Auction
        cosine_similarities (array): Cosine similarities matrix from linear_kernel
        limit (int, optional): Optional limit on number of recommendations to return. 
        
    Returns: 
        Pandas dataframe. 
    """
    
    # Return indices for the target dataframe column and drop any duplicates
    indices = pd.Series(df.index, index=df[column])

    # Get the index for the target value
    target_index = indices[value]

    # Get the cosine similarity scores for the target value
    cosine_similarity_scores = list(enumerate(cosine_similarities[target_index]))

    # Sort the cosine similarities in order of closest similarity
    cosine_similarity_scores = sorted(cosine_similarity_scores, key=lambda x: x[1], reverse=True)

    # Return tuple of the requested closest scores excluding the target item and index
    cosine_similarity_scores = cosine_similarity_scores[1:limit+1]

    # Extract the tuple values
    index = (x[0] for x in cosine_similarity_scores)
    scores = (x[1] for x in cosine_similarity_scores)    

    # Get the indices for the closest items
    recommendation_indices = [i[0] for i in cosine_similarity_scores]

    # Get the actutal recommendations
    recommendations = df[column].iloc[recommendation_indices]

    # Return a dataframe
    df = pd.DataFrame(list(zip(index, recommendations, scores)), 
                      columns=['index','recommendation', 'cosine_similarity_score']) 

    return df

In [36]:
btcc.reset_index(inplace=True)
recommendations = get_recommendations(btcc,
                                      'title',
                                      'visitor_query',
                                      cosine_similarities)

In [37]:
def clean_up_recommendations(recommendations, btcc):
    recommendations = recommendations.rename(columns = {'recommendation': 'title'}) 
    # Creating a copy of this dataframe so that when a second question is asked an error doesn't occur due to mismatched columns
    # btc_copy = btc                                  
    recommendationsss = recommendations.merge(btcc, on=["index", "title"], how="left", sort=False)
    return recommendationsss

article_suggestions = clean_up_recommendations(recommendations, btcc)

article_suggestions.head(10)

Unnamed: 0,index,title,cosine_similarity_score,url,body,image,type
0,6909,The Cyphernomicon,0.607428,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"- the best way to understand them is to ""just do it,"" that",,article
1,23113,Bitcoin QnA Common questions,0.578847,https://bitcoiner.guide/qna/common,where can i buy some? there are lots of online exchanges that you can buy bitcoin from although you should do your own due diligence when finding somewhere to buy from. the best way to get your first piece of bitcoin is to buy some from a trusted friend or family member. remember - you can buy a fraction of a bitcoin (there are one hundred million sats in each one)! see here for some recommendations.,,article
2,7343,The Cyphernomicon,0.487001,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,- the best way to get a feel for the list is to simply read,,article
3,7308,The Cyphernomicon,0.483225,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,"- reading the various newsgroups is, as always, the best way",,article
4,4946,The Rise of the Sovereign Individual,0.471171,https://medium.com/bull-bitcoin/the-rise-of-the-sovereign-individual-2201eee82f00,should you buy bitcoin as part of your retirement plan?,,article
5,15082,The Cyphernomicon,0.464043,https://ia600208.us.archive.org/10/items/cyphernomicon/cyphernomicon.txt,roadblocks in the way. it might be best to buy off the visa,,article
6,4607,Stone Ridge 2020 Shareholder Letter,0.441191,https://www.microstrategy.com/en/bitcoin/documents/stone-ridge-2020-shareholder-letter,the mystery of the missing amzn millionaires & is it too late to buy bitcoin?,,article
7,22796,Enders Game,0.363524,https://unchained-capital.com/blog/enders-game/,with a recommendation of how best to prote,,article
8,32101,How We Know Bitcoin Is Not a Bubble,0.330417,https://nakamotoinstitute.org/mempool/how-we-know-bitcoin-is-not-a-bubble/,"every time you buy bitcoin, a fairy gets its wings. now clap your hands, click your heels together three times, and believe in bitcoin! it will only take faith the size of a mustard seed.",https://nakamotoinstitute.org/static/img/mempool/how-we-know-bitcoin-is-not-a-bubble/MetcalfeGraph.png,article
9,28950,The Geodesic Market,0.330115,https://nakamotoinstitute.org/the-geodesic-market/,"but dont despair. remember that if digital bearer transactions really do something im betting my company on, sooner or later an equity index-based internet currency will in fact emerge as the best way to buy things.",https://nakamotoinstitute.org/static/img/cc-4-0-by-sa.png,article
