Skip to content

basvandorst/where-is-satoshi

Repository files navigation

Where is Satoshi?

TL;DR: you could most probably find the real Satoshi Nakamoto here.

This repository contains:

  • 👉 500,000+ mailing list posts (10+ lists, 75,000+ authors, 1992-2000)
  • 😵 7,500,000+ Reddit comments (/u/Bitcoin, 70,000+ authors, 2005-2019)
  • 📖 All texts (posts, mails, code comments, paper) written by Satoshi
  • 📈 Stylometric analysis of words, punctuation, slang, sentences, n-grams, word lengths, & spelling checks
  • 📚 John Burrow's Delta, Jaccard Similarity, Flesch, Gunning Fog, Dale-Chal, Coleman, SMOG
  • 📊 Comparison of Satoshi's text with all other chunks of data

Download: XLS aggregates (40MB) | CSV raw (240MB)

Satoshi facts & figures

Sources: Bitcoin paper, mail communication, mailing list posts (excluding the 2 posts in 2015), forum posts, code comments, website texts

  • Words: 81,500 words, 6,000 unique words, 4,750 unique sentences
  • Characters: 460,000 w/ spaces, 385,000 w/o spaces
  • British English: realise, cheque, customisations, favour, adminning, funnelling, decentralised, colour, labelled, formalised, reorganised, sceptical, fulfil, honour, neighbour, labour, dependants, analysed, labelling, liberalising, optimised, optimisation, colours, greyed, modernised, synchronising , defence, amortised, ...GB-US spellcheck
  • Hyphenated words: re-broadcasting, multiple-invocation, un-upgraded, double-spends, pre-compiled, sub-languages, anti-embedding, non-routable, time-sharing, reverse-spamming, self-defeating, market-determined, re-requested, non-reversible, de-emphasize, pre-announce, ...and more
  • 2+ hyphens: no-priority-requirement, man-in-the-middle(d), one-person-one-vote, one-CPU-one-vote, zero-knowledge-proofs, created-but-never-used, tit-for-tat, back-of-the-envelope, stable-with-respect-to-energy, non-lower-ASCII, delete-at-will, delete-and-lose-everything-in-your-other-files, almost-release-candidate, one-time-use, free-for-all, inventory-request-data,...and more
  • Dot decimal separator: National standard in US, UK, Japan, China, Switzerland, Australia, ..
  • Slang: yuck, darn, nope, heck, pay-naggy, laggy, gotta, dumbed down
  • Few typo's: incomming, resurect, transfered, walkthough
  • Double-spacing: 2,000+ times double-spacing after a sentence, 150 single. Consistent in all communication from day 1 (... my personal favorite)
  • Capitalized abbreviations: CPU/cpu (63/2), P2P/p2p (19/3), JSON/json (55,18), SVN/svn (65/0), HTTP/http (19/5)
  • Punctuation: !(29) ?(145) ,(2,800) .(2,400) :(96) ;(22)
  • Pronouns: I(950), you(1040), your(270), we(370), us(20) me(50)
  • Verb suffixes: 6,000 unique words. -ing(490) -ed(540), -ly(220) ...view all words
  • -> indicator: consistently without spaces: "Options->Change", "Options->Generate Coins", ..

Satoshi in numbers

source: all-text.txt code.txt forum.txt mail-dustin.txt mail-hal.txt mail-sirius.txt paper.txt posts-mailinglist.txt website.txt
SUM words 80238 2049 47950 1726 1694 20866 2994 4265 516
SUM characters 453873 12421 269264 9608 9483 117964 17899 24934 3026
AVG sentence_word_count 18.78 56.64 17.96 20.36 18.7 18.91 25.37 19.75 14.97
AVG sentence_word_median 16.2 51.1 16.1 18.0 14.5 14.98 21.08 15.94 13.75
AVG satoshi_ngram1_matches 217.97 189.6 210.51 180.5 173.5 222.76 198.83 209.33 104.0
AVG satoshi_ngram2_matches 452.78 329.2 459.33 386.0 379.25 442.21 438.33 419.78 227.0
AVG satoshi_ngram3_matches 437.3 280.8 457.32 357.5 350.0 405.93 429.5 396.78 235.0
AVG satoshi_john_burrow_delta 0.14481 0.14495 0.14481 0.14492 0.14495 0.14481 0.14479 0.14483 0.14523
AVG satoshi_jaccard_similarity 0.03709 0.03226 0.03582 0.03071 0.02952 0.0379 0.03383 0.03562 0.0177
SUM satoshi_words_slang 10.0 - 3.0 1.0 - 5.0 - 1.0 -
SUM satoshi_words_typo 4.0 2.0 2.0 - - - - - -
SUM satoshi_words_british 41.0 - 24.0 2.0 - 11.0 1.0 2.0 1.0
SUM satoshi_words_hyphen_1 154 3 88 4 3 41 5 9 1
SUM satoshi_words_hyphen_2 75 2 30 3 1 15 6 6 1
SUM satoshi_words_less_frequent 160 4 95 3 4 42 6 8 1
AVG words_unique 284.56 234.2 276.61 244.0 235.5 290.1 256.83 269.11 134.0
AVG words_nostop_count 268.03 264.8 267.93 225.5 229.75 263.88 268.67 258.0 150.5
AVG words_nostop_unique 217.97 189.6 210.51 180.5 173.5 222.76 198.83 209.33 104.0
AVG words_with_hyphen_1 3.89 1.0 3.9 3.25 1.25 4.24 2.67 5.11 1.0
AVG words_with_hyphen_2 0.59 0.4 0.39 1.0 0.75 0.4 1.67 1.11 1.0
AVG words_fully_capitalized 13.92 6.6 14.41 10.5 17.0 14.95 3.83 8.44 4.5
AVG words_verb_ing 12.57 6.6 11.83 10.75 10.75 13.64 15.5 13.22 6.5
AVG words_verb_ed 12.49 9.6 11.79 9.5 15.25 12.55 16.5 11.44 5.0
AVG words_verb_ly 5.98 3.0 5.69 6.0 8.0 6.64 6.5 5.89 3.5
AVG punctuation_single_space 13.55 5.0 14.42 10.5 10.25 12.19 7.5 12.33 11.5
AVG punctuation_double_space 15.58 2.4 16.02 12.5 15.0 16.74 14.5 13.11 0.0
AVG punctuation_space_ratio 1.23 0.93 1.17 1.26 1.42 1.53 2.01 1.12 0.0
AVG punctuation_commas 17.59 11.0 17.67 13.75 14.5 17.55 20.17 18.56 7.0
AVG punctuation_dots 28.09 7.0 29.56 22.75 25.5 27.43 22.0 26.0 13.5
AVG punctuation_exclamation 0.29 0.0 0.35 0.25 0.25 0.21 0.0 0.44 0.0
AVG punctuation_question 1.89 0.4 2.3 0.5 0.75 1.81 0.0 0.0 0.0
AVG punctuation_colons 2.3 1.6 2.64 0.5 1.75 2.02 1.0 1.33 3.0
AVG punctuation_semicolons 0.16 2.4 0.06 0.0 0.0 0.12 0.17 0.22 0.0
AVG readability_flesch_reading_ease 72.77 41.78 74.64 70.33 74.95 71.98 55.28 64.59 67.15
AVG readability_smog_index 9.81 14.7 9.42 10.25 9.35 9.91 13.35 11.14 5.8
AVG readability_flesch_kincaid_grade 6.82 17.6 6.28 7.88 6.1 7.14 11.25 8.7 7.0
AVG readability_automated_readability_index 8.29 22.9 7.6 9.05 7.53 8.6 13.37 10.39 7.95
AVG readability_dale_chall_readability_score 8.68 10.29 8.45 8.15 8.24 8.75 8.98 8.95 9.22
AVG readability_linsear_write_formula 8.18 23.35 7.3 8.62 5.98 8.83 11.22 8.48 6.13
AVG readability_gunning_fog 8.79 19.38 8.15 9.84 7.95 9.09 13.22 10.51 9.26
AVG readability_coleman_liau_index 8.48 11.91 8.08 8.13 8.18 8.49 10.92 9.75 9.06
AVG sentence_count 30.56 7.8 32.16 24.0 27.5 30.43 22.0 27.33 14.5
AVG sentiment_subjectivity 0.47 0.36 0.47 0.47 0.48 0.48 0.45 0.44 0.26
AVG sentiment_polarity 0.1 0.1 0.1 0.07 0.02 0.13 0.09 0.08 0.03
AVG personal_pronouns_i 7.71 0.2 7.03 5.5 13.25 10.76 0.0 3.67 1.5
AVG personal_pronouns_you 7.43 0.0 8.53 8.75 6.75 6.86 0.0 3.22 3.0
AVG personal_pronouns_he 0.32 0.0 0.18 0.0 0.0 0.45 2.67 0.11 0.0
AVG personal_pronouns_she 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
AVG personal_pronouns_it 15.09 6.6 16.14 15.0 13.75 14.81 7.0 12.11 5.5
AVG personal_pronouns_we 2.75 2.6 2.75 1.0 0.0 3.21 3.33 1.11 0.0
AVG personal_pronouns_they 2.05 1.0 1.76 2.75 0.75 2.5 2.17 3.56 2.5
AVG personal_pronouns_me 0.43 0.2 0.34 0.75 0.75 0.64 0.0 0.11 0.0
AVG personal_pronouns_him 0.1 0.0 0.06 0.0 0.0 0.17 0.5 0.0 0.0
AVG personal_pronouns_her 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
AVG personal_pronouns_us 0.02 0.0 0.04 0.0 0.0 0.0 0.0 0.0 0.0
AVG personal_pronouns_myself 0.02 0.0 0.01 0.0 0.0 0.07 0.0 0.0 0.0
AVG personal_pronouns_them 0.93 0.2 0.9 1.0 1.25 1.12 1.33 0.67 0.0
AVG personal_pronouns_yourself 0.07 0.0 0.09 0.0 0.0 0.07 0.0 0.0 0.0
AVG word_length_frequency_1 5.32 3.36 5.66 4.6 6.09 4.78 4.34 4.56 3.41
AVG word_length_frequency_2 18.01 16.57 18.08 18.18 16.4 18.13 18.57 17.62 17.54
AVG word_length_frequency_3 19.08 16.33 19.27 18.3 20.22 19.26 17.5 18.84 17.65
AVG word_length_frequency_4 17.96 18.57 18.06 20.14 18.48 18.25 14.3 16.31 16.64
AVG word_length_frequency_5 11.37 13.53 11.41 10.75 12.05 10.99 11.9 11.12 14.74
AVG word_length_frequency_6 7.99 9.68 7.82 8.18 7.75 8.31 7.58 8.26 11.63
AVG word_length_frequency_7 8.13 8.96 8.0 9.92 7.45 8.23 8.32 9.1 4.61
AVG word_length_frequency_8 4.6 3.92 4.6 3.22 4.89 4.55 6.21 4.46 2.21
AVG word_length_frequency_9 2.91 2.88 2.85 3.18 3.1 2.88 3.31 2.97 4.45
AVG word_length_frequency_10 1.59 1.16 1.46 1.17 1.08 1.75 2.17 2.53 3.94
AVG word_length_frequency_11 1.42 2.84 1.3 0.9 0.6 1.41 2.67 1.98 1.31
AVG word_length_frequency_12 0.84 1.2 0.77 0.55 1.06 0.76 1.77 0.95 0.8
AVG word_length_frequency_13 0.41 0.48 0.37 0.61 0.51 0.35 0.93 0.71 0.8
AVG word_length_frequency_14 0.17 0.2 0.18 0.15 0.1 0.15 0.17 0.31 0.1
AVG word_length_frequency_15 0.2 0.32 0.18 0.16 0.25 0.21 0.27 0.29 0.2

Comparison in graphs

Texts of Satoshi compared to all(!) authors


Graph on how word length are divided over all texts. Outliers: 4, 5, 10


Scores indicates that Satoshi's text are easy to understand/read (lower = easier, except for Flesch where higher = easier).


Tips & Tricks:

  • N-gram analysis: Satoshi's most used n-grams were not very common before 2010 (Bitcoin, transaction, time stamping server, blockchain, etc). After 2010 they were used a lot, which makes n-gram/text comparison harder for people that were only active in the pre-Bitcoin phase. Compare fair!
  • Last chunk: Filter out the last chunk if you are going to aggregate totals. Last chunk is almost never 500 tokens and could heavily impact your results, especially when authors have a few chunks.
  • Outliers: use median/stddev instead of mean/average when aggregating. Technical posts often contains code snippets, which impacts deviation/peaks in numbers.
  • Mean/averages: ... For lower numbers without much variation use means
  • ML prediction: Exclude the (real) Satoshi posts from your training set?!

Dataset

  • Public archives: All data is publicly available and downloaded via public archives. Only the https://marc.info/ archive was scraped (...with respectful delays, sorry Hank!)
  • Uniqueness some authors were active under multiple names and or mail addresses and/or used remailer services (to post anonymously)
  • Reply extraction: good extraction in mail threads is hard! No guarantee that texts fully belongs to the author (different ways of quotation, indentation, reply)
  • Duplicate lines: some sources contain cross-posts across multiple lists & duplicate signatures. Duplicate lines are removed before text chunking.
  • Chunks: Data was chunked per 500 tokens (not taken the current sentence into account). Token regex: \b[a-zA-Z0-9-']+\b
  • Minimum chunks: Files with less than 500 tokens was not analyzed.

Reddit

All Reddit comments are publicly available via Google BigQuery (2009-2019)

Mailing list

Extracted texts from mailing lists where Satoshi was and/or could be active.

  • bitcoin-dev (2011-2020)
  • cryptography (2001-2020)
  • cypherpunks-cpunks1 (1992-2013)
  • cypherpunks-cpunks2 (2013-2020)
  • cypherpunks-venona (1992-1998)
  • cypherpunks-marc (1992-2020) (includes replies to non-cypherpunks lists)
  • gnupg-users (1999-2020)
  • p2p-research (2007-2011)
  • tor-talk (2004-2020)
  • testlist-cpunks (1992-2019)
  • winpt-users (2005-2009)

Missing archives: [cypherpunks-moderated], [openssl-users] (<2014), [cryptopp-users], [tor-talks], [e-gold-list], [p2p-hackers] [gsc] (gold-silver-crypto)

Field description

Field Description
source The origin of the text, such as a mailing list or Reddit.
filename Identifier for the author, such as a username or email address.
chunks The total number of text chunks analyzed for this author.
chunk The specific chunk of text being analyzed.
words The total number of words or tokens analyzed in the chunk.
characters The total number of characters in the chunk.
satoshi_ngram1_matches Count of single-word matches with known Satoshi writings.
satoshi_ngram2_matches Count of two-word sequences (bigrams) matches with known Satoshi writings.
satoshi_ngram3_matches Count of three-word sequences (trigrams) matches with known Satoshi writings.
satoshi_john_burrow_delta A stylometric measure comparing the text to Satoshi's known writings using John Burrows' Delta method. Lower = higher match
satoshi_jaccard_similarity The Jaccard similarity index comparing the text to Satoshi's known writings. Higher = higher match
satoshi_words_british British English words matching with Satoshi writing
satoshi_words_slang Slang words that Satoshi has used.
satoshi_words_typo Typos found in Satoshi's known writings.
satoshi_words_hyphen_1 Single hyphenated words matching Satoshi's writing style.
satoshi_words_hyphen_2 Double(+) hyphenated words matching Satoshi's writing style.
satoshi_words_less_frequent Less frequently used words in Satoshi's writings. Less than 100 times in total for all authors
words_unique The number of unique words in the chunk.
words_nostop_count The number of words in the chunk, excluding common stopwords.
words_nostop_unique The number of unique words in the chunk, excluding common stopwords.
words_with_hyphen_1 Count of single hyphenated words found in the chunk.
words_with_hyphen_2 Count of 2-hyphen-plus words found in the chunk.
words_fully_capitalized Count of words that are fully capitalized.
punctuation_single_space Count of occurrences of a single space after punctuation.
punctuation_double_space Count of occurrences of a double space after punctuation.
punctuation_space_ratio Ratio double spacing - single spacing
punctuation_commas Count of commas in the chunk.
punctuation_dots Count of dots in the chunk.
punctuation_exclamation Count of exclamation marks in the chunk.
punctuation_question Count of question marks in the chunk.
punctuation_colons Count of colons in the chunk.
punctuation_semicolons Count of semicolons in the chunk.
readability_flesch_reading_ease A readability score based on the Flesch reading-ease test.
readability_smog_index A readability score based on the SMOG index, indicating years of education needed to understand the text.
readability_flesch_kincaid_grade A readability grade level based on the Flesch-Kincaid grade level test.
readability_coleman_liau_index A readability score based on the Coleman-Liau index, indicating the US grade level needed to understand the text.
readability_automated_readability_index A readability score indicating the US grade level needed to understand the text based on characters per word and words per sentence.
readability_dale_chall_readability_score A readability score based on the Dale-Chall formula, indicating the US grade level needed to understand the text using a list of familiar words.
readability_linsear_write_formula A readability score based on the Linsear Write formula, calculating text difficulty based on sentence length and easy/hard word count.
readability_gunning_fog A readability score based on the Gunning fog index, indicating years of education needed to understand the text.
sentence_count The number of sentences in the chunk.
sentence_word_count The total number of words in all sentences.
sentence_word_median The median word count per sentence.
sentiment_subjectivity A measure of subjectivity in the text, with higher values indicating more subjective text.
sentiment_polarity A measure of the overall sentiment in the text, ranging from negative to positive.

Background

Weeks before the Bitcoin launch, Satoshi Nakamoto was in contact with famous cypherpunks and people on the cryptography mailing list. Most probably he lurked and/or was an active user on these cryptography/cypherpunks lists. Also, Reddit (/r/Bitcoin) had a lot of in-depth Bitcoin discussions between cypherpunks in the early days.

Assume Satoshi was already active on the internet (...under his real name), you could find the needle in the haystack in these data sources.

All fine, who is the real Satoshi !?!!!1!

Yes, i have a short-list of suspects. No, i’m not going to drop names here because i’m not 100% sure. Proven statistical text based / author comparison techniques are used, but this isn’t enough to automatically conclude the real Satoshi; High correlation only shows a similar use of language pattern --> high correlation does not imply causation

Please do your own research, interpret everything with care and don't trust the outliers. Cheers!

Disclaimer

The plan was to have some Python refreshment since that was already a while ago. Well, that escalated a bit: Python -> Large datasets -> Reddit comments on BigQuery -> Satoshi's texts -> Stylometric analysis -> text comparison...

"The chase is better than the catch" - There is no personal interest in Satoshi's real identity. i sold my stake in 2012, way before the hype started. Unfortunately!

About

Stylometric analysis of Satoshi & comparison of Satoshi with 75,000+ authors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published