Write Free Science Books to Get Famous Website
Mission: live in a world where you can learn mathematics, physics, chemistry, biology and engineering for free whenever they want from perfect open source books made for free by people who want to get famous to get better paying jobs.
- 1. Desired social impact
- 2. Key algorithms
- 3. Further features
- 4. Secondary algorithms
- 5. Prototypes
- 6. datasets
- 7. Websites we want to replace
- 8. Business model
- 9. TODO
Crush the current grossly inefficient educational system, replace today’s students + teachers + researchers with unified "online content creators / consumers".
Gamify them, and pay the best creators so they can work it full time, until some company hires for more them since they are so provenly good.
Destroy useless exams, the only metrics of society are either:
how much money you make
how high is your educational content creator reputation score
Reduce the entry barrier to education, like Uber has done for taxis.
The key innovation of the website is to use the following algorithms to rank users and posts, while avoiding concept of "elected human moderators" at all costs.
This is the central and most important algorithm of the website.
users can upvote or downvote articles
any user can create any tag or upvote and downvote tags on any article, even if the user is not the author. Upvoting a tag on an article means that "I think that this tag describes this article well"
A tag could be something like: `C` to indicate that "this article is about the C language".
From these inputs, we want to answer, using algorithms, the following questions:
what is the best content for a given tag. This allows learners to find more interesting content.
which user knows the most about a given tag, or in other words, has the most reputation for a given tag. This motivates users to contribute to become famous and get better jobs.
This is the central algorithmic innovation that we want to implement.
If an user has high reputation for a tag, say
if the user upvotes a post tagged with
C++, the upvote has more weight than if the user upvotes a post tagged with
Java. By "has more weight" we mean:
the post gets a better ranking in the tag
the author of the post gets more reputation in the tag
if the user upvotes or downvotes a
C++tag on any post, that tag vote has more weight than if the user upvotes or downvotes a
Javatag for a post
Just like for PageRank, this leads to circular chains of influence, e.g.:
user A upvotes a post for user B
user B upvotes a post for user C
user C upvotes a post for user A
And then a way to solve this problem is to model it to an Eigenvalue problem.
We do not know exactly what the algorithm, but we believe that the PageRank analogy is valid. The algorithm could look something like this.
If we forget tags to simplify, we could do a bipartite authors / posts graph:
each post and user is node in one side of the bipartite graph
if userN upvotes postN, add a link from userN to postN
link postN to it’s author userN
To consider tags without weight, in addition:
each user is represented by one node per tag userN-tagM
if userN upvotes postN, add a link from userN-tagM to postN if postN is tagged with tagM
link from postN to each userN-tagM where userN is the autor and tagM a tag of the post
We do not know exactly what the algorithm, but we believe that the PageRank analogy is valid.
On Stack Overflow for example:
the post with most upvotes goes gets the highest ranking, no matter how old it is or when the votes were made.
This means that very old posts, which gained a lot of upvotes, almost never leave the top, even if newer better posts come along.
if two users with the same reputation
We must include in our post score and user reputation a time factor, so that recent votes count more than old votes.
It would be even more awesome to have a parameter that controls how much time matters, and then this would allow us to cover a wide variety of post types:
what we call "news" are simply posts where time matters a lot
what we call "theoretical reference books" are just posts where time matters less
The Reddit ranking algorithm does this reasonably well: https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9
Even better, would be to consider how many times users view EACH post in a single page, with some JS black magic. With that, we can just use the Wilso score interval https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval as mentioned at: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
Non SO literature:
How to mark tags
Java as being duplicates without moderators?
Possible solution: everyone can mark tags as duplicate.
Why people would waste time doing that? Because once you mark tags as duplicate, if you search for one, you will see both, so you can waste less time searching.
Then we need some algorithms that fuzzily joins all subjects that many people said are the same.
This is one of Quora’s focus: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
The website will have GitHub-like pull requests to content.
No one can ever edit your posts unless you explicitly allow them.
This prevents edit wars which can only be resolved with moderation.
But you can make your own copy (fork) according to the required website content license (CC-BY-SA), and a make precise a suggestion, which can be merged with a single click (aka GitHub pull requests).
But then What happens if:
the writer of an answer dies, and someone makes a great pull request to his answer with 1M upvotes?
50% of users agree with a pull request, 50% don’t?
next to each answer, have a list of forks
everyone can mark an answer as the "best version" or just upvote the pull requests
The following less-algorithmic features must also be present.
It must be possible for users to create trees of posts.
When a teacher wants to create a course for example, he can just link to existing material to the course material tree.
And only if something is missing, then he may write it.
Pull requests can be made for additions to the post tree, just as for regular content.
The best way to do such tree, would be something along:
WYSIWYG text editor
user can mark some text as a heading
whenever a heading a heading is created:
the user specifies the heading level
a database entry is created, that contains the text of the child, child entries and metadata: upvotes and tags
Has out of box:
https://github.com/quilljs/quill/issues/1681 How to create unique id on each block?
It would be awesome if all tags mapped to posts.
This way, a posts would serve as the description of a tag.
For example, the tag
mathematics should map to a set of posts
mathematics, which explains what Mathematics is, and contains a tree of children nodes which are sub-subjects, e.g.
Furthermore, when an user puts the
algebra post as a child of
mathematics, this is equivalent to saying "tag my
Algebra article with the
Comments and pull requests are analogous, and stored separately from regular nodes as
Comments and pull requests are more like "meta posts, with optional titles".
Comments are like GitHub issues, which are very similar to pull requests.
Comments are tied specifically to a given post.
E.g., if user 1 and user 2 make their own page entitled
Algebra page of both users could often be a child of the
Mathematics page of either user.
Comments on the other hand, are tied to a single
Mathematics page of a single user.
Forks however should inherit all comments and pull requests.
users can never delete their content. This way, links will never break.
the amount of data (characters in posts, number of tags, etc.) each user can create may be capped to limit server usage. Make this limit proportional to user reputation
These are further algorithms that would also be worth investigating, but which are not the most critical ones in our opinion.
This would counter voting fraud, e.g. of close groups of friends which upvote each other a lot.
Malicious users, or innocent users from close-knit research communities, might end up voting each other a lot.
We would like to have an algorithm such that every time you upvote the same given person, it has less positive impact on his reputation for that tag than the previous upvote.
How to determine if something is "original research" or not?
E.g.: a genius discovers something and publishes it really badly explained.
Someone less intelligent comes, explains it better, and gets widely read.
Or someone who just posts a bunch of links to good sources.
It would be cool for a user to say: I trust this other user on given tags / all tags.
Maybe this is required. E.g., given a real network, a bot network could make an exact copy of it, and that should have the same reputation as the real one.
Such relations make per-user score of other users / posts even more important.
Rate how much one user likes other users based on his actions.
E.g.: someone who only upvotes C questions will give score 0 for someone with only Java questions.
We could be able to deduce that
dog, is a lot of articles tagged as
Very early stage:
A hard part in testing the algorithms is that it is difficult to obtain data in the first place.
Besides the possibility of bootstrapping data ourselves by Consulting, these are some existing datasets that could be used:
Likely largest database of DOI metadata. They also issue DOIs.
Data comes from multiple journals, and each one has a different metadata set. Some don’t even have cross references, most have authors by name only instead of ORCID.
You have to belong to a journal to be listed there at all.
They host the metadata only.
Smaller than Crossref since only for bio related stuff, but despite that does not even seem to be much more uniform anyways…
Download data from: https://www.nlm.nih.gov/databases/download/pubmed_medline.html
TODO how are references encoded? Example.
Most authors don’t have ORCID, just string name. ORCIDs are in an optional field.
Most journals don’t have keywords, but at least those that do have keywords nicely split in the XML.
On the other hand, has a bunch of more bio specific fields such as which chemicals the paper mentions… lol, they can’t standardize the most important data, but they can add stuff like this.
Some laugh at our ambition. So do we sometimes.
WordPress, Medium, Facebook, Twitter, Blogger, etc., etc.:
no tag convergence across blogs. Each blog is a moderated castle. So who is the best user for a given tag, or the best content for a given tag, across the entire website?
Scope too limited, and politics defined.
Imagine if you could link up-votable application examples to the useless page of a Mathematics theorem.
Imagine if you could create multiple different versions of articles explaining them in your own perfect manner to a specific audience, instead of having this encyclopedic blob.
if the living ultimate god of
C++upvotes you, you get
if the first-day newb of
Javaupvotes you, you also get
Which makes no sense.
Only very specific posts are accepted on Stack Overflow, and anything else gets downvoted, criticized and deleted.
This greatly discourages new users, who might still have added value to the project.
On our website, anyone can post anything that is legal in a given country. No one can delete your content if it is legal, no matter their reputation, only downvote it.
Then we use algorithms to rank content.
Is politics based, rather than algorithmic, and thus more imperfect, e.g.:
each post can have up to 5 tags. If people disagree, politically elected moderators or site employees decide.
Edit wars, just like Wikipedia, which require moderator intervention to solve.
Randomly split between sites like Stack Overflow vs Super User, with separate user reputations, but huge overlaps, and many questions that appears as dupes on both and never get merged.
If I were to write a book about Quantum mechanics today, I’d likely upload an asciidoc to GitHub.
But there is one major problem with that: the entry barrier for new contributors is very large.
If they submit a pull request, I have to review it, otherwise, no one will ever see it.
Out amazing website would allow the reader to add his own example of, say, The Uncertainty Principle, whenever he wants, under the appropriate section.
Then, people who want to learn more about it, would click on the "defined tag" by the article, and our amazing analytics would point them to the best such articles.
PubMed data represents this concept through the
Who might seed fund this:
https://elifesciences.org/labs by eLife open journal not for profit. Cambridge UK based.
https://www.digital-science.com/investment/catalyst-grant/ by Shuttleworth foundation.
They don’t really have money, but they could help :-)
https://www.chanzuckerberg.com/ Zuck has already invested in education previously:
Start with consulting for universities to get some cash flowing.
Help teachers create perfect courses.
At the same time, develop the website, and use the generated content to bootstrap it.
Choose a domain of knowledge, generate perfect courses for it, and find all teachers of the domain in the world who are teaching that and help them out.
Ensure that the content can be downloaded as text, so that if this project fails, we can just upload everything to GitHub, and not all is lost.
Then expand out to other domains.
TODO: which domain of knowledge should we go for? The more precise the better.
maths is perfect because it "never" changes. But does not make money.
computer science might be good, e.g. machine learning.
If enough people use it, we can let people sell content through us, to become the YouTube of courses.
Teachers have the incentive of making open source to get more students.
Students pay when they want help to learn something.
We take a cut of the transactions.
However this goes a bit against our "open content" ideal.
One solution would be to only allow content to be private for a limited amount of time. Then users would be selling early access to the content. But all content would ultimately come back to the public site.
Don’t like this very much, but if it’s the only way…
Focus on job ads like Stack Overflow.
like YouTube, pay creators proportionally to views / metrics
paid subscription to remove ads from site
education has huge inertia:
university teachers are only ranked for their innovative research, and most don’t care or are not truly good explainers / educational content generators.
pre university: only cares about making students pass the useless university entry exams, instead of doing something truly valuable for society
Stack Overflow is good enough (?), even though it could be so much better
Google PageRank worked because they could crawl the entire web and get a large dataset without everyone having to go to them in the first place.
PageRank does not work for us however, as we need to know who is the author of each post. What to do about pages where the posts of multiple people show at the same time?
If only there was a standardized metadata on HTML that said who is the author of each post.
But even then, how to standardize the tagging? Who would store that data?
most of the information that is actually useful in the world if not open, but rather closed behind patents and industrial secrets.
And you wouldn’t be able to use or advance that information without the expensive associated machinery.
Working on recreating this information in an open way, and putting it on GitHub, may be more useful than this project.
in small fields of highly advanced research, the entry barrier is already huge, and only full time researchers can make any meaningful contribution, and we already know who the best are at all times.
The entry barrier of a journal is tiny compared to working full time on a given subject.
I have to organize this part better.
General reputation systems:
shares 90% ad revenue with content creators
http://www.synereo.com/whitepapers/synereo.pdf#subsection.2.2.2 distributed social network, seems to use quality metrics to determine how much content will be hosted from each person?
TODO open source? https://github.com/synereo Where is the source?
Where does their money come from? When will it launch?
https://github.com/debiki/ed-server no tags? Best go up focus.
Mathematical problem: make a stochastic matrix graph where each entry equals:
(1 / n_links)if there is a link going out
Now calculate the steady state of the Markov process: https://en.wikipedia.org/wiki/Markov_chain#Steady-state_analysis_and_limiting_distributions which is the same as calculating the eigenvector.
Convergence of simple interactive algorithm: stochastic link matrix M iff M is both: (TODO proof):
irreducible: definition: no strongly connected components smaller than the entire matrix. You can get from any place to any place.
Or in other words, there are no sets of pages from which the surfer cannot escape. One example of this is a page without any outgoing links.
http://drops.dagstuhl.de/volltexte/2007/1072/pdf/07071.VignaSebastiano.Paper.1072.pdf the damping factor can be interpreted as a probability that the random surfer will jump to a random page. It solves in particular the problem if the page has no outgoing links.
If is the same as adding a
dumping_factor / total_n_pagesto every element of he matrix, and multiplying the actual matrix by
1 - damping_factor.
1 is always the largest eigenvalue http://math.stackexchange.com/questions/40320/proof-that-the-largest-eigenvalue-of-a-stochastic-matrix-is-1 wit Looks like 1 is the only eigenvalue: http://math.stackexchange.com/questions/351142/why-markov-matrices-always-have-1-as-an-eigenvalue
Existence of a single largest real eigenvalue is guaranteed by https://en.wikipedia.org/wiki/Perron–Frobenius_theorem
Aperiodicity is likely for the huge graph of the web, so we forget about it.
Proposal to use it on Stack Overflow:
PageRank tutorials and papers:
https://en.wikipedia.org/wiki/TrustRank Starts from a set of trusted pages. Interesting, as that could be pages / users which were upvoted.
https://en.wikipedia.org/wiki/HITS_algorithm separates author from referrer, which could be interesting to give more reputation to those who actually write material.
https://www.nayuki.io/page/computing-wikipedias-internal-pageranks Wikipedia internal PageRanks, using a simple proprietary open-source Java PageRank implementation.
topic sensitive TODO understand better. Seems to modify the damping biasing to favour some pre-determined pages, on the paper based on DMOZ human consensus classification (no upvotes, just politics)
we could use something like that but based on votes of a given user, but it could be too expensive
http://www-cs-students.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf Contains a great explanation of PageRank.
Seems to use an arbitrary previously fixed number of topics?
Flickr 2016 only photo author can add tags
Delicious TODO down?
DOIs are identifiers for articicles, and what current research uses an identifiers.
https://arxiv.org: you need to get an endorsement by someone who has a least three published papers on a given magic category. This then gives you free DOIs, which makes your stuff visible by third party rankers like Google scholar. PDF uploads. Meh.
You can upload a bit of description text which change, but the files are unchangeable.
Forces you to select from magic tag / category list.
DOIs of type: https://doi.org/10.6084/m9.figshare.6248786.v1 and those links redirect you to the content
Magic urls have a version for multiple versions of same content, but this is just a convention done by figshare.
TODO: ORCID login?