# Age at creation for programming languages stats

Anton Antonov
[MathematicaForPrediction at WordPress](https://mathematicaforprediction.wordpress.com)
May 2024

## Introduction

In this notebook we ingest programming languages creation data from ["](https://pldb.io/index.html)**[P](https://pldb.io/index.html)**[rogramming ](https://pldb.io/index.html)**[L](https://pldb.io/index.html)**[anguage ](https://pldb.io/index.html)**[D](https://pldb.io/index.html)**[ata](https://pldb.io/index.html)**[B](https://pldb.io/index.html)**[ase"](https://pldb.io/index.html) and visualize several statistics of it.

We do not examine the data source and we do not want to reason too much about the data using the stats. We started this notebook by just wanting to make the bubble charts (both 2D and 3D.) Nevertheless, we are tempted to say and justify statements like:

- Pareto holds, as usual.

- Language creators tend to do it more than once.

- Beware the [Second system effect](https://en.wikipedia.org/wiki/Second-system_effect).

### References

Here are reference links with explanations and links to dataset files:

- [The Ages of Programming Language Creators (pldb.io)](https://pldb.io/posts/ageAtCreation.html)

    - Short note about data and related statistics; provides a link to a [TSV file](https://pldb.io/posts/age.tsv) with the data.

- [The Ages of Programming Language Creators (datawrapper.dwcdn.net)](https://datawrapper.dwcdn.net/rT0yG/1/)

    - "Just a plot"; provides a link to a CSV file with the data.

- [The Ages of Programming Language Creators](https://www.reddit.com/r/programming/comments/1cw2ri4/the_ages_of_programming_language_creators/) (Reddit)

    - Link(s) and discussion.

------

## Setup

In [1]:
use Data::Importers;
use Data::Reshapers;
use Data::Summarizers;

use JavaScript::D3;

In [2]:
#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

In [3]:
#% js
js-d3-list-line-plot(10.rand xx 40, background => 'none')


## Data ingestion

Here we get the TSC file with Wolfram Function Repository (WFR) function [ImportCSVToDataset](https://resources.wolframcloud.com/FunctionRepository/resources/ImportCSVToDataset/):

In [4]:
my $url = "https://pldb.io/posts/age.tsv";
my @dsDataLines = data-import($url).lines.map({ $_.split("\t") })>>.Array;
deduce-type(@dsDataLines)

Vector(Vector(Atom((Str)), 13), 186)

In [5]:
my @field-names = @dsDataLines.head.Array;
my @dsData = @dsDataLines.tail(*-2).map({ @field-names.Array Z=> $_.Array })>>.Hash;

deduce-type(@dsData)

Vector(Assoc(Atom((Str)), Atom((Str)), 13), 184)

In [6]:
@dsData = @dsData.map({
    $_<ageAtCreation> = $_<ageAtCreation>.UInt;
    $_<rank> = $_<rank>.Int;
    $_<pldbScore> = $_<pldbScore>.Int;
    $_<appeared> = $_<appeared>.Int;
    $_<numberOfUsersEstimate> = $_<numberOfUsersEstimate>.Int;
    $_<numberOfJobsEstimate> = $_<numberOfJobsEstimate>.Int;
    $_<foundationScore> = $_<foundationScore>.Int;
    $_
}).Array;

deduce-type(@dsData)

Vector(Struct([ageAtCreation, appeared, creators, filename, foundationScore, id, inboundLinks, measurements, numberOfJobsEstimate, numberOfUsersEstimate, pldbScore, rank, tags], [Int, Int, Str, Str, Int, Str, Str, Str, Int, Int, Int, Int, Str]), 184)

Show summary:

In [7]:
sink records-summary(@dsData, max-tallies => 12)

+------------------------------+------------------------+-------------------------+--------------------+-----------------------+---------------+-------------------------------+---------------------+---------------------------+-----------------------+----------------------+---------------------+----------------+
| filename                     | pldbScore              | numberOfUsersEstimate   | id                 | appeared              | inboundLinks  | creators                      | ageAtCreation       | tags                      | numberOfJobsEstimate  | rank                 | foundationScore     | measurements   |
+------------------------------+------------------------+-------------------------+--------------------+-----------------------+---------------+-------------------------------+---------------------+---------------------------+-----------------------+----------------------+---------------------+----------------+
| fp.scroll             => 1   | Min    => 15877        | Min

Focus languages to be used in the plots below:

In [8]:
my @focusLangs = ["C++", "Fortran", "Java", "Mathematica", "Perl 6", "Raku", "SQL", "Wolfram Language"];

[C++ Fortran Java Mathematica Perl 6 Raku SQL Wolfram Language]

Here we find the most important tags (used in the plots below):

In [9]:
my @topTags = @dsData.map(*<tags>).&tally.sort({ $_.value }).reverse.head(7)>>.key;

[pl textMarkup dataNotation grammarLanguage queryLanguage protocol stylesheetLanguage]

Here we add the column "group" based on the focus languages and most important tags:

In [10]:
@dsData = @dsData.map({ 
    $_<group> = do if $_<id> ∈ @focusLangs { "focus" } elsif $_<tags> ∈ @topTags { $_<tags> } else { "other" };
    $_
});

deduce-type( @dsData.head(3) )

Vector(Struct([ageAtCreation, appeared, creators, filename, foundationScore, group, id, inboundLinks, measurements, numberOfJobsEstimate, numberOfUsersEstimate, pldbScore, rank, tags], [Int, Int, Str, Str, Int, Str, Str, Str, Str, Int, Int, Int, Int, Str]), 3)

------

## Distributions

Here are the distributions of the variables/columns:

- age at creation 

    - i.e. "How old was the creator?"

- appeared"

    - i.e. "In what year the programming language was proclaimed?"

In [11]:
#% js
js-d3-histogram(@dsData.map(*<ageAtCreation>), title => 'Age at creation') 
~
js-d3-histogram(@dsData.map(*<appeared>), title => 'Appeared')



Here are corresponding Box-Whisker plots together with tables of their statistics:


## Pareto principle manifestation

### Number of creations

Here is the Pareto principle statistic for the number of created (or renamed) programming languages per creator:

In [12]:
my %creations = @dsData.map(*<creators>).&tally;
my @paretoStats = pareto-principle-statistic(%creations);
@paretoStats.head(6)

(Niklaus Wirth => 0.043478 Breck Yunits => 0.081522 John Backus => 0.108696 Larry Wall => 0.130435 Chris Lattner => 0.152174 Tim Berners-Lee => 0.173913)

Here is the corresponding plot:

In [13]:
#% js
js-d3-list-plot( @paretoStats>>.value, 
    title => 'Pareto principle: number languages per creators team', 
    title-color => 'Silver',
    background => 'none', 
    :grid-lines,
)

**Remark:** We can see that ≈25% of the creators correspond to ≈50% of the languages.


### Popularity

Obviously, programmers can and do use more than one programming language. Nevertheless, it is interesting to see the Pareto principle plot for the languages "mind share" based on the number of users *estimates*.

In [14]:
#% js
my %users = @dsData.map({ $_<id> => $_<numberOfUsersEstimate>.Int });
my @paretoStats = pareto-principle-statistic(%users);

js-d3-list-plot( @paretoStats>>.value, 
    title => 'Pareto principle: number users per language', 
    title-color => 'Silver',
    background => 'none', 
    :grid-lines,
)

**Remark:** Again, the plot above is "wrong" -- programmers use more than one programming language.


-----------

## Correlations

In order to see meaningful correlation, pairwise plots we take logarithms of the large value columns:

In [15]:
dsDataVar = dsData[All, {"appeared", "ageAtCreation", "numberOfUsersEstimate", "numberOfJobsEstimate", "rank", "measurements", "pldbScore"}];
dsDataVar = dsDataVar[All, Append[#, <|"numberOfUsersEstimate" -> Log10[#numberOfUsersEstimate + 1], "numberOfJobsEstimate" -> Log10[#numberOfJobsEstimate + 1]|>] &];

Preceding context expects a term, but found infix = instead.
Did you make a mistake in Pod syntax?

**Remark:** Note that we "cheat" by adding 1 before taking the logarithms.

We obtain the tables of correlations plots using the newly introduced, experimental [PairwiseListPlot](https://reference.wolfram.com/language/ref/PairwiseListPlot). If we remove the rows with zeroes some of the correlations become more obvious. Here is the corresponding tab view of the two correlation tables:

In [None]:
TabView[{
   "data" -> PairwiseListPlot[dsDataVar, PlotTheme -> "Business", ImageSize -> 800], 
   "zero-free data" -> PairwiseListPlot[dsDataVar[Select[FreeQ[Values[#], 0] &]], PlotTheme -> "Business", ImageSize -> 800]}]

![0fkn1gb70pm71](img/0fkn1gb70pm71.png)

**Remark:** Given the names of the data columns and the corresponding obvious interpretations we can say that the stronger correlations make sense.

## Bubble chart 2D

In this section we make an informative 2D bubble chart with (tooltips).

Here we make a dataset for the bubble chart:

In [17]:
my @dsData2 = @dsData.map({
    %( x => $_<appeared>, y => $_<ageAtCreation>, z => log($_<numberOfUsersEstimate>, 10), group => $_<group>, label => "<b>{$_<id>}</b> by {$_<creators>}")
});

deduce-type(@dsData2)

Vector(Struct([group, label, x, y, z], [Str, Str, Int, Int, Num]), 184)

Here is the bubble chart:

In [18]:
#% js
js-d3-bubble-chart(@dsData2, 
        z-range-min => 1,
        z-range-max => 16,
        title-color => 'Silver',
        title-font-size => 20,
        x-label => "appeared", 
        y-label => "lg(rank)",
        title => 'Age at creation',
        width => 1200,
        margins => { left => 60, bottom => 50, right => 200},
        background => 'none',
        :grid-lines
);

**Remark:** The programming language J is a clear outlier because of creators' ages.

-------

## Second system effect traces


In this section we try -- and fail -- to demonstrate that the more programming languages a team of creators makes the less successful those languages are. (Maybe, because they are more cumbersome and suffer the Second system effect?)

**Remark:** This section is mostly made "for fun." It is not true that each sets of languages per creators team is made of comparable languages. For example, complementary languages can be in the same set. (See, HTTP, HTML, URL.) Some sets are just made of the same language but with different names. (See, Perl 6 and Raku, and Mathematica and Wolfram Language.) Also, older languages would have the [First mover advantage](https://en.wikipedia.org/wiki/First-mover_advantage).

Make creators to index association:

In [82]:
my %creators = @dsData.map(*<creators>).&tally.pairs.grep(*.value > 1);
my %nameToIndex = %creators.keys.sort Z=> ^%creators.elems;
%nameToIndex.elems

33

Make a bubble chart dataset with relative popularity per creators team:

In [79]:
my @nUsers = @dsData.grep({ %creators{$_<creators>}:exists });

@nUsers = |group-by(@nUsers, <creators>).map({ 

    my $m = max(1, $_.value.map(*<numberOfUsersEstimate>).max.sqrt);

    $_.value.map({ %( x => $_<appeared>, y => %nameToIndex{$_<creators>}, z => $_<numberOfUsersEstimate>.sqrt/$m, group => $_<creators>, label => "<b>{$_<id>}</b>" ) }) 
    
})>>.Array.flat;

@nUsers .= sort(*<group>);

deduce-type(@nUsers)

Vector(Struct([group, label, x, y, z], [Str, Str, Int, Int, Num]), 92)

Here is the corresponding bubble chart:

In [87]:
#% js
js-d3-bubble-chart(@nUsers, 
        z-range-min => 1,
        z-range-max => 16,
        title => 'Second system effect',
        title-color => 'Silver',
        title-font-size => 20,
        x-label => "appeared",
        y-label => "creators", 
        z-range-min => 3,
        z-range-max => 10,
        width => 1000,
        height => 900,
        margins => { left => 60, bottom => 50, right => 200},
        background => 'none',
        grid-lines => (Whatever, %nameToIndex.elems)
);

From the plot above we *cannot* decisively say that:

> The most recent creation of a team of programming language creators is not team's most popular creation.

That statement, though, does hold for a fair amount of cases.


-------

## References

TBD...