# Age at creation for programming languages stats

## ⎡***Exploratory Data Analysis with Raku***⎦

Anton Antonov   
[RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com)   
[MathematicaForPrediction at WordPress](https://mathematicaforprediction.wordpress.com)   
May 2024


------

## Introduction

In this notebook we ingest programming languages creation data from ["](https://pldb.io/index.html)**[P](https://pldb.io/index.html)**[rogramming ](https://pldb.io/index.html)**[L](https://pldb.io/index.html)**[anguage ](https://pldb.io/index.html)**[D](https://pldb.io/index.html)**[ata](https://pldb.io/index.html)**[B](https://pldb.io/index.html)**[ase"](https://pldb.io/index.html) and visualize several statistics of it.

We do not examine the data source and we do not want to reason too much about the data using the stats. We started this notebook by just wanting to make the bubble charts (both 2D and 3D.) Nevertheless, we are tempted to say and justify statements like:

- Pareto holds, as usual.

- Language creators tend to do it more than once.

- Beware the [Second system effect](https://en.wikipedia.org/wiki/Second-system_effect).

### References

Here are reference links with explanations and links to dataset files:

- [The Ages of Programming Language Creators (pldb.io)](https://pldb.io/posts/ageAtCreation.html)

    - Short note about data and related statistics; provides a link to a [TSV file](https://pldb.io/posts/age.tsv) with the data.

- [The Ages of Programming Language Creators (datawrapper.dwcdn.net)](https://datawrapper.dwcdn.net/rT0yG/1/)

    - "Just a plot"; provides a link to a CSV file with the data.

- [The Ages of Programming Language Creators](https://www.reddit.com/r/programming/comments/1cw2ri4/the_ages_of_programming_language_creators/) (Reddit)

    - Link(s) and discussion.

------

## Setup

### Raku Tools

We'll leverage these Raku packages:

* **Data::Importers:**  Ingesting data from various formats (TSV in our case).
* **Data::Reshapers:** Transforming and restructuring data for analysis.
* **Data::Summarizers:**  Calculating descriptive statistics.
* **Data::Translators:** Data structures conversion into R, Python, WL, and ***HTML*** code.
* **JavaScript::D3:**  Generating interactive visualizations using D3.js.
* **Jupyter::Chatbook:**  Interactive notebook environment for code execution and presentation.

In [1]:
use Data::Importers;
use Data::Reshapers;
use Data::Summarizers;
use Data::Translators;
use Data::TypeSystem;

use JavaScript::D3;

In [2]:
use DSL::English::DataQueryWorkflows;

### JavaScript graphics

In [3]:
#% javascript
require.config({
     paths: {
     d3: 'https://d3js.org/d3.v7.min'
}});

require(['d3'], function(d3) {
     console.log(d3);
});

In [4]:
#% js
js-d3-list-line-plot(10.rand xx 80, background => 'none')

In [5]:
my $title-color = 'SlateGray';
my $stroke-color = 'Silver';
my $tooltip-color = 'Black';
my $tooltip-background-color = 'Ivory';
my $background = 'none';

none

-----

## Data ingestion


Here we ingest the TSV file:

In [6]:
my $url = "https://pldb.io/posts/age.tsv";
#my @dsData = data-import($url, headers => 'auto');
my @dsData = slurp($url, headers => 'auto');

deduce-type(@dsData)

Vector(Assoc(Atom((Str)), Atom((Str)), 13), 213)

Here we define a preferred order of the columns:

In [7]:
my @field-names = ['id', 'name', |(@dsData.head.keys (-) <id name>).keys.sort];

[id name ageAtCreation appeared creators foundationScore inboundLinksCount measurements numberOfJobsEstimate numberOfUsersEstimate pldbScore rank tags]

Here we convert values of relevant columns into integers: 

In [8]:
@dsData = @dsData.map({
    $_<ageAtCreation> = $_<ageAtCreation>.UInt;
    $_<rank> = $_<rank>.Int;
    $_<pldbScore> = $_<pldbScore>.Int;
    $_<appeared> = $_<appeared>.Int;
    $_<numberOfUsersEstimate> = $_<numberOfUsersEstimate>.Int;
    $_<numberOfJobsEstimate> = $_<numberOfJobsEstimate>.Int;
    $_<foundationScore> = $_<foundationScore>.Int;
    $_<measurements> = $_<measurements>.Int;
    $_<inboundLinksCount> = $_<inboundLinksCount>.Int;
    $_
}).Array;

deduce-type(@dsData)

Vector(Struct([ageAtCreation, appeared, creators, foundationScore, id, inboundLinksCount, measurements, name, numberOfJobsEstimate, numberOfUsersEstimate, pldbScore, rank, tags], [Int, Int, Str, Int, Str, Int, Int, Str, Int, Int, Int, Int, Str]), 213)

Here is a sample of dataset's rows: 

In [9]:
#% html
@dsData.pick(6) ==> data-translation(:@field-names, align => 'left')

id,name,ageAtCreation,appeared,creators,foundationScore,inboundLinksCount,measurements,numberOfJobsEstimate,numberOfUsersEstimate,pldbScore,rank,tags
semver,Semantic Versioning,32,2011,Tom Preston-Werner,0,0,20,0,8425,23015,237,schema
modula-2,Modula-2,44,1978,Niklaus Wirth,0,2,41,0,655,22910,278,pl
ruby,Ruby,30,1995,Yukihiro Matsumoto,84,87,104,11438,394798,24334,8,pl
spss,SPSS,25,1968,Norman H. Nie and C. Hadlai Hull and Dale H. Bent,0,2,30,9587,965674,23808,68,pl
literate-coffeescript,Literate CoffeeScript,32,2013,Jeremy Ashkenas,0,0,19,0,22814,23054,225,pl
erlang,Erlang,36,1986,Joe Armstrong and Robert Virding and Mike Williams,13,13,69,308,27148,24079,34,pl


Show summary of the dataset:

In [10]:
sink records-summary(@dsData, max-tallies => 7, field-names => @field-names.sort[^7]);
sink records-summary(@dsData, max-tallies => 7, field-names => @field-names.sort[7..12]);

+---------------------+-----------------------+------------------------+--------------------+--------------------------+---------------------+---------------------+
| ageAtCreation       | appeared              | creators               | foundationScore    | id                       | inboundLinksCount   | measurements        |
+---------------------+-----------------------+------------------------+--------------------+--------------------------+---------------------+---------------------+
| Min    => 16        | Min    => 1948        | Niklaus Wirth   => 8   | Min    => 0        | scroll            => 1   | Min    => 0         | Min    => 6         |
| 1st-Qu => 30        | 1st-Qu => 1978        | Breck Yunits    => 7   | 1st-Qu => 0        | gnu-emacs-editor  => 1   | 1st-Qu => 0         | 1st-Qu => 14.5      |
| Mean   => 36.643192 | Mean   => 1992.896714 | John Backus     => 5   | Mean   => 32.58216 | oberon-2          => 1   | Mean   => 34.286385 | Mean   => 32.173709 |
| Median =

Focus languages to be used in the plots below:

In [11]:
my @focusLangs = ["C++", "Fortran", "Java", "Mathematica", "Perl 6", "Raku", "SQL", "Wolfram Language"];

[C++ Fortran Java Mathematica Perl 6 Raku SQL Wolfram Language]

Here we find the most important tags (used in the plots below):

In [12]:
my @topTags = @dsData.map(*<tags>).&tally.sort({ $_.value }).reverse.head(8)>>.key;

[pl dataNotation textMarkup library grammarLanguage queryLanguage stylesheetLanguage editor]

Here we add the column "group" based on the focus languages and most important tags:

In [13]:
@dsData = @dsData.map({ 
    $_<group> = do if $_<name> ∈ @focusLangs { "focus" } elsif $_<tags> ∈ @topTags { $_<tags> } else { "other" };
    $_
});

deduce-type(@dsData)

Vector(Struct([ageAtCreation, appeared, creators, foundationScore, group, id, inboundLinksCount, measurements, name, numberOfJobsEstimate, numberOfUsersEstimate, pldbScore, rank, tags], [Int, Int, Str, Int, Str, Str, Int, Int, Str, Int, Int, Int, Int, Str]), 213)

------

## Distributions

Here are the distributions of the variables/columns:

- age at creation 

    - i.e. "How old was the creator?"

- appeared"

    - i.e. "In what year the programming language was proclaimed?"

In [14]:
#% js
my %opts = :$title-color, background => 'none', bins => 40;
js-d3-histogram(@dsData.map(*<ageAtCreation>), title => 'Age at creation', |%opts) 
~
js-d3-histogram(@dsData.map(*<appeared>), title => 'Appeared', |%opts)


Here are corresponding Box-Whisker plots:

In [30]:
#% js
my %opts = :horizontal, :outliers, :$title-color, :$stroke-color, :$background, :$tooltip-color, :$tooltip-background-color, width => 400;
js-d3-box-whisker-chart(@dsData.map(*<ageAtCreation>), title => 'Age at creation', |%opts)

In [31]:
#% js
js-d3-box-whisker-chart(@dsData.map(*<appeared>), title => 'Appeared', |%opts)

-----

## Facilitation with code generation

Here is a way to _derive_ summarization and code:

In [17]:
'
use @dsData; 
group by tags; 
summarize over "ageAtCreation"
'
==> ToDataQueryWorkflowCode(target => 'Raku::Reshapers')
==> copy-to-clipboard

$obj = @dsData ;
$obj = group-by($obj, "tags") ;
$obj = $obj.map({ $_.key => summarize-at($_.value, ("ageAtCreation"), (&elems, &min, &max)) })

For more details see [AA1] and [AAv1]. 

-------

## Pareto principle manifestation

### Number of creations

Here is the Pareto principle statistic for the number of created (or renamed) programming languages per creators team:

In [18]:
my %creations = @dsData.map(*<creators>).&tally;
my @paretoStats = pareto-principle-statistic(%creations);
@paretoStats.head(6)

(Niklaus Wirth => 0.037559 Breck Yunits => 0.070423 John Backus => 0.093897 Tim Berners-Lee => 0.112676 Chris Lattner => 0.131455 Donald Knuth => 0.14554)

Here is the corresponding plot:

In [19]:
#% js
js-d3-list-plot( @paretoStats, 
    title => 'Pareto principle: number languages per creators team', 
    :$title-color,
    background => 'none',
    :$tooltip-color,
    :$tooltip-background-color,
    :grid-lines,
)

**Remark:** We can see that ≈30% of the creators correspond to ≈50% of the languages.


### Popularity

Obviously, programmers can and do use more than one programming language. Nevertheless, it is interesting to see the Pareto principle plot for the languages "mind share" based on the number of users *estimates*.

In [20]:
#% js
my %users = @dsData.map({ $_<name> => $_<numberOfUsersEstimate>.Int });
my @paretoStats = pareto-principle-statistic(%users);
say @paretoStats.head(8);

js-d3-list-plot( @paretoStats, 
    title => 'Pareto principle: number users per language', 
    :$title-color,
    background => 'none', 
    :$tooltip-color,
    :$tooltip-background-color,
    :grid-lines,
)

(SQL => 0.124920756 JavaScript => 0.228674537 Java => 0.325793151 HTML => 0.422729511 C++ => 0.494563201 C => 0.560576923 Python => 0.611953791 CSS => 0.662099551)


**Remark:** Again, the plot above is "wrong" -- programmers use more than one programming language.


-----------

## Correlations

In order to see meaningful correlation, pairwise plots we take logarithms of the large value columns:

In [21]:
my @corColnames = <appeared ageAtCreation numberOfUsersEstimate numberOfJobsEstimate rank measurements>;
my @dsDataVar = select-columns(@dsData, @corColnames);
@dsDataVar = @dsDataVar.map({ 
    my %h = $_.clone; 
    %h<numberOfUsersEstimate> = log(%h<numberOfUsersEstimate> + 1, 10); 
    %h<numberOfJobsEstimate> = log(%h<numberOfJobsEstimate> + 1, 10);
    %h
}).Array;

deduce-type(@dsDataVar)


Vector(Struct([ageAtCreation, appeared, measurements, numberOfJobsEstimate, numberOfUsersEstimate, rank], [Int, Int, Int, Num, Num, Int]), 213)

Here make a Cartesian product of the focus columns and make scatter points plot for each pair of that product:

In [22]:
#% js
(@corColnames X @corColnames)>>.reverse>>.Array.map( -> $c {
    #my @points = @dsDataVar.map({ %( x => $_{$c.head}, y => $_{$c.tail} ) });
    my @points = @dsDataVar.map({ %( 
        x => $c.head ∈ <numberOfUsersEstimate numberOfJobsEstimate> ?? log($_{$c.head} + 1,10) !! $_{$c.head}, 
        y => $c.tail ∈ <numberOfUsersEstimate numberOfJobsEstimate> ?? log($_{$c.tail} + 1,10) !! $_{$c.tail}  ) });
    if $c.head eq $c.tail {
        js-d3-histogram( @points.map(*<x>), width => 230, height => 200, x-label => $c.head, :$background )
    } else {
        js-d3-list-plot( @points, width => 230, height => 200, x-label => $c.head, y-label => $c.tail, :$title-color, :$background )
    }
}).join("\n")

**Remark:** Given the names of the data columns and the corresponding obvious interpretations we can say that the stronger correlations make sense.

--------

## Bubble chart 2D

In this section we make an informative 2D bubble chart with (tooltips).

### Number of users (estimates)

Here we make a dataset for the bubble chart:

In [23]:
my @dsData2 = @dsData.map({
    %( x => $_<appeared>, y => $_<ageAtCreation>, z => log($_<numberOfUsersEstimate> + 1, 10), group => $_<group>, label => "<b>{$_<name>}</b> by {$_<creators>}")
});

deduce-type(@dsData2)

Vector(Struct([group, label, x, y, z], [Str, Str, Int, Int, Num]), 213)

Here is the bubble chart:

In [24]:
#% js
js-d3-bubble-chart(@dsData2, 
        z-range-min => 1,
        z-range-max => 16,
        :$title-color,
        title-font-size => 20,
        x-label => "appeared", 
        y-label => "ageAtCreation",
        title => 'lg(Number of users estimate)',
        width => 1200,
        margins => { left => 60, bottom => 50, right => 200},
        background => 'none',
        :$tooltip-color,
        :$tooltip-background-color,
        :grid-lines
);

**Remark:** The programming language J is a clear outlier because of creators' ages.

### Foundation score

The creator of the dataset -- Breck Yunits -- posted this comment on the Wolfram community post ["Computational exploration for the ages of programming language creators dataset"](https://community.wolfram.com/groups/-/m/t/3180327),

> [...]`foundationScore` might be a better signal to plot. `foundationScore` answers the question "what languages do people who build languages use?". We crawl the Git repos of over 1,000 programming language projects, and look at the languages used in each. Using the same scale across languages helps me identify the most impactful ones. [...]

In [25]:
#% js
my @dsData3 = @dsData.map({
    %( x => $_<appeared>, y => $_<ageAtCreation>, z => sqrt($_<foundationScore>), group => $_<group>, tooltip => "<b>{$_<name>}</b> by {$_<creators>}")
});

js-d3-bubble-chart(@dsData3, 
        z-range-min => 3,
        z-range-max => 16,
        :$title-color,
        title-font-size => 20,
        x-label => "appeared", 
        y-label => "ageAtCreation",
        title => 'lg(foundationScore)',
        width => 1200,
        margins => { left => 60, bottom => 50, right => 200},
        background => 'none',
        :$tooltip-color,
        :$tooltip-background-color,
        :grid-lines
);

-------

## Second system effect traces


In this section we try -- and fail -- to demonstrate that the more programming languages a team of creators makes the less successful those languages are. (Maybe, because they are more cumbersome and suffer the Second system effect?)

**Remark:** This section is mostly made "for fun." It is not true that each sets of languages per creators team is made of comparable languages. For example, complementary languages can be in the same set. (See, HTTP, HTML, URL.) Some sets are just made of the same language but with different names. (See, Perl 6 and Raku, and Mathematica and Wolfram Language.) Also, older languages would have the [First mover advantage](https://en.wikipedia.org/wiki/First-mover_advantage).

Make creators to index association:

In [26]:
my %creators = @dsData.map(*<creators>).&tally.pairs.grep(*.value > 1);
my %nameToIndex = %creators.keys.sort Z=> ^%creators.elems;
%nameToIndex.elems

40

Make a bubble chart dataset with relative popularity per creators team:

In [27]:
my @nUsers = @dsData.grep({ %creators{$_<creators>}:exists });

@nUsers = |group-by(@nUsers, <creators>).map({ 

    my $m = max(1, $_.value.map(*<numberOfUsersEstimate>).max.sqrt);

    $_.value.map({ %( x => $_<appeared>, y => %nameToIndex{$_<creators>}, z => $_<numberOfUsersEstimate>.sqrt/$m, group => $_<creators>, tooltip => "<b>{$_<name>}</b> by {$_<creators>}" ) }) 
    
})>>.Array.flat;

@nUsers .= sort(*<group>);

deduce-type(@nUsers)

Vector(Struct([group, tooltip, x, y, z], [Str, Str, Int, Int, Num]), 109)

Here is the corresponding bubble chart:

In [28]:
#% js
js-d3-bubble-chart(@nUsers, 
        z-range-min => 1,
        z-range-max => 16,
        title => 'Second system effect',
        :$title-color,
        title-font-size => 20,
        x-label => "appeared",
        y-label => "creators", 
        z-range-min => 3,
        z-range-max => 10,
        width => 1000,
        height => 1100,
        margins => { left => 60, bottom => 50, right => 200},
        background => 'none',
        grid-lines => (Whatever, %nameToIndex.elems),
        opacity => 0.9,
);

In [29]:
#% js
my @nUsersTr = @nUsers.map({ my %h = $_.clone; %h<y> = $_<x>; %h<x> = $_<y>; %h });

js-d3-bubble-chart(@nUsersTr, 
        z-range-min => 1,
        z-range-max => 16,
        title => 'Second system effect',
        :$title-color,
        title-font-size => 20,
        y-label => "appeared",
        x-label => "creators", 
        z-range-min => 3,
        z-range-max => 10,
        width => 2000,
        height => 600,
        margins => { left => 60, bottom => 50, right => 200},
        background => 'none',
        grid-lines => (%nameToIndex.elems, Whatever),
        opacity => 0.9,
);

From the plot above we *cannot* decisively say that:

> The most recent creation of a team of programming language creators is not team's most popular creation.

That statement, though, does hold for a fair amount of cases.


-------

## References

### Articles

[AA1] Anton Antonov, ["Introduction to data wrangling with Raku"](https://rakuforprediction.wordpress.com/2021/12/31/introduction-to-data-wrangling-with-raku/), (2021), [RakuForPrediction at WordPress](https://rakuforprediction.wordpress.com).

[AA2] Anton Antonov, ["Age at creation for programming languages stats"](https://mathematicaforprediction.wordpress.com/2024/05/22/age-at-creation-for-programming-languages-stats/), (2024), [MathematicaForPrediction at WordPress](https://mathematicaforprediction.wordpress.com).

### Notebooks

[AAn1] Anton Antonov, ["Computational exploration for the ages of programming language creators dataset"](https://community.wolfram.com/groups/-/m/t/3180327), (2024), [Wolfram Community](https://community.wolfram.com).

[AAn2] Anton Antonov, ["Age at creation for programming languages stats"]()

### Packages

[AAp1] Anton Antonov, [Data::Importers Raku package](https://github.com/antononcube/Raku-Data-Importers), (2024), [GitHub/antononcube](https://github.com/antononcube).

[AAp2] Anton Antonov, [Data::Reshapers Raku package](https://github.com/antononcube/Raku-Data-Reshapers), (2021-2024), [GitHub/antononcube](https://github.com/antononcube).

[AAp3] Anton Antonov, [Data::Summarizers Raku package](https://github.com/antononcube/Raku-Data-Summarizers), (2021-2023), [GitHub/antononcube](https://github.com/antononcube).

[AAp4] Anton Antonov, [JavaScript::D3 Raku package](https://github.com/antononcube/Raku-JavaScript-D3), (2022-2024), [GitHub/antononcube](https://github.com/antononcube).

[AAp5] Anton Antonov, [Jupyter::Chatbook Raku package](https://github.com/antononcube/Raku-Jupyter-Chatbook), (2023-2024), [GitHub/antononcube](https://github.com/antononcube).

### Videos

[AAv1] Anton Antonov, ["TRC 2022 Implementation of ML algorithms in Raku
"](https://www.youtube.com/watch?v=efRHfjYebs4&t=1293s), (2022), [YouTube/@AAA4Prediction](https://www.youtube.com/@AAA4prediction).   
  *(Part of the presentation discusses the minimalistic data wrangling approach introduced in [AA1].)*

[AAv2] Anton Antonov, "Small dataset data analysis walkthrough (Raku)", (2024), [YouTube/@AAA4Prediction](https://www.youtube.com/@AAA4prediction).