# Languages for Data Analysis

> 

### Douglas Bates
### Emeritus Professor
### Dept. of Statistics
### U. of Wisconsin-Madison,

## For the impatient

- Much of this talk will be about the [Julia](https://julialang.org) programming language

- For an overview of [Julia](https:julialang.org) for econometrics, see [quantecon.org](https://quantecon.org)
    - The [lectures](https://lectures.quantecon.org/) tab provides a [Python](https://python.org) version and a [Julia](https://julialang.org) version
    - The [About](https://lectures.quantecon.org/about_lectures.html) tab is a very good discussion of why these two languages were chosen.

## More Julia resources

- The original developers formed a company, [Julia Computing](https://juliacomputing.com), to provide consulting services and oversee the development of the language.
    - the [case studies](https://juliacomputing.com/case-studies/) section is of particular interest

- Registered Julia packages are at [pkg.julialang.org](https://pkg.julialang.org)
    - note that Julia packages are [git](https://en.wikipedia.org/wiki/Git) repositories, usually housed on [github.com](http://github.com)
    - see also [Julia Package Ecosystem Pulse](https://pkg.julialang.org/pulse.html)

- Documentation for the language itself is at [docs.julialang.org](https://docs.julialang.org)

## Language shapes the way we think - Benjamin Lee Whorf

- I have been doing data analysis programming for a long time
    - took my only computing course in 1967 - didn't like it
    - in the 70's wrote Fortran code that I hope no one ever discovers
    - late 70's read Kernighan and his co-authors, saw programming differently
    - was able to use more sophisticated languages, almost by accident, in thesis research
    - some SPSS and SAS use, Minitab for teaching
    - heard about the work on S by John Chambers and others at Bell Labs 

## What was different about S?

- an interactive language (REPL - read-eval-print-loop)

- functional language (in the sense of defining and calling functions)

- heterogeneous, self-describing, recursive, extensible data structures
    - contrast with flat-files structures in SPSS, SAS; vectors (columns) in Minitab

- random-access memory based, not a filter
    - required a "second megabyte of memory"

- explicit interfaces to compiled code
    - original versions were more of a wrapper around numerical and graphics code bases

## What was the same with S?

- explicitly designed for data analysis and graphics

- provision for missing data was built in at a very low level

- internally, data structures were always vectors, possibly with "attributes"
    - "lists" are actually vectors of pointers, not linked lists as in Lisp

- "semi-"proprietary software
    - AT&T couldn't market it (or the Unix operating system)
    - some universities became "beta-test sites"
    - U. of Washington Stats. Dept. spun off "StatSci" and marketed S-PLUS

## The 90's and R

- open-source software gained traction (perl, python, emacs, GNU, MySQL, PostgreSQL) in some areas
    - servers were often characterized as LAMP (Linux, Apache, MySQL, Perl/Python)
    - first releases of both Python and Linux were in 1991

- mid-90's Ross Ihaka and Robert Gentleman started work on a language "not-unlike S"
    - S-PLUS had been ported to DOS/Windows but not to Macintosh
    - essentially a "clean room" reimplementation of the S language
    - Martin Maechler contributed so many patches they gave him an account
    - Martin encouraged release under the GPL
    - others joined the party, 1997 R-Core was formed.

## The 90's and CRAN

- Kurt Hornik and Fritz Leisch, then at TU-Wien, created the "Comprehensive R Archive Network" or CRAN
    - patterned after CTAN (TeX) and CPAN (Perl) archive networks

- encouraged users to become developers and researchers to provide a reference implementation of their methods

- promoted the development of package standards, testing etc.

- one important facility added later (Uwe Ligges) was "Win-builder" to create binary Windows packages
    - most developers were on Linux, most users on Windows

## The aughts - R fourishes

- initially R was considered to be a "poor man's S-PLUS"

- many people felt that "Freeware" couldn't possibly be as good as commercial software

- eventually the open-source model was recognized as producing high-quality code
    - Eric S. Raymond, "Given enough eyeballs, all bugs are shallow."

- CRAN went from 10's to 100's to 1000's of packages

- useR! conferences started, *R Newsletter* later to become *R Journal* founded

- papers, books, online resources, became available

## The aughts - other languages enter the mix

- S and R were designed for data analysis and graphics, not general purpose programming
    - as Ross said, "we thought maybe a couple of hundred people total would use it"

- every language is a trade-off between pre-defined, high-level operations and low-level building blocks
    - R is more high-level tools: vectors, matrices, data frames are okay, scalars, loops, low-level logic not so much
    - the number of internal data types is surprisingly small (32-bit integers, 64-bit floats and character strings)

- languages like Python provided more flexibility but w/o the data science specific structures
    - required add-ons like numpy, pandas, etc.    