These course materials are designed for an introduction to computation and working with data. It will include organized data examples, a set of slides and examples, notebooks, etc.
Since I'm most comfortable programming in R, this class will mostly be taught using R.
Definitely a work in progress; if you have thoughts/suggestions/additions please open a github issue.
This is in no particular order, grouped by topic. As lectures come together I'll group these by the set of slides. Each group of topics will have a set of slides with an associated set of example scripts and data.
- Organizing and structuring datasets
- Tidy data frameworks and key-value pairs
- Web and other non-tabular formats
- Visual checks and identifying mistakes
- Unit tests and assertions for data
- Safe joining and merging of datasets
- A layered grammar of graphics, ggplot2, etc.
- Interactive plotting tools (plotly, highcharts)
- Javascript libraries, htmlwidgets (DT, leaflet)
- Simple interactive web applications with shiny
- Structuring your project as a pipeline
- Notebook tools for exploration and writing (e.g. Jupyter, Rmarkdown)
- Documentation tools, managing requirements
- Version control and collaboration with git
- Writing and documenting code
- Unit testing
- Functional programming
- Continuous integration tools
- Working with APIs
- Web data structures (JSON & XML)
- Developing packages?
Here are other good guides and tools people have put together. Most of my material is either learned or conceptually in debt to these things. If I've missed something please submit an issue or pull request!
Learning R:
Specific tools:
- ggplot2 website
- pipes with magrittr
- Introduction to data.table, the best software for working with data in memory
- plotly for interactive visualization
- Time series with xts
Reproducible Research Frameworks:
- Intelligent project structure made easy
- The all-in-one solution for ensuring reproducibility
- Brilliant notebooks for shareable analysis
Coding resources:
Misc:
The content of this project itself is licensed under the Creative Commons Attribution 4.0 license, and the underlying source code used to generate that content is licensed under the MIT license.