Skip to content

Readings

François Briatte edited this page May 3, 2024 · 88 revisions

Below are all course readings, videos and other recommended resources, as a single list. Those are sent, with comments, through weekly course emails. I did not find the time to write a short reading list, so I wrote a 22-page one.

Introductions to the material (read those first) → Handbooks · Videos · Optional

Jump to a session:
1 (Software) · 2 (Workflow) · 3 (Data) · 4 (Visualization) · 5 (Description) · 6 (Association)
7 (Correlation) · 8 (Regression) · 9 (Nonlinearity) · 10 (Surveys) · 11 (Classification) · 12 (Extensions)
(You can also use the sidebar on the right to reach specific sections.)

Special bonus sections:
Help with R · R stuff in French · SQL · Web scraping · Learning more R · Keeping up with R · Varia

Note on availability: almost all resources below are free to read online, in one form or another. In a few exceptional cases, I will supply the readings on Google Drive.

Handbooks

Early in the course, we rely on R-focused handbooks, all of which are readable online, and on a single, short introductory statistics textbook:

  • Gerring and Christenson, Applied Social Science Methodology (Cambridge University Press, 2017)

    This handbook does not contain any R code, but it provides the simplest and most concise introduction to all statistical topics up to linear regression. Assigned to Sessions 5, 6, 7 and 8.

  • Irizarry, Introduction to Data Science (CRC Press, 2022)

    This handbook is our ‘primary’ R handbook in the first course sessions. The book has been split into two online books, Introduction to Data Science and Advanced Data Science, which we will both use. Assigned to Sessions 1, 2, 3, 4 and 5. First edition also available in Spanish.

  • Ismay and Kim, Statistical Inference via Data Science. A Modern Dive into R and the tidyverse (CRC Press, 2023)

    This handbook is our ‘secondary’ R handbook in the first course sessions. Assigned to Sessions 1, 2, 3, 4 and 5 (both as compulsory and optional).

  • Healy, Data Visualization. A Practical Introduction (Princeton University Press, 2019)

    This handbook covers various aspects of plotting things with R and the {ggplot2} package. Assigned to Sessions 4, 8 and 12.

  • Wickham et al., R for Data Science (2nd ed., O'Reilly, 2023)

    This is the ‘master’ R handbook to use throughout the course whenever needed, as a reference for both the R language and the {tidyverse} functions. Assigned to Sessions 3 and 4, but meant to be used in all subsequent sessions too.

Later in the course, we continue with R handbooks, but also turn to more research-focused ones for their statistical parts:

Videos

All sessions will come with a handful of videos to watch, if that's something that works for you. The main sources that you will encounter are:

  • Bail, SICSS Boot Camp and SICSS 2020

    Videos from the Summer Institute in Computational Social Science (SICSS). The 'Boot Camp' is a series of introductory videos on R and RStudio (cited in Sessions 1, 2, 3 and 4). The 'SICSS 2020' series is an entire summer school, which is cited in Session 12 for its introduction to text analysis.

  • Rooduijn et al., Basic and Inferential Statistics

    An excellent introductory statistics online course, from the University of Amsterdam. Cited in Sessions 5, 6, 7 and 8.

  • Pew Research Center, Methodological Research

    This institute has published some very accessible 'Methods 101' videos on specific topics. Cited in Sessions 10, on surveys, and 11, on machine learning.

  • Robinson, Tidy Tuesday R Screencasts

    A series of live R-coded screencasts that show how to perform exploratory data analysis on real-world datasets, with lots of data visualization through the {ggplot2} package. Cited in Session 4.

  • Silge, Tidy Tuesday R Screencasts

    Another series of screencasts showing how to analyse real-world datasets, with many examples of how to run machine learning algorithms through the {tidymodels} package bundle. Cited in Sessions 8, 9, 11 and 12.

  • Starmer, StatQuest

    Hundreds (literally) of very understandable videos on basic statistics and machine learning. Cited in Sessions 5, 7, 9 and 11.

  • El Khadir, Visually Explained

    Short explanatory videos that show the logic behind many machine learning techniques. Cited in Sessions 8 and 12.

Optional

Each session has a list of optional resources that will often go further than what we covered in class.

Some sessions will also recommend using R cheatsheets, although some students have reported that those can be overwhelming.

Many more helpful resources, in multiple languages, are listed on the Rzine website, and the final session contains bonus sections of where to find more R resources.


1 (Software)

Handbooks

The chapters below cover essentially the same topics. Read at least one of them, and explore the contents of all handbooks if you are curious about what is to come.

Videos

The videos for this session do not involve any R code.

  • Bail: ‘Installing R and RStudio’

    If needed. Also covers RStudio basics.

  • Rosling, ‘The best stats you’ve ever seen’ (TED, 2007)

    An inspirational video that you should definitely watch to understand why some people like me love stats (and plots). The screenshot in the slides is from another, similar video (BBC Four, 2010). You can also check the Gapminder website, and the related book, Factfulness. Rest in peace, Hans.

  • Tierney, ‘Statistics for Journalists’ (2013)

    A 30-minute video that covers the absolute basics about ’stats and numbers’ — big and little numbers, surveys (and polls), averages, uncertainty, p-values, correlation and causation, rare events, risk, and more. Should be compulsory watching for every student (journalist, policymaker, expert) around the world.

Bonus: Help with R

I predict that there will be a point where the course material will feel like it does not provide the right answer to your questions on how to use R to do specific things. When you reach that point, do the following:

  • Check the course material, again.

    The code provided in each session covers a lot of use cases, and the readings also cover a lot of ground. Also note that you can search the entire course by keyword online on GitHub (example search).

  • If your question is about a specific function or package, check the R help pages.

    The help pages are very technical: they have to cover all function arguments, and might be overwhelming, which is why it might be best in many cases to look for package vignettes. Go, for instance, to the {tidyverse} website, and check the ‘Get started’ page for the {dplyr} package.

  • Google searches will often lead you to Stack Overflow, where many users have asked R-related questions.

    Using the R code from the answers that you will find there might be a double-edged sword, as it might lead you to use more base R syntax than we will do in class, or even use different R packages (after installing them). This can be more overwhelming than helpful, but R is a language, and there are many ways to express yourself in it, so feel free to go that way if it helps you finding solutions to our exercises.

Bonus: R stuff in French

This course is 100% in English, but its wiki has a page with stuff in French. The Rzine website also links to resources in other languages, like Spanish.

Optional

  • Briatte, Going with Python (2023)

    The wiki page where I provide a few pointers on how to learn data science with Python. We will not use Python at all in class, but RStudio can run Python, and it might be useful to some of you to learn some Python later on, especially if you are interested in things like Machine Learning (ML), Natural Language Processing (NLP) or Web scraping. We will come back to that in our last session.

  • Huntington-Klein, Library of Statistical Techniques (LOST) (c. 2023)

    A website that shows you how to code various things in Python, R and Stata. Useful as a reference guide, and as an introduction to the main topics that we will cover together in this course, namely: data manipulation, data visualization and linear regression (ordinary least squares), plus a quick look at logistic regression, geospatial tools and a few more things.

  • McCullough and Yalta, ‘Spreadsheets in the Cloud – Not Ready Yet’ (Journal of Statistical Software, 2013)

    This article explains one of the reasons why we are using statistical software for this course, instead of a spreadsheet editor. It's not just that spreadsheets are error-prone, that their workflows are mostly irreproducible, and that they have caused a huge amount of mistakes in the past: it's also that they have a history of being computationally inaccurate. Statistical software is more versatile, but also more numerically reliable, than spreadsheet editors.

2 (Workflow)

Handbooks

  • Irizarry, ch. 2: ‘R basics’ (again)

    This chapter covers basic R syntax, with all its oddities. Treat R as a language: do not expect to learn it in a single week!

  • Irizarry, ch. 4: ‘The tidyverse’ (up to Section 4.8)

    This chapter introduces the main bundle of packages that we will be using throughout the course.

  • Irizarry, ch. 5: ‘Importing data’

    This chapter anticipates on our next session. The essential concept that you should take out of it at that stage is that of setting the working directory, which is essential in order for R to find your files (your datasets).

Videos

  • Bail: ‘R basics’

    A very easy-to-follow introduction to the core mechanics of R. Think of it as the ‘greetings’ lesson that every language course starts with: ‘hello, my name is, I am x years-old, what is your name?’ and so on.

Cheatsheets

Compulsory:

  • RStudio IDE

    This cheatsheet will show you the many different parts of RStudio. You will not need to use even 10% of the software in class: just focus on setting the working directory, opening scripts, and executing code. Some keyboard shortcuts will be very useful for that: learn them as soon as possible.

  • Base R

    This cheatsheet documents the ‘base R’ syntax, which you will need to understand well enough to manipulate R objects like data frames. This course will teach you the basics. Remember to ask for explanations in class if you do not understand the syntax of a function: I need you to tell me when you need help!

Optional:

  • The multiple syntaxes of R

    This cheatsheet will show that R actually has three sub-syntaxes: ‘base R’ (with lots of hard[brackets] and $ signs), ‘formula’ syntax (of the y ~ x1 + x2:x3 form), and ‘tidy’ syntax, with lots of %>% pipes. We will use all three syntaxes in class, but will give priority to tidy syntax when possible.

  • Stata to R

    A cheatsheet for Stata users. Obviously recommended only if you are proficient enough in Stata to find this useful. For more advanced users: see also stata2r.github.io, which explains how to translate Stata into R using the {data.table} (for data manipulation) and {fixest} (for regression models) packages.

Optional

3 (Data)

Handbooks

  • Ismay and Kim, ch. 3: ‘Data wrangling’

    A simple introduction to the main ‘verbs’ (functions) of the {dplyr} package, in very similar fashion to what we did in class.

  • Irizarry, ch. 4: ‘The tidyverse’ (up to Section 4.8)

    This chapter was already assigned last week -- if you have not yet read it, do so now! It covers the {tidyverse} package bundle and shows how to subset (filter), aggregate (group_by) and summarise (summarise) your data, using functions from the {dplyr} package.

  • Irizarry, ch. 13: ‘Joining tables’

    Read this chapter to learn everything you need to know about a very common data operation: merging – or ’joining’ – datasets. This will be very helpful, very soon. The chapter also uses the {dplyr} package and its 'join' functions, which are inspired by the SQL language.

The next handbook readings come from the R for Data Science handbook, by some of the authors of the {tidyverse} package bundle. I am listing them here primarily for future reference, by which I mean, know that those chapters exist if you need (or rather, when you will need) help later with data management:

Furthermore, the following chapters might also be useful, for dealing with specific aspects of data management. All of them come from the ‘Transform’ section of the handbook, which has even more to offer:

Cheatsheets

  • Data import

    An overview of how to read common data formats into R. Do not worry too much about this: most of the data that we will use in class will come from CSV and TSV datasets, which are easy to read with the {readr} package. We will also use Stata and possibly SPSS datasets, which can be read with the {haven} package. Last, we will very occasionally read spatial data formats with the {sf} package.

  • Data transformation

    An overview of ‘tidyverse’ functions to perform data wrangling. Do not feel overwhelmed: you will learn many of those on the fly, as we go. Just remember this cheatsheet exists if you need a quick guide to ‘how to do x to a dataset’ (or more precisely, a ‘data frame’ or a ‘tibble’) in R.

Some cheatsheets will come in handy when you face special data formats, such as:

Videos

  • Bail: ‘Data wrangling’

    This video will introduce you to some of the functions available via the {dplyr} package, which is part of the {tidyverse} package bundle. The video is simple enough to reinforce your understanding of R basics, e.g. loading packages.

  • Bail: ‘Data visualization’

    You might have noticed that I have slipped a few plots in the course material. I usually build those with the {ggplot2} package, which comes with its own visualization syntax. This video provides a good example of how to use this package.

    As said in the previous section, visualization will be covered more at length in our next workshop.

  • There are a few more videos mentioned towards the end of my slides.

    The total runtime of those videos is too high for you to be able to watch that many of them, but just like for the (many) readings above, you should note that they exist, that they are available for future reference, and that you will still be learning data wrangling by the end of this course.

Optional

  • Broman and Woo, ‘Data Organization in Spreadsheets’ (The American Statistician, 2018)

    Free to read online. While spreadsheet editors are not suitable for analysing data (see the McCullough and Yalta 2013 reading from Session 1), organising data within spreadsheets is a different topic. This article covers the basics of how to do so in order for the result to be ‘machine-readable’ (i.e. understandable by a computer for import).

  • Elff, Data Management with R (Sage, 2020)

    A full book on the topic of data wrangling, with many ‘notebooks’ to illustrate how to handle survey, spatial and text data in R. Cited in the slides. I am citing this book for reference: it actually covers more than we need (for now), and you have limited time, so you might prefer sticking to the other readings.

  • Weidmann, Data Management for Social Scientists (Cambridge University Press, 2023)

    Free to read online (open access). A book that touches all bases, from using spreadsheets to relational databases, with chapters on special (spatial, text, network) data types. Uses R, of course.

  • Wickham, ‘Tidy Data’ (Journal of Statistical Software, 2014)

    Free to read online. A paper that explains why it makes sense to strive for ‘tidy’ data, which we will cover in class. (A related argument is that ‘tidy’ data allows for ‘split-apply-combine’ operations, but that's less central to our goals right now.)

Bonus: SQL

Special section for those of you who want to understand how to use SQL, within R or on its own.

  • Baumer et al., ch. 15: ‘Database querying using SQL’

  • Baumer et al., ch. 16: ‘Database administration’

    The two chapters above cover a lot of SQL basics, and how to implement them from within R. The first chapter uses its own SQL source, but the second one explains how to create database connections.

  • Posit/RStudio, Best Practices in Working with Databases

    An absolute must-read if you are going to use databases (DBs) through RStudio, which has been extended with many features with DB users in mind.

  • Wickham et al., ch. 22: ‘Databases’

    From the handbook that you have already started reading a lot from. Introduces the two key packages beyond {dplyr} to work with databases in R, the {DBI} package, which contains ‘drivers’ to handle database connections, and the {dbplyr} package.

Bonus: Web scraping

If you hear or read about Web scraping during this class and are interested in learning more on how to get data from the Web into R, start with the following:

  • Bail: APIs

  • Bail: Web scraping

    Two short videos that introduce their respective topics very well, in case you just want a very quick overview of what lies ahead.

  • McCrain, RSelenium Tutorial (2020)

    An online tutorial on how to use the {rvest} package to scrape complex Web pages, where the user needs to click on elements of the Web page to access some of its content. This aspect of Web scraping relies on ‘headless browsing’ (see the description of the resource below).

  • Pittard, Web Scraping with R (online book, 2022)

    An online book that covers the essentials, including working with APIs. Very well-illustrated, but fairly limited on a key topic, ‘headless browsing’ with packages like {RSelenium}, which is a way to use a Web browser programmatically in order to render JavaScript and get data from complex Web pages (see this example for a demo, as well as the McCrain tutorial above).

4 (Visualization)

Handbooks

An alternative to the chapters above is Healy's Data Visualization handbook, which goes a bit further into how to use {ggplot2} effectively:

The Healy handbook even has a chapter on maps, which we only touched upon and will come back to at the end of the course:

Last, note that there is also an entire ‘Visualize’ section in the Wickham et al. handbook:

Cheatsheets

Videos

  • Bail: ‘Data wrangling’

    This video was also already assigned last week. It is still relevant to watch it now if you did not next week, as it covers the basics of data wrangling.

  • Bail: ’Data visualization’

    This video was also already assigned last week. It shows the same kind of operations that we did interactively ('live') in class.

  • Robinson: ‘Analyzing the Kenya census in R’

    A screencast example of how to use {ggplot2} to explore a dataset, including through maps.

  • Robinson: ‘Analyzing deforestation in R’

    Another good screencast example of how to explore a survey, including through maps.

Optional

  • Chang, R Graphics Cookbook (O'Reilly, 2nd ed., 2023)

    Free to read online. A very good go-to reference handbook for producing common plots (bar plots, line graphs, etc.) with the {ggplot2} package, with dozens of examples.

  • Emaasit, {ggplot2} extensions (c. 2023)

    A gallery of {ggplot2} extensions, that is, R packages that have been built on top of the package to facilitate various types of plots. See also Erik Gahner Larsen's awesome {ggplot2} list for a nice selection of themes and color palettes to use with it.

  • Heiss, Data Visualization with R (2023)

    An entire course that covers the core principles of graphic design, and how to apply them with {ggplot2}. I do not usually recommend courses in this section, because there is a special wiki page for that, but this course is exceptionally good, with highly informative slides, video-recorded examples, and lots of good code and other resources.

  • Holtz, R Graph Gallery (c. 2023)

    Many different types of plots, all coded in R, mostly with the {ggplot2} package. Cited in the slides.

  • Munzner, Visualization Analysis and Design (CRC Press, 2014)

    An excellent book on visualization ‘theory’ -- the fundamentals on how data abstraction works. The website linked to above has an entire online course on the topic, plus many sorter talks, all of which are like the book, extremely clear and well-illustrated. Tangential to the course, but highly recommended.

  • Tufte, The Visual Display of Quantitative Information (2nd ed., Graphics Press, 2001)

    A beautiful treaty on data visualization. ‘VDQI’ is the kind of book that you will make you fall in love with a topic, and that you will keep mentioning over the years on every possible occasion. Tangential to the course, but highly recommended.

  • Wickham, ‘A Layered Grammar of Graphics’ (Journal of Computational and Graphical Statistics, 2010)

    Hadley Wickham is the main author of the {ggplot2} package, which implements the ‘grammar of graphics’ logic in R. This article explains what that grammar, which was designed by Leland Wilkinson, actually is.

5 (Description)

Handbooks

The next sessions will feature more handbook readings than the previous ones. Here are a few relevant chapters from the handbooks that we have already used:

  • Irizarry, Advanced Data Science, ch. 1: ‘Summary statistics’

    This chapter covers summary statistics (mean, median, percentiles etc.) and their relationship to density curves, boxplots, and the normal distribution. The next chapter also explains methods to identify outliers.

  • Irizarry, Advanced Data Science, ch. 5: ‘Random variables’

    This chapter explains how the theoretical properties of random variables and probability distributions can be leveraged to estimate population statistics from samples. The key concept that connects them is the standard error (SE). The chapter covers the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN), which explain why we strive for large sample sizes (large ‘N’) when we collect data.

  • Irizarry, Advanced Data Science, section on ‘Statistical inference’

    This is a series of chapters that cover estimation (going from ‘sample’ to ’population’), confidence intervals, p-values, and much more. Read whatever you need to catch up on the topics mentioned in class.

  • Ismay and Kim, Appendix A: ‘Statistical Background’

    This reading is presented as a glossary. It covers basic summary statistics (mean, median, etc.), the normal distribution, and log10 transformations. We use log-transformations at several points in this course, so take a look at that part if you are not familiar with them.

At that stage, I also want to introduce another handbook, Llaudet and Imai's Data Analysis for Social Science, which comes with more details on the statistical side of things, and which also includes lots of R code examples. I recommend reading the following chapters:

  • Llaudet and Imai, ch. 3: ‘Inferring population characteristics via survey research’
  • Llaudet and Imai, ch. 6: ‘Probability’
  • Llaudet and Imai, ch. 7: ‘Quantifying uncertainty’

(The chapters are on Google Drive.)

Videos

  • Starmer: ‘Statistics Fundamentals’

    Throughout this course, I will mostly recommend watching the Rooduijn et al. videos on various topics. However, if those do not work for you for any reason, you might be able to ‘fall back’ on this other source, which covers very similar grounds. Make a note of it, as I will not systematically link to it in the next few weeks.

Optional

  • Gerring and Christenson, ch. 18: ’Univariate statistics’

  • Gerring and Christenson, ch. 19: ’Probability distributions’

    (On Google Drive.) A no-code introduction to descriptive statistics and distributions. Recommended if you are looking for something short: each chapter is less than 10 pages, with many graphs.

  • Ismay and Kim, ch. 7: ‘Sampling’

    A full dive into how sampling works, and what we can derive from it. Ends on an interesting section on polls, which is a bit more developed than what you might have read on that topic in Irizarry's handbook.

  • Ismay and Kim, ch. 8: ‘Bootstrapping and confidence intervals’

    This chapter goes deeper into the core topics of the session. It specifically covers bootstrapping, a method that involves resampling the data in order to produce ‘bootstrapped’ confidence intervals. To see how to use those in association tests, take a look at Appendix B of the book, ‘Inference Examples’.

6 (Association)

Handbooks

For this week, I want to recommend fewer readings, from two handbooks that I have not recommended so far:

  • Gerring and Christenson, ch. 20: ’Statistical inference’

  • Gerring and Christenson, ch. 21: ‘Bivariate statistics’ (up to p. 322)

    (On Google Drive.) Read those if you have never studied association tests before, or if you need a refresher on how they work. The chapters are very short, to the point, and well illustrated. Focus on understanding what statistical significance really means.

  • Imai, Quantitative Social Science, ch. 7: ‘Uncertainty’

    (On Google Drive.) This chapter covers the same theoretical points as those above, and adds a bunch of useful examples with R code.

Videos

Optional

  • Gelman et al. Regression and Other Stories (Cambridge University Press, 2020), ch. 3 (‘Some basic methods in mathematics and probability’) and 4 (‘Statistical inference’)

    A fuller treatment of the basics: weighted averages, logarithms, probability distributions, standard errors, statistical significance and hypothesis testing. The authors are rightfully hostile to a lot of the language used to describe p-values and errors: read the second chapter to understand why (or read this 4-page article on the same topic). The rest of the book covers regression modelling in depth, using Bayesian inference: I will recommend it again (as an optional reading) in due time.

  • Greenland, ‘Connecting Simple and Precise p-values to Complex and Ambiguous Realities’ (Scandinavian Journal of Statistics, 2023)

    The truth about the misuse of p-values. Another good treatment of the issue is Imbens, ‘Statistical Significance, 𝑝-Values, and the Reporting of Uncertainty’ (Journal of Economic Perspectives, 2021). And although you probably do not need an entire book on the matter, it exists.

  • Ismay and Kim, ch. 9: ‘Hypothesis testing’

    This chapter is interesting because it shows how to use the {infer} package to perform association tests from bootstrapped statistics (see the optional readings of the previous session for an introduction to those). The chapter also contains a helpful section on interpreting hypothesis tests.

  • Lindeløv, Common statistical tests are linear models (2019)

    This link shows you that what we are covering right now in class can in fact be completely subsumed into our later topics: linear (and nonlinear) models. This is why, next week, I will very briefly cover linear correlation, and then skip directly to simple linear regression: because correlation is just the standardized coefficient in a bivariate regression model.

7 (Correlation)

Handbooks

I am going to assume that you have some late readings to do from the past few weeks (such as passive-aggressive way to tell you to do your homework…), and will therefore only recommend reading a few handbook pages for this session:

  • Gerring and Christenson, ch. 21: ’Bivariate statistics’ (p. 322-sq.)

    (On Google Drive.) You really do not need to read more on the topic than this, as we will very soon delve into something much more powerful and interesting. The important thing to take away from that part of the course is the logic behind statistical significance, which you already got last week from reading some earlier pages out of Gerring and Christenson’s handbook.

Videos

  • Rooduijn et al., Basic Statistics Module 2: ‘Correlation and regression’

    Like our class, this module quickly jumps from linear correlation to (simple) linear regression, which we will come back to next week. Video 2.2 is the one that focuses strictly on correlation, in 7 minutes.

Optional

  • Bueno de Mesquita and Fowler, Thinking Clearly with Data. A Guide to Quantitative Reasoning and Analysis (Princeton University, 2021)

    (On Google Drive.) If you have to read a single book on the topic of correlation and causation, read this one. Correlation is the topic of only three (excellent) chapters. The rest of the book covers 90% of what you might want to learn about causal inference: regression, samples, (randomised) experiments, regression discontinuity designs, and differences-in-differences. Oh, and there's a bonus section on measurement and quantification at the end.

  • Hijmans, ‘Spatial autocorrelation’

    Very optional reading. It seems to me that it is very often the case that people speak of correlation when looking at either time series (serial correlation, or autocorrelation) or maps (spatial correlation). You might know the former from studying econometrics. This tutorial briefly mentions both, with a focus on the latter, and shows one way to measure it, Moran’s I. We will not cover time series at any point in this course, but we will briefly come back to the topic of spatial analysis later on, even though we will probably not have the time to come back to the specific topic of spatial dependence.

  • Hirschman, ‘Stylized Facts in the Social Sciences’ (Sociological Science, 2016)

    Free to read online. Let’s pause at that stage of the course: why are we interested in relationships between things, and what are we actually looking to extract from those relationships? Read the article, and possibly this comment on the text (and its related article), to get some answers. Trigger warning: contains causal language and some implicit philosophy of science (as do my slides for this session).

  • Rohrer, ‘Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data’ (Advances in Methods and Practices in Psychological Science, 2018)

    Free to read online. For those who are seriously interested in what it means to formulate causal statements from statistical models. The technique (Judea Pearl's direct acyclic graphs, a.k.a. DAGs) is not difficult to understand, but they do require a bit more intellectual effort than just checking p-values after dumping every covariate in the formula.

8 (Regression)

Handbooks

Note that regression models are a vast topic, and that the handbook chapters below will almost certainly not suffice for a full understanding of how it works, and how to perform all the required operations for them, e.g. dummies, interactions, diagnostics and marginal effects. For those, dig into the optional readings of this session.

  • Gerring and Christenson, ch. 22: ’Regression’

    (On Google Drive.) The shortest introduction on the topic that I could find. Definitely recommended for everyone.

  • Healy, ch. 7: ‘Work with models’

    This chapter shows how to use the {broom} package to manipulate model results, as we do in class, and shows how to plot their results in great detail. The end covers working with complex survey data, which we will come back to later. For more ways to plot regression coefficients in R, also take a look at the {ggstats} and {dotwhisker} packages.

  • Imai, Quantitative Social Science, ch. 4: ‘Prediction’

    (On Google Drive.) A similar one-chapter treatment of linear regression, with slightly different vocabulary, and some R examples. This is very long chapter that spans over 60 pages, but you can focus on the key parts, Sections 4.2, 4.3.2 and 4.3.3, which also cover linear correlation, and interaction terms.

Videos

  • El Khadir: ‘Linear Regression in 2 Minutes’

    From the Visually Explained YouTube channel, which is more focused on machine learning than statistics, hence the language used in the video (e.g. ‘features’ instead of ‘variables’). Ignore the language: start with this, to get the ‘geometry’ behind linear regression.

  • Rooduijn et al., Inferential Statistics Module 3: ‘Simple regression’

  • Rooduijn et al., Inferential Statistics Module 4: ‘Multiple regression’

    These modules cover simple regression, with a single predictor (independent variable), and multiple regression, with multiple ones. They cover all the basics you need to know (outside of the R code) on the topic.

Optional

There are a lot of optional readings for this session, but they each cover different grounds. Go through the list, and choose one or two at most, depending on your level of familiarity with linear models and interests.

  • Caffo, Regression Models for Data Science in R (Leanpub, 2019)

    Free to download. This handbook covers pretty much everything you need to know on regression models, and comes with videos and coded examples. It contains all essential equations, and two chapters on models with binary and count ‘responses’ (dependent variables), which is why I recommend it if you are already familiar enough with linear models but want to revise them and push it a bit further by the same occasion. Otherwise, go with the Hanck et al. handbook below for a more compact treatment.

  • Gelman et al. Regression and Other Stories (Cambridge University Press, 2020)

    Everything that we do in this course is done through frequentist inference, but this book will show you that there is another way to think statistically: go through it for an introduction to Bayesian reasoning and inference, an advanced topic on which there is also a dedicated page on the course wiki.

  • Hanck et al., ch. 4: ‘Linear regression with one regressor’

  • Hanck et al., ch. 6: ‘Regression models with multiple regressors’

  • Hanck et al., ch. 9: ‘Assessing studies based on multiple regression’

    This handbook covers a bit more ground, and also comes with coded examples. Chapter 9 is particularly useful to understand regression diagnostics. I recommend checking the handbook if you are also taking econometrics, and/or are interested in more advanced modelling than what we manage to do in class.

  • Rodrigues, ch. 6: ‘Statistical models’

    This chapter covers more than just linear regression: it also mentions other models (in Section 6.7), and provides a brief overview of regularization in Section 6.8, and cross-validation in Section 6.9. Read those parts if you are interested in statistical and machine learning.

  • Sanchez and Marzban, ch. 5: ‘Linear regression’

    This reading clearly shows what the intuition behind linear regression is, and also gives you its mathematical foundations in matrix algebra and a few equations. Skip those if you find them too challenging, and focus on the geometrical insights, which are perfectly understandable on their own.

  • Shalizi, The Truth About Linear Regression (2019)

    Free to read online. Delivers exactly what it says on the tin: the truth about linear regression, in 406 detailed pages. A much longer version of the Sanchez and Marzban reading above, one might say, with much harder mathematical parts. Check chapter 13 on regression diagnostics in particular.

9 (Nonlinearity)

Handbooks

  • Hanck et al., ch. 8: ’Nonlinear regression functions’

    This chapter goes back to nonlinearity as it was mentioned in Session 7, when we introduced linear and nonlinear correlation. It shows how to use the insights of that session into regression models, with polynomials and logarithms, and also covers interactions.

  • Hanck et al., ch. 11: ’Regression with a binary dependent variable’

    Covers the full spectrum of models that you need to read about: linear probability models (and their many issues), logit and probit models (focus on probit if you are an economist, ignore it otherwise), and maximum likelihood estimation (MLE).

  • Li, Appendix: ‘A Brief Introduction to Analyzing Categorical Data and Finding More Data’

    (On Google Drive.) A 20-page case study of how to perform logistic regression in R, using data from the World Values Survey. The next pages focus on finding more data online: you can skip those.

Videos

Optional

  • Boehmke and Greenwell, Hands-On Machine Learning with R (CRC Press, 2020), ch. 5: ‘Logistic regression’

    As the title indicates, this chapter comes from a machine learning book, which means that the vocabulary used will be a bit different than what we have used in class so far. The text will still make sense, at least up to Section 5.5 (and read beyond that section if you are interested in the general logic of machine learning, which we will talk about again later on in the course).

  • Sanchez and Marzban, ch. 25: ‘Logistic regression’

    This chapter explains the mathematical underpinnings of logistic regression. Read it to understand why logistic regression is a generalization of what we have covered earlier with linear regression, and to understand how the estimation of the model differs from OLS.

  • Vegetti, Introduction to Generalized Linear Modeling (GLM) (2017)

    A short course that covers the basics that you will need for this course: logit and probit models (you can skip the parts on probit), Maximum Likelihood Estimation (MLE), and how to interpret (log-)odds, odds ratios and interaction terms. The last set of slides goes beyond our scope by also covering ordinal and multinomial logit. Very much recommended as a 'GLM 101' crash course. Comes with R lab sessions: check out Day 3 in particular.

10 (Surveys)

Handbooks

  • Fugard, Using R for Social Research (2022), ch. 9: ‘Complex surveys’

    A tutorial that got turned into an online book. The rest of the chapters are also very good, but this one is remarkably clear on how to use survey weights properly.

  • Vegetti, Introduction to Survey Statistics (2018)

    A short course that covers survey methods, survey weights and measurement, in 3 sets of compact slides. Very much recommended as a 'Survey 101' crash course. Comes with R lab sessions: check out Day 2 in particular.

  • Zimmer et al., Tidy Survey Book (forthcoming)

    An online handbook on survey analysis with the {survey} and {srvyr} packages. Not yet finished, but already very helpful in its current form.

Videos

Optional

  • McNamara and Horton, ’Wrangling Categorical Data in R’ (The American Statistician, 2018)

    Free preprint. A lot of survey data come as categorical data. Handling that kind of data in R requires an understanding of factors. This paper covers the basics, using both base R and the {forcats} package, which offers a set of helpful functions to deal with that data type.

11 (Classification)

Handbooks

Note that some of the ‘handbook’ readings below are not from the course handbooks, but from other sources.

  • Baumer et al., ch. 12: ‘Unsupervised learning’

    This chapter covers almost the same topics as we did in class, and contains an interesting example that uses voting data from the Scottish Parliament, which was collected for a senior thesis.

Videos

  • Savage, ‘The Importance of Class in an Age of Inequality’

    This conference is mentioned in my slides because it goes through a classic graph from Bourdieu’s Distinction. The graph is an example of Multiple Correspondence Analysis (MCA), a technique that is useful to reveal latent dimensions, such as social class, in highly dimensional data like survey data on cultural consumption.

  • Starmer: Principal Component Analysis (PCA), Step-by-Step

    A clear 20-minute explainer on what principal components are. The same channel has lots of other videos on the topic, as well as on many other techniques that are commonly used in machine learning for dimensionality reduction.

Optional

  • Sanchez and Marzban, ch. 4: ‘Principal Components Analysis’

  • Sanchez and Marzban, ch. 30: ‘Clustering’

  • Sanchez and Marzban, ch. 31: K-Means’

  • Sanchez and Marzban, ch. 32: ‘Hierarchical Clustering’

    More detailed chapters on the various methods covered in the James et al. reading. Those chapters are very well illustrated, not with R examples, but with actual illustrations that explain the logic followed by the different methods and algorithms covered.

12 (Extensions)

Handbooks

This section is split by topic.

On spatial analysis:

  • Healy, ch. 7: ‘Draw maps’

    This chapter was already mentioned in Week 4 (on visualization). Contains many very well-coded examples.

  • Pebesma and Bivand, Spatial Data Science (2022)

    An online book coauthored by the author of the {sf} and {stars} packages that we used in class.

  • Lovelace et al., Geocomputation with R (CRC Press, 2022)

    Free to read online. A fairly advanced book on spatial analysis, with useful chapters on its application in e.g. ecology and transportation.

  • Jung, Spatial Analysis with R (2023)

    An online tutorial that covers every aspect of manipulating spatial data in R, through the {sf} package.

  • Moraga, Spatial Statistics for Data Science: Theory and Practice with R (2023)

    Free to read online. Yet another online book that covers all of the basics, from using {sf} data to drawing maps to estimating spatial models with areal data. Includes a chapter on plotting raster and vector data with the {terra} package.

On text analysis:

  • Ornstein, Text as Data (2022)

    An online course/book companion for Grimmer et al.'s Text as Data (Princeton University Press, 2021).

  • Silge and Robinson, Tidy Text Mining (O'Reilly, 2022)

    Free to read online. The book that we kind of followed in class, except for the final chapter on topic models.

  • Hvitfeldt and Silge, Supervised Machine Learning for Text Analysis in R (CRC Press, 2022)

    Free to read online. A machine learning approach to text mining, with an entire section on ‘deep learning’ through neural networks. Check it out if you are curious about large language models like ChatGPT, for instance: this is pretty much how they work. (On that topic, see also this explainer, and this example of how to build such a model. Both posts use Python.)

On going further with R:

  • Baumer et al., Modern Data Science with R (2nd ed., CRC Press, 2021)

    Free to read online. The book covers the same topics as the course (and was assigned to some of its sessions), but it pushes a bit deeper into each of them, with some bonus chapters on SQL databases, geospatial models and network data. If I had to recommend a single book to read in full after taking the course, it would be that one.

  • Boehmke and Greenwell, Hands-On Machine Learning with R (CRC Press, 2020)

    Free to read online. A book that covers many of the topics that we covered in class (Sessions 8, 9 and 11 especially), plus many more, all from a machine learning perspective. Very much recommended if you are curious about ML methods and algorithms.

  • Kuhn and Silge, Tidy Modeling with R (O'Reilly, 2023)

    Free to read online. A book that introduces the {tidymodels} package bundle, which allows to fit many statistical and machine learning models (as also shown through Julia Silge's excellent videos).

Videos

On text analysis:

On spatial analysis:

Optional

  • Bryan et al., Happy Git with R (n.d.)

    Pretty much all you need to know to use Git (and GitHub) from R (and RStudio), in an easy-to-read online tutorial.

  • RStudio/Posit, R Markdown

  • RStudio/Posit, Quarto

    Documentation pages for the two main technologies that will allow you to produce reports, slides and other documents with a mix of text, R code, figures and tables.

  • Briatte, Going with Python

    The wiki page where I provide a few pointers on how to learn data science with Python. Already recommended in Session 1.

  • Briatte, Going Bayesian

    The wiki page where I provide a few pointers on how to learn Bayesian data analysis, using R. I might have mentioned Bayesian reasoning and Bayesian models at a few points during the course. There was no time for us to cover it, but it might (just might) serve you well to learn about it if you go deeper into statistical analysis in the future.

Bonus: Learning more R

Interested in learning more R, either through courses or by self-teaching it to yourself? Try this:

  • If your university course catalogue does not have something on offer, the Summer Institute in Computational Social Science (SICSS) offers R training through summer schools. This is one efficient way to continue learning data science in the near future, especially if you are interested in scientific research (similar workshops exist in the context of scientific conferences).

  • There are lots of online courses that use R to teach data science or quantitative methods applied to various disciplines. For a list of examples, go to the course wiki and check the other similar courses listed there. Many of them have detailed examples and lecture notes.

  • Additional online data science course are offered through commercial providers, which can deliver course completion certificates. Take a look, for instance, at Rafael Irizarry's Professional Certificate in Data Science at edX, at the Google Data Analytics course at Coursera, and at Jeff Leek's Advanced Data Science course, formerly offered at DataCamp, now offered at Coursera.

  • For very focused/specific learning, and if you are comfortable with code, head over to GitHub, which I mentioned in class, and search for example R code using specific packages. The trick is to use the language:r search term. This is a more advanced way to find real-world examples of how to use very specific functions/models.

Bonus: Keeping up with R

Interested in following R news outside of a classroom environment? Try the following:

Bonus: Various things

Just a bunch of extra stuff that was mentioned in class or in the emails, often more than once:

P.S. Oh, and here's the Spotify playlist for the course.