Skip to content

big-data-and-economics/big-data-class-materials

Repository files navigation

Class Materials for Bates ECON/DCS 368: Big Data and Economics/Data Science for Economists

Full syllabus with official policies

Lectures | Goals | Other details | FAQ | License

Feedback

I am constantly trying to improve this course. Provide feedback.

Office hours:

My office hours are:

  • Tuesdays 4pm-5pm
  • Wednesdays 10:30am-11:30am

You can make an appointment at here.

Getting in touch

In this course, I ask that you use GitHub Discussions and Issues to ask questions about the problem sets, final projects, presentation clarifications, and other class specifics. This is so that everyone can benefit from the answer. Also, it will encourage collaboration (and declutter my inbox). A portion of the grade is based on participation in GitHub Issues.

Course Organization Page

Every repository will be linked in the organization page. This is where you will find all the repositories for the problem sets, final projects, and other class materials. It also hosts discussions.

GitHub Discussions

The GitHub Discussions page is hosted on the overall Classroom Organization (details on Organization vs. Repository in FAQ). It is private and can be used like any traditional online forum. You can use it to ask questions, answer questions, and discuss topics related to the course. I will be monitoring the discussions and will answer questions as they come up. I encourage you to answer each other's questions as well.

I will also make announcements for the benefit of the course here.

GitHub Issues

GitHub Issues are for repository specific questions.

If you would like to discuss a private matter, you can of course email me at kcoombs@bates.edu. I will respond as quickly as possible. If you do need to email me, please include "ECON368" in the subject line. If you email me a question that would benefit the whole class or someone else in the class could answer, I will respond by asking you to post it to GitHub Discussions or Issues.

Lectures

Note: While I have provided PDF versions of the lectures, they are best viewed as HTMLs.

The course is broken up into three rough sections.

  • Part 1 covers basics of empirical organization, data gathering, and organizing that are not "big data" specific
  • Part 2 covers data description, econometrics, and causal inference that are possible with big data
  • Part 3 covers machine learning techniques that are possible with big data

Parts 2 and 3 will highlight examples of using big data to address social problems.

This is in progress and subject to change.

Date Day Topic Do before class Due
Data Science Basics
2024-01-11 Th Introduction to Big Data (.html, .pdf, .Rmd) Read and Install Ch 1, 4-8 of happygitwihtr
2024-01-16 T Git slides (.html, .pdf, .Rmd) Work through Ch 9-19 of happygitwithr
2024-01-18 Th Empirical Organization slides (.html, .pdf, .Rmd) Read Code and Data for Social Sciences, Check out RStudio Projects tips Problem Set 0 due 1/18 at 11am
2024-01-23 T Data Tips (.html, .pdf, .Rmd) Read Code and Data for Social Sciences
2024-01-25 Th R Basics (.html, .pdf, .Rmd), Data Tips (.html, .pdf, .Rmd) Watch basics of RStudio by Bates alumni Eli Mokas and Ian Ramsay Final Project Proposal due 1/25 11:59:59pm
2024-01-30 T Data Table (.html, .pdf, .pdf) Tidyverse (.html, .pdf, .Rmd) Ch 1 DS4E Problem Set 1 Due 1/29 at 11:59:59pm
2024-02-01 Th Scraping in Research (.html, .pdf, .Rmd),APIs (.html, .pdf, .Rmd) JSONView, Sign-up and register for Personal API keys from FRED and the Census
2024-02-06 T CSS (.html, .pdf, .Rmd) SelectorGadget (Chrome), ScrapeMate (Firefox), Review Cheatsheet on scraping with R Annotated Summary due 2/5 11:59:59pm
2024-02-08 Th Catch-up
2024-02-13 T Opportunity Atlas (.html, .pdf, .Rmd) Watch Geography of Upward Mobility in America starting at 39min Problem Set 2 due 02/16 at 11:59:59pm
2024-02-15 Th Spatial Analysis (.html, .pdf, .Rmd) Read Geographic Data in R Ch. 2 Problem Set 2 due 02/16 at 11:59:59pm
2024-02-20 T Winter Break
2024-02-22 Th Winter Break
Causal Inference
2024-02-27 T Causal Inference (.html, .pdf, .Rmd) Read Effect Ch 13 or Mixtape Ch 2, Watch Causal Effects of Neighborhoods
2024-02-29 Th Regression Review (.html, .pdf, .Rmd), Control variables (.html, .pdf, .Rmd) Read Effect Ch 13 or Mixtape Ch 2, Watch Causal Effects of Neighborhoods Data Description due 2/26 11:59:59pm
2024-03-05 T Fixed Effects (.html, .pdf, .Rmd) Panel data and two-way fixed effects (.html, .pdf, .Rmd) Watch first 40min of Teachers and Charter Schools
2024-03-07 Th Difference-in-differences (.html, .pdf, .Rmd) Read Effect Ch 18
2024-03-12 T Regression Discontinuity Design (.html, .pdf, .Rmd) Read Effect Ch 20 or Mixtape Ch 6 Problem Set 3 due 3/11 at 11:59:59pm, Problem Set 3 solutions
2024-03-14 Th RDD activity (.html, .pdf, .Rmd) Read Effect Ch 20 or Mixtape Ch 6 Problem Set 3 due 3/11 at 11:59:59pm, Problem Set 3 solutions
Machine Learning
2024-03-19 T Bootstrapping (.html, .pdf, .Rmd), Functions & Parallel Programming (.html, .pdf, .Rmd), Bootstrapping activity (.html, .pdf, .Rmd) Refer to Chapters 2-4 of DS4E, Chapter 9 of R for Data Science
2024-03-21 Th March Recess
2024-03-26 Th Bootstrapping activity (.html, .pdf, .Rmd) Refer to Chapters 2-4 of DS4E, Chapter 9 of R for Data Science Problem Set 4 due 3/25 at 11:59:59pm, Research Methods Summary due 3/25 by midnight
2024-03-28 T Intro to Machine Learning (.html, .pdf, .Rmd), ISLR tidymodels lab (.html), Oregon Schools Decision Tree application by Cianna Bedford-Petersen, Christopher Loan, & Brendan Cullen (.html) Read Athey & Imbens (2019), Mullainathan and Spiess (2017), Refer to ISLR 8.1
2024-04-02 T Tree-based methods (.html, .pdf, .Rmd) Watch Improving Judicial Decisions
2024-04-04 Th Causal Forests (.html, .pdf, .Rmd), Application: Causal forests with grf or Hack-a-thon (.Rmd) Read ISLR 8.2
2024-04-09 T Regular expressions, WordClouds (.html, .pdf, .Rmd), Tidy text activities (.html, .pdf, .Rmd) Read Gentzkow (2019): Text as Data
2024-04-11 Th Hack-a-thon presentations/bonus activity
If time Regression regularization/penalization (.html, .pdf, .Rmd), Application (.html, .pdf, .Rmd) Refer to ISLR Ch 6.1, 6.2
If time Sentiment Analysis (.html, .pdf, .Rmd) Read Stephens-Davidowitz (2014)
If time Topics Modeling, LLMs Read Ash and Hansen (2023): Text Algorithms Problem Set 5 due 4/15 at 11:59:59pm, Final project due 4/18 at 11:59:59pm
If time AI and bias Read Rambachan et al (2020) and Cowgill et al. (2019)

Goals for this course

This class is about helping you build good habits for doing organized and reproducible empirical work. It is not about developing expertise in specific R packages or functions. To that end, you should work in groups, and expect to spend several coding sessions working through problems together. I expect you to be flexible with your coding and commit to learning how to solve your own problems. Moreover, once you figure out a solution, I expect you to comment and organize your code, so you can easily reproduce the fix later.^[This is a good habit to get into for your own work, but also for your future collaborators and employers.]

  • Organize empirical projects that are replicable, reproducible, and collaborative using good programming practices
  • Collect and clean big or novel datasets using APIs, web scraping, and other methods
  • Use Big Data to generate key insights about economic opportunity, inequality, and other social problems
  • Understand the differences between prediction, causality, and description, and when to apply each
  • Explain what data science is, and how Big Data differs from other types of data

Navigating the course

  • All problem sets and lectures are linked above in the calendar
  • The repository for each problem set, these course materials, and your class presentations are all linked in the organization page

Expectations

This is an extremely challenging course. To help you succeed, I have outlined expectations for both you and me.

For your professor

  • Link to lecture slides and problem sets in the calendar
  • Post any software you need to download or other materials you need to prep for class in the calendar with 24 hours notice
  • Outline learning goals at the top of each lecture
  • Clearly explain the expectations for each problem set
  • Provide examples of skill sets to be used on problem sets in class
  • Grade your problem sets within two weeks (i.e. before the next problem set is due)
    • Post all problem set solutions to the repository within a week of the problem set being due
  • Check all GitHub Issues and Discussions tabs at least once per day to answer questions

For students

  • Check the calendar within 24 hours of each lecture to see any materials you need to download/review
  • Fork the main problem set repository within 48 hours of the problem set being posted
  • Open problem set data and code within 48 hours of the problem set being posted
  • Work on problem sets in groups, but turn in your own code
  • Post questions to GitHub Discussions or page-specific Issues unless it is of a private matter (e.g. grades, extensions)
    • There is a GitHub Issues tab within every problem set that I create, please post questions about problem sets directly to the tab for each problem set
    • If I receive an email with a question that will benefit everyone, I will ask you to post it to GitHub Issues/Discussions
    • This is so that everyone can benefit from the answer
    • Also, it will encourage collaboration
  • Use computers in class for class-related activities only
  • Seek out solutions to coding problems you run into
    1. Read error messages and see if you can immediately solve the problem
    2. Think before going to Google/ChatGPT: "How would I read a small portion of a large dataset in R?" (Use this services proactively, not reactively)

How I will run class

Most classes will (hopefully) be divided into a "lecture" and an "interactive" component. During the lecture, computers will be closed. During the interactive component, computers will be open for you to work through it.

Resources to use for class

This course is taught in R, but the goal is not for students to become experts in individual R functions and packages. That is something a person could do using generative AI, existing R vignettes and demos, and other online resources.

Still, any programming language is tough to learn and R has a few quirks that make it challenging when you're coming from Stata. Those quirks are worth learning because they buy you a lot more functionality, but they can be frustrating. As shown in Ten simple rules for teaching yourself R (Lawlor et al. 2022), this is the nonlinear process of developing comfort in R:

Taken from Lawlor et al. (2022) NIH.

With that in mind, I expect students in this course to make ample use of the countless free resources on the internet to learn R. Here are a few that I recommend:

On R

Demos

Textbooks

Cheatsheets

On R Markdown

Econometrics, Statistics, Data Science with R examples

Staying organized

Large Language Models

You are actively encouraged to use generative AI assistants in this class. These can be used to improve your code, refine your writing, iterate on your ideas, and more.

Student Academic Support Center

Scheduled hours for R held in the Student Academic Support Center (SASC) of the Library are:

  • Sunday | 7:30-9pm
  • Monday | 12-1pm, 2:30pm-4pm
  • Tuesday |12-2:30pm, 6-7:30pm
  • Wednesday | 11am-1pm, 6-7:30pm
  • Thursday | 12-4pm, 6-7:30pm
  • Friday | 11am-12pm

Course-Attached Tutor

Charlie Berman is our Course-Attached tutor. He will host office hours in the SASC and will be available for individual appointments. His hours are:

  • SASC Drop-in Hours: MTR 2:30-4
  • Evening Help Session: R 7:30-9 (room assignment to come)

You can also schedule private office hours at the Calendar Link: https://calendar.app.google/2Abyp3LqY3NPeg4u8.

He can help you troubleshoot R. He does not have solutions to the problem sets, but he can help you figure them out.

GitHub Codespaces

Having trouble with R on your computer?

Do NOT use a school computer as these do not have Git or GitHub integrated.

To get you up and running and writing R code in no time, I have containerized this workshop such that you have a ready out of the box R coding environment.

For some problem sets, I will explicitly request that you work with GitHub Codespaces to minimize the amount of time you spend troubleshooting your local R installation and package versions. No more, "but it works on my computer" when I ask you why your code isn't running! On occasion, I may ask you to work on your own computer because I want you to learn how to troubleshoot on your own machine.

Dev Containers in GitHub Codespaces

Click the green "<> Code" button at the top right on this repository page, and then select "Create codespace on main". (GitHub Codespaces is available with GitHub Enterprise and GitHub Education.)

To open RStudio Server, click the Forwarded Ports "Radio" icon at the bottom of the VS Code Online window.

Forwarded Ports

In the Ports tab, click the Open in Browser "World" icon that appears when you hover in the "Local Address" column for the Rstudio row.

Ports

This will launch RStudio Server in a new window. Log in with the username: rstudio and password: rstudio.

  • NOTE: Sometimes, the RStudio window may fail to open with a timeout error. If this happens, try again, or restart the Codepace.

In RStudio, use the File menu to open the file test.Rmd. Use the "Knit" submenu to "Knit as HTML" and view the rendered "R Notebook" Markdown document.

  • Note: You may be prompted to install an updated version of the markdown package. Select "Yes".

  • Note: Pushing/pulling will work a bit differently. In practice, you will use the Text changing depending on mode. Light: 'So light!' Dark: 'So dark!' icon for "Source Control" on the RHS bar where you can stage things, commit, and push them. You will need to do this to turn in your problem set. See documentation from GitHub on Source Control and Codespaces

Other details

This is an undergraduate course taught by Kyle Coombs. Here is the course description, right out of the syllabus:

Economics is at the forefront of developing statistical methods for analyzing data collected from uncontrolled sources. Since econometrics addresses challenges in estimation such as sample selection bias and treatment effects identification, the discipline is well-suited for the analysis of large and unsystematically collected datasets. This course introduces statistical (machine) learning methods, which have been developed for analyzing such datasets but which have only recently been implemented in economic research. We will cover a variety of topics including data collection, data management, data description, causal inference, and data visualization. The course also explores how econometrics and statistical learning methods cross-fertilize and can be used to advance knowledge in the numerous domains where large volumes of data are rapidly accumulating. We will also cover the ethics of data collection and analysis. The course will be taught in R.

Grading policy

Component Weight Graded
6 Ă— problem sets (12.5% each, drop two lowest) 50% Top 4
1 Ă— 5-minute presentation 5% Top 1
1 Ă— GitHub participation 5% Overall
1 Ă— group final project 30%-40% In parts
1 x Lewiston Hackathon 0%-10% Optional
Classroom participation Bonus up to 2.5% Discretion
Open source material contribution Bonus up to 2.5% Provide evidence
Most "good-faith" posts to GitHub Bonus 2.5% Posts/answers in course organization

Finalized grades on each component will be posted to Lyceum. Where possible to give feedback privately, it will be posted to GitHub.

Problem sets

Throughout the course you will engage in problem sets that deal with actual data. These may seem out of step with what we do in class, but they are designed to get you to think about how to apply the tools we learn in class to real data.

  • Problem sets are coding assignments that get you to play with data using R
  • They are extremely challenging, but also extremely rewarding
  • With rare exceptions: You will not be given code to copy and paste to accomplish these data cleaning tasks, but instead given a set of instructions and asked to figure out how to write code yourself
  • You are encouraged to work together on problem sets, but you must write up your own answers (unless it is a group assignment)
  • All problem sets will be completed and turned in as GitHub repositories
  • I will drop the two lowest problem set grades

What you will turn in:

  • Each problem set will be posted as a GitHub repository, which you will fork, set to private, and then clone to your computer (instructions provided in each problem set)
  • You will then work on the problem set on your computer or a Codespaces server, and push your code to GitHub (push often!) (Note: You have to push in Codespaces. If you delete your Codespace without pushing changes, you will lose your answers.)
  • For each problem set, you will turn in modular code (i.e. separate files do separate things) that accomplishes the tasks outlined in the problem set
  • You will also turn in a .Rmd file that contains your answers to the questions in the problem set along with a knitted .html or .pdf of your .Rmd
    • This .Rmd will "source" the code you wrote, so I can easily run your code from start to finish by knitting
  • Your problem sets will have a sensible folder structure that is easy to navigate (name folders code, data, output, etc.)
  • You will turn in your problem sets by pushing your code to GitHub.

Grading

Your problem sets are (generally) graded on four criteria:

  1. Submission via GitHub (10%): Did you use GitHub to stage, commit, and push your code? Did you submit the assignment on time? Did you submit the assignment in the correct format?
    • If you are unable to submit via GitHub pushing/pulling, you can create a zip folder of your problem set and upload it as a single file to your repository.
    • Zip instructions: https://www.wikihow.com/Make-a-Zip-File
    • Click Add file in your repository to upload the zip file.
    • You will forgo this 10% of your grade.
  2. Quality of code (30%): Is it well-commented? Is it easy to follow? Can I run it?
    • Any scripts needed to run your code should be included in the repository and sourced in the .Rmd file
    • Write code that automates as much of the process as possible. For example, if you need to download a file, write code that downloads the file automation
    • If you cannot figure out how to automate a step, you can write a comment explaining what I need to do to run your code (you will lose very few points)
  3. Quality of presentation of graphs and tables (30%): Are they well-labeled? Do they have titles? Do they have legends? Are they formatted well?
  4. Quality of answers (30%): Are they clear? Do they answer the question?

I will provide feedback and a grade in a feedback branch of your problem set repository. That will let me add feedback without overwriting your work in the main branch.

Solutions

The solutions are made public within a week of the problem set being posted.

Improving your grade

In an effort to incentivize you to see coding as an ongoing process of learning and improvement, I will allow you to improve the coding and presentation quality portions of your grade on any problem set. However, you cannot just copy and paste the solutions.

Instead, you must provide carefully commented explanations of each step of the code -- whether from the solutions or of your own invention. This is a great way to learn, but it is also a lot of work.

Example. You might write add a comment like this to the top of your code:

# Create directories, suppress warning that the directory already exists. 
suppressWarnings({
    dir.create(data)
    dir.create(documentation)
    dir.create(code)
    dir.create(output)
    dir.create(writing)
})
Submission process

To be eligible to resubmit to improve your grade, you must have submitted an initial version of the problem set on time.

  1. View my feedback on the feedback branch of your problem set repository.
  2. Fix your problem set answers and comment your code as needed. Write "CORRECTED" in all caps next to any changes.
  3. Push changes to the main branch of your problem set repository.
  4. Navigate to the Issues tab of your problem set repository and create a new issue titled "Resubmission for Problem Set X". Briefly describe your changes in the body of the issue and tag my username, @kgcsport.
  5. Deadline for resubmissions: All resubmissions must be pushed within one week of the solutions being posted.

Within your own private problem set repository, you can create an Issues tab within the Settings tab for interfacing only with me and any group partners.

Requests for reconsideration

On occasions, you may disagree with the grade you received on a problem set. Here are my policies for reconsideration:

  • Deadline for requests: All requests for reconsideration must be submitted within one week of the solutions being posted.
  • Full regrade: Any request for reconsideration will result in a full regrade of your problem set. This means that your grade can go up or down.
  • Regrading high scores: If you scored a 90 percent or above on a problem set, I will not change your grade. This is not because I do not want to help you, but because we both have limited time and I want to focus my efforts on cases where an incorrectly graded problem set could significantly impact your grade in the course. This does not apply to re-submissions. This only applies to the cases where you want me to review your score in full separate from corrections and re-submissions.

If you would like reconsideration, please raise an Issue in your private problem set repository. Title the issue "Reconsideration request for Problem Set X". Briefly describe your request in the body of the issue and tag my username, @kgcsport.

Within your own private problem set repository, you can create an Issues tab within the Settings tab for interfacing only with me and any group partners.

Presentations

Each of you will give a 5-minute presentation summarizing a key lecture reading, or an (approved) software package/platform.

Please sign up here at the start of the semester.

Final Project

You will write a final project over the course of the semester as part of a group. Further details are available here. If you participate in the Hack-a-thon, your final project will be worth 30 percent of your grade. If you do not participate in the Hack-a-thon, your final project will be worth 40 percent of your grade.

Lewiston Hack-a-thon

This semester, we will be working with the City of Lewiston to help them solve a problem using data. Specifically, we will help the city understand how to use existing administrative data to complement, and at times substitute, for survey data.

We will specifically be engaging in a Hack-a-thon. A hack-a-thon is a short (often 24 hours), intense period of collaboration between a group of people to solve a problem. Scheduling is still in the works.

The Hack-a-thon is planned to be optional and replace a quarter of the final project grade.

Data Requests

Several weeks before the hack-a-thon, we will brainstorm datasets that your group would like the City of Lewiston to provide for you. You will then write a short report on how you would use those datasets to solve a problem.

What you will do

  • Compete in groups of 3-4 to each propose solutions to the same problem
  • Present your solution to a group from the City of Lewiston
  • Write a short report on your solution
  • Maintain all code and necessary documentation to the City of Lewiston in a GitHub repository
  • Provide any additional documentation the City of Lewiston requests

Your solution may include a variety of things, including:

  • A data visualization
  • Suggestions of new databases to maintain
  • Examples from similar cities that have tackled these problems

GitHub participation

Participation on GitHub is 5 percent of your grade. Please use GitHub Discussions and Issues to ask questions about the course materials and problem sets. You can also suggest improvements to the course materials. Here are the guidelines:

  • When starting a discussion, posting an issue, or suggesting a pull request, please use a clear title (e.g. Problem Set 1: Question about Question 2) and description ("What does term X mean?")

  • If posting about an error you are encountering, follow these steps:

    • Briefly state the expected behavior
    • Write the minimal code needed to reproduce the error (a minimally reproducible example)
    • Write the full error message you are receiving
    • Write the steps you have taken to troubleshoot the error
  • If posting a clarifying question about someone's post, follow these steps:

    • Briefly clarify what you are confused about
    • Suggest potential interpretations of the post
  • If posting about a suggestion to improve the class materials, follow these steps:

    • Briefly state the improvement you are suggesting
    • Write the steps you have taken to troubleshoot the error
    • If you are suggesting a change to the course materials, please fork the repository, make the change, and submit a pull request
  • Be kind to one another. Coding is hard. We are all learning.

This policy guidelines are taken from stackoverflow.com. You can read more about how to ask a good question here and how to answer a question here.

I will rate participation based on the following criteria:

  • Are you posting thoughtful questions? Follow the guidelines on stackoverflow for posting a good question.
  • Are you replying to questions? Follow the guidelines on stackoverflow to write a good answer.
  • Do you pull request improvements to the course materials? This can be a typo fix, a bug fix, or a new feature.

Note: I hope to add Issue templates across the entire organization. All in good time.

For each problem set, use the Issues tab for that specific problem set. For course materials, please use the organization Discussions tab.

I will be monitoring the GitHub Issues tab for each repository and will participation points to those who are actively engaging per the guidelines from stackoverflow. To receive full credit, you must be asking thoughtful questions and thoughtfully answering each other's questions. Thoughtful questions come after you've spent some time re-reading your code, Googling, and working with ChatGPT to try to solve the problem yourself first. Thoughtful answers may not solve the problem, but they should be clear, concise, good faith efforts to help. You will receive a lower participation grade if you do not follow the posting guidelines.

The goal is to encourage you to work together to solve problems. This is one of the most important skills you can take away from this, and really any, course. I also want to incentivize you to think carefully about how you post. Be kind and respectful, as much as you endeavor to be clear, concise, and helpful.

Furthermore, I want you to take ownership over your learning. You get more out of a course when you are actively engaged in the material. Actively engaging on this repository and suggesting changes to the course materials is a very tactile way to do that.

Bonus points:

There are several opportunities for bonus points during the semester:

  1. A 2.5% bonus on your final grade for issuing a pull request to any open source material. This can be to fix a typo or to fix a bug in the code.
  2. A 2.5% participation bonus on your final grade that I will award at my discretion.
  3. I will offer a 2.5% participation bonus to the person with most "good faith" posts/answers in GitHub Issues and Discussions within this organization. "Good faith" means:
    • The posts are made to actually ask about a problem you are having with a problem set/your final project or to answer someone else's question
    • The posts/answers follow the guidelines above
    • The posts are not duplicates of existing posts -- you have to search before you post to see if folks are already working on this problem.
    • The posts and answers are not spam -- I'm a child of the 90s. I can spot spam from a mile away.
    • The posts and answers are respectful and constructive.
    • An improvement on existing answer is fine, if it actually improves on the existing answer. The intent of this bonus is to encourage you to work together to solve technical problems in a way that resembles professional software development and data analysis on this platform. This is one of the most important skills you can take away from this, and really any, course.
  4. I offer a bonus point for each typo corrected on problem sets and solutions. This is capped at 10 points per student per problem set. You must pull request and/or raise an Issue on the corresponding GitHub repository to get credit.

I have given instructions on how to execute a pull request of a specific commit (instead of your entire commit history) in the FAQ.

Extensions

I offer extensions for two things:

  1. Major health issues
  2. Major family emergencies

Please flag either with Bates Reach and email me with the subject "[ECON 368] subject here", so I can be aware of the situation. Together we'll figure out an appropriate extension.

FAQ

How is this all structured? What's a repository? What's an organization?

This course is hosted in a GitHub Organization. An organization is a collection of repositories. A repository contains the full history of "commits" of a coding project, all its folders, etc.

Organizations are incredibly useful because all the repositories within an organization are linked. This means that you can easily navigate between them. For example, you can click on the "big-data-and-economics" to get to the organization page and navigate over to other repositories under the Repositories tab.

What is a commit?

We'll go over this in more detail in later lectures, but a commit is a collection of changes made to the code that you intentionally bundle together and can label. You can think of it like saving a file under a separate file name. The difference is you don't create multiple copies of the file, but you can revert to earlier versions easily using your repository.

If you find a typo in these lecture notes

Please raise an issue or submit a pull request. For those taking this course, I offer a 2.5% bonus on your final grade for issuing a pull request to any open source material -- including these lecture notes. This can be to fix a typo or to fix a bug in the code.

How do I download this material and keep up to date with any changes?

Please note that this is a work in progress, with new material being added every week.

If you just want to read the lecture slides or HTML notebooks in your browser, then you should simply scroll up to the Lectures section at the top of this page. Completed lectures will be hyperlinked as soon as they have been added. Remember to check back in regularly to get any updates. Or, you can watch or star the repo to get notified automatically.

If you actually want to run the analysis and code on your own system (highly recommended), then you will need to download the material to your local machine. The best way to do this is to clone the repo via Git and then pull regularly to get updates. Please take a look at these slides if you are unfamiliar with Git or are unsure how to do any of that. Once that's done, you will find each lecture contained in a numbered folder (e.g. 01-intro). The lectures themselves are written in R Markdown and then exported to HMTL format. Click on the HTML files if you just want to view the slides or notebooks.

I've spotted a mistake or would like to contribute

Please open a new issue. Better yet, please fork the repo and submit an upstream pull request. I'm very grateful for any contributions, but may be slow to respond while this course is still be developed. Similarly, I am unlikely to help with software troubleshooting or conceptual difficulties for non-enrolled students. Others may feel free to jump in, though.

Can I use/adapt your material for a similar course that I'm teaching?

Sure. I already borrowed half of it Grant McDermott, Tyler Ransom, Raj Chetty, and Stephen Hansen. I have also kept everything publicly available. I ask two favours (like Grant McDermott) 1) Please let me know (email if you do use material from this course, or have found it useful in other ways. 2) An acknowledgment somewhere in your own syllabus or notes would be much appreciated.

Pull Request of a Specific Commit

If you want to make a pull request of a specific commit (and not all changes you have made), you are best off using the command line interface for Git. There are two ways to do this thanks to a recent innovation by GitHub. The first, more traditional approach, involves something called cherry picking.

You'll need to do something called cherry picking. Here's how you do it:

Use the command line (Git Bash, WSL, Terminal)

  1. Create a fork of this repository (called the upstream repository) if you have not before
  2. Clone the forked repo to your local computer
  3. Add the original repo as a remote called upstream (enter git remote add upstream <upstream-repo-url>)
  4. Fetch the upstream repo (git fetch upstream)
  5. Create a branch of this upstream repo (git checkout -b <pull-request-branch-name> upstream/main)
  6. Either:
    • Make the changes you want to make
    • Cherry pick the specific commit you want to merge as a pull request by typing git cherry pick <commit-hash> into the command line
      • A commit hash is a unique combination of letters and numbers that identifies a specific commit. You can find the commit hash by running git log and copying the hash of the commit you want to make a pull request for OR by clicking on the commit history on GitHub and copying the SHA (the icon with two interlocked squares.)
  7. Push this branch to the forked repository with git push -u origin <pull-request-branch-name>
  8. Return to your forked repo's main branch with git checkout -b origin/main
  9. Navigate to your forked repository on GitHub and create a pull request from the branch you just pushed (you should see a banner that says "Compare & pull request" when you navigate to your forked repo)
  10. Make sure:
    • The base repository is the upstream repo and the base is the main branch
    • The head repository is your forked repo and the compare is the the branch named <pull-request-branch-name>
  11. Optional after pull request is accepted: Destroy the pull-request-branch once it has served its purpose with git branch -d <pull-request-branch-name>

The second, more recent approach, involves using the GitHub Desktop interface to create a new branch that is directly aligned with the top branch. This involves some setup, but is a bit more user-friendly (i.e. point-and-click based). This assumes you are working with a clone of an existing, forked repository.

Use GitHub Desktop

  1. Open GitHub Desktop.
  2. Navigate to your repository under the Current Repository tab. If you don't see it, you may need to Add > Add to existing repository (navigate to the repositories local path on your computer, so GitHub Desktop knows where to look.)
  3. Click on the dropdown menu Repository and select Repository settings.
  4. Click on Fork behavior and select "To contribute to the parent repository" under "I'll be using this fork..." (This will mean all new branches are aligned with the upstream "parent fork" repository.)
  5. Click Save and move on.
  6. Create a new branch by clicking on the Current Branch tab and selecting New Branch.
  7. Name the branch something that is descriptive of the changes you are making, i.e. <pull-request-branch-name>
  8. Navigate to the Branch tab and select the new branch you just created
  9. Publish the branch to your forked repository by clicking Publish branch in the bottom right corner.
  10. Make the changes you want to make in the repository.
  11. Commit the changes to the branch by clicking Commit to <pull-request-branch-name> in the bottom left corner in GitHub Desktop. (Or use RStudio, etc.)
  12. Push the changes to the branch by clicking Push origin in the bottom left corner in GitHub Desktop.
  13. Navigate to your forked repository on GitHub and create a pull request from the branch you just pushed (you should see a banner that says "Compare & pull request" when you navigate to your forked repo)
  14. Make sure:
    • The base repository is the upstream repo and the base is the main branch
    • The head repository is your forked repo and the compare is the the branch named <pull-request-branch-name>
  15. Optional after pull request is accepted: Destroy the pull request branch by navigating to the Branch tab in GitHub Desktop and selecting Delete <pull-request-branch-name>. Select Yes, delete this branch on the remote as well to fully kill this branch.

Setup

This organization is used to run the ECON/DCS 368 course at Bates College, commonly known as Data Science for Economists or Big Data and Economics. This readme documents how to interact with this organization as a student in the course or instructor looking to mimic it.

Students

Students should await an initial invitation via a Github Classrooms Assignment for Problem Set 0. This will allow you to enroll as a member of the organization.

End of course

About a month after the course is over, students will be moved from Members to Outside Collaborators. This will allow any students to maintain access to private course materials while losing access to (and deleting) any of their problem set repositories that are hosted within the organization. Students will keep local versions of their solutions on their machines and can create new repositories accordingly.

Professors

Use an organization like this to invite students to use Github Classroom. I am continuously improving how I leverage GH Classroom, which is itself under constant development. I largely use GH Classroom to automate organization enrollment and access to repositories (much faster than adding each name multiple times and managing repository access one-by-one.)

After creating your organization:

  1. Create a classroom with GH classroom within said organization. Add your students one-by-one (just once!) or link to your institution's learning management software.
  2. Create an initial problem set assignment through GH classroom that you want to send out to students. Send the link to students so they enroll. Optional: Make it a group assignment to create a group that all students are in, which can be used to give access to repositories during the course.
  3. Optional: Make problem set repositories within the organization and ask students to fork their own repositories instead of using the GH Classroom links -- that way the students host the forks on their own repositories.
  4. Once students have submitted all work and the class is over, you have two options: (1) archive the classroom and convert them to Outside Collaborators to maintain their access to materials or (2) delete the classroom entirely. Either way they will keep local clones of repositories, but this changes their access to forked repositories/class materials after the course concludes.

License

The material in this repository is made available under the MIT license.