Full syllabus with official policies
Lectures
| Goals
|
Other details
| FAQ
| License
I am constantly trying to improve this course. Provide feedback.
- Class Hours: T/Th 9:30am-11am
My office hours are:
- Tuesdays 3-4pm
- Wednesdays 10:30am-11:30am
My office is in Pettengill 161.
You can make an appointment at here.
In this course, I ask that you use GitHub Discussions and Issues
to ask questions about the problem sets, exercises, and other class materials. This is so that everyone can benefit from the answer. Also, it will encourage collaboration (and declutter my inbox). A portion of the grade is based on participation in GitHub Issues.
Every repository will be linked in the organization page. This is where you will find all the repositories for the problem sets, final projects, and other class materials. It also hosts discussions.
The GitHub Discussions page is hosted on the overall Classroom Organization (details on Organization vs. Repository in FAQ). It is private and can be used like any traditional online forum. You can use it to ask questions, answer questions, and discuss topics related to the course. I will be monitoring the discussions and will answer questions as they come up. I encourage you to answer each other's questions as well.
I will also make announcements for the benefit of the course here.
GitHub Issues
are for repository-specific questions.
If you would like to discuss a private matter, you can of course email me at kcoombs [at] bates.edu. I will respond as quickly as possible. If you do need to email me, please include "ECON368" in the subject line. If you email me a question that would benefit the whole class or someone else in the class could answer, I will respond by asking you to post it to GitHub Discussions or Issues.
Check the discussions page for most recent updates.
Note: While I have provided PDF versions of the lectures, they are best viewed as HTMLs.
The course is broken up into three rough sections.
- Part 1 covers basics of empirical organization, data gathering, and organizing that are not "big data" specific
- Part 2 covers data description, econometrics, and causal inference that are possible with big data
- Part 3 covers machine learning techniques that are possible with big data
Parts 2 and 3 will highlight examples of using big data to address social problems.
Note: The syllabus is always subject to change based on how the course is progressing.
Thursday (2025-01-09): Introduction (.html, .pdf, .Rmd)
- Exercise due before class: Start up exercise
- In-class activity: Intro-to-R
Tuesday (2025-01-14): Empirical Organization .html, .pdf, .Rmd
- Due before class: Read Code and Data for the Social Sciences by Gentzkow and Shapiro and share a story of a non-reproducible workflow that caused you problems on the Discussions forum
- In-class activity: MRE
Thursday (2025-01-16): Git and Github (.html, .pdf, .Rmd)
- Exercise due before class: Introduction to Git and Introduction to Github Concepts
- In-class activity: Git Basics
- Problem Set 1 assigned, Fork, clone, and make at least one commit by next class.
Tuesday (2025-01-21): R basics (.html, .pdf, .Rmd) and (Git spillover)
- Exercise due before class: Datacamp Introduction to R: R Basics and Data frames,
- In-class exercise: File paths and modular files
Thursday (2025-01-23): Professor away, CAT leads class
- Exercise due before class: Datacamp: Reporting with RMarkdown, Watch STAT545 video on RMarkdown's -- no deliverable
- In-class exercise: R tutorial on Rmarkdowns led by CAT
Tuesday (2025-01-28): Data Tips (.html, .pdf, .Rmd)
- Exercise due before class: Post a data quality/visualization checklist to the Discussions forum, Watch STAT545 video on Reading/Writing data
- Optional: Datacamp: Introduction to Importing Data in R, Chapter 2: readr and data.table
- In-class activity: Data checks
Thursday (2025-01-30): Tidyverse (.html, .pdf, .Rmd)
- Exercise due before class: Datacamp: Introduction to Tidyverse
- In-class activity: Tidying data
Tuesday (2025-02-04): Opportunity Atlas (.html, .pdf, .Rmd)
- Exercise due before class: Review Opportunity Atlas and/or Watch this lecture and discuss the questions on the Discussions forum
- Class activity: Code review
- Problem Set 1 due before class
- Problem Set 2 assigned
Thursday (2025-02-06): Spatial Analysis (.html, .pdf, .Rmd)
- Exercise due before class: Analyzing Census Data in R: Chapter 1 and 4 and Request and activiate Census Data API Key (If you encounter issues, let me know in GitHub Discussions.)
- Class activity: Make maps!
Tuesday (2025-02-11): Scraping in Research (.html, .pdf, .Rmd)
- Exercise due before class: Data camp: Intermediate Importing Data in R - Chapter 4
- Class activity: APIs
Thursday (2025-02-13): CSS (.html, .pdf, .Rmd)
- Exercise due before class: Data camp: Intermediate Importing Data in R - Chapter 3
- Class activity: Scrape websites
Tuesday (2025-02-25): Causal Inference (.html, .pdf, .Rmd)
- Exercise due before class: Watch Econometrics: Inference and Identification and respond to discussion questions in Discussion post
- Class activity: Causality and simulations
- Problem Set 2 due before class
- Problem Set 3 assigned
Thursday (2025-02-27): Control Variables (.html, .pdf, .Rmd)
- Exercise due before class: Watch Regression and The Error Term and respond to discussion questions in Discussion post
- Class activity: Control variables and omitted variable bias
Tuesday (2025-03-04): Fixed Effects (.html, .pdf, .Rmd)
- Exercise due before class: Watch Fixed Effects and respond to discussion questions in Discussion post
- Class activity: Fixed Effects and Panel Data
Thursday (2025-03-06): Difference-in-differences (.html, .pdf, .Rmd)
- Exercise due before class: Watch Diff-in-diff lecture and read Baker's Diff-in-diff methodology and respond to discussion questions in Discussion post
- Class activity: Difference-in-difference exercise
Tuesday (2025-03-11): Regression Discontinuity Design (.html, .pdf, .Rmd)
- Problem Set 3 due before class
- Problem Set 4 assigned
- Exercise due before class: RDD lecture and respond to discussion questions in Discussion post
- Class activity: RDD and class sizes
Thursday (2025-03-13): Catch-up, potentially power tests/experimental design
- Exercise due before class: Watch Power and the Statistical Test and respond to discussion questions in Discussion post
- Class activity: Power Analysis
Tuesday (2025-03-18): Catch-up, potentially rental assistance topic discussion
- Exercise due before class: Read the Direct Rental Assistance Pilot Document on post and respond to questions
- Class activity: Meet with Travis Heynen, select hackathon groups, complete hackathon prep questions
Thursday (2025-03-20): Break
Tuesday (2025-03-25): Hackathon Kickoff and Catchup day
- Problem Set 4 due before class
- Complete hackathon prep questions
Thursday (2025-03-27): Hackathon presentations
- Problem Set 5 assigned
Tuesday (2025-04-01): Functions & Iteration (.html, .pdf, .Rmd)
- Exercise due before class: Datacamp: Intermediate R Chapters 3 and 4
- Class activity: Function writing
Thursday (2025-04-03): Parallel Programming (.html, .pdf, .Rmd)
- Exercise due before class: Datacamp: Parallel Programming in R Chapters 1 and 3
- Class activity: Parallel programming
Tuesday (2025-04-08): Bootstrapping (.html, .pdf, .Rmd)
- Exercise due before class: Read Brownstone & Valletta (2001) and give an intuitive use case for bootstrapping and multiple imputation on discussion post
- Class activity: Bootstrapping practice
Thursday (2025-04-10): Introduction to Machine Learning (.html, .pdf, .Rmd)
- Problem Set 5 due 4/15
- Exercise due before class: Read Machine Learning: An Applied Econometric Approach and give an intuitive use case for machine learning on discussion post
- Class activity: Introduction to machine learning
- Causal Forests (.html, .pdf, .Rmd)
- Regression regularization/penalization
- Regular expressions
- Topics Modeling, LLMs
- AI and bias
This class is about helping you build good habits for doing organized and reproducible empirical work. It is not about developing expertise in specific R packages or functions. To that end, you should work in groups, and expect to spend several coding sessions working through problems together. I expect you to be flexible with your coding and commit to learning how to solve your own problems. Moreover, once you figure out a solution, I expect you to comment and organize your code, so you can easily reproduce the fix later.^[This is a good habit to get into for your own work, but also for your future collaborators and employers.]
- Organize empirical projects that are replicable, reproducible, and collaborative using good programming practices
- Collect and clean big or novel datasets using APIs, web scraping, and other methods
- Use Big Data to generate key insights about economic opportunity, inequality, and other social problems
- Understand the differences between prediction, causality, and description, and when to apply each
- Explain what data science is, and how Big Data differs from other types of data
- All problem sets and lectures are linked above in the calendar
- The repository for each problem set, these course materials, and your class presentations are all linked in the organization page
This is an extremely challenging course. To help you succeed, I have outlined expectations for both you and me.
- Link to lecture slides and problem sets in the calendar
- Post any software you need to download or other materials you need to prep for class in the calendar with 24 hours notice
- Outline learning goals at the top of each lecture
- Clearly explain the expectations for each problem set
- Provide examples of skill sets to be used on problem sets in class
- Grade your problem sets within two weeks (i.e. before the next problem set is due)
- Post all problem set solutions to the repository within a week of the problem set being due
- Check all GitHub
Issues
and Discussions tabs at least once per day to answer questions
- Check the calendar within 24 hours of each lecture to see any materials you need to download/review
- Fork the main problem set repository within 48 hours of the problem set being posted
- Open problem set data and code within 48 hours of the problem set being posted
- Work on problem sets in groups, but turn in your own code (unless it is a group assignment)
- Post questions to GitHub Discussions or page-specific
Issues
unless it is of a private matter (e.g. grades, extensions)- There is a GitHub
Issues
tab within every problem set that I create, please post questions about problem sets directly to the tab for each problem set - If I receive an email with a question that will benefit everyone, I will ask you to post it to GitHub Issues/Discussions
- This is so that everyone can benefit from the answer
- Also, it will encourage collaboration
- There is a GitHub
- Use computers in class for class-related activities only
- Seek out solutions to coding problems you run into
- Read error messages and see if you can immediately solve the problem
- Think before going to Google/ChatGPT: "How would I read a small portion of a large dataset in R?" (Use this services proactively, not reactively)
Most classes will (hopefully) be divided into a "lecture" and an "interactive" component. During the lecture, computers will be closed. During the interactive component, computers will be open for you to work through it.
This course is taught in R, but the goal is not for students to become experts in individual R functions and packages. That is something a person could do using generative AI, existing R vignettes and demos, and other online resources.
Still, any programming language is tough to learn and R has a few quirks that make it challenging when you're coming from Stata. Those quirks are worth learning because they buy you a lot more functionality, but they can be frustrating. As shown in Ten simple rules for teaching yourself R (Lawlor et al. 2022), this is the nonlinear process of developing comfort in R:
The important to remember is that the goal is not to get rid of errors! The goal is to get comfortable solving them and finding ways to avoid the same mistake in the future. As XCKD's Randall Munroe put it:
With that in mind, I expect students in this course to make ample use of the countless free resources on the internet to learn R. Here are a few that I recommend:
- How to Ask for Programming Help
- Stack Exchange's
- Matt Gemmell's 'What have you tried?'
- How to Ask Question's the Smart Way by Eric Raymond and Rick Moen (you may need thick skin)
- Dozens of free R Courses by Harvard
- learnr package developed by Garrick Aden-Buie, Barret Schloerke, JJ Allaire, and Alexander Rossell Hayes is a great way to Learn R by making interactive demos in RMarkdown
- Awesome R Learning Resources provides all kinds of resources and is maintained by Eric Fletcher
- R Intro by Grant McDermott and Ed Rubins
- R For Data Science by Hadley Wickham and Garrett Grolemund
- Advanced R by Hadley Wickham
- Geocomputation with R by Robin Lovelace by Jakub Nowosad, and Jannes Muenchow
- R Programming for Data Science by Roger D. Peng
- The R Inferno
- Posit Cheatsheets
- Ten Simple Rules for Teaching Yourself R by Lawlor et al. (2022)
- RStudio Gallery
- R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, and Garrett Grolemund
- An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
- Data Science for Economists and Other Animals by Grant McDermott and Ed Rubin
- Causal Inference: The Mixtape by Scott Cunningham
- The Effect by Nick Huntington-Klein
- Spatial Data Science by Edzer Pebesma and Roger Bivand
- Data Visualization: A practical introduction by Kieran Healy
- Curated List by Nathan Tefft
- Library of Statistical Techniques (LOST)
- Cheatsheet on scraping with R
- Code and Data for the Social Sciences: A Practitioner's Guide by Matthew Gentzkow and Jesse Shapiro
- Coding for Economists: A Language-Agnostic Guide
- happygitwithr by Jenny Bryan
You are actively encouraged to use generative AI assistants in this class. These can be used to improve your code, refine your writing, iterate on your ideas, and more.
- Sign-up for ChatGPT
- Sign-up for GitHub CoPilot (Note: you do not signup through this organization, you signup through your own personal GitHub account as a student.)
- Tips to get better with ChatGPT
- Integration of AI with R
Scheduled hours for R held in the Student Academic Support Center (SASC) of the Library are:
- Sunday | 7:30-9pm
- Monday | 12-1pm, 2:30pm-4pm
- Tuesday |12-2:30pm, 6-7:30pm
- Wednesday | 11am-1pm, 6-7:30pm
- Thursday | 12-4pm, 6-7:30pm
- Friday | 11am-12pm
Ethan Wu is our Course-Attached tutor. He will host office hours in the SASC and will be available for individual appointments. His hours are:
- Evening Help Session: Thursdays at 6:30pm in PGill 227
- Regular Office Hours: By appointment
He can help you troubleshoot R. He does not have solutions to the problem sets, but he can help you figure them out.
-
Think of it as a more interactive version of Googling for the solution to a bug
- It is not a replacement for you, the programmer
- Think through the basic coding tasks first, then ask AI to fill in the blanks
-
Be as specific as possible in your instructions
- If you know the name of the variables in your dataset, use them
-
Think of it as a more interactive version of Googling for the solution to a bug
-
Try things iteratively and in small steps
- If you're not sure how to do something, try to break it down into smaller steps
- This is a good tip for coding in general
-
Your brain is still the most powerful tool you have
- ChatGPT is a tool to help you, not replace you
- You will not get much mileage if you say, "Read in the gapminder dataset and do something interesting with it"
-
Often it only provides "skeleton code", so you'll need to fill in the blanks
This is an undergraduate course taught by Kyle Coombs. Here is the course description, right out of the syllabus:
Economics is at the forefront of developing statistical methods for analyzing data collected from uncontrolled sources. Since econometrics addresses challenges in estimation such as sample selection bias and treatment effects identification, the discipline is well-suited for the analysis of large and unsystematically collected datasets. This course introduces statistical (machine) learning methods, which have been developed for analyzing such datasets but which have only recently been implemented in economic research. We will cover a variety of topics including data collection, data management, data description, causal inference, and data visualization. The course also explores how econometrics and statistical learning methods cross-fertilize and can be used to advance knowledge in the numerous domains where large volumes of data are rapidly accumulating. We will also cover the ethics of data collection and analysis. The course will be taught in R.
Component | Weight | Graded |
---|---|---|
4-5 Ă— problem sets | 30% | Top 3-4 (drop one) |
1 Ă— presentation | 15% | Top 1 |
N x In-class exercises | 10% | Completion, drop two lowest |
N x Out-of-class exercises | 15% | Completion, two drops |
1 Ă— Final Project | 25% | Hackathon rubric |
1 Ă— GitHub participation | 5% | Overall |
Classroom participation | Bonus up to 2.5% | Discretion |
Open source material contribution | Bonus up to 2.5% | Provide evidence |
Research seminar bonuses | Problem set bonus | 2 points |
Finalized grades on each component will be posted to Lyceum. Where possible to give feedback privately, it will be posted to GitHub. '
There are two types of exercises in this course: in-class exercises and Datacamp exercises.
These exercises are short coding assignments that you will start in class. They are designed to help you practice the skills we are learning in class and are due by the next class.
Exercises will be graded on a 0, 1, 2 scale. A 0 means you did not turn in the assignment. A 1 means you turned in the assignment, but put in minimal effort. A 2 means you turned in the assignment with clear effort.
I will drop the lowest two in-class exercise grades.
During the semester, I will use a variety of tools to get you (1) practicing R and Git skills outside of class and (2) thinking about the value of different data science skills ahead of the relevant class period. These will include short readings, datacamp assignments, and
I will allow you to makeup two Datacamp tutorials if you do not complete them on time.
Throughout the course you will engage in problem sets that deal with actual data. These may seem out of step with what we do in class, but they are designed to get you to think about how to apply the tools we learn in class to real data.
- Problem sets are coding assignments that get you to play with data using R
- Some problem sets will be group assignments, some will be individual
- They are extremely challenging, but also extremely rewarding
- With rare exceptions: You will not be given code to copy and paste to accomplish these data cleaning tasks, but instead given a set of instructions and asked to figure out how to write code yourself
- You are actively encouraged to collaborate with your classmates even if the assignment is not a group assignment
- All problem sets will be completed and turned in as GitHub repositories
- I will drop the lowest problem set grade
I reserve the right to add a problem set if I feel it will aid your learning.
Unless otherwise indicated on the problem set, this is what you will turn in.
- Each problem set will be posted as a GitHub repository, which you will fork, set to private, and then clone to your computer (instructions provided in each problem set)
- You will then work on the problem set on your computer or a Codespaces server, and push your code to GitHub (push often!) (Note: You have to push in Codespaces. If you delete your Codespace without pushing changes, you will lose your answers.)
- For each problem set, you will turn in modular code (i.e. separate files do separate things) that accomplishes the tasks outlined in the problem set
- You will also turn in a
.Rmd
file that contains your answers to the questions in the problem set along with a knitted.html
or.pdf
of your.Rmd
- This
.Rmd
will "source" the code you wrote, so I can easily run your code from start to finish byknitting
- This
- Your problem sets will have a sensible folder structure that is easy to navigate (name folders
code
,data
,output
, etc.) - You will turn in your problem sets by pushing your code to GitHub.
Grading
Your problem sets are (generally) graded on four criteria:
- Submission via GitHub (10%): Did you use GitHub to stage, commit, and push your code? Did you submit the assignment on time? Did you submit the assignment in the correct format?
- If you are unable to submit via GitHub pushing/pulling, you can create a zip folder of your problem set and upload it as a single file to your repository.
- Zip instructions: https://www.wikihow.com/Make-a-Zip-File
- Click
Add file
in your repository to upload the zip file. - You will forgo this 10% of your grade.
- Quality of code (30%): Is it well-commented? Is it easy to follow? Can I run it?
- Any scripts needed to run your code should be included in the repository and sourced in the
.Rmd
file - Write code that automates as much of the process as possible. For example, if you need to download a file, write code that downloads the file automation
- If you cannot figure out how to automate a step, you can write a comment explaining what I need to do to run your code (you will lose very few points)
- Any scripts needed to run your code should be included in the repository and sourced in the
- Quality of presentation of graphs and tables (30%): Are they well-labeled? Do they have titles? Do they have legends? Are they formatted well?
- Quality of answers (30%): Are they clear? Do they answer the question?
I will provide feedback and a grade in a feedback
branch of your problem set repository. That will let me add feedback without overwriting your work in the main
branch.
On group assignments, I expect everyone to collaborate. Your group will receive one grade. Part of the course is about learning to collaborate on open-ended challenging problems. This sets you up to effectively tackle the hackathon efficiently at the end of the semester.
If there has been a clear disparity in effort as made clear by the repository commit history, I will reach out to the group members to clarify whether the disparity is due to working together in person or if there was a meaningful difference in contributions. If I deem there has been a meaningful difference in contributions, only the group members who actively participated will be eligible for resubmission and gaining points back as detailed below.
If I get sufficient evidence that a group member did not contribute to any portion of the problem set, that group member will receive a zero.
The solutions are made public within a week of the problem set being posted.
In an effort to incentivize you to see coding as an ongoing process of learning and improvement, I will allow you to improve the coding and presentation quality portions of your grade on any problem set. However, you cannot just copy and paste the solutions.
Instead, you must provide carefully commented explanations of each step of the code. This is a great way to learn, but it is also a lot of work.
Example. You might write add a comment like this to the top of your code:
# Create directories, suppress warning that the directory already exists.
suppressWarnings({
dir.create(data)
dir.create(documentation)
dir.create(code)
dir.create(output)
dir.create(writing)
})
To be eligible to resubmit to improve your grade, you must have submitted an initial version of the problem set on time.
- View my feedback on the
feedback
branch of your problem set repository. - Fix your problem set answers and comment your code as needed to explain the change. Write "# CORRECTED" in all caps next to any changes (commented out)
- You will only receive points back on coding/figures portions of the problem set that you tried in your first submission.
- Make sure your problem set knits from raw data to final answers without any errors as do my solutions.
- Push changes to the
main
branch of your problem set repository. - Navigate to the
Issues
tab of your problem set repository and create a new issue titled "Resubmission for Problem Set X". Briefly describe your changes in the body of the issue and tag my username, @kgcsport. - Deadline for resubmissions: All resubmissions must be pushed within one week of the solutions being posted.
Within your own private problem set repository, you can create an Issues
tab within the Settings tab for interfacing only with me and any group partners.
On group assignments, only students that actively participated will be eligible for resubmission and gaining points back.
On occasions, you may disagree with the grade you received on a problem set. Here are my policies for reconsideration:
- Deadline for requests: All requests for reconsideration must be submitted within one week of the solutions being posted.
- Full regrade: Any request for reconsideration will result in a full regrade of your problem set. This means that your grade can go up or down.
- Regrading high scores: If you scored a 90 percent or above on a problem set, I will not change your grade. This is not because I do not want to help you, but because we both have limited time and I want to focus my efforts on cases where an incorrectly graded problem set could significantly impact your grade in the course. This does not apply to re-submissions. This only applies to the cases where you want me to review your score in full separate from corrections and re-submissions.
If you would like reconsideration, please raise an Issue
in your private problem set repository. Title the issue "Reconsideration request for Problem Set X". Briefly describe your request in the body of the issue and tag my username, @kgcsport.
Within your own private problem set repository, you can create an Issues
tab within the Settings tab for interfacing only with me and any group partners.
Each of you will give a 10-minute presentation of either an MRE of a coding struggle you are having or on a coding skill.
Please sign up here at the start of the semester.
You will write a final project over the course of the semester as part of a group. It is likely that the final project will be a hack-a-thon in collaboration with the city of Lewiston detailed below.
This semester, we will be working with the City of Lewiston to help them solve a problem using data. Specifically, we will help the city understand how to use existing administrative data to complement, and at times substitute, for survey data.
We will specifically be engaging in a Hack-a-thon. A hack-a-thon is a short (often 24 hours), intense period of collaboration between a group of people to solve a problem. Scheduling is still in the works.
The Hack-a-thon is mandatory.
Several weeks before the hack-a-thon, we will brainstorm datasets that your group would like the City of Lewiston to provide for you. You will then write a short report on how you would use those datasets to solve a problem.
- Compete in groups of 3-5 to each propose solutions to the same problem
- Present your solution to a group from the City of Lewiston
- Write a short report on your solution
- Maintain all code and necessary documentation to the City of Lewiston in a GitHub repository
- Provide any additional documentation the City of Lewiston requests
Your solution may include a variety of things, including:
- A data visualization
- Suggestions of new databases to maintain
- Examples from similar cities that have tackled these problems
In the event that the Hack-a-thon does not occur, the final project will be a replication project with an extension of the results. You will evaluate (1) the extent that you could replicate the results and (2) the quality of the replication package provided. You will turn in a short paper explaining what you did, a replication package, and a presentation of your extension results.
Participation on GitHub is 5 percent of your grade. Please use GitHub Discussions and Issues
to ask questions about the course materials and problem sets. You can also suggest improvements to the course materials. Here are the guidelines:
-
When starting a discussion, posting an issue, or suggesting a pull request, please use a clear title (e.g. Problem Set 1: Question about Question 2) and description ("What does term X mean?")
-
If posting about an error you are encountering, follow these steps:
- Briefly state the expected behavior
- Write the minimal code needed to reproduce the error (a minimally reproducible example)
- Write the full error message you are receiving
- Write the steps you have taken to troubleshoot the error
-
If posting a clarifying question about someone's post, follow these steps:
- Briefly clarify what you are confused about
- Suggest potential interpretations of the post
-
If posting about a suggestion to improve the class materials, follow these steps:
- Briefly state the improvement you are suggesting
- Write the steps you have taken to troubleshoot the error
- If you are suggesting a change to the course materials, please fork the repository, make the change, and submit a pull request
-
Be kind to one another. Coding is hard. We are all learning.
This policy guidelines are taken from stackoverflow.com. You can read more about how to ask a good question here and how to answer a question here.
I will rate participation based on the following criteria:
- Are you posting thoughtful questions? Follow the guidelines on stackoverflow for posting a good question.
- Are you replying to questions? Follow the guidelines on stackoverflow to write a good answer.
- Do you pull request improvements to the course materials? This can be a typo fix, a bug fix, or a new feature.
Note: I hope to add Issue templates across the entire organization. All in good time.
For each problem set, use the Issues
tab for that specific problem set. For course materials, please use the organization Discussions tab.
I will be monitoring the GitHub Issues
tab for each repository and will participation points to those who are actively engaging per the guidelines from stackoverflow. To receive full credit, you must be asking thoughtful questions and thoughtfully answering each other's questions. Thoughtful questions come after you've spent some time re-reading your code, Googling, and working with ChatGPT to try to solve the problem yourself first. Thoughtful answers may not solve the problem, but they should be clear, concise, good faith efforts to help. You will receive a lower participation grade if you do not follow the posting guidelines.
The goal is to encourage you to work together to solve problems. This is one of the most important skills you can take away from this, and really any, course. I also want to incentivize you to think carefully about how you post. Be kind and respectful, as much as you endeavor to be clear, concise, and helpful.
Furthermore, I want you to take ownership over your learning. You get more out of a course when you are actively engaged in the material. Actively engaging on this repository and suggesting changes to the course materials is a very tactile way to do that.
There are several opportunities for bonus points during the semester:
- A 2.5% bonus on your final grade for issuing a pull request to any open source material. This can be to fix a typo or to fix a bug in the code.
- A 2.5% participation bonus on your final grade that I will award at my discretion.
- I offer a problem set bonus point for each typo corrected on problem sets and solutions. This is capped at 10 points per student per problem set. You must pull request and/or raise an Issue on the corresponding GitHub repository to get credit.
- I offer a make-up of an out-of-class or in-class exercise for attending an economics research seminar and writing a short summary in the GitHub discussions page.
I have given instructions on how to execute a pull request of a specific commit (instead of your entire commit history) in the FAQ.
I offer extensions for two things:
- Major health issues
- Major family emergencies
Please flag with Bates Reach and email me with the subject "[ECON 368] subject here", so I can be aware of the situation. Together we'll figure out an appropriate extension.
If you are asking for an extension on a group assignment include all group partners on the email. Do not email me individually and ask for an extension on behalf of your group. This practice will ensure that all group members are aware of the request.
Having trouble with R on your computer?
To get you up and running and writing R code in no time, I have containerized this workshop such that you have a ready out of the box R coding environment.
For some problem sets, I will explicitly request that you work with GitHub Codespaces to minimize the amount of time you spend troubleshooting your local R installation and package versions. No more, "but it works on my computer" when I ask you why your code isn't running! On occasion, I may ask you to work on your own computer because I want you to learn how to troubleshoot on your own machine.
Click the green "<> Code" button at the top right on this repository page, and then select "Create codespace on main". (GitHub Codespaces is available with GitHub Enterprise and GitHub Education.)
To open RStudio Server, click the Forwarded Ports "Radio" icon at the bottom of the VS Code Online window.
In the Ports tab, click the Open in Browser "World" icon that appears when you hover in the "Local Address" column for the Rstudio row.
This will launch RStudio Server in a new window. Log in with the username: rstudio
and password: rstudio
.
- NOTE: Sometimes, the RStudio window may fail to open with a timeout error. If this happens, try again, or restart the Codepace.
In RStudio, use the File menu to open the file test.Rmd
. Use the "Knit" submenu to "Knit as HTML" and view the rendered "R Notebook" Markdown document.
-
Note: You may be prompted to install an updated version of the
markdown
package. Select "Yes". -
Note: Pushing/pulling will work a bit differently. In practice, you will use the
icon for "Source Control" on the RHS bar where you can stage things, commit, and push them. You will need to do this to turn in your problem set. See documentation from GitHub on Source Control and Codespaces
This course is hosted in a GitHub Organization. An organization is a collection of repositories. A repository contains the full history of "commits" of a coding project, all its folders, etc.
Organizations are incredibly useful because all the repositories within an organization are linked. This means that you can easily navigate between them. For example, you can click on the "big-data-and-economics" to get to the organization page and navigate over to other repositories under the Repositories
tab.
We'll go over this in more detail in later lectures, but a commit is a collection of changes made to the code that you intentionally bundle together and can label. You can think of it like saving a file under a separate file name. The difference is you don't create multiple copies of the file, but you can revert to earlier versions easily using your repository.
Please raise an issue or submit a pull request. For those taking this course, I offer a 2.5% bonus on your final grade for issuing a pull request to any open source material -- including these lecture notes. This can be to fix a typo or to fix a bug in the code.
Please note that this is a work in progress, with new material being added every week.
If you just want to read the lecture slides or HTML notebooks in your browser, then you should simply scroll up to the Lectures section at the top of this page. Completed lectures will be hyperlinked as soon as they have been added. Remember to check back in regularly to get any updates. Or, you can watch or star the repo to get notified automatically.
If you actually want to run the analysis and code on your own system (highly recommended), then you will need to download the material to your local machine. The best way to do this is to clone the repo via Git and then pull regularly to get updates. Please take a look at these slides if you are unfamiliar with Git or are unsure how to do any of that. Once that's done, you will find each lecture contained in a numbered folder (e.g. 01-intro
). The lectures themselves are written in R Markdown and then exported to HMTL format. Click on the HTML files if you just want to view the slides or notebooks.
Please open a new issue. Better yet, please fork the repo and submit an upstream pull request. I'm very grateful for any contributions, but may be slow to respond while this course is still be developed. Similarly, I am unlikely to help with software troubleshooting or conceptual difficulties for non-enrolled students. Others may feel free to jump in, though.
Sure. I already borrowed half of it Grant McDermott, Tyler Ransom, Raj Chetty, and Stephen Hansen. I have also kept everything publicly available. I ask two favours (like Grant McDermott) 1) Please let me know (email if you do use material from this course, or have found it useful in other ways. 2) An acknowledgment somewhere in your own syllabus or notes would be much appreciated.
If you want to make a pull request of a specific commit (and not all changes you have made), you are best off using the command line interface for Git. There are two ways to do this thanks to a recent innovation by GitHub. The first, more traditional approach, involves something called cherry picking.
You'll need to do something called cherry picking. Here's how you do it:
Use the command line (Git Bash, WSL, Terminal)
- Create a fork of this repository (called the upstream repository) if you have not before
- Clone the forked repo to your local computer
- Add the original repo as a remote called
upstream
(entergit remote add upstream <upstream-repo-url>
) - Fetch the upstream repo (
git fetch upstream
) - Create a branch of this upstream repo (
git checkout -b <pull-request-branch-name> upstream/main
) - Either:
- Make the changes you want to make
- Cherry pick the specific commit you want to merge as a pull request by typing
git cherry pick <commit-hash>
into the command line- A commit hash is a unique combination of letters and numbers that identifies a specific commit. You can find the commit hash by running
git log
and copying the hash of the commit you want to make a pull request for OR by clicking on the commit history on GitHub and copying the SHA (the icon with two interlocked squares.)
- A commit hash is a unique combination of letters and numbers that identifies a specific commit. You can find the commit hash by running
- Push this branch to the forked repository with
git push -u origin <pull-request-branch-name>
- Return to your forked repo's main branch with
git checkout -b origin/main
- Navigate to your forked repository on GitHub and create a pull request from the branch you just pushed (you should see a banner that says "Compare & pull request" when you navigate to your forked repo)
- Make sure:
- The base repository is the upstream repo and the base is the main branch
- The head repository is your forked repo and the compare is the the branch named
<pull-request-branch-name>
- Optional after pull request is accepted: Destroy the pull-request-branch once it has served its purpose with
git branch -d <pull-request-branch-name>
The second, more recent approach, involves using the GitHub Desktop interface to create a new branch that is directly aligned with the top branch. This involves some setup, but is a bit more user-friendly (i.e. point-and-click based). This assumes you are working with a clone of an existing, forked repository.
Use GitHub Desktop
- Open GitHub Desktop.
- Navigate to your repository under the
Current Repository
tab. If you don't see it, you may need toAdd > Add to existing repository
(navigate to the repositories local path on your computer, so GitHub Desktop knows where to look.) - Click on the dropdown menu
Repository
and selectRepository settings
. - Click on
Fork behavior
and select "To contribute to the parent repository" under "I'll be using this fork..." (This will mean all new branches are aligned with the upstream "parent fork" repository.) - Click
Save
and move on. - Create a new branch by clicking on the
Current Branch
tab and selectingNew Branch
. - Name the branch something that is descriptive of the changes you are making, i.e.
<pull-request-branch-name>
- Navigate to the
Branch
tab and select the new branch you just created - Publish the branch to your forked repository by clicking
Publish branch
in the bottom right corner. - Make the changes you want to make in the repository.
- Commit the changes to the branch by clicking
Commit to <pull-request-branch-name>
in the bottom left corner in GitHub Desktop. (Or use RStudio, etc.) - Push the changes to the branch by clicking
Push origin
in the bottom left corner in GitHub Desktop. - Navigate to your forked repository on GitHub and create a pull request from the branch you just pushed (you should see a banner that says "Compare & pull request" when you navigate to your forked repo)
- Make sure:
- The base repository is the upstream repo and the base is the main branch
- The head repository is your forked repo and the compare is the the branch named
<pull-request-branch-name>
- Optional after pull request is accepted: Destroy the pull request branch by navigating to the
Branch
tab in GitHub Desktop and selectingDelete <pull-request-branch-name>
. SelectYes, delete this branch on the remote
as well to fully kill this branch.
This organization is used to run the ECON/DCS 368 course at Bates College, commonly known as Data Science for Economists or Big Data and Economics. This readme documents how to interact with this organization as a student in the course or instructor looking to mimic it.
Students should await an initial invitation via a Github Classrooms Assignment. This will allow you to enroll as a member of the organization.
About a month after the course is over, students will be moved from Members to Outside Collaborators. This will allow any students to maintain access to private course materials while losing access to (and deleting) any of their problem set repositories that are hosted within the organization. Students will keep local versions of their solutions on their machines and can create new repositories accordingly.
Use an organization like this to invite students to use Github Classroom. I am continuously improving how I leverage GH Classroom, which is itself under constant development. I largely use GH Classroom to automate organization enrollment and access to repositories (much faster than adding each name multiple times and managing repository access one-by-one.)
After creating your organization:
- Create a classroom with GH classroom within said organization. Add your students one-by-one (just once!) or link to your institution's learning management software.
- Create an initial problem set assignment through GH classroom that you want to send out to students. Send the link to students so they enroll. Optional: Make it a group assignment to create a group that all students are in, which can be used to give access to repositories during the course.
- Optional: Make problem set repositories within the organization and ask students to fork their own repositories instead of using the GH Classroom links -- that way the students host the forks on their own repositories.
- Once students have submitted all work and the class is over, you have two options: (1) archive the classroom and convert them to Outside Collaborators to maintain their access to materials or (2) delete the classroom entirely. Either way they will keep local clones of repositories, but this changes their access to forked repositories/class materials after the course concludes.
The material in this repository is made available under the MIT license.