See the HEAT course syllabus for all mark weights and course policies, textbooks, etc. The syllabus also contains the important information on course communication and schedules. Note that all three sections share the same syllabus.
Course content will be mostly based on paper discussions with some more technical lecture sessions on specific tools. Attendance and engagement in class is mandatory and essential. Email me if you think you may miss a lecture.
Software development activities (expansively defined as requirements, testing, deployment, design, development ,...) generate a lot of raw quantitative data (and qualitative data as well, but that is a separate course). We can use what are now cheap and readily available tools—machine learning classics like regression, but also deep learning, optimization, NLP—to try and make sense of this data. Thus this class is essentially about how we can use data science to analyze software data, in order to predict, understand, improve software practices. This is also known as ML4SE, and might be contrasted with SE4ML, the application of software engineering practices to machine learning and data science.
Some basic questions we might start with (h/t to Greg):
- How many projects written in Python are on Github (filtering student projects, dead projects, forks)?
- Is Java better than C?
- Which programmer is more productive?
- What file in my project has the most bugs?
- How long will it take to write this code?
After this class students are able to:
- Use the scientific method to separate fact from fiction in software engineering claims.
- Demonstrate a grasp of basic data science workflows - acquiring data, cleaning it, analyzing it, and generating a report.
- Use and understand machine learning approaches such as linear regression, dimensionality reduction in the context of software engineering problems.
- Summarize SE data science papers, quickly identifying their key contributions.
- Understand the ethical implications of the data science they are doing.
There are no mandatory texts. We will cover public tutorials and papers. I occasionally refer to chapters from the following, all available "free" from the Uvic library:
- Data Mining from Scratch . No libraries in this one ... examples in Python. Joel Grus OReilly 2019.
- R for Data Science. A nice intro showing the R approach to importing, cleaning, visualizing data. Hadley Wickham & Garrett Grolemund. OReilly 2017, is under revision. This version is freely available as HTML.
- Statistical Rethinking. Bayesian approach to statistical inference with excellent and gentle intro. Available via the library e-collection. Also see his excellent Youtube recordings and Github material. Richard McElreath. CRC Press 2020 (2nd).
- Regression and Other Stories. Intro to regression and inference with clear examples. Gelman, Hill, Vehtari. Cambridge 2021.
Due dates are all on Brightspace. Class format will be lectures and short in class exercises, including discussion of the readings (which you must do before class).
Class in Summer 2025 will be 2x3 hour sessions each week. That's a lot of class time. The plan will be to take 2 short breaks (5 mins) and one longer break (20 mins) at appropriate times. You should come to class prepared to follow along on your computer or tablet; most of the latter half will be in class exercises. Each class has assigned readings for each module; those readings are to be done prior to class.
Day | Module | Due |
---|---|---|
July 4 | Intro * AI4SE | |
July 9 | Early Approaches and Problems | |
July 11 | Basic stats | Project proposal |
July 16 | Bayes | |
July 18 | Project work - no class | Assn 1 - basic DS |
July 23 | Ethics | |
July 25 | LLMs for SE | |
July 30 | LLMs for SE cont. | |
Aug 1 | Project work - no class | Assn 2 - Black Mirror • Interim project report |
Aug 6 | Traceability • Clones • Cost | |
Aug 8 | Bug Localization/Triage | |
Aug 13 | Analysing Text Discussions / Qualitative Data in SE | Assn 3 - Bayes |
Aug 15 | Spare/buffer - Productivity | |
Aug 20 | Demos/Project presentations | Project presentation |
Aug 23 | no class | Final project report |
See the Assignments page.
Each week there are several readings/videos. Everyone must do the readings before class. There will be a short quiz on the readings, focused on analysis and synthesis of the papers (i.e., not just memorizing what the paper results were). These quizzes will be aimed at 8 minutes and everyone will have 16 minutes to do them.
The project is a semester long SE data science project tackling an SE-specific problem, using techniques discussed in class. See the project page for details.
Category | Value |
---|---|
Project | 45% |
Assignments | 40% |
Paper quizzes | 15% |
Category | Value |
---|---|
Project | 45% |
Assignments | 30% |
Paper presentations | 15% |
Paper quizzes | 10% |
Assignment/project expectations will be higher for graduate students than undergraduate students.
Course marks will be distributed via Brightspace.
- Neil Ernst, instructor. Please message me to set up office hours for a meeting (Teams or email).
Please use Teams to message the TAs first with programming questions and group issues. Direct personal issues to the instructor, nernst@uvic.ca
The class will use Github (course notes, slides), Teams (blogs, discussion, chat) and Brightspace (assignment posting, grades). Grades and any interviews or videos are distributed via Brightspace for privacy compliance.
This course is a synchronous in-person-only offering. I will attempt to record the lectures, but the cameras are not great, and the class has discussions etc when I will turn off the recording.
Many course activities (such as group design activities, chat sessions) will expect synchronous participation (i.e. at the scheduled time). Students should plan to attend all course components. Courses will not be able to accommodate personal scheduling issues, including time zone variations (from Pacific Daylight time).
I use Quarto to create the HTML slides for the course. All the notes are in Quarto Markdown, a text format, and readable directly on Github. If you would like to generate PDFs instead, clone the repo to your machine, and in the top directory, run quarto render --to pdf
. This will also interpret all the R commands anew, so you should also have R (v 4.4.3) installed. Running R, and then inside R running renv::restore()
will get all the relevant libraries installed.
CmdStanR and rethinking
need special install steps, documented in the links to those libraries.
The university and the Faculty of Engineering has a strong mandate to support Equity, Diversity and Inclusion: https://www.uvic.ca/engineering/about/equity/index.php We as a teaching team will do what we can to create a positive, safe, and supportive environment for you to participate in all components of this course offering. I (the instructor) appreciate all feedback from you and hope that you feel free to message me to voice concerns or to arrange a time to discuss virtually or in-person.
You are expected to be respectful of other students and the instructor/TAs: minimize non-class discussions and activities (such as scrolling TikTok), participate by providing input, and asking questions using inclusive language and behavior.
Strict monitoring of academic integrity will be performed in this course for any work submitted for marks. See course component descriptions and Course Policies and Guidelines for details on academic integrity expectations. Substantiated academic integrity violations will be referred to the Department's Academic Integrity Committee which will determine penalty and ensure a record of the violation is kept with the university.
You are expected to use generative AI (ChatGPT, Copilot, Claude, Cursor etc. etc.) in this course except where explicitly forbidden (mainly, because it will get in the way of you actually learning and understanding the concepts!). Use of these assistants beyond simple grammar help must be accompanied by a dedicated section in the assignment submission outlining how it was used and why it helped.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.