DSCI 100: Introduction to Data Science
Use of Data Science tools to summarize, visualize, and analyze data. Sensible workflows and clear interpretations are emphasized. Prerequisite: MATH 12. UBC course calendar entry.
Expanded Course Description
In recent years, virtually all areas of inquiry have seen an uptake in the use of Data Science tools. Skills in the areas of assembling, analyzing, and interpreting data are more critical than ever. This course is designed as a first experience in honing such skills. Students who have completed this course will be able to implement a Data Science workflow in the R programming language, by “scraping” (downloading) data from the internet, “wrangling” (managing) the data intelligently, and creating tables and/or figures that convey a justifiable story based on the data. They will be adept at using tools for finding patterns in data and making predictions about future data. There will be an emphasis on intelligent and reproducible workflow, and clear communications of findings.
Course Software Platforms
Students will learn to perform their analysis using the R programming language. Worksheets and tutorial problem sets as well as the final project analysis, development, and reports will be done using Jupyter Notebooks. Students will be working on their own devices in lecture and tutorials (if students do not have a laptop, chromebook or tablet of their own, the UBC library has a technology lending program.
By the end of the course, students will be able to:
- Download and scrape data off the world-wide-web.
- Wrangle data from their original format into a fit-for-purpose format.
- Create, and interpret, meaningful tables from wrangled data.
- Create, and interpret, impactful figures from wrangled data.
- Apply, and interpret the output of, a simple classifier.
- Make and evaluate predictions using a simple classifier.
- Apply, and interpret the output of, a simple clustering algorithm.
- Apply, and interpret the output of, a regression model.
- Make and evaluate predictions using a regression model.
- Distinguish between in-sample prediction, out-of-sample prediction, and cross-validation.
- Apply and interpret a bootstrap analysis in a regression context.
- Accomplish all of the above using workflows and communication strategies that are sensible, clear, reproducible, and shareable.
|Position||Name||office hours||office location|
|Instructor||Tiffany Timbersfirstname.lastname@example.org||TBD||ESB 3152|
|Tutorial problem sets||15|
|Peer-review of other groups projects||5|
|1||Chapter 1: Introduction to Data Science||Learn to use the R programming language and Jupyter notebooks as you walk through a real world Data Science application that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization.|
|2||Chapter 2: Reading in data locally and from the web||Learn to read in various cases of data sets locally and from the web. Once read in, these data sets will be used to walk through a real world Data Science application that includes wrangling the data into a useable format and creating an effective data visualization.|
|3||Chapter 3: Cleaning and wrangling data||This week will be centered around tools for cleaning and wrangling data. Again, this will be in the context of a real world Data Science application and we will continue to practice working through a whole case study that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization.|
|4||Chapter 4: Effective data visualization||Expand your data visualization knowledge and tool set beyond what we have seen and practiced so far. We will move beyond scatter plots and learn other effective ways to visualize data, as well as some general rules of thumb to follow when creating visualations. All visualization tasks this week will be applied to real world data sets. Again, this will be in the context of a real world Data Science application and we will continue to practice working through a whole case study that includes downloading data from the web, wrangling the data into a useable format and creating an effective data visualization.|
|5||Transition week||Quiz #1|
|6||Chapter 5: Classification||Introduction to classification using K-nearest neighbours (k-nn)|
|7||Chapter 6: Classification, continued||Classification continued|
|8||Chapter 7: Clustering||Introduction to clustering using K-means|
|9||Transition week||Quiz 2|
|10||Chapter 8: Regression||Introduction to regression using K-nearest neighbours (k-nn). We will focus on prediction in cases where there is a response variable of interest and a single explanatory variable.|
|11||Chapter 9: Regression, continued||Continued explortion of k-nn regression in higher dimensions. We will also begin to compare k-nn to linear models in the context of regression.|
|12||Chapter 10: Bootstrap applied to regression||This week will introduce the bootstrap, first by visualizing bootstrap samples and their fitted regression lines for cases where there is a response variable of interest and a single explanatory variable. An intuitive case will be made for what the ensemble of slopes represents, Then we work through examples from multiple regression, emphasizing the scientific interpretation and relevance of the mix of negative/positive slopes. We will emphasize that this is a jumping off point for the study of statistical inference.|