Data Science Fundamentals (DS1): Training Resources
- This site provides additional resources for the Data Science Fundamentals training course. Trainer's notes are available at the Trainer's Git Repository. Note: updated trainer's notes with navigation. To stay up to date, you can follow hashtags like #DataScience on Twitter.
- If a file does not open in the interface, use the View Raw or download links. Jupyter notebooks (.ipynb files), can be opened with the [nbviewer] (https://nbviewer.jupyter.org/) if they fail to open directly from Github.
- All material here is provided under the Creative Commons Non Commercial License: [CC BY-NC 4.0] (https://creativecommons.org/licenses/by-nc/4.0/)
- A copy of the pre-class interview questions
- A copy of the Training outline
- This file provides access to the important resources in this site. It will be updated regularly. You can ignore other files listed above (some are tests and may not work)
- Interesting read/view: the future of jobs short video and full WEF report. See also: 5 Jobs Robots Will Take First and 5 They Will Take Last
- Environmental data: Air Quality in Ramallah Area
- More datasets can be found at Google Research, Kaggle, UCI Machine learning datasets, data.gov, OECD, WorldBank, EU data portal, DataHub and Awesome public datasets - to name just a few. Also: Traffic data - Google with note and Bing. Data from Airbnb and Uber: data sample. Uber Movement is for city officials only at this time. Data from Yelp: Challenge with financial rewards and academic datasets. Data from Amazon reviews. Data from Expedia. Data from PEX: historical. Telecom data: Italy data. Retail data: Belgium Grovery shopping and Walmart. Social media data: Twitter which requires an account and app creation (use the tweepy package) and Facebook. Images: Image-net. Research articles and datasets from Zenodo. Do not forget local and internal data: data from your organization/institution can be anonymized and used, your local supermarket, utility provider (water or electricity), mobile operator, municipal fees and services, ... etc.
- Python and Jupyter: Anaconda distribution download, IPython Notebook Tutorial (Jupyter), Change Jupyter Home directory, Python basics course 1: edX, Python basics course 2: Coursera, A Whirlwind Tour of Python. Here's some markdown help. To get your working directory (where your notebook is), use %ls in a code cell and %lsmagic to get all similar commands. See [more magic commands] (https://ipython.org/ipython-doc/3/interactive/magics.html) help. Append ? to a method to get help. For a list of installed packages, use conda list and pip freeze (take the union of the two lists). See also Kaggle for both code examples and datasets. A list of Python packages has many useful tools. For more help on Jupyter notebooks, see 28 Jupyter Notebook tips, tricks and shortcuts. Here's also a video tutorial on Jupyter.
- Statistical Foundations: Statistical Thinking for Data Science and Analytics. See also Empirical CDF. A list of Common Probability Distributions is also available. See also: A Refresher on Statistical Significance. There are two books for in-depth knowledge: The Elements of Statistical Learning:Data Mining, Inference, and Prediction and An Introduction to Statistical Learning with examples in R.
- Mathematical Foundations: Math for Data Science: Self Starter, Essence of Linear Algebra, Calculus for Deep Learning. Most of this material is relevant for both this course and next one. See also this tweet on Mathematics of Machine Learning.
- Two helpful introductory courses: Foundations of Data Science from Berkeley and CS 109 Data Science from Harvard. See also the additional resources from part II
- Data Science Ethics: please see the list of Data Science ethics resources file in part II.
Projects and Homeworks
- Please create a GitHub account (if you haven't already) and upload your own work on homeworks using a friendly structure/navigation. Your notebooks must be readable (markdown cells and code comments).
- Please provide a project proposal (one page) by February 18, 2017 (one per group). State the problem, data source(s), expected output and list group members. Projects will be presented in a special session one week from the last session.
Sessions 7 and 8, project presentations
- Session 7 will be March 1st (Wednesday) at 4:00 PM and session 8 on Saturday March 4th at 8:00 AM.
- Project presentations: March 18th.
Applied Data Science (next course)
- All trainees who complete this course and present their projects are allowed to enroll
- Other candidates must demonstrate equivalent knowledge and skills through an assessment
- More information about next course (part 2) in the Applied Data Science section.
Last updated: May 14, 2017