Applied Data Science (DS2)
July 11, 2017: added a sample notebook on bank loans using real local data (anonymous).
May 21, 2017: As training was concluded and projects presented yesterday, Palestine Data Science Meetup launched by course trainees. May 11, 2017: Projects presentations are planned for May 20th at 1:30 PM. You are expected to brief your trainer on your progress during session 8. Your near-final work should be ready on github on or before the 17th of May, 2017. Make sure to document your work and add an ethics section.
May 7-11, 2017: added a new file on data science ethics resources and uploaded an example on sentiment analysis in Keras (using Word2Vec).
May 4, 2017: Session 7-8 material is online: open links from ADS (Applied Data Science). Tip: to practice, clone or download branch
Info and Resources
- Planned start date and time: April 8, 2017, 1:30PM (at CCE, Masa bldg, 6th floor in Ramallah). For registration, see the Ad at Ritaj
- Trainer's notes: available at ADS (Applied Data Science) repository. Use the ReadMe file for links to sessions and sub-sessions. Note: html files may display as source (markup) on GitHub. You can either download (or clone) the branch or use an online viewer like RawGit. In RawGit, you should paste the address of the html source. The RBasics notebooks above may also help as a quick R review. A good source on caret (R's equivalent to Scikit-Learn in Python) is the caret package site (by the author of Caret). There is also a book online called R for Data Science by an active R contributor. Here's another good introduction to caret. There is also a caret wiki.
- Prerequisite: Data Science Foundations or demonstrated equivalent knowledge and skills (through assessment). You can also watch this 2-part training session for beginners. Parts of the (updated) online book Python for Scientists and Engineers should also be useful. Here's also a collection of Jupyter notebooks in different subjects.If you are starting to learn data science, watch this video and check the speaker's website and DS videos. A recent update to PyData book by Wes McKinney is now available in Jupyter notebooks. To install packages from Jupyter, see this article for best practices.
- Preparation: code in this training will be in both R and Python (new candidates: you also need to install the Anaconda distribution of Python - see Data Science Foundations for more details). If you are unfamiliar with R, you should take these two short MOOCs before the training starts: Introduction to R for Data Science and Programming with R for Data Science. Both courses are free if you do not need a certificate. For a full R reference, check The R Book 2nd edition (available for free) and Awesome R which has a great list of resources (also links to Awesome ML. Install R from the CRAN site. You can also install the free RStudio IDE if you want. However, most work will be in the Jupyter notebook you already know. To enable R in Jupyter notebooks, you need to install IRKernel. Run the IRKernel installation commands from the R prompt. See this video if you need help. Another option is to use R Essentials. Python notebooks can also run R cells using rpy2. For help on Jupyter notebooks in general, see 28 Jupyter Notebook tips, tricks and shortcuts. Here's also a video tutorial on Jupyter. More on Jupyter project - good to know. Also watch out for JupyterLab - a nice IDE. For machine learning work, check this Scikit-Learn and Caret packages cheatsheet - see this interactive map for scikit-learn algorithms. More on scikit-learn and related projects. If you use Linux OS, you can also try AUto Sklearn. There is also an Azure ML cheat sheet and infographic with examples. For Anaconda, here's a conda cheat sheet. For deep learning, see this collection.
- Comparing R and Python: read this Infoworld article. See also the reply to this article by Hadley Wickham, an active R contributor. Another good source is this Stack Exchange question.
- Outline (subject to adjustments): tentative outline.
- The course (48 training hours) will focus on practical cases and will include different algorithms and data types (including text and images). Trainees also work on a project and present it at the end of the training.
- If a file does not open in the interface, use the View Raw or download links. Jupyter notebooks (.ipynb files), can be opened with the [nbviewer] (https://nbviewer.jupyter.org/) if they fail to open directly from Github.
- Datasets and general resources: see resources and datasets in the Data Science Foundations part. Also, check this Kaggle wiki for additional links (see also: ramp.studio and openML for code and data, and data for everyone for selected open datasets from crowdflower. An example of cllecting data from social media - Twitter: part 1, part 2. If you haven't seen an iris, check this tweet. A notebook with real local data from bank loans is also available.
- You should have a GitHub account by now (create one if you don't). Also, make sure you follow the trainer on GitHub (access to all repositories). You can also star or watch a repository. You) for Jupyter users. also need to know a bit of markdown notation here's a Markdown cheat sheet for Jupyter users.
- Misc resources: Data Science ethics, is your machine learning model wrong?, model evaluation metrics, how to win a data science competition and how to approach machine learning problems and a curated list of past competitions and solutions, on how logistic regression works - in Python.
- Feature extraction: from text, see nltk book, scikit-learn and Gensim. More text resoources (global WordNets and others) can be found at Princeton page, TALP resources, The SAI - search for Arabic!. Also, you can refer to Stanford NLP with deep learning class with videos. NLP Glove, software and data also available. More Arabic resources: SLSA: A Sentiment Lexicon for Standard Arabic, OMA Project, pyArabic and Tashaphyne Python libraries, Arabic sentiment analysis, Lab41, Arabic data and resources repo and list of Arabic corpora. For images, try scikit-image and check openCV. For sound/audio, you can use LibRosa or PyAudio. Here's an audio dataset from Google research. The CREAM lab is also relevant. For image data and deep learning / NNs, see Image-Net and FastAI - nice MOOC and CS231n. This blog is very informative on keras optimizers and NLP.
- Projects: you are advised to start working on your final project as early as possible (first week of training). Local data is preferred. Groups of 2 are preferred (1 and 3 are allowed exception). Level of work is proportional to group size. Pay attention to dataset nature and distribution, performance metrics and model explanation - try either Lime or Skater.
- Technology and real life DS: you should familiarize yourself with computational limits and solutions (big data and distributed file systems, parallel processing, using GPU and cloud computing). You can use cloud resources on different platforms for free (limited time and computing power). For example, Azure ML Studio offers free trials and you can start from this tutorial. Azure ML has drag/drop and GUI interface for model creation, training and publishing (prediction via API). See this screenshot. The Cortana Gallery is also a good resource of ML solutions. Machine learning is now integrated with databases like MS SQL server 2017 - see this video for more (ex. demo at 12 min). Examples from ML as a service (API) like face recognition and translation: Microsoft cognitive services and Google cloud services.It is also a good idea to follow relevant twitter feeds and related conferences - ex. PyCon2017. This is tensorflow playground and here's a series of tensor flow tutorials. This is t-SNE in Python and R.
- All material here is provided under the Creative Commons Non Commercial License: [CC BY-NC 4.0] (https://creativecommons.org/licenses/by-nc/4.0/)
Last updated: Dec 7, 2017