Applied Data Science (DS2)

Updates:

July 11, 2017: added a sample notebook on bank loans using real local data (anonymous).
May 21, 2017: As training was concluded and projects presented yesterday, Palestine Data Science Meetup launched by course trainees. May 11, 2017: Projects presentations are planned for May 20th at 1:30 PM. You are expected to brief your trainer on your progress during session 8. Your near-final work should be ready on github on or before the 17th of May, 2017. Make sure to document your work and add an ethics section.
May 7-11, 2017: added a new file on data science ethics resources and uploaded an example on sentiment analysis in Keras (using Word2Vec).
May 4, 2017: Session 7-8 material is online: open links from ADS (Applied Data Science). Tip: to practice, clone or download branch

Info and Resources

Planned start date and time: April 8, 2017, 1:30PM (at CCE, Masa bldg, 6th floor in Ramallah). For registration, see the Ad at Ritaj
Trainer's notes: available at ADS (Applied Data Science) repository. Use the ReadMe file for links to sessions and sub-sessions. Note: html files may display as source (markup) on GitHub. You can either download (or clone) the branch or use an online viewer like RawGit. In RawGit, you should paste the address of the html source. The RBasics notebooks above may also help as a quick R review. A good source on caret (R's equivalent to Scikit-Learn in Python) is the caret package site (by the author of Caret). There is also a book online called R for Data Science by an active R contributor. Here's another good introduction to caret. There is also a caret wiki.
Prerequisite: Data Science Foundations or demonstrated equivalent knowledge and skills (through assessment). You can also watch this 2-part training session for beginners. Parts of the (updated) online book Python for Scientists and Engineers should also be useful. Here's also a collection of Jupyter notebooks in different subjects.If you are starting to learn data science, watch this video and check the speaker's website and DS videos. A recent update to PyData book by Wes McKinney is now available in Jupyter notebooks. To install packages from Jupyter, see this article for best practices.
Preparation: code in this training will be in both R and Python (new candidates: you also need to install the Anaconda distribution of Python - see Data Science Foundations for more details). If you are unfamiliar with R, you should take these two short MOOCs before the training starts: Introduction to R for Data Science and Programming with R for Data Science. Both courses are free if you do not need a certificate. For a full R reference, check The R Book 2nd edition (available for free) and Awesome R which has a great list of resources (also links to Awesome ML. Install R from the CRAN site. You can also install the free RStudio IDE if you want. However, most work will be in the Jupyter notebook you already know. To enable R in Jupyter notebooks, you need to install IRKernel. Run the IRKernel installation commands from the R prompt. See this video if you need help. Another option is to use R Essentials. Python notebooks can also run R cells using rpy2. For help on Jupyter notebooks in general, see 28 Jupyter Notebook tips, tricks and shortcuts. Here's also a video tutorial on Jupyter. More on Jupyter project - good to know. Also watch out for JupyterLab - a nice IDE. For machine learning work, check this Scikit-Learn and Caret packages cheatsheet - see this interactive map for scikit-learn algorithms. More on scikit-learn and related projects. If you use Linux OS, you can also try AUto Sklearn. There is also an Azure ML cheat sheet and infographic with examples. For Anaconda, here's a conda cheat sheet. For deep learning, see this collection.
Comparing R and Python: read this Infoworld article. See also the reply to this article by Hadley Wickham, an active R contributor. Another good source is this Stack Exchange question.
Outline (subject to adjustments): tentative outline.
The course (48 training hours) will focus on practical cases and will include different algorithms and data types (including text and images). Trainees also work on a project and present it at the end of the training.
If a file does not open in the interface, use the View Raw or download links. Jupyter notebooks (.ipynb files), can be opened with the [nbviewer] (https://nbviewer.jupyter.org/) if they fail to open directly from Github.
Datasets and general resources: see resources and datasets in the Data Science Foundations part. Also, check this Kaggle wiki for additional links (see also: ramp.studio and openML for code and data, and data for everyone for selected open datasets from crowdflower. An example of cllecting data from social media - Twitter: part 1, part 2. If you haven't seen an iris, check this tweet. A notebook with real local data from bank loans is also available.
You should have a GitHub account by now (create one if you don't). Also, make sure you follow the trainer on GitHub (access to all repositories). You can also star or watch a repository. You) for Jupyter users. also need to know a bit of markdown notation here's a Markdown cheat sheet for Jupyter users.
Misc resources: Data Science ethics, is your machine learning model wrong?, model evaluation metrics, how to win a data science competition and how to approach machine learning problems and a curated list of past competitions and solutions, on how logistic regression works - in Python.
Feature extraction: from text, see nltk book, scikit-learn and Gensim. More text resoources (global WordNets and others) can be found at Princeton page, TALP resources, The SAI - search for Arabic!. Also, you can refer to Stanford NLP with deep learning class with videos. NLP Glove, software and data also available. More Arabic resources: SLSA: A Sentiment Lexicon for Standard Arabic, OMA Project, pyArabic and Tashaphyne Python libraries, Arabic sentiment analysis, Lab41, Arabic data and resources repo and list of Arabic corpora. For images, try scikit-image and check openCV. For sound/audio, you can use LibRosa or PyAudio. Here's an audio dataset from Google research. The CREAM lab is also relevant. For image data and deep learning / NNs, see Image-Net and FastAI - nice MOOC and CS231n. This blog is very informative on keras optimizers and NLP.
Projects: you are advised to start working on your final project as early as possible (first week of training). Local data is preferred. Groups of 2 are preferred (1 and 3 are allowed exception). Level of work is proportional to group size. Pay attention to dataset nature and distribution, performance metrics and model explanation - try either Lime or Skater.
Technology and real life DS: you should familiarize yourself with computational limits and solutions (big data and distributed file systems, parallel processing, using GPU and cloud computing). You can use cloud resources on different platforms for free (limited time and computing power). For example, Azure ML Studio offers free trials and you can start from this tutorial. Azure ML has drag/drop and GUI interface for model creation, training and publishing (prediction via API). See this screenshot. The Cortana Gallery is also a good resource of ML solutions. Machine learning is now integrated with databases like MS SQL server 2017 - see this video for more (ex. demo at 12 min). Examples from ML as a service (API) like face recognition and translation: Microsoft cognitive services and Google cloud services.It is also a good idea to follow relevant twitter feeds and related conferences - ex. PyCon2017. This is tensorflow playground and here's a series of tensor flow tutorials. This is t-SNE in Python and R.
All material here is provided under the Creative Commons Non Commercial License: [CC BY-NC 4.0] (https://creativecommons.org/licenses/by-nc/4.0/)

Last updated: Dec 7, 2017

Name		Name	Last commit message	Last commit date
Latest commit History 828 Commits
a		a
bert		bert
j		j
multifit		multifit
ulmfit		ulmfit
ulmfit2		ulmfit2
warc		warc
110.pdf		110.pdf
BankLoans.ipynb		BankLoans.ipynb
BankLoansSampleNotebook.ipynb		BankLoansSampleNotebook.ipynb
DSEthics.md		DSEthics.md
DataScience2Outline.pdf		DataScience2Outline.pdf
ExPythonIris.ipynb		ExPythonIris.ipynb
GraphvizExample.ipynb		GraphvizExample.ipynb
NO2Apr7.csv		NO2Apr7.csv
NewsClassifierAr.ipynb		NewsClassifierAr.ipynb
PickledIris.ipynb		PickledIris.ipynb
RBasics1.ipynb		RBasics1.ipynb
RBasics2.ipynb		RBasics2.ipynb
RBasics3.ipynb		RBasics3.ipynb
RBasics4.ipynb		RBasics4.ipynb
RBasics5.ipynb		RBasics5.ipynb
README.md		README.md
SampleRNotebook.ipynb		SampleRNotebook.ipynb
azureml.png		azureml.png
crossvalidation.ipynb		crossvalidation.ipynb
dt.png		dt.png
dtt.png		dtt.png
iris_model.pkl		iris_model.pkl
keras.ipynb		keras.ipynb
knnClassifyCv.ipynb		knnClassifyCv.ipynb
loans.csv		loans.csv
notes.md		notes.md
open-data-policy-ps-v1.4.1a.pdf		open-data-policy-ps-v1.4.1a.pdf
perceptrons.ipynb		perceptrons.ipynb
sentiment.ipynb		sentiment.ipynb
students.csv		students.csv
students.ipynb		students.ipynb
test.xlsx		test.xlsx
tf_learn.ipynb		tf_learn.ipynb
tik.js		tik.js
titanic.ipynb		titanic.ipynb
titanicfull.csv		titanicfull.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Applied Data Science (DS2)

Updates:

Info and Resources

About

Releases

Packages

Contributors 2

Languages

abedkhooli/ds2

Folders and files

Latest commit

History

Repository files navigation

Applied Data Science (DS2)

Updates:

Info and Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages