Skip to content
Collection of Python and other data analysis resources
Branch: master
Clone or download
Pull request Compare This branch is 61 commits ahead of fazekasda:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
JupyterEnv
.gitignore
LICENSE.md
README.md

README.md

Python and R resources for beginners...

...with special regards to science and data analysis applications, visualization and graphics

This collection has been started by Dávid Fazekas and continued by Dénes Türei. Submissions are welcome.

Python and R resources for beginners...

...with special regards to science and data analysis applications, visualization and graphics

Here you find a list of useful resources for learning Python or R, as well as a virtual environment where many tools are readily available for you. You will find this collection helpful either if you are a scientist and you just realized that you need computational tools for processing your data, or you already have done it for a while but looking for new tools and alternatives, or you just would like to start learning Python or R for any other purpose.

Setting up a Python environment

This repo contains a virtual environment created by David Fazekas and set up for easy installation. It is unmaintained for a couple of years already, I can not guarantee it works. The idea of this virtual environment is that you can start using Python without learning how to install Python and modules, which is not always straightforward and might even require some system administration knowledge. Of course later you can learn these, but the virtual environment provides a little help for a quick start. To install this environment to your own computer follow the description.

Alternatively you can install Python and modules the traditional way. See below.

General points

  • Python has two major versions available: Python 2 and 3. These are incompatible, although it is possible to write code which runs both in 2 and 3. If you have both of them installed they will reside in their own directories and you install modules for them independently. It is highly recommended to use only Python 3 today. Almost all the important modules have been already ported to Python 3. Python 2 exists only to keep it possible to run old code when it's really necessary. The most important science and data analysis modules like numpy and scipy are about to end Python 2 support.
  • On some Linux distributions and Mac OS X still Python 2 is the default. Sometimes the python command in the shell is a link to either python2 or python3. Same stands for pip/pip2/pip3. Check your system and be aware where are your python and pip exacutables and where is the site-packages or dist-packages directory (where the modules install).
  • Always know which Python distribution you use and where do you install the modules. If you just call pip install numpy and then you start a Python shell from the Anaconda distribution, don't be surprised if import numpy gives a ModuleNotFoundError. If you are in doubt you can find out by which python and something like ls -l /usr/bin/python and similar ways.

Linux

In Linux distributions you will find up-to-date and well maintained packages for both Python 2 and 3 and also many modules. One issue is that pip and the distibution's package manager don't know about each other but will complain if files already exist (because the other package manager installed them) and won't recognize if a dependency is already installed by the other manager. Most often I just give the force option to overwrite the files.

Mac OS X

OS X comes with a built-in Python 2. Sometimes this is quite old. Anyways probably you will want to install an up-to-date Python 3 distribution. Most convenient is to install a package manager for OS X (most popular is HomeBrew, another one is MacPorts) and you can use these to install Python 3 and many other modules.

Python on Windows

Install Python by the provided installer and don't forget to tick the "include in the path" box. Also you might consider to install cygwin and git (actually git installer already offers also cygwin). This way you will have BASH and git which are essential for development. See more about Windows here.

pip

pip is the most often used package manager for Python. If pip does not come with your installation you can install it by easy_install pip. Or by your operating system's package manager.

Editor

If you are about to start writing code it is important to have a good text editor. What makes a good text editor is the followings:

  • Syntax highlighting: it automatically colors the different elements of the language so it will be easier for you to recognize them and read the code
  • Autocompletion: it automatically offers suggestions to complete words while you are typing so you don't need to type of long function or variable name more than once
  • Automatic indentation and closing parentheses automatically. These features can make writing code even more convenient.
  • Nice color scheme and line numbering: it is important to have a color scheme which has appropriate contrast and gentle with your eyes (usually dark background color schemes).
  • Line numbering: you must have the lines numbered so you can easily find the line blamed by the error message.
  • Search and replace also by regular expressions, go to line key combination.

In Linux usually not a problem as the default ones like gedit can be tuned to be quite good. Personally I use Kate from KDE. For Mac and Windows you need to install one, for example Notepad++ is popular for Windows, TextMate is popular for Mac. Also see the IDEs listed below and you can consider the new JupyterLab which is and IDE in the web browser.

Anaconda

Anaconda is a Python and R distribution and package manager for science and data analysis. They promise standardized packege management which make collaboration and deployment easier. Some people like it, I've never used it. For me pip and the system's package manager have been always easy to use and sufficient.

Where to start?

Non Python specific but important resources

  • http://stackoverflow.com/ - This very important resource worths special mention. As you will se if you google for any programming issue, in 90% you will end up on this site. SO is a Q&A (question and answer) site where anybody can ask programming related questions and answer or comment others questions. Users collect reputation points for their contributions which made it a very efficient platform for community building around mutual help. Your question likely will be answered very quickly, but be careful not to ask something already answered by answers for other questions.
  • https://bitbucket.org/ - We suggest you to familiarize yourself as soon as possible with version control frameworks. Nowdays the most popular is git. Version control helps you to keep track of changes, keep your project files in order, backup your work often, avoid data loss, to collaborate and to share your code in a standard and convenient way. BitBucket allows you to to create more private repos than the most well known git server, http://github.com/.
  • https://maryrosecook.com/blog/post/git-from-the-inside-out - If you write code please start using git. Sooner is better, even your random exercises you can commit to a git repo. Here is an in depth introduction to git starting from the basics from Mary Rose Cook.

Interactive Python learning platforms

Programming exercises

When you write code with the aim of learning it is often difficult to find a task, you want to code, but don't know what to code. In Euler Problems you find hundreds of small mathematics problems, each of them you can solve just in a few lines of code, ideal even if you have only half hour for practicing. As you develop you can return to already solved problems, and find out better and nicer implementations.

Python tutorials

Python resources

Here we list blogs and essays which are not primarily tutorials, but give an introduction or insight into specific topics.

Online Courses

Python environments

  • http://www.bpython-interpreter.org/ - Nice, colorful command line environment with smart autocompletion and built in help functions.
  • https://gist.github.com/lonetwin/5902720 - You can easily customize your Python shell with editing your *~/.pythonrc- file. For example, copy the one in this git repo into yours, and you will have a colored shell with autocompletion.
  • https://www.pythonanywhere.com/ - A full Python environment in the cloud with lots of libraries and many Python versions available. You can write Python scripts in the browser, and even deploy your application as a webpage. Free plan is available.
  • https://jupyter.org/ - Interactive Python environment in the browser: Python runs in the background on your machine, and you write the code and get the output in the browser, in so called notebooks. Note: formerly known as IPython, they just renamed when it became language agnostic (originally it was only for Python, but now can be used also with other languages, for example R).
  • https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906 - A complete IDE developed from Jupyter.
  • https://www.continuum.io/why-anaconda - Python environment intended for science and data analysis, with easy availability of relevant modules (at least in theory: eventually installation might be more complicated).

Python modules for data analysis

  • http://www.numpy.org/ - Efficient operations on multidimensional numeric arrays (i.e. matrices of numbers).
  • https://www.scipy.org/ - Collection of many stats and science methods, like regression, statistical tests.
  • http://pandas.pydata.org/ - Built on top of numpy, pandas provides a more convenient handling of data tables, i.e. here you can have row and column names, methods for convenient rearranging and filtering your data. You can imagine a programmable excel sheet, or something like data frames in R.
  • https://jupyter.org/ - Interactive Python environment in the browser: Python runs in the background on your machine, and you write the code and get the output in the browser, in so callednotebooks. Note: this is the same as IPython was, they just renamed when it became language agnostic (originally it was only for Python, but now can be used also with other languages).
  • https://boltons.readthedocs.org/en/latest/ - Many useful tools for advanced Python programming

Python visualization and plotting

We have seen a number of efforts emerging in the past years with the aim to provide powerful data visualization in Python, so sciencists and data analysts would not need to be envy of R user colleagues. Perhaps the perfect ggplot2 or lattice equivalent is still to come (although two very fresh libraries, Altair and Plotnine are promising), but each of the frameworks listed below are very good in certain tasks, and of course have its limitations. Thus, it is difficult to chose a plotting library, likely you will try more of them.

Graphs (networks)

Alternative network visualizations

Visualization in general

Other tools for graphics and typography: post processing figures, designing slides, posters and figures, typesetting reports, papers, theses and books

  • https://github.com/gztchan/awesome-design - A curated collection of graphic design resources from Tony Chan.
  • http://inkscape.org/ - Inkscape is a professional vector graphics editor and a free alternative of Adobe Illustrator. Its default format is standard SVG, while you can import and export many other formats, for example, of course PDF. Of note, you really should not pay $240 per year for an application to a company only to design vector graphics. Nor should you make your institute pay it for you. This way you only chain yourself to Adobe and the more time and energy you invest in learning to use this sophisticated application it just constrain you to keep using it and keep paying for it. What's the point in this when an excellent free and open source alternative is available? Inkscape is powerful and what you learn and achieve using Inkscape remains yours forever.
  • http://gimp.org/ - GIMP stands for GNU Image Manipulation Program. Is a professional bitmap graphics editor and a free alternative of Adobe Photoshop. What I wrote about Inkscape and Illustrator as alternatives completely stands also for GIMP and Photoshop.
  • https://www.latex-project.org/ - LaTeX is the best tool and the state of the art standard for scientific typesetting and publishing. Created in the 80s by Leslie Lamport, and developed by the scientific community with the aim of having a tool which completely fits their needs. LaTeX is a free and open source software built on top of TeX which has been created by Donald Knuth also addressing the needs of scientific typography. The difference is that TeX does very basic elementary things in the background, like how to fill a line of text with symbols, while LaTeX provides macros for more complex tasks, like how to size and position items of a list or titles on a page in order to make it look good. I can not say LaTeX is an alternative of Adobe InDesign, but for scientific publishing it is definitely superior. You can of course use it for typesetting fiction books or entertaining journals, but maybe you will have more difficulties and if you just want an open source alternative for this, there is Scribus. If you need help with LaTeX don't go to StackOverflow but to its sister site dedicated for LaTeX. One more important thing: you can not simply download and install LaTeX, it comes packaged in many different distributions, including different fonts, typesetting engines and macro packages. First you should look up which of these are available for your operating system. And many templates and styles are also available, for example journals used to have their own article style, universities their own presentation and thesis styles.
  • http://www.texample.net/tikz/examples/ - PGF/TikZ is a LaTeX package for creating high quality scientific graphics authored by Till Tantau. To find out what kind of graphics, see the examples in the galery. If your figure involves lots of maths, algebra or needs many alignements and positioning TikZ might be useful for you. It is worths to take a look on the 1161 pages user manual of TikZ, it is really amazing! Or you can start with the short introduction.
  • https://www.wikiwand.com/en/Beamer\_%28LaTeX%29 - Beamer is a LaTeX package for creating presentations. In my opinion for scientific presentations it is much better than PowerPoint and Keynote which are the most awful applications I have ever seen and I am really happy I could completely avoid them in the last 15 years. Also Beamer is a free and open source software. Most of the default themes look not very nice and old-style, but you can easily modify them to have something better looking. See examples here, here, here, here, here, here, here or here. If you work at EMBL or in the Saez-Rodriguez Group at RWTH Aachen University you can find my Beamer theme, slightly modified from PaloAlto, in my git repos there: @EMBL or @Aachen. Other notes: the final format of your slides will be PDF which is perfectly cross-platform. You should check or ask the tech support for the aspect ratio of the projector in your lecture room. Prepare wide-screen (16:9 or 16:10) slides if those fit as you can have more space this way. Also check if your connection can transmit this resolution and have an HDMI cable and adapter with you if necessary. VGA cables are sometimes limited to 4:3 aspect ratio and 1024x768 resolution which is quite poor.
  • http://www.texstudio.org/ - TeXstudio is a great editor for LaTeX. It comes with autocompletion, built in help, embedded compilation tool, PDF viewer and many other handy tools.
  • http://gpick.org/ - Gpick is a nice little color picker and palette editor application for Linux by Albertas Vyšniauskas. I use it with great satisfaction to create palettes what I use later in R, Python, Inkscape, GIMP or whereever else. I don't know about alternatives for Mac or Windows but definitely there are.
  • https://lvdmaaten.github.io/tsne/ - Dimensionality reduction method for 2D/3D visualization.
  • https://distill.pub/2016/misread-tsne/ - Interactive insigths into parameter sensitivity and artefacts in the t-SNE dimensionality reduction method.

R blogs and tutorials

Statistics

These are not Python related but generic.

P-values

Debate in Nature Methods

Others

Suggestions for new p-value

IDEs (integrated development environments)

IDEs help you to keep track of files in your project, their history, dependencies, testing, outputs, etc.

Python IDEs

R IDEs

  • http://rstudio.com/ - Very popular and powerful IDE for R. Built in editor, file browser, shell, plot viewer, manual and many other helpful tools.

Image processing

In biology computational analysis of images is often unavoidable. With high-throughput microscopy you can acquire hundreds of images just in one hour of time and obviously you can not do all the adjustments and measurements by hand. Luckily there are a number of easy to use tools around. You can identify structures and make measurements and finally come to quantitative data from images. ImageJ is very popular and can be programmed by its own macro language or many other languages including Python. But if you want to use Python maybe better to go for Python image processing modules like scikit-image, OpenCV or ITK.

Chemistry

Books

Python beginner and intermediate books

Advanced Python

R books

Lectures

Podcasts

Miscellanous

Regular expression resources

In data analysis we process tremendous amount of data which is sometimes noisy and we need to extract information from messy patterns. Regular expressions sooner or later will be your essential tools no matter which field and language do you work with. Here are a few excellent resources to learn these small tricky things called regex:

sed, awk, grep

Unix, Linux and Bash

Introductory Bash

Basic bash is essential whatever operating system you use. Linux and Mac anyways has Bash as its default command line. It is extremely versatile and convenient. You can have it also in Windows if you want, see the section below. If you have access to any computing cluster or file storage server at your institute or university, you most likely have the easiest (and often the only) access to them by Bash.

There are so many Bash tutorials for beginners. I think most of them are quite dry and you will get bored soon or forget soon what you read. I think aim to learn the most essential 10-20 commands first, and keep using it actively for many weeks. Then according to your needs you can learn more useful things and what you read will be more digestable for you. Among the 3 materials below maybe the first one is the best in style:

If you prefer to learn from podcasts and videos you can find a ton of them on youtube, for example this channel covers many topics including Linux, Bash, SSH, vim, etc:

SSH

SSH stands for secure shell and it gives you a bash session on a remote computer (e.g. computing cluster of your institute), also capable to copy files between your computer and the remote one (scp) and to encrypt any communication between any software on different computers. It is convenient to set up a key based authentication for the server so you don't need to type your password any more:

Virtual machines

If you use Unix-like system you most likely want to have a Windows virtual machine (VM) and conversely, if you use Windows you most likely need a Linux VM for the very few tasks which can not be performed but only with one of these operating systems. Really you should not expect to have such tasks often, most probably you will start your VM only once a month.

The easiest way to create virtual machines is VirtualBox which has a free and open source edition, but the not OS edition is also free. It integrates seemlessly with your main operating system (especially if you install also the guest additions): you can share your directories, USB devices, network, etc. If you switch the VM to full screen you will have exactly the same experience as it was your main system.

Windows

In my opinion by using Windows you make most of the things more difficult for yourself because it is designed to restrict your insight and control over the technology. This way you do many things blindly without understanding what is happening in the background, and if you encounter a problem you will have less information available for thinking about a solution. Still many people use Windows because they feel they already invested significant time into "learning" it in the school and they don't want to learn something new.

If you use Windows as your main operating system, you might want to have some Unix compatible tools to make it more convenient to use. Alternatively you might create a virtual machine with Linux. But in this case I think you need a strong reason why you don't install Linux as your main system.

  • https://www.putty.org/ - A little SSH and Telnet client for Windows. The easiest way to log in to Unix servers (usually the computing clusters of your institute or university). Note: for an SSH login you need at least 3 information: the address of the server, your user name and your password.
  • https://mingw-w64.org/ - GCC for Windows. You need this if you want to compile software using GCC.
  • https://www.cygwin.com/ - A POSIX compatible environment for Windows. You need this to have all the very helpful tools and environment you have by default on Linux. And also to run some software which need a POSIX environment.
  • https://gitforwindows.org/ - Git for Windows. I think an easy quick solution is to install only this, and it will offer for you to install also Cygwin and Bash, and set up your paths. At the end you will have a nice environment.
You can’t perform that action at this time.