Workshop for CBioVikings
Title: Interactive Data Analysis in Python with Pandas using Jupyter Notebook
Presented by: David Lyon, Researcher @ Novo Nordisk Fonden Center for Protein Research, University of Copenhagen
Introduction Data comes in many forms, shapes and flavors. As tasty and free spirited as this may sound, the diligent data analyst often spends most of her/his time preparing and wrangling the data itself, rather than running or coding a particular model or statistical test. This is where Python and Pandas come into play, providing high-level, flexible, and efficient tools for manipulating your data as needed.
Program CBioVikings will get a short introduction on how to use Jupyter Notebook (formerly IPython Notebook), an interactive computational environment, which combines code execution, rich text, mathematics, plots and media. Then we’ll delve right into Data Analysis using Pandas, a Python library providing easy-to-use data structures and data analysis tools.
Structure Introduction 30-45 min Break 7.5 min Exercises 30-60 min
Prerequisites This evening workshop is aimed at people with basic Python skills, but "all levels" are welcome and encouraged to attend. Please install the following software before the workshop and check that it is running (or at least download it before coming).
1.) Git https://git-scm.com/
2.) Python (2.x or 3.x), Enthought or Anaconda (Python and other commonly used packages) https://www.python.org/ https://www.enthought.com/canopy-subscriptions/ (Canopy Express is FREE and very easy to set up --> recommended if you are new to Python/programming) https://www.continuum.io/downloads
The following Python packages can be installed using "pip" (a Python package manager) or found at "pypi" as well as individual web-sites. https://pip.pypa.io/en/stable/installing/ https://pypi.python.org/
EASY INSTALLATION using pip: enter the following in the terminal to install multiple packages at once including all dependencies "pip install ipython jupyter numpy pandas matplotlib seaborn" (n.b. if pip is not available write the following: "easy_install pip" depending on your installation you might need to add python and pip to your environmental variables)
3.) IPython and Jupyter http://jupyter.readthedocs.org/en/latest/install.html
4.) Numpy http://www.numpy.org/
5.) Pandas http://pandas.pydata.org/
optional: 6.) Matplotlib http://matplotlib.org/
7.) xlrd http://www.python-excel.org/
RESOURCES used for this workshop
Very good (and long) tutorial.
Book by Wes McKinney
Pandas cheat sheet
https://github.com/fonnesbeck/statistical-analysis-python-tutorial https://github.com/guipsamora/pandas_exercises https://github.com/ajcr/100-pandas-puzzles http://gregreda.com/2013/10/26/working-with-pandas-dataframes/ http://pandas.pydata.org/