# Big Data Module I: Introduction to Data Science with Python

## Setting up Python

Make yourself acquainted with the notebook environment. It's basically a webpage with executable code. Code is run by clicking the "run" button (looks similar to the play button).

There are many great keyboard shortcuts. Press 'H' to see a cheat sheet (Jupyter Notebook, different in Juypter Lab).

A good introduction is [this video right here](https://www.youtube.com/watch?v=HW29067qVWk). 

### Imports

A central building block of Python, and especially the distribution of Anaconda you should have installed, is the ability to import additional modules, packages or libraries into your current script with the 'import' command. 

In [14]:
import math

In [15]:
math.log(4)

1.3862943611198906

In [3]:
math.cos(math.pi)

-1.0

Sometimes you will want to use a short name for a library:

In [4]:
import math as mt

In [5]:
mt.log(4)

1.3862943611198906

Note that you have to type the module name ("math" or "mt") before each function call. You can also import a specific function of a module. Then the explicit call is not necessary:

In [6]:
from statistics import mean

In [7]:
mean([2, 5, 6, 100])

28.25

Now that we know the basics of importing, make yourself comfortable with using multiple libraries. NumPy, Pandas, and NetworkX are only three of the ones we will be using in the course.

However, in our introductory tutorials on Python fundamentals, we will use only basic functions of Python.

### NumPy

NumPy is the fundamental package for scientific computing with Python. More information and tutorials at:

http://www.numpy.org/

In [8]:
import numpy as np

An example command:

In [9]:
x = [2, 5, 6, 100]
np.mean(x)

28.25

### Pandas

Pandas provides data structures and data analysis tools. More information and tutorials at:

http://pandas.pydata.org/

In [10]:
import pandas as pd

In [16]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Some examples for extended markdown possibilities (double click on the cells to see the code)

$$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$$

| This | is   |
|------|------|
|   a  | table|

## Introductory Tutorials for Preparation

Now you are ready to start the tutorials. They are required preparation for the course. At the beginning of the course, we will only do a short recap.

Open the first notebook, <a href='01_var_string_num.ipynb'>01_var_string_num.ipynb</a>, and go through the other five notebooks in order. **Do the exercises** to know you really understood the lessons.



## Optional Materials for Preparation

Via <a href='https://notebooks.gesis.org/binder/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb'>this link here</a> you can open the Python Data Science Handbook project, and can work through this complete data science text book: **VanderPlas, J. (2016): *Python Data Science Handbook: Essential Tools for Working with Data*. O'Reilly Media.** The book can be found here: https://jakevdp.github.io/PythonDataScienceHandbook/

A fine introduction for newcomers with a focus on data handling is: **McKinney, W. (2012): *Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython*. O'Reilly Media.** This book is not as deep on the data analysis we will be dealing with as the book by VanderPlas.

This is a data science textbook from the perspective of the social sciences: **Foster, I. , Ghani, R., Jarmin, R.S., Kreuter, F., and Lane, J. (eds) (2016): *Big Data and Social Science: A Practical Guide to Methods and Tools*. Chapman and Hall/CRC Press.**

Finally, more basic tutorials can be found <a href='https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#introductory-tutorials'>here</a>.

## Additional Resources (if you want to study more yourself, not mandatory)

An example machine learning notebook: https://github.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/blob/master/example-data-science-notebook/Example%20Machine%20Learning%20Notebook.ipynb

Statistics visualisations with Java Script: http://students.brown.edu/seeing-theory/

Coursera, e.g.: https://www.coursera.org/browse/data-science

Berthold, M. and Hand, D. J. (eds.) (2002): *Intelligent Data Analysis: An Introduction*. Springer.

Bishop, C. (2006): *Pattern Recognition and Machine Learning*. Springer.

Ester, M. and Sander, J. (2000): *Knowledge Discovery in Databases: Techniken und Anwendungen*. Springer. **Deutschsprachig**.

Hastie, T., Tibshirani, R., and Friedman, J. (2001): *The Elements of Statistical Learning*. Springer.

Han, J. and Kamber, M. (2011): *Data Mining: Concepts and Techniques*. Morgan Kaufmann Publishers.

Mitchell, T. M. (1997): *Machine Learning*. McGraw-Hill.

Witten, I. H. and Frank, E. (2005): *Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations*. Morgan Kaufmann Publishers.