Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
9,594 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,391 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"slideshow": { | ||
"slide_type": "-" | ||
} | ||
}, | ||
"source": [ | ||
"# Big Data Module I: Introduction to Data Science with Python\n", | ||
"\n", | ||
"## Setting up Python\n", | ||
"\n", | ||
"Make yourself acquainted with the notebook environment. It's basically a webpage with executable code. Code is run by clicking the \"run\" button (looks similar to the play button).\n", | ||
"\n", | ||
"There are many great keyboard shortcuts. Press 'H' to see a cheat sheet (Jupyter Notebook, different in Juypter Lab).\n", | ||
"\n", | ||
"A good introduction is [this video right here](https://www.youtube.com/watch?v=HW29067qVWk). " | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Imports\n", | ||
"\n", | ||
"A central building block of Python, and especially the distribution of Anaconda you should have installed, is the ability to import additional modules, packages or libraries into your current script with the 'import' command. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 14, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import math" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 15, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"1.3862943611198906" | ||
] | ||
}, | ||
"execution_count": 15, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"math.log(4)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"-1.0" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"math.cos(math.pi)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Sometimes you will want to use a short name for a library:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import math as mt" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"1.3862943611198906" | ||
] | ||
}, | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"mt.log(4)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Note that you have to type the module name (\"math\" or \"mt\") before each function call. You can also import a specific function of a module. Then the explicit call is not necessary:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from statistics import mean" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"28.25" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"mean([2, 5, 6, 100])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that we know the basics of importing, make yourself comfortable with using multiple libraries. NumPy, Pandas, and NetworkX are only three of the ones we will be using in the course.\n", | ||
"\n", | ||
"However, in our introductory tutorials on Python fundamentals, we will use only basic functions of Python." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### NumPy\n", | ||
"\n", | ||
"NumPy is the fundamental package for scientific computing with Python. More information and tutorials at:\n", | ||
"\n", | ||
"http://www.numpy.org/" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import numpy as np" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"An example command:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"28.25" | ||
] | ||
}, | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"x = [2, 5, 6, 100]\n", | ||
"np.mean(x)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Pandas\n", | ||
"\n", | ||
"Pandas provides data structures and data analysis tools. More information and tutorials at:\n", | ||
"\n", | ||
"http://pandas.pydata.org/" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"0 1.0\n", | ||
"1 3.0\n", | ||
"2 5.0\n", | ||
"3 NaN\n", | ||
"4 6.0\n", | ||
"5 8.0\n", | ||
"dtype: float64" | ||
] | ||
}, | ||
"execution_count": 16, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"s = pd.Series([1, 3, 5, np.nan, 6, 8])\n", | ||
"s" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Some examples for extended markdown possibilities (double click on the cells to see the code)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"$$e^x=\\sum_{i=0}^\\infty \\frac{1}{i!}x^i$$" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"| This | is |\n", | ||
"|------|------|\n", | ||
"| a | table|" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Introductory Tutorials for Preparation\n", | ||
"\n", | ||
"Now you are ready to start the tutorials. They are required preparation for the course. At the beginning of the course, we will only do a short recap.\n", | ||
"\n", | ||
"Open the first notebook, <a href='01_var_string_num.ipynb'>01_var_string_num.ipynb</a>, and go through the other five notebooks in order. **Do the exercises** to know you really understood the lessons." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"\n", | ||
"## Optional Materials for Preparation\n", | ||
"\n", | ||
"Via <a href='https://notebooks.gesis.org/binder/v2/gh/jakevdp/PythonDataScienceHandbook/master?filepath=notebooks%2FIndex.ipynb'>this link here</a> you can open the Python Data Science Handbook project, and can work through this complete data science text book: **VanderPlas, J. (2016): *Python Data Science Handbook: Essential Tools for Working with Data*. O'Reilly Media.** The book can be found here: https://jakevdp.github.io/PythonDataScienceHandbook/\n", | ||
"\n", | ||
"A fine introduction for newcomers with a focus on data handling is: **McKinney, W. (2012): *Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython*. O'Reilly Media.** This book is not as deep on the data analysis we will be dealing with as the book by VanderPlas.\n", | ||
"\n", | ||
"This is a data science textbook from the perspective of the social sciences: **Foster, I. , Ghani, R., Jarmin, R.S., Kreuter, F., and Lane, J. (eds) (2016): *Big Data and Social Science: A Practical Guide to Methods and Tools*. Chapman and Hall/CRC Press.**\n", | ||
"\n", | ||
"Finally, more basic tutorials can be found <a href='https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks#introductory-tutorials'>here</a>.\n", | ||
"\n", | ||
"## Additional Resources (if you want to study more yourself, not mandatory)\n", | ||
"\n", | ||
"An example machine learning notebook: https://github.com/rhiever/Data-Analysis-and-Machine-Learning-Projects/blob/master/example-data-science-notebook/Example%20Machine%20Learning%20Notebook.ipynb\n", | ||
"\n", | ||
"Statistics visualisations with Java Script: http://students.brown.edu/seeing-theory/\n", | ||
"\n", | ||
"Coursera, e.g.: https://www.coursera.org/browse/data-science\n", | ||
"\n", | ||
"Berthold, M. and Hand, D. J. (eds.) (2002): *Intelligent Data Analysis: An Introduction*. Springer.\n", | ||
"\n", | ||
"Bishop, C. (2006): *Pattern Recognition and Machine Learning*. Springer.\n", | ||
"\n", | ||
"Ester, M. and Sander, J. (2000): *Knowledge Discovery in Databases: Techniken und Anwendungen*. Springer. **Deutschsprachig**.\n", | ||
"\n", | ||
"Hastie, T., Tibshirani, R., and Friedman, J. (2001): *The Elements of Statistical Learning*. Springer.\n", | ||
"\n", | ||
"Han, J. and Kamber, M. (2011): *Data Mining: Concepts and Techniques*. Morgan Kaufmann Publishers.\n", | ||
"\n", | ||
"Mitchell, T. M. (1997): *Machine Learning*. McGraw-Hill.\n", | ||
"\n", | ||
"Witten, I. H. and Frank, E. (2005): *Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations*. Morgan Kaufmann Publishers." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernel_info": { | ||
"name": "python3" | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.7.3" | ||
}, | ||
"nav_menu": {}, | ||
"nteract": { | ||
"version": "0.8.4" | ||
}, | ||
"toc": { | ||
"navigate_menu": true, | ||
"number_sections": true, | ||
"sideBar": true, | ||
"threshold": 4, | ||
"toc_cell": false, | ||
"toc_section_display": "block", | ||
"toc_window_display": false | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.