# Course overview
* What are we doing here?
* If you're not here for Intro to Data Science in Python, you're in the wrong place!

## Course objectives
The goal of this course is to provide a high-level introduction to:
* The Python language and the scientific Python stack
* Core elements of a typical data science pipeline

## Who is this course for?
* The ideal participant:
    * Has prior programming experience in other languages (e.g., R, Matlab, etc.)
    * Is familiar with basic statistics (descriptives, probability, regression, etc.)
    * Analyzes data on a regular basis
    * Is interested in learning how to analyze data more effectively in Python
* Don't worry if you're not the ideal participant!
    * But you may have to do some extra work to catch up

## What this course will and won't do
* It will give you a basic understanding of the Python data science ecosystem
* It will help you figure out which resources to seek out next
* It won't turn you into either a data scientist or a Python developer
    * Programming and data science take time!

## Who are we?
* A  bit about me
* A bit about you

## Structure of the course
The course is structured loosely around different phases of a typical data science project:
* Day 1: Setting up a data science environment
* Day 2: Importing and preprocessing the data
* Day 3: Describing, visualizing, and analyzing the data
* Day 4: Machine learning/predictive modeling

## Getting the most out of the course
* You'll get more out of the course if you interact with the code
* To run the Jupyter notebooks, you'll need the following:
    * Python
    * The core Python scientific computing stack (Numpy, SciPy, pandas, matplotlib)
    * The Jupyter Notebook
    * Various additional packages we'll install as we go
        * scikit-learn
        * seaborn
        * requests
        * beautifulsoup4
        * etc...

## Good news...

Virtually all of the packages we'll cover are included in the base Anaconda distribution, available for all platforms.

Additional packages can almost always be installed via conda or pip:

> conda install [package]

or

> pip install [package]

You can also run system commands from within the Jupyter notebook by prefixing them with a ! in a code cell. For example:

In [None]:
!ls

# Overview of Day 1
* Course overview
* What is data science?
* The Python language
* Why do data science in Python?
* Comparison with other common languages
* The scientific Python ecosystem
* The Jupyter notebook
* Numpy
* Best practices for data science

# What is data science?

<div width="600px">
<img src="images/data_science_venn.png" width="500px" style="margin-bottom: 10px;">
<a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram" style="font-size: 14px;">http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram</a>
</div>

### General themes
* Mathematical/statistical sophistication (contrast with data analyst, software developer)
* Solid programming ability: can build automated data-processing pipelines
* Extracts coherent stories from large amounts of data

### Skills
The skill set people associate with data science varies wildly, and can include:
* Statistics
* Data munging
* Machine learning
* Visualization
* Expertise in Python, R, or similar
* SQL
* Distributed computing
* Optimization
* Etc. etc...

<img src="images/josh_wills_tweet.png" width="700px">

# The Python language

Python is a very widely used, very flexible, high-level, general-purpose, dynamic programming language

### High-level
Python features a high level of abstraction
* Many operations that are explicit in lower-level languages (e.g., C/C++) are implicit in Python
* E.g., memory allocation, garbage collection, etc.
* Python lets you write code faster

#### File reading in Java
```java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
 
public class ReadFile {
    public static void main(String[] args) throws IOException{
        String fileContents = readEntireFile("./foo.txt");
    }
 
    private static String readEntireFile(String filename) throws IOException {
        FileReader in = new FileReader(filename);
        StringBuilder contents = new StringBuilder();
        char[] buffer = new char[4096];
        int read = 0;
        do {
            contents.append(buffer, 0, read);
            read = in.read(buffer);
        } while (read >= 0);
        return contents.toString();
    }
}
```

#### File-reading in Python
```python
open(filename).read()
```

### General-purpose
You can do almost everything in Python
* Comprehensive standard library
* Enormous ecosystem of third-party packages
* Widely used in many areas of software development (web, dev-ops, data science, etc.)

### Dynamic
Code is interpreted at run-time
* No compilation process*; code is read line-by-line when executed
* Eliminates delays between development and execution
* The downside: poorer performance compared to compiled languages

## Variables and data types
* In Python, we declare a variable by assigning it a value with the = sign
* Python supports a variety of data types:
    * booleans
    * numbers (ints, floats, etc.)
    * strings
    * lists
    * dictionaries
    * many others!
* We don't specify a variable's type at assignment--Python uses [duck typing](https://en.wikipedia.org/wiki/Duck_typing)

### Examples

In [None]:
# An integer. Notice the variable naming convention.
age_in_years = 30

In [None]:
# A float
kind_of_pi = 3.14

In [None]:
# A string
apple = "A is for Apple"

In [None]:
# A boolean takes on only the values True or False
enjoying_class = True

#### Lists
* An ordered, heterogeneous collection of objects
* List elements can be accessed by position

In [None]:
random_stuff = [12, 'donut', age_in_years]

In [None]:
# We index lists by numerical position--starting at 0
random_stuff[1]

In [None]:
# We can also slice lists
random_stuff[1:-1]

In [None]:
# Append an element
random_stuff.append(kind_of_pi)
random_stuff 

#### Dictionaries (dict)
* Unordered collection of key-to-value pairs
* dict elements can be accessed by key, but *not* by position

In [None]:
# A dictionary is an unordered mapping from keys to values
fruit_prices = {
    'apple': 0.65,
    'mango': 1.50,
    'strawberry': '$3/lb',
    'durian': 'unavailable'
}

In [None]:
# What's the price of a mango?
fruit_prices['mango']

In [None]:
# Add a new entry
fruit_prices['pear'] = 0.75

### Everything is an object in Python
* All of these 'data types' are actually just objects in Python
* *Everything* is an object in Python!
* The operations you can perform with a variable depend on the object's definition
* E.g., the operator * is defined for some objects but not others

In [None]:
# Multiply an int by 2
age_in_years * 2

In [None]:
# Multiply a float by 2
4.8 * 2

In [None]:
# What about a string?
'duck' * 2

In [None]:
# A list?
[3, 2, 1] * 2

In [None]:
# A dictionary?
fruit_prices * 2

## Control structures
* Language features that allow us to control how code is executed
* Iteration (e.g., for-loops, while statements...)
* Conditionals (if-then-else statements)
* [Etc](https://docs.python.org/3/tutorial/controlflow.html)...

In [None]:
# Count how many elements we have in our list
n_elements = len(random_stuff)

# Loop over indices of the list and print each value
for i in range(n_elements):
    val =  random_stuff[i]
    # This is an "f-string": A template that allows you to
    # easily inject Python expressions into strings. Notice
    # the 'f' before the quotes!
    msg = f"At index {i}, the value is {val}"
    print(msg)

In [None]:
# We could also replace the above code with a single line
# using Python's 'list comprehension' syntax.
[print(f"At index {i}, the value is {v}") for i, v in enumerate(random_stuff)];

## Imports and namespaces
* Python is very serious about maintaining orderly namespaces
* If you want to use some code outside the current scope, you need to explicitly "import" it
* Python's import system often annoys beginners, but it substantially increases code clarity
    * Almost completely eliminates naming conflicts and confusion
    * If you know R, consider the horrors wreaked by liberal use of `attach()`

In [None]:
# Three different ways to import and access the OrderedDict class
from collections import OrderedDict
a = OrderedDict()

from collections import OrderedDict as od
b = od()

import collections
c = collections.OrderedDict()

# Verify that the resulting objects are equivalent
a == b == c

## Functions
* A block of code that only runs when explicitly called
* Can accept arguments (or parameters) that alter its behavior
* Can accept any number/type of inputs and return any single object

In [None]:
# We'll need the random module for this
import random

def add_noise(x, mu=0, sd=1):
    ''' Adds gaussian noise to the input.
    
    Parameters:
        x (number): The number to add noise to
        mu (float): The mean of the gaussian noise distribution
        sd (float): The standard deviation of the noise distribution
    
    Returns: A float.
    '''
    noise = random.normalvariate(mu, sd)
    return x + noise

In [None]:
# Let's try it out..
add_noise(4, 0, 100)

### Keyword arguments
* Python functions can have optional keyword (or named) arguments
* This makes it easy to call with arguments
* Omitted keyword arguments will use the default value they're assigned in the definition

In [None]:
# Only the mandatory x argument
add_noise(x=10)

In [None]:
add_noise(4, sd=10)

## Classes
* A template for a particular kind of object
* A class defines the variables an object contains and what it can do with them
* To illustrate, let's define a `Circle` class...
* Note: object-oriented programming can be a bit hard to understand at first, and we're moving quickly

In [None]:
# We need pi!
import math

class Circle:
    ''' Represents a circle.
    
    Parameters:
        radius (float): The radius of the circle.
    '''
    
    def __init__(self, radius):
        self.r = radius
    
    def __mul__(self, value):
        return Circle(self.r * value)
    
    def __repr__(self):
        return "A circle with a radius of {} has an area of {}".format(self.r, self.area)

    @property
    def area(self):
        return math.pi * math.pow(self.r, 2)

    def copy(self):
        ''' Returns a new Circle with the same radius. '''
        return Circle(self.r)

In [None]:
# Initialize a circle with radius 5...

### Magic methods [advanced]
* Methods padded with `__` have a variety of special functions in Python
* E.g., `__init__` and/or `__new__` are called when an object is initialized
* All operators in Python are actually just cleverly-disguised method calls
* E.g., the code `age_in_years * 2` is actually equivalent to `age_in_years.__mul__(2)`
* Any object that implements the `__mul__` method can use the `*` operator

In [None]:
# Multiply a circle by 2 and print the resulting circle's area

# Why do data science in Python?

## Easy to learn
* Readable, explicit syntax
* Most packages are very well documented
    * e.g., scikit-learn's [documentation](http://scikit-learn.org/stable/documentation.html) is widely held up as a model
* A huge number of tutorials, guides, and other educational materials
    * [Code Academy](https://www.codecademy.com/learn/learn-python-3) is a good place to start
    * Tons of questions (and answers) on [Stack Overflow](http://stackoverflow.com/questions/tagged/python)

## Comprehensive standard library
* The [Python standard library](https://docs.python.org/3.7/library/) contains a huge number of high-quality modules
* When in doubt, check the standard library first before you write your own tools!
* For example:
    * os: operating system tools
    * re: regular expressions
    * collections: useful data structures
    * multiprocessing: simple parallelization tools
    * pickle: serialization
    * json: reading and writing JSON

## Exceptional external libraries

* Python has very good (often best-in-class) external packages for almost everything
* Particularly important for data science, which draws on a very broad toolkit
* Package management is easy (conda, pip)
* Examples:
    * Web development: flask, Django
    * Database ORMs: SQLAlchemy, Django ORM (w/ adapters for all major DBs)
    * Scraping/parsing text/markup: beautifulsoup, scrapy
    * Natural language processing (NLP): nltk, gensim, textblob
    * Numerical computation and data analysis: numpy, scipy, pandas, xarray
    * Machine learning: scikit-learn, Tensorflow, keras
    * Image processing: pillow, scikit-image, OpenCV
    * Plotting: matplotlib, seaborn, altair, ggplot, Bokeh
    * GUI development: pyQT, wxPython
    * Testing: py.test
    * Etc. etc. etc.

## (Relatively) good performance
* Python is a high-level dynamic language—this comes at a performance cost
* For many (not all!) data scientists, performance is irrelevant most of the time
* In general, the less Python code you write yourself, the better your performance will be
    * Much of the standard library consists of Python interfaces to C functions
    * Numpy, scikit-learn, Theano, etc. all rely heavily on C/C++ or Fortran

In [None]:
# Create a list of 100,000 integers
my_list = list(range(100000, ))

In [None]:
# Python's built-in sum() function is pretty fast
%timeit sum(my_list)

In [None]:
# If you write your own naive implementation, it probably won't
# be nearly as fast
def ill_write_my_own_sum(l):
    s = 0
    for elem in my_list: 
        s += elem
    return s

%timeit ill_write_my_own_sum(my_list)

### If you need more speed...
* Parallelization
* [Cython](http://cython.org) (a superset of Python) allows C type declarations, function calls
* Rapid progress on just-in-time compilers that optimize code effortlessly
    * E.g., [Numba](https://numba.pydata.org/)

# Python vs. other data science languages

* Python competes for mind share with many other languages
* Most notably, R
* To a lesser extent, Matlab, Mathematica, SAS, Julia, Java, Scala, etc.

### R
* [R](https://www.r-project.org/) is dominant in traditional statistics and some fields of science
    * Has attracted many SAS, SPSS, and Stata users
* Exceptional statistics support; hundreds of best-in-class libraries
* Designed to make data analysis and visualization as easy as possible
* Slow
* Language quirks drive many experienced software developers crazy
* Less support for most things non-data-related

### MATLAB
* A proprietary numerical computing language widely used by engineers
* Good performance and very active development, but expensive
* Closed ecosystem, relatively few third-party libraries
    * There is an open-source port (Octave)
* Not suitable for use as a general-purpose language

### Others
* [Julia](http://julialang.org/) is a performant new language for technical computing
    * Promising, but very few libraries compared to Python, R
* [SAS](https://www.sas.com/en_us/home.html) is an enterprise analytics software suite widely used in government, some industries
    * Offers a GUI and handles large datasets very well
    * Does very little other than data analysis
* Java has an enormous ecosystem and excellent performance, but is extremely verbose
* [SPSS](http://www.ibm.com/analytics/us/en/technology/spss/) is more of a cash cow for IBM than a programming language; we won't discuss it
    * But if you must use SPSS, use [JASP](https://jasp-stats.org/) or [Jamovi](https://www.jamovi.org/) instead!

## So, why Python?
Why choose Python over other languages?
* Arguably none of these offers the same combination of readability, flexibility, libraries, and performance
* Python is sometimes described as "the second best language for everything"
* Doesn't mean you should always use Python
    * Depends on your needs, community, etc.

## You can have your cake _and_ eat it!
* Many languages--particularly R--now interface seamlessly with Python
* You can work primarily in Python, fall back on R when you need it (or vice versa)
* The best of all possible worlds?

# The core Python data science stack
* The Python ecosystem contains tens of thousands of packages
* Several are very widely used in data science applications:
    * [Jupyter](http://jupyter.org): interactive notebooks
    * [Numpy](http://numpy.org): numerical computing in Python
    * [Scipy](http://scipy.org): scientific Python tools
    * [Matplotlib](http://matplotlib.org): plotting in Python
    * [pandas](http://pandas.pydata.org/): complex data structures for Python
    * [scikit-learn](http://scikit-learn.org): machine learning in Python
* We'll cover the first two today, and meet the rest later

# The Jupyter notebook
* "The [Jupyter Notebook](http://jupyter.org) is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text."
    * You can [try it online](http://jupyter.org/try)
* Formerly the IPython Notebook
* Supports [many different languages](https://github.com/jupyter/jupyter/wiki/Jupyter-kernels)
* A living document wrapped around a command prompt
* Various extensions and [widgets](http://ipywidgets.readthedocs.io/en/latest/index.html)

In [None]:
# We discuss what the following lines do elsewhere
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Interactive widgets!
from ipywidgets import interact
import ipywidgets as widgets

# Define our plotting function
def plot_normal_hist(mu, sd):
    samp = np.random.normal(mu, sd, size=10000)
    plt.hist(samp, bins=100)
    plt.xlim(-10, 20)

# Hook up our plotting function to the interactive widget
interact(plot_normal_hist, mu=5, sd=2);

## Why Jupyter?
* An easy way to write and share completely reproducible documents
* Combine code, results, and text in one place
* You can mix languages
* Completely interactive: make a change and see what happens
* Execution order matters

### Slideshow mode
* These slides are actually a live Jupyter notebook
* We can edit and execute cells on-the-fly
* The slideshow extension is installed separately; follow the instructions [here](https://github.com/damianavila/RISE)

### Built-in LaTeX support
$$ F(k) = \int_{-\infty}^{\infty} f(x) e^{2\pi i k} dx $$

### Highly customizable
* Custom key bindings
* Supports [web standards](http://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/JavaScript%20Notebook%20Extensions.html#) that enable near-limitless customization via JavaScript
* All kinds of [unofficial extensions](http://jupyter-contrib-nbextensions.readthedocs.io/en/latest/)

### Magic functions
* Jupyter/IPython includes a number of ["magic" commands](http://ipython.readthedocs.io/en/stable/interactive/magics.html) to make life easier
* Support in-line plotting, timing, debugging, calling other languages, etc.

In [None]:
# This line says we want plots displayed in cell output
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

x = np.random.normal(size=100)
y = np.random.normal(size=100)
p = plt.scatter(x, y)

### Combining R and Python with the %R magic
* In the same notebook
* And even in the same notebook cell!
* Can also use an [R kernel](https://irkernel.github.io/) with the notebook

In [None]:
# This cell won't work unless you've installed R and rpy2.
# You can get R from CRAN (https://cloud.r-project.org/),
# and install rpy2 with "conda install rpy2"
%load_ext rpy2.ipython

import seaborn as sns

df = sns.load_dataset('tips')
print("\n\nFirst few lines of the dataset in Python:\n\n", df.head())

print("\n\nPlot generated with ggplot2 in R, after passing in the Python data:")
%R -i df
# Assumes that ggplot2 is installed in R!
%R library(ggplot2)
%R p = qplot(total_bill, tip, color=time, data=df) + geom_smooth(method='lm')
%R print(p);

### Cell vs. line magics
* %R will let you write one line of R code into a Python cell
* %%R turns the whole cell into R code

In [None]:
%%R
x = rnorm(100)
y = rnorm(100)
print(qplot(x, y))
print(summary(lm(y ~ x)))

### Getting help in Jupyter
* Explore the options under the "Help" menu
* Press the 'h' key (in cell mode) to see keyboard shortcuts
* Shift-tab inside Python function calls will show you the function signature/arguments
* Prefix any Python function with '?' to bring up its documentation
* Prefix any command with '!' to run it as an operating system command

In [None]:
# # Running this cell will pop up the documentation for
# # numpy's reshape function
import numpy as np
?np.reshape

## Numpy
* "The fundamental package for scientific computing with Python"
* The basic building block of most data analysis in Python
* Numpy arrays: N-dimensional, homogeneous, unlabeled arrays
* Working with numpy will look familiar if you've spent time in an environment like R or Matlab
    * There are handy cheat sheets for [Matlab](http://mathesaurus.sourceforge.net/matlab-numpy.html) and [R](http://mathesaurus.sourceforge.net/r-numpy.html) users
    * Suggested homework: do [a numpy tutorial](https://www.datacamp.com/community/tutorials/python-numpy-tutorial) or [two](https://www.learnpython.org/en/Numpy_Arrays)
* Numpy contains highly optimized routines for creating and manipulating arrays

In [None]:
# By convention, we assign numpy to the variable np for brevity
import numpy as np

# This isn't numpy-related, but we'll be plotting stuff in the notebook
import matplotlib.pyplot as plt

# Draw all plots inline in the notebook
%matplotlib inline

In [None]:
# Create an empty 10 x 10 array of zeros
np.zeros([10, 10])

In [None]:
# Create a 1d numpy array with values 0 through 9999
a = np.arange(1000)

In [None]:
# Reshape into a 10 x 10 2d array
a = np.reshape(a, (10, 100))

In [None]:
# # Add a little bit of noise
b = np.random.normal(0, 0.5, size=[10, 100])
a = a + b

In [None]:
# Inspect the first 50 elements of each of the first two rows
print(a[0:2, 0:50])

In [None]:
# Plot only the 3rd through 6th rows
to_plot = a[3:7, :]
for row in to_plot:
    plt.plot(row)

### Numpy exercises
Here are a few numpy exercises to get you started; if you're new to Python and/or numerical computing, these will probably require some googling.
1. Create a 10 x 10 x 10 3d array of random numbers (hint: see examples above)
2. Extract any 10 x 10 slice from that array
3. Create two 2d arrays of any size (but both with the same dimensions) and print their element-wise product
4. Create a 1d array of any length and then reverse it (so that the first element becomes the last, etc.)
5. 100 more short exercises can be found [here](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_no_solution.ipynb) (there are also versions [with hints](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises_with_hint.ipynb) and [with solutions](https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises.ipynb))

In [None]:
# Write your code here

## Everything revolves around numpy arrays
* Scipy adds a bunch of useful science and engineering routines that operate on numpy arrays
    * signal processing, statistical distributions, image analysis, etc.
* pandas adds powerful methods for manipulating numpy arrays
    * Like data frames in R--but typically faster
* scikit-learn supports state-of-the-art machine learning over numpy arrays
    * Inputs and outputs of virtually all functions are numpy arrays

# Best practices

* Good data scientists borrow many best practices from software developers
* Efficiency and reproducibility are key
* Some best practices:
    * Maintain project- or domain-specific environments
    * Use version control (e.g., Git/GitHub)
    * Profile your code
    * Test your code*
    * Document everything

## Set up project-specific environments
* It's a good idea to set up a new Python environment for each project (or at least, domain)
* Prevents version conflicts and makes dependency management easy
* Conda simplifies this process (see [the documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html))
* In the interest of time, we won't do this here--but you may want to try it for tomorrow

## Version control
* Most data analysts are familiar "how did I do that analysis?" syndrome
* Version or source control involves formally tracking the history of your work
* Every major (sometimes minor) change is logged
* Repository of changes is often maintained in a central location (e.g., GitHub)

<img src="images/version_control.png" width="1000px">

## Git/GitHub
* The most widely used source control platform is [git](https://git-scm.com/downloads)
    * If you're on Linux or Mac, it's probably already installed (type `git` at command line)
    * If you're on Windows, you'll probably need to [install git](https://git-scm.com/downloads)
* GitHub is the most widely used Git hosting service
    * Hosts most millions of projects (including this course)
* Learning curve can be a bit steep, but there are [interactive tutorials](https://try.github.io/levels/1/challenges/1) and installable [GUIs](https://git-scm.com/downloads/guis) to help
* A small example...

## For tomorrow...

* Install (and/or update) git
    * Note: git is different from the GitHub client!
    * You'll need the former; the latter is optional
* Clone the repository for this course. From the command line:

> git clone https://github.com/tyarkoni/SSI2019.git
   
* This will make it easier to work with the course notebooks
* Create a user account on GitHub
* Complete [this git tutorial](https://try.github.io/levels/1/challenges/1)

## Code profiling
* Code has bottlenecks
* The bottlenecks are often not where you expect them to be
* Python has many available tools for easy code profiling
* The easiest to use is the cProfile module in the standard library

In [None]:
import cProfile
from scipy.signal import convolve
import numpy as np

def pointless_array_operations():
    ''' A set of pointless array operations intended to chew up clock time. '''
    n = 100
    x = np.random.normal(size=(n, n))
    y = np.random.normal(size=(n, n))
    for i in range(200):
        z = np.dot(x, y)
        z = convolve(z, np.random.normal(size=(n, 1)))
        np.corrcoef(x, z)
        
cProfile.run('pointless_array_operations()')

## ...but don't get carried away
* There are many profiling tools
* It's easy to get sucked in
* "Premature optimization is the root of all evil"
* Focus on the parts of your code that hamper performance most

## Document your code and workflow
* Get in the habit of documenting everything you do (not just code)
* Tell your users (including yourself) the story of what your analyses are doing
* The Jupyter notebook makes this easy
    * There are plenty of excellent examples on the web (e.g., [1](http://beautifuldata.net/2014/03/datalicious-notebookmania-my-favorite-7-ipython-notebooks/), [2](https://github.com/donnemartin/data-science-ipython-notebooks), [3](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks))

# Resources/further reading

There are hundreds of excellent resources online for learning Python and/or data science. A few good ones:

* CodeAcademy offers interactive programming courses for many languages and tools, including [Python](https://www.codecademy.com/learn/learn-python-3) and [git](https://www.codecademy.com/learn/learn-git)
* [A Whirlwind Tour of Python](http://www.oreilly.com/programming/free/files/a-whirlwind-tour-of-python.pdf) is an excellent intro to Python by [Jake VanderPlas](http://vanderplas.com/); Jupyter notebooks are available [here](https://github.com/jakevdp/WhirlwindTourOfPython)
* Another excellent and free online book is Allen Downey's ["Think Python"](http://greenteapress.com/wp/think-python-2e/)
* Jake's [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook) is also available online as a set of notebooks
* Kaggle maintains a nice list of [data science and Python tutorials](https://www.kaggle.com/learn/overview)
* GitHub offers a [hands-on introduction](https://try.github.io/levels/1/challenges/1) to git (with GitHub); the official [Pro Git](https://git-scm.com/book/en/v2/) book provides a more comprehensive guide

# Questions? Comments? Suggestions?
* I'll be here for 20 - 30 minutes before and after every class
* See you tomorrow!