# Python Data Science Handbook Notes 

This document contains notes I have taken on *Python Data Science Handbook*. I hope someone finds this helpful. 

Book citation: *Python Data Science Handbook* by Jake VanderPlas (O'Reilly). Copyright 2017 Jake VanderPlas, 978-1-491-91295-8.

# Preface

Data science comprises three distinct and overlapping areas: 
1. **Statistics**
2. **Computer science** - used for the design and use of algorithms to efficiently store, process, and visualize data
3. **Domain expertise** - necessary to formulate the right question and to put their answers in context

Important libraries: 
1. NumPy: manipulation of homogenous data
2. Pandas: manipulation of heterogenous data
3. SciPy: common scientific computing tasks
4. Matplotlib: visualizations
5. Scikit-Learn: machine learning

# Chapter 1: IPython: Beyond Normal Python

Need help: `?` for docmentation, `??` for source code, tab key for autocompletion

Every python object contains a docstring which contains a concise summary of the object and how to use it. Python has a built in ```help()``` function that prints the docstring. This method even works for functions or objects you create yourself. To create a docstring for our function we place a string literal in the first line.

Shortcuts
- `Ctrl-a` to move cursor to the begining of the line. 
- `Ctrl-e` to move cursor to the end of the line. 
- `Ctrl-k` to cut rexr from cursor to the end of the line. 
- `Ctrl-p` to access previous command in history.
- `Ctrl-n` to access next demand in history. 
    - Note: you can use Ctrl-p/Ctrl-n or the up/down arrow keys to search through history, but only by matching characters - at the begining of the line. 
- `Ctrl-l` to clear terminal screen. 
- `Ctrl-c` to interrupt current Python command. 

In [8]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



In [9]:
len?

In [10]:
def square(a): 
    """Return the square of a."""
    return a**2

Because Python is so readable you can usually gain another level of insight by reading the source code of the object you're curious about. `??` can give a quick insight into the under the hood details. Sometimes you will notice that `??` does not display source code. This is generally because the object in question is not implemented in Python, but in C or some other language. If this is the case `??` will give you the same output as `?`. 

In [11]:
square??

Every Python object has various attribues and methods associated with it. Python has a built in `dir` function that returns a list of these, but the tab completion interface is much easier to use in practice. To see a list of all available attributes of an object, you can type the name of the object followed by a `.` and then the Tab key. If there is only a single option, pressing the Tab key will complete the line for you. Tab completion is also useful when importing objects from packages. 

In [16]:
#wildcard matching
*Warning? #returns a list of every object in the namespace that ends with Warning

In [17]:
str.*find*? #retruns a list of every string method that contains the word find somewhere in its name. 

SyntaxError: invalid syntax (<ipython-input-17-31600e19d448>, line 1)

Magic commands are prefixed by `%`. These magic commands are designed to succintly solve various common problems in standard data analysis. There are two kinds of magic commands: line magics (denoted by `%` and operate on a single line) and cell magics (denoted by `%%` and operate on multiple lines of input)

`%paste` pastes code into the cell and does so without indentation errors. This way you can copy code from online sources and paste with no troubles. 

`%cpaste` opens an interactive multiline prompt in which you can paste one or more chunks of code to be executed in a batch. 

`%run` is useful when you have created a myscript.py file you can execute this on Jupyter `%run myscript.py`. Note that any functions defined within the .py file are now available for use. 

`%timeit` determines the execution time of the single line python statement that follows it

`%magic` to access a general description of available magic functions 

`%lsmagic` to list all available magic functions

In [18]:
%timeit L = [n**2 for n in range(100)]

31.6 µs ± 756 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Note that list comprehensions are faster than the equivalent for loop construction. 

The `In[]` object is a list which keeps track of the commands in order. The `Out[]` object is not a list but a dictionary mapping input numbers to their outputs. 

Note not all operations have outputs. E.g., import statments and print statements don't affect the output (yes print statments!). This makes sense if you think about how print is a function that returns `None`; for brevity any command that returns `None` is not added to Out. Where this can be useful is if you want to interact with past results. This can be very handy if you execute a very expensive computation and want to reuse the result. 

In [19]:
import math
math.sin(2)

0.9092974268256817

In [20]:
math.cos(2)

-0.4161468365471424

In [21]:
Out[19]*Out[20]

-0.37840124765396416

Underscore shortcuts and previous outputs: 
the variable _ a single underscore _ is kept updated with the previous output. You can use a double underscore to access the second-to-last output and a triple underscore to access the third-to-last output. It stops there! 

A shorthand for `Out[X]` is `_X` (i.e., a single underscore followed by the line number)

In [22]:
print(_)

-0.37840124765396416


In [23]:
_19

0.9092974268256817

Supressing the output of a command: the easiest way is to add a semicolon `;` to the end of the line. Note that when you do this the result is computed silently, and the result is neither displayed on the screen or stored in the Out dictionary.  

IPython gives you syntax for executing shell commands directly from within the IPython terminal. Anything appearing after `!` on a line will be executed not by the python kernal, but by the system command line. The shell is a way to interact textually with your computer. Shell offers much more control of advanced tasks. You can use any command that works at the command line in IPython by prefixing it with the `!` character. 

In [24]:
!ls

Coding comparisons .ipynb
Cogs 164 project.ipynb
[34mDrawing to Learn Science[m[m
[34mHate-Crimes-in-SD[m[m
Python Data Science Handbook Notes .ipynb
[34mSD_collisions[m[m
[34mUnique-words[m[m
[34mgroup044[m[m
sarah_functions.py


In [25]:
!pwd

/Users/sarahamiraslani/Documents/GitHub


Shell commands can also be made to interact with the IPython namespace. For example, you can save the output of any shell command to a Python list using the assignment operator. Note that these results are not returned as lists, but as a special shell return type defined in Ipython. This looks and acts a lot like a Python list, but has additional functionality

In [26]:
directory = !pwd

In [27]:
print(directory)

['/Users/sarahamiraslani/Documents/GitHub']


In [28]:
type(directory)

IPython.utils.text.SList

#### Profiling and and Timing Code

In the process of developing code and creating data processing pipelines, there are trade-offs you can make between various implementations. Early in developing your algorithm, it can be counterproductive to worry about these things. "premature optimization is the root of all evil". But once you have your code working, it can be useful to dig into its efficiency a bit. 

`%time` to time the execution of a single statement
`%%timeit` to time the repeated execution of snippets of code. 
`%timeit` to time repeated execution of a single statement for more accuracy
`%prun` to run the code with the profiler
`%lprun` to run the code with the line-by-line profiler
`%memit` to measure the memory use of a single statement
`%mprun` to run code with the line-by-line memory profiler
Note: the last four commands are not bundled with Ipython, you need to install the `line_profiler` and `memory_profiler` extensions

In [29]:
%timeit sum(range(100))

975 ns ± 25.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [30]:
%%timeit
total=0

for i in range(100): 
    for j in range(100): 
        total += i *(-1)**j

3.65 ms ± 93.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Chapter 2: Introduction to NumPy

It will help us to think of all data as arrays of numbers. No matter what the data are, the first step in making them analyzable will be to transform them into arrays of numbers. In some ways, NumPy arrays are like Python's built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size. Numpy arrays form the core of nearly the entire ecosystem of data science tools in Python.  

In [31]:
import numpy as np

Effective data-driven science and computation requires understanding how data is stored and manipulated.

The standard Python implementation is written in C. This means that every Python object is a cleverly disguised C structure. 

**A python list is more than just a list**
- because of python's dynamic typing we can create heterogenous lists. This flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other information – that is, each item is a complete Python object. In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array. 

At the implementation level, the array contains a single pointer to one contiguous block of data. The Python list on the other hand contains a pointer to a block of pointers, each of which in turn points to a full python object like the Python integer we saw earlier. Again the advantage of a list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled witth data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for sotring and manipulating data. 

**Fixed-Type Arrays in Python**
Python offers several different options for storing data in efficient, fixed-type data buffers. The built in `array` module can be used to create dense arrays of a uniform type. 

In [32]:
import array

L = list(range(10))
A = array.array('i',L)
A

array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Here `'i'` is a type code indicating the contents are integers. 

However, the `ndarray` object of the NumPy package is much more useful. Python's `array` object provides an efficient way to store array-based data, but NumPy adds to this efficient operations on that data. Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type. If the types do not match, NumPy will upcast if possible. If we want to explicitly set the data type of the resulting array, we can use the `dtype` keyword. Finally unlike Python lists, NumPy arrays can be explicitly multidimensional

In [33]:
#integer array
np.array([1,4,2,5,3])

array([1, 4, 2, 5, 3])

In [34]:
#Numpy will upcast if types don't match
np.array([3.14,4,2,3])

array([3.14, 4.  , 2.  , 3.  ])

In [35]:
np.array([1,2,3,4],dtype='float32')

array([1., 2., 3., 4.], dtype=float32)

In [36]:
#nested lists result in multidimensional arrays
np.array([range(i,i+3) for i in [2,4,6]])

array([[2, 3, 4],
       [4, 5, 6],
       [6, 7, 8]])

### Creating Arrays from Scratch
Especially for larger arrays, it is more efficient to create arrays from scartch using routines built into NumPy. 

In [37]:
# Create a length-10 integer array filled with zeros 
np.zeros(10,dtype=int)

# Create a 3x5 floatting-point array filled with 1s
np.ones((3,5),dtype=float)

# Create a 3x5 array filled with 3.14
np.full((3,5),3.14)

# Create an array filled with a linear sequence 
# Starting at 0, ending at 20, stepping by 2
# This is similar to the built in range() function
np.arange(0,20,2)

# Create an array of five values evenly spaced between 0 and 1
np.linspace(0,1,5)

# Create a 3x3 array of uniformly distributed random values between 0 and 1
np.random.random((3,3))

# Create a 3x3 array of normally distributed random values with mean 0 and standard deviation of 1
np.random.normal(0,1,(3,3))

# Create a 3x3 array of random integers in the interval [0,10]
np.random.randint(0,10,(3,3))

# Create a 3x3 identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

### The Basics of NumPy Arrays 
Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array. 

Basic array manipulation: 
- **atributes of arrays** - determining the size, shape, memory consumption, and data types of arrays
- **indexing of arrays** - getting and setting the value of individual array elements
    - in a one dimensional array, you can access the ith value (counting from zero) by specifying the desired index in square brackets, just as with Python lists
    - to index from the end of the array you can use negative indices
    - in a multidimensional array, you can access items using a comma-seperated tuple of indices
- **slicing of arrays** - getting and setting smaller subarrays within a larger array
    - Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon `:` character. 
    - `x[start:stop:step]`
        - if any of these are unspecified, they default to the values start=0, stop = size of dimension, step = 1
        - a potentially confusing case is when the step value is negative
    - multidimensional slices work in the same way, with multiple slices seperated by commas
    - to access a single row or column of an array you can combine indexing and slicing using an empty slice marked by a `:`
    - one important thing to know about array slices is that they return views rather than copies of the array data. 
        - this default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer
            - this is one area in which NumPy array slicing differs from Python list slicing: in lists slices will be copies 
- **reshaping of arrays** - changing the shape of a given array
- **joining and splitting of arrays** - combining multiple arrays into one, and splitting one array into many

In [38]:
x1 = np.random.randint(10,size=6) # one dimensional array
x2 = np.random.randint(10,size=(3,4)) # two-dimensional array
x3 = np.random.randint(10, size = (3,4,5)) # three-dimensional array

In [39]:
x1[0]
x2[0,0]

3

You can also modify values using any of the above index notation. Keep in mind that unlike Python lists, NumPy arrays have a fixed type. This means that, for example, if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. 

In [40]:
x2[0,0]=12
x1[0]=3.144444 # will be truncated
x1

array([3, 2, 7, 8, 1, 6])

In [41]:
x = np.arange(10)
x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [42]:
x[:5] #first five elements
x[5:] #elements after index 5
x[4:7] #middle subarray
x[::2] #every other element
x[1::2] #every other element starting at index 1

array([1, 3, 5, 7, 9])

In [43]:
x2

array([[12,  2,  4,  8],
       [ 3,  3,  4,  2],
       [ 4,  7,  4,  2]])

In [44]:
x2[:2,:3] #two rows and three columns
x2[:3,::2] #all rows every other column

#subarray dimensions can even be reversed together
x2[::-1,::-1]

array([[ 2,  4,  7,  4],
       [ 2,  4,  3,  3],
       [ 8,  4,  2, 12]])

In [45]:
print(x2[:,0]) # first column of x2
print(x2[0,:]) # first row of x2

[12  3  4]
[12  2  4  8]


In [46]:
print(x2)

[[12  2  4  8]
 [ 3  3  4  2]
 [ 4  7  4  2]]


In [47]:
# Let's extract a subarray from this: 
x2_sub = x2[:2,:2]
print(x2_sub)

[[12  2]
 [ 3  3]]


In [48]:
#Now if we modify this subarray we will see that the original array is changed!
x2_sub[0,0]=99
print(x2)

[[99  2  4  8]
 [ 3  3  4  2]
 [ 4  7  4  2]]


### Creating Copies of Arrays
It is useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the `copy()` method. 

In [50]:
x2_sub_copy=x2[:2,:2].copy() #if we modify this subarray, the original array is not touched
x2

array([[99,  2,  4,  8],
       [ 3,  3,  4,  2],
       [ 4,  7,  4,  2]])

### Reshaping Arrays

- `reshape()`: note that for this to work, the size of the initial array must match the size of the reshaped array. Whenever possible, the `reshape()` method will use a no-copy view of the initial array, but this is not always the case. 

- Another common reshaping pattern is the conversion of a one-dimensional array into a two dimensional row or column matrix. You can do this with the `reshape()` method or by making use of the `newaxis` keyword within a slice operation. 

In [55]:
grid = np.arange(1,10).reshape((3,3))
grid

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [61]:
x=np.array([1,2,3])

# row vector via reshape
x.reshape(1,3)

# row vector via newaxis
x[np.newaxis,:]

array([[1, 2, 3]])

In [63]:
# column vector via reshape
x.reshape(3,1)

# column vector via newaxis
x[:,np.newaxis]

array([[1],
       [2],
       [3]])

### Array Concatenation and Splitting 
It is possible to combine multiple arrays into one and to conversley split a single array into multiple arrays. 

Concatenation
- `np.concatenate`
- `np.vstack`
- `np.hstack`

Splitting
- `np.split`
- `np.hsplit`
- `np.vsplit`

In [70]:
x = np.array([1,2,3])
y = np.array([3,2,1])
z = np.array([99,99,99])

np.concatenate([x,y])
np.concatenate([x,y,z])

# np.concatenate can also be used for two-dimensional arrays
grid = np.array([[1,2,3],[3,4,5]])
grid

array([[1, 2, 3],
       [3, 4, 5]])

In [71]:
# Concatenate along the first axis
np.concatenate([grid,grid])

array([[1, 2, 3],
       [3, 4, 5],
       [1, 2, 3],
       [3, 4, 5]])

In [72]:
# Concatenate along the second axis 
np.concatenate([grid,grid],axis=1)

array([[1, 2, 3, 1, 2, 3],
       [3, 4, 5, 3, 4, 5]])

In [None]:
# Vertically stack the arrays