# Lecture 6: Tables and Matrics in numpy and pandas

## jupyter

But first, an introduction to Jupyter notebooks.
They have markdown, like this.
To draw/render a line of markdown, or to exectute a line of code,
you either press the "play" button at the top,
press shift-enter to execute and go to the next cell,
ctrl-enter to execute and stay in place, or
alt-enter to execute, insert a new cell, and go to that new line.

Try all the buttons at the top to figure out what they do.
Click on the Help tab and try the User Interface tour.

See: http://jupyter-notebook.readthedocs.io/en/latest/notebook.html
for more information.


In [1]:
# we can also put comments into code cells, like this.
# we typically begin a session by importing the modules we will want, and giving them convenient names.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib notebook

## Random numbers, random vectors, random matrices, and plotting

In [3]:
# playing with small examples is the best way to learn how things work.
# so, let's play with some random vectors and matrices
# before we start, I will set the seed of the random number generator.
# This will make sure that our results are reproducible.
np.random.seed(1)

In [4]:
# note that we will use random numbers from numpy, rather than the random module in python.
x = np.random.rand(3)
x

array([  4.17022005e-01,   7.20324493e-01,   1.14374817e-04])

In [5]:
y = np.random.rand(3)
y

array([ 0.30233257,  0.14675589,  0.09233859])

In [6]:
z = x + y
z

array([ 0.71935458,  0.86708038,  0.09245297])

In [7]:
2*z

array([ 1.43870915,  1.73416077,  0.18490594])

The most commonly used distributions on random numbers are uniform(0,1) and normal (also called Gaussian).  `rand` generates random numbers between 0 and 1.
If you ever take a probability class, you will see normal random variables.
They can be both positive and negative.  Their histogram is the famous bell curve.
They have many useful properties that we will sometimes exploit.
We generate them with `randn`.  In the following, I give three ways of vizualizing these two distributions.

In [8]:
# we generate 1000 numbers from the uniform and normal distributions.
u = np.random.rand(1000)
n = np.random.randn(1000)

In [9]:
# By default, python does not print these out.  
# But, if you just type the name, it will show what is in the variable.
# This could be a pain, except that Jupyter creates a little window for each output.
u

array([  1.86260211e-01,   3.45560727e-01,   3.96767474e-01,
         5.38816734e-01,   4.19194514e-01,   6.85219500e-01,
         2.04452250e-01,   8.78117436e-01,   2.73875932e-02,
         6.70467510e-01,   4.17304802e-01,   5.58689828e-01,
         1.40386939e-01,   1.98101489e-01,   8.00744569e-01,
         9.68261576e-01,   3.13424178e-01,   6.92322616e-01,
         8.76389152e-01,   8.94606664e-01,   8.50442114e-02,
         3.90547832e-02,   1.69830420e-01,   8.78142503e-01,
         9.83468338e-02,   4.21107625e-01,   9.57889530e-01,
         5.33165285e-01,   6.91877114e-01,   3.15515631e-01,
         6.86500928e-01,   8.34625672e-01,   1.82882773e-02,
         7.50144315e-01,   9.88861089e-01,   7.48165654e-01,
         2.80443992e-01,   7.89279328e-01,   1.03226007e-01,
         4.47893526e-01,   9.08595503e-01,   2.93614148e-01,
         2.87775339e-01,   1.30028572e-01,   1.93669579e-02,
         6.78835533e-01,   2.11628116e-01,   2.65546659e-01,
         4.91573159e-01,

In [10]:
# And, double-clicking on the left collapses it.
n

array([ -9.24323185e-02,  -2.37875265e-01,  -7.55662765e-01,
         1.85143789e+00,   2.09096677e-01,   1.55501599e+00,
        -5.69148654e-01,  -1.06179676e+00,   1.32247779e-01,
        -5.63236604e-01,   2.39014596e+00,   2.45422849e-01,
         1.15259914e+00,  -2.24235772e-01,  -3.26061306e-01,
        -3.09114176e-02,   3.55717262e-01,   8.49586845e-01,
        -1.22154015e-01,  -6.80851574e-01,  -1.06787658e+00,
        -7.66793627e-02,   5.72962726e-01,   4.57947076e-01,
        -1.78175491e-02,  -6.00138799e-01,   1.46765263e-01,
         5.71804879e-01,  -3.68176565e-02,   1.12368489e-01,
        -1.50504326e-01,   9.15499268e-01,  -4.38200267e-01,
         1.85535621e-01,   3.94428030e-01,   7.25522558e-01,
         1.49588477e+00,   6.75453809e-01,   5.99213235e-01,
        -1.47023709e+00,   6.06403944e-01,   2.29371761e+00,
        -8.30010986e-01,  -1.01951985e+00,  -2.14653842e-01,
         1.02124813e+00,   5.24750492e-01,  -4.77124206e-01,
        -3.59901817e-02,

In [11]:
# Let's draw some histograms.
plt.hist(u)

<IPython.core.display.Javascript object>

(array([  95.,  105.,   94.,   92.,  104.,  107.,  101.,  101.,   98.,  103.]),
 array([  4.02024891e-04,   1.00094107e-01,   1.99786190e-01,
          2.99478273e-01,   3.99170355e-01,   4.98862438e-01,
          5.98554520e-01,   6.98246603e-01,   7.97938685e-01,
          8.97630768e-01,   9.97322850e-01]),
 <a list of 10 Patch objects>)

In [12]:
plt.hist(n)

(array([  11.,   22.,   81.,  167.,  248.,  242.,  138.,   71.,   17.,    3.]),
 array([-3.15335745, -2.49475536, -1.83615327, -1.17755119, -0.5189491 ,
         0.13965299,  0.79825508,  1.45685717,  2.11545926,  2.77406134,
         3.43266343]),
 <a list of 10 Patch objects>)

Where did my other histgram go !?!?
The answer is that matplotlib, if we start it with %matplotlib notebook,
places plots on top of one another unless you tell it not to.
To get a new plot, type plt.figure()


In [13]:
plt.figure()
plt.hist(u)

<IPython.core.display.Javascript object>

(array([  95.,  105.,   94.,   92.,  104.,  107.,  101.,  101.,   98.,  103.]),
 array([  4.02024891e-04,   1.00094107e-01,   1.99786190e-01,
          2.99478273e-01,   3.99170355e-01,   4.98862438e-01,
          5.98554520e-01,   6.98246603e-01,   7.97938685e-01,
          8.97630768e-01,   9.97322850e-01]),
 <a list of 10 Patch objects>)

In [14]:
plt.figure()
plt.hist(n)

<IPython.core.display.Javascript object>

(array([  11.,   22.,   81.,  167.,  248.,  242.,  138.,   71.,   17.,    3.]),
 array([-3.15335745, -2.49475536, -1.83615327, -1.17755119, -0.5189491 ,
         0.13965299,  0.79825508,  1.45685717,  2.11545926,  2.77406134,
         3.43266343]),
 <a list of 10 Patch objects>)

In [15]:
# whenever I want to understand a distribution of numbers, I sort it and plot the result.
# this seems to be a symptom of being a matlab user.
plt.figure()
plt.plot(sorted(u))

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x1189cccd0>]

In [16]:
plt.figure()
plt.plot(sorted(n))

<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x118df8a50>]

In [None]:
u = np.random.rand(1000,2)
u

In [None]:
n = np.random.randn(1000,2)


In [None]:
plt.figure()
plt.plot(u[:,0],u[:,1],".")


In [None]:
plt.figure()
plt.plot(n[:,0],n[:,1],".")



In [None]:
# this code will cause an error
norms = (n[:,0]**2 + n[:,1]**2)**(0.5)

unit_vecs = n * (1/norms) 

In [None]:
# to fix, it, check out the shape or norms
norms.shape

In [None]:
# it is a ones-dimensional vector.  We fix this by turning it into a matrix.
norms = norms.reshape(1000,1)
norms.shape

In [None]:
unit_vecs = n * (1/norms) 

In [None]:
plt.figure()
plt.plot(unit_vecs[:,0],unit_vecs[:,1],".")