In [1]:
# REPL -- read, eval, print loop 
# Jupyter is based on a browser

# Jupyter's server runs on your computer (or somewhere else)
# Jupyter's client runs in your browser

# you get the illusion of running Python inside of your browser

# Course agenda

1. Jupyter (Monday)
2. NumPy  (Monday)
    - Arrays
    - Data types (dtypes)
    - Operations with NumPy arrays
    - Working with files (external data)
    - Boolean indexing
    - Searching, sorting, retrieving data
    - Plotting with Matplotlib
3. Pandas (Tuesday-Thursday)
    - Series and data frames
    - Working with data (setting, retrieving) in Pandas
    - Importing and exporting data (in various formats)
    - Filtering by rows and columns
    - Working with string data (text)
    - Indexing (regular indexes and multi-indexes)
    - Pivot tables
    - Grouping
    - Sorting
    - Joining
    - Categories
    - Working with dates and times
    - Plotting and visualization using Pandas

In [2]:
# Input in Jupyter is put into "cells"
# Each cell can contain either Python code (like this) or Markdown (like above, which creates HTML).

# If I have Python code, I can just execute it with shift+enter.

x = 10
y = 20

print(x+y)

30


In [3]:
# the entire Python backend running on my computer is one Python process
# variables and functions stick around from one cell to another

# so even though I'm in a new cell, I can still say:

print(x+y)

30


In [4]:
# A cell may contain any number of lines of Python code
# If the final line is an expression (i.e., it gives us a value back)
# then we'll see its value , even without printing

x+y

30

In [5]:
x

10

In [6]:
y

20

# Modes in Jupyter

Jupyter actually has two different "modes" -- meaning, what happens when you type.

- Edit mode (green frame around the cell, press ENTER or click inside of the cell to activate) is what you use to enter text or code.  It's what I'm using right now.
- Command mode (blue frame around the cell, press ESC or click to the left of the cell to activate) is what you use to give Jupyter commands.  

When you're in command mode, you can type many one-character commands to Jupyter:

- `c` -- copy the current cell
- `v` -- paste the current cell
- `x` -- cut the current cell
- `h` -- get help about command mode
- `a` -- add a new, empty cell *above* the current one
- `b` -- add a new, empty cell *below* the current one
- `m` -- set the mode to markdown (like now) for easy-to-write HTML
- `y` -- set the mode to code, for writing Python
- `r` -- set the mode to "raw," meaning just text that isn't marked up or executed
- `z` -- undo the latest action

Always, we can use shift+Enter to execute the cell

# Installing and starting Jupyter

1. Download and install it with `pip install -U jupyter`.  This is a command-line command, not a Python command.
2. At the command line, type `jupyter notebook`.
3. You'll see a "new" menu on the top right, and you should choose "Python 3 notebook."
4. You can rename your notebook by clicking on the title and changing it.  (View->Header, and click on the title, to rename).

# Exercise: Starting with Jupyter

1. Start up a Jupyter server on your computer. (If you haven't yet installed it, now is a good time!)
2. Start a new notebook with Jupyter.
3. Rename it to reflect today's date
4. Write some simple Python code, and execute it in Jupyter.

# Magic commands

In Jupyter, you can type whatever Python code you want, and execute it.  In addition, Jupyter has its own "magic commands," all of which start with `%` (which isn't legal in Python, so Jupyter can notice it).



In [7]:
%pwd

'/Users/reuven/Courses/Current/Cisco-2022-06June-06-analytics'

In [8]:
%ls 

Cisco-2022-06June-06-analytics.ipynb


In [9]:
%ls /etc/*.conf

/etc/AFP.conf	       /etc/newsyslog.conf	    /etc/resolv.conf@
/etc/asl.conf	       /etc/nfs.conf		    /etc/rtadvd.conf
/etc/autofs.conf       /etc/notify.conf		    /etc/slpsa.conf
/etc/kern_loader.conf  /etc/ntp.conf		    /etc/syslog.conf
/etc/launchd.conf      /etc/ntp_opendirectory.conf
/etc/man.conf	       /etc/pf.conf


In [10]:
# get a list of magic commands
%magic

In [11]:
%autosave 30

Autosaving every 30 seconds


# Shell commands

You can execute a program in your computer's shell (either the Unix shell or Windows CMD) by putting `!` at the start of a line.

In [12]:
!ls /etc/*.conf

/etc/AFP.conf	       /etc/newsyslog.conf	    /etc/resolv.conf
/etc/asl.conf	       /etc/nfs.conf		    /etc/rtadvd.conf
/etc/autofs.conf       /etc/notify.conf		    /etc/slpsa.conf
/etc/kern_loader.conf  /etc/ntp.conf		    /etc/syslog.conf
/etc/launchd.conf      /etc/ntp_opendirectory.conf
/etc/man.conf	       /etc/pf.conf


In [14]:
# I will, very often, use the "cat" and "head" commands in Unix.

!head -12 /etc/passwd

##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh


# Examining our environment

Jupyter defines `In`, a list of all inputs we have entered, and `Out`, a dict of all return values we've received.

In [15]:
2+2

4

In [16]:
# The %whos magic command shows me all variables and their values

%whos

Variable   Type    Data/Info
----------------------------
x          int     10
y          int     20


In [17]:
s = 'abcd'
d = {'a':1, 'b':2}

def hello(name):
    return f'Hello, {name}'

In [18]:
%whos

Variable   Type        Data/Info
--------------------------------
d          dict        n=2
hello      function    <function hello at 0x10f6711b0>
s          str         abcd
x          int         10
y          int         20


In [19]:
# if I want to see a function's source code, I can put ?? after its name
hello??

# NumPy

In [20]:
# If I have an integer in Python, how many bytes does it take up?
# if my ints are 64 bits, then they'll be 8 bytes

import sys

s = 1234
sys.getsizeof(s)

28

In [22]:
s = 12345678901234567890
sys.getsizeof(s)

36

# What is NumPy?

A C-language array of integers, with a thin wrapper of Python around it. This allows us to benefit from the best of both worlds -- we can get the speed, small size, and efficiency of C, but still work in Python.

NumPy basically defines one thing, namely a new kind of array. The array type that's defined here is known as `ndarray`, short for "n-dimensional array."

We're going to use 1- and 2-dimensional NumPy arrays, nothing too wacky or weird.

Normally, in Python, to create an object of type X, you execute X and get back a new object.

In [23]:
# Load NumPy
import numpy as np

# create a new array of integers
a = np.array([10, 20, 30, 40, 50])
a



array([10, 20, 30, 40, 50])

In [24]:
type(a)

numpy.ndarray

In [25]:
# it often looks like NumPy arrays are just like lists
a[0]

10

In [27]:
a[-1]  # final element

50

In [28]:
a[2:4]  # slice

array([30, 40])

In [29]:
# NumPy arrays are mutable
a[3] = 999
a

array([ 10,  20,  30, 999,  50])

In [30]:
# run the len function on it
len(a)

5

In [31]:
# NumPy has a bunch of methods
a.sum()

1109

In [32]:
a.mean()

221.8

In [33]:
a.std()  

388.82510207032675

In [34]:
a.min()

10

In [35]:
a.max()

999

# A few other ways to create NumPy arrays



In [36]:
np.arange(10)   # new array with 10 elements starting at 0

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [37]:
np.arange(10, 20)   # new array with 10 elements starting at 10, ending (before) 20

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [38]:
np.arange(10, 20, 3)   # new array with 10 elements starting at 10, ending before 20, step size 3

array([10, 13, 16, 19])

In [39]:
# get random integers
np.random.randint(0, 100, 5)   # 5 integers, each pulled randomly from 0-100

array([62, 89, 21, 83, 71])

In [41]:
# get random floats -- each number in the array is from 0-1
np.random.rand(10)

array([0.89531392, 0.52862809, 0.43669085, 0.02876322, 0.38268141,
       0.95171115, 0.25083314, 0.07558865, 0.95633308, 0.88245151])

# Exercise: Simple NumPy arrays

1. Create a NumPy array with three elements -- the year, the month, and the day of your birthday.
2. Retrieve the year. Retrieve the month.
3. Replace the year with the current year.
4. Create an array with all of the numbers from 567 to 890, skipping by 3s. What is the number at index 7? What is the mean of the entire array?
5. Create an array with 500 integers, randomly chosen from 0-100.  What is the mean? What is the standard deviation.
6. Create an array of 5 ints, from 0 to 100.  What is the mean, and what is the standard deviation? How are these different from #5?

In [42]:
a = np.array([1970, 7, 14])

In [43]:
type(a)

numpy.ndarray

In [44]:
a

array([1970,    7,   14])

In [45]:
a[0]

1970

In [46]:
a[1]

7

In [47]:
a[2]

14

In [48]:
a[0] = 2022
a

array([2022,    7,   14])

In [50]:
# create an array with all of the numbers from 567 to 890, step size of 3
a = np.arange(567, 890, 3)
a

array([567, 570, 573, 576, 579, 582, 585, 588, 591, 594, 597, 600, 603,
       606, 609, 612, 615, 618, 621, 624, 627, 630, 633, 636, 639, 642,
       645, 648, 651, 654, 657, 660, 663, 666, 669, 672, 675, 678, 681,
       684, 687, 690, 693, 696, 699, 702, 705, 708, 711, 714, 717, 720,
       723, 726, 729, 732, 735, 738, 741, 744, 747, 750, 753, 756, 759,
       762, 765, 768, 771, 774, 777, 780, 783, 786, 789, 792, 795, 798,
       801, 804, 807, 810, 813, 816, 819, 822, 825, 828, 831, 834, 837,
       840, 843, 846, 849, 852, 855, 858, 861, 864, 867, 870, 873, 876,
       879, 882, 885, 888])

In [51]:
a[7]

588

In [52]:
a.mean()

727.5

In [54]:
a.sum() / len(a)

727.5

In [55]:
a = np.random.randint(0, 100, 500)
a

array([41, 52, 68, 28, 25, 27, 95, 89, 37, 60, 63,  3,  3, 57, 65, 73, 26,
       10, 12, 29, 46, 23, 57, 50, 28, 31, 52, 17,  3, 75, 23, 67, 98, 44,
       95, 93, 52, 18, 73, 67,  2, 21, 55, 89, 98, 55, 66, 28, 96, 52, 33,
       96, 85, 52, 24, 50, 15, 82, 32, 36, 79, 93, 48, 27, 46, 80, 48,  0,
       38, 16, 70, 97, 83, 28, 63,  7, 43, 71, 49, 51, 25, 64, 66, 54, 71,
       77, 16, 86, 57, 90, 65,  9, 21, 21, 50, 79, 96, 11, 70, 77, 44, 86,
       43, 29, 50, 59, 24, 89, 67, 69, 56, 90,  6, 39, 84, 97, 79, 29, 36,
       65, 72, 49, 17, 61, 65, 56, 99, 97,  4, 97, 63, 59, 40, 44,  4, 91,
       93, 60, 89, 69,  8, 97, 88, 55, 92, 71, 29,  5, 16, 60, 52, 70, 71,
       72, 76, 38, 84, 28, 39,  1,  7, 34, 62, 52, 22, 92, 43, 57, 61, 35,
       65, 51, 70, 21, 38,  0,  2, 16, 82, 57, 14, 66, 56, 29, 97, 48, 41,
       98, 66, 97,  7, 89, 87, 27, 33, 47, 31, 35, 89, 85, 95,  6, 38, 72,
       68, 66, 64, 69, 70, 49, 12, 83, 84, 26, 24, 44, 25, 95, 69, 26, 83,
       66, 99, 49, 57, 13

In [56]:
a.mean()

52.242

In [57]:
a.std()

29.026116447089507

In [58]:
a = np.random.randint(0, 100, 5)
a

array([49, 56, 16, 50, 76])

In [59]:
a.mean()

49.4

In [60]:
a.std()

19.324595726689857

In [61]:
a = np.random.randint(0, 100, 5000)
a.mean()

49.724

In [62]:
a = np.random.randint(0, 100, 50000)
a.mean()

49.53526

In [63]:
a = np.random.randint(0, 100, 5000000)
a.mean()

49.524672

In [64]:
a = np.random.randint(0, 100, 500000000)
a.mean()

49.500133526

# Where are NumPy arrays different from lists?

In [65]:
mylist = [10, 20, 30]
mylist + mylist  # can I add a list to a list?

[10, 20, 30, 10, 20, 30]

In [67]:
a = np.array([10, 20, 30])

# NumPy arrays can be added, by their indexes

a + a   # what will I get back now?

array([20, 40, 60])

In [68]:
a = np.array([10, 20, 30])
b = np.array([10, 20, 30, 40])

a + b

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

In [69]:
a * b

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

# Operations on arrays

We can use any operator on two NumPy arrays. The operation will be executed on each index. So, given arrays A and B, if we have operator x, Python will give us back a new array in which index 0 is `A[0] x B[0]`, and index 1 is `A[1] x B[1]`, and so forth.

In [70]:
a = np.array([10, 20, 30])
b = np.array([100, 200, 300])

a + b

array([110, 220, 330])

In [71]:
a - b

array([ -90, -180, -270])

In [72]:
a * b

array([1000, 4000, 9000])

In [73]:
a / b  # this is "truediv", always giving a float

array([0.1, 0.1, 0.1])

In [74]:
a // b  # this is "floordiv", removing the decimal point and anything after it

array([0, 0, 0])

In [75]:
a % b  # return the remainder

array([10, 20, 30])

In [76]:
a ** b  # exponentiation

array([0, 0, 0])

# What about operations with an array and a scalar value?

We've now seen that if we apply an operator to two arrays, the operation is handled at each index (almost as if we had a `for` loop), one by one.  The return value from index `i` will be based on `A[i]` and `B[i]`.

But.  If we use a scalar value, then the value is "broadcast" to each of the elements of the array.



In [77]:
a

array([10, 20, 30])

In [78]:
a + 3

array([13, 23, 33])

In [80]:
a  # we didn't change a

array([10, 20, 30])

In [81]:
a - 3

array([ 7, 17, 27])

In [82]:
a * 3

array([30, 60, 90])

In [83]:
a / 3

array([ 3.33333333,  6.66666667, 10.        ])

In [84]:
a // 3

array([ 3,  6, 10])

In [85]:
a % 3

array([1, 2, 0])

In [86]:
a ** 3


array([ 1000,  8000, 27000])

In [88]:
# broadcasting and random floats

# we've seen that we can get an array of random floats between 0 and 1:

# now get random floats between 0 and 10
np.random.rand(10) * 10

array([6.76927194, 5.23189611, 0.41635728, 1.71524373, 9.44062929,
       9.01420103, 0.8278796 , 1.3268744 , 8.19596229, 4.58059884])

# Exercise: Vectorized and broadcast operations

1. Create two arrays, each with 20 random integers from 0 to 1,000.
2. What are the mean and std of the array we get after adding them together?
3. Take the first array, and multiply it by 5. What is the mean of the new array you got?
4. What are the min and max values (using the `min` and `max` array methods) for each of the two arrays?

In [93]:
np.random.seed(0)   # reset the random-number generator, so that we get deterministic values back

a = np.random.randint(0, 1000, 20)
b = np.random.randint(0, 1000, 20)

# c = a+b
# c.mean()

(a+b).mean()

1107.65

In [94]:
(a+b).std()

355.2476987962061

In [96]:
a

array([684, 559, 629, 192, 835, 763, 707, 359,   9, 723, 277, 754, 804,
       599,  70, 472, 600, 396, 314, 705])

In [97]:
a*5

array([3420, 2795, 3145,  960, 4175, 3815, 3535, 1795,   45, 3615, 1385,
       3770, 4020, 2995,  350, 2360, 3000, 1980, 1570, 3525])

In [98]:
(a*5).mean()

2612.75

In [99]:
a.mean()

522.55

In [100]:
a.mean() * 5

2612.75

In [101]:
a.min()

9

In [102]:
a.max()

835

In [103]:
b.min()

72

In [104]:
b.max()

976

In [106]:
np.random.seed(0)

a = np.random.randint(0, 100, 5)
a

array([44, 47, 64, 67, 67])

In [107]:
a[2]

64

In [108]:
a[4]

67

In [109]:
# I can do "fancy indexing" -- giving a list of indexes

a[[2, 4]]  # notice: double square brackets!

array([64, 67])

In [110]:
a[[3, 0, 2, 0, 1, 0]]

array([67, 44, 64, 44, 47, 44])

In [111]:
# I can pass a list of True/False (boolean) values, and get only those elements
# where we have a True value -- this is known as a "mask index" or a "boolean index"

a[[True, False, True, False, True]]

array([44, 64, 67])

In [112]:
a

array([44, 47, 64, 67, 67])

In [113]:
# how can we create a boolean index, if not manually?
# answer: broadcasting boolean operators

a + 5

array([49, 52, 69, 72, 72])

In [114]:
a + 200

array([244, 247, 264, 267, 267])

In [115]:
# what if I ask about equality?

a == 67  # broadcast the == operator, and get back a boolean index

array([False, False, False,  True,  True])

In [116]:
# let's use that boolean index as a mask index on our array

a[a == 67]

array([67, 67])

In [117]:
a[a < 50]

array([44, 47])

In [118]:
a[a > 30]

array([44, 47, 64, 67, 67])

In [119]:
a[a > a.mean()]

array([64, 67, 67])

# Exercise: Mask indexes

1. Create a NumPy array with the temperature forecast for your city over the next 10 days.
2. On how many days will the temperature be above the average?
3. On how many days will the temperature be very hot - that is, more than the mean + std?

In [120]:
a = np.array([30, 30, 32, 33, 33, 31, 30, 29, 27, 28])
a

array([30, 30, 32, 33, 33, 31, 30, 29, 27, 28])

In [121]:
a.mean() 

30.3

In [122]:
# when will the temperature be greater than a.mean()?

a > 30.3   # broadcasting > on each element of a, comparing it with 30.3

array([False, False,  True,  True,  True,  True, False, False, False,
       False])

In [129]:
# when will the temperature be greater than a.mean()?

a > a.mean()

array([False, False,  True,  True,  True,  True, False, False, False,
       False])

In [128]:
2 > 5

False

In [124]:
# show me elements of a
# where the element is greater than 30.3

# when I apply a boolean array as an index to a, we get back only those elements of a where it's True

# [30,    30,     32,     33,    33,    31,   30,    29,    27,    28]
# [False, False,  True,  True,  True,  True, False, False, False, False]

a[a > 30.3]   

array([32, 33, 33, 31])

In [125]:
# show me elements of a
# where the element is greater than a.mean()

# (1) calculate a.mean()
# (2) broadcast a > a.mean(), getting an array of True/False values
# (3) apply that boolean array as a mask index onto a, giving us a new NumPy array with those
#  elements of a that are > a.mean()

a[a > a.mean()]

array([32, 33, 33, 31])

# Next up

1. Boolean indexes and floats
2. Complex comparisons
3. Assignments based on conditions
4. Dtypes

Resume at :50

In [130]:
# I'm going to create an array of 30 floats from 0-1,000.
# I want to find all of the numbers < the mean
# Following that, I want to find all of the numbers < the mean - 1 standard deviation

In [131]:
a = np.array([10, 10, 10, 10, 10 ,10, 10, 10])

In [132]:
a.mean()

10.0

In [133]:
a.std()

0.0

In [134]:
a = np.array([6,7,8,9,10,11,12,13,14])

In [135]:
a.mean()

10.0

In [136]:
a.std()

2.581988897471611

In [137]:
# create an array of 30 floats from 0-1,000

np.random.seed(0)
a = np.random.rand(30)
a

array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ,
       0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152,
       0.79172504, 0.52889492, 0.56804456, 0.92559664, 0.07103606,
       0.0871293 , 0.0202184 , 0.83261985, 0.77815675, 0.87001215,
       0.97861834, 0.79915856, 0.46147936, 0.78052918, 0.11827443,
       0.63992102, 0.14335329, 0.94466892, 0.52184832, 0.41466194])

In [138]:
np.random.seed(0)
a = np.random.rand(30) * 1000
a

array([548.81350393, 715.18936637, 602.76337607, 544.883183  ,
       423.65479934, 645.89411307, 437.58721126, 891.77300078,
       963.6627605 , 383.44151883, 791.72503808, 528.89491975,
       568.04456109, 925.59663829,  71.0360582 ,  87.1292997 ,
        20.21839744, 832.61984555, 778.15675095, 870.01214825,
       978.61834223, 799.15856422, 461.47936225, 780.52917629,
       118.27442587, 639.92102133, 143.35328741, 944.66891705,
       521.84832175, 414.66193999])

In [140]:
# find numbers < mean

a < a.mean()

array([ True, False, False,  True,  True, False,  True, False, False,
        True, False,  True,  True, False,  True,  True,  True, False,
       False, False, False, False,  True, False,  True, False,  True,
       False,  True,  True])

In [141]:
# let's apply that boolean array as a mask index on a

a[a < a.mean()]

array([548.81350393, 544.883183  , 423.65479934, 437.58721126,
       383.44151883, 528.89491975, 568.04456109,  71.0360582 ,
        87.1292997 ,  20.21839744, 461.47936225, 118.27442587,
       143.35328741, 521.84832175, 414.66193999])

In [144]:
# let's find the *really* small values -- those that are 
# < a.mean() - a.std() 

a[a < a.mean() - a.std()]

array([ 71.0360582 ,  87.1292997 ,  20.21839744, 118.27442587,
       143.35328741])

# Complex comparisons

What if I have an array of 20 integers from 0-100, and I want to find even numbers?

In [145]:
np.random.seed(0)
a = np.random.randint(0, 100, 20)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [146]:
a%2 == 0   # if the remainder from dividing by 2 is 0, the number is even

array([ True, False,  True, False, False, False, False, False,  True,
       False,  True,  True,  True,  True,  True, False, False, False,
        True,  True])

In [149]:
# apply that boolean array as a mask index
a[a%2 == 0] # get all of the even numbers in a

array([44, 64, 36, 70, 88, 88, 12, 58, 46, 88])

In [150]:
a[a%2 == 1]  # get all of the odd numbers in a

array([47, 67, 67,  9, 83, 21, 87, 65, 39, 87])

In [151]:
# I want all of the even numbers in a, that are also < the mean

a[a%2 == 0 and a<a.mean()]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

# Boolean context

`and` (as well as `not` and `or`, and also `if` and `while`) only works with boolean (`True` and `False`) values. If it sees a non-boolean value, then it turns that value into a boolean.

In Python, everything is considered `True` in this "boolean context" (i.e., when we force data to be boolean) except for:

- `None`
- 0
- `False`
- anything empty 

NumPy breaks this rule a bit -- it doesn't allow you to call a NumPy array either `True` or `False` in boolean context, unless it contains 0 or 1 elements.

As a result, don't use `and`, `or`, and `not` with NumPy arrays.  Instead, you'll use the operators `&` (for and) `|` (for or) and `~` (for not).

In [153]:
# let's then try & instead of "and"
# the idea is that & will operate on two NumPy arrays
# wherever both items at a given index are True, we'll get True
# if zero or one element is True, then we'll get False

a[(a%2 == 0) & (a<a.mean())]

array([44, 36, 12, 58, 46])

In [154]:
(a%2 == 0)    # generate a boolean array

array([ True, False,  True, False, False, False, False, False,  True,
       False,  True,  True,  True,  True,  True, False, False, False,
        True,  True])

In [155]:
(a<a.mean())  # generate a second boolean array

array([ True,  True, False, False, False,  True, False,  True,  True,
       False, False, False, False,  True,  True, False,  True, False,
        True, False])

In [156]:
(a%2 == 0)  & (a<a.mean()) 

array([ True, False, False, False, False, False, False, False,  True,
       False, False, False, False,  True,  True, False, False, False,
        True, False])

In [158]:
a[(a%2 == 0)  & (a<a.mean()) ]    # what even numbers, less than the mean, do we see?

array([44, 36, 12, 58, 46])

# Exercises: Complex comparisons

1. Create a NumPy array of 20 random integers from 0-100.
2. What's the smallest even number that's also greater than the mean?
3. Show all numbers that are either < mean-std or >mean+std (i.e., outliers, kind of)
4. Show odd numbers that are < mean, and even numbers that are > mean.  (This will be long and horrible looking.)

In [159]:
np.random.seed(0)

a = np.random.randint(0, 100, 20)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [160]:
# even numbers
# greater than the mean
# smallest of them

a%2 == 0  # boolean array indicating which are even

array([ True, False,  True, False, False, False, False, False,  True,
       False,  True,  True,  True,  True,  True, False, False, False,
        True,  True])

In [161]:
a > a.mean()   # boolean array indicating which elements of a are greater than a.mean()

array([False, False,  True,  True,  True, False,  True, False, False,
        True,  True,  True,  True, False, False,  True, False,  True,
       False,  True])

In [163]:
(a%2==0) & (a>a.mean())   #boolean array indicating which are both even and greater than a.mean()

array([False, False,  True, False, False, False, False, False, False,
       False,  True,  True,  True, False, False, False, False, False,
       False,  True])

In [164]:
# apply this boolean index to a
a[(a%2==0) & (a>a.mean())]

array([64, 70, 88, 88, 88])

In [165]:
# get the smallest of these even numbers that are > mean
a[(a%2==0) & (a>a.mean())].min()

64

In [166]:
# 3. Show all numbers that are either < mean-std or >mean+std (i.e., outliers, kind of)

a < a.mean()-a.std()  # boolean array of very small elements of a

array([False, False, False, False, False,  True, False,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False])

In [168]:
a > a.mean()+a.std()  # boolean array of very large elements of a

array([False, False, False, False, False, False, False, False, False,
        True, False,  True,  True, False, False, False, False,  True,
       False,  True])

In [169]:
# we'll use | the "or" for NumPy to say that we want either of these to be true (but not both!)
(a < a.mean()-a.std()) | (a > a.mean()+a.std())

array([False, False, False, False, False,  True, False,  True, False,
        True, False,  True,  True,  True, False, False, False,  True,
       False,  True])

In [170]:
a[(a < a.mean()-a.std()) | (a > a.mean()+a.std())]

array([ 9, 21, 87, 88, 88, 12, 87, 88])

In [None]:
a[(a < a.mean()-a.std()) |
  (a > a.mean()+a.std())]

In [171]:
# Show odd numbers that are < mean, and even numbers that are > mean.  

a%2 == 1


array([False,  True, False,  True,  True,  True,  True,  True, False,
        True, False, False, False, False, False,  True,  True,  True,
       False, False])

In [172]:
a < a.mean()

array([ True,  True, False, False, False,  True, False,  True,  True,
       False, False, False, False,  True,  True, False,  True, False,
        True, False])

In [173]:
(a%2==1) & (a<a.mean())   #boolean array indicating small, odd numbers

array([False,  True, False, False, False,  True, False,  True, False,
       False, False, False, False, False, False, False,  True, False,
       False, False])

In [174]:
# get even numbers
a%2 == 0

array([ True, False,  True, False, False, False, False, False,  True,
       False,  True,  True,  True,  True,  True, False, False, False,
        True,  True])

In [175]:
a > a.mean()

array([False, False,  True,  True,  True, False,  True, False, False,
        True,  True,  True,  True, False, False,  True, False,  True,
       False,  True])

In [176]:
(a%2 == 0) & (a>a.mean())  # boolean array indicating large, even numbers 

array([False, False,  True, False, False, False, False, False, False,
       False,  True,  True,  True, False, False, False, False, False,
       False,  True])

In [177]:
# let's combine our combined boolean arrays with | 

a[(a%2==1) & (a<a.mean())    | # small, odd
  (a%2==0) & (a>a.mean()) ]    # large, even

array([47, 64,  9, 21, 70, 88, 88, 39, 88])

# Modifying arrays



In [178]:
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [179]:
a[2] = 99   # can I do this?
a

array([44, 47, 99, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [180]:
# fancy indexing to assign

a[[2,4,6,8]] = 99
a

array([44, 47, 99, 67, 99,  9, 99, 21, 99, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88])

In [182]:
# we can assign based on a mask index!

np.random.seed(0)
a = np.random.randint(0, 100, 20)

a[a%2==0]  # find all of the even numbers

array([44, 64, 36, 70, 88, 88, 12, 58, 46, 88])

In [183]:
a[a%2==0] = 0      # assign all even numbers to 0

In [184]:
a

array([ 0, 47,  0, 67, 67,  9, 83, 21,  0, 87,  0,  0,  0,  0,  0, 65, 39,
       87,  0,  0])

# Exercises: Assigning via indexes/conditions

1. Create a NumPy array with 40 random integers from 0-100.
2. Find all numbers within 1 standard deviation of the mean (i.e., at least mean-std, and at most mean+std) , and set them to be equal to the mean. (This won't work perfectly, because the mean is a float, and we have an array of integers, but it'll be close.)  
3. Has the mean changed at all?  A lot?  Has the standard deviation changed?

In [202]:
np.random.seed(0)
a = np.random.randint(0, 100, 40)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88, 81, 37, 25, 77, 72,  9, 20, 80, 69, 79, 47, 64, 82, 99,
       88, 49, 29, 19, 19, 14])

In [203]:
a>a.mean()-a.std()

array([ True,  True,  True,  True,  True, False,  True, False,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True, False, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False, False, False])

In [204]:
a<a.mean()+a.std()

array([ True,  True,  True,  True,  True,  True, False,  True,  True,
       False,  True, False, False,  True,  True,  True,  True, False,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False, False,  True,
        True,  True,  True,  True])

In [205]:
(a>a.mean()-a.std()) & (a<a.mean()+a.std())

array([ True,  True,  True,  True,  True, False, False, False,  True,
       False,  True, False, False, False,  True,  True,  True, False,
        True, False,  True,  True, False,  True,  True, False, False,
        True,  True,  True,  True,  True,  True, False, False,  True,
        True, False, False, False])

In [206]:
# here are the numbers within 1 std of the mean

a[(a>a.mean()-a.std()) & 
  (a<a.mean()+a.std())]

array([44, 47, 64, 67, 67, 36, 70, 58, 65, 39, 46, 81, 37, 77, 72, 80, 69,
       79, 47, 64, 82, 49, 29])

In [207]:
a.mean()

55.625

In [208]:
a.std()

26.965429256735373

In [209]:
a[(a>a.mean()-a.std()) & 
  (a<a.mean()+a.std())] = a.mean()

In [210]:
a

array([55, 55, 55, 55, 55,  9, 83, 21, 55, 87, 55, 88, 88, 12, 55, 55, 55,
       87, 55, 88, 55, 55, 25, 55, 55,  9, 20, 55, 55, 55, 55, 55, 55, 99,
       88, 55, 55, 19, 19, 14])

In [211]:
a.mean()

53.025

In [212]:
a.std()

23.77129308640992

# Next up

1. Types (dtypes) -- how we can set them, change them, and why we care
2. `nan` ("not a number) -- how we can set it, use it, and why we care
3. 2-dimensional arrays
4. Sorting
5. Plotting with matplotlib

Resume at 1 p.m. Eastern

# dtypes -- types of data you can store in NumPy

In [213]:
a = np.array([10, 20, 30, 40, 50])

In [214]:
a.dtype

dtype('int64')

In [215]:
5 * 64

320

In [216]:
# we can tell NumPy how big we want our integers to be
# this way, we can either handle larger numbers (with more bits) or smaller numbers (and not waste memory)

# this is done by setting the dtype

In [217]:
a = np.array([10, 20, 30.5, 40, 50])

In [218]:
a.dtype

dtype('float64')

In [219]:
# let's say that I know my integers will be small
# I can use a smaller dtype, and thus waste less memory

a = np.array([10, 20, 30, 40, 50], dtype=np.int8)

In [220]:
a

array([10, 20, 30, 40, 50], dtype=int8)

In [221]:
5 * 8

40

In [222]:
a * 2

array([ 20,  40,  60,  80, 100], dtype=int8)

In [224]:
a * 10   # we go past the max number that int8 can handle


array([ 100,  -56,   44, -112,  -12], dtype=int8)

In [225]:
2 ** 8

256

# Choosing a dtype

You want to choose a dtype that's large enough to handle all of the numbers, and calculations with those numbers, that you intend to do.

NumPy will *not* warn you if you use a too-small dtype.  It'll simply do what you've asked, within the data structure that it has.

You want to choose a dtype that's as small as possible, so that you don't waste memory needlessly.

Available dtypes:
- np.int8, np.int16, np.int32, np.int64
- np.uint8, np.uint16, np.uint32, np.uint64
- np.float16, np.float32, np.float64, np.float128

In [226]:
a = np.array([10, 20, 30, 40, 50], dtype=np.int32)

In [227]:
a

array([10, 20, 30, 40, 50], dtype=int32)

In [230]:
# I want to change one of the values in a

a[2] = 12.34     # no error, but 12.34 becomes 12 -- because the dtype is np.int32

In [229]:
a

array([10, 20, 12, 40, 50], dtype=int32)

In [231]:
# what if I assign a string to one element?
a[1] = '12'

In [232]:
a

array([10, 12, 12, 40, 50], dtype=int32)

In [233]:
a[1] = 'ab'

ValueError: invalid literal for int() with base 10: 'ab'

In [241]:
a = np.array([10, 20, 30, 40, 50, 60, 70, 80], dtype=np.int8)
a

array([10, 20, 30, 40, 50, 60, 70, 80], dtype=int8)

In [242]:
# now I want to change a[2] to be 1234567.
# that's too big for an 8-bit int
# so... let's change the dtype to be 32 bits.

a.dtype = np.int32 # NEVER EVER EVER EVER DO THIS!

In [243]:
a

array([ 673059850, 1346780210], dtype=int32)

In [244]:
a.dtype = np.int16

In [245]:
a

array([ 5130, 10270, 15410, 20550], dtype=int16)

# Changing dtypes

NumPy *will* allow you to set a new value for `dtype` on your array.  **NEVER EVER DO THIS.**  When you assign the dtype, you're not changing the types of values you have in your array.  Rather, you're changing the interpretation of those values.

Don't change the dtype by assigning to an array.

Rather, run the `astype` method on an array.  That returns a new array with the same data, but interpreted with the new dtype.

In [246]:
# reset this to what it was before
a.dtype = np.int8

In [247]:
a

array([10, 20, 30, 40, 50, 60, 70, 80], dtype=int8)

In [249]:
a = a.astype(np.int32)
a

array([10, 20, 30, 40, 50, 60, 70, 80], dtype=int32)

In [250]:
a = a.astype(np.uint64)

In [251]:
2 ** 64

18446744073709551616

In [252]:
a[0] = 2 ** 64 - 1

In [253]:
a

array([18446744073709551615,                   20,                   30,
                         40,                   50,                   60,
                         70,                   80], dtype=uint64)

In [254]:
a[0] += 1

In [255]:
a

array([ 0, 20, 30, 40, 50, 60, 70, 80], dtype=uint64)

In [256]:
a = np.array([10, 20, 30, 40, 50])  # if we don't specify a dtype, NumPy makes a reasonable guess (int vs. float)

a[2] = 123.456  # cut off because the dtype is np.int64
a

array([ 10,  20, 123,  40,  50])

In [257]:
a = a.astype(np.float64)
a

array([ 10.,  20., 123.,  40.,  50.])

In [258]:
a[2] = 123.456
a

array([ 10.   ,  20.   , 123.456,  40.   ,  50.   ])

# Exercises: dtypes

1. Create a NumPy array of 10 random ints from 0-100.
2. Calculate the mean, which will be a float.  Also get the standard deviation.
3. Create a new array of floats based on our original array, and replace the outliers (< mean-std and > mean+std) with the mean you just calculated.
4. Check: Has the mean changed?   Has the standard deviation changed?
5. Create a NumPy array of 20 random *floats* from 0-100.
6. Replace those numbers whose integer portion is even with the mean. So 20.5 will be replaced by the mean, but 21.5 won't be.

In [259]:
np.random.seed(0)
a = np.random.randint(0, 100, 10)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87])

In [260]:
a.mean()

52.5

In [261]:
a.std()

24.3567239176372

In [262]:
# new array of floats based on a
a = a.astype(np.float64)
a

array([44., 47., 64., 67., 67.,  9., 83., 21., 36., 87.])

In [266]:
# replace the outliers (very small and very big) with the mean

a[(a < a.mean()-a.std()) |
  (a > a.mean()+a.std())] = a.mean()

In [267]:
a

array([44. , 47. , 64. , 67. , 67. , 52.5, 52.5, 52.5, 36. , 52.5])

In [268]:
a.mean()

53.5

In [269]:
a.std()

9.578622030334008

In [270]:
np.random.seed(0)
a = np.random.rand(20) * 100
a

array([54.88135039, 71.51893664, 60.27633761, 54.4883183 , 42.36547993,
       64.58941131, 43.75872113, 89.17730008, 96.36627605, 38.34415188,
       79.17250381, 52.88949198, 56.80445611, 92.55966383,  7.10360582,
        8.71292997,  2.02183974, 83.26198455, 77.81567509, 87.00121482])

In [271]:
# which of these numbers is even?
# we can use our trick of %2 == 0

a[a%2 == 0]

array([], dtype=float64)

In [272]:
a%2  # divide each element of a by 2, and show us the remainder

array([0.88135039, 1.51893664, 0.27633761, 0.4883183 , 0.36547993,
       0.58941131, 1.75872113, 1.17730008, 0.36627605, 0.34415188,
       1.17250381, 0.88949198, 0.80445611, 0.55966383, 1.10360582,
       0.71292997, 0.02183974, 1.26198455, 1.81567509, 1.00121482])

In [275]:
# get the (floating-point) numbers in a whose integer portion is even

a[a.astype(np.int64) % 2 == 0]

array([54.88135039, 60.27633761, 54.4883183 , 42.36547993, 64.58941131,
       96.36627605, 38.34415188, 52.88949198, 56.80445611, 92.55966383,
        8.71292997,  2.02183974])

In [277]:
# replace even numbers with the mean
a[a.astype(np.int64) % 2 == 0] = a.mean() 

In [278]:
a

array([58.15548245, 71.51893664, 58.15548245, 58.15548245, 58.15548245,
       58.15548245, 43.75872113, 89.17730008, 58.15548245, 58.15548245,
       79.17250381, 58.15548245, 58.15548245, 58.15548245,  7.10360582,
       58.15548245, 58.15548245, 83.26198455, 77.81567509, 87.00121482])

# What happens if some data is missing?

This is a common problem:

- A sensor malfunctioned
- Someone didn't want to answer a questionnaire
- The file containing the survey data was erased

We don't want to make up data.

In [279]:
scores = np.array([95, 93, 87, 98, 92])
scores.mean()

93.0

In [280]:
scores = np.array([95, 93, 87, 0, 92])  # representing missing data with 0 is a very bad idea
scores.mean()

73.4

# The solution: `NaN` 

This stands for "not a number," and is a standard way (not just in NumPy) to say, "there should be a value here, but there isn't, and don't confuse it with 0, or the like."

In [281]:
scores = np.array([95, 93, 87, np.nan, 92])  
scores.mean()

nan

In [282]:
# let's talk about nan!

np.nan

nan

In [283]:
np.NaN

nan

In [284]:
# what is this thing?
type(np.nan)

float

In [285]:
scores

array([95., 93., 87., nan, 92.])

In [286]:
scores.dtype

dtype('float64')

In [287]:
# is np.nan equal to itself?
np.nan == np.nan

False

In [288]:
# np.nan is equal to NOTHING ELSE
np.nan == None

False

In [289]:
np.nan == 0

False

In [290]:
np.nan == ''

False

In [291]:
np.nan == 0.0

False

In [293]:
# any math operation you execute on np.nan, you get np.nan
np.nan + 5

nan

In [294]:
scores.mean()

nan

In [295]:
scores.sum()

nan

In [296]:
scores.sum() / len(scores)

nan

In [297]:
# how can we get out of this pickle?  
# let's create a boolean array, that we can apply to scores and remove the nan values

scores[scores != np.nan]  # this doesn't work!

array([95., 93., 87., nan, 92.])

In [298]:
# we can get rid of np.nan by combining two tools:
# (1) np.isnan, a function that returns True for np.nan and False for everything else
# (2) ~, the "not" operator for NumPy

np.isnan(scores)

array([False, False, False,  True, False])

In [299]:
# get only the np.nan values
scores[np.isnan(scores)]

array([nan])

In [301]:
# get only the non-nan values
scores[~np.isnan(scores)]

array([95., 93., 87., 92.])

In [302]:
scores[~np.isnan(scores)].mean()

91.75

In [303]:
np.isnan(scores)

array([False, False, False,  True, False])

In [304]:
np.isnan(scores).sum()   # this returns the number of nan values in our array

1

In [311]:
(~np.isnan(scores)).sum()

4

In [None]:
# True is basically 1, and False is basically 0.

# if you have an array with True and False values, you can find out how many True values there are with .sum()



In [319]:
(~np.isnan(scores)).sum()  # this returns the number of non-nan values in our array

4

In [315]:
~1

-2

# Exercises: Using `np.nan`

1. Create a NumPy array of 30 random ints from 0-1,000.
2. Find the numbers that are < mean-std or > mean+std, and set them to be `np.nan`
3. Change those `np.nan` values to be the mean of the non-`np.nan` values.

In [320]:
np.random.seed(0)
a = np.random.randint(0, 1000, 30)
a

array([684, 559, 629, 192, 835, 763, 707, 359,   9, 723, 277, 754, 804,
       599,  70, 472, 600, 396, 314, 705, 486, 551,  87, 174, 600, 849,
       677, 537, 845,  72])

In [323]:
a[(a<a.mean()-a.std()) |
  (a>a.mean()+a.std())] = np.nan

ValueError: cannot convert float NaN to integer

In [324]:
a = a.astype(np.float64)

a[(a<a.mean()-a.std()) |
  (a>a.mean()+a.std())] = np.nan

In [325]:
a

array([684., 559., 629.,  nan,  nan, 763., 707., 359.,  nan, 723., 277.,
       754.,  nan, 599.,  nan, 472., 600., 396., 314., 705., 486., 551.,
        nan,  nan, 600.,  nan, 677., 537.,  nan,  nan])

In [329]:
# get the mean of the non-nan values
a[~np.isnan(a)].mean()

569.6

In [330]:
# replace the nan values with the non-nan values' mean
a[np.isnan(a)] = a[~np.isnan(a)].mean()

In [331]:
a

array([684. , 559. , 629. , 569.6, 569.6, 763. , 707. , 359. , 569.6,
       723. , 277. , 754. , 569.6, 599. , 569.6, 472. , 600. , 396. ,
       314. , 705. , 486. , 551. , 569.6, 569.6, 600. , 569.6, 677. ,
       537. , 569.6, 569.6])

In [332]:
a.mean()

569.6

# Next up:

1. Defining 2D arrays
2. Retrieving from and setting values on 2D arrays
3. Aggregation methods on 2D arrays
4. Sorting 2D arrays
5. Tiny bit of files and visualization with NumPy

Resume at :40

In [333]:
a = np.array([10, 20, 30])

In [334]:
type(a)

numpy.ndarray

In [335]:
# ndarray == n-dimensional array
# so far, n==1

In [336]:
np.random.seed(0)
a = np.random.randint(0, 100, 24)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88, 81, 37, 25, 77])

In [337]:
# checking the dimension(s)
a.shape

(24,)

In [339]:
a.reshape(6, 4)  # have I changed the array? no! I got a new array back

array([[44, 47, 64, 67],
       [67,  9, 83, 21],
       [36, 87, 70, 88],
       [88, 12, 58, 65],
       [39, 87, 46, 88],
       [81, 37, 25, 77]])

In [340]:
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88, 81, 37, 25, 77])

In [341]:
a.reshape(8, 3)

array([[44, 47, 64],
       [67, 67,  9],
       [83, 21, 36],
       [87, 70, 88],
       [88, 12, 58],
       [65, 39, 87],
       [46, 88, 81],
       [37, 25, 77]])

In [342]:
a.reshape(4, 6)

array([[44, 47, 64, 67, 67,  9],
       [83, 21, 36, 87, 70, 88],
       [88, 12, 58, 65, 39, 87],
       [46, 88, 81, 37, 25, 77]])

In [343]:
a.reshape(2, 12)

array([[44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88],
       [88, 12, 58, 65, 39, 87, 46, 88, 81, 37, 25, 77]])

In [344]:
# what if I try a shape that isn't legal?
a.reshape(3, 6)

ValueError: cannot reshape array of size 24 into shape (3,6)

In [345]:
a = a.reshape(6, 4)

In [346]:
a

array([[44, 47, 64, 67],
       [67,  9, 83, 21],
       [36, 87, 70, 88],
       [88, 12, 58, 65],
       [39, 87, 46, 88],
       [81, 37, 25, 77]])

In [347]:
a[0]

array([44, 47, 64, 67])

In [348]:
a[1]

array([67,  9, 83, 21])

In [349]:
a[2]

array([36, 87, 70, 88])

In [350]:
a[3]

array([88, 12, 58, 65])

In [351]:
a[4]

array([39, 87, 46, 88])

In [352]:
a[5]

array([81, 37, 25, 77])

In [353]:
a[3]

array([88, 12, 58, 65])

In [355]:
# one way to get the item in row index 3, column index 1
a[3][1]

12

In [359]:
# an even better way is to say:
a[3,1]

12

In [358]:
a.__getitem__((3,1))

12

In [360]:
a

array([[44, 47, 64, 67],
       [67,  9, 83, 21],
       [36, 87, 70, 88],
       [88, 12, 58, 65],
       [39, 87, 46, 88],
       [81, 37, 25, 77]])

In [361]:
# what if I want multiple rows?
# we can use fancy indexing again, passing a list of indexes we want

a[[1, 3]]

array([[67,  9, 83, 21],
       [88, 12, 58, 65]])

In [364]:
# what if I want just column index 1?

a[[0,1,2,3,4,5], 1]    # one option

array([47,  9, 87, 12, 87, 37])

In [365]:
a[:, 1]    # better option -- use a slice

array([47,  9, 87, 12, 87, 37])

In [366]:
# I can even use a slice if I want some rows
a[1:6:2, 1]

array([ 9, 12, 37])

In [367]:
# I want row indexes 2 and 4
# and column indexes 1 and 3

a[[2,4], [1, 3]]  # doesn't work

array([87, 88])

In [368]:
# use slices instead!
a[2:5:2, 1:4:2]  # slices are always start:end+1  or start:end+1:step

array([[87, 88],
       [87, 88]])

In [369]:
mylist = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

mylist[2:5]   # start at index 2, go until (but not including) index 5

[30, 40, 50]

In [370]:
a

array([[44, 47, 64, 67],
       [67,  9, 83, 21],
       [36, 87, 70, 88],
       [88, 12, 58, 65],
       [39, 87, 46, 88],
       [81, 37, 25, 77]])

In [371]:
a.shape

(6, 4)

In [372]:
b = a.reshape(8, 3)
b

array([[44, 47, 64],
       [67, 67,  9],
       [83, 21, 36],
       [87, 70, 88],
       [88, 12, 58],
       [65, 39, 87],
       [46, 88, 81],
       [37, 25, 77]])

In [373]:
b[1,1] = 999  # can I assign this?
b

array([[ 44,  47,  64],
       [ 67, 999,   9],
       [ 83,  21,  36],
       [ 87,  70,  88],
       [ 88,  12,  58],
       [ 65,  39,  87],
       [ 46,  88,  81],
       [ 37,  25,  77]])

In [375]:
a   # look, we changed a as well!

array([[ 44,  47,  64,  67],
       [999,   9,  83,  21],
       [ 36,  87,  70,  88],
       [ 88,  12,  58,  65],
       [ 39,  87,  46,  88],
       [ 81,  37,  25,  77]])

In [376]:
# every NumPy array has a few "flags," True/False values that indicate
# something about the data structure

a.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [377]:
b.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [381]:
# how can we avoid this? How can we avoid sharing data?
# use the "copy" method

b = a.reshape(8, 3).copy()    # create a new data structure, so b is independent of a
b.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [382]:
a.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [383]:
a = a.copy()
a.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [384]:
# do two data structures share memory?
# we can find out with np.shares_memory(a, b)

np.shares_memory(a, b)

False

In [385]:
np.shares_memory(a, a)

True

# Exercise: 2D arrays

1. Create a 2D array with 45 (5x9) random integers from 0-100.
2. Retrieve the elements at row index 2.
3. Retrieve the elements at column index 3
4. Retrieve the elements at row indexes 1 and 4.
5. Retrieve the elements at column index 1 and 4.
6. Get the mean of the even numbers in row index 4.
7. Get the mean of the odd numbers in column index 4.

In [390]:
np.random.seed(0)
a = np.random.randint(0, 100, 45)

In [391]:
a = a.reshape(5,9)  # assign the reshaped array back to the same variable
a

array([[44, 47, 64, 67, 67,  9, 83, 21, 36],
       [87, 70, 88, 88, 12, 58, 65, 39, 87],
       [46, 88, 81, 37, 25, 77, 72,  9, 20],
       [80, 69, 79, 47, 64, 82, 99, 88, 49],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [392]:
a.flags

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

In [393]:
# shortcut!
np.random.seed(0)
a = np.random.randint(0, 100, [5,9])  # this creates the 5x9 NumPy array all at once!

a

array([[44, 47, 64, 67, 67,  9, 83, 21, 36],
       [87, 70, 88, 88, 12, 58, 65, 39, 87],
       [46, 88, 81, 37, 25, 77, 72,  9, 20],
       [80, 69, 79, 47, 64, 82, 99, 88, 49],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [394]:
# get elements at row index 2
a[2]

array([46, 88, 81, 37, 25, 77, 72,  9, 20])

In [395]:
# get elements at column index 3

a[:, 3]     # all rows, column index 3

array([67, 88, 37, 47, 14])

In [398]:
# get items at row indexes 1 and 4
a[1:5:3]  # from 1 until (not including) 5, step size 3

array([[87, 70, 88, 88, 12, 58, 65, 39, 87],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [399]:
# another solution: 

a[[1, 4]]

array([[87, 70, 88, 88, 12, 58, 65, 39, 87],
       [29, 19, 19, 14, 39, 32, 65,  9, 57]])

In [400]:
# get items at column index 1 and 4
a[:, [1,4]]

array([[47, 67],
       [70, 12],
       [88, 25],
       [69, 64],
       [19, 39]])

In [404]:
# mean of even numbers in row index 4
a[4][a[4] % 2 == 0].mean()

23.0

In [408]:
# mean of the odd numbers in column index 4
a[:, 4][a[:, 4]%2 == 1].mean()

43.666666666666664

# Sorting



In [409]:
np.random.seed(0)
a = np.random.randint(0, 100, 24)
a

array([44, 47, 64, 67, 67,  9, 83, 21, 36, 87, 70, 88, 88, 12, 58, 65, 39,
       87, 46, 88, 81, 37, 25, 77])

In [410]:
# how can I sort this array, from highest to lowest?


In [411]:
help(np.sort)

Help on function sort in module numpy:

sort(a, axis=-1, kind=None, order=None)
    Return a sorted copy of an array.
    
    Parameters
    ----------
    a : array_like
        Array to be sorted.
    axis : int or None, optional
        Axis along which to sort. If None, the array is flattened before
        sorting. The default is -1, which sorts along the last axis.
    kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, optional
        Sorting algorithm. The default is 'quicksort'. Note that both 'stable'
        and 'mergesort' use timsort or radix sort under the covers and, in general,
        the actual implementation will vary with data type. The 'mergesort' option
        is retained for backwards compatibility.
    
        .. versionchanged:: 1.15.0.
           The 'stable' option was added.
    
    order : str or list of str, optional
        When `a` is an array with fields defined, this argument specifies
        which fields to compare first, second, etc.  A si