# NumPy

Before we get into NumPy and Pandas, I want to show (remind?) you that we can do basic descriptive statistics without ever importing a data science library in Python.

In [None]:
some_numbers = [26, 83, 35, 39, 31, 11, 30, 44, 42, 65, 30, 2, 29, 38, 43, 68]

# minimum, maximum, range
print("Min:", min(some_numbers))
print("Max:", max(some_numbers))
print("Range:", min(some_numbers), "to", max(some_numbers))

# sum, count, average
print("Sum:", sum(some_numbers))
print("Count:", len(some_numbers))
print("Average (Mean):", sum(some_numbers)/len(some_numbers))

I guess if you want to call `statistics` a data science library, you do have to import _one_ data science library to get some of the descriptive statistics you might want to have.

In [None]:
import statistics

some_numbers = [26, 83, 35, 39, 31, 11, 30, 44, 42, 65, 30, 2, 29, 38, 43, 68]

# mean (another version), median, mode
print("Average (mean):", statistics.mean(some_numbers))
print("Median:", statistics.median(some_numbers))
# this will throw an error if you have a multi-modal distribution
print("Mode:", statistics.mode(some_numbers))

# measures of dispersion: variance, standard deviation
print("Variance:", statistics.variance(some_numbers))
print("Standard deviation:", statistics.stdev(some_numbers))

So... we can run all of those exciting functions on lists. Why would we need anything else?

Well.

For one thing, lists are slow. For another...

In [None]:
# 4 elements by 4 elements
some_more_numbers = [[26, 83, 35, 39], [31, 11, 30, 44], [42, 65, 30, 2], [29, 38, 43, 68]]

print(some_more_numbers)

# just a reminder that this is legal
print("First number:", some_more_numbers[0][0])
print("Last number:", some_more_numbers[3][3])

print("Mode:", statistics.mode(some_more_numbers))

## NumPy
"Numerical Python" is a way for us to store and do *very fast* math with multi-dimensional arrays of numbers in Python. 

It underlies SciPy ("Scientific Python"), Pandas (which we'll talk about later tonight), Keras (a machine learning library), and TensorFlow (which our Fundamentals of Machine Learning class uses), and we get it for free with Anaconda Distribution.

It expects that every element is the same type -- an `int` or a `float`. 

In [None]:
# the traditional way we import numpy
import numpy as np

Now, we can create our own NumPy array, of course, but also, if we are trying to work on lists and find that we've hit a wall, as above, we can also cast lists into NumPy arrays:

In [None]:
np_nums = np.array(some_more_numbers)

# look how nicely it prints!
print(np_nums)

# note for coral: point out right-alignment, especially of the 2

In [None]:
# a NumPy array is literally a different type of thing than a multi-dimensional list:
print("Our 2-D list:", type(some_more_numbers))
print("Our np array:", type(np_nums))

### Creating a NumPy array

#### Arrays from Python native types
Restating: you can make a NumPy array from a list (or list of lists). Or a tuple (or tuple of tuples)! We're just type-casting here.

In [None]:
# from a list of lists
# restating what we did above; 
# just using a hard-coded list of lists:
np_nums = np.array([[26, 83, 35, 39], [31, 11, 30, 44], [42, 65, 30, 2], [29, 38, 43, 68]])

# same as np_nums = np.array(some_more_numbers) above

# from a tuple of tuples
np_tup_nums = np.array(((1, 2, 3), (4, 5, 6), (7, 8, 9)))
# tup1 = (1, 2, 3)
# tup2 = (4, 5, 6)
# tup3 = (7, 8, 9)
# tups = (tup1, tup2, tup3)
# np_tup_nums = np.array(tups)

print(np_nums)
print("") #space
print(np_tup_nums)

#### Creating arrays full of constant values

* `.zeros((rows, columns), dtype=int/float)` - make a whole array of zeros
* `.full((rows, columns), value)` - make an array with an arbitrary value

In [None]:
# make an array of zeros:
np_zeros = np.zeros((5,5), dtype=int) # could also do float

# make an array of arbitrary whatevers
# in this case, we filled it with 13s
np_unlucky = np.full((13, 13), 13)

print(np_zeros)
print("") #space
print(np_unlucky)

#### Arrays with ranges of values

* `.arange(start, end, count_by)` - specify where you start, where you end, and **how much space between** elements
* `.linspace(start, end, num=how_many)` - specify where you start, where you end, and **how many** elements

In [None]:
# from 0 to 100, counting by 5s
# defaults to ints, but can be made into floats
# has "range" in the name; what do we guess about the ending point?
np_range = np.arange(0, 100, 5)

# 15 points between 0 and 100
# floats!
np_lspace = np.linspace(0, 100, 15)

print(np_range)
print("") #space
print(np_lspace)

### Reshaping arrays

OK, those `arange()` and `linspace()` arrays are neat, but they're only one-dimensional? Sometimes, that's what you want, but often it isn't. I'll go so far as to say "usually," it isn't. It's fine, though, because we can reshape our data as needed. 

Let's fix it! 

Unsurprisingly, the command is `.reshape(rows, columns)`

In [None]:
# worth noting: arrays ARE NOT changed in place
np_range = np_range.reshape(4, 5)

np_lspace = np_lspace.reshape(5, 3)

#np_lspace_nope = np_lspace.reshape(5, 4)

print(np_range)
print("") # space
print(np_lspace)

#### Flattening

OK, and let's say you have a multi-dimensional array that you want to make into a one-dimensional array. Cool.

In [None]:
np_flat_range = np_range.flatten()

# subtly different:
#np_flat_range = np_range.reshape(1, 20)

print(np_flat_range)

### Getting things out of our arrays

We can slice NumPy arrays a lot like we'd slice lists, with one slight change in syntax. Instead of `list_name[x][y]`, we need to do `array_name[x, y]`. Otherwise, slicing is very similar. The main trick is keeping straight which is the row and which is the column.

In [None]:
# getting a single value out is slightly different:
# restating our array
np_nums = np.array([[26, 83, 35, 39], [31, 11, 30, 44], [42, 65, 30, 2], [29, 38, 43, 68]])

# comma notation instead of double-bracket notation
print("First value:", np_nums[0,0])
print("Last value:", np_nums[3,3])

In [None]:
# getting out a row
print("First row:", np_nums[0])
# same
print("First row:", np_nums[0, :])

In [None]:
# getting out a column
print("First column:", np_nums[:, 0])

In [None]:
# getting the middle four values
print("Second and third row, second and third column:\n", np_nums[1:3, 1:3])

In [None]:
# getting out non-adjacent rows
print("First and last row:\n", np_nums[[0, 3]])
# same
print("First and last row:\n", np_nums[[0, 3], :])

#getting out non-adjacent columns
print("\nFirst and last column:\n", np_nums[:, [0, 3]])

A cool thing about NumPy arrays is that they know about themselves, so you can interrogate them:
* How many dimensions do you have? `.ndim`
* What shape are you? `.shape`
* How many elements do you have? `.size`
* How many bytes are there in an individual element? `.itemsize`
* What data type are your elements (in C)? `.dtype`

In [None]:
print("Dimensions:", np_nums.ndim)
print("Shape, (rows, columns):", np_nums.shape)
print("Size:", np_nums.size)
print("Size of an item in bytes:", np_nums.itemsize)
print("Data type of items:", np_nums.dtype)

### Descriptive statistics

You knew were were coming back to this eventually.

In [None]:
print("Min:", np_nums.min())
print("Max:", np_nums.max())
print("Range:", np_nums.min(), "to", np_nums.max())
print("Sum:", np_nums.sum())
print("Count:", np_nums.size)
print("Average:", np_nums.mean())
# there isn't one for mode 
print("Variance:", np_nums.var())
print("Standard deviation:", np_nums.std())

#### Looping through a NumPy Array
I wanted to show you how to loop through an array anyway, so let's find the mode ourselves! (I'm not claiming this is the best possible way, or especially fast, but it will do the job, as long as there actually is a mode to be found.)

In [None]:
# make an empty dictionary to hold unique values from our array
vals_dict = {}

# loop through the array row by row, then each column within a row
for row in np_nums:
    for col in row:
        # just counting occurrences of particular values in the array
        if col in vals_dict:
            vals_dict[col] += 1
        else:
            vals_dict[col] = 1

# now we're pulling the value that appeared the most often
max_count = 0
max_val = 0
for val in vals_dict:
    if vals_dict[val] > max_count:
        max_count = vals_dict[val]
        max_val = val

mode = max_val
print("Mode:", mode)

### Doing math with NumPy arrays

#### Scalars
There are a couple of different ways to do math (or "math," if you count Boolean comparisons as math) with arrays. You can apply a single value (a scalar) to the entire array, piecewise, which is known as "broadcasting":

In [None]:
# multiply every number in the array by the same amount
np_nums2 = np_nums * 2

print(np_nums)
print("\nAnd then multiply it by 2:\n")
print(np_nums2)

In [None]:
# add the same amount to every number in the array
np_nums2 = np_nums + 2

print(np_nums)
print("\nAnd then add 2:\n")
print(np_nums2)

In [None]:
print(np_nums)

# compare every number in the array with a set value
# recall, we calculated a mode:
print("\nOur mode:", mode, "\n\nIs each element greater than the mode?\n")

# so let's see how much of our array is larger than the mode
np_nums_gt_mode = np_nums > mode
print(np_nums_gt_mode)

Here is a good NumPy array fact: you can subset an array by feeding it a same-size array holding Trues and Falses.

In [None]:
# only the Trues will be part of the new array
np_nums3 = np_nums[np_nums_gt_mode]
print(np_nums3)

#### Math with multiple arrays

In [None]:
# set up two arrays, same dimensions
np_num1 = np.array(((1, 2, 3), (4, 5, 6), (7, 8, 9)))
np_num2 = np.full((3,3), 2)

print(np_num1)
print("") # space
print(np_num2)

In [None]:
# probably nothing surprising here?

## ADD
np_add = np.add(np_num1, np_num2)
print(np_add)
print("")

# exactly the same
np_add2 = np_num1 + np_num2
print(np_add2)

In [None]:
## SUBTRACT
np_sub = np.subtract(np_num1, np_num2)
print(np_sub)
print("")

# exactly the same
np_sub2 = np_num1 - np_num2
print(np_sub2)

In [None]:
## MULTIPLY
np_mult = np.multiply(np_num1, np_num2)
print(np_mult)
print("")

# exactly the same
np_mult2 = np_num1 * np_num2
print(np_mult2)

In [None]:
## DIVIDE
np_div = np.divide(np_num1, np_num2)
print(np_div)
print("")

# exactly the same
np_div2 = np_num1 / np_num2
print(np_div2)

In [None]:
# this MIGHT be obvious, but it seems worth saying:
# we can do all of our math on subsets of arrays, too
np_num4 = np_num1[:, [0, 2]] + np_num2[:, [0, 2]]
print(np_num4)

# Matplotlib

This is not an in-depth intro to Matplotlib. This is me showing you how to make a single kind of graph, and pointing you at the documentation if you're interested in doing more. (It is very well documented!)

To be clear: you aren't stuck with line charts. There are also bar charts, scatter plots, histograms, and pie charts (ew). You can do all kinds of wild things with Matplotlib.

I didn't want to neglect to show you this, though, because the book sort of makes it look like you have to use Seaborn to use Matplotlib, and that isn't true by any stretch of the imagination. Matplotlib works fine on its own. 

In [None]:
# lets us show matplotlib plots inline
%matplotlib notebook

# getting just the part we need and giving it a shorter name
from matplotlib import pyplot as plt
# often you'll see this instead:
# import matplotlib.pyplot as plt

# telling matplotlib we're going to start a graph
plt.figure()

# it'll plot each row as a separate series, interestingly
plt.plot(np_mult, linestyle="--", color="#AA00AA", marker='s')
# this is more fun:
# plt.plot(np_mult[0], linestyle="dashed", color="#990033", marker='o')
# plt.plot(np_mult[1], linestyle="solid", color="#AA00AA", marker='s')
# plt.plot(np_mult[2], linestyle="dotted", color="#1100AA", marker='*')
# feed axis a list to set ranges:
# [x_start, x_end, y_start, y_end]
plt.axis([0, 2.5, 0, 20])
# put nice labels on
plt.xlabel("our x axis")
plt.ylabel("our y axis")
plt.title("A very linear graph")
# actually makes the graph show up
plt.show()

# Start here with the documentation: 
# https://matplotlib.org/3.1.0/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py


 # Pandas
 ### The world's shortest introduction.
 
OK, but sometimes we don't want our data to all be the same type. Maybe instead of rows and columns of ints or floats, we want labels on things, floats in one place and ints in another, or perhaps categorical data of some kind. Maybe there are gaps in our data (which NumPy can't really handle). 

NumPy alone can't do that, though it serves as a useful backend of the library that can: Pandas. Also, it's worth noting that you can often use NumPy operations on Pandas objects, because the two libraries are so closely linked.

First, let's get Pandas, which (like NumPy) we got for free when we installed Anaconda.

In [None]:
import pandas as pd

We've got two main kinds of objects in Pandas: `Series` and `DataFrame`. Usually you'll want an entire `DataFrame` for your data, but you'll often pull individual rows or columns out to deal with as a `Series`. So we'll start there.

## Series

We can build a Series from a list or a tuple, just like we could with NumPy arrays. We can also build it _from a NumPy array_ if we want to. 

In [None]:
# kind of a boring series to start
some_numbers = [26, 83, 35, 39, 31, 11, 30, 44, 42, 65, 30, 2, 29, 38, 43, 68]

numbers_series = pd.Series(some_numbers)

print(numbers_series)


# np_array2 = np.array(some_numbers)
# numbers2 = pd.Series(np_array2)

# print("") #space
# print(numbers2)

Notice, we got free indices with our Series. So... sure, it's one-dimensional. Only it isn't? really? 

By default, the indices are numbers, starting at zero.

A cool thing about this, though: we can also create a series from a dictionary. And if we do that, it'll set our index to whatever the keys of our dictionary are, and the data to whatever the values are.

In [None]:
state_rankings = {
    1 : "Virginia", 
    2 : "Texas", 
    3 : "Colorado", 
    4 : "New York ", 
    5 : "North Carolina ", 
    6 : "New Jersey", 
    7 : "California", 
    8 : "Florida", 
    9 : "District of Columbia", 
    10 : "South Dakota"
}
# from https://docs.google.com/spreadsheets/d/1fSZwXMi8ARXh3XbBF1RKjMr5MDJH_RwHsy0-BF4N2Ko/edit#gid=0

In [None]:
rankings_series = pd.Series(state_rankings)
print(rankings_series)

In [None]:
# not just numbers!
capitals = {
    'Alabama' : 'Montgomery', 
    'Alaska' : 'Juneau',
    'Arizona' : 'Phoenix', 
    'Arkansas' : 'Little Rock',
    'California' : 'Sacramento', 
    'Colorado' : 'Denver',
    'Connecticut' : 'Hartford', 
    'Delaware' : 'Dover',
    'Florida' : 'Tallahassee', 
    'Georgia' : 'Atlanta',
    'Hawaii' : 'Honolulu'
}

caps_series = pd.Series(capitals)
print(caps_series)

### Getting items out of a Series

In [None]:
# kind of about what you'd guess, when the index is numeric

print(numbers_series[0])
print(rankings_series[1])
#print(rankings_series[0])

# maybe? what you'd guess? when the index is text

print(caps_series['Hawaii'])

### Descriptive statistics on Pandas Series

In [None]:
print("Count: ", numbers_series.count())
print("Min: ", numbers_series.min())
print("Max: ", numbers_series.max())
print("Standard deviation: ", numbers_series.std())

In [None]:
# or there's this
numbers_series.describe()

## DataFrames

A DataFrame is a two-dimensional matrix. Each column is a Series. You won't generally go making your DataFrame this way, but you can create one from a dictionary if you want to:

In [None]:
data_dict = {
    "Virginia" : [14180, 4570, 10330, 13190],
    "Texas" : [8520, 4128, 6550, 7760],
    "Colorado" : [3590, 1372, 1500, 2870],
    "New York" : [6930, 2600, 5040, 6170],
    "North Carolina" : [3570, 1397, 3140, 4090],
    "New Jersey" : [3480, 1764, 1820, 3210],
    "California" : [7830, 5008, 8260, 8470],
    "Florida" : [5600, 1834, 3390, 5240],
    "District of Columbia" : [1660, 1646, 800, 1310]
} # from https://docs.google.com/spreadsheets/d/1fSZwXMi8ARXh3XbBF1RKjMr5MDJH_RwHsy0-BF4N2Ko/edit#gid=0

states = pd.DataFrame(data_dict)

print(states)

OK, that numeric index isn't _super_ helpful, is it? What are those numbers? Who can say? We probably need to label our indices. It's cool, there are several ways to pull that off.

In [None]:
states.index = ["working_now", "vacancies_now", "working_in_2013", "working_in_2017"]

# alternately:
#states = states.rename(index={0:"working_now", 1:"vacancies_now", 2:"working_in_2013", 3:"working_in_2017"})
print(states)
print("")

# and, yeah, we could have saved that step by planning ahead, _i guess_
# states2 = pd.DataFrame(data_dict, index = ["working_now", "vacancies_now", "working_in_2013", "working_in_2017"])
# print(states2)

## Making a DataFrame from a CSV

I'm not going to claim you won't ever make a DataFrame from scratch like I did above, but realistically? You're probably pulling in data from a file. And I would feel remiss if I didn't show you that.

In [None]:
trees = pd.read_csv('city_of_pittsburgh_trees.csv', index_col=0)
print(trees)

### Pulling data out of your DataFrame

To show a row, use its index with `.loc[index]` or its numeric location (index counting from zero) with `.iloc[index]`. Slicing works.

In [None]:
# get the row that matches ID #615894354
print(trees.loc[615894354])

In [None]:
# get the 13th row
print(trees.iloc[12])

In [None]:
# the first three
print(trees.loc[1946899269:994063598])

# also the first three
#print(trees.iloc[0:3])

Clearly, you aren't always going to want the whole row, though, right? So. You can specify the columns you want in a list, within the `.loc[]` selection.

In [None]:
trees.loc[1946899269:1870646392, ['common_name', 'scientific_name', 'height', 'neighborhood']]

And if you want to be _incredibly_ specific, getting only a specific value, you can do that, too. You need to use `.at[]` or `.iat[]`

In [None]:
# use the named index and the column name
trees.at[1870646392, 'common_name']

In [None]:
# use the positions, both down and across
trees.iat[10,3]

## Descriptive statistics

I love how easy this is and also how useless some of it really is. It's just a delight. :) 

In [None]:
trees.describe()