# Lab 3

# Introduction to NumPy

In this lab, you'll be working through Chapter 2 to get an introduction to the numerical computing package for Python, NumPy. This notebook is made up of two sections.

- Section 1: Work through the code samples in Chapter 2
- Section 2: Exercises

# Section 1: Code Practice

In this section, you will be reading through the various chapter sections and typing out/running the code samples given in the sections. The purpose of this is for you to practice using Jupyter to run Python code as well as learn about the functionality available to you in both IPython and Jupyter.

##### Executing code in Jupyter

When typing and executing code in Jupyter, it is helpful to know the various keyboard shortcuts. You can find the full list of these by clicking **Help &rarr; Keyboard Shortcuts** in the menu. However, the two most useful keyboard shortcuts are:

- `Shift-Enter`: Execute the current cell and advance to the next cell. This will create one if none exists, but if a cell exists below your current cell, a new cell will **not** be created.
- `Alt-Enter`: Execute the current cell and **create** a new cell below.
- `Control-Enter`: Execute the current cell without advancing to the next cell

When writing your code, you will be using these two commands to make sure input/output (`In`/`Out`) is consistent with what is found in the chapter. If you create a cell by mistake, you can always go to **Edit &rarr; Delete Cells** to remove it.

#### Purpose of Section 1

Your purpose in this section is 

- **Type out** the code examples from the chapter (do not copy and paste)
- **Run** them
- **Check** to **make sure** you are getting the same results as what is contained in the chapter

---




## Computation on Arrays: Broadcasting

[Chapter/Section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.05-Computation-on-arrays-broadcasting.ipynb)

### Introducing Broadcasting

In [None]:
import numpy as np

In [None]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b

In [None]:
a + 5

In [None]:
M = np.ones((3, 3))
M

In [None]:
M + a

In [None]:
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]

print(a)
print(b)

In [None]:
a + b

### Rules of Broadcasting

#### Broadcasting example 1

In [None]:
M = np.ones((2, 3))
a = np.arange(3)

In [None]:
M + a

#### Broadcasting example 2

In [None]:
a = np.arange(3).reshape((3, 1))
b = np.arange(3)

In [None]:
a + b

#### Broadcasting example 3

In [None]:
M = np.ones((3, 2))
a = np.arange(3)

In [None]:
M + a

In [None]:
a[:, np.newaxis].shape

In [None]:
M + a[:, np.newaxis]

In [None]:
np.logaddexp(M, a[:, np.newaxis])

### Broadcasting in Practice

#### Centering an array

In [None]:
X = np.random.random((10, 3))

In [None]:
Xmean = X.mean(0)
Xmean

In [None]:
Xmean = X.mean(0)
Xmean

In [None]:
X_centered.mean(0)

#### Plotting a two-dimensional function

In [None]:
# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]

z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
plt.imshow(z, origin='lower', extent=[0, 5, 0, 5],
           cmap='viridis')
plt.colorbar();

## Comparisons, Masks, and Boolean Logic

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.06-Boolean-Arrays-and-Masks.ipynb)

### Example: Counting Rainy Days

In [1]:
def array_from_url(url, column):
    import pandas as pd
    import numpy as np
    data = pd.read_csv(url)
    return np.array(data[column])

rainfall = array_from_url('https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/Seattle2014.csv','PRCP')

Start the next cell at `inches = rainfall / 254.0`

In [None]:
import numpy as np
import pandas as pd

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # set plot styles

### Comparison Operators as UFuncs

In [None]:
x = np.array([1, 2, 3, 4, 5])

In [None]:
x < 3  # less than

In [None]:
x > 3  # greater than

In [None]:
x <= 3  # less than or equal

In [None]:
x >= 3  # greater than or equal

In [None]:
x != 3  # not equal

In [None]:
x == 3  # equal


In [None]:
(2 * x) == (x ** 2)

In [None]:
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x

In [None]:
x < 6

### Working with Boolean Arrays

In [None]:
print[x]

#### Couting entries

In [None]:
# how many values less than 6?
np.count_nonzero(x < 6)

In [None]:
np.sum(x < 6)


In [None]:
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)

In [None]:
# are there any values greater than 8?
np.any(x > 8)

In [None]:
# are there any values less than zero?
np.any(x < 0)

In [None]:
# are all values less than 10?
np.all(x < 10)

In [None]:
# are all values equal to 6?
np.all(x == 6)

In [None]:
# are all values in each row less than 8?
np.all(x < 8, axis=1)

#### Boolean operators

In [None]:
np.sum((inches > 0.5) & (inches < 1))

In [None]:
np.sum(~( (inches <= 0.5) | (inches >= 1) ))

In [None]:
print("Number days without rain:      ", np.sum(inches == 0))
print("Number days with rain:         ", np.sum(inches != 0))
print("Days with more than 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches  :", np.sum((inches > 0) &
                                                (inches < 0.2)))

### Boolean Arrays as Masks

In [None]:
x

In [None]:
x < 5

In [None]:
x[x < 5]

In [None]:
# construct a mask of all rainy days
rainy = (inches > 0)

# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)

print("Median precip on rainy days in 2014 (inches):   ",
      np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches):  ",
      np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
      np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
      np.median(inches[rainy & ~summer]))

### Aside: Using the Keywords `and`/`or` Versus the Operators `&`/`|`

In [None]:
bool(42), bool(0)

In [None]:
bool(42 and 0)

In [None]:
bool(42 or 0)

In [None]:
bin(42)

In [None]:
bin(42 & 59)

In [None]:
bin(42 | 59)


In [None]:
A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | B

In [None]:
 A or B 

In [None]:
x = np.arange(10)
(x > 4) & (x < 8)

In [None]:
(x > 4) and (x < 8)

## Fancy Indexing

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.07-Fancy-Indexing.ipynb)

### Exploring Fancy Indexing

In [None]:
import numpy as np
rand = np.random.RandomState(42)

x = rand.randint(100, size=10)
print(x)

In [None]:
[x[3], x[7], x[2]]

In [None]:
ind = [3, 7, 4]
x[ind]

In [None]:
ind = np.array([[3, 7],
                [4, 5]])
x[ind]

In [None]:
X = np.arange(12).reshape((3, 4))
X

In [None]:
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]

In [None]:
X[row[:, np.newaxis], col]

In [None]:
row[:, np.newaxis] * col

### Combined Indexing

In [None]:
print(X)

In [None]:
X[2, [2, 0, 1]]


In [None]:
X[1:, [2, 0, 1]]

In [None]:
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]

### Example: Selecting Random Points

In [None]:
mean = [0, 0]
cov = [[1, 2],
       [2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # for plot styling

plt.scatter(X[:, 0], X[:, 1]);

In [None]:
indices = np.random.choice(X.shape[0], 20, replace=False)
indices

In [None]:
selection = X[indices]  # fancy indexing here
selection.shape

In [None]:
plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
plt.scatter(selection[:, 0], selection[:, 1],
            facecolor='none', edgecolor='blue', s=200);


### Modifying Values with Fancy Indexing

In [None]:
x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)

In [None]:
x[i] -= 10
print(x)

In [None]:
x = np.zeros(10)
x[[0, 0]] = [4, 6]
print(x)

In [None]:
i = [2, 3, 3, 4, 4, 4]
x[i] += 1
x

In [None]:
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)

### Example: Binning Data

In [None]:
np.random.seed(42)
x = np.random.randn(100)

# compute a histogram by hand
bins = np.linspace(-5, 5, 20)
counts = np.zeros_like(bins)

# find the appropriate bin for each x
i = np.searchsorted(bins, x)

# add 1 to each of these bins
np.add.at(counts, i, 1)

In [None]:
# plot the results
plt.plot(bins, counts, linestyle='steps');


In [None]:
print("NumPy routine:")
%timeit counts, edges = np.histogram(x, bins)

print("Custom routine:")
%timeit np.add.at(counts, np.searchsorted(bins, x), 1)

In [None]:
x = np.random.randn(1000000)
print("NumPy routine:")
%timeit counts, edges = np.histogram(x, bins)

print("Custom routine:")
%timeit np.add.at(counts, np.searchsorted(bins, x), 1)

## Sorting Arrays

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.08-Sorting.ipynb)

In [None]:
import numpy as np

def selection_sort(x):
    for i in range(len(x)):
        swap = i + np.argmin(x[i:])
        (x[i], x[swap]) = (x[swap], x[i])
    return x

In [None]:
x = np.array([2, 1, 4, 3, 5])
selection_sort(x)

In [None]:
def bogosort(x):
    while np.any(x[:-1] > x[1:]):
        np.random.shuffle(x)
    return x

In [None]:
x = np.array([2, 1, 4, 3, 5])
bogosort(x)

### Fast Sorting in NumPy: `np.sort` and `np.argsort`

In [None]:
x = np.array([2, 1, 4, 3, 5])
np.sort(x)

In [None]:
x.sort()
print(x

In [None]:
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)


In [None]:
x[i]

#### Sorting along rows or columns

In [None]:
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)

In [None]:
# sort each column of X
np.sort(X, axis=0)


In [None]:
# sort each row of X
np.sort(X, axis=1)

### Partial Sorts: Partitioning

In [None]:
x = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3)


In [None]:
np.partition(X, 2, axis=1)

### Example: k-Nearest Neighbors

In [None]:
X = rand.rand(10, 2)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling
plt.scatter(X[:, 0], X[:, 1], s=100);


In [None]:
# for each pair of points, compute differences in their coordinates
differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]
differences.shape

In [None]:
# square the coordinate differences
sq_differences = differences ** 2
sq_differences.shape


In [None]:
# sum the coordinate differences to get the squared distance
dist_sq = sq_differences.sum(-1)
dist_sq.shape

In [None]:
dist_sq.diagonal()

In [None]:
nearest = np.argsort(dist_sq, axis=1)
print(nearest)

In [None]:
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)

In [None]:
plt.scatter(X[:, 0], X[:, 1], s=100)

# draw lines from each point to its two nearest neighbors
K = 2

for i in range(X.shape[0]):
    for j in nearest_partition[i, :K+1]:
        # plot a line from X[i] to X[j]
        # use some zip magic to make it happen:
        plt.plot(*zip(X[j], X[i]), color='black')

## Structured Data: NumPy's Structured Arrays

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/02.09-Structured-Data-NumPy.ipynb)

In [None]:
import numpy as np

In [None]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

In [None]:
x = np.zeros(4, dtype=int)

In [None]:
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
                          'formats':('U10', 'i4', 'f8')})
print(data.dtype)

In [None]:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

In [None]:
# Get first row of data
data[0]

In [None]:
# Get the name from the last row
data[-1]['name']

In [None]:
# Get names where age is under 30
data[data['age'] < 30]['name']

### Creating Structured Arrays

In [None]:
np.dtype({'names':('name', 'age', 'weight'),
          'formats':('U10', 'i4', 'f8')})

In [None]:
np.dtype({'names':('name', 'age', 'weight'),
          'formats':((np.str_, 10), int, np.float32)})

In [None]:
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])

In [None]:
np.dtype('S10,i4,f8')

### More Advanced Compound Types

In [None]:
tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
X = np.zeros(1, dtype=tp)
print(X[0])
print(X['mat'][0])

### RecordArrays: Structured Arrays with a Twist

In [None]:
data['age']

In [None]:
data_rec = data.view(np.recarray)
data_rec.age

In [None]:
%timeit data['age']
%timeit data_rec['age']
%timeit data_rec.age

---

# Section 2: Exercises

In this section, you will be provided a few exercises to demonstrate your understanding of the chapter contents. Each exercise will have a Markdown section describing the problem, and you will provide cells below the description with code, comments and visual demonstrations of your solution.

---

### Problem 1

Make sure you have the `array_from_url` function defined:

```python
def array_from_url(url, column):
    import pandas as pd
    import numpy as np
    data = pd.read_csv(url)
    return np.array(data[column])
```

Using the `array_from_url` function, load the following two data sets into memory using the variable names provided:

- variable: `areas`
    - URL: `"https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv"`
    - column: `"area (sq. mi)"`
- variable: `populations`
    - URL: `"https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-population.csv"`
    - column: `"population"`

Compute a new variable: `pop_density` containing the population density of each of the states (plus D.C. and Puerto Rico). Population density is defined as the population divided by the area.

Use this NumPy array to answer the following questions.

- Which state has the highest population density and what is it?
- Which territory has the highest population density and what is it?
- What is the mean population density of just the United States in 2012?
- What is the mean population density of the United States and territories in 2012?

In [None]:
def array_from_url(url, column):
    import pandas as pd
    import numpy as np
    data = pd.read_csv(url)
    return np.array(data[column])

# Load the datasets
areas = array_from_url("https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv", "area (sq. mi)")
populations = array_from_url("https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/state-population.csv", "population")

# Compute population density
pop_density = populations / areas

# Additional imports for further calculations
import pandas as pd
import numpy as np

# Load the state names to identify which state/territory corresponds to which density
state_names = pd.read_csv("https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv")['state']

# Create a DataFrame to handle population data with states and 


---

### Problem 2

Using the `array_from_url` function, load the following two data sets into memory using the variable names provided:

- variable: `titanid`
    - URL: `"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"`
    - column: `"age"`



Answer the following questions:

- What are the minimum, maximum, and mean ages of the following types of passengers on the Titanic?
    - All passengers
    - Survivors 
    - Those that died
- What are the percentage of male passengers that died?
- What are the percentage of female passengers that died?


In [None]:
def array_from_url(url, column):
    import pandas as pd
    import numpy as np
    data = pd.read_csv(url)
    return np.array(data[column])

# Load the required data
titanic = array_from_url("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv", "age")
survived = array_from_url("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv", "survived")
sex = array_from_url("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv", "sex")

# All passengers
all_min_age = np.min(titanic)
all_max_age = np.max(titanic)
all_mean_age = np.mean(titanic)

# Survivors
survivors_age = titanic[survived == 1]
survivors_min_age = np.min(surv


---

### Problem 3

Define the following function:

```python
def titanic_structured():
    import pandas as pd, numpy as np
    data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
    cols = ['survived', 'pclass', 'sex', 'age', 'fare']
    sarray = np.zeros(len(data), dtype={'names':cols,'formats':('i4','i4','U10','f8','f8')})
    sarray['survived'] = data.Survived
    sarray['pclass'] = data.Pclass
    sarray['sex'] = data.Sex
    sarray['age'] = data.Age
    sarray['fare'] = data.Fare
    return sarray
```

Assign the output of this function to a new variable `titanic_new`, and answer the following questions:

- What is the average age of men that survived?
- What is the average age of women that survived?
- What is the [mode](https://www.mathsisfun.com/definitions/mode.html) of the class of survivors?
- What is the mode of the class of those that died?
   

In [None]:
import numpy as np
from scipy import stats

# Define the function
def titanic_structured():
    import pandas as pd, numpy as np
    data = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
    cols = ['survived', 'pclass', 'sex', 'age', 'fare']
    sarray = np.zeros(len(data), dtype={'names':cols,'formats':('i4','i4','U10','f8','f8')})
    sarray['survived'] = data.Survived
    sarray['pclass'] = data.Pclass
    sarray['sex'] = data.Sex
    sarray['age'] = data.Age
    sarray['fare'] = data.Fare
    return sarray

# Assign the output to a variable
titanic_new = titanic_structured()

# Calculate the average age of men that sur
