In [2]:
# imports, Python packages
import numpy as np
import pandas as pd

## Numpy

NumPy is a Python library for creating and manipulating **vectors** and **matrices**8.

### Populate arrays with specific numbers

In [5]:
# create an 8-element vector
one_dimensional_array = np.array([1.2, 2.4, 3.5, 4.7, 6.1, 7.2, 8.3, 9.5])
print(one_dimensional_array)

# create a 3x2 matrix
two_dimensional_array = np.array([[6, 5], [11, 7], [4, 8]])
print(two_dimensional_array)

# all zeroes vector
all_zeroes_vector = np.zeros(3)
print(all_zeroes_vector)

# all zeroes matrix
all_zeroes_matrix = np.zeros((3, 5))
print(all_zeroes_matrix)

# all ones vector
all_ones_vector = np.ones(2)
print(all_ones_vector)

# all onex matrix
all_ones_matrix = np.ones((2, 4))
print(all_ones_matrix)

[1.2 2.4 3.5 4.7 6.1 7.2 8.3 9.5]
[[ 6  5]
 [11  7]
 [ 4  8]]
[0. 0. 0.]
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
[1. 1.]
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]]


### Populate arrays with sequences of numbers

In [None]:
# includes the lower bound (5) but not the upper bound (12)
sequence_of_integers = np.arange(5, 12)
print(sequence_of_integers)

### Populate arrays with random numbers

In [28]:
# vector, populate a 6-element vector with random integers between 50 and 100
# the highest generated integer np.random.randint is one less than the high argument.
random_integers_between_50_and_100 = np.random.randint(low=50, high=101, size=(6))
print(random_integers_between_50_and_100)

# matrix, populate a 6 rows x 2 columns element matrix with random integers between 50 and 100
# the highest generated integer np.random.randint is one less than the high argument.
random_integers_between_50_and_100 = np.random.randint(low=50, high=101, size=((6, 2)))
print(random_integers_between_50_and_100)

# create random floating-point values between 0.0 and 1.0
random_floats_between_0_and_1 = np.random.random([6])           # NumPy (version 1.15)
print(random_floats_between_0_and_1) 
random_floats_between_0_and_1 = np.random.random_sample([6])    # NumPy (version 1.17)
print(random_floats_between_0_and_1) 

[96 97 72 91 50 99]
[[60 60]
 [89 76]
 [68 68]
 [94 69]
 [55 73]
 [56 92]]
[0.6344497  0.66448409 0.72847084 0.74583038 0.8198336  0.91088218]
[0.48820749 0.19581252 0.44706966 0.80381713 0.94261708 0.77724427]


### Mathematical Operations on NumPy Operands

If you want to add or subtract two vectors or matrices, linear algebra requires that the two operands have the same dimensions. Furthermore, if you want to multiply two vectors or matrices, linear algebra imposes strict rules on the dimensional compatibility of operands. Fortunately, NumPy uses a trick called broadcasting to virtually expand the smaller operand to dimensions compatible for linear algebra. For example, the following operation uses broadcasting to add 2.0 to the value of every item in the vector created in the previous code cell:

In [11]:
# use broadcasting to add 2.0 to the value of every item in the vector 
random_floats_between_2_and_3 = random_floats_between_0_and_1 + 2.0
print(random_floats_between_2_and_3)

# relies on broadcasting to multiply each cell in a vector by 3
random_integers_between_150_and_300 = random_integers_between_50_and_100 * 3
print(random_integers_between_150_and_300)

[2.29373861 2.41670117 2.31839224 2.29266067 2.29389992 2.01538145]
[[171 297]
 [153 183]
 [249 264]
 [192 174]
 [213 204]
 [294 183]]


## Task 1

Your goal is to create a simple dataset consisting of a single feature and a label as follows:

1. Assign a sequence of integers from 6 to 20 (inclusive) to a NumPy array named feature.
2. Assign 15 values to a NumPy array named label such that:

`label = (3)(feature) + 4`

In [16]:
feature = np.arange(6, 21)
print('feature is:', feature)

label = (3 * feature) + 4
print('label is:', label)

feature is: [ 6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
label is: [22 25 28 31 34 37 40 43 46 49 52 55 58 61 64]


### Task 2: Add Some Noise to the Dataset

To make your dataset a little more realistic, insert a little random noise into each element of the label array you already created. To be more precise, modify each value assigned to label by adding a different random floating-point value between -2 and +2.

Don't rely on broadcasting. Instead, create a noise array having the same dimension as label.

In [29]:
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.random.html
noise = (2 -(-2)) * np.random.random_sample(label.shape) + (-2)
print(noise)

noisy_label = label + noise
print('label + noise is:', noisy_label)


[-1.62534518  1.04133897 -1.67653797  1.32551858 -0.73939619  1.82712969
  0.8616976  -0.56922894  1.32027112 -0.64536759  1.3326531  -1.77424582
  1.722513   -0.08886094 -0.51412373]
label + noise is: [20.37465482 26.04133897 26.32346203 32.32551858 33.26060381 38.82712969
 40.8616976  42.43077106 47.32027112 48.35463241 53.3326531  53.22575418
 59.722513   60.91113906 63.48587627]


## Pandas 

A DataFrame is similar to an in-memory spreadsheet. Like a spreadsheet:

* A DataFrame stores data in cells.
* A DataFrame has named columns (usually) and numbered rows.


### Creating a DataFrame

In [3]:
# Create and populate a 5x2 NumPy array.
my_data = np.array([[0, 3], [10, 7], [20, 9], [30, 14], [40, 15]])

# Create a Python list that holds the names of the two columns.
my_column_names = ['temperature', 'activity']

# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity
0            0         3
1           10         7
2           20         9
3           30        14
4           40        15


### Adding a new column to a DataFrame

You may add a new column to an existing pandas DataFrame just by assigning values to a new column name. For example, the following code creates a third column named adjusted in my_dataframe:

In [4]:
# Create a new column named adjusted.
my_dataframe["adjusted"] = my_dataframe["activity"] + 2

# Print the entire DataFrame
print(my_dataframe)

   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11
3           30        14        16
4           40        15        17


### Specifying a subset of a DataFrame

Pandas provide multiples ways to isolate specific rows, columns, slices or cells in a DataFrame

In [5]:
print("Rows #0, #1, and #2:")
print(my_dataframe.head(3), '\n')

print("Row #2:")
print(my_dataframe.iloc[[2]], '\n')

print("Rows #1, #2, and #3:")
print(my_dataframe[1:4], '\n')

print("Column 'temperature':")
print(my_dataframe['temperature'])

Rows #0, #1, and #2:
   temperature  activity  adjusted
0            0         3         5
1           10         7         9
2           20         9        11 

Row #2:
   temperature  activity  adjusted
2           20         9        11 

Rows #1, #2, and #3:
   temperature  activity  adjusted
1           10         7         9
2           20         9        11
3           30        14        16 

Column 'temperature':
0     0
1    10
2    20
3    30
4    40
Name: temperature, dtype: int64


### Task 1: Create a DataFrame

Do the following:

1. Create an 3x4 (3 rows x 4 columns) pandas DataFrame in which the columns are named Eleanor, Chidi, Tahani, and Jason. Populate each of the 12 cells in the DataFrame with a random integer between 0 and 100, inclusive.

2. Output the following:
* the entire DataFrame
* the value in the cell of row #1 of the Eleanor column

3. Create a fifth column named Janet, which is populated with the row-by-row sums of Tahani and Jason.

To complete this task, it helps to know the NumPy basics covered in the NumPy UltraQuick Tutorial.

In [23]:
# Create and populate a 3x4 NumPy array.
my_data = np.random.randint(low=0, high=101, size=((3, 4)))

# Create a Python list that holds the names of the columns.
my_column_names = ['Eleanor', 'Chidi', 'Tahani', 'Jason']

# Create a DataFrame.
my_dataframe = pd.DataFrame(data=my_data, columns=my_column_names)

# Print the entire DataFrame
print(my_dataframe)


# Print the value in the cell of row #1 of the Eleanor column
#print("Row #1:")
row_1 = my_dataframe.iloc[[1]] 
#print(row_1)
#print("Column 'Eleanor:")
print('The value in the cell of row #1 of the Eleanor column is', row_1['Eleanor'])

# way simpler my_dataframe[column][row]
print('The value in the cell of row #1 of the Eleanor column is', my_dataframe['Eleanor'][1])


my_dataframe['Janet'] = my_dataframe['Tahani'] + my_dataframe['Jason']
print('my_dataframe with a fifth column named Janet, which is populated with the row-by-row sums of Tahani and Jason')
print(my_dataframe)

   Eleanor  Chidi  Tahani  Jason
0       44     40      39     47
1        4     82      80     77
2       27     23      20     15
The value in the cell of row #1 of the Eleanor column is 1    4
Name: Eleanor, dtype: int64
The value in the cell of row #1 of the Eleanor column is 4
my_dataframe with a fifth column named Janet, which is populated with the row-by-row sums of Tahani and Jason
   Eleanor  Chidi  Tahani  Jason  Janet
0       44     40      39     47     86
1        4     82      80     77    157
2       27     23      20     15     35
