# Practical Data Analysis Using Jupyter Notebook

## Ch. 3: Getting Started with NumPy

---

## Introduction

This here is an introduction to the Python library `NumPy` and my notes from the excellent book *Practical Data Analysis Using Jupyter Notebook*.  Please enjoy!

## Goals
* Understanding a Python NumPy array and its importance
* Differences between single and multiple dimensional arrays
* Making your first NumPy array
* Practical use cases of NumPy and arrays

## Imports

In [14]:
import numpy as np
np.__version__

'1.23.1'

---

## Understanding a Python NumPy array and its importance

According to the documentation, the purpose was to extend Python to allow the manipulation of large sets of objects organized in a grid-like fashion.

Python does not support arrays out of the box but does have a similar feature called `lists`, which has limitations in performance and scalability.

The NumPy library is all about arrays.

A more formal definition of an array is it is a container used to store a list of values or collections of values called elements.  The elements must be defined with a data type that applies to all of the values in an array, and that data type cannot be changed during the creation of the array.

The most common `dtype` are `Boolean` for true/false values, `char` for words/string values, `float` for decimal numbers, and `int` for integers.


### Differences between single and multiple dimensional arrays

If the array only has one dimension, it would represent that list of values in a single row or
column (but not both)

In [2]:
oneD_array = ([1, 2, 3, 4, 5])

A two-dimensional array, also known as a matrix, would be any combination of multiple rows and columns.

In [3]:
twoD_array = (
  [1, 'a'],
  [2, 'b'],
  [3, 'c'],
  [4, 'e'],
  [5, 'f']
)

You may have already realized from the examples that a structured data table that is made up of rows and columns is a two-dimensional array!

If the array has more than one dimension, you can reference the values along the axis (X, Y, or Z).

With the `numpy` library package, the core feature is the `ndarray` object, which allows for any number of dimensions, which is called n-dimensional.

A 3D cube with an X, Y, and Z axis can also be created using NumPy arrays.

Some other key features of NumPy include the following:
* The ability to perform mathematical calculations against big datasets
* Using operators to compare values such as greater than and less than
* Combining values in two or more arrays together
* Referencing individual elements in the sequence from how they are stored

### Making your first NumPy array


In [4]:
my_first_array = np.array([1, 2, 3]) # 1D array
print(my_first_array)

[1 2 3]


Now that we have an array available, let's walk through how you can verify the contents.

### Useful array functions

Commands to run against any array in NumPy to give you metadata:

* `array.shape` : provides the array dimensions, attribute

In [5]:
my_first_array.shape

(3,)

* `array.size` : shows the number of array elements (similar to the number of cells in a table), attribute

In [7]:
my_first_array.size # shows the number of array elements (similar to the number of cells in a table)

3

* `len()`: shows the length of the array

In [6]:
len(my_first_array)

3

* `array.dtype.name` : provides the data type of the array elements

In [8]:
my_first_array.dtype.name

'int32'

* `array.astype(int)` : converts an array into a different data type—in this example, an integer that will display as int64

In [9]:
my_first_array.astype(int)

array([1, 2, 3])

To reference individual elements in the array, you use the square brackets along with an ordinal whole number, which is called the array index.  **[Bracket Notation]** 

Some useful statistical functions you can run against numeric arrays that have `dtype` of `int` or `float` include the following:

* `array.sum()` : sums all of the element values

In [10]:
my_first_array.sum()

6

* `array.min()` : provides the minimum element value in the entire array

In [11]:
my_first_array.min()

1

* `array.max()` : provides the maximum element value in the entire array

In [12]:
my_first_array.max()

3

* `array.mean()` : provides the mean or average, which is the sum of the elements divided by the count of the elements

In [13]:
my_first_array.mean()

2.0

## Practical use cases of NumPy and arrays

Here's the scenario—you are a data analyst who wants to know what is the highest daily closing price for a stock ticker for the current Year To Date (YTD). To do this, you can use an array to store each value as an element, sort the price element from high to low, and then print the first element, which would display the  highest price as the output value.

### Assigning values to arrays directly

A more scalable option versus manually assigning values in the array is to use another NumPy command called the `genfromtxt()` function.

There are multiple required and optional parameters for the `genfromtxt()` function, let's walk through the ones required to answer our business question:

* The first parameter is the filename, which is assigned to the file we upload, named `AAPL_stock_price_example.csv`.
* The second parameter is the delimiter, which is a comma since that is how the input file is structured.
* The next parameter is to inform the function that our input data file has a header by assigning the `names=` parameter to `True`.
* The last parameter is `usecols=`, which defines the specific column to read the data from.

According to the `genformtxt()` function help, when passing a value to the `usecols=` parameter, the first column is always assigned to **0** by default. Since we need the `Close column` in our file, we change the parameter value to **1** to match the order that is found in our input file.

Check a look ;)

In [15]:
input_stock_price_array = np.genfromtxt('../data/source/AAPL_stock_price_example.csv', delimiter=',', names=True, usecols=(1))
input_stock_price_array.size

229

In [16]:
sorted_stock_price_array = np.sort(input_stock_price_array)[::-1] # descending order

print('Closing stock price in order of day traded : ', input_stock_price_array[:5])
print('Closing stock price in order from high to low : ', sorted_stock_price_array[:5])

Closing stock price in order of day traded :  [(157.919998,) (142.190002,) (148.259995,) (147.929993,) (150.75    ,)]
Closing stock price in order from high to low :  [(267.100006,) (266.369995,) (266.290009,) (265.76001 ,) (264.470001,)]


In [19]:
print('Highest closing stock price : ', sorted_stock_price_array[0])

Highest closing stock price :  (267.100006,)


### Assigning values to an array using a loop

Another approach that may use more code but has more flexibility to control data quality during the process of populating the array would be to use a loop.

A summary of the process is as follows:
1.  Read the file into memory
2.  Loop through each individual record
3.  Strip out a value from each record
4.  Assign each value to a temporary array
5.  Clean up the array
6.  Sort the array in descending order
7.  Print the first element in the array to display the highest price

In [20]:
temp_array = []

#1. read the file into memory
with open('../data/source/AAPL_stock_price_example.csv', 'r') as input_file:
  #1.a load all the data into a variable
  all_lines_from_input_file = input_file.readlines()
  #2. loop through each individual record
  for each_individual_line in all_lines_from_input_file:
    #3. strip out a value from each record
    for value_from_line in each_individual_line.rsplit(',')[1:]:
      #3.a remove the whitespaces from each value
      clean_value_from_line = value_from_line.replace("\n", "")
      #4. assign each value to the new array by element
      temp_array.append(clean_value_from_line)

print(temp_array[:5])

['Close', '157.919998', '142.190002', '148.259995', '147.929993']


In [21]:
#5. clean up the array
temp_array = np.delete(temp_array, 0) # removes the header row
temp_array.size

229

In [22]:
input_stock_price_array = temp_array.astype(float) # cast as float, removes quotes on values
print(input_stock_price_array[:5])

[157.919998 142.190002 148.259995 147.929993 150.75    ]


In [23]:
#6. sort the array in descending order
sorted_stock_price_array = np.sort(input_stock_price_array)[::-1]

print('Closing stock price in order of day traded: ', input_stock_price_array[:5])
print('Closing stock price in order from high to low: ', sorted_stock_price_array[:5])

Closing stock price in order of day traded:  [157.919998 142.190002 148.259995 147.929993 150.75    ]
Closing stock price in order from high to low:  [267.100006 266.369995 266.290009 265.76001  264.470001]


In [24]:
#7. print the first element in the array to display the highest price
print('Highest closing stock price: ', sorted_stock_price_array[0])

Highest closing stock price:  267.100006
