** Introduction to NumPy **

NumPy is a commonly used Python data analysis package. By using NumPy, you can speed up your workflow (especially when it comes to math and statistical operations), and interface with other packages in the Python ecosystem, like scikit-learn and pandas, that use NumPy under the hood.

In [4]:
# Let's start by importing the modules that we'll need for this tutorial
import csv
import numpy as np

In [19]:
# We'll start by looking at the differences between a standard Python list and a numpy array

# Open the csv data file
data_file_name = "diabetes_num.csv"
data_file = open(data_file_name)
data_csv = csv.reader(data_file)

# Instead of iterating through the csv file as we did in past tutorials and assignments, 
# let's convert it into a list.  We'll end up with an object that is essentially a list of lists,
# or a list of data rows from the original csv data file
list_ds = list(data_csv)

print(list_ds[1:][0:3])



[['1000', '203', '82', '56', '3.599999905', '4.309999943', '46', '62', '121', '118', '59', '29', '38'], ['1001', '165', '97', '24', '6.900000095', '4.440000057', '29', '64', '218', '112', '68', '46', '48'], ['1002', '228', '92', '37', '6.199999809', '4.639999866', '58', '61', '256', '190', '92', '49', '57']]


**Numpy 2-Dimensional Arrays**
With NumPy, we work with multidimensional arrays. For now, we'll focus on 2-dimensional arrays. A 2-dimensional array is also known as a matrix. A matrix has rows and columns. By specifying a row number and a column number, we're able to extract an element from a matrix.

In [27]:
# Convert the Python list to a 2D NumPy array
# Pass the list of lists wines into the numpy.array function, which converts it into a NumPy array.
# Exclude the header row with list slicing [1:].
# Specify the argument 'dtype' to make sure each element is converted to a float. 

np_ds = np.array(list_ds[1:][1:5], dtype=np.float)
np_ds


array([[ 1001.        ,   165.        ,    97.        ,    24.        ,
            6.9000001 ,     4.44000006,    29.        ,    64.        ,
          218.        ,   112.        ,    68.        ,    46.        ,
           48.        ],
       [ 1002.        ,   228.        ,    92.        ,    37.        ,
            6.19999981,     4.63999987,    58.        ,    61.        ,
          256.        ,   190.        ,    92.        ,    49.        ,
           57.        ],
       [ 1003.        ,    78.        ,    93.        ,    12.        ,
            6.5       ,     4.63000011,    67.        ,    67.        ,
          119.        ,   110.        ,    50.        ,    33.        ,
           38.        ],
       [ 1005.        ,   249.        ,    90.        ,    28.        ,
            8.89999962,     7.71999979,    64.        ,    68.        ,
          183.        ,   138.        ,    80.        ,    44.        ,
           41.        ]])

In [28]:
# We can check the number of rows and columns in our data using the shape property of NumPy arrays:
np_ds.shape

(4, 13)

**Alternative NumPy Array Creation Methods**

There are a variety of methods that we can use to create NumPy arrays. Sometimes it is really useful to create an array where every element is zero. The below code will create an array with 3 rows and 5 columns, where every element is 0, using numpy.zeros.  For example, it's useful to create an array with all zero elements in cases when you need an array of fixed size, but don't have any values for it yet.

In [30]:
empty_array = np.zeros((3,5))
empty_array

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

Sometimes it is also useful to create an array where each element is a random number using numpy.random.rand.  These types of arrays are often used for testing your program's logic

In [31]:
np.random.rand(3,5)

array([[ 0.29160107,  0.07887505,  0.87400499,  0.39425393,  0.03598815],
       [ 0.76400302,  0.80766258,  0.4910861 ,  0.61800219,  0.06220531],
       [ 0.04793067,  0.40991014,  0.0889837 ,  0.30882094,  0.30143846]])

In [34]:
np.sum(np_ds[0:][1])

2132.8399996749999

In [35]:
# Row sum
np.sum(np_ds, axis = 1)

array([ 1883.34000015,  2132.83999967,  1681.13000011,  2006.61999941])

In [36]:
# Column sum
np.sum(np_ds, axis = 0)

array([ 4011.        ,   720.        ,   372.        ,   101.        ,
          28.49999952,    21.42999983,   218.        ,   260.        ,
         776.        ,   550.        ,   290.        ,   172.        ,
         184.        ])

In [37]:
# Specific row sum
np_ds[1].sum()

2132.8399996749999

In [39]:
# Specific column sum
np_ds[:, 1].sum()

720.0

In [45]:
# Specific column average
np_ds[:, 1].mean()

180.0

In [46]:
# Specific column max
np_ds[:, 1].max()

249.0

In [47]:
# Specific column standard deviation
np_ds[:, 1].std()

66.509397832186096

** Assigning Values To NumPy Arrays **

We can also use indexing to assign values to certain elements in arrays. 

In [50]:
# We can change a value in a NumPy array by assigning directly to the indexed "cell":
np_ds[1,5] = 10

np_ds[1,5]

10.0

In [51]:
# We can do the same for slices. To overwrite an entire column, we can do this:
np_ds[:,5] = 50
np_ds[:,5]

array([ 50.,  50.,  50.,  50.])