#                                                    NUMPY

Numpy is commonly used Python Data Analysis package. By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem.

In this tutorial, we'll walk through using NumPy to analyze data on wine quality. The data contains information on various attributes of wines, such as pH and fixed acidity, along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we'll try to figure out more about the perceived quality of wine.

Here are the first few rows of the winequality-red.csv file, which we'll be using throughout this tutorial:

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5

Before using NumPy, we'll first try to work with the data using Python and the csv package. 

In [123]:
import csv
f = open("C:\Users\hp\Documents\winequality-red.csv", "r")
csv_reader = csv.reader(f, delimiter=';')
wines = list(csv_reader)
print wines[0:3]

[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'], ['7.4', '0.7', '0', '1.9', '0.076', '11', '34', '0.9978', '3.51', '0.56', '9.4', '5'], ['7.8', '0.88', '0', '2.6', '0.098', '25', '67', '0.9968', '3.2', '0.68', '9.8', '5']]


We can find the average quality of the wines using below code:  (In the end I had done same colection using Numpy)

In [108]:
qualities = [float(item[-1]) for item in wines[1:]]

sum(qualities) / len(qualities)

5.6360225140712945

Although we were able to do the calculation we wanted, the code is fairly complex, and it won't be fun to have to do something similar every time we want to compute a quantity. Luckily, we can use NumPy to make it easier to work with our data.
Let's start by creating a NumPy array using the numpy.array function

In [125]:
import numpy as np

# Exclude header row by slicing the list as shown in code below (wines[1:]).
# Specify the keyword argument dtype to make sure each element is converted to a float.
# Numpy array are Homogeneous.
wines = np.array(wines[1:], dtype = np.float)
wines

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

In [5]:
# We can check the number of rows and columns in our data using the shape property of NumPy arrays:
wines.shape

(1599L, 12L)

Alternative NumPy Array Creation Methods

In [6]:
# The below code will create an array with 3 rows and 4 columns, where every element is 0, using numpy.zeros
import numpy as np
empty_array = np.zeros((3,4))
empty_array

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [7]:
# You can also create an array where each element is a random number using numpy.random.rand
np.random.rand(3,4)

array([[0.15944622, 0.15532579, 0.21514882, 0.99475802],
       [0.47539513, 0.68637007, 0.71241175, 0.41687567],
       [0.01846695, 0.82132293, 0.44951237, 0.73569538]])

Using NumPy To Read In Files
Use the genfromtxt function to read in the winequality-red.csv file.

In [126]:
# How to create numpy array of first ten wines
wines_first_ten = np.array(wines[:10], dtype = np.float)
wines_first_ten

array([[7.400e+00, 7.000e-01, 0.000e+00, 1.900e+00, 7.600e-02, 1.100e+01,
        3.400e+01, 9.978e-01, 3.510e+00, 5.600e-01, 9.400e+00, 5.000e+00],
       [7.800e+00, 8.800e-01, 0.000e+00, 2.600e+00, 9.800e-02, 2.500e+01,
        6.700e+01, 9.968e-01, 3.200e+00, 6.800e-01, 9.800e+00, 5.000e+00],
       [7.800e+00, 7.600e-01, 4.000e-02, 2.300e+00, 9.200e-02, 1.500e+01,
        5.400e+01, 9.970e-01, 3.260e+00, 6.500e-01, 9.800e+00, 5.000e+00],
       [1.120e+01, 2.800e-01, 5.600e-01, 1.900e+00, 7.500e-02, 1.700e+01,
        6.000e+01, 9.980e-01, 3.160e+00, 5.800e-01, 9.800e+00, 6.000e+00],
       [7.400e+00, 7.000e-01, 0.000e+00, 1.900e+00, 7.600e-02, 1.100e+01,
        3.400e+01, 9.978e-01, 3.510e+00, 5.600e-01, 9.400e+00, 5.000e+00],
       [7.400e+00, 6.600e-01, 0.000e+00, 1.800e+00, 7.500e-02, 1.300e+01,
        4.000e+01, 9.978e-01, 3.510e+00, 5.600e-01, 9.400e+00, 5.000e+00],
       [7.900e+00, 6.000e-01, 6.000e-02, 1.600e+00, 6.900e-02, 1.500e+01,
        5.900e+01, 9.964e-01, 3.

In [120]:
wines = np.genfromtxt("C:\Users\hp\Documents\winequality-red.csv", delimiter=";", skip_header=1)
wines

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

Indexing NumPy Arrays
Just like Python lists, NumPy is zero-indexed, meaning that the index of the first row is 0, and the index of the first column is 0.

In [9]:
# Lets select the element at row 3 and column 2 -- volatile acidity
wines[2,1]

0.76

In [10]:
# select the element at row 2 and column 3 -- citric acid 
wines[1,2]

0.0

In [11]:
# Slicing the Array -- its similar to Python slicing list of lists
# If we instead want to select the first three items from the fourth column, we can do it using a colon (:) -- residual sugar
wines[0:3, 3]

array([1.9, 2.6, 2.3])

In [12]:
# Just like with list slicing, it's possible to omit the 0 to just retrieve all the elements from the beginning up to element 3:
wines[:3,3]

array([1.9, 2.6, 2.3])

In [127]:
# We can select entire row of 3rd column as follow:
wines[:,2]

array([0.  , 0.  , 0.04, ..., 0.13, 0.12, 0.47])

In [14]:
# we can also select entire column for 3rd row as follow:
wines[2,:]

array([7.80e+00, 7.60e-01, 4.00e-02, 2.30e+00, 9.20e-02, 1.50e+01,
       5.40e+01, 9.97e-01, 3.26e+00, 6.50e-01, 9.80e+00, 5.00e+00])

In [15]:
# If we take our indexing to the extreme, we can select the entire array using two colons to select all the rows and columns 
# in wines.
wines[:,:]

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

In [None]:
# how to find last ten wines
wines[1589:1599,:]   

In [16]:
# Assigning Values to Numpy Array
wines[1,2] = 5
# We can do the same for slices. To overwrite an entire column, we can do this:
wines[:, 1] = 5
wines[:,:]

array([[ 7.4 ,  5.  ,  0.  , ...,  0.56,  9.4 ,  5.  ],
       [ 7.8 ,  5.  ,  5.  , ...,  0.68,  9.8 ,  5.  ],
       [ 7.8 ,  5.  ,  0.04, ...,  0.65,  9.8 ,  5.  ],
       ...,
       [ 6.3 ,  5.  ,  0.13, ...,  0.75, 11.  ,  6.  ],
       [ 5.9 ,  5.  ,  0.12, ...,  0.71, 10.2 ,  5.  ],
       [ 6.  ,  5.  ,  0.47, ...,  0.66, 11.  ,  6.  ]])

In [19]:
third_wine = wines[3,:]
print third_wine
print third_wine[1]

[11.2    5.     0.56   1.9    0.075 17.    60.     0.998  3.16   0.58
  9.8    6.   ]
5.0


In [None]:
# numpy random.rand() package can be used to generate one dimensional array:
np.random.rand(3)

N-Dimensional Array
This doesn't happen extremely often, but there are cases when you'll want to deal with arrays that have greater than 3 dimensions. One way to think of this is as a list of lists of lists. Let's say we want to store the monthly earnings of a store, but we want to be able to quickly lookup the results for a quarter, and for a year. The earnings for one year might look like this:
[500, 505, 490, 810, 450, 678, 234, 897, 430, 560, 1023, 640]

In [20]:
# we can split up the earnings in quarters as follow:
year_one = [
    [500,505,490],
    [810,450,678],
    [234,897,430],
    [560,1023,640]
]
# We can retrieve the earnings from January by calling 
print year_one[0][0]
# If we want the results for a whole quarter, we can call 
print year_one[0]

500
[500, 505, 490]


We now have a 2-dimensional array, or matrix. But what if we now want to add the results from another year? We have to add a third dimension:

In [21]:
earnings = [
            [
                [500,505,490],
                [810,450,678],
                [234,897,430],
                [560,1023,640]
            ],
            [
                [600,605,490],
                [345,900,1000],
                [780,730,710],
                [670,540,324]
            ]
          ]

# We can retrieve the earnings from January of the first year by calling 
earnings[0][0][0]

500

In [22]:
# We now need three indexes to retrieve a single element. A three-dimensional array in NumPy is much the same. 
# In fact, we can convert earnings to an array and then get the earnings for January of the first year:
earnings = np.array(earnings)
print earnings[0,0,0]
print earnings.shape

500
(2L, 4L, 3L)


In [23]:
# Indexing and slicing work the exact same way with a 3-dimensional array, but now we have an extra axis to pass in. 
# If we wanted to get the earnings for January of all years, we could do this:
earnings[:,0,0]

array([500, 600])

In [24]:
# If we wanted to get first quarter earnings from both years, we could do this:
earnings[:,0,:]

array([[500, 505, 490],
       [600, 605, 490]])

NumPy Data Types
As we mentioned earlier, each NumPy array can store elements of a single data type. For example, wines contains only float values. NumPy stores values using its own data types, which are distinct from Python types like float and str. This is because the core of NumPy is written in a programming language called C, which stores data differently than the Python data types.

You can find the data type of a NumPy array by accessing the dtype property:

In [25]:
wines.dtype

dtype('float64')

NumPy has several different data types, which mostly map to Python data types, like float, and str. You can find a full listing of NumPy data types here, but here are a few important ones:

float -- numeric floating point data.
int -- integer data.
string -- character data.
object -- Python objects.
Data types additionally end with a suffix that indicates how many bits of memory they take up. So int32 is a 32 bit integer data type, and float64 is a 64 bit float data type.

Data Type conversion in Numpy
You can use the numpy.ndarray.astype method to convert an array to a different type. The method will actually copy the array, and return a new array with the specified data type. For instance, we can convert wines to the int data type:

In [74]:
wines.astype(int)

array([[ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       ...,
       [ 6,  0,  0, ...,  0, 11,  6],
       [ 5,  0,  0, ...,  0, 10,  5],
       [ 6,  0,  0, ...,  0, 11,  6]])

In [75]:
int_wines = wines.astype(int)
int_wines.dtype.name

'int32'

In [76]:
int_wines = wines.astype(np.int64)
int_wines.dtype.name

'int64'

Single Array Math
If you do any of the basic mathematical operations (/, *, -, +, ^) with an array and a value, it will apply the operation to each of the elements in the array.

Let's say we want to add 10 points to each quality score because we're drunk and feeling generous. Here's how we'd do that:

In [77]:
wines[:,11] + 10

array([15., 15., 15., ..., 16., 15., 16.])

Note that the above operation won't change the wines array -- it will return a new 1-dimensional array where 10 has been added to each element in the quality column of wines.

If we instead did +=, we'd modify the array in place:

In [78]:
wines[:, 11] += 10
wines[:,11]

array([15., 15., 15., ..., 16., 15., 16.])

In [79]:
# All the other operations work the same way. For example, if we want to multiply each of the quality score by 2, 
# we could do it like this:
wines[:, 11] * 2

array([30., 30., 30., ..., 32., 30., 32.])

Multiple Array Math
It's also possible to do mathematical operations between arrays. This will apply the operation to pairs of elements. For example, if we add the quality column to itself, here's what we get:

In [80]:
wines[:,11] + wines[:,11]
# Note that this is equivalent to wines[11] * 2 -- this is because NumPy adds each pair of elements. 
# The first element in the first array is added to the first element in the second array, the second to the second, and so on.

array([30., 30., 30., ..., 32., 30., 32.])

We can also use this to multiply arrays. Let's say we want to pick a wine that maximizes alcohol content and quality (we want to get drunk, but we're classy). We'd multiply alcohol by quality, and select the wine with the highest score:

In [81]:
wines[:, 10] * wines[:, 11]
# All of the common operations (/, *, -, +, ^) will work between arrays.

array([141., 147., 147., ..., 176., 153., 176.])

NumPy Array Methods
In addition to the common mathematical operations, NumPy also has several methods that you can use for more complex calculations on arrays. An example of this is the numpy.ndarray.sum method. This finds the sum of all the elements in an array by default:

In [82]:
wines[:, 11].sum()

25002.0

We can pass the axis keyword argument into the sum method to find sums over an axis. If we call sum across the wines matrix, and pass in axis=0, we'll find the sums over the first axis of the array. This will give us the sum of all the values in every column. This may seem backwards that the sums over the first axis would give us the sum of each column, but one way to think about this is that the specified axis is the one "going away". So if we specify axis=0, we want the rows to go away, and we want to find the sums for each of the remaining axes across each row:

In [83]:
wines.sum(axis=0)

array([13303.1    ,   843.985  ,   433.29   ,  4059.55   ,   139.859  ,
       25384.     , 74302.     ,  1593.79794,  5294.47   ,  1052.38   ,
       16666.35   , 25002.     ])

In [84]:
# If we pass in axis=1, we'll find the sums over the second axis of the array. This will give us the sum of each row:
wines = np.genfromtxt("C:\Users\hp\Documents\winequality-red.csv", delimiter=";", skip_header=1)
wines.sum(axis=1)

array([ 74.5438 , 123.0548 ,  99.699  , ..., 100.48174, 105.21547,
        92.49249])

There are several other methods that behave like the sum method, including:
numpy.ndarray.mean — finds the mean of an array.
numpy.ndarray.std — finds the standard deviation of an array.
numpy.ndarray.min — finds the minimum value in an array.
numpy.ndarray.max — finds the maximum value in an array.

In [85]:
# numpy.ndarray.mean
wines[:, 11].mean()

5.6360225140712945

In [86]:
# numpy.ndarray.std
wines[:,11].std()

0.8073168769639513

In [87]:
# numpy.ndarray.min
wines[:, 11].min()

3.0

In [88]:
# numpy.ndarray.max
wines[:, 11].max()

8.0

NumPy Array Comparisons
NumPy makes it possible to test to see if rows match certain values using mathematical comparison operations like <, >, >=, <=, and ==. For example, if we want to see which wines have a quality rating higher than 5, we can do this:

In [69]:
wines[:,11] > 5

array([False, False, False, ...,  True, False,  True])

In [70]:
# We get a Boolean array that tells us which of the wines have a quality rating greater than 5. 
# We can do something similar with the other operators. For instance, we can see if any wines have a quality rating equal to 10:
wines[:, 11] == 10

array([False, False, False, ..., False, False, False])

Reshaping NumPy Arrays
We can change the shape of arrays while still preserving all of their elements. This often can make it easier to access array elements. The simplest reshaping is to flip the axes, so rows become columns, and vice versa. We can accomplish this with the numpy.transpose function:

In [71]:
np.transpose(wines).shape

(12L, 1599L)

In [72]:
# We can use the numpy.ravel function to turn an array into a one-dimensional representation. 
# It will essentially flatten an array into a long sequence of values:
wines.ravel()

array([ 7.4 ,  0.7 ,  0.  , ...,  0.66, 11.  ,  6.  ])

In [73]:
array_one = np.array(
    [
        [1, 2, 3, 4], 
        [5, 6, 7, 8]
    ]
)

array_one.ravel()

array([1, 2, 3, 4, 5, 6, 7, 8])