### NumPy Data Analysis

NumPy is a commonly used Python data analysis package. 

By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use NumPy under the hood. NumPy was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages NumPy in some way.

In this tutorial, we’ll walk through using NumPy to analyze data on wine quality. The data contains information on various attributes of wines, such as pH and fixed acidity, along with a quality score between 0 and 10 for each wine. The quality score is the average of at least 3 human taste testers. As we learn how to work with NumPy, we’ll try to figure out more about the perceived quality of wine.

### Data 
Data is from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php),
which is available here: https://archive.ics.uci.edu/ml/datasets/Wine+Quality

The two datasets are related to red and white variants of the Portuguesse 'Vino Verde' wine. Only physicochemical (inputs) and sensory (output) variables are available in the dataset (none about grape types, wine brand, wine selling price, etc.) due to privacy and logistic issues.

       
I downloaded both the white and red wine csv. We can also join and append it together if we want to. But for this example, we will use the red wine csv file.

#### Data characteristics
- 4,898 rows
- 12 columns
- No missing values
- 2009-10-07

#### Uses
- Can be used for classification or regression.
- Can be used for multivariate analysis

#### Columns
1. fixed acidity 
2. volatile acidity 
3. citric acid 
4. residual sugar 
5. chlorides 
6. free sulfur dioxide 
7. total sulfur dioxide 
8. density 
9. pH 
10. sulphates 
11. alcohol 

Output variable (based on sensory data): 
12. quality (score between 0 and 10)

In [3]:
# Set directory for the data
import os
path = 'C:\\Users\\' + os.getlogin() + '\\Documents\\Programming\\Python\\MachineLearning\\Data'
os.chdir(path)
os.getcwd()
os.listdir()

['01-ign.csv', '02-winequality-red.csv', '02-winequality-white.csv']

### Lists Of Lists for CSV Data
Before using NumPy, we’ll first try to work with the data using Python and the csv package. We can read in the file using the csv.reader object, which will allow us to read in and split up all the content from the csv file.

In the below code, we:

- Import the csv library.
- Open the winequality-red.csv file.
    - With the file open, create a new csv.reader object.
        - Pass in the keyword argument delimiter=";" to make sure that the records are split up on the semicolon character instead of the default comma character.
    - Call the list type to get all the rows from the file.
    - Assign the result to wines.

In [5]:
'''Import the csv library'''
import csv

'''
Open the 02-winequality-red.csv file as a new csv.reader() object and pass
in the keyword argument delimiter=';' to make sure that the records are split up
on the semicolon character instead of the default value ','
'''
with open('02-winequality-red.csv', 'r') as f:  # Open in read mode
    '''
    The data has been read into a list of lists.
    Each inner list is a row from the csv file and each item in the entire
    list of lists is represented as a string.
    
    Call the list type and assign the results to wines
    '''
    wines = list(csv.reader(f, delimiter=';'))
print(wines[:3])

[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'], ['7.4', '0.7', '0', '1.9', '0.076', '11', '34', '0.9978', '3.51', '0.56', '9.4', '5'], ['7.8', '0.88', '0', '2.6', '0.098', '25', '67', '0.9968', '3.2', '0.68', '9.8', '5']]


Format the data into a table to make it easier to read and work with.
- Extract the last element from each row after the header row.
- Convert each extracted element to a float.
- Assign all the extracted elements to the list qualities.
- Divide the sum of all the elements in qualities by the total number of elements in qualities to the get the mean.


In [9]:
# Assign qualities
qualities = [float(item[-1]) for item in wines[1:]]
print(qualities[:5])

# Divide the sum of all the elements by the total number of elements 
mean_quality = sum(qualities)/len(qualities)
mean_quality

[5.0, 5.0, 5.0, 6.0, 5.0]


5.6360225140712945

Although we were able to do the calculation we wanted, the code is fairly complex, and it won’t be fun to have to do something similar every time we want to compute a quantity. Luckily, we can use NumPy to make it easier to work with our data.

### NumPy 2 - Dimensional Arrays (Matrices)

With NumPy, we work with multidimensional arrays. We’ll dive into all of the possible types of multidimensional arrays later on, but for now, we’ll focus on 2-dimensional arrays. 

A 2-dimensional array is also known as a matrix, and is something you should be familiar with. In fact, it’s just a different way of thinking about a list of lists. A matrix has rows and columns. By specifying a row number and a column number, we’re able to extract an element from a matrix.

If we picked the element at the first row and the second column, we’d get volatile acidity. If we picked the element in the third row and the second column, we’d get 0.88.

In a NumPy array, the number of dimensions is called the rank, and each dimension is called an axis. So the rows are the first axis, and the columns are the second axis.

Now that you understand the basics of matrices, let’s see how we can get from our list of lists to a NumPy array.

### Creating A NumPy Array
We can create a NumPy array using the numpy.array function. 

#### Pass in a list to the numpy.array() function
If we pass in a list of lists, it will automatically create a NumPy array with the same number of rows and columns. 

Because we want all of the elements in the array to be float elements for easy computation, we’ll leave off the header row, which contains strings. 

#### All elements in an array have to be of the same type.
One of the limitations of NumPy is that all the elements in an array have to be of the same type, so if we include the header row, all the elements in the array will be read in as strings. Because we want to be able to do computations like find the average quality of the wines, we need the elements to all be floats.

- Import the numpy package.
- Pass the list of lists wines into the array function, which converts it into a NumPy array.
    - Exclude the header row with list slicing.
    - Specify the keyword argument dtype to make sure each element is converted to a float. We’ll dive more into what the dtype is later on.

In [10]:
import csv
import numpy as np

with open('02-winequality-red.csv', 'r') as f:
    wines = list(csv.reader(f, delimiter=';'))

# Pass in a list into the np.array() function and 
# the argument np.float for the dtype parameter
wines = np.array(wines[1:], dtype=np.float)

In [27]:
# Print top 3 rows
print(wines[:3])

# Get first 3
print(wines[0:3,1])

# First element
print(wines[1])

# Inner of first element
print(wines[1,0])
print(wines[1][0])

[[7.400e+00 7.000e-01 0.000e+00 1.900e+00 7.600e-02 1.100e+01 3.400e+01
  9.978e-01 3.510e+00 5.600e-01 9.400e+00 5.000e+00]
 [7.800e+00 8.800e-01 0.000e+00 2.600e+00 9.800e-02 2.500e+01 6.700e+01
  9.968e-01 3.200e+00 6.800e-01 9.800e+00 5.000e+00]
 [7.800e+00 7.600e-01 4.000e-02 2.300e+00 9.200e-02 1.500e+01 5.400e+01
  9.970e-01 3.260e+00 6.500e-01 9.800e+00 5.000e+00]]
[0.7  0.88 0.76]
[ 7.8     0.88    0.      2.6     0.098  25.     67.      0.9968  3.2
  0.68    9.8     5.    ]
7.8
7.8


In [12]:
# Get shape
wines.shape

(1599, 12)

### Alternative NumPy Array Creation Methods

In [29]:
# Create an array with 3 rows and 4 columns with all elements as 0
import numpy as np
empty_array = np.zeros((3,4))  # np.zeros((x,y))
empty_array

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [30]:
# Create an array with random values as elements
import numpy as np
random_array = np.random.rand(3,4)  # np.random.rand()
random_array

array([[0.21064211, 0.35498152, 0.4917327 , 0.57312747],
       [0.80387808, 0.54334451, 0.65611305, 0.23077667],
       [0.85540526, 0.24779613, 0.80526936, 0.26987374]])

Creating arrays full of random numbers can be useful when you want to quickly test your code with sample arrays.

### Using NumPy To Read In Files
It’s possible to use NumPy to directly read csv or other files into arrays. We can do this using the numpy.genfromtxt function. We can use it to read in our initial data on red wines.

In the below code, we:

- Use the genfromtxt function to read in the winequality-red.csv file.
- Specify the keyword argument delimiter=";" so that the fields are parsed properly.
- Specify the keyword argument skip_header=1 so that the header row is skipped.

In [39]:
# Use the genfromtxt() function to directly read csv into an array
wines = np.genfromtxt('02-winequality-red.csv', delimiter=';', skip_header=1)
wines[1]

array([ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 25.    , 67.    ,
        0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ])

The dataset will end up looking the same as if we read it into a list then converted in to an array of floats. NumPy will automatically pick a data type for the elements in an array based on their format.

### Indexing NumPy Arrays
We now know how to create arrays, but unless we can retrieve results from them, there isn’t a lot we can do with NumPy. We can use array indexing to select individual elements, groups of elements, or entire rows and columns. One important thing to keep in mind is that just like Python lists, NumPy is zero-indexed, meaning that the index of the first row is 0, and the index of the first column is 0. If we want to work with the fourth row, we’d use index 3, if we want to work with the second row, we’d use index 1, and so on. 

In [40]:
wines[2,3]  # specify 2 indexes to retrieve an element

2.3

Since we’re working with a 2-dimensional array in NumPy, we specify 2 indexes to retrieve an element. The first index is the row, or axis 1, index, and the second index is the column, or axis 2, index. Any element in wines can be retrieved using 2 indexes.

### Slicing NumPy Arrays
If we instead want to select the first three items from the fourth column, we can do it using a colon (:). A colon indicates that we want to select all the elements from the starting index up to but not including the ending index. 

In [41]:
# Select the first three items from the fourth column
wines[0:3,3]

array([1.9, 2.6, 2.3])

In [42]:
# Retrieve all the elements from the beginning up to the third
wines[:3,3]

array([1.9, 2.6, 2.3])

We can select an entire column by specifying that we want all the elements, from the first to the last. We specify this by just using the colon (:), with no starting or ending indices. The below code will select the entire fourth column:

In [43]:
# Select the third column and all elements
wines[:,3]

array([1.9, 2.6, 2.3, ..., 2.3, 2. , 3.6])

In [48]:
# Second element (row)
wines[1]

array([ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 25.    , 67.    ,
        0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ])

In [49]:
wines

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

In [51]:
# Extract entire row
wines[1,:]  # wines[1] equivalent

array([ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 25.    , 67.    ,
        0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ])

In [53]:
wines[0,1:]  # What is this indexing...

array([ 0.7   ,  0.    ,  1.9   ,  0.076 , 11.    , 34.    ,  0.9978,
        3.51  ,  0.56  ,  9.4   ,  5.    ])

If we take our indexing to the extreme, we can select the entire array using two colons to select all the rows and columns in wines. This is a great party trick, but doesn’t have a lot of good applications:

In [54]:
wines[:,:]

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

### Assigning Values To NumPy Arrays
We can also use indexing to assign values to certain elements in arrays. We can do this by assigning directly to the indexed value:

In [56]:
wines[1,] # First element

array([ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 25.    , 67.    ,
        0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ])

In [58]:
wines[:,5] # Fifth column

array([11., 25., 15., ..., 29., 32., 18.])

In [59]:
wines[1,5] # First row & fifth column

25.0

In [60]:
wines[1,5] = 10

In [61]:
wines[1,5] # 25 changed to 10

10.0

In [62]:
wines[:,5] # show fifth column, the 25 is now a 10

array([11., 10., 15., ..., 29., 32., 18.])

In [63]:
wines[1,] # First element same as wines[1]

array([ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 10.    , 67.    ,
        0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ])

In [64]:
wines[1] # First element same as wines[1,]

array([ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 10.    , 67.    ,
        0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ])

### 1-Dimensional NumPy Arrays
So far, we’ve worked with 2-dimensional arrays, such as wines. However, NumPy is a package for working with multidimensional arrays. One of the most common types of multidimensional arrays is the 1-dimensional array, or vector. As you may have noticed above, when we sliced wines, we retrieved a 1-dimensional array. 

A 1-dimensional array only needs a single index to retrieve an element. Each row and column in a 2-dimensional array is a 1-dimensional array. Just like a list of lists is analogous to a 2-dimensional array, a single list is analogous to a 1-dimensional array. If we slice wines and only retrieve the third row, we get a 1-dimensional array:

In [65]:
# 1D array only needs a single index to retrieve an element
third_wine = wines[3,:]
third_wine

array([11.2  ,  0.28 ,  0.56 ,  1.9  ,  0.075, 17.   , 60.   ,  0.998,
        3.16 ,  0.58 ,  9.8  ,  6.   ])

In [66]:
wines[3,] # same thing as above without ':' because it's indexing for all

array([11.2  ,  0.28 ,  0.56 ,  1.9  ,  0.075, 17.   , 60.   ,  0.998,
        3.16 ,  0.58 ,  9.8  ,  6.   ])

In [67]:
# Retrieve individual elements from third_wine using a single index
third_wine[1]

0.28

In [68]:
# Generate a random vector using np.random.rand()
np.random.rand(3)

array([0.46774995, 0.34988034, 0.271651  ])

In [72]:
# Pass in a shape for a 2D array
np.random.rand(3,4) # 3x4 matrix (array)

array([[0.12645679, 0.87573413, 0.81230444, 0.13758032],
       [0.22722529, 0.11634374, 0.81273009, 0.4369269 ],
       [0.94293411, 0.05208825, 0.1073651 , 0.50810366]])

Previously, when we called np.random.rand, we passed in a shape for a 2-dimensional array, so the result was a 2-dimensional array. This time, we passed in a shape for a single dimensional array. The shape specifies the number of dimensions, and the size of the array in each dimension. A shape of (10,10) will be a 2-dimensional array with 10 rows and 10 columns. A shape of (10,) will be a 1-dimensional array with 10 elements.

Where NumPy gets more complex is when we start to deal with arrays that have more than 2 dimensions.

### N-Dimensional NumPy Arrays
This doesn’t happen extremely often, but there are cases when you’ll want to deal with arrays that have greater than 3 dimensions. One way to think of this is as a list of lists of lists. Let’s say we want to store the monthly earnings of a store, but we want to be able to quickly lookup the results for a quarter, and for a year. The earnings for one year might look like this:
- [500, 505, 490, 810, 450, 678, 234, 897, 430, 560, 1023, 640]

The store earned 500 in January, 505 in February, and so on. We can split up these earnings by quarter into a list of lists:

In [73]:
# Create an array
year_one = [
    [500,505,490],
    [810,450,678],
    [234,897,430],
    [560,1023,640]
]

year_one

[[500, 505, 490], [810, 450, 678], [234, 897, 430], [560, 1023, 640]]

We can retrieve the earnings from January by calling year_one[0][0]. If we want the results for a whole quarter, we can call year_one[0] or year_one[1]. We now have a 2-dimensional array, or matrix. But what if we now want to add the results from another year? We have to add a third dimension:

In [74]:
year_one[0]

[500, 505, 490]

In [75]:
year_one[0][0]

500

In [76]:
earnings = [
    [
        [500,505,490],
        [810,450,678],
        [234,897,430],
        [560,1023,640]
    ],
    [
        [600,605,490],
        [345,900,1000],
        [780,730,710],
        [670,540,324]
    ]
]
earnings

[[[500, 505, 490], [810, 450, 678], [234, 897, 430], [560, 1023, 640]],
 [[600, 605, 490], [345, 900, 1000], [780, 730, 710], [670, 540, 324]]]

In [77]:
earnings[0]

[[500, 505, 490], [810, 450, 678], [234, 897, 430], [560, 1023, 640]]

In [79]:
earnings[0][0]

[500, 505, 490]

In [80]:
earnings[0][0][0]

500

We can retrieve the earnings from January of the first year by calling earnings[0][0][0]. We now need three indexes to retrieve a single element. A three-dimensional array in NumPy is much the same. In fact, we can convert earnings to an array and then get the earnings for January of the first year:

In [83]:
# Convert array into earnings variable
type(earnings)  # list
earnings = np.array(earnings)
earnings[0,0,0]  # vs list indexing [0][0][0]

500

In [84]:
# Get shape of the array
earnings.shape

(2, 4, 3)

Indexing and slicing work the exact same way with a 3-dimensional array, but now we have an extra axis to pass in. If we wanted to get the earnings for January of all years, we could do this:

In [86]:
earnings[:]

array([[[ 500,  505,  490],
        [ 810,  450,  678],
        [ 234,  897,  430],
        [ 560, 1023,  640]],

       [[ 600,  605,  490],
        [ 345,  900, 1000],
        [ 780,  730,  710],
        [ 670,  540,  324]]])

In [87]:
earnings[:,0,0]

array([500, 600])

In [88]:
earnings[:,0,1]

array([505, 605])

In [89]:
earnings[:,1,1]

array([450, 900])

In [90]:
earnings[:,0,0]

array([500, 600])

In [91]:
# Get earnings for January of all years
earnings[:,0,:]

array([[500, 505, 490],
       [600, 605, 490]])

Adding more dimensions can make it much easier to query your data if it’s organized in a certain way. As we go from 3-dimensional arrays to 4-dimensional and larger arrays, the same properties apply, and they can be indexed and sliced in the same ways.

### NumPy Data Types
As we mentioned earlier, each NumPy array can store elements of a single data type. For example, wines contains only float values. NumPy stores values using its own data types, which are distinct from Python types like float and str. This is because the core of NumPy is written in a programming language called C, which stores data differently than the Python data types. NumPy data types map between Python and C, allowing us to use NumPy arrays without any conversion hitches.

You can find the data type of a NumPy array by accessing the dtype property:

In [92]:
# Find the NumPy array type
wines.dtype

dtype('float64')

### Converting Data Types
You can use the numpy.ndarray.astype method to convert an array to a different type. The method will actually copy the array, and return a new array with the specified data type. For instance, we can convert wines to the int data type:

In [93]:
wines.astype(int)

array([[ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       ...,
       [ 6,  0,  0, ...,  0, 11,  6],
       [ 5,  0,  0, ...,  0, 10,  5],
       [ 6,  0,  0, ...,  0, 11,  6]])

In [94]:
type(wines.astype(int))

numpy.ndarray

In [95]:
type(wines.astype(int).dtype)

numpy.dtype

As you can see above, all of the items in the resulting array are integers. Note that we used the Python int type instead of a NumPy data type when converting wines. This is because several Python data types, including float, int, and string, can be used with NumPy, and are automatically converted to NumPy data types.

We can check the name property of the dtype of the resulting array to see what data type NumPy mapped the resulting array to:

In [96]:
int_wines = wines.astype(int)
int_wines.dtype.name

'int32'

The array has been converted to a 64-bit integer data type. This allows for very long integer values, but takes up more space in memory than storing the values as 32-bit integers.

If you want more control over how the array is stored in memory, you can directly create NumPy dtype objects like numpy.int32:

In [97]:
# Create NumPy dtype object on how an array is stored in memory
np.int32

numpy.int32

In [98]:
wines.astype(np.int32)

array([[ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       [ 7,  0,  0, ...,  0,  9,  5],
       ...,
       [ 6,  0,  0, ...,  0, 11,  6],
       [ 5,  0,  0, ...,  0, 10,  5],
       [ 6,  0,  0, ...,  0, 11,  6]])

### NumPy Array Operations
NumPy makes it simple to perform mathematical operations on arrays. This is one of the primary advantages of NumPy, and makes it quite easy to do computations.

### Single Array Math
If you do any of the basic mathematical operations (/, *, -, +, ^) with an array and a value, it will apply the operation to each of the elements in the array.

Let’s say we want to add 10 points to each quality score because we’re drunk and feeling generous. Here’s how we’d do that:

In [100]:
wines[:,11]

array([5., 5., 5., ..., 6., 5., 6.])

In [101]:
wines[:,1]

array([0.7  , 0.88 , 0.76 , ..., 0.51 , 0.645, 0.31 ])

In [103]:
wines[:,11]

array([5., 5., 5., ..., 6., 5., 6.])

In [104]:
# Multiply each quality score by 2
wines[:,2] * 2

array([0.  , 0.  , 0.08, ..., 0.26, 0.24, 0.94])

### Multiple Array Math
It’s also possible to do mathematical operations between arrays. This will apply the operation to pairs of elements. For example, if we add the quality column to itself, here’s what we get:

In [105]:
wines[:,11] + wines[:,11]

array([10., 10., 10., ..., 12., 10., 12.])

Note that this is equivalent to wines[11] * 2 — this is because NumPy adds each pair of elements. The first element in the first array is added to the first element in the second array, the second to the second, and so on.

We can also use this to multiply arrays. Let’s say we want to pick a wine that maximizes alcohol content and quality (we want to get drunk, but we’re classy). We’d multiply alcohol by quality, and select the wine with the highest score:

In [106]:
wines[:,11] * wines[:,11]

array([25., 25., 25., ..., 36., 25., 36.])

### Broadcasting
Unless the arrays that you’re operating on are the exact same size, it’s not possible to do elementwise operations. In cases like this, NumPy performs broadcasting to try to match up elements. Essentially, broadcasting involves a few steps:

- The last dimension of each array is compared.
    - If the dimension lengths are equal, or one of the dimensions is of length 1, then we keep going.
    - If the dimension lengths aren’t equal, and none of the dimensions have length 1, then there’s an error.
- Continue checking dimensions until the shortest array is out of dimensions.
For example, the following two shapes are compatible:

For example, the following two shapes are compatible:
- A: (50,3)
- B (3,)

This is because the length of the trailing dimension of array A is 3, and the length of the trailing dimension of array B is 3. They’re equal, so that dimension is okay. Array B is then out of elements, so we’re okay, and the arrays are compatible for mathematical operations.

https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

In [108]:
array_two = np.array([4,5])
array_two

array([4, 5])

In [110]:
array_one = np.array(
[[1,2],[3,4]])
array_one

array([[1, 2],
       [3, 4]])

In [111]:
array_one + array_two

array([[5, 7],
       [7, 9]])

array_two has been broadcasted across each row of array_one

In [112]:
rand_array = np.random.randint(12)
wines + rand_array

array([[16.4  ,  9.7  ,  9.   , ...,  9.56 , 18.4  , 14.   ],
       [16.8  ,  9.88 ,  9.   , ...,  9.68 , 18.8  , 14.   ],
       [16.8  ,  9.76 ,  9.04 , ...,  9.65 , 18.8  , 14.   ],
       ...,
       [15.3  ,  9.51 ,  9.13 , ...,  9.75 , 20.   , 15.   ],
       [14.9  ,  9.645,  9.12 , ...,  9.71 , 19.2  , 14.   ],
       [15.   ,  9.31 ,  9.47 , ...,  9.66 , 20.   , 15.   ]])

In [113]:
rand_array

9

Elements of rand_array are broadcast over each row of wines, so the first column of wines has the first value in rand_array added to it, and so on.

In [114]:
wines[1]

array([ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 10.    , 67.    ,
        0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ])

In [115]:
wines[1] + rand_array

array([16.8   ,  9.88  ,  9.    , 11.6   ,  9.098 , 19.    , 76.    ,
        9.9968, 12.2   ,  9.68  , 18.8   , 14.    ])

### NumPy Array Methods
In addition to the common mathematical operations, NumPy also has several methods that you can use for more complex calculations on arrays. An example of this is the numpy.ndarray.sum method. This finds the sum of all the elements in an array by default:

In [116]:
wines[:,11].sum()

9012.0

The total of all of our quality ratings is 154.1788. We can pass the axis keyword argument into the sum method to find sums over an axis. If we call sum across the wines matrix, and pass in axis=0, we’ll find the sums over the first axis of the array. This will give us the sum of all the values in every column. This may seem backwards that the sums over the first axis would give us the sum of each column, but one way to think about this is that the specified axis is the one “going away”. So if we specify axis=0, we want the rows to go away, and we want to find the sums for each of the remaining axes across each row:

In [117]:
# Sum of every value in every column
wines.sum(axis=0)

array([13303.1    ,   843.985  ,   433.29   ,  4059.55   ,   139.859  ,
       25369.     , 74302.     ,  1593.79794,  5294.47   ,  1052.38   ,
       16666.35   ,  9012.     ])

We can verify that we did the sum correctly by checking the shape. The shape should be 12, corresponding to the number of columns:

In [118]:
wines.sum(axis=0).shape

(12,)

In [120]:
# axis = 1 to find the sums over the second axis of the array
wines.sum(axis=1)

array([ 74.5438 , 108.0548 ,  99.699  , ..., 100.48174, 105.21547,
        92.49249])

There are several other methods that behave like the sum method, including:

- numpy.ndarray.mean — finds the mean of an array.
- numpy.ndarray.std — finds the standard deviation of an array.
- numpy.ndarray.min — finds the minimum value in an array.
- numpy.ndarray.max — finds the maximum value in an array.

You can find a full list of array methods here.
- https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html

### NumPy Array Comparisons
NumPy makes it possible to test to see if rows match certain values using mathematical comparison operations like <, >, >=, <=, and ==. For example, if we want to see which wines have a quality rating higher than 5, we can do this:

In [121]:
wines[:,1] > 5

array([False, False, False, ..., False, False, False])

We get a Boolean array that tells us which of the wines have a quality rating greater than 5. We can do something similar with the other operators.

### Subsetting
One of the powerful things we can do with a Boolean array and a NumPy array is select only certain rows or columns in the NumPy array. For example, the below code will only select rows in wines where the quality is over 7:

In [125]:
high_quality = wines[:,11] > 7
wines[high_quality,:][:3,:]

array([[7.900e+00, 3.500e-01, 4.600e-01, 3.600e+00, 7.800e-02, 1.500e+01,
        3.700e+01, 9.973e-01, 3.350e+00, 8.600e-01, 1.280e+01, 8.000e+00],
       [1.030e+01, 3.200e-01, 4.500e-01, 6.400e+00, 7.300e-02, 5.000e+00,
        1.300e+01, 9.976e-01, 3.230e+00, 8.200e-01, 1.260e+01, 8.000e+00],
       [5.600e+00, 8.500e-01, 5.000e-02, 1.400e+00, 4.500e-02, 1.200e+01,
        8.800e+01, 9.924e-01, 3.560e+00, 8.200e-01, 1.290e+01, 8.000e+00]])

In [128]:
wines[high_quality,:][:3]

array([[7.900e+00, 3.500e-01, 4.600e-01, 3.600e+00, 7.800e-02, 1.500e+01,
        3.700e+01, 9.973e-01, 3.350e+00, 8.600e-01, 1.280e+01, 8.000e+00],
       [1.030e+01, 3.200e-01, 4.500e-01, 6.400e+00, 7.300e-02, 5.000e+00,
        1.300e+01, 9.976e-01, 3.230e+00, 8.200e-01, 1.260e+01, 8.000e+00],
       [5.600e+00, 8.500e-01, 5.000e-02, 1.400e+00, 4.500e-02, 1.200e+01,
        8.800e+01, 9.924e-01, 3.560e+00, 8.200e-01, 1.290e+01, 8.000e+00]])

In [131]:
wines[high_quality,:][:3,:1]

array([[ 7.9],
       [10.3],
       [ 5.6]])

We select only the rows where high_quality contains a True value, and all of the columns. This subsetting makes it simple to filter arrays for certain criteria. For example, we can look for wines with a lot of alcohol and high quality. In order to specify multiple conditions, we have to place each condition in parentheses, and separate conditions with an ampersand (&):

In [135]:
high_q = (wines[:,10] > 10) & (wines[:,11] > 7)
wines[high_q,:1]

array([[ 7.9],
       [10.3],
       [ 5.6],
       [11.3],
       [ 9.4],
       [10.7],
       [10.7],
       [ 5. ],
       [ 7.8],
       [ 9.1],
       [10. ],
       [ 7.9],
       [ 8.6],
       [ 5.5],
       [ 7.2],
       [ 7.4]])

In [136]:
wines[high_q,10:]

array([[12.8,  8. ],
       [12.6,  8. ],
       [12.9,  8. ],
       [13.4,  8. ],
       [11.7,  8. ],
       [11. ,  8. ],
       [11. ,  8. ],
       [14. ,  8. ],
       [12.7,  8. ],
       [12.5,  8. ],
       [11.8,  8. ],
       [13.1,  8. ],
       [11.7,  8. ],
       [14. ,  8. ],
       [11.3,  8. ],
       [11.4,  8. ]])

In [137]:
wines[high_q,:]

array([[7.9000e+00, 3.5000e-01, 4.6000e-01, 3.6000e+00, 7.8000e-02,
        1.5000e+01, 3.7000e+01, 9.9730e-01, 3.3500e+00, 8.6000e-01,
        1.2800e+01, 8.0000e+00],
       [1.0300e+01, 3.2000e-01, 4.5000e-01, 6.4000e+00, 7.3000e-02,
        5.0000e+00, 1.3000e+01, 9.9760e-01, 3.2300e+00, 8.2000e-01,
        1.2600e+01, 8.0000e+00],
       [5.6000e+00, 8.5000e-01, 5.0000e-02, 1.4000e+00, 4.5000e-02,
        1.2000e+01, 8.8000e+01, 9.9240e-01, 3.5600e+00, 8.2000e-01,
        1.2900e+01, 8.0000e+00],
       [1.1300e+01, 6.2000e-01, 6.7000e-01, 5.2000e+00, 8.6000e-02,
        6.0000e+00, 1.9000e+01, 9.9880e-01, 3.2200e+00, 6.9000e-01,
        1.3400e+01, 8.0000e+00],
       [9.4000e+00, 3.0000e-01, 5.6000e-01, 2.8000e+00, 8.0000e-02,
        6.0000e+00, 1.7000e+01, 9.9640e-01, 3.1500e+00, 9.2000e-01,
        1.1700e+01, 8.0000e+00],
       [1.0700e+01, 3.5000e-01, 5.3000e-01, 2.6000e+00, 7.0000e-02,
        5.0000e+00, 1.6000e+01, 9.9720e-01, 3.1500e+00, 6.5000e-01,
        1.1000e+01,

In [142]:
wines[high_q,1:5] = 20

In [144]:
wines[high_q,1:5]

array([[20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.],
       [20., 20., 20., 20.]])

In [145]:
wines[high_q,1:6]

array([[20., 20., 20., 20., 15.],
       [20., 20., 20., 20.,  5.],
       [20., 20., 20., 20., 12.],
       [20., 20., 20., 20.,  6.],
       [20., 20., 20., 20.,  6.],
       [20., 20., 20., 20.,  5.],
       [20., 20., 20., 20.,  5.],
       [20., 20., 20., 20., 19.],
       [20., 20., 20., 20., 34.],
       [20., 20., 20., 20.,  7.],
       [20., 20., 20., 20., 42.],
       [20., 20., 20., 20.,  8.],
       [20., 20., 20., 20.,  6.],
       [20., 20., 20., 20., 28.],
       [20., 20., 20., 20., 15.],
       [20., 20., 20., 20., 17.]])

In [147]:
wines[high_q,0:6]  # Add extra col on each sides to better view 1:5 changes

array([[ 7.9, 20. , 20. , 20. , 20. , 15. ],
       [10.3, 20. , 20. , 20. , 20. ,  5. ],
       [ 5.6, 20. , 20. , 20. , 20. , 12. ],
       [11.3, 20. , 20. , 20. , 20. ,  6. ],
       [ 9.4, 20. , 20. , 20. , 20. ,  6. ],
       [10.7, 20. , 20. , 20. , 20. ,  5. ],
       [10.7, 20. , 20. , 20. , 20. ,  5. ],
       [ 5. , 20. , 20. , 20. , 20. , 19. ],
       [ 7.8, 20. , 20. , 20. , 20. , 34. ],
       [ 9.1, 20. , 20. , 20. , 20. ,  7. ],
       [10. , 20. , 20. , 20. , 20. , 42. ],
       [ 7.9, 20. , 20. , 20. , 20. ,  8. ],
       [ 8.6, 20. , 20. , 20. , 20. ,  6. ],
       [ 5.5, 20. , 20. , 20. , 20. , 28. ],
       [ 7.2, 20. , 20. , 20. , 20. , 15. ],
       [ 7.4, 20. , 20. , 20. , 20. , 17. ]])

### Reshaping NumPy Arrays
We can change the shape of arrays while still preserving all of their elements. This often can make it easier to access array elements. The simplest reshaping is to flip the axes, so rows become columns, and vice versa. We can accomplish this with the numpy.transpose function:

In [148]:
# Transpose an array
np.transpose(wines.shape)

array([1599,   12])

In [149]:
wines.shape

(1599, 12)

In [150]:
np.transpose(wines).shape

(12, 1599)

In [151]:
# Use ravel() function to turn an array into a 1D representation
# Flattens an array into a long sequence of values
wines.ravel()

array([ 7.4 ,  0.7 ,  0.  , ...,  0.66, 11.  ,  6.  ])

In [152]:
wines.ravel().shape

(19188,)

In [153]:
12*1599

19188

In [157]:
# Ordering np.ravel
array_one = np.array(
[
    [1,2,3,4],
    [5,6,7,8]
])

new_array =np.array([1,2,3,4,5,6,7,8])

In [158]:
array_one

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [159]:
new_array

array([1, 2, 3, 4, 5, 6, 7, 8])

In [160]:
array_one.ravel() == new_array

array([ True,  True,  True,  True,  True,  True,  True,  True])

In [161]:
# Reshaoing an array to a certain shape
# Below will turn the second row of wines into a 2D array: 2x6
wines[1,:].reshape((2,6))

array([[ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 10.    ],
       [67.    ,  0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ]])

In [162]:
wines[1,:]

array([ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 10.    , 67.    ,
        0.9968,  3.2   ,  0.68  ,  9.8   ,  5.    ])

In [163]:
wines[1,:].shape

(12,)

In [164]:
wines[1,:].reshape((2,6)).shape

(2, 6)

### Combining NumPy Arrays
With NumPy, it’s very common to combine multiple arrays into a single unified array. We can use numpy.vstack to vertically stack multiple arrays. Think of it like the second arrays’s items being added as new rows to the first array. We can read in the winequality-white.csv dataset that contains information on the quality of white wines, then combine it with our existing dataset, wines, which contains information on red wines.

In the below code, we:
- Read in winequality-white.csv.
- Display the shape of white_wines.

In [166]:
white_wines = np.genfromtxt('02-winequality-white.csv', delimiter=';',skip_header=1)
white_wines.shape

(4898, 12)

As you can see, we have attributes for 4898 wines. Now that we have the white wines data, we can combine all the wine data.

In the below code, we:
- Use the vstack function to combine wines and white_wines.
- Display the shape of the result.

In [167]:
# Use np.vstack() to combine wine data
all_wines = np.vstack((wines, white_wines))
all_wines.shape

(6497, 12)

As you can see, the result has 6497 rows, which is the sum of the number of rows in wines and the number of rows in red_wines.

If we want to combine arrays horizontally, where the number of rows stay constant, but the columns are joined, then we can use the numpy.hstack function. The arrays we combine need to have the same number of rows for this to work.

Finally, we can use numpy.concatenate as a general purpose version of hstack and vstack. If we want to concatenate two arrays, we pass them into concatenate, then specify the axis keyword argument that we want to concatenate along. Concatenating along the first axis is similar to vstack, and concatenating along the second axis is similar to hstack:

In [168]:
# Pass two arrays into concatenate function then specify axis
np.concatenate((wines, white_wines), axis=0)

array([[ 7.4 ,  0.7 ,  0.  , ...,  0.56,  9.4 ,  5.  ],
       [ 7.8 ,  0.88,  0.  , ...,  0.68,  9.8 ,  5.  ],
       [ 7.8 ,  0.76,  0.04, ...,  0.65,  9.8 ,  5.  ],
       ...,
       [ 6.5 ,  0.24,  0.19, ...,  0.46,  9.4 ,  6.  ],
       [ 5.5 ,  0.29,  0.3 , ...,  0.38, 12.8 ,  7.  ],
       [ 6.  ,  0.21,  0.38, ...,  0.32, 11.8 ,  6.  ]])

### Further Reading
You should now have a good grasp of NumPy, and how to apply it to a data set.

If you want to dive into more depth, here are some resources that may be helpful:
- NumPy Quickstart — has good code examples and covers most basic NumPy functionality.
- Python NumPy Tutorial — a great tutorial on NumPy and other Python libraries.
    - http://cs231n.github.io/python-numpy-tutorial/#numpy
- Visual NumPy Introduction — a guide that uses the game of life to illustrate NumPy concepts.
    - https://github.com/rougier/numpy-tutorial

### Using NumPy To Read In Files
It’s possible to use NumPy to directly read csv or other files into arrays. We can do this using the numpy.genfromtxt function. We can use it to read in our initial data on red wines.

In the below code, we:
- Use the genfromtxt function to read in the winequality-red.csv file.
- Specify the keyword argument delimiter=";" so that the fields are parsed properly.
- Specify the keyword argument skip_header=1 so that the header row is skipped.

In [170]:
# Read csv files into arrays by genfromtxt()
wines = np.genfromtxt('02-winequality-red.csv', delimiter=';', skip_header=1)

wines will end up looking the same as if we read it into a list then converted it to an array of floats. NumPy will automatically pick a data type for the elements in an array based on their format.