# Data Science with Python - Numpy Basics
---

Numpy is short for Numerical Python and is one of the most important packages for numerical processing in Python. Numpy provides the basis for most scientific application packages that use Python numeric data (data structures and algorithms). We can highlight the following features that the Numpy package contains:

- A powerful multidimensional array object;
- Sophisticated math functions for array operations without the need for for loops;
- Linear algebra and random number generation features.

In addition to its obvious scientific uses, the NumPy package is also widely used in data analysis as an efficient multidimensional container of generic data for transport between various Python algorithms and libraries.

**Version:** 1.23

**Installation:** https://scipy.org/install.html

**Documentation:** https://numpy.org/doc/1.23/

Within this section, we will cover:

*   Importing packages in Python
*   Techniques for creating Numpy arrays
*   Arrays of more than one dimension
*   Performance Comparisons Between Numpy Arrays and Python Lists
*   Arithmetic operations with Numpy arrays
*   Item selections and slicing into arrays
*   Indexing with boolean arrays
*   Attributes and methods of arrays in Numpy package
*   Generating descriptive statistics and summarizations with arrays

## Numpy Arrays - Introduction

In [1]:
import numpy as np

In [2]:
km = np.loadtxt('cars-km.txt')

In [None]:
km

In [4]:
years = np.loadtxt('cars-years.txt', dtype = int)

In [None]:
years

### Getting the average mileage per year

In [None]:
year_current = 2019

km_avg = km / (year_current - years)

In [None]:
km_avg

In [None]:
type(km_avg)

# Packages

There are several Python packages available for download on the internet. Each package aims to solve a certain type of problem and for that, new types, functions and methods are developed.

Some packages are widely used in a data science context, such as:

- Numpy
- Pandas
- Scikit-learn
- Matplotlib

Some packages are not shipped with the default Python installation. In this case we must install the packages we need on our system in order to use their features.

### Importing the entire package

In [11]:
import numpy

https://numpy.org/doc/1.16/reference/generated/numpy.arange.html

In [None]:
numpy.arange(10)

### Importing the entire package and assigning a new name

In [13]:
import numpy as np

In [None]:
np.arange(10)

### Importing part of the package

In [None]:
from numpy import arange

In [None]:
arange(10)

# Creating arrays Numpy

In [None]:
import numpy as np

### From lists

https://numpy.org/doc/1.16/user/basics.creation.html

In [15]:
km = np.array([1000, 2300, 4987, 1500])

In [None]:
km

In [None]:
type(km)

https://numpy.org/doc/1.16/user/basics.types.html

In [None]:
km.dtype

dtype('int64')

### From external data

https://numpy.org/doc/1.16/reference/generated/numpy.loadtxt.html

In [17]:
km = np.loadtxt(fname = 'cars-km.txt', dtype = int)

In [None]:
km

In [None]:
km.dtype

### Two-dimensional arrays

In [None]:
data = [
     ['Alloy wheels', 'Power locks', 'Autopilot', 'Leather seats', 'Air conditioning', 'Parking sensor', 'Twilight sensor', 'Rain sensor'],
     ['Multimedia center', 'Panoramic roof', 'ABS brakes', '4 X 4', 'Digital panel', 'Autopilot', 'Leather seats', 'Parking camera'],
     ['Autopilot', 'Stability control', 'Twilight sensor', 'ABS brakes', 'Automatic transmission', 'Leather seats', 'Multimedia center', 'Power windows']
]

data

In [20]:
accessories = np.array(data)

In [None]:
accessories

In [None]:
km.shape            # check km dimension

In [None]:
accessories.shape   # check accessories dimension

### Comparing performance with *lists*

In [25]:
np_array = np.arange(1000000)

In [26]:
py_list = list(range(1000000))

In [27]:
%time for _ in range(100): np_array *= 2

CPU times: user 76.7 ms, sys: 0 ns, total: 76.7 ms
Wall time: 78.2 ms


In [28]:
%time for _ in range(100): py_list = [x * 2 for x in py_list]

CPU times: user 8.6 s, sys: 2.28 s, total: 10.9 s
Wall time: 10.9 s


# Arithmetic operations with Numpy arrays

### Operations between arrays and constants



In [29]:
km = [44410., 5712., 37123., 0., 25757.]
years = [2003, 1991, 1990, 2019, 2006]

In [30]:
# age = 2019 - years

TypeError: ignored

In [31]:
km = np.array([44410., 5712., 37123., 0., 25757.])
years = np.array([2003, 1991, 1990, 2019, 2006])

In [32]:
age = 2019 - years

In [33]:
age

array([16, 28, 29,  0, 13])

### Operations between arrays

In [34]:
km_avg = km / age

  """Entry point for launching an IPython kernel.


In [35]:
km_avg

array([2775.625     ,  204.        , 1280.10344828,           nan,
       1981.30769231])

In [36]:
44410 / (2019 - 2003)

2775.625

In [37]:
5712 / (2019 - 1991)

204.0

### Operations with two-dimensional arrays

In [40]:
data = np.array([km, years])

In [None]:
data

In [None]:
data.shape

![1410-img01.png](https://caelum-online-public.s3.amazonaws.com/1410-pythondatascience/01/1410-img01.png)

In [None]:
data[0]

In [None]:
data[1]

In [None]:
km_avg = data[0] / (2019 - data[1])

In [None]:
km_avg

# Selections with Numpy arrays

![1410-img01.png](https://caelum-online-public.s3.amazonaws.com/1410-pythondatascience/01/1410-img01.png)

In [None]:
data

![1410-img02.png](https://caelum-online-public.s3.amazonaws.com/1410-pythondatascience/01/1410-img02.png)

### Indexing

**Note:** Indexing starts at zero.

In [None]:
counter = np.arange(10)
counter

In [None]:
counter[0]

In [None]:
item = 6
index = item - 1
counter[index]

In [None]:
counter[-1]

In [None]:
data[0]

In [None]:
data[1]

## **Tip:**
### *ndarray[ row ][ column ]* ou *ndarray[ row, column ]*

In [None]:
data[1][2]

In [None]:
data[1, 2]

### Slicing
 
The syntax for slicing a Numpy array is $i : j : k$ where $i$ is the start index, $j$ is the stop index, and $k$ is the step indicator ($k\neq0$ )
 
**Note:** In slices (*slices*) the item with index i is **included** and the item with index j is **not included** in the result.

![1410-img01.png](https://caelum-online-public.s3.amazonaws.com/1410-pythondatascience/01/1410-img01.png)

In [None]:
counter = np.arange(10)
counter

In [None]:
counter[1:4]

In [None]:
counter[1:8:2]

In [None]:
counter[::2]

In [None]:
counter[1::2]

In [None]:
data

In [None]:
data[:, 1:3]

In [None]:
data[:, 1:3][0] / (2019 - data[:, 1:3][1])

In [None]:
data[0] / (2019 - data[1])

### Indexing with boolean array

**Note**: Selects a group of rows and columns according to labels or a boolean array.

In [None]:
counter = np.arange(10)
counter

In [None]:
counter > 5

In [None]:
counter[counter > 5]

In [None]:
counter[[False, False, False, False, False, False,  True,  True,  True, True]]

In [None]:
data

In [None]:
data[1] > 2000

In [None]:
data[:, data[1] > 2000]

# Numpy arrays attributes and methods

In [None]:
data = np.array([[44410., 5712., 37123., 0., 25757.],
                  [2003, 1991, 1990, 2019, 2006]])
data

### Attributes

https://numpy.org/doc/1.16/reference/arrays.ndarray.html#array-attributes

## *ndarray.shape*

Returns a tuple with the dimensions of the array.

In [None]:
data.shape

## *ndarray.ndim*

Returns the number of dimensions in the array.

In [None]:
data.ndim

## *ndarray.size*

Returns the number of elements in the array.

In [None]:
data.size

## *ndarray.dtype*

Returns the data type of array elements.

In [None]:
data.dtype

## *ndarray.T*

Returns the transposed array, that is, converts rows to columns and vice versa.

In [None]:
data.T

In [None]:
data.transpose()

### Methods

https://numpy.org/doc/1.16/reference/arrays.ndarray.html#array-methods

## *ndarray.tolist()*

Returns the array as a Python list.

In [None]:
data.tolist()

## *ndarray.reshape(shape[, order])*

Returns an array that contains the same data in a new form.

In [None]:
counter = np.arange(10)
counter

In [None]:
counter.reshape((5, 2))

In [None]:
counter.reshape((5, 2), order='C')    # C-language indexing pattern

In [None]:
counter.reshape((5, 2), order='F')    # FORTRAN-language indexing pattern

In [72]:
km = [44410, 5712, 37123, 0, 25757]
years = [2003, 1991, 1990, 2019, 2006]

In [None]:
info_cars = km + years
info_cars

In [None]:
np.array(info_cars).reshape((2, 5))

In [None]:
np.array(info_cars).reshape((5, 2), order='F')

## *ndarray.resize(new_shape[, refcheck])*

Change the shape and size of the array.

In [None]:
data_new = data.copy()
data_new

In [77]:
data_new.resize((3, 5), refcheck=False)

In [None]:
data_new

In [None]:
data_new[2] = data_new[0] / (2019 - data_new[1])

In [None]:
data_new

# Statistics with Numpy arrays

* https://numpy.org/doc/1.16/reference/arrays.ndarray.html#calculation

* https://numpy.org/doc/1.16/reference/routines.statistics.html

* https://numpy.org/doc/1.16/reference/routines.math.html

In [78]:
years = np.loadtxt(fname = "cars-years.txt", dtype = int)
km = np.loadtxt(fname = "cars-km.txt")
value = np.loadtxt(fname = "cars-value.txt")

In [None]:
years.shape

* https://numpy.org/doc/1.16/reference/generated/numpy.column_stack.html

In [None]:
dataset = np.column_stack((years, km, value))
dataset

In [None]:
dataset.shape

## *np.mean()*

Returns the average of array elements along the specified axis.

In [None]:
np.mean(dataset)    # mean of all array elements

In [None]:
np.mean(dataset, axis = 0)    # per column

In [None]:
np.mean(dataset, axis = 1)    # per row

In [None]:
np.mean(dataset[:, 1])

In [None]:
np.mean(dataset[:, 2])

## *np.std()*

Returns the standard deviation of array elements along the specified axis.

In [None]:
np.std(dataset[:, 2])

## *ndarray.sum()*

Returns the sum of array elements along the specified axis.

In [None]:
dataset.sum(axis = 0)

array([  517938.        , 11480849.        , 25531812.37999999])

In [None]:
dataset[:, 1].sum()

11480849.0

## *np.sum()*

Returns the sum of array elements along the specified axis.

In [None]:
np.sum(dataset, axis = 0)

array([  517938.        , 11480849.        , 25531812.37999999])

In [None]:
np.sum(dataset[:, 2])

25531812.38