### NumPy Arrays

NumPy is an array manipulation library in Python, which handles array operations very efficiently by the use of various attributes and methods.
- ndarray is a multidimensional container of items of the same size and type.
- built on C++
| NumPy Array | Python Lists |
| --- | --- |
| consumes less memory as they only store binary value of array item | consume more memory as they store the size, reference, object value and object type |
| doesn't perform type checking when iterating | |
| uses contiguous memory | list items are spread across memory |

In [27]:
import numpy as np
import sys

In [28]:
%time a_list = list(range(1000000))
%time a_list = [x + 10 for x in a_list]

CPU times: total: 15.6 ms
Wall time: 37.3 ms
CPU times: total: 46.9 ms
Wall time: 152 ms


In [29]:
sys.getsizeof(a_list)

8448728

### Array Initializing using `np.array()`

To create an array, we can simply pass a list of values (or any collection of values) into the `np.array()` function.

In [30]:
# Array initialization
a_array = np.array(range(1000000))

In [31]:
%time a_array + 10

CPU times: total: 0 ns
Wall time: 3.78 ms


array([     10,      11,      12, ..., 1000007, 1000008, 1000009])

In [32]:
sys.getsizeof(a_array) / sys.getsizeof(a_list)

0.47345730623592097

In [33]:
a = [1, 2, 4, 8, 16]
array_a = np.array(a)

In [34]:
type(array_a)

numpy.ndarray

### Creating an array of a specific data type

In [35]:
array_b = np.array(a, dtype='int8')
array_b

array([ 1,  2,  4,  8, 16], dtype=int8)

### `np.arange(start, stop, step)`

This function is similar to the `range()` function, but creates decimal number values instead of only integer values.

In [36]:
np.arange(10, 19, 0.1)

array([10. , 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 11. ,
       11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7, 11.8, 11.9, 12. , 12.1,
       12.2, 12.3, 12.4, 12.5, 12.6, 12.7, 12.8, 12.9, 13. , 13.1, 13.2,
       13.3, 13.4, 13.5, 13.6, 13.7, 13.8, 13.9, 14. , 14.1, 14.2, 14.3,
       14.4, 14.5, 14.6, 14.7, 14.8, 14.9, 15. , 15.1, 15.2, 15.3, 15.4,
       15.5, 15.6, 15.7, 15.8, 15.9, 16. , 16.1, 16.2, 16.3, 16.4, 16.5,
       16.6, 16.7, 16.8, 16.9, 17. , 17.1, 17.2, 17.3, 17.4, 17.5, 17.6,
       17.7, 17.8, 17.9, 18. , 18.1, 18.2, 18.3, 18.4, 18.5, 18.6, 18.7,
       18.8, 18.9])

### `array_object.reshape((dimensions))`

The `reshape()` method is used to restructure an array into the specified dimensions.

The number of elements in the array should be equal to the product of the dimensions.

In [37]:
# there are 16 elements in x
x = np.arange(0, 16)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [38]:
# we can make an array which has 2 rows and 8 columns
x.reshape((2, 8))

array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15]])

In [39]:
# or we can make an array which has 8 rows and 2 columns
x.reshape((8, 2))

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11],
       [12, 13],
       [14, 15]])

In [40]:
# we can also make a 3-D array
x.reshape((4, 1, 4))

array([[[ 0,  1,  2,  3]],

       [[ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11]],

       [[12, 13, 14, 15]]])

### NumPy Array Attributes

In [41]:
array_1 = x.reshape((2, 8))
array_2 = x.reshape((4, 1, 4))
array_1

array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15]])

### `array_object.shape`

Tells the dimensions of the array.

In [42]:
array_1.shape

(2, 8)

In [43]:
array_2.shape

(4, 1, 4)

In [44]:
x.shape

(16,)

### `array_object.ndim`

Tells the number of dimensions in the array.

In [45]:
array_1.ndim

2

In [46]:
array_2.ndim

3

In [47]:
x.ndim

1

### `array_object.T`

Transposes the array.

In [48]:
array_1

array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15]])

In [49]:
array_1.T

array([[ 0,  8],
       [ 1,  9],
       [ 2, 10],
       [ 3, 11],
       [ 4, 12],
       [ 5, 13],
       [ 6, 14],
       [ 7, 15]])

### Array Operations

#### Copies of an array

When we assign an array to a new variable and make changes to it, the original array is also changed.

In [50]:
array_1

array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15]])

In [51]:
array_1_copy = array_1

In [52]:
array_1_copy += 10
array_1_copy

array([[10, 11, 12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23, 24, 25]])

In [53]:
array_1

array([[10, 11, 12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23, 24, 25]])

We use the `copy()` method to make a copy of the array.

In [54]:
array_1 = x.reshape((2, 8))
array_1_copy = array_1.copy()

In [55]:
array_1

array([[10, 11, 12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23, 24, 25]])

In [56]:
array_1_copy += 10
array_1_copy

array([[20, 21, 22, 23, 24, 25, 26, 27],
       [28, 29, 30, 31, 32, 33, 34, 35]])

In [57]:
array_1

array([[10, 11, 12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23, 24, 25]])

### Array methods

#### `array_object.transpose()`

This method transposes an array: converts rows into columns and columns into rows.

In [58]:
array_1.transpose()

array([[10, 18],
       [11, 19],
       [12, 20],
       [13, 21],
       [14, 22],
       [15, 23],
       [16, 24],
       [17, 25]])

#### `array_object.sort()`

This method is used to sort an array.
- by default it sorts by rows

In [59]:
array_4 = np.array([[7, 2, 4, 8], [5, 18, 2, -1]])
array_4

array([[ 7,  2,  4,  8],
       [ 5, 18,  2, -1]])

In [60]:
array_4.sort()
array_4

array([[ 2,  4,  7,  8],
       [-1,  2,  5, 18]])

- we can specify which axis to sort along

In [61]:
array_4.sort(axis=0)
array_4

array([[-1,  2,  5,  8],
       [ 2,  4,  7, 18]])

#### `array_object.resize(r, c)`

This method changes the dimensions of an array.

In [62]:
array_4.resize(4, 2)
array_4

array([[-1,  2],
       [ 5,  8],
       [ 2,  4],
       [ 7, 18]])

#### `array_object.flatten()`

Used to create a 1 dimensional copy of an n-dimensional array. It consumes more space as it creates a copy.

In [63]:
array_4.flatten()

array([-1,  2,  5,  8,  2,  4,  7, 18])

#### `array_object.ravel()`

Similar to the `flatten()` method, but does not create a copy and makes changes to the original array.

In [64]:
array_4.ravel()

array([-1,  2,  5,  8,  2,  4,  7, 18])

#### `array_object.sum()`

This method calculates the sum of along the specified axis. When no axis is specifies, it calculates the sum of the entire array.

In [65]:
array_4.sum()

45

In [66]:
array_4.sum(axis=1)

array([ 1, 13,  6, 25])

In [67]:
array_4.sum(axis=0)

array([13, 32])

#### `np.matmul(array_object_1, array_object_2)`

Returns the result of matrix multiplcation.

In [68]:
a = np.random.rand(2, 3)
b = np.random.rand(3, 2)

print(f'a:\n{a}\nb:\n{b}\na * b:\n{np.matmul(a, b)}')

a:
[[0.58835523 0.61145144 0.64792665]
 [0.74590975 0.96217479 0.83758703]]
b:
[[0.02562016 0.65516998]
 [0.72562518 0.43103449]
 [0.61924353 0.44186192]]
a * b:
[[0.85998271 0.93532346]
 [1.23595894 1.27352601]]


#### `@` operator

Instead of using `np.matmul()`, we can use the `@` operator to find the matrix multiplcation product of two arrays.

In [69]:
print(f'a:\n{a}\nb:\n{b}\na * b:\n{a @ b}')

a:
[[0.58835523 0.61145144 0.64792665]
 [0.74590975 0.96217479 0.83758703]]
b:
[[0.02562016 0.65516998]
 [0.72562518 0.43103449]
 [0.61924353 0.44186192]]
a * b:
[[0.85998271 0.93532346]
 [1.23595894 1.27352601]]


### Array Broadcasting

Array broadcasting is a method in which a simple mathematical operation is performed across all elements of an array.

In [70]:
list_var = [1, 2, 3, 4, 5]
list_var

[1, 2, 3, 4, 5]

To add 5 to all the elements of the above list, we would need to run a loop

In [71]:
for i in range(len(list_var)):
    list_var[i] = list_var[i] + 5

list_var

[6, 7, 8, 9, 10]

In case of an array, we can specify this operation without the use of loops

In [72]:
array_var = np.array(list_var)
array_var

array([ 6,  7,  8,  9, 10])

In [73]:
array_var += 5
array_var

array([11, 12, 13, 14, 15])

In the above statement, we are "broadcasting" `5` so that it is added to all the elements in the array. This broadcasting is what makes NumPy very optimized. NumPy is not directly used in general data analysis, as pandas is built on NumPy. All the optimizations developed for NumPy are available in pandas.

### Other ways to initialize an array

#### `np.random.rand(r, c)`

This function creates an array of dimensions `r` rows and `c` columns. The values of this array are in the range [0, 1).

In [74]:
np.random.rand(3, 4)

array([[0.00560478, 0.08293671, 0.09638307, 0.73532143],
       [0.57619885, 0.44625321, 0.69472131, 0.4503342 ],
       [0.89752081, 0.52645961, 0.32182344, 0.75294222]])

`np.random.randn(r, c)`

This function creates an array of dimensions `r` rows and `c` columns. 

In [75]:
np.random.randn(3, 4)

array([[-0.91202811, -0.19131126, -2.61808894, -1.25174053],
       [-0.3498885 ,  0.24950844,  0.14569433, -0.98068742],
       [-0.26026517,  0.53697237, -0.21697246,  0.56682613]])

### `np.random.randint(n, size=(r, c))`

This function creates an array with integer elements, with values less than `n` and dimensions `r` rows and `c` columns.

In [76]:
np.random.randint(15, size=(3, 2))

array([[ 9,  7],
       [ 2,  0],
       [11,  7]])

### `np.random.random_sample((r, c))`

This function creates an array of `r` rows and `c` columns, with decimal numbers.

In [77]:
np.random.random_sample((3, 2))

array([[0.3345981 , 0.33763025],
       [0.65568709, 0.85394378],
       [0.62677434, 0.86812818]])

`np.zeros((r, c))`

This function creates an array with `r` rows and `c` columns, with all 0.

In [78]:
np.zeros((4, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

`np.ones((r, c))`

This function creates an array of all ones, with `r` rows and `c` columns.

`np.identity(n)`

This function creates an identity matrix with `n` rows and columns.

In [79]:
np.identity(7)

array([[1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 1.]])

`np.eye(n)`

Same as `identity(n)`, creates an identity matrix with `n` rows and columns.

In [80]:
np.eye(7)

array([[1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 1.]])

### `np.full((r, c), n)`

This function creates an array with `r` rows and `c` columns, which has `n` as elements.

In [81]:
np.full((2, 3), 8)

array([[8, 8, 8],
       [8, 8, 8]])

`np.full_like(array, n)`

This function creates an array in the shape of another array, with `n` as its element.

`np.linspace(start, stop, n)`

This function creates a one-dimensional array with `n` equally spaced values between `start` and `stop`.

In [82]:
np.linspace(10, 15, 50)

array([10.        , 10.10204082, 10.20408163, 10.30612245, 10.40816327,
       10.51020408, 10.6122449 , 10.71428571, 10.81632653, 10.91836735,
       11.02040816, 11.12244898, 11.2244898 , 11.32653061, 11.42857143,
       11.53061224, 11.63265306, 11.73469388, 11.83673469, 11.93877551,
       12.04081633, 12.14285714, 12.24489796, 12.34693878, 12.44897959,
       12.55102041, 12.65306122, 12.75510204, 12.85714286, 12.95918367,
       13.06122449, 13.16326531, 13.26530612, 13.36734694, 13.46938776,
       13.57142857, 13.67346939, 13.7755102 , 13.87755102, 13.97959184,
       14.08163265, 14.18367347, 14.28571429, 14.3877551 , 14.48979592,
       14.59183673, 14.69387755, 14.79591837, 14.89795918, 15.        ])

### pandas

`pandas` is a Python library most commonly used for working with tabular data.

In [None]:
# importing pandas library to the current session
import pandas as pd
import warnings

# specifying Jupyter to ignore warnings, to keep the notebook clean
warnings.filterwarnings("ignore")

# specifying pandas to display all columns of the dataframe in this notebook
pd.set_option("display.max_columns", None)

In [None]:
path = r"E:\data\superstore.xls"

In [None]:
# importing our table from the location and storing it in the "data" variable
# this variable can be anything, superstore_data, superstoredata, sdata
data = pd.read_excel(path)
data

In [None]:
# checking the data type of variable data
type(data)

### `datafram_obj.info()`
- displays the number of records in the data frame
- index of fields in the dataframe
- name of fields in the dataframe
- count of non-null values in each field
- data type of each field

In [None]:
# getting information about our table
data.info()

### `dataframe_obj.describe()`
- displays summary statistics of fields in the dataframe
- we can use `include` or `exclude` to get summary statistics of particular type of fields
    1. `number`: for numeric fields
    2. `datetime`: for datetime fields
    3. `object`: for string fields

In [None]:
# getting summary statistics of columns
data.describe()

In [None]:
# getting summary statistics of "object" type columns
data.describe(include="object")

In [None]:
# getting summary statistics of "datetime" type columns
data.describe(include="datetime")

In [None]:
# getting summary statistics of numeric columns
data.describe(include="number")

### `dataframe.columns`
- the `dataframe.columns` attribute is used to fetch the list of headers.

In [None]:
# getting the names of columns in our table
data.columns

# notice that columns does not have brackets, it is an attribute, not a method

### `dataframe.shape`
- the `dataframe.shape` attribute is used to fetch the number of rows and columns of the dataframe.

In [None]:
data.shape

### `dataframe.index`
- this attribute is used to fetch the indices of the dataframe.

In [None]:
data.index

### `dataframe.size`
- this attribute fetches the total number of cells in a dataframe.

In [None]:
# size tells us the total number of cells in the DataFrame
data.size

### `dataframe.head(n)`
- the `head` method is used to fetch `n` number of rows from the start of the dataframe. By default, it fetches 5 rows.

In [None]:
data.head(10)

### `dataframe.tail(n)`
- the `tail` method is used to fetch `n` number of rows from the end of the dataframe. By default, it fetches 5 rows.

In [None]:
data.tail()

### `dataframe.sample(n)`
- the `sample` method is used to fetch `n` number of rows, without replacement, from a dataset. It randomly selects the rows, in a way that each row has equal probability of being fetched.
- by default, it fetches 1 row.

In [None]:
data.sample(5)

### Indexing in `pandas`

In [None]:
# getting all rows of only one column
data['Region']

In [None]:
# we can also use the dot notation
data.Region

In [None]:
# but the dot notation does not work for columns which have spaces in them
data.Order Date

In [None]:
type(data['Region'])

In [None]:
# getting all values of more than one columns
data[['Postal Code', 'Region', 'Product ID']]

In [None]:
# another way to get data of multiple columns, using lists
cols = ['Postal Code', 'Region', 'Product ID']

data[cols]

# these are ways to slice the data: getting some data from our complete data

In [None]:
data["Discount"][16]

In [None]:
# now if we want to fetch the first 10 values in the Discount column (this is slicing)
data["Discount"][:10]

In [None]:
# getting filtered data
# let's try to fetch all transactions that belong to the South region
data[data['Region'] == 'South']

### Keyword and Symbolic Logical Operators

| Symbol | Usage |
| --- | --- |
| `and` | used to implement AND operation when both sides of operator are single values |
| `or` | used to implement OR operation when both sides of operator are single values |
| `not` | used to implement NOT operation in case of single values |
| `&` | used to implement AND operation when both sides of operator are array-like values |
| `\|` | used to implement OR operation when both sides of operator are array-like values |
| `~` | used to implement NOT operation with array-like values |

In [None]:
# let's try to fetch transactions in the South region and Consumer segment
data[data['Region'] == 'South' & data['Segment'] == 'Consumer']

In [None]:
data[(data['Region'] == 'South') & (data['Segment'] == 'Consumer')]

In [None]:
data[data['Region'] == 'South' and data['Segment'] == 'Consumer']

In [None]:
data[((data['Region'] == 'South') and (data['Segment'] == 'Consumer'))]

In [None]:
# let's try fetch profitable transactions in East and South region
data[(data['Profit'] > 0) & (data['Region'].isin(['East', 'South']))]

In [None]:
# using dt functions with datetime columns
data['Order Date']

In [None]:
data['Order Date'].dt.year

In [None]:
data['Order Date'].dt.strftime('%A')

### DateTime Format Specifiers

These specifiers can be used in both `.dt.strftime()` and `.dt.strptime()`

| Specifier | Value |
| --- | --- |
| %m | numeric month |
| %b | short month name |
| %B | full month name |
| %y | last two digits of year |
| %Y | full year value |
| %a | short day name |
| %A | full day name |

- `strftime` is used to convert datetime type values to a formatted date string value
- `strptime` is used to convert string to standard datetime type values

[DateTime Documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

### Indexing with `dataframe.loc`

`.loc` is a dataframe attribute used for indexing a dataframe. It makes use of the `names` of rows and columns in a dataframe to fetch values.

#### SYNTAX:
`dataframe.loc[name_of_row, name_of_columns]`

In [None]:
data.loc[0, 'Profit']

In [None]:
data.loc[:10, ['Profit', 'Discount']]

In [None]:
summary = data.describe()
summary

In [None]:
type(summary)

In [None]:
summary['Row ID']

In [None]:
summary[['Row ID', 'Sales']]

In [None]:
summary.loc[['25%', '50%', '75%'], ['Row ID', 'Sales']]

In [None]:
summary.iloc[3:6, [0, 4]]

### indexing with `dataframe.iloc`
`.iloc` is another attribute which can be used for indexing a dataframe, however, it makes use of the `index` of rows and columns rather than the `names` of rows and columns.

#### SYNTAX
`dataframe.loc[index_of_row, index_of_columns]`

In [None]:
data.iloc[0, 20]

In [None]:
data.iloc[:10, 19:]

### `.loc[]` vs `.iloc[]`

| .loc | .iloc |
| --- | --- |
| indexing is based on name of rows and columns | indexing is based on index position of rows and columns |
| includes the upper limit | does not include upper limit |

In [None]:
data.loc[data['Profit'] <= 0, 'Profitable'] = False
data.loc[data['Profit'] > 0, 'Profitable'] = True
data