<img src="http://www.digitalvidya.com/wp-content/uploads/2013/05/logoa5-300x95.png">

## Digital Vidya Data Analytics NumPy Tutorial
### Digital Vidya Copyright

## Numpy

NumPy or Numerical Python, is the fundamental package required for high performance scientific computing and data analysis.

    A powerful N-dimensional array object


### Why Numpy

    Memory Efficient
    Vectorized operations

https://www.scipy.org/scipylib/faq.html#what-advantages-do-numpy-arrays-offer-over-nested-python-lists

In [25]:
import timeit
import numpy as np
import random

np_array = np.random.random(1000)
python_list = random.sample(range(1000), 1000)


def python_code():
    return [val-32 for val in python_list]


def numpy_code():
    return np_array-32

n = 1000

t_python = timeit.timeit(python_code, number=n)
t_numpy = timeit.timeit(numpy_code, number=n)

print('Time python', t_python)
print('Time numpy', t_numpy)
print('Speed Comparision', t_python/t_numpy)

Time python 0.16143963098875247
Time numpy 0.006095745018683374
Speed Comparision 26.48398686197376



### ARRAY:  Array is a group of elements, of same data type, indexed by tuple of non-negative integers.

Array: An array is a contiguous block of memory consisting of elements of some type (e.g. integers).
You cannot change the size of an array once it is created.
It therefore follows that each integer element in an array has a fixed size, e.g. 4 bytes.

List: Python list is an array of pointers to Python objects (an "array" of addresses), at least 4 bytes per pointer plus 16 bytes for even the smallest Python object


#### Create a rank 1 array (1 dimention)

In [233]:
import numpy as np
arr = np.array([1,2,3,4,5,6,7,8,9]) 
arr

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

#### Check the datatype

In [234]:
arr.dtype
arr.ndim
arr.shape

(9,)

### Rank 2 array

In [239]:
arr = np.array([[1,2,3], [4,5,6], [7,8,9], [10,11,12]])
arr

array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

In [240]:
arr.ndim
arr.shape

(4, 3)

### Different functions to create array

In [242]:
np.arange(0,10,2) #Array of range of numbers from 0 to 9

array([0, 2, 4, 6, 8])

In [32]:
np.zeros((5,3)) #Array of all zeros

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [23]:
np.ones((5,3)) #Array of all ones

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [241]:
np.random.random((5,3)) #Array of random numbers

array([[ 0.8368918 ,  0.4613682 ,  0.82872401],
       [ 0.77424341,  0.15899729,  0.07205358],
       [ 0.30614151,  0.93399446,  0.80473179],
       [ 0.00638184,  0.24916616,  0.79239137],
       [ 0.1076441 ,  0.57137641,  0.26882581]])

In [244]:
#an array of equally spaced values between start, end
np.linspace(0,10,5)

array([  0. ,   2.5,   5. ,   7.5,  10. ])

### Data types
https://docs.scipy.org/doc/numpy/user/basics.types.html

    Integer: int (i),
    Unsigned integer: uint (u),
    Single precision float: float (f),
    Double precision float: double (d),
    Boolean: bool (b),
    Complex: D,
    String: S,
    Unicode: U

In [62]:
np.sctypes

{'complex': [numpy.complex64, numpy.complex128, numpy.complex256],
 'float': [numpy.float16, numpy.float32, numpy.float64, numpy.float128],
 'int': [numpy.int8, numpy.int16, numpy.int32, numpy.int64],
 'others': [bool, object, bytes, str, numpy.void],
 'uint': [numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64]}

In [246]:
int_arr = np.array([[-1,2,3,4], [5,-6,7,8]])
arr.dtype

dtype('int64')

In [247]:
u_int_arr = np.array([[1,2,3,4], [5,6,7,8]], dtype = 'uint')
u_int_arr

array([[1, 2, 3, 4],
       [5, 6, 7, 8]], dtype=uint64)

In [248]:
bool_arr = np.array([True, True, False])
bool_arr

array([ True,  True, False], dtype=bool)

#### Explicitly convert or cast an array from one dtype to another

In [250]:
int_arr = int_arr.astype(np.float64)
int_arr.dtype

dtype('float64')

#### String vs Object:
The length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses object ndarray, which save pointers to objects.

Helps in case you need to modify the content of a large array without prior knowledge about the maximum length of the strings

In [254]:
a = np.array(['apples', 'banananas', 'mangoes'], dtype = 'str')
a #"seven-character string"

array(['apples', 'banananas', 'mangoes'], 
      dtype='<U9')

In [253]:
a[1] = 'bananananas'
a

array(['apples', 'bananan', 'mangoes'], 
      dtype='<U7')

In [255]:
a = np.array(['apples', 'bananas', 'mangoes'], dtype = object)
a[2] = 'bananananas'
a

array(['apples', 'bananas', 'bananananas'], dtype=object)

In [257]:
import pandas as pd
deliveries=pd.read_csv('/Users/vaishaligarg/Downloads/ipl/deliveries.csv')
deliveries.dtypes

match_id             int64
inning               int64
batting_team        object
bowling_team        object
over                 int64
ball                 int64
batsman             object
non_striker         object
bowler              object
is_super_over        int64
wide_runs            int64
bye_runs             int64
legbye_runs          int64
noball_runs          int64
penalty_runs         int64
batsman_runs         int64
extra_runs           int64
total_runs           int64
player_dismissed    object
dismissal_kind      object
fielder             object
dtype: object

### Unary operations: one input

In [258]:
a = np.array([9,17,51,4,25,64,36])
a.sum()

206

In [259]:
a.max()

64

In [260]:
a.min()

4

In [261]:
a.cumsum()

array([  9,  26,  77,  81, 106, 170, 206])

In [262]:
deliveries.groupby(['match_id','batting_team', 'batsman']).batsman_runs.sum().reset_index().head(20)

Unnamed: 0,match_id,batting_team,batsman,batsman_runs
0,1,Kolkata Knight Riders,BB McCullum,158
1,1,Kolkata Knight Riders,DJ Hussey,12
2,1,Kolkata Knight Riders,Mohammad Hafeez,5
3,1,Kolkata Knight Riders,RT Ponting,20
4,1,Kolkata Knight Riders,SC Ganguly,10
5,1,Royal Challengers Bangalore,AA Noffke,9
6,1,Royal Challengers Bangalore,B Akhil,0
7,1,Royal Challengers Bangalore,CL White,6
8,1,Royal Challengers Bangalore,JH Kallis,8
9,1,Royal Challengers Bangalore,MV Boucher,7


### Shape Manipulation:

#### Reshape, Traspose, ravel

In [264]:
arr = np.array([3,5,7,13,1,9, 19,12,5,16,31,6])
arr = arr.reshape(4,3)
arr

array([[ 3,  5,  7],
       [13,  1,  9],
       [19, 12,  5],
       [16, 31,  6]])

In [265]:
arr.T
#https://stackoverflow.com/questions/42796548/plot-per-row-in-pandas#comment72707802_42796548

array([[ 3, 13, 19, 16],
       [ 5,  1, 12, 31],
       [ 7,  9,  5,  6]])

In [148]:
arr.ravel()

array([ 3,  5,  7, 13,  1,  9, 19, 12,  5, 16, 31,  6])

### Array Indexing

In [268]:
arr = np.arange(10)
arr[2:8]
arr[::-1]

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [131]:
arr[:3] = 100
arr

array([100, 100, 100,   3,   4,   5,   6,   7,   8,   9])

In [269]:
arr_slice = arr[6:]
arr_slice[:] = -100
arr_slice

array([-100, -100, -100, -100])

In [285]:
arr

array([   0,    1,    2,    3,    4,    5, -100, -100, -100, -100])

#### Indexing higher dimentional array

In [201]:
arr = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [202]:
arr[2:]

array([[7, 8, 9]])

In [203]:
arr[:2, 1:]

array([[2, 3],
       [5, 6]])

In [204]:
arr[:, 1:]

array([[2, 3],
       [5, 6],
       [8, 9]])

In [290]:
threed_arr = np.arange(24).reshape(2,3,4)
threed_arr#[1:, :2, :2]

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

### Boolean Indexing

In [291]:
arr = np.array([9,5,2,16,7,0,4])
arr > 5

array([ True, False, False,  True,  True, False, False], dtype=bool)

In [296]:
arr[(arr > 5)&(arr < 9)]

array([7])

In [295]:
deliveries[(deliveries.total_runs == 6) | (deliveries.total_runs == 4)][:5]

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
8,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,2,2,BB McCullum,SC Ganguly,Z Khan,0,...,0,0,0,0,4,0,4,,,
9,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,2,3,BB McCullum,SC Ganguly,Z Khan,0,...,0,0,0,0,4,0,4,,,
10,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,2,4,BB McCullum,SC Ganguly,Z Khan,0,...,0,0,0,0,6,0,6,,,
11,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,2,5,BB McCullum,SC Ganguly,Z Khan,0,...,0,0,0,0,4,0,4,,,
16,1,1,Kolkata Knight Riders,Royal Challengers Bangalore,3,4,BB McCullum,SC Ganguly,P Kumar,0,...,0,0,0,0,4,0,4,,,


In [298]:
arr = np.array([[6,4,3,8], [9,0,2,5]])
arr[arr >4]
arr[(arr >4) & (arr < 9)]

array([6, 8, 5])

In [157]:
arr_str = np.array(['Google', 'Apple', 'Microsoft', 'Facebook','PayPal','HP', 'Linkedlin'])
arr_str == 'Microsoft'

array([False, False,  True, False, False, False, False], dtype=bool)

In [158]:
arr_str[arr_str != 'Microsoft']

array(['Google', 'Apple', 'Facebook', 'PayPal', 'HP', 'Linkedlin'], 
      dtype='<U9')

In [159]:
arr_str[(arr_str == 'Microsoft') | (arr_str =='PayPal')]

array(['Microsoft', 'PayPal'], 
      dtype='<U9')

In [303]:
matches=pd.read_csv('/Users/vaishaligarg/Downloads/ipl/matches.csv')
matches.head()
matches[matches.player_of_match == 'MEK Hussey'].shape

(12, 18)

### Broadcasting

element-by-element operations on array of same shape or array and a scalar value


In [307]:
# Given two lists, quantity and rate, return a list multiplication of the two values

avg_score = [31,21,55,18,8,14,3,95,21,6]
innings_played = [8,7,10,2,8,5,13,4,7,9]
#avg_score*innings_played # unsupported operand type(s) for ** or pow(): 'list' and 'list'
#total_runs = [s*i for s, i in zip(avg_score, innings_played)]
#total_runs
%timeit [q*r for q, r in zip(avg_score, innings_played)]

100000 loops, best of 3: 2.84 µs per loop


In [310]:
avg_score = np.array(avg_score)
innings_played = np.array(innings_played)
#avg_score*innings_played
%timeit avg_score*innings_played

The slowest run took 14.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.61 µs per loop


In [311]:
a1 = np.array([1,2,3,4])
a2 = np.array([5,6,7,8])
a1+a2
a1-a2
#a1/a2

array([-4, -4, -4, -4])

In [312]:
#Try the same with
a1 = np.array([1,2,3,4])
a2 = np.array([5,6,7,8,9])
a1+a2

ValueError: operands could not be broadcast together with shapes (4,) (5,) 

In [314]:
#Here the newaxis index operator inserts a new axis into a, making it a two-dimensional 4x1 array. Combining the 
#4x1 array with b, which has shape (3,), yields a 4x3 array.

a = np.array([0.0, 10.0, 20.0, 30.0])
b = np.array([1.0, 2.0, 3.0])
a = a[:, np.newaxis]
a #vector
a + b    

array([[  1.,   2.,   3.],
       [ 11.,  12.,  13.],
       [ 21.,  22.,  23.],
       [ 31.,  32.,  33.]])

### Universal Functions
    Functions that operate on ndarrays in an element-by-element fashion
https://docs.scipy.org/doc/numpy/reference/ufuncs.html

In [315]:
a1 = np.array([10,2,31,45])
a2 = np.array([53,62,7,18])

np.add(a1,a2)

array([63, 64, 38, 63])

In [96]:
np.subtract(a1,a2)

array([-43, -60,  24,  27])

In [97]:
np.divide(a1,a2)

array([ 0.18867925,  0.03225806,  4.42857143,  2.5       ])

In [98]:
np.maximum(a1,a2)

array([53, 62, 31, 45])

In [316]:
a = np.array([1,4,9,16,25])
np.sqrt(a)

array([ 1.,  2.,  3.,  4.,  5.])

In [221]:
np.exp(a)

array([  2.71828183e+00,   5.45981500e+01,   8.10308393e+03,
         8.88611052e+06,   7.20048993e+10])

In [318]:
arr = np.array([6,2,0,1,5,0,7,10,0])
np.nonzero(arr)
arr[np.nonzero(arr)]

array([ 6,  2,  1,  5,  7, 10])

In [319]:
x = np.random.randn(5)
x

array([-0.19565899, -0.1040445 ,  1.66369894, -0.83598469,  0.78541226])

In [321]:
np.floor(x)
np.ceil(x)

array([-0., -0.,  2., -0.,  1.])

#### Matrix multiplication

In [323]:
x = np.matrix([[10,2],[31,45]])
y = np.mat([[53,62], [7,18]])
x*y
#np.dot(x,y) # or x.dot(y)

matrix([[ 544,  656],
        [1958, 2732]])

In [322]:
x = np.array([[10,2],[31,45]])
y = np.array([[53,62], [7,18]])
x*y

array([[530, 124],
       [217, 810]])

### np.where: ([cond], if, else)


In [330]:
x = np.random.randn(5)
y = np.random.randn(5)
x

array([-0.40642063, -1.29189162,  1.37123845,  0.59009313, -0.58297879])

In [331]:
y

array([-1.68000835,  1.68077439,  0.76761372,  1.47195587,  0.26292138])

In [332]:
np.where(x > 0, x, y)

array([-1.68000835,  1.68077439,  1.37123845,  0.59009313,  0.26292138])

In [226]:
matches.head()
matches1 = matches[['id', 'season', 'team1', 'team2', 'win_by_runs', 'win_by_wickets']].copy()
matches1.head()

Unnamed: 0,id,season,team1,team2,win_by_runs,win_by_wickets
0,1,2008,Kolkata Knight Riders,Royal Challengers Bangalore,140,0
1,2,2008,Chennai Super Kings,Kings XI Punjab,33,0
2,3,2008,Rajasthan Royals,Delhi Daredevils,0,9
3,4,2008,Mumbai Indians,Royal Challengers Bangalore,0,5
4,5,2008,Deccan Chargers,Kolkata Knight Riders,0,5


In [333]:
matches1['level'] = np.where((matches1.win_by_runs < 20) , 'Super exciting!', 'Expected')
matches1.head()

Unnamed: 0,id,season,team1,team2,win_by_runs,win_by_wickets,level
0,1,2008,Kolkata Knight Riders,Royal Challengers Bangalore,140,0,Expected
1,2,2008,Chennai Super Kings,Kings XI Punjab,33,0,Expected
2,3,2008,Rajasthan Royals,Delhi Daredevils,0,9,Super exciting!
3,4,2008,Mumbai Indians,Royal Challengers Bangalore,0,5,Super exciting!
4,5,2008,Deccan Chargers,Kolkata Knight Riders,0,5,Super exciting!


In [360]:
matches1['level'] = np.where((matches1.win_by_runs < 20) & (matches1.win_by_wickets < 3), 'Super exciting!', 'Expected')
matches1.head()

Unnamed: 0,id,season,team1,team2,win_by_runs,win_by_wickets,level
0,1,2008,Kolkata Knight Riders,Royal Challengers Bangalore,140,0,Expected
1,2,2008,Chennai Super Kings,Kings XI Punjab,33,0,Expected
2,3,2008,Rajasthan Royals,Delhi Daredevils,0,9,Expected
3,4,2008,Mumbai Indians,Royal Challengers Bangalore,0,5,Expected
4,5,2008,Deccan Chargers,Kolkata Knight Riders,0,5,Expected


### Statistical methods


In [27]:
arr = np.array([[5,9,3,6], [9,2,0,8]])
arr

array([[5, 9, 3, 6],
       [9, 2, 0, 8]])

#### axis 0 - across rows - rows are getting pushed together into the result - columns remain same

In [28]:
arr.mean(axis = 0)

array([ 7. ,  5.5,  1.5,  7. ])

#### axis 1 - across columns - columns are getting pushed together into the result - rows remain same

In [342]:
arr.mean(axis = 1)

array([ 5.75,  4.75])

In [344]:
np.median(arr, 0)

array([ 7. ,  5.5,  1.5,  7. ])

### sort(), argmax(), argmin()
    argmax: Returns the indices of the maximum values along an axis.
    argmin: Returns the indices of the minimum values along an axis.

In [39]:
arr = np.array([6,9,4,2,10,3])
arr.sort()
arr

array([ 2,  3,  4,  6,  9, 10])

In [351]:
arr = np.array([6,9,4,2,10,3])
arr.argmax() 
arr.argmin() 
arr[arr.argmin()]

2

### Stacking 

In [347]:
a = np.random.random((2,2))
a

array([[ 0.50443326,  0.83620898],
       [ 0.93537317,  0.46407502]])

In [348]:
b = np.random.random((2,2))
b

array([[ 0.64232903,  0.89373215],
       [ 0.71645084,  0.23584626]])

In [349]:
c = np.vstack((a,b))
c#.shape

array([[ 0.50443326,  0.83620898],
       [ 0.93537317,  0.46407502],
       [ 0.64232903,  0.89373215],
       [ 0.71645084,  0.23584626]])

In [350]:
d = np.hstack((a,b))
d#.shape

array([[ 0.50443326,  0.83620898,  0.64232903,  0.89373215],
       [ 0.93537317,  0.46407502,  0.71645084,  0.23584626]])

In [361]:
np.concatenate((a,b))
np.concatenate((a,b), axis = 1) #pd.concat([df1, df2], axis = 1)

array([[ 0.50443326,  0.83620898,  0.64232903,  0.89373215],
       [ 0.93537317,  0.46407502,  0.71645084,  0.23584626]])

### Splitting

In [214]:
np.hsplit(c, 2)

[array([[ 0.17436462],
        [ 0.16058039],
        [ 0.37556011],
        [ 0.91794452]]), array([[ 0.71013125],
        [ 0.36656114],
        [ 0.2510648 ],
        [ 0.02293452]])]

In [215]:
np.vsplit(c, 2)

[array([[ 0.17436462,  0.71013125],
        [ 0.16058039,  0.36656114]]), array([[ 0.37556011,  0.2510648 ],
        [ 0.91794452,  0.02293452]])]

### Copies and views

In [273]:
a = np.arange(10)
b = a #view
b is a #identity operator

True

In [274]:
b*=2
b

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [275]:
a

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [276]:
a +=5
b

array([ 5,  7,  9, 11, 13, 15, 17, 19, 21, 23])

In [277]:
a == b

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

In [278]:
a = np.arange(10)
b = a.copy()
b is a

False

In [279]:
b*=2

In [280]:
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [281]:
a+=10

In [282]:
b

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

In [390]:

import numpy as np
a = np.array([8,5,3,1,9,7,4,2,6,8,2,0])
a = a.reshape(2,6)
a.mean() #mean of the entire array
a.mean(axis = 0) #mean along axis 0 - for each column
#a.mean(axis = 1) #along axis 1

4.583333333333333

In [393]:
a1 = np.array([9,0,5,0,2,1])
a1[np.nonzero(a1)]
a[-1]

array([4, 2, 6, 8, 2, 0])