In [1]:
import sys
import numpy as np

## _Ideal Data Analysis Process_

Data Extraction- SQL Database, Scrapping, Distributed Databases (File Format- CSV, JSON, XML)
 
Data Cleaning - Missing values and empty data, Incorrect types/values, Outliners.

Data Wrangling - Reshaping and transforming data, Indexing for quick access, Merging, combining and Joining data.

Data Analysis - Exploration, Building statistical models, Visualisations and representations,Statistical Analysis, Reporting.

Action - (Building ETL pipelines, ML models - Data Science) 
(Dashboards, Decision making assist - Data Analysis)


## Numpy
Library used for numeric calculations used for processing numbers because in python processing numbers is slow as compared to C++, Java. Numpy is an array processing library.


### Why numpy over Python?

#### 1 - Difference in size

In [2]:
sys.getsizeof(1)

28

Here we can see python uses 28 bytes to store a single number which takes a lot od space, hence processing is slow.

In [3]:
np.dtype(int).itemsize

4

Here in numpy, it only uses 4 bytes (for 32x system) to store a single number.

And we can also control how many byte a number will take to get stored by -

In [4]:
np.dtype(np.int8).itemsize

1

for memory size of Z numpy matrix

In [5]:
Z = np.zeros((10,10))

print("%d bytes" % (Z.size * Z.itemsize))

800 bytes


#### 2 - Difference in performance

Consider a list of 1000 elements(python) and array of 1000 elements(numpy)

In [6]:
l = list(range(100000))

In [7]:
y = np.arange(100000)

In [8]:
%time sum([x**2 for x in l]) #python

CPU times: total: 31.2 ms
Wall time: 30.9 ms


333328333350000

In [9]:
%time np.sum(y**2) #numpy

CPU times: total: 0 ns
Wall time: 0 ns


216474736

Here we can see numpy operations are significantly faster than python

### Basic Numpy Arrays

In [10]:
a = np.array([1,2,3,4,5])
b = np.array([0,0.5,1.5,2])

To get values in position 1 and 2 in an array we can -

In [11]:
a[0] , a[1]

(1, 2)

### Indexing - 

In [12]:
a[2:]  #to show element from position 2

array([3, 4, 5])

In [13]:
a[1:5] #to show element from position 1 to 5

array([2, 3, 4, 5])

In [14]:
a[1:-1] #To show all middle elements

array([2, 3, 4])

In [15]:
a[-1] #to show last element of an array

5

In [16]:
a[::-1] #to show elements in reverse order

array([5, 4, 3, 2, 1])

### Array Manipulation

In [17]:
a.sort() #sorts array in ascending order

In [18]:
a

array([1, 2, 3, 4, 5])

In [19]:
a[3] = 1 #makes 3rd postion in array a = 1
a

array([1, 2, 3, 1, 5])

### Extracting elements from an array by 2 ways -

Traditional way

In [20]:
a[0],a[2],a[4]

(1, 3, 5)

Multi Indexing (Creates another array instead of list) -

In [21]:
a[[0,2,3]]

array([1, 3, 1])

### Array Types 

In [22]:
a.dtype

dtype('int32')

In [23]:
b.dtype

dtype('float64')

To convert in32 to float similar step can be used to convert from float64 to int32

In [24]:
np.array([1,2,3,4], dtype=np.float64)

array([1., 2., 3., 4.])

#### Idea of numpy is that you can create multidimensional arrays 
Eg - 2 dimensional array

In [25]:
A = np.array([
    [1,2,3],
    [3,5,1]
])

In [26]:
A

array([[1, 2, 3],
       [3, 5, 1]])

2 rows and 3 columns

In [27]:
A.shape

(2, 3)

In [28]:
A.size

6

In [29]:
A.ndim #2 dimensional array

2

3 Dimensional Array

In [30]:
B = np.array([
    [
        [1,3,5],
        [2,4,5],
    ],
    [
        [3,1,4],
        [1,4,3]
    ]
])

In [31]:
B

array([[[1, 3, 5],
        [2, 4, 5]],

       [[3, 1, 4],
        [1, 4, 3]]])

In [32]:
B.shape


(2, 2, 3)

In [33]:
B.size

12

In [34]:
B.ndim #3 dimesional array

3

### Indexing and Slicing of Matrix

In [35]:
C = np.array([
#    0 1 2 
    [1,3,5], #0
    [4,2,6], #1
    [4,1,2]  #2
])

In [36]:
#[row][column]
C[1] #gives #1 

array([4, 2, 6])

In [37]:
C[1][0] #give value from 2nd row 1st column

4

In [38]:
C[1,0] #give value from 2nd row 1st column

4

In [39]:
C[1:-1,1:-1] #gives middle element of matrix

array([[2]])

In [40]:
C[0:2] #gives row 1 and row 2

array([[1, 3, 5],
       [4, 2, 6]])

In [41]:
C[:, :2] #gives column 1 and column 2

array([[1, 3],
       [4, 2],
       [4, 1]])

In [42]:
C[:, 1:3]

array([[3, 5],
       [2, 6],
       [1, 2]])

#### To assign an new array to a row in matrix

In [43]:
C[1] = np.array([0,1,3])

In [44]:
C

array([[1, 3, 5],
       [0, 1, 3],
       [4, 1, 2]])

#### If u want to assign same values in a row 

In [45]:
C[2] = 69

In [46]:
C

array([[ 1,  3,  5],
       [ 0,  1,  3],
       [69, 69, 69]])

In [47]:
C.shape

(3, 3)

In [48]:
C.size

9

#### To copy an array to differnt array -

In [49]:
X = [1,2,3]
print(X,type(X))

[1, 2, 3] <class 'list'>


In [50]:
Y = np.array(X)
print(Y,type(Y))

[1 2 3] <class 'numpy.ndarray'>


In [51]:
X = np.array([4,5,6])

#### To convert list to a numpy array

In [52]:
Y = np.copy(X)
print(Y)

[4 5 6]


#### Create a numpy array with numbers from 1 to 10, in descending order.

In [53]:
np.arange(1, 11)[::-1]

array([10,  9,  8,  7,  6,  5,  4,  3,  2,  1])

### Summary Statistics -  

In [54]:
a = np.array([1,4,2,5,6])

In [55]:
a

array([1, 4, 2, 5, 6])

#### Basic Arithematic Operations - 

In [56]:
a.sum()

18

In [57]:
a.mean()

3.6

In [58]:
a.std()

1.8547236990991407

In [59]:
a.var()

3.44

In [60]:
C

array([[ 1,  3,  5],
       [ 0,  1,  3],
       [69, 69, 69]])

In [61]:
C.sum()

220

In [62]:
C.sum(axis=0)

array([70, 73, 77])

here 70 is sum of first column,73 second and so on

In [63]:
C.sum(axis=1)

array([  9,   4, 207])

Here 9 is sum of first row, 4 second and so on

Other statistical calculation can also be done the same way by just replacing "sum"

In [64]:
C.ndim

2

Here C is 2 dimensional hence axis values can be either 0 or 1

### Broadcasting and Vectorized operations
_(Main use of numpy)_

In [65]:
a

array([1, 4, 2, 5, 6])

In [66]:
a+10

array([11, 14, 12, 15, 16])

This is an example of vectorizing a number in which it is added to each element of an array

In [67]:
a

array([1, 4, 2, 5, 6])

As numpy is an immutable first library any operations done on an array will not modify the original value of an array.

To modify we use broadcating operation

In [68]:
a += 10

In [69]:
a

array([11, 14, 12, 15, 16])

We use numpy for array operations because in python to modify value in an array we have to use loops

In [70]:
c = np.array([0,11,1.4,1.6,2.4])

In [71]:
a+c

array([11. , 25. , 13.4, 16.6, 18.4])

### Boolean Arrays
_Also called as masks_

#### Selecting element in an array with boolean values i.e true or false

In [72]:
a

array([11, 14, 12, 15, 16])

In [73]:
a[[True,False,False,True,True]]

array([11, 15, 16])

In [74]:
a<=12

array([ True, False,  True, False, False])

In [75]:
a[a<=12]

array([11, 12])

The advantage is that we can quickly filter/select

In [76]:
a.mean()

13.6

In [77]:
a

array([11, 14, 12, 15, 16])

In [78]:
a[(a>a.mean())]

array([14, 15, 16])

In [79]:
a[(a<=14)&(a>=11)]

array([11, 14, 12])

### Linear Algebra
Dot product, cross product, transpose

In [80]:
A = np.array([
    [1,2,6],
    [4,12,4],
    [7,2,17]
])

In [81]:
A

array([[ 1,  2,  6],
       [ 4, 12,  4],
       [ 7,  2, 17]])

In [82]:
C

array([[ 1,  3,  5],
       [ 0,  1,  3],
       [69, 69, 69]])

In [83]:
A.dot(C)

array([[ 415,  419,  425],
       [ 280,  300,  332],
       [1180, 1196, 1214]])

In [84]:
A@C

array([[ 415,  419,  425],
       [ 280,  300,  332],
       [1180, 1196, 1214]])

In [85]:
A.T

array([[ 1,  4,  7],
       [ 2, 12,  2],
       [ 6,  4, 17]])

In [86]:
C.T@A

array([[ 484,  140, 1179],
       [ 490,  156, 1195],
       [ 500,  184, 1215]])

# Useful Numpy Functions

### Random Function

In [87]:
np.random.random(size=2)

array([0.88370766, 0.41173763])

In [88]:
np.random.normal(size=3) #for a array with random values 

array([-0.27685138, -0.79146306,  1.51419511])

In [89]:
np.random.rand(2,4) #for 2d array containing random float values

array([[0.85683524, 0.79321433, 0.51344679, 0.26200417],
       [0.73784364, 0.23587921, 0.99291788, 0.55288894]])

In [90]:
np.random.randint(1,7,size=5) 
# used to simulate random events such as rolling a dice

array([6, 5, 1, 2, 2])

In [91]:
np.random.choice([1,2,3,4,5,6],size=5) 
#same as above but here we can change probability

array([6, 1, 5, 5, 3])

In [92]:
np.random.choice([0,1],size=5,p=[0.9,0.1])
#probabilities should sum up to 1 and no of probabilities = no of input

array([0, 0, 0, 0, 0])

### arange function

In [93]:
np.arange(5,10)

array([5, 6, 7, 8, 9])

In [94]:
np.arange(1,11,2)#array by dropping 2nd element

array([1, 3, 5, 7, 9])

In [95]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [96]:
np.arange(0,1,.1)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [97]:
np.arange(10).reshape(2,5)

array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

### linspace function
returns evenly spaced numbers over a specified interval defined by the first two arguments of the function 

In [98]:
np.linspace(0,1,5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [99]:
np.linspace(0,10,10)

array([ 0.        ,  1.11111111,  2.22222222,  3.33333333,  4.44444444,
        5.55555556,  6.66666667,  7.77777778,  8.88888889, 10.        ])

In [100]:
np.linspace(0, 10, 10, False)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

### Ones and zeros

In [101]:
np.zeros(5) #gives 5 zeros in an array

array([0., 0., 0., 0., 0.])

In [102]:
np.empty(5) #gives an array with all values as null/0

array([0., 0., 0., 0., 0.])

In [103]:
np.ones(4)#gives 5 ones in an array

array([1., 1., 1., 1.])

In [104]:
np.ones((2,3))

array([[1., 1., 1.],
       [1., 1., 1.]])

### Eye and Identity function

In [105]:
np.identity(3) #gives an 3x3 identity matrix

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [106]:
np.eye(3,3) #gives an 3x3 identity matrix

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [107]:
np.eye(8, 4, k=0) #starts value of 1 from column k 

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [108]:
np.eye(8, 4, k=-1)

array([[0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

### Digitize Function
?

In [109]:
bins = np.array([0,6])
#remember bins must be monotonically increasing or decreasing

In [110]:
d = np.array([1,2,6,8])

In [111]:
np.digitize(d,bins)

array([1, 1, 2, 2], dtype=int64)

### Repeat 
function repeats the elements of an array. The number of repetitions is specified by the second argument repeats

In [112]:
np.repeat(3,10)

array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

### Binomial
draws samples from a binomial distribution

In [113]:
#number of successes (number of heads) in 10 coin flips
np.random.binomial(10,0.5)

2

In [114]:
#number of successes (number of heads) in 10 coin flips with 0.8 probability
#and we can also obtain approximated probabilities by simulating a huge number of flips
flips = np.random.binomial(10,0.8,size=int(100))

In [115]:
flips.mean()

8.0

In [116]:
a

array([11, 14, 12, 15, 16])

In [117]:
#returns the indices of the maximum values along an axis
max_pos = np.argmax(a)
max_pos

4