# Data Analysis 

## Definition

Data analysis is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions and supporting decision-making. 

First part of data analysis is gathering data data, cleaning the data to remove all the garbage data and transforming the data to the form that we can use.

Modelling data is the process applying real world scenarios to the data models that we create. This is done using statistical tools

Discovering useful information means that from all the data that we have gathered and processed, we want to derive information

Informing conclusions and support decision making is the final step of data analysis. Once we have derived useful information from the data , we want to use that information to make better informed decisions and come to an informed conclusion.

## Numpy
Numpy is a library in python which is used to perform numerical computations. Python inherently provides support for numerical computaion as well as array/matrix computation but that is inherently slow. Numpy is beneficial when we want to process very large datasets of numbers since it is written solely for that. 

The main reason numpy provides highly efficient  numerical computations is due to the way it stores the data in the memory(RAM) for quick usage, using native C level code instead of creating a wrapper and storing data as primitive type instead of list which are essentially objects in python.

First step is to import numpy package. Generally we do `import numpy as np` so that we can access the numpy library functions by using the namespace `np`.

In [3]:
import numpy as np
import sys

### Basic Numpy Array

To create a numpy array, we use the function `np.array()` in which we can pass as input an array of numbers. This is important because numpy uses it's own implementation of array to make computation faster and efficient.

In [4]:
# If 
np.array([1,2,3,4])

array([1, 2, 3, 4])

In [5]:
a = np.array([1, 2, 3, 4])
b = np.array([0, .5, 1, 1.5, 2])

To access the element of an array of 1 dimension, we can just pass the index postition of the value we want to access

In [8]:
a[0],a[1]

(1, 2)

Similarly we can perform slicing the same way we do for python lists. The rules are same. in `[:]`, anything index position written before `:` is included and index position after `:` is excluded.

In [9]:
a[0:]

array([1, 2, 3, 4])

In [10]:
a[1:3]

array([2, 3])

In [11]:
a[1:-1]

array([2, 3])

In [12]:
a[::2]

array([1, 3])

In the above slicing, we pass 2 `:` followed by a number. This is the way to define the steps for slicing, i.e., how many elements should be skipped before picking the next element. For eg `[0:5:2]` means that we want to slice starting from `0th` index till the `5th` index position and there should be a step or addition of `2` in each index posiiton. So the indexes that will be returned will be 0,2 and 4.(+2 in each index position)

In [13]:
b[0], b[2], b[-1]

(0.0, 1.0, 2.0)

If we want to extract multiple values from a numpy array, we can do so by using 2 methods. 1 is the traditional method in which we pass multiple objects separated by comma, or we can pass all the index postition values we want in tha single `[]` as we see below, i.e., we are passing a list of index positions for the elements we want from the array. This is functionality of numpy and is not present in python lists.

In [14]:
b[[0, 2, -1]]

array([0., 1., 2.])

### Array Types
As we saw above, numpy is able to provide optimizations due to the way it stores the data in the memory. We store a single data type value in an array so that it can behave optimally and not everything like we do in lists. Numpy automatically assigns a datatype to each of the array. We can check that by calling the function `.dtype`.  

In [18]:
a

array([1, 2, 3, 4])

In [19]:
a.dtype

dtype('int64')

Since the array `a` contains all integer values, numpy assigns it a datatype of `int64`(64 or 32 is based upon the underlying architecture that the python code is using, i.e., 64bit or 32bit)

In [20]:
b

array([0. , 0.5, 1. , 1.5, 2. ])

In [21]:
b.dtype

dtype('float64')

Since `b` has decimal numbers, it is assigned a datatype of type `float64`(again 64 based on the underlying architecture)

We can also explicitedly define the datatype for an array that we are creating by passing the argument `dtype` in the array creation. For eg, if we want to create an array which has a dtype of `int8`, we can do so by writing as following

In [25]:
np.array([1, 2, 3, 4], dtype=np.int8)

array([1, 2, 3, 4], dtype=int8)

We can also store string or object type data using numpy but it is not very efficient. Numpy shines when we use datatypes like int, date, boolean, etc. 

In [30]:
c = np.array(['a','b','c'])

In [31]:
c.dtype

dtype('<U1')

In [32]:
d = np.array([{'a':1}, sys])

In [33]:
d.dtype

dtype('O')

Numpy has a Special type of datatype defined for string and object type of data in case we want to use it.

### Dimensions and Shapes
We can create any dimension matrix using numpy. Above we created 1-d matrix, which we call as array. We can also create 2-d and 3-d matrix.

Numpy has multiple attributes and functions by which we can use to get more information about the array. Some of them are -

* `shape` - We can get the `.shape` attribute to get the shape of the matrix, i.e., how many rows and columes are there in case of 2d matrix. In case of 3d matrix and above, how many data points we have in each axis
* `ndim` - We can get the `.ndim` attribute to get the number of dimensions of the matrix
* `size` - Size tells us the total number of elements in the matrix

For 2 dimensional array

In [43]:
A = np.array([
    [1,2,3],
    [4,5,6]
])

In [44]:
A

array([[1, 2, 3],
       [4, 5, 6]])

In [38]:
A.shape

(2, 3)

In [39]:
A.ndim

2

In [40]:
A.size

6

For 3 dimensional array

In [41]:
B = np.array([
    [
        [12,11,10],
        [9,8,7]
    ],
    [
        [6,5,4],
        [3,2,1]
    ]
])

In [42]:
B

array([[[12, 11, 10],
        [ 9,  8,  7]],

       [[ 6,  5,  4],
        [ 3,  2,  1]]])

In [45]:
 B.shape

(2, 2, 3)

In [47]:
B.ndim

3

In [48]:
B.size

12

If the shape is not consistent, the data will fallback to regular Python Object.

For eg, in the below list, the 2nd array has only 1 list

In [50]:
C = np.array([
    [
        [12,11,10],
        [9,8,7]
    ],
    [
        [6,5,4]
    ]
])

In [51]:
C

array([list([[12, 11, 10], [9, 8, 7]]), list([[6, 5, 4]])], dtype=object)

In [52]:
C.dtype

dtype('O')

In [53]:
C.shape

(2,)

In [54]:
C.size

2

In [55]:
type(C[0])

list

### Indexing and slicing of Matrices

When we create a multi dimension matrix, we need to take care of te

In [56]:
A = np.array([
#   0   1  2
    [1, 2, 3], #0
    [4, 5, 6], #1
    [7, 8, 9]  #2
])

In [57]:
A[1]

array([4, 5, 6])

Since `A` is a 2d array, `A[1]` will return the 2nd row of the matrix.

To select the element at 2nd row and 1st column, we can write

In [59]:
A[1][0]

4

But there is a better way to get element from multidimensional list in numpy. We can pass all the dimension comma separated as `[d1, d2, d3, d4 ]` and so on. For eg, the above statement can be written as

In [60]:
A[1, 0]

4

In a similar fashion we can perform muti dimensional slicing

In [61]:
A[0:2]

array([[1, 2, 3],
       [4, 5, 6]])

In [62]:
A[:, :2]

array([[1, 2],
       [4, 5],
       [7, 8]])

In [63]:
A[:2, :2]

array([[1, 2],
       [4, 5]])

In [64]:
A

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

If we want to set a value  to all the elements of a row, we can do so in 2 ways, 1 is to create a new np.array and assign it to the row, as follows

In [65]:
A[1] = np.array([10,10,10])

In [66]:
A

array([[ 1,  2,  3],
       [10, 10, 10],
       [ 7,  8,  9]])

Another way to do is to set directly the value after assignment operator and numpy will fill the row with that value. This is as follows 

In [67]:
A[2] = 99

In [68]:
A

array([[ 1,  2,  3],
       [10, 10, 10],
       [99, 99, 99]])

### Summary Statistics
There are multiple functions that we can use from numpy on the  numpy array. Some of the functions are 
* sum - This will add all the elements and return the value
* mean - This will calculate the mean value of all the elements
* std - This is used to calculate the standard deviation
* var - This is used to calculate variation between the values

In [69]:
a = np.array([1,2,3,4])

In [70]:
a.sum()

10

In [71]:
a.mean()

2.5

In [72]:
a.std()

1.118033988749895

In [73]:
a.var()

1.25

We can use the same functions in matrices also

In [74]:
A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])

In [75]:
A.sum()

45

In [76]:
A.mean()

5.0

In [77]:
A.std()

2.581988897471611

We can also apply these functions to a specific axis (like on only row or column or if any higher dimension matric is there, then on any axis of that matrix). For eg -

In [78]:
A.sum(axis = 1)

array([ 6, 15, 24])

In [79]:
A.sum(axis = 0)

array([12, 15, 18])

In [81]:
A.mean(axis = 0)

array([4., 5., 6.])

In [82]:
A.mean(axis = 1)

array([2., 5., 8.])

In [83]:
A.std(axis = 0)

array([2.44948974, 2.44948974, 2.44948974])

In [84]:
A.std(axis = 1)

array([0.81649658, 0.81649658, 0.81649658])

And [many more](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.ndarray.html#array-methods)...

### Broadcasting and Vectorized operations
This is an **important** concept in numpy. In numpy, we can easily apply 1 operation to all the elements of the array using the concept of broadcasting and vectorized operation.

In [87]:
a = np.arange(4)

In [88]:
a

array([0, 1, 2, 3])

Now let's say we want to perform an operation on all the elements like addition or multiplication. What we can do is simply write `a+10` or `a*10`, where `a` is a numpy array, and that operation will be broadcasted to all the elements of the array. What is means is that when we write `a+10`, all the elements of the array `a` get added by `10`. Similarly, when we write `a*10`, all the elements get multipled by `10`. This makes performing any operation very easy on numpy array. Any operation that we perfrom will not modify the existing array but will return a new array

In [91]:
a + 10

array([10, 11, 12, 13])

In [92]:
a * 10

array([ 0, 10, 20, 30])

In [93]:
a

array([0, 1, 2, 3])

To modify the existing array, we can perform assignment of the new array to the existing array variable. Numpy arrays follow the syntax of `variable operation= value`, i.e., `a += 10` will add `10` to all the elements of `a` and then assign the new array back to `a`. Same for other kind of operations

In [94]:
a += 100

In [95]:
a

array([100, 101, 102, 103])

The analogy of vectorized operations is list comprehensions, wherein we can generate a list which follows certian expression(s) and condition(s). For eg

In [96]:
l = [0,1,2,3]
[i * 10 for i in l]

[0, 10, 20, 30]

The broadcasting operation can be between array and scalars(like we saw above), or between array and arrays, which we can see below

In [99]:
a = np.arange(4)

In [100]:
a

array([0, 1, 2, 3])

In [101]:
b = np.array([10,20,30,40])

In [102]:
a + b

array([10, 21, 32, 43])

In [103]:
a * b

array([  0,  20,  60, 120])

### Boolean Arrays(also called masks)

If we want to get multiple values from an array , we have already seen there are 2 ways, 1 is traditional python way(comma separated list and index postition) and 1 is numpy way(pass all the index values in the input)

In [104]:
a = np.arange(4)

In [105]:
a

array([0, 1, 2, 3])

In [107]:
# Python way
a[0], a[-1]

(0, 3)

In [109]:
# Numpy way
a[[0, -1]]

array([0, 3])

There is 1 more way to get multiple elements is by passing Boolean values for the indexes we want values for, i.e., for each index position we want value for, we pass `True` otherwise we pass `False`. 

For ex, if we have a numpy array `a = np.array([1,2,3,4])`, and we want values for `0th` index and `3rd` index, we will write the following - 
```python
a[[True, False, False, True]]
```
which will output
```python
array[(1, 4)]
```
which are the values of index position `0` and `3`

In [110]:
a[[True, False, False, True]]

array([0, 3])

Now if look at this, if we have thousands or millions of records, we will not write `True` or `False` for each element because it is not actually practical.  But where the power actually shines is we can perfrom Boolean broadcasting similar to scalar or vector value broadcasting we saw, i.e., wwe can perform relational operations on our data(which return boolean values) and according get or modify or data. 

Let say we have an array `a = [0,1,2,3]` and we want to get all the values that are greater than or equal to 2. Normally we will have to iterate over all the values, apply `if` condition and return those values. But with big datasets, this is not an efficient solution. 

Another way to do is in numpy array, we can broadcast the relation operator and get the values accordingly. If we apply `a <=2`, where a is the above numpy array, we will get a boolean array like `array([False, False, True, True])`. Now we saw above that we can pass boolean array as input when we want to get multiple values from the numpy array and that it will get all the values for which we pass `True` in the index position. So if we do `a[a<=2]`, We are essentially passing the `array([False, False, True, True])` array as input and we get in response `array([3,4])

In [111]:
a

array([0, 1, 2, 3])

In [112]:
a >= 2

array([False, False,  True,  True])

In [113]:
a[a>=2]

array([2, 3])

So we see this is an efficient way to apply conditionals on our data and get data accordingly.

We can also perform certain calculation and get values. Like we can say that return all the values which are greater than the mean of the array

In [114]:
a.mean()

1.5

In [115]:
a[a > a.mean()]

array([2, 3])

We can use other boolean operation like `not`, `or` or `and`. Or is using `|`, and is usng `&` and not is using `~`.

In [None]:
a[~(a > a.mean())]

In [120]:
a[(a == 0) | (a == 1)]

array([0, 1])

In [121]:
a[(a <= 2) & (a % 2 == 0)]

array([0, 2])

In [122]:
A = np.random.randint(100, size =(3,3)) # We are generating a random 3x3 array

In [123]:
A

array([[48, 85, 55],
       [58, 16, 66],
       [43, 59, 77]])

In [124]:
A[np.array([
    [True, False, True],
    [False, True, False],
    [True, False, True],
])]

array([48, 55, 16, 43, 77])

In [125]:
A > 30

array([[ True,  True,  True],
       [ True, False,  True],
       [ True,  True,  True]])

In [126]:
A[A > 30]

array([48, 85, 55, 58, 66, 43, 59, 77])

So we can procedurally generate our boolean values and get our data back instead of manually performing it.

### Linear Algebra
We can perform all the linear algebra operations on array like dot product, cross product, transpose, etc

In [127]:
A = np.array([
    [1,2,3],
    [4,5,6],
    [7,8,9]
])

In [128]:
B = np.array([
    [6,5],
    [4,3],
    [2,1]
])

In [129]:
# Dot product
A.dot(B)

array([[20, 14],
       [56, 41],
       [92, 68]])

In [130]:
#  Cross Product
A @ B

array([[20, 14],
       [56, 41],
       [92, 68]])

In [132]:
#Transpose
B.T

array([[6, 4, 2],
       [5, 3, 1]])

In [133]:
A

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [134]:
# Corss product of B transpose and A
B.T @ A

array([[36, 48, 60],
       [24, 33, 42]])

## Size of objects in Memory
The default memory used by python int and long is very high(as we see below). Due to this, python data types are not always optimal to use. Like int uses 28 bytes of memory(or 224 bits) or Long uses 72 Bytes of memory(or 576 bits). For numpy, we can define the size as required and by default numpy values consume less space 

In [136]:
# The size of int in python is 28 bytes
sys.getsizeof(1)

28

In [138]:
# The size of Longs are 72 bytes
sys.getsizeof(10 ** 100)

72

In [148]:
np.dtype(int).itemsize

1

In [147]:
np.dtype(float).itemsize

8

The difference in performance is also there between python native list and numpy array. 

For ex, if we want to sum the square of all the elements in an array, numpy is much faster than python as we see

In [159]:
l = list(range(10000))

In [161]:
a = np.arange(100000)

In [162]:
%time np.sum(a ** 2)

CPU times: user 1.5 ms, sys: 0 ns, total: 1.5 ms
Wall time: 792 µs


333328333350000

In [160]:
%time sum([x ** 2 for x in l])

CPU times: user 2.15 ms, sys: 0 ns, total: 2.15 ms
Wall time: 2.16 ms


333283335000

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## Useful Numpy functions

### `random` 

In [None]:
np.random.random(size=2)

In [None]:
np.random.normal(size=2)

In [None]:
np.random.rand(2, 4)

---
### `arange`

In [None]:
np.arange(10)

In [None]:
np.arange(5, 10)

In [None]:
np.arange(0, 1, .1)

---
### `reshape`

In [None]:
np.arange(10).reshape(2, 5)

In [None]:
np.arange(10).reshape(5, 2)

---
### `linspace`

In [None]:
np.linspace(0, 1, 5)

In [None]:
np.linspace(0, 1, 20)

In [None]:
np.linspace(0, 1, 20, False)

---
### `zeros`, `ones`, `empty`

In [None]:
np.zeros(5)

In [None]:
np.zeros((3, 3))

In [None]:
np.zeros((3, 3), dtype=np.int)

In [None]:
np.ones(5)

In [None]:
np.ones((3, 3))

In [None]:
np.empty(5)

In [None]:
np.empty((2, 2))

---
### `identity` and `eye`

In [None]:
np.identity(3)

In [None]:
np.eye(3, 3)

In [None]:
np.eye(8, 4)

In [None]:
np.eye(8, 4, k=1)

In [None]:
np.eye(8, 4, k=-3)

In [None]:
"Hello World"[6]