<a href="https://colab.research.google.com/github/bitprj/DigitalHistory/blob/Atul/Week7-Intro-to-Statistical-Analysis-and-Methods/Intro_To_Statistical_Analysis_with_NumPy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NumPy

### Creating Arrays *

NumPy allows you to work with arrays very efficiently. The array object in NumPy is called *ndarray*.

We can create a NumPy ndarray object by using the array() function.

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 4, 5,6,7,8,9,10])

print(arr)

print(type(arr))

[ 1  2  3  4  5  6  7  8  9 10]
<class 'numpy.ndarray'>


**Different dimensions of arrays?**

### Indexing

Indexing is the same thing as accessing an element of a list. In this case, we will be accessing an array element.

You can access an array element by referring to its **index number**. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.

The following example shows how you can access multiple elements of an array and perform operations on them.

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 4,5,6,7,8,9,10])

print(arr[4] + arr[8])

14


### Slicing

Slicing in python means taking a part or subsection of an object. An array in this case.

We slice using this syntax: [start:end].

We can also define the step, like this: [start:end:step].

If we don't pass a start it's considered that we start from the beginning. If we don't pass an end its considered that we slice to the end of the array.

If we don't pass step, we slice one element at a time. Additionally, we don't aloways have to step through an array from beginning to end. We can also go backwards. For instance:

In [None]:
# Reverse an array through backwards/negative stepping
import numpy as np
arr = np.array([3,7,9,0])

print(arr[::-1])

[0 9 7 3]


In [None]:
# Slice elements from the beginning to index 8

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7,8,9,10])

print(arr[:8])

[1 2 3 4 5 6 7 8]


You'll notice we only got to index 7. That's because the end is always *non-inclusive*. We slice up to but not including the end value. The start index on the other hand, **is** inclusive.

### Data Types

Just like base Python, NumPy has many data types available. They are all differentiated by a single character. Below is a list of all data types:

* i - integer
* b - boolean
* u - unsigned integer
* f - float
* c - complex float
* m - timedelta
* M - datetime
* O - object
* S - string
* U - unicode string
* V - fixed chunk of memory for other type ( void )

In [None]:
# Checking the data type of an array
import numpy as np

arr = np.array([5, 7, 3, 1])

print(arr.dtype)

int64


In [None]:
# How to create an array with a defined type
import numpy as np

arr = np.array([5, 7, 3, 1], dtype='S')

print(arr)
print(arr.dtype)

[b'5' b'7' b'3' b'1']
|S1


In [None]:
# How to convert between types
import numpy as np

arr = np.array([4.4, 24.1, 25.1,3.5])

newarr = arr.astype('i')

print(newarr)
print(newarr.dtype)

[ 4 24 25  3]
int32


### Copy vs. View

In NumPy, you can work with either a copy of the data or the data itself, and it's very important that you know the difference. Namely, modifying a copy of the data will not change the original dataset but modifying the view **will**. Here are some examples:

In [None]:
# A Copy
import numpy as np

arr = np.array([6, 2, 1, 5, 3])
x = arr.copy()
arr[0] = 8

print(arr)
print(x)

[8 2 1 5 3]
[6 2 1 5 3]


In [None]:
# A View
import numpy as np

arr = np.array([6, 2, 1, 5, 3])
x = arr.view()
arr[0] = 8

print(arr)
print(x)

[8 2 1 5 3]
[8 2 1 5 3]


### Shape

All NumPy arrays have an attribute called *shape*. This is helpful for 2d or n-dimensional arrays, but for simple lists, it is simply the number of elements that it has.

In [None]:
# Print the shape of an array
import numpy as np

arr = np.array([2,7,3,7])

print(arr.shape)

(4,)


### Reshaping Arrays *

### Iterating Through Arrays *

Iterating simply means to traverse or travel through an object. In the case of arrays, we can iterate through them by using simple for loops.

In [None]:
import numpy as np

arr = np.array([1, 5, 7])

for x in arr:
  print(x)

1
5
7


### Joining Arrays *

Joining combining the elements of multiple arrays into one.

The basic way to do it is like this:

In [None]:
import numpy as np

arr1 = np.array([7, 1, 0])

arr2 = np.array([2, 8, 1])

arr = np.concatenate((arr1, arr2))

print(arr)

[7 1 0 2 8 1]


### Stack Functions

### Splitting Arrays

Splitting is the opposite of joining arrays. It takes one array and creates multiple from it.

In [None]:
# Split array into 4
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6,7,8])

newarr = np.array_split(arr, 4)

print(newarr)

[array([1, 2]), array([3, 4]), array([5, 6]), array([7, 8])]


### Searching Arrays *

Searching an array to find a certain element is a very important and basic operation. We can do this using the *where()* method.

In [None]:
import numpy as np

arr = np.array([1, 2, 5, 9, 5, 3, 4])

x = np.where(arr == 4)

print(x)

(array([6]),)


In [None]:
# Find all the odd numbers in an array
import numpy as np

arr = np.array([10, 20, 30, 40, 50, 60, 70, 80,99])

x = np.where(arr%2 == 1)

print(x)

(array([8]),)


### Sorting Arrays

Sorting an array is another very important and commonly used operation. NumPy has a function called sort() for this task.

In [None]:
import numpy as np

arr = np.array([4, 1, 0, 3])

print(np.sort(arr))

[0 1 3 4]


In [None]:
# Sorting a string array alphabetically
import numpy as np

arr = np.array(['zephyr', 'gate', 'match'])

print(np.sort(arr))

['gate' 'match' 'zephyr']


### Filtering Arrays

Sometimes you would want to create a new array from an existing array where you select elements out based on a certain condition. Let's say you have an array with all integers from 1 to 10. You would like to create a new array with only the odd numbers from that list. You can do this very efficiently with **filtering**. When you filter something, you only take out what you want, and the same principle applies to objects in NumPy. NumPy uses what's called a **boolean index list** to filter. This is an array of True and False values that correspond directly to the target array and what values you would like to filter. For example, using the example above, the target array would look like this:

[1,2,3,4,5,6,7,8,9,10]

And if you wanted to filter out the odd values, you would use this particular boolean index list:

[True,False,True,False,True,False,True,False,True,False]

Applying this list onto the target array will get you what you want:

[1,3,5,7,9]

A working code example is shown below:

In [None]:
import numpy as np

arr = np.array([51, 52, 53, 54])

x = [False, False, True, True]

newarr = arr[x]

print(newarr)

[53 54]


We don't need to hard-code the True and False values. Like stated previously, we can filter based on conditions.

In [None]:
arr = np.array([51, 52, 53, 54])

# Create an empty list
filter_arr = []

# go through each element in arr
for element in arr:
  # if the element is higher than 52, set the value to True, otherwise False:
  if element > 52:
    filter_arr.append(True)
  else:
    filter_arr.append(False)

newarr = arr[filter_arr]

print(filter_arr)
print(newarr)

[False, False, True, True]
[53 54]


Filtering is a very common task when working with data and as such, NumPy has an even more efficient way to perform it. It is possible to create a boolean index list directly from the target array and then apply it to obtain the filtered array. See the example below:

In [None]:
import numpy as np

arr = np.array([10,20,30,40,50,60,70,80,90,100])

filter = arr > 50

filter_arr = arr[filter]

print(filter)
print(filter_arr)

[False False False False False  True  True  True  True  True]
[ 60  70  80  90 100]


### Random Numbers and Distributions *

You will find yourself using randomly generated numbers and objects quite often in statistical analysis. This is especailly true when creating simulations. As such, it is important to be familar with both random object generation as well as the random distribution.

### Random Generation

In [None]:
# Generate a random integer
from numpy import random

x = random.randint(50)

print(x)

23


In [None]:
# Generate a random float
from numpy import random

x = random.rand()

print(x)

0.7430470171291289


In [None]:
# Generate a random integer array of a fixed size
from numpy import random

x=random.randint(100, size=(5))

print(x)

[95 35 51 66 49]


In [None]:
# Generate a random float array of fixed size
from numpy import random

x=random.rand(10)

print(x)

[0.61790744 0.04358101 0.45571515 0.04022127 0.02050482 0.67487397
 0.91171688 0.96665611 0.57881927 0.87026867]


In [None]:
# Randomly select a number from an array
from numpy import random

x = random.choice([2, 4, 8, 16])

print(x)

4


### Random Distribution
A random distribution is a set of random numbers that each have a set probability of being selected. A very commonly known example of a random distribution is a 6-sided die. Each number from 1-6, has a certain probability of being rolled but they are all independent and random. One roll will not affect the next in any way.

We can generate random numbers based on defined probabilities using the choice() method of the random module. The choice() method allows us to explicitly set the probability for each value.

In [None]:
# Create an array of 20 randomly selected numbers from 4 choices. Each choice has a certain chance of being chosen.
from numpy import random

x = random.choice([2, 4, 8, 16], p=[0.2, 0.1, 0.3, 0.4], size=(20))

# The sum of all probabilities in the p vector should be 1!

print(x)

[16  8  8  2  2  2 16 16 16  8  2 16 16  2  8 16 16  8  2  8]


### Universal Functions *


Universal functions are built-in NumPy methods that work on NumPy arrya objects, or *ndarrays*. The way you normally iterate through elements in an array is through some sort of loop, but universal functions, or ufuncs, iterate through a process called **vectorization**. This is a much faster, more efficient way of iterating that makes better use of modern CPU's capabilities. We'll discuss some if the most popular, and commonly used ufuncs here.

### Summations

In [None]:
# Sums up all the elements in arr1 
# and then adds the result to the
# sum of all elements in arr2
import numpy as np

arr1 = np.array([10,20,20])
arr2 = np.array([5,15,30])

# The arrays need to be the same length!

newarr = np.sum([arr1, arr2])

print(newarr)

100


#### Axes in Summation
The sum function in NumPy has additional arguments that you can pass to it. One of them is the `axis` argument. Depending on the value of `axis` that you set, you can either sum up arrays row-wise or column-wise. In the case of 1D arrays, you can only sum row-wise whereas column-wise summations apply to matrices and other higher dimension arrays.

In [None]:
# Sum up each array row wise and
# create a new array with the results

import numpy as np

arr1 = np.array([10, 30, 50])
arr2 = np.array([20, 40, 60])

newarr = np.sum([arr1, arr2], axis=1)

print(newarr)

[ 90 120]


#### Cummulative Sums
NumPy also allow you to perform cummulative sums on an array. Cummulative sums are where each element is summed with the previous elements. Here's an example.

In [None]:
# Cummulative sum of an array
import numpy as np

arr = np.array([5,7,9,11,13])

newarr = np.cumsum(arr)

print(newarr)
# The following array is [5,7+5,9+7+5,11+9+7+5,13+11+9+7+5]

[ 5 12 21 32 45]


# Introduction to Statistical Analysis

Statistics plays a central part in data science and is practically it's foundation. As such, performing statistical analysis on data will be a routine part of any project you will be a part of. Since we've already covered the basics of NumPy, it would be prudent to discuss how to perform standard statistical operations on data using NumPy.

## Obtaining Min and Max


In [7]:
# Calculating minima

import numpy as num 

num1 = 10
num2 = 20

arr1 = [1, 4, 10] 
arr2 = [2, 3, 20] 

print(arr1) 
print(arr2) 

min_1 = num.minimum(num1,num2) # Minimum of two numbers
min_2 = num.minimum(arr1,arr2) # Creates an array with element-wise minima from arrays

print(min_1)
print(min_2)

# Calculating maxima
# Very similar

max_1 = num.maximum(num1,num2) # Maximum of two numbers
max_2 = num.maximum(arr1,arr2) # Creates an array with element-wise maxima from arrays

print(max_1)
print(max_2)

[1, 4, 10]
[2, 3, 20]
10
[ 1  3 10]
20
[ 2  4 20]


## Mean

In [8]:
# Calculating the mean
import numpy as num

arr = [30, 1, 15, 28, 93] 

print(arr) 
print("MEAN : ", num.mean(arr)) 



[30, 1, 15, 28, 93]
MEAN :  33.4


## Median

In [9]:
# Calculating a median (odd or even)
	
import numpy as num 
	 
arr = [30, 1, 15, 28, 93] 

print(arr) 
print("MEDIAN: ", num.median(arr)) 



[30, 1, 15, 28, 93]
MEDIAN:  28.0


## Variance

In [11]:
# Calculating the variance

import numpy as num 

arr = [30, 1, 15, 28, 93] 

print(arr) 
print("VARIANCE : ", num.var(arr)) 


[30, 1, 15, 28, 93]
VARIANCE :  996.2400000000001


## Standard Deviation

In [12]:
# Calculating the standard deviation

import numpy as num 

arr = [30, 1, 15, 28, 93] 

print(arr) 
print("STANDARD DEVIATION : ", num.std(arr)) 

[30, 1, 15, 28, 93]
STANDARD DEVIATION :  31.563269792592784


## Quantiles

A qth quantile is a number where q% of the values in a distribution fall below that number. For instance, calculating a 33rd quantile means finding a number where a 1/3 of the values in a dataset are below it. 

In [15]:
# Multiple quantile calculations

import numpy as num

arr = [30, 1, 15, 28, 93] 

print(arr) 
print("50th quantile of arr : ", num.quantile(arr, .50)) 
print("25th quantile of arr : ", num.quantile(arr, .25)) 
print("75th quantile of arr : ", num.quantile(arr, .75)) 
print("33rd quantile of arr : ", num.quantile(arr, .33)) 
	


[30, 1, 15, 28, 93]
50th quantile of arr :  28.0
25th quantile of arr :  15.0
75th quantile of arr :  30.0
33rd quantile of arr :  19.16


## Covariance

Sometimes you have two or more variables in your data set and would like to know how they vary together as the values change. You can measure that using covariance.

- If cov(x,y) = 0 then variables are uncorrelated
- If cov(x,y) > 0 then variables are positively correlated
- If cov(x,y) < 0 then variables are negatively correlated

In [23]:
# Covariance between multiple arrays

import numpy as num 

# An array of 3 variables with 5 possible values for each
x = num.array([ [0.1, 0.3, 0.4, 0.8, 0.9],
               [3.2, 2.4, 2.4, 0.1, 5.5],
               [10., 8.2, 4.3, 2.6, 0.9]
             ])
print("Covariance matrix:\n", num.cov(x)) 


Covariance matrix:
 [[ 0.115   0.0575 -1.2325]
 [ 0.0575  3.757  -0.8775]
 [-1.2325 -0.8775 14.525 ]]


### Interpretation of Covariance Matrix

The main diagonal (C_ii) are the variances for each variable. 0.115 is the variance for var_1, 3.757 for var_2, and 14.525 for var_3. The other values in the matrix tell us the direction of correlation between the other variables. For instance, -1.325 is the covariance between var_1 and var_3, and we can see that there is a negative correlation. That is, as one values for one increases, the values in the other decrease.

## Cross-Correlation

A very similar statistic to covariance that we can calculate is the cross-correlation between variables. While covariance can only reliably tell us the direction of correlation, cross-correlation can also tell us the strength of this correlation.

In [24]:
import numpy as num

num.random.seed(30)

# 100 random integers between 0 and 75
x = num.random.randint(0, 75, 100)

# Positive Correlation with some noise
y = x + num.random.normal(0, 10, 100)

num.corrcoef(x, y)

# There is a VERY strong correlation between x and y here.
# Correlations can be a min of 0 and a max of 1.

array([[1.        , 0.91523891],
       [0.91523891, 1.        ]])

In [25]:
# Negative correlations

import numpy as num

num.random.seed(30)

# 100 random integers between 0 and 75
x = num.random.randint(0, 75, 100)

# Positive Correlation with some noise
y = 200 - x + num.random.normal(0, 10, 100)

num.corrcoef(x, y)

# Not as strong, but still a very prominent negative correlation between x and y here.

array([[ 1.        , -0.89798683],
       [-0.89798683,  1.        ]])

## Probability

## Hypothesis Tests