# NumPy

**NumPy** (**Num**erical **Py**thon)is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Its **high level syntax** makes it **accessible** and productive for programmers from any background or experience level, while its core is **well-optimized C code** to combine the **flexibility** of Python with the **speed** of compiled code.

##Arrays
An **array** is the central data structure of the NumPy library. An array is a grid of values and it contains information about 
- the raw data
- how to locate an element
- how to interpret an element. 

It has a grid of elements that can be indexed in various ways. The elements are all of the same type, referred to as the array **dtype**.

The **rank** of the array is the number of dimensions. The **shape** of the array is a tuple of integers giving the size of the array along each dimension. In Numpy, dimensions are called **axes**.

While a Python list can contain different data types within a single list, all of the elements in a NumPy array should be **homogeneous**. The mathematical operations that are meant to be performed on arrays would be extremely inefficient otherwise. NumPy arrays are **faster** and more **compact** than Python lists. NumPy uses much **less memory** to store data.

There are different ways to initialize a _numpy array_:

- With regular Python arrays

In [None]:
import numpy as np

a1d = np.array([1, 2, 3])
a2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8 , 9]])
a3d = np.array([[[1, 2, 3], [4, 5, 6], [7, 8 , 9]], [[1, 2, 3], [4, 5, 6], [7, 8 , 9]], [[1, 2, 3], [4, 5, 6], [7, 8 , 9]]])
print(a1d, "\n")
print(a2d, "\n")
print(a3d)

- With numpy functions to repeat data

In [None]:
import numpy as np

print([0, 0])
print(np.zeros(5))
print(np.ones(5))
print(np.arange(7, 49, 7)) # 49 not included

- With numpy functions to generate distributions: linear, geometric, logaritmic...

In [None]:
import numpy as np

print(np.linspace(1, 100, 10))
print(np.geomspace(1, 100, 10))
print(np.logspace(1, 100, 10))
print(np.logspace(0, 2, 10))


##Basic array operations
Numpy arrays have attributes with information about its shape, size and dimmensions.

In [None]:
import numpy as np

a1d = np.array([1, 2, 3])
a2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a2d2 = np.array([[1], [4, 5, 6], [7, 8]])
a2d3 = np.array([[1, None, None], [4, 5, 6], [7, 8, None]])

print(a1d.size, a1d.ndim, a1d.shape)
print(a2d.size, a2d.ndim, a2d.shape)
print(a2d2.size, a2d2.ndim, a2d2.shape)
print(a2d3.size, a2d3.ndim, a2d3.shape)
# remove a2d2 to avoid warning

We can also join arrays, break them in pieces or change their shape.

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 5, 2, 4, 3, 2, 5, 6, 1, 2, None, 5])
print(arr[4:7], arr[7:], arr[:2])
print(np.split(arr, 2))

print()

arra = np.array([1, 2, 3, 3, 2, 5])
arrb = np.array([8, 5, 2, 4, 6, 1, 2, 5])
arr2 = np.concatenate((arra, arrb))
print(arr2)
print(np.flip(arr2))
print(np.sort(arr2))
print(np.unique(arr2))
print(arr2.reshape(2, 7)) # try different numbers

##Basic numeric operations: broadcasting
**Broadcasting** is the term used to describe the implicit element-by-element behavior of operations; generally speaking, in NumPy all operations, not just arithmetic operations, but logical, bit-wise, functional, etc., behave in this implicit element-by-element fashion.

**Looping occurs in C instead of Python!**

In regular Python...

In [None]:
a = [1, 2, 3]
b = [15, 25, 30]
c = []
for i in range(len(a)):
  c.append(a[i] * b[i])
print(c)

With _Numpy_

In [None]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([15, 25, 30])
c = 2.35
print(a * b)
print(a * c)

More examples.

In [None]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([5, 15, 10])
c = 3
print(np.power(a, b))
print(np.power(a, c))

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(np.sqrt(a))
print(np.square(a))
print(np.exp(a))

In [None]:
import numpy as np

arr = np.array([-0.8, 4.1, -9.7, -8, 5])
print(np.round(arr))
print(np.ceil(arr))
print(np.floor(arr))

## Grouping
We can find some statistical data from a group: maximum, minimum, mean...

In [None]:
import numpy as np

arr = np.array([-0.8, 4.1, -9.7, -8, 5])
print('Sum:', arr.sum())
print('Max:', arr.max())
print('Min:', arr.min())
print('Mean:', arr.mean())
print('Standard deviation:', arr.std())

##Selecting
We can also select data that meets certain conditions.

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 5, 2, 4, 3, 2, 5, 6, 1, 2, 5])
print(np.unique(arr))
print(np.where(arr < 5))
print(arr[np.where(arr < 5)])

##Random
There are several _numpy_ functions to generate random data.

In [None]:
import numpy as np

print(np.random.rand(10))
print()
print(np.random.rand(3, 6))
print()
print(np.random.randn(10))
print()
print(np.random.randint(low = 1, high = 25, size = (5, 5)))
print()

choices = [1, 10, 100, 1000]
chances = [0.5, 0.1, 0.1, 0.3]
print(np.random.choice(choices, 10, p = chances))


##Basic statistics
Statistical data is really easy to btain with _numpy_.

In [None]:
import numpy as np

arr = np.array([1, 2, 3, 5, 2, 4, 3, 2, 5, 6, 1, 2, 5])
print(arr.mean())
print(arr.std())
print(np.percentile(arr, 80))
print()
choices = [1, 10, 100, 1000]
chances = [0.5, 0.1, 0.1, 0.3]
print(np.average(choices, weights = chances))

##Matrixes
We can also create and operate matrixes with _numpy_.

In [None]:
import numpy as np

matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
m = np.array(matrix)
print(m, "\n")
print(m + m, "\n")
print(m * m, "\n")
print(np.diag(m), "\n")
print(np.flipud(m), "\n")
print(np.fliplr(m), "\n")
print(np.transpose(m), "\n")
print(m * np.transpose(m))


#MatPlotLib
_Matplotlib_ is a comprehensive library for creating static, animated, and interactive visualizations in Python.

You can find a full reference at https://matplotlib.org/.

Let's see a few cool examples.



##Correlogram
_Seaborn_ library is a widely popular data visualization library that is built on top of the Matplotlib and can perform exploratory analysis. You can find a full reference at https://seaborn.pydata.org/.

A _correlogram_ is used to visually see the **correlation** metric between all possible pairs of numeric variables in a given dataframe.

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")
df

In [None]:
df.corr()

In [None]:
plt.figure(figsize = (12, 10), dpi = 80)
sns.heatmap(df.corr(), xticklabels = df.corr().columns, yticklabels = df.corr().columns, cmap = 'RdYlGn', center = 0, annot = True)
plt.title('Correlogram of mtcars', fontsize = 22)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()

##Slope Chart
Slope chart is most suitable for comparing _before_ and _after_ positions of a given item. We can test a lot of code here!

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import seaborn as sns

df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv")
df

In [None]:
def newline(p1, p2, color = 'black'):
    ax = plt.gca()
    l = mlines.Line2D([p1[0], p2[0]], [p1[1], p2[1]], color = 'red' if p1[1] > p2[1] else 'green', marker = 'o', markersize = 6)
    ax.add_line(l)
    return l

fig, ax = plt.subplots(1, 1, figsize = (14, 14), dpi = 80)

# Vertical Lines
ax.vlines(x =1, ymin = 500, ymax = 13000, color = 'black', alpha = 0.7, linewidth = 1, linestyles = 'dotted')
ax.vlines(x =3, ymin = 500, ymax = 13000, color = 'black', alpha = 0.7, linewidth = 1, linestyles = 'dotted')

# Points
ax.scatter(y = df['1952'], x = np.repeat(1, df.shape[0]), s = 10, color = 'black', alpha = 0.7)
ax.scatter(y = df['1957'], x = np.repeat(3, df.shape[0]), s = 10, color = 'black', alpha = 0.7)

# Line Segmentsand Annotation
for p1, p2, c in zip(df['1952'], df['1957'], df['continent']):
    newline([1, p1], [3, p2])
    ax.text(1 - 0.05, p1, c + ', ' + str(round(p1)), horizontalalignment = 'right', verticalalignment ='center', fontdict = {'size': 14})
    ax.text(3 + 0.05, p2, c + ', ' + str(round(p2)), horizontalalignment = 'left', verticalalignment = 'center', fontdict = {'size': 14})

# 'Before' and 'After' Annotations
ax.text(1-0.05, 13000, 'Before', horizontalalignment = 'right', verticalalignment = 'center', fontdict = {'size': 18, 'weight': 700})
ax.text(3+0.05, 13000, 'After', horizontalalignment = 'left', verticalalignment = 'center', fontdict = {'size': 18, 'weight': 700})

# Decoration
ax.set_title("Slopechart: Comparing GDP Per Capita between 1952 vs 1957", fontdict = {'size': 22})
ax.set(xlim = (0, 4), ylim = (0, 14000), ylabel = 'Mean GDP Per Capita')
ax.set_xticks([1, 3])
ax.set_xticklabels(["1952", "1957"])
plt.yticks(np.arange(500, 13000, 2000), fontsize = 12)

# Lighten borders
plt.gca().spines["top"].set_alpha(.50)
plt.gca().spines["bottom"].set_alpha(.50)
plt.gca().spines["right"].set_alpha(.50)
plt.gca().spines["left"].set_alpha(.50)
plt.show()

##Joy Plot
_JoyPy_ is a one-function Python package based on matplotlib + pandas with a single purpose: drawing ridgeline plots (joyplots). You can find it here https://github.com/leotac/joypy.

Joyplots are stacked, partially overlapping density plots. They are a useful to plot data to visually compare distributions, especially those that change across one dimension (e.g., over time).

In [None]:
!pip install joypy

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import joypy

df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")
df

In [None]:
# Draw Plot
plt.figure(figsize = (10, 6), dpi = 80)
fig, axes = joypy.joyplot(df, column = ['hwy', 'cty'], by = "class", ylim = 'own', figsize = (10, 6))

# Decoration
plt.title('Joy Plot of City and Highway Mileage by Class', fontsize = 22)
plt.show()

#StatsModels
*statsmodels* is a Python module that provides classes and functions for the estimation of many different **statistical models**, as well as for conducting statistical tests, and statistical data exploration. It supports ``numpy`` arrays as well as ``pandas`` dataframes.

You can find the complete list of *statsmodel* supported models at https://www.statsmodels.org/stable/api.html.


##ARIMA
**ARIMA** is a statistical analysis model that uses time series data to either better understand the data set or to predict future trends. The name stands for **A**uto**R**egressive **I**ntegrated **M**oving **A**verage. You can find more information about ARIMA at https://www.investopedia.com/terms/a/autoregressive-integrated-moving-average-arima.asp or at https://towardsdatascience.com/machine-learning-part-19-time-series-and-autoregressive-integrated-moving-average-model-arima-c1005347b0d7.

###Data
Let's see how does our timeseries data look like.

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

df = pd.read_csv('screenviews_by_date_country_devcat.csv', parse_dates = ['ga_date'], index_col = ['ga_date'])
aggr_df = df.groupby("ga_date").agg({"ga_screenviews": np.sum }) 
plt.xlabel('Date')
plt.ylabel('Screenviews')
plt.plot(aggr_df)

###Model
Let's now create a model and see how it looks.

In [None]:
from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(aggr_df, order = (3, 1, 0))
results = model.fit()
print(results.summary())
results.plot_diagnostics()
plt.show()

###Forecast
Let's check our past data and what the model predicts.

In [None]:
plt.plot(aggr_df)
plt.plot(results.fittedvalues, color = 'red')
plt.plot(results.forecast(12), color = 'green')
plt.show()
print()
print(results.forecast(2))

# SciPy
**SciPy** is a collection of mathematical algorithms and convenience functions built on the **NumPy** extension of Python. You can find a complete reference in https://scipy.org/.

This is the list of packages avaiable in _SciPy_:
- cluster: Clustering algorithms
- constants: Physical and mathematical constants
- fftpack: Fast Fourier Transform routines
- integrate: Integration and ordinary differential equation solvers
- interpolate: Interpolation and smoothing splines
- io: Input and Output
- linalg: Linear algebra
- ndimage: N-dimensional image processing
- odr: Orthogonal distance regression
- optimize: Optimization and root-finding routines
- signal: Signal processing
- sparse: Sparse matrices and associated routines
- spatial: Spatial data structures and algorithms
- special: Special functions
- stats: Statistical distributions and functions

As you can see, you need previous mathematical knowledge to be able to use _SciPy_ library. Let's see some examples:

##Constants
Let's see some available constants. You can find the full list at https://docs.scipy.org/doc/scipy/reference/constants.html.

In [None]:
import scipy.constants as c

print('Mathematical constants')
print('PI:', c.pi)
print('e:', c.e)
print()
print('Physical constants')
print('Planck:', c.h)
print('Newton:', c.g)
print('Avogadro:', c.Avogadro)
print('Speed of light:', c.speed_of_light)
print('Elementary charge:', c.elementary_charge)
print('Electron mass:', c.elementary_charge)
print()
print('Conversion')
print('Degrees in a radian:', c.degree)
print('Centigrade degrees in Farenheit:', c.degree_Fahrenheit)
print('Pascals in Atmosphere:', c.atm)
print('Cms per inch:', c.inch)
print('Square meters in acre:', c.acre)


##K-Means
The k-means algorithm takes as input the number of clusters to generate, k, and a set of observation vectors to cluster. It returns a set of centroids, one for each of the k clusters. An observation vector is classified with the cluster number or centroid index of the centroid closest to it.

###Bidimensional guess example
Let's imagine a set of 100 points near the segment limited by point(-4, -4) and (4, 4).  

In [None]:
import numpy as np
from scipy.cluster.vq import vq, kmeans, kmeans2
import matplotlib.pyplot as plt

rng = np.random.default_rng()
data = rng.multivariate_normal([1, 1], [[0, 1], [1, 1]], size = 100)
plt.scatter(data[:, 0], data[:, 1])
plt.show()

Let's find two centroids.

In [None]:
codebook, d = kmeans(data, 2)
codebook2, d2 = kmeans2(data, 2)
plt.scatter(data[:, 0], data[:, 1])
plt.scatter(codebook[:, 0], codebook[:, 1], c = 'red')
plt.scatter(codebook2[:, 0], codebook2[:, 1], c = 'yellow')
plt.show()

Let's find 4 centroids.

In [None]:
codebook, d = kmeans(data, 4)
codebook2, d2 = kmeans2(data, 4)
plt.scatter(data[:, 0], data[:, 1])
plt.scatter(codebook[:, 0], codebook[:, 1], c = 'red')
plt.scatter(codebook2[:, 0], codebook2[:, 1], c = 'yellow')
plt.show()

Let's find 8 centroids.

In [None]:
codebook, d = kmeans(data, 8)
codebook2, d2 = kmeans2(data, 8)
plt.scatter(data[:, 0], data[:, 1])
plt.scatter(codebook[:, 0], codebook[:, 1], c = 'red')
plt.scatter(codebook2[:, 0], codebook2[:, 1], c = 'yellow')
plt.show()

###Letter identification
Let's code an image that contains a letter as a 3X5 pixel matrix. From 0 to 3 it could look like this:
```
ZERO ONE  TWO  THREE
 x     x  xxx  xx
x x   xx    x    x
x x    x   x    x
x x    x  x      x
 x     x  xxx  xx
```
Let's code each `x` as a 1 and each `blank` as a 0 and write the pixels in just one array of data:
```
zero  = [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0]
one   = [0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1]
two   = [1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1]
three = [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1]
```
Now we could code a few observations and check what k-means returns. **Let's play!**


In [None]:
import numpy as np
from scipy.cluster.vq import vq

zero  = [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0]
one   = [0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1]
two   = [1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1]
three = [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0]
code_book = np.array([zero, one, two, three])
data = np.array([[1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1], #noisy zero
                 [1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1], #noisy two
                 [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0], #perfect three
                 [0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0], #noisy one
                 [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0], #perfect zero
                 [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1]])#very noisy three
vq(data, code_book)

### Let's play
Add code to include 4 and 5 in the model and test it.

In [None]:
# Write your code here

# Scikit Learn
This is a library which contains simple and efficient tools for predictive data analysis. It is built on NumPy, SciPy, and MatPloLlib.

You can find further details at https://scikit-learn.org/stable/index.html.

## Decision tree
Let's try to solve the number identification problem with a _decision tree_. 

In [None]:
#!pip install sklearn

import pandas as pd
from sklearn.tree import DecisionTreeClassifier

zero  = [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0]
one   = [0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1]
two   = [1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1]
three = [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0]
X = pd.DataFrame([zero, one, two, three])
y = pd.DataFrame({'result': [0, 1, 2, 3]})

model = DecisionTreeClassifier()
model.fit(X, y)
predictions = model.predict([[1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1], #noisy zero
                             [1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1], #noisy two
                             [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0], #perfect three
                             [0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0], #noisy one
                             [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0], #perfect zero
                             [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1]])#very noisy three
predictions


Sklearn provides a way to show graphically how our classifier decides.

In [31]:
from sklearn import tree

tree.export_graphviz(model,
                     out_file = 'decision.dot',
                     feature_names = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14'],
                     class_names = ['0', '1', '2', '3'],
                     label = 'all')

## Let's play
Add code to include 4, 5, ... 9 in the models, test them and generate a visual decision tree.

In [None]:
# Write your code here

## Nearest neighbour
The principle behind **nearest neighbor** methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The **distance** can, in general, be any metric measure: standard Euclidean distance is the most common choice.

Let's try to solve the number identification problem with this algorithm.

In [None]:
#!pip install sklearn

import pandas as pd
from sklearn.neighbors import NearestCentroid

zero  = [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0]
one   = [0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1]
two   = [1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1]
three = [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0]
X = pd.DataFrame([zero, one, two, three])
y = pd.DataFrame({'result': [0, 1, 2, 3]})

model = NearestCentroid()
model.fit(X, y)
predictions = model.predict([[1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1], #noisy zero
                             [1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1], #noisy two
                             [1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0], #perfect three
                             [0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0], #noisy one
                             [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0], #perfect zero
                             [1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1]])#very noisy three
predictions

## Let's play
Add code to include 4, 5, ... 9 in the models and test them.

In [None]:
# Write your code here

# Anaconda

Check https://anaconda.org/anaconda/jupyter.