# Working with Data - Numpy

##### `A bit about Numpy`

`Numpy` is a library in Python that is used for working with arrays. It provides a high-performance multidimensional array object and tools for working with these arrays. Here are some of its main uses:

- Mathematical and Logical Operations on Arrays: Numpy allows element-wise operations, which means you can easily perform operations (addition, subtraction, square, exponentials) on every element of the array.

- Fourier Transforms and Shapes Manipulation: Numpy can perform Fourier transforms and reshape arrays.

- Matrix Operations: Numpy is capable of performing various matrix operations like multiplication, dot product, transpose, etc.

- Random Number Generation: Numpy can generate random numbers, which is useful in statistical analyses, machine learning algorithms, and simulations.

- Statistical Operations: Numpy provides many statistical methods such as mean, median, percentile, and standard deviation.

- Image Processing and Computer Graphics: Images can be manipulated via Numpy arrays. For example, you can reverse the colors of an image, convert an image to grayscale, etc.

- Scientific Computing: Numpy is used in a wide range of scientific research areas, including physics, astronomy, and engineering.

- `Machine Learning: Numpy is fundamental for machine learning. Many other libraries (like Scikit-Learn, TensorFlow, and PyTorch) are built on top of Numpy and rely on its array operations.`

##### `What’s the difference between a Python list and a NumPy array?`

Numpy arrays and list arrays in Python are both used in data manipulation, but they are different in several ways. Here are some examples:

1. Memory Efficiency: Numpy arrays are more memory efficient than Python lists

2. Performance: Numpy arrays are faster than Python lists when it comes to mathematical operations. This is due to the fact that numpy uses optimized C API to perform computations, which is significantly faster than Python's built-in functions.

3. Functionality: Numpy provides a lot of functionality that Python lists do not have. For example, you can perform operations like arithmetic (addition, subtraction, etc.), statistical operations (mean, median, etc.), and more on numpy arrays. Python lists do not support these operations directly and it's often required to use loops or list comprehensions.

4. Syntax: The numpy array follows the mathematical syntax, which makes it more convenient to use for mathematical operations. However, Python lists don't follow mathematical syntax.

5. Size Flexibility: Python lists are dynamically resizable, whereas numpy arrays are not. If you want to resize a numpy array, a new array will be created.

In summary, you might want to use a numpy array if you are doing complex mathematical operations, especially on large data, and use lists for simpler, more general purpose tasks.

[https://numpy.org/doc/stable/user/absolute_beginners.html](https://numpy.org/doc/stable/user/absolute_beginners.html)

##### `Numpy Arrays`

In Numpy, `rank` and `shape` are attributes that give information about the dimensionality of the array.

- `Rank`: The rank of a Numpy array is the number of dimensions of the array. For example, a 1D array has a rank of 1, a 2D array has a rank of 2, and so on. In Python, you can get the rank of a Numpy array using the ndim attribute (`array.ndim`).

- `Shape`: The shape of a Numpy array is a tuple that gives the size of each dimension. For example, for a 2D array, the shape would be (n,m) where n is the number of rows and m is the number of columns. For a 3D array, the shape would be (n,m,p) where n is the number of stacks, m is the number of rows, and p is the number of columns. In Python, you can get the shape of a Numpy array using the shape attribute (`array.shape`).

##### `Import the required libraries and create an array`

In [None]:
import numpy as np

# create a 2D array
# In this example, array is a 2D array with 2 rows and 3 columns, so its rank is 2 and its shape is (2, 3).
array = np.array([[1, 2, 3], [4, 5, 6]])

print("Rank of array: ", array.ndim)  # Output: Rank of array:  2
print("Shape of array: ", array.shape)  # Output: Shape of array:  (2, 3)
print(array)

##### `Populating a Numpy Array`

Sometimes you might need to create an array but don't currently have all the data to fill it. This is where Numpy the `zeros()` and `empty()` functions can help you. 

Numpy `zeros()` is particularly useful in cases where you need an array as a placeholder for data that you are going to generate, and you want all initial values to be zero.

Example uses include:

- Initializing weights to zero in machine learning algorithms.
- Creating a zero-filled array as a starting point for computations where the actual values will be filled in later.

You would use `empty()` when you want to allocate memory for an array, but you don't care about the initial values. Since `empty()` does not have to populate the array with values, it can sometimes be faster than `np.zeros()` or other ways of creating and initializing an array.

Example uses include:

- When you need a large array that will be completely filled with new values, and the initial values are irrelevant.
- When you need an array as a buffer for a big computation.



In [None]:
import numpy as np

# Populate a new 2D array with zeros
# In this example, np.zeros((10, 10)) creates a 2D array with 10 rows and 10 columns, all filled with zeros. 
array = np.zeros((10, 10))
print(array)

# Create an empty 2D array 
# In this example, np.empty((10, 10)) creates a 2D array with 10 rows and 10 columns, without initializing entries. 
array = np.empty((10, 10))

print(array)




##### `Useful Numpy Functions`

`Reshaping`: Reshaping is used when you want to change the number of rows and columns which gives a new perspective of data.

`Slicing`: Slicing is used when you want to get a subset of an existing array.

`Dot Product`: Dot product is used in vector computations for matrix multiplications. One use case is in machine learning where the dot product is used in calculating the weighted sum of inputs in neural networks.

<img src="images/dot-product.png" width="300">

`Square`: The square function is used when you want to compute the square of each element in the array.

`Mean`: Mean is used when you want to compute the average of the elements in the array.

In [None]:
import numpy as np
arr = np.arange(1, 10)
print("Original Array: ", arr)

# reshaping to 3 rows and 3 columns
reshaped_arr = arr.reshape(3, 3)
print("Reshaped Array: \n", reshaped_arr)

In [None]:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
print("Original Array: ", arr)

# slicing from index 1 to 5
sliced_arr = arr[1:5]
print("Sliced Array: ", sliced_arr)

In [None]:
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# finding dot product
dot_product = np.dot(arr1, arr2)
print("Dot Product: ", dot_product)

In [None]:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])

# squaring each element
squared_arr = np.square(arr)
print("Squared Array: ", squared_arr)

In [None]:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])

# finding mean
mean_val = np.mean(arr)
print("Mean: ", mean_val)

##### `Visualising Data with Matplotlib`

`Matplotlib` is a plotting library in Python. It is a multi-platform, multi-purpose, and highly customizable library that allows you to create a wide variety of plots and charts in both 2D and 3D.

Here are some features and uses of Matplotlib:

- Line Plots: You can create line plots to track changes over a period of time.

- Bar Charts and Histograms: Bar charts for comparison of entities and histograms for frequency distribution can be easily plotted with Matplotlib.

- Scatter Plots: Scatter plots can be created which are useful for displaying the relationship between two variables.

- Pie Charts: Matplotlib can be used to create pie charts to represent part-to-whole relationships in the data.

- Stack Plots: Stack plots, similar to pie charts, can be created to track changes over time for one or more groups.

- 3D Plots: You can create various 3D plots including surface plots, wireframes, contour plots, etc.

- Image, Contours, Fields and Pathways: You can also create images, contours, fields, and pathways.

Matplotlib integrates well with other libraries in the scientific Python ecosystem, such as NumPy, SciPy, and Pandas. It's widely used for creating static, animated, and interactive visualizations in Python.



In [None]:
import numpy as np
import matplotlib.pyplot as plt

# generate random house sizes and prices
np.random.seed(0)  # for reproducibility

# generate an array of 100 random integers between 500 and 3500. These represent the sizes of the houses.
house_size = np.random.randint(500, 3500, 100)

# generate the house prices. It assumes a linear relationship between house size and price, with some added random noise.
house_price = house_size * 100 + np.random.normal(0, 50000, 100)  # linear relationship with some noise

# creates a scatter plot of house size against house price
plt.scatter(house_size, house_price, label='Data')

# fits a 1st-degree polynomial (a line) to the data. It returns the slope 'w' and y-intercept 'b' of the line.
# https://en.wikipedia.org/wiki/Degree_of_a_polynomial

w, b = np.polyfit(house_size, house_price, 1)  # w is slope, b is y-intercept

# note: polyfit has been deprecated and replaced by numpy polynomial which is shown in the next example

# plots the fitted line on the same plot as the scatter plot.
plt.plot(house_size, w*house_size + b, color='red', label='Fit: {:.2f}x + {:.2f}'.format(w, b))

# add labels and a title to the x and y axes.
plt.xlabel('House Size (sq ft)')
plt.ylabel('House Price (USD)')
plt.title('House Size vs House Price')

# adds a legend to the plot
plt.legend()

# displays the plot
plt.show()

##### `Interactive data`

To make the plot interactive, you can use a library like Plotly. Matplotlib also has an interactive mode, but it's more commonly used for static plots.

This script creates an interactive plot where you can zoom, pan, hover over the data points to see their values, and toggle the visibility of the datasets.

##### `(Optional) What is a coefficient`

In mathematics, a coefficient is a numerical or constant quantity placed before and multiplying the variable in an algebraic expression. It is used to scale the variable or to adjust its units.

For example, in the equation y = 5x + 3:

5 is the coefficient of x. It scales the value of x by 5.
The number 3 is not a coefficient in this case since it is not multiplying a variable; instead, it is called the constant term or y-intercept.
In the context of a polynomial function or a linear regression line (like y = wx + b), the coefficient w represents the slope of the line, which describes the relationship between y and x. In other words, w tells you how much y changes for each unit change in x. The coefficient b in this case is the y-intercept, or the value of y when x is zero.

In more complex equations or higher-degree polynomials (like y = ax^2 + bx + c), there can be more than one coefficient. Here, a and b are the coefficients of x^2 and x, respectively.

In [None]:
import numpy as np
import plotly.graph_objects as go

# generate random house sizes and prices
np.random.seed(0)  # for reproducibility
house_size = np.random.randint(500, 3500, 100)
house_price = house_size * 100 + np.random.normal(0, 50000, 100)  # linear relationship with some noise

# fit a line (using Polynomial API)
p = np.polynomial.Polynomial.fit(house_size, house_price, 1)  # fit a 1st degree polynomial


# The syntax [::-1] in Python is used for reversing the order of an array, list, string, or other sequence type.
# 
# The : character selects all elements in the sequence.
# The -1 after the :: is the step. It means go backwards by 1 step.

# So together, [::-1] means create a new sequence that starts from the end (the last element), 
# ends at the start (the first element), and steps backwards by 1 at each step, effectively reversing the sequence.

w, b = p.coef[::-1]  # the coefficients are returned in reverse order, i.e., [b, w]

print(w,b)

# create scatter plot
scatter = go.Scatter(x=house_size, y=house_price, mode='markers', name='Data')

# create line plot
line = go.Scatter(x=house_size, y=p(house_size), mode='lines', name='Fit: {:.2f}x + {:.2f}'.format(w, b))

# create layout
layout = go.Layout(title='House Size vs House Price', xaxis=dict(title='House Size (sq ft)'), yaxis=dict(title='House Price (USD)'))

# create figure and add traces
fig = go.Figure(data=[scatter, line], layout=layout)

# show figure
fig.show()

## Another way to display data

If you have a requirement to display and share data with others you can have a look at open source tools such as `Evidence`

[https://github.com/evidence-dev/evidence](https://github.com/evidence-dev/evidence)

From the link above:

Evidence renders a website from markdown files:

- SQL statements inside markdown files run queries against your data warehouse
- Charts and components are rendered using these query results
- Templated pages generate many pages from a single markdown template
- Loops and If / Else statements allow control of what is displayed to users

