## 3. Numerical Analysis Libraries 🔢
When doing any sort of machine learning end-to-end project, understanding the data is an integral part of the entire Machine Learning ecosystem. To achieve this with little effort, libraries such as NumPy and Pandas are essential.

### 3.1 NumPy
NumPy (Numerical Python) is the core library for scientific computing in Python. It deals with mathematical computation and enables users to compute on multidimensional data structures more efficiently and easily.

#### 3.1.1 Housekeeping

Before starting, make sure you have installed the NumPy package by executing this shell:

In [None]:
!pip install numpy

In [None]:
# Import NumPy library into the current Python workflow, and refer it as np (a convention)
import numpy as np

# Prints the current version of NumPy, this is useful when referring to the documentation
print(np.__version__)

#### 3.1.2 Initializing Arrays/Matrices

NumPy offers a very intuitive way of representing matrices as multidimensional arrays. Shown below are some of the ways of initializing arrays/matrices:
- `np.array()` : Create an array/matrix by specifying entry for each element
- `np.ones()` : Create a matrix full of ones
- `np.zeros()` : Create a matrix full of zeros
- `np.eye()` : Create an identity matrix
- `np.random.random()` : Create a matrix filled with random numbers between 0 and 1
- `np.arange()` : Create an array in which its elements are in sorted order 
- `np.empty()` : Create a matrix placeholder

Further documentation for NumPy is at https://numpy.org/doc/.

In [None]:
# Create a 3x3 matrix full of ones

B = np.ones((3, 3))
print("Matrix B")
print(B)

In [None]:
# TODO: Create a 5x5 matrix full of zeros

# Expected output:
# [[ 0  0  0  0  0]
#  [ 0  0  0  0  0]
#  [ 0  0  0  0  0]
#  [ 0  0  0  0  0]
#  [ 0  0  0  0  0]]


########################################################
# INSERT YOUR ANSWER HERE

C = np.zeros((5, 5))
print("Matrix C")
print(C)

########################################################

In [None]:
# TODO: Create a 2x2 identity matrix

# Expected output:
# [[ 1  0]
#  [ 0  1]]


########################################################
# INSERT YOUR ANSWER HERE

D = np.eye(2)
print("Matrix D")
print(D)

########################################################

In [None]:
# TODO: Create a 3x3 matrix filled with random numbers between 0 and 1

# Expected output (seed = 41):
#  [[0.25092362 0.04609582 0.67681624]
#  [0.04346949 0.1164237  0.60386569]
#  [0.19093066 0.66851572 0.91744785]]

# Further information on what seed is and how NumPy generates a random number is given below:
# https://www.w3schools.com/python/numpy/numpy_random.asp


########################################################
# INSERT YOUR ANSWER HERE

np.random.seed(41)
E = np.random.random((3, 3))
print("\nMatrix E")
print(E)

########################################################

In [None]:
# TODO: Create an array which has 0-9 as its elements in sorted order

# Expected output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


########################################################
# INSERT YOUR ANSWER HERE

f = np.arange(10)
print("\nArray f")
print(f)

########################################################

In [None]:
# TODO: Create a 5x3 matrix placeholder, without initializing entries (elements in the matrix)

# Expected output (seed = 13):
#    [[0.         0.         0.4472136 ]
#    [0.0531494  0.18257419 0.4472136 ]
#    [0.2125976  0.36514837 0.4472136 ]
#    [0.4783446  0.54772256 0.4472136 ]
#    [0.85039041 0.73029674 0.4472136 ]]


########################################################
# INSERT YOUR ANSWER HERE

np.random.seed(13)
G = np.empty((5, 3))
print("\nMatrix G")
print(G)

########################################################

#### 3.1.3 Matrix Operations

In machine learning, we will deal with a lot of matrix calculations. It is therefore good for us to get accustomed to some of the common operations we perform on them. Here is a list of the first few:

- `np.transpose(A)` : Find the transpose of array/matrix A
- `A @ B` : Calculate the Dot product of matrix A and B
- `np.linalg.inv(A)` : Calculate the inverse matrix of matrix A (only valid for square matrices, whose dimension is n * n)
- `np.diagonal(A)` : Extract diagonal components of matrix A
- `np.reshape(A, (x, y))` : Reshape matrix A into the given dimension which is (x, y)


Now let's check what each of them does by filling in the cells below.

In [None]:
# Initialise the data we will use below
X = np.array([[3, 11, 1, 4], [7, 5, 2, 7], [6, 8, 9, 7], [0, 10, 4, 2]])
print(X)

In [None]:
# TODO: Transpose the matrix

# Expected output:
# [[ 3  7  6  0]
#  [11  5  8 10]
#  [ 1  2  9  4]
#  [ 4  7  7  2]]


########################################################
# INSERT YOUR ANSWER HERE

X_transposed = np.transpose(X)
print(X_transposed)

########################################################

In [None]:
# TODO: Find the dot product of two matrices: original X and X_transposed

# Expected output:
#  [[147, 106, 143, 122],
#   [106, 127, 149,  72],
#   [143, 149, 230, 130],
#   [122,  72, 130, 120]]


########################################################
# INSERT YOUR ANSWER HERE

Y = X @ X_transposed
print(Y)

########################################################

In [None]:
# TODO: Calculate the inverse matrix of X

# Expected output:
#    [[ 1.38095238 -1.14285714  0.80952381 -1.5952381 ]
#    [ 0.29365079 -0.21428571  0.1031746  -0.1984127 ]
#    [ 0.07142857 -0.21428571  0.21428571 -0.14285714]
#    [-1.61111111  1.5        -0.94444444  1.77777778]]


########################################################
# INSERT YOUR ANSWER HERE

X_inverse = np.linalg.inv(X)
print(X_inverse)

########################################################

In [None]:
# TODO: Extract the diagonal elements of matrix X

# Expected output: [3 5 9 2]


########################################################
# INSERT YOUR ANSWER HERE

diagonal = np.diag(X)
print(diagonal)

########################################################

In [None]:
# TODO: Reshape matrix X to one that has 8 rows and 2 columns

# Expected output:
#    [[ 3 11]
#    [ 1  4]
#    [ 7  5]
#    [ 2  7]
#    [ 6  8]
#    [ 9  7]
#    [ 0 10]
#    [ 4  2]]


########################################################
# INSERT YOUR ANSWER HERE

X_reshaped = X.reshape((8, 2))
print(X_reshaped)

########################################################

#### 3.1.4 Statistics in NumPy

When we deal with large amounts of data, we often want to know some information about the data as a whole. This is where NumPy comes to the rescue. Most of them are self-explanatory:

- `np.sum(b)` : Sum of all elements in an array/matrix b (if b is a matrix, you need to specify which
                      axes it is acting upon, whether you want to add it column-wise or row-wise)
- `np.max(b)` : Find the maximum element in array/matrix b
- `np.min(b)` : Find the minimum element in array/matrix b
- `np.mean(b)` : Mean of elements in an array/matrix b
- `np.median(b)` : Median value among elements in array/matrix b
- `np.var(b)` : Variance of the elements in the array/matrix b
- `np.std(b)` : Standard deviation of the elements in the array/matrix b

As before, fill in the cells below to get used to these functions.

In [None]:
x = np.array([34, 56, 6, 3, 9, 89, 120, 12, 201], dtype=np.int32)
print(x)

In [None]:
# TODO: Find the summation of all elements
# Expected output: 530


########################################################
# INSERT YOUR ANSWER HERE

summation = np.sum(x)
print(summation)

########################################################

In [None]:
# TODO: Maximum element in the array
# Expected output: 201


########################################################
# INSERT YOUR ANSWER HERE

maximum = np.max(x)
print(maximum)

########################################################

In [None]:
# TODO: Minimum element in the array
# Expected output: 3


########################################################
# INSERT YOUR ANSWER HERE

minimum = np.min(x)
print(minimum)

########################################################

In [None]:
# TODO: Average value of elements in the array
# Expected output: 58.89


########################################################
# INSERT YOUR ANSWER HERE

mean = x.mean()
print(mean)

########################################################

In [None]:
# TODO: Median element in the array
# Expected output: 34.0


########################################################
# INSERT YOUR ANSWER HERE

variation = np.median(x)
print(variation)

########################################################

In [None]:
# TODO: Variance of x
# Expected output: 4008.098765432099


########################################################
# INSERT YOUR ANSWER HERE

variance = np.var(x)
print(variance)

########################################################

In [None]:
# TODO: Standard deviation of the array
# Expected output: 63.3095471902311


########################################################
# INSERT YOUR ANSWER HERE

std = np.std(x)
print(std)

########################################################

#### 3.1.5 Further Resources
Shown below are some further resources you can use to improve your knowledge of NumPy:
- [NumPy Quickstart Tutorial](https://numpy.org/devdocs/user/quickstart.html)
- [NumPy Tutorial - *by Nicolas Rougier*](https://github.com/rougier/numpy-tutorial)
- [Stanford CS231 - *by Justin Johnson*](https://cs231n.github.io/python-numpy-tutorial/)
- [Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
- [Documentation](https://numpy.org/doc/) 

### 3.2 Pandas

Pandas is another library that is commonly used. It is built on top of another package named NumPy. It is mostly known for its data wrangling ability and can be easily integrated with other data science modules in Python.

##### What Can You Do With DataFrames Using Pandas?

Pandas makes it simple to do many of the time-consuming, repetitive tasks associated with working with data, including:
- Data cleansing
- Data fill
- Data normalization
- Merges and joins
- Statistical analysis
- Data inspection
- Loading and saving data
- And much more...

#### 3.2.1 Housekeeping
Make sure you've installed the Pandas library:

In [None]:
!pip install pandas

In [None]:
# Import Pandas library into the current Python workflow, and refer it as pd (a convention)
import pandas as pd

# Prints the current version of Pandas, this is useful when referring to the documentation
print(pd.__version__)

#### 3.2.2 Series
A Series in Pandas is a one dimensional array consisting of the same data type.

Properties: index, values, dtype

To create a Series, use the function below:
- `pd.Series()`

In [None]:
# Creating a Series by passing a list of values, alongside with its index:
s1 = pd.Series([1, 3, 5, np.nan, 6, 8], index=["A", "B", "C", "D", "E", "F"])
print("-----Series 1-----\n", s1, "\n")

# Creating a Series by passing a variable of list, alongside providing its index
data = [1, 2, 3]
index = ["X", "Y", "Z"]
s2 = pd.Series(data=data, index=index, name="series")
print("-----Series 2-----\n", s2, "\n")

# Creating a Series by passing a list of values, letting pandas create a default integer index:
s3 = pd.Series([1, 3, 5, np.nan, 6, 8], index=["A", "B", "C", "D", "E", "F"])
print("-----Series 3-----\n", s3, "\n")

# Creating a Series by passing a list of dictionaries
s4 = pd.Series(data={"a": 1, "b": 2, "c": 3})
print("-----Series 4-----\n", s4, "\n")

In [None]:
# Further information contained in 's1'
print("-----Series 1-----\n")
print("example.name:\n", s1.name, "\n")
print("example.values:\n", s1.values, "\n")
print("example.dtypes:\n", s1.dtypes, "\n")

#### 3.2.3 DataFrame

A DataFrame is the bread and butter of Pandas. It consists of two-dimensional tabular data.

Properties: index, column, values, dtype

To create a DataFrame, use the function below:
- `pd.DataFrame()`

In [None]:
# Creating a DataFrame by passing a NumPy array and labeled columns:
df1 = pd.DataFrame(
    np.random.randn(6, 4), columns=["Price1", "Price2", "Price3", "Price4"]
)
print("-----DataFrame 1-----\n", df1, "\n")

# Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
print("-----DataFrame 2-----\n", df2, "\n")

#### 3.2.4 Reading File and Export DataFrame
In a real project, we will mostly be using a provided dataset. Hence, we need a way of importing that data (CSV/HDF5/Excel) into a DataFrame so that it can be easily processed and manipulated later. On top of that, we will also want a way to export our processed data into some file

Function for importing a dataset is: 
- `pd.read_csv()`
- `pd.read_hdf()`
- `pd.read_excel()`

Function for exporting a DataFrame is: 
- `pd.DataFrame.to_csv()`
- `pd.DataFrame.to_hdf()`
- `pd.DataFrame.to_excel()`

In [None]:
df_import = pd.read_csv(
    "username.csv", sep=";"
)  # Import a CSV file called 'username.csv'
df_import.head()  # Show the first 5 dataset

In [None]:
df_import.to_csv(
    "username2.csv"
)  # Export a DataFrame into a file called 'username2.csv'

#### 3.2.5 Data Manipulation
When your data is loaded into a DataFrame, there are endless things you can do with your data, such as:
- Reshaping
- Slicing 
- Querying (take a subset of the data that meets certain criteria)
- Handling missing data (very handy for data preprocessing)
- Summarizing

Below are some of the functions you can call to execute the aforementioned functionality: 
- `df.sort_index()` : Sort by the index of the DataFrame
- `df.sort_values()` : Sort the DataFrame by the value in particular column
- `df.drop(columns=['A’, 'B'])` : Drop columns from the DataFrame
- `df.head(n)` : Select first n rows; Default: n = 5
- `df.tail(n)` : Select last n rows; Default: n = 5
- `df.loc` : Access a group of rows and columns by label(s) or a boolean array.
- `df.iloc` :  Access a group of rows and columns by position/index
- `df.sample(n)` : Randomly select n rows.
- `df[['A'],['B']]` : Select multiple columns with specific names (show only column 'A' and column 'B')
- `df.query('A > 7')` : Take the rows that meet specified criteria (row at which the value of column A is greater than 7)
- `df.dropna()` : Drop rows with any column having NA/null data.
- `df.fillna(value)` : Replace all NA/null data with value.
- `len(df)` : Return the number of rows in the DataFrame.
- `df.shape` : Return a tuple of the number of rows, and the number of columns in the DataFrame
- `df.describe()` : Provide a basic description and statistics for each column

In [None]:
# Initialise the data we will use below
df = pd.read_csv("username.csv", sep=";")  # Import a CSV file called 'username.csv'
display(df)

In [None]:
# TODO: Find the shape of the DataFrame df

# Expected output: (10, 5)


########################################################
# INSERT YOUR ANSWER HERE

print(df.shape)

########################################################

In [None]:
# TODO: Extract the first 3 rows of the DataFrame

# Expected output:
# 	    Username	    Identifier	First name	Last name
#    0	alpha01	        1035	    Romane	    Alpha
#    1	booker12	    9012	    Rachel	    Booker
#    2	grey07	        2070	    Laura	    Grey


########################################################
# INSERT YOUR ANSWER HERE

df_extract1 = df.head(3)
display(df_extract1)

########################################################

In [None]:
# TODO: Extract data in rows 3-6

# Expected outcome:
#       Username          Identifier    First name  Last name   Age
#    3  hedgehogger14     2456          Joe         Hogger      26
#    4  johnson81         4081          Craig       Johnson     43
#    5  jenkins46         9346          Mary        Jenkins     62
#    6  smith79           5079          Jamie       Smith       17


########################################################
# INSERT YOUR ANSWER HERE

df_extract2 = df.iloc[3:7]
display(df_extract2)

########################################################

In [None]:
# TODO: Produce a basic statistical description of the DataFrame

# Expected outcome:
#       	Identifier	    Age
#   count	10.000000	    10.000000
#   mean	5220.900000	    32.500000
#   std	    3587.123529	    14.908983
#   min	    1035.000000	    13.000000
#   25%	    2166.500000	    22.000000
#   50%	    4580.000000	    30.500000
#   75%	    8827.500000	    41.750000
#   max	    9821.000000	    62.000000


########################################################
# INSERT YOUR ANSWER HERE

display(df.describe())

########################################################

In [None]:
# TODO: Extract the last 3 rows with only 'Username' and 'Identifier' as its column

# Expected outcome:
#       Username	        Identifier
#   7	midnighteagle10	    8274
#   8	mavericky00     	1035
#   9	laxman10	        9821


########################################################
# INSERT YOUR ANSWER HERE

df_extract3 = df.tail(3)[["Username", "Identifier"]]
display(df_extract3)

########################################################

In [None]:
# TODO: Sort the DataFrame df by age in descending order

# Expected outcome:
#       Username        Identifier  First name  Last name   Age
#   5   jenkins46       9346        Mary        Jenkins     62
#   4   johnson81       4081        Craig       Johnson     43
#   2   grey07          2070        Laura       Grey        42
#   9   laxman10        9821        Trevor      Dos         41
#   8   mavericky00     1035        Thomas      Maverick    35
#   3   hedgehogger14   2456        Joe         Hogger      26
#   7   midnighteagle10 8274        Donald      Eagle       25
#   0   alpha01         1035        Romane      Alpha       21
#   6   smith79         5079        Jamie       Smith       17
#   1   booker12        9012        Rachel      Booker      13


########################################################
# INSERT YOUR ANSWER HERE

df_sorted = df.sort_values(by=["Age"], ascending=False)
display(df_sorted)

########################################################

In [None]:
# TODO: Find data in which the identifier is greater than 8000

#   	Username	    Identifier	First name	Last name	Age
#   1	booker12	    9012	    Rachel	    Booker	    13
#   5	jenkins46	    9346	    Mary	    Jenkins	    62
#   7	midnighteagle10	8274	    Donald	    Eagle	    25
#   9	laxman10	    9821	    Trevor	    Dos	        41


########################################################
# INSERT YOUR ANSWER HERE

display(df.query("Identifier > 8000"))

########################################################

#### 3.2.6 Further Resources

Shown below are some further resources you can use to improve your knowledge of Pandas:
- [Pandas Workshop - *by Stefanie Molin*](https://github.com/stefmolin/pandas-workshop)
- [Pandas Cookbook - *by Julia Evans*](https://github.com/jvns/pandas-cookbook)
- [Pandas Exercises - *by Guilherme Samora*](https://github.com/guipsamora/pandas_exercises)
- [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Documentation](https://pandas.pydata.org/docs/)