## 1. Basics of Numpy

Numpy (Numerical python) is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

If you are already familiar with MATLAB, you might find this [tutorial](http://wiki.scipy.org/NumPy_for_Matlab_Users) useful to get started with Numpy.


## Why use numpy ?


1.  Numpy array calucations are much faster and efficient than python lists.
2.  Tt has builtin functions in the domains of linear algebra,  fourier transforms, matrix calculations etc.



In [None]:
import numpy as np

### Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets







In [None]:
list_a = [1,2,3,4,5]
a = np.array(list_a)  # Create a 1D array
print(type(a), a.shape, a[0], a[1], a[2])
a[0] = 5                 # Change an element of the array
print(a)                  

<class 'numpy.ndarray'> (5,) 1 2 3
[5 2 3 4 5]


In [None]:
list_b = [[1,2,3],[4,5,6]] # Create a nested list
b = np.array(list_b)   # Create a 2D array
print(b)
print(b.shape)
# NOTE: Dims of the nested lists should be constant

[[1 2 3]
 [4 5 6]]
(2, 3)


In [None]:
print(b.shape)
print(b[0, 0], b[0, 1], b[1, 0])

(2, 3)
1 2 4


In [None]:
# Task
"""
Create a numpy array "c" with shape (2,2,3)
"""
list_c = [ [[1,2,3], [4,5,6]], [[3,2,1], [6,5,4]]]
c = np.array(list_c)

[[[1 2 3]
  [4 5 6]]

 [[3 2 1]
  [6 5 4]]]


Numpy also provides many functions to create arrays:

In [None]:
a = np.zeros((4,2))  # Create an array of all zeros
print(a)

[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


In [None]:
b = np.ones((1,2))   # Create an array of all ones
print(b)

[[1. 1.]]


In [None]:
c = np.full((2,2), 4) # Create a constant array
print(c)

[[4 4]
 [4 4]]


In [None]:
d = np.eye(3)        # Create a 3x3 identity matrix
print(d)

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


In [None]:
e = np.random.random((3,3)) # Create an array filled with random values
print(e)

# NOTE:- The values filled are always between 0 and 1

[[0.96530349 0.23945709 0.69999421]
 [0.14117527 0.44927329 0.0620962 ]
 [0.1034208  0.67021154 0.15203092]]


In [None]:
f = np.random.randint(5, size=(2,4)) # Create an array filled with random values
print(f)

[[3 1 2 2]
 [0 3 1 3]]


In [None]:
tp = np.random.randint(10, size=(1000))
g = np.random.choice(tp, 6, replace=False) # Creates a numpy array by randomly choosing
print(g)

[1 4 1 4 8 7]


### Array indexing and slicing

Numpy offers several ways to index into arrays.

Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:

In [None]:
a = np.array([1,2,3,4,5])

b = a[1:4] # Use slicing to pull out the elements from indexes 1 till 4
print(b)

[2 3 4]


In [None]:

# Create the following 2D array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2): 
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]
print(b)

[[2 3]
 [6 7]]


A slice of an array is a view into the same data, so modifying it will modify the original array.

In [None]:
print(a[0, 1])
b[0, 0] = 77    # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1]) 

2
77


In [None]:
# Task: Expected value if we were to execute " print(b[:, -1]) "
print(b[:, -1])

[3 7]


In [None]:
list_a = [[1,2,3,4], [5,6,7,8], [9,10,11,12]]
a = np.array(list_a)
print(a)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


Two ways of accessing the data in the middle row of the array.
Mixing integer indexing with slices yields an array of lower rank,
while using only slices yields an array of the same rank as the
original array:

In [None]:
row_r1 = a[1, :]    # Rank 1 view of the second row of a  
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
row_r3 = a[[1], :]  # Rank 2 view of the second row of a
print(row_r1, row_r1.shape)
print(row_r2, row_r2.shape)
print(row_r3, row_r3.shape)

[5 6 7 8] (4,)
[[5 6 7 8]] (1, 4)
[[5 6 7 8]] (1, 4)


In [None]:
# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape, "\n")
print(col_r2, col_r2.shape)

[ 2  6 10] (3,) 

[[ 2]
 [ 6]
 [10]] (3, 1)


### Datatypes

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:


In [None]:
x = np.array([1, 2])  # Let numpy choose the datatype
y = np.array([1.0, 2.0])  # Let numpy choose the datatype
z = np.array([1.2, 2], dtype=np.int64)  # Force a particular datatype

print(x.dtype, y.dtype, z.dtype, z)

int64 float64 int64 [1 2]


### Array math

Basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module:

In [None]:
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
print(x + y)
print(np.add(x, y))

[[ 6.  8.]
 [ 8. 10.]]
[[ 6.  8.]
 [ 8. 10.]]


In [None]:
# Elementwise difference; both produce the array
print(x - y)
print(np.subtract(x, y))

[[-4. -4.]
 [-6. -6.]]
[[-4. -4.]
 [-6. -6.]]


In [None]:
# Elementwise product; both produce the array
print(x * y)
print(np.multiply(x, y))

[[ 5. 12.]
 [ 7. 16.]]
[[ 5. 12.]
 [ 7. 16.]]


In [None]:
# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

[[0.2        0.33333333]
 [0.14285714 0.25      ]]
[[0.2        0.33333333]
 [0.14285714 0.25      ]]


In [None]:
# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x), x)

[[1.         1.41421356]] [[1. 2.]]


We instead use the dot function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices. dot is available both as a function in the numpy module and as an instance method of array objects:

In [None]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

219
219


We can use matmul function to compute matrix multiplication operations. `@` operator can be used as short hand notation for matmul function

In [None]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
z = np.array([[2,3],[5,6]])

print('x',x.shape)
print('y',y.shape)
print('z',z.shape)
# Matrix multiplication of x and y
print('x times y', np.matmul(x,y)) ## xy
print('y times x', y@x) ## yx 
print('z times x',z@x) ## zx

x (2, 2)
y (2, 2)
z (2, 2)
x times y [[19 22]
 [43 50]]
y times x [[23 34]
 [31 46]]
z times x [[11 16]
 [23 34]]


Numpy provides many useful functions for performing computations on arrays; one of the most useful is `sum`:

In [None]:
x = np.array([[1,2],[3,4]])
"""
[ [1,2]
  [3,4] ]
"""
print(np.sum(x))  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

10
[4 6]
[3 7]


In [None]:
x = np.array([[1,2],[3,4]])

print(np.max(x))  # Compute max of all elements; prints "4"
print(np.max(x, axis=0))  # Compute max of each column; prints "[3 4]"
print(np.max(x, axis=1))  # Compute max of each row; prints "[2 4]"

4
[3 4]
[2 4]


In [None]:
x = np.array([[1,2],[3,4]])

print(np.mean(x))  # Compute mean of all elements;
print(np.mean(x, axis=0))  # Compute mean of each column;
print(np.mean(x, axis=1))  # Compute mean of each row;

2.5
[2. 3.]
[1.5 3.5]


Apart from computing mathematical functions using arrays, we frequently need to reshape or otherwise manipulate data in arrays. The simplest example of this type of operation is transposing a matrix; to transpose a matrix, simply use the T attribute of an array object:

In [None]:
print(x)
print("transpose\n", x.T)

[[1 2]
 [3 4]]
transpose
 [[1 3]
 [2 4]]


In [None]:
v = np.array([[1,2,3]])
print(v )
print("transpose\n", v.T)

[[1 2 3]]
transpose
 [[1]
 [2]
 [3]]


In [None]:
v = np.array([[1,2,3],[4,5,6]])
print(v.shape)
print(v.reshape(3,2))
print(v.reshape(6))

(2, 3)
[[1 2]
 [3 4]
 [5 6]]
[1 2 3 4 5 6]


In [None]:
## array splitting into mulitple subarrays
x=np.arange(15)
y = np.split(x, 5)
print(y)

[array([0, 1, 2]), array([3, 4, 5]), array([6, 7, 8]), array([ 9, 10, 11]), array([12, 13, 14])]


In [None]:
## find the index of an element in the array
x = np.array([ [7,4,3,5,2,1], [1,2,3,4,5,6] ])

print(np.where( x > 3 ), '\n')

y = x[np.where( x > 3 )]
print(y)

(array([0, 0, 0, 1, 1, 1]), array([0, 1, 3, 3, 4, 5])) 

[7 4 5 4 5 6]


In [None]:
## sorting numpy arrays
x = np.array([7,4,3,5,2,1])
print(x)
print(np.sort(x))

[7 4 3 5 2 1]
[1 2 3 4 5 7]


# QM7 dataset - Class Exercise

In [None]:
import pandas as pd

### Part-1
*   Use concepts taught in previous tutorial to load qm7.csv
*   Filter out the first 1000 enteries
*   Take out only the values for "u0_atom" column. (what datatype is this new array?)

In [None]:
df = pd.read_csv('qm7.csv')
df_1000 = df.head(1000)
u0_vals = df_1000["u0_atom"].values
print(u0_vals[:10], type(u0_vals), u0_vals.shape)

[ -417.96  -712.42  -564.21  -404.88  -808.87  -677.16  -796.98  -860.33
 -1008.49  -861.73] <class 'numpy.ndarray'> (1000,)


### Part 2

*   Slice the elements based on index from 100-900
*   Create a new numpy array "int_u0_vals" containg all the elements in the modified u0_vals but in int datatype

*  Print the max, min, sum, mean of "int_u0_vals"

In [None]:
mod_u0_vals = u0_vals[100:900]
int_u0_vals = np.array(mod_u0_vals, dtype=np.int64)
print(int_u0_vals[:8], int_u0_vals.shape, "\n")

max_u0 =  np.max(int_u0_vals)
min_u0 =  np.min(int_u0_vals)
sum_u0 =  np.sum(int_u0_vals)
mean_u0 = np.mean(int_u0_vals)
print(f'Metrics of the array -\n Max:{max_u0}\n Min:{min_u0}\n Sum:{sum_u0}\n Mean:{mean_u0}')

[-1310 -1105 -1122 -1118 -1118 -1271 -1013 -1023] (800,) 

Metrics of the array -
 Max:-785
 Min:-1896
 Sum:-1120256
 Mean:-1400.32


### Part 3
* Create a function to calculate rmse (root mean squared error) between 2 numpy arrays.
* Make numpy arrays a,b by slicing "int_u0_vals" from indices (100-300) and (500-700)
* Find rmse for a,b

In [None]:
def rmse(arr_1, arr_2):
  err = np.subtract(arr_1,arr_2)
  sqrd_err = (err)**2
  mean_sqrd_err = np.mean(sqrd_err)
  rmse = np.sqrt(mean_sqrd_err)

  return rmse

In [None]:
def rmse_shorthand(arr_1, arr_2):
  return np.sqrt( ((arr_1 - arr_2)**2).mean() ) # Numpy arrays can be directly subtracted

In [None]:
a,b = int_u0_vals[100:300], int_u0_vals[500:700]

print(rmse(a,b))
print(rmse_shorthand(a,b))

260.3198513367738
260.3198513367738


# 2. Basics of SKlearn (sci-kit learn)

Scikit-learn (Sklearn) is one of the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

In [None]:
# Doing the previous exercise using a sklearn function

from sklearn.metrics import mean_squared_error
print(mean_squared_error(a,b, squared=False))

260.3198513367738


In [None]:
# Importing nessasary Libraries

from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

In [None]:
dataset = pd.read_csv('herg_mol_descriptors.csv')
dataset.dropna(inplace=True)
columns = list(dataset.columns.values)
# features = list(set(columns) - set(['smiles', 'label']))
features = columns
features.remove('smiles')
features.remove('label')
# print(features)
# features = features.remove('label')

data = pd.DataFrame(dataset[features]).to_numpy(dtype=np.float64)
labels = dataset['label'].to_numpy(dtype=np.float64)

In [None]:
print(data.shape)
print(labels.shape)

(588, 32)
(588,)


In [None]:
# Splitting the total dataset into a train and test set


# X_train is the input data for train set
# y_train is the label data for train set
# X_test is the input data for test set
# y_test is the label data for test set
print(features)
X_train , X_test , y_train, y_test = train_test_split(data[:], # we can tweak around to see what all features we want to include 
                                                      labels[:], 
                                                      test_size=0.25, 
                                                      random_state=12)

['qed', 'MolWt', 'HeavyAtomMolWt', 'ExactMolWt', 'NumValenceElectrons', 'NumRadicalElectrons', 'MaxPartialCharge', 'MinPartialCharge', 'MaxAbsPartialCharge', 'MinAbsPartialCharge', 'FpDensityMorgan1', 'FpDensityMorgan2', 'FpDensityMorgan3', 'HeavyAtomCount', 'NHOHCount', 'NOCount', 'NumAliphaticCarbocycles', 'NumAliphaticHeterocycles', 'NumAliphaticRings', 'NumAromaticCarbocycles', 'NumAromaticHeterocycles', 'NumAromaticRings', 'NumHAcceptors', 'NumHDonors', 'NumHeteroatoms', 'NumRotatableBonds', 'NumSaturatedCarbocycles', 'NumSaturatedHeterocycles', 'NumSaturatedRings', 'RingCount', 'MolLogP', 'MolMR']


In [None]:
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(441, 32) (147, 32)
(441,) (147,)


## Linear Model
It creates a linear model with coeffecients 
$w = [w_1, w_2, w_3 ..w_n]$

Let the input feature vector be 
$X = [x_1,x_2,x_3..x_n]$

The prediction is given by  $w.X + bias$ term

The weights get adjusted so the residual sum of squares between the predictions and the actual labels in the dataset is least

In [None]:
# Defining a Linear Classifier model
model = SGDClassifier(random_state=0)

In [None]:
# Training the model on the training data
model.fit(X_train, y_train)

SGDClassifier(random_state=0)

In [None]:
# the final weights of the linear model can be obtined using the coef_ parameter
print('Coeffecients \n', model.coef_)
print('Intercept \n', model.intercept_)

Coeffecients 
 [[ -222.88965772  -129.69818294   245.58977518  -203.97760263
  -1762.65270506   -17.45200698  -105.53648591   186.58171347
   -191.06273575  -101.05546363  -610.44945865  -799.0264184
   -885.31860864   133.45652397 -1074.49611607 -2085.68593231
   -373.67826712    99.92129487  -273.75697225   788.76227629
   -109.5027889    679.25948739 -2017.58888547  -901.34483113
  -1397.52934333   752.14728125  -218.66338158   114.63573213
   -104.02764945   405.50251514  1403.34649762  3307.57146768]]
Intercept 
 [-1171.31703218]


In [None]:
# Using the trained model to make prediction on the test set
predictions = model.predict(X_test)

In [None]:
print(accuracy_score(y_test, predictions))

0.7959183673469388


In [None]:
all_Acc = []
for i in range(20):
  model = SGDClassifier(random_state=i)
  model.fit(X_train, y_train)
  predictions = model.predict(X_test)
  all_Acc.append(accuracy_score(y_test, predictions))

In [None]:
print('Mean accuracy of the models',np.array(all_Acc).mean())

Mean accuracy of the models 0.6435374149659863
