<h1 style="text-align: center"> Basic Python for Machine Learning </h1>
<h1 style="text-align: center"> (Part 2)</h1>

In this second part, we will use CSV file format (https://en.wikipedia.org/wiki/Comma-separated_values). You  first need to understand some basic things about this format, e.g. file header, comments, delimiter, quotes, etc. We will use a dataset to demonstrate how to prepare data for machine learning.  The data is freely available from the UCI Machine Learning Repository  ( https://archive.ics.uci.edu/ml/datasets.php ).   

# 1. The Numpy Library

Numpy is the **fundamental package for scientific computing in Python**. It provides support for **large multi-dimensional arrays** and also **high level mathematical functions** to operate on these arrays. (You can play with this library to do deeplearning but NumPy is not the best choice). Nevertheless, most scientific libs rely on NumPy conventions and APIs so it is important to have some knowledges about it.

For more detail about Numpy, please refer to the official documentation available at https://numpy.org

To start, we first need to import numpy in Python and check the version

---
---



In [1]:
import numpy as np
print("numpy: " +np.version.version)

numpy: 1.19.5


## 1.2. The ndarray class

The fundamental class of NumPy is ndarray. It represents table of items, with the following constraints:

• It is multidimensional(1d,2d,3d,...,nd),

• It is homogeneous,i.e, all items inside the table should belong to the same type.

NumPy provides the foundation data structures and operations for SciPy. These are ndarrays that are efficient arrays and easy to define and manipulate. 

In [None]:
# define an array
a = np.arange(1,4)
a

array([1, 2, 3])

In [None]:
b=np.arange(4,7)


In [None]:
# create a multi-dimensional array.
c = np.array([a,b])
c

array([[1, 2, 3],
       [4, 5, 6]])

In [None]:
# Type of a
type(c)

numpy.ndarray

In [None]:
#Check the shape (rows and columns of the array).
c.shape

(2, 3)

In [None]:
# 'Rank' as mention in NumPy doc or number of dimensions
c.ndim

2

In [None]:
# Total number of items
c.size

6

In [None]:
# Item type
c.dtype

dtype('int64')

In [None]:
#Actual data of the table
c.data

<memory at 0x7f5e49ed5590>

In [None]:
#Create an evenly spaced array between 1 and 30 with a difference of 2.
new_array = np.arange(1,30,2)
new_array

array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])

In [None]:
#Reshape the above array into a desired shape.
new_array.reshape(5,3)

array([[ 1,  3,  5],
       [ 7,  9, 11],
       [13, 15, 17],
       [19, 21, 23],
       [25, 27, 29]])

In [None]:
# Create an array with all elements as ones.
one_array = np.ones([2,2])
one_array

array([[1., 1.],
       [1., 1.]])

In [None]:
#Create an array filled with zeros. 
zero_array = np.zeros([3,3])
zero_array

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [None]:
#Create a diagonal matrix with diagonal values = 1
new_diago = np.zeros([2,2])
np.fill_diagonal(new_diago,new_diago.diagonal() + 1)
new_diago

array([[1., 0.],
       [0., 1.]])

In [None]:
#Extract only diagonal values from an array.
new_diago.diagonal()

array([1., 1.])

In [None]:
new_diago_2 = np.zeros([3,4])
np.fill_diagonal(new_diago_2,new_diago_2.diagonal() + 1)
new_diago_2

array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.]])

In [None]:
#Generate an evenly spaced list between the interval 1 and 5. 
#(Take a minute here to understand the difference between ‘linspace’ and ‘arange’)
new_array_2 = np.linspace(1, 5, num=20)
new_array_2

array([1.        , 1.21052632, 1.42105263, 1.63157895, 1.84210526,
       2.05263158, 2.26315789, 2.47368421, 2.68421053, 2.89473684,
       3.10526316, 3.31578947, 3.52631579, 3.73684211, 3.94736842,
       4.15789474, 4.36842105, 4.57894737, 4.78947368, 5.        ])

In [None]:
#Generate an evenly spaced list 
new_array_3 = np.geomspace(10, 100000, num=20)
new_array_3

array([1.00000000e+01, 1.62377674e+01, 2.63665090e+01, 4.28133240e+01,
       6.95192796e+01, 1.12883789e+02, 1.83298071e+02, 2.97635144e+02,
       4.83293024e+02, 7.84759970e+02, 1.27427499e+03, 2.06913808e+03,
       3.35981829e+03, 5.45559478e+03, 8.85866790e+03, 1.43844989e+04,
       2.33572147e+04, 3.79269019e+04, 6.15848211e+04, 1.00000000e+05])

In [None]:
#Now, change the shape of the array in place (‘resize’ function changes the shape of the array in place, 
#unlike ‘reshape’)
new_array_2_reshape = np.reshape(new_array_2,(-1,4))
new_array_2_reshape

array([[1.        , 1.21052632, 1.42105263, 1.63157895],
       [1.84210526, 2.05263158, 2.26315789, 2.47368421],
       [2.68421053, 2.89473684, 3.10526316, 3.31578947],
       [3.52631579, 3.73684211, 3.94736842, 4.15789474],
       [4.36842105, 4.57894737, 4.78947368, 5.        ]])

In [None]:
#Create an array consisting of repeating list
repeat_array = np.tile([1,2,3],5)
repeat_array

array([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3])

In [None]:
#Now, repeat each element of array n number of times using repeat function.
repeat_array_2 = np.repeat([1,2,3],3)
repeat_array_2

array([1, 1, 1, 2, 2, 2, 3, 3, 3])

In [None]:
#Generate arrays of desired shape filled with random values between 0 and 1.
random_array= np.random.rand(3,4)
random_array

array([[0.08366865, 0.45090088, 0.14127832, 0.88538094],
       [0.46133743, 0.62564014, 0.52828852, 0.72254879],
       [0.0917912 , 0.46033073, 0.76162298, 0.89445093]])

In [None]:
# !!! shape is given dimension by dimension as arguments not in one tuple
rand_n2 = np.random.randn(3,4)
rand_n2

array([[-0.2640695 , -0.90991832, -0.95561966, -0.23093901],
       [ 0.9939839 ,  0.62995271,  0.37020593, -0.17510285],
       [ 0.14537516, -1.56048317, -0.40497795, -1.49093615]])

In [None]:
#Stack the above two arrays created vertically
rand_3 = np.concatenate((random_array,rand_n2),axis=0)
rand_3

array([[ 0.08366865,  0.45090088,  0.14127832,  0.88538094],
       [ 0.46133743,  0.62564014,  0.52828852,  0.72254879],
       [ 0.0917912 ,  0.46033073,  0.76162298,  0.89445093],
       [-0.2640695 , -0.90991832, -0.95561966, -0.23093901],
       [ 0.9939839 ,  0.62995271,  0.37020593, -0.17510285],
       [ 0.14537516, -1.56048317, -0.40497795, -1.49093615]])

In [None]:
# stack the above two arrays createdhorizontally.
rand_4 = np.concatenate((random_array,rand_n2),axis=1)
rand_4

array([[ 0.08366865,  0.45090088,  0.14127832,  0.88538094, -0.2640695 ,
        -0.90991832, -0.95561966, -0.23093901],
       [ 0.46133743,  0.62564014,  0.52828852,  0.72254879,  0.9939839 ,
         0.62995271,  0.37020593, -0.17510285],
       [ 0.0917912 ,  0.46033073,  0.76162298,  0.89445093,  0.14537516,
        -1.56048317, -0.40497795, -1.49093615]])

## 1.3. Operations

In [None]:
# randomly create 2 np array 
rand1= np.random.rand(2,2)
rand2= np.random.rand(2,2)
print(rand1 , "\n" , rand2)

[[0.69872638 0.33858868]
 [0.274603   0.18719013]] 
 [[0.3040998  0.87525292]
 [0.49760003 0.28290281]]


In [None]:
#element-wise addition.
rand1+rand2

array([[1.00282619, 1.21384161],
       [0.77220304, 0.47009295]])

In [None]:
#Element wise subtraction.
rand1-rand2

array([[ 0.39462658, -0.53666424],
       [-0.22299703, -0.09571268]])

In [None]:
#Element wise multiplication 
rand1*rand2

array([[0.21248256, 0.29635073],
       [0.13664246, 0.05295661]])

In [None]:
#power each element to 2.
np.power(rand1,2)

array([[0.48821856, 0.1146423 ],
       [0.07540681, 0.03504015]])

In [None]:
# dot product of the two arrays k and l.
np.dot(rand1,rand2)

array([[0.3809643 , 0.70735   ],
       [0.17665253, 0.29330369]])

In [None]:
# transpose of a.
np.transpose(rand1)

array([[0.69872638, 0.274603  ],
       [0.33858868, 0.18719013]])

In [None]:
#datatype of elements in the array.
rand1.dtype

dtype('float64')

In [None]:
#Change the datatype of the array.
rand1= rand1.astype('float32')
rand1.dtype

dtype('float32')

In [None]:
#some mathematical functions in an array, starting with sum of an array.
rand1.sum()

1.4991082

In [None]:
#Maximum of the elements of an array.
rand1.max()

0.69872636

In [None]:
#Mean of the elements of the array
rand1.mean()

0.37477705

In [None]:
#Now, let’s retrieve the index of the maximum value of the array.
rand1.argmax()

0

In [None]:
d.argmin()

0

In [None]:
#Create an array consisting of square of first ten whole numbers.
square_array = np.power(np.arange(0,10),2)
square_array

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [None]:
# randomly create 2 np array
rand1 = np.random.randn(4,3)
rand2 = np.random.randn(4,3)
print(rand1 , "\n" , rand2)

[[-0.2191532   0.45300024  1.51761131]
 [ 0.34074762  0.48883813  0.40802666]
 [-0.29001823 -0.58529523  0.32833784]
 [ 0.4924684   0.75322758  1.66320103]] 
 [[-1.44645723  0.38651933  0.11135126]
 [-1.57051961  1.12149458  0.46519124]
 [ 1.70265781 -0.17237293 -1.20921193]
 [-0.6298594  -0.65952506  0.61161276]]


If you want to compute an extremum along a particular axis, you should precise axis in argument. As indexing, this reduce the dimension of the array. If you want to keep the same number of dimension, you should set the keepdims argument to True.

In [None]:
np.amax(rand1,axis=1)

array([1.51761131, 0.48883813, 0.32833784, 1.66320103])

In [None]:
np.amax(rand1,axis=1,keepdims=True)

array([[1.51761131],
       [0.48883813],
       [0.32833784],
       [1.66320103]])

## 1.4. Indexation and Slicing

In [None]:
#Create an array consisting of square of first ten whole numbers.
square_array = np.power(np.arange(0,10),2)
square_array

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [None]:
# First item
square_array[0]

0

In [None]:
#Access values in the above array using index.
square_array[2]

4

In [None]:
# Last item
square_array[-1]

81

In [None]:
# From item 2 to item 5 (excluded !)
square_array[2:5]

array([ 4,  9, 16])

In [None]:
#Eliptic formulation 
# 3 first items
square_array[:3]

array([0, 1, 4])

In [None]:
# Starting from the 4th item
square_array[3:]

array([ 9, 16, 25, 36, 49, 64, 81])

In [None]:
# All items
square_array

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [None]:
# With a step
#d[start:stop:stepsize]
square_array[2:5:1]

array([ 4,  9, 16])

In [None]:
#Reverse
square_array[::-1]

array([81, 64, 49, 36, 25, 16,  9,  4,  1,  0])

In [None]:
#Select values from array greater than 20.
array = square_array > 20
square_array[array]

array([25, 36, 49, 64, 81])

In [2]:
#Create a multidimensional array
new_array = np.random.rand(3,4,5)
new_array

array([[[0.49931938, 0.65580344, 0.38575927, 0.07800263, 0.1979933 ],
        [0.03431004, 0.43012857, 0.04770653, 0.60835738, 0.03702095],
        [0.6453156 , 0.68900044, 0.91398537, 0.62527437, 0.63752692],
        [0.76131949, 0.65872079, 0.19515961, 0.22025879, 0.6580571 ]],

       [[0.35677245, 0.63824799, 0.49539882, 0.57337316, 0.05917601],
        [0.42743466, 0.45603249, 0.32557427, 0.41556776, 0.80834507],
        [0.41506303, 0.93265029, 0.2528518 , 0.79353211, 0.27490727],
        [0.37893331, 0.61665816, 0.96559455, 0.0604141 , 0.69169512]],

       [[0.94083551, 0.36244302, 0.37447129, 0.08706718, 0.79685212],
        [0.5810193 , 0.74030162, 0.05173164, 0.81892617, 0.63344136],
        [0.27112076, 0.69046683, 0.6721961 , 0.01311503, 0.86614991],
        [0.27017491, 0.52978485, 0.27663302, 0.42807216, 0.27741289]]])

In [None]:
# shape
new_array.shape

(3, 4, 5)

In [None]:
# First item on each axis
new_array[(0,0,0)]

0.7595024130138094

In [None]:
#Access the second row and third column
new_array[(1,2)]

array([0.16732638, 0.4800989 , 0.86614543, 0.40640716, 0.94445266])

In [None]:
#With an interval and ann elipse
new_array[::1,1,2::]

array([[0.07534833, 0.21520799, 0.52061881],
       [0.71241139, 0.30090046, 0.91283161],
       [0.79391644, 0.9679389 , 0.44088454]])

In [4]:
# Access 2nd row and columns 3 to 7. Note that the numbering of the rows and columns start with 0.,
new_array[1,2:7]

array([[0.41506303, 0.93265029, 0.2528518 , 0.79353211, 0.27490727],
       [0.37893331, 0.61665816, 0.96559455, 0.0604141 , 0.69169512]])

In [12]:
#Select all rows till the 2nd row and all columns except last column
new_array[:2,:new_array.shape[1]-1]


array([[[0.49931938, 0.65580344, 0.38575927, 0.07800263, 0.1979933 ],
        [0.03431004, 0.43012857, 0.04770653, 0.60835738, 0.03702095],
        [0.6453156 , 0.68900044, 0.91398537, 0.62527437, 0.63752692]],

       [[0.35677245, 0.63824799, 0.49539882, 0.57337316, 0.05917601],
        [0.42743466, 0.45603249, 0.32557427, 0.41556776, 0.80834507],
        [0.41506303, 0.93265029, 0.2528518 , 0.79353211, 0.27490727]]])

In [14]:
# a[2] is equivalent to a[2,:,:]
new_array[2]

array([[0.94083551, 0.36244302, 0.37447129, 0.08706718, 0.79685212],
       [0.5810193 , 0.74030162, 0.05173164, 0.81892617, 0.63344136],
       [0.27112076, 0.69046683, 0.6721961 , 0.01311503, 0.86614991],
       [0.27017491, 0.52978485, 0.27663302, 0.42807216, 0.27741289]])

In [37]:
# Multiple elipses : c[1,...,2] is equivalent to c[1,:,:,2] on 4-D array
c = np.random.randn(2,2,2,3)
c

array([[[[ 0.23209278, -1.07513607, -1.64491808],
         [-0.00648128,  0.71373812, -1.80091105]],

        [[-0.52443201, -1.14358991,  0.06774028],
         [-1.00729197, -1.99064104, -0.50212937]]],


       [[[-0.11940422,  3.03113499, -0.16217507],
         [ 0.51883335, -1.06260664, -0.41520796]],

        [[-0.4216383 ,  1.77672338, -1.11051321],
         [ 1.80466906,  1.93041066, -0.20151835]]]])

In [38]:
c[1, ..., 2]

array([[-0.16217507, -0.41520796],
       [-1.11051321, -0.20151835]])

In [39]:
c[1, :, :, 2]

array([[-0.16217507, -0.41520796],
       [-1.11051321, -0.20151835]])

In [40]:
d = np.random.randn(4, 3)
d

array([[-0.46543794,  0.36561175,  0.04369021],
       [-0.38208289,  1.47005566, -0.08421098],
       [-0.64769729, -0.87546724,  1.62910088],
       [ 1.77196598,  0.27052044,  0.05251547]])

In [41]:
e = d[:, 0] 
e

array([-0.46543794, -0.38208289, -0.64769729,  1.77196598])

In [43]:
# b has shape (4,) not (4,1)
e.shape

(4,)

In [44]:
e = d[0, :]
e

array([-0.46543794,  0.36561175,  0.04369021])

In [45]:
# c has shape (3,) not (1,3)
e.shape

(3,)

In [50]:
# Meanwhile using slice and not index preserves dimension
f = d[0:1]
f

array([[-0.46543794,  0.36561175,  0.04369021]])

In [51]:
f.shape

(1, 3)

## 1.5. Assignation

In [69]:
#Assignation is performed by the operator =. Item or a sub-array can be targeted.
a = np.array([[1,2,3],[4,5,6]])
a

array([[1, 2, 3],
       [4, 5, 6]])

In [70]:
a[0,0] = 10
a

array([[10,  2,  3],
       [ 4,  5,  6]])

In [71]:
a[:,1:]=1
a

array([[10,  1,  1],
       [ 4,  1,  1]])

In [None]:
#Take Care ! dtype is determined at instanciation and can not be changed after.

In [72]:
#1.175 will be downcast before assignation
a[1,0] = 1.75
a

array([[10,  1,  1],
       [ 1,  1,  1]])

In [73]:
#Arrays can be reshaped by the resize method. That’s an in-place operation:
b = np.resize(a,[3,2])
b

array([[10,  1],
       [ 1,  1],
       [ 1,  1]])

## 1.6. References, view and copy

If a and b reference the same ndarray, all operation on a also applied to b. They share both data and metadata. If c is a view of a, they share the same data but not the metadata. For example shapes can be modified separately. But if we change the first element of c, the first element of a is also changed. If d is a copy of a, all data and metadata are separated.

In [15]:
a = np.random.randn(4, 3)
a

array([[-0.47533822, -0.71050967, -0.52122922],
       [ 0.67406937, -0.5562013 , -2.11426923],
       [ 0.87102457, -0.29358536, -0.09430362],
       [ 0.13472803, -1.66597512,  0.13657809]])

In [16]:
# b is a reference to a
b=a
b[0,0] = 1
b

array([[ 1.        , -0.71050967, -0.52122922],
       [ 0.67406937, -0.5562013 , -2.11426923],
       [ 0.87102457, -0.29358536, -0.09430362],
       [ 0.13472803, -1.66597512,  0.13657809]])

In [17]:
#c is a view of a
c = a.view().reshape(3,4)
c

array([[ 1.        , -0.71050967, -0.52122922,  0.67406937],
       [-0.5562013 , -2.11426923,  0.87102457, -0.29358536],
       [-0.09430362,  0.13472803, -1.66597512,  0.13657809]])

In [18]:
# Shape of a is not affected
a

array([[ 1.        , -0.71050967, -0.52122922],
       [ 0.67406937, -0.5562013 , -2.11426923],
       [ 0.87102457, -0.29358536, -0.09430362],
       [ 0.13472803, -1.66597512,  0.13657809]])

In [19]:
# But if we modify the last element of c, the last element of a is changed
c[-1,-1] = 0
a

array([[ 1.        , -0.71050967, -0.52122922],
       [ 0.67406937, -0.5562013 , -2.11426923],
       [ 0.87102457, -0.29358536, -0.09430362],
       [ 0.13472803, -1.66597512,  0.        ]])

In [20]:
# d is a copy of a
d = a.copy()
d

array([[ 1.        , -0.71050967, -0.52122922],
       [ 0.67406937, -0.5562013 , -2.11426923],
       [ 0.87102457, -0.29358536, -0.09430362],
       [ 0.13472803, -1.66597512,  0.        ]])

In [22]:
d[0,0] =3
d

array([[ 3.        , -0.71050967, -0.52122922],
       [ 0.67406937, -0.5562013 , -2.11426923],
       [ 0.87102457, -0.29358536, -0.09430362],
       [ 0.13472803, -1.66597512,  0.        ]])

In [23]:
# a was not modified by the assigniation on d
a

array([[ 1.        , -0.71050967, -0.52122922],
       [ 0.67406937, -0.5562013 , -2.11426923],
       [ 0.87102457, -0.29358536, -0.09430362],
       [ 0.13472803, -1.66597512,  0.        ]])

* ndarray.resize(new shape, refcheck=True) Resize in-place
* ndarray.reshape(shape, order=C) Return a view with a new shape ndarray.ravel(order=C) Return a flatten view
* ndarray.flatten(order=C) Return a flatten copy
* numpy.concatenate((a1, a2, ...), axis=0) Return a concatenation of arrays along an existing axis
* numpy.stack((a1, a2, ...), axis=0) Return a stack of arrays along a new axis

## 1.7. Saving and loading data

Load a npy or npz file: 

* numpy.load(file, mmap_mode=None, allow_pickle=True, fix_imports=True, encoding='ASCII')


Load a txt file: 
* numpy.loadtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, converters=None, skiprows=0, usecols=None, unpack=False, ndmin=0)



Save ONE array into a npy file: 

* numpy.save(file, arr, allow_pickle=True, fix_imports=True)



Save many arrays into an npz file,

* numpy.savez(file, *args, **kwds) 

save ONE array into a txt file: 

* numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ')


## 1.8. CVS reading 

The Python API provides the module CSV and the function reader() that can be used to load CSV files. Once loaded, you can convert to a numpy array and use it for machine learning.



In [24]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [31]:
import os
os.listdir('/content/gdrive')

['.shortcut-targets-by-id', '.file-revisions-by-id', 'MyDrive', '.Trash-0']

In [44]:
# Load CSV Using Python
import pandas as pd
df = pd.read_csv("/content/gdrive/MyDrive/data.csv")
df.shape

(149, 5)

You can load your CSV data using numpy and the numpy.loadtxt() por numpy.genfromtxt() functions. This functions assume no header row and all data has the same format.



In [45]:
# Load CSV using NumPy
from numpy import genfromtxt
my_data = genfromtxt('/content/gdrive/MyDrive/data.csv', delimiter=',')
my_data.shape



(150, 5)

## 1.9. Your turn

Try to answer each following questions by a small snippet of code.

1. How to reverse a vector (1d array) ?
- new_array = array[::-1]
2. How to keep dimension consistency when slicing a matrix (2d array) ?
new_array = array[i:j,k:l]
3. How to create a (5,5) array with random values and find the extrema values ?
- rand1 = np.random.randn(5,5)
  rand.max()
4. With the help of broadcasting, how to produce a matrix A where A[i,j] = 2i + j ? (no for loop allowed)
5. A is a (4,4) int array, I want to change the last element of A to 1.5 without loosing any information. How can I do it ?
a = A[-1,-1]
A[-1,-1] = 1.5