# **<u>Important Libraries in Python</u>**

---

Table of contents:

1. [Modules](#Modules)
1. [Packages](#Packages)
1. [Libraries](#Libraries)
1. [ML Libraries](#Machine-Learning-Libraries)
    1. [NumPy](#NumPy)
    1. [Pandas](#Pandas)
    1. [Scikit Learn](#Scikit-Learn)

## Modules

A module in Python is a code file(with a .py extension) that defines functions/classes.

Modules are made to support the reusability of code. We need not copy paste code from that file to use the class definitions and functions made in another code file.

The code from a module is **imported** by using the import keyword followed by the module name. This runs the module's code, and makes it available in your program.

```python
import modulename
```
This imports a specific module.

```python
from modulename import classname
```
This imports the specified class from the module.

We can import all classes from a module using another syntax as well
```python
from modulename import *
```

The difference between the 2 is that when you use only *import modulename*, to refer to a class within the module, we need to write *modulename.classname*. This is because we want to avoid ambiguity between multiple modules having the same function name.

When we use *from modulename import * *, we import all the classes without having to refer to the classes via the modulename. We can simply use the classname directly.

Sometimes the modulenames are really big, and it is cumbersome to type them out everytime we want to access a class defined in a module. In such cases, we can provide a shorter alias to the modulename for our proggram like so

```python
import modulename as modulealias

from modulename import classname as classalias
```

Thus, we can alias(provide another name for) whatever we are importing, be it the class or the module.

## Packages

A Python Package is a directory of Python modules. Using them, we can organize modules in a hierarchical fashion.

They are very similar to Modules.

Because of the organizational structure, to access subpackages/submodules, we have to add a dot between the names of the modules which are in the path.

Imagine the package as a tree. The modules are leaves. The path to the leaf has to be mentioned using the names of the subpackages/submodules in the way, separated by a '.'.

```python
import packagename.modulename

import packagename.subpackagename.modulename

from packagename.subpacakagename import modulename
```

## Libraries

Libraries do not have a specific meaning in Python. Modules and packages do the job of general libraries in other languages.

A Library generically means either a module or a package.

In Python, we have a [Python Standard Library](https://docs.python.org/3/library/)

The term ‘standard library‘ in Python language refers to the collection of exact syntax, token and semantics of the Python language which comes bundled with the core Python distribution.

In Python, the standard library is written in C language and it handles the standard functionalities like file I/O and other core modules that make Python what it is. The python standard library lists down more than 200 such core modules that form the core of Python.

## Machine Learning Libraries

We will be studying NumPy, Pandas, and having a look at SciKitLearn.

Python has a ton of ML libraries. We will go over the basics, and see how one library is used by the other.

![The Hierarchy of Libraries](https://qph.fs.quoracdn.net/main-qimg-278de31eae1000b627dc5ce4b99b9e04-c)

### NumPy

[NumPy](http://www.numpy.org/) is the fundamental package for scientific computing with Python.

NumPy, short for Numerical Python, provides support for numeric operations including a host of inbuilt functions, and support for multi-dimensional arrays, which form the basis of most computing.

NumPy is extremely fast and takes up much lesser space than storing data in lists. Virtually every ML library uses NumPy in the background.

[Advantages of NumPy](https://stackoverflow.com/questions/993984/what-are-the-advantages-of-numpy-over-regular-python-lists)

NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes.

**The array is the mathematical matrix**

NumPy’s array class is called ndarray. It is also known by the alias array. 

To import numpy into your computing environment, we have to use the import statement.

```python
import numpy
```

Since numpy is so commonly used, we generally use the alias **np** to refer to numpy.

In [1]:
import numpy as np

Thus, we have imported NumPy. Now, let's declare an array in numpy.

We can create an array from:
- A list: This generates a one dimensional array.
- A list of lists: This generates a higher dimensional array.

In [2]:
numpy_arr = np.array([1,2,3])

Thus, we have created an array with 3 elements. This is a one-dimensional array.

**To check the number of dimensions, we use the *ndim* attribute.**

```python
arr_name.ndim
```

In [3]:
numpy_arr.ndim

1

**To see the exact shape of the array, we use the *shape* attribute.**

```python
arr_name.shape
```

In [4]:
numpy_arr.shape

(3,)

Note that the np.array() method takes as an argument a list, or a list of lists for a multi-dimensional array, and converts it to a numpy ndarray.

In [5]:
type(numpy_arr)

numpy.ndarray

A list of lists creates a multidimensional array.

In [6]:
arr_2d = np.array([[1,2],[3,4]])

In [7]:
print(arr_2d)

[[1 2]
 [3 4]]


In [53]:
arr_2d.ndim

2

In [54]:
arr_2d.shape

(2, 2)

**We can see the total number of elements in an array using the *size* attribute.**

```python
arr_name.size
```

In [55]:
numpy_arr.size

3

In [56]:
arr_2d.size

4

**NumPy has it's own Datatypes as well.**
- numpy.int32
- numpy.int16
- numpy.float64
- and Many More

The number refers to the number of bits the object of that type occupies in memory.

**To find out the datatype, we use the *dtype* attribute.**

```python
arr_name.dtype
```

In [15]:
numpy_arr.dtype

dtype('int32')

**To create an array of equally spaced numbers, we can use the *arange* function.**

```python
np.arange(stop)
np.arange(start, stop)
np.arange(start, stop, step)
```

It's arguments are similar to the range function. It will return a numpy.ndarray object.

In [16]:
arr = np.arange(5)

In [17]:
print(arr)

[0 1 2 3 4]


In [18]:
arr2 = np.arange(1,5)

In [19]:
print(arr2)

[1 2 3 4]


In [20]:
arr = np.arange(1,10,2)

In [21]:
print(arr)

[1 3 5 7 9]


Thus, the arange() function will create a 1 dimensional array.

**To generate an array with a specific shape, we use the *ndarray* function.**

```python
shape = (m,n)
np.ndarray(shape)
```

In [22]:
rand_arr = np.ndarray((5,5))

In [23]:
print(rand_arr)

[[2.22523004e-307 2.78145267e-307 1.86920871e-306 1.33509389e-306
  1.78019354e-306]
 [4.45061456e-308 1.78019354e-306 1.69119330e-306 5.56296645e-307
  3.22647423e-307]
 [4.45041594e-307 2.22518658e-306 1.29061142e-306 5.11798223e-307
  1.37961370e-306]
 [1.37960012e-306 4.22802739e-307 1.24611470e-306 8.34423493e-308
  1.11261027e-306]
 [1.29061821e-306 8.90103559e-307 1.24611470e-306 6.01347002e-154
  9.13612771e+242]]


**To generate an array of a specific shape and all elements of the same value, we can use functions like *ones*, *zeros* and *eye*.**

```python
shape = (m,n)
np.ones(shape)
```

In [24]:
ones = np.ones((3,3))
print(ones)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


Note that the shape argument in any function has to be a tuple containing the shape, and not just the independent numbers.

```python
shape = (m,n)
np.zeros(shape)
```

In [25]:
zeros = np.zeros((4,4))

In [26]:
print(zeros)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


```python
np.eye(ndim)
```

In [57]:
identity_mat = np.eye(5)

In [58]:
print(identity_mat)

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


This generates an identity matrix of n x n size. The diagonal elements are 1 and the rest of the elements are 0 in an identity matrix.

**Special values are:**
- inf : A number that is too large to be represented.
- nan : A value like an imaginary number, something that is 'Not a Number'.

In [237]:
np.nan

nan

In [238]:
np.inf

inf

**Another method is *linspace*, which is used to get a certain number of equally spaced points between 2 values (including them).**

```python
np.linspace(start_val, stop_val, num_points)
```

In [27]:
nums = np.linspace(5,10,101)

In [28]:
print(nums)

[ 5.    5.05  5.1   5.15  5.2   5.25  5.3   5.35  5.4   5.45  5.5   5.55
  5.6   5.65  5.7   5.75  5.8   5.85  5.9   5.95  6.    6.05  6.1   6.15
  6.2   6.25  6.3   6.35  6.4   6.45  6.5   6.55  6.6   6.65  6.7   6.75
  6.8   6.85  6.9   6.95  7.    7.05  7.1   7.15  7.2   7.25  7.3   7.35
  7.4   7.45  7.5   7.55  7.6   7.65  7.7   7.75  7.8   7.85  7.9   7.95
  8.    8.05  8.1   8.15  8.2   8.25  8.3   8.35  8.4   8.45  8.5   8.55
  8.6   8.65  8.7   8.75  8.8   8.85  8.9   8.95  9.    9.05  9.1   9.15
  9.2   9.25  9.3   9.35  9.4   9.45  9.5   9.55  9.6   9.65  9.7   9.75
  9.8   9.85  9.9   9.95 10.  ]


In [29]:
nums.shape

(101,)

When you print an array, NumPy displays it in a similar way to nested lists, but with the following layout:

- the last axis is printed from left to right,
- the second-to-last is printed from top to bottom,
- the rest are also printed from top to bottom, with each slice separated from the next by an empty line.

***reshape* allows you to change the shape of the array. The values are retained.**

```python
shape = (m,n)
arr_name.reshape(shape)
```

In [30]:
arr = np.linspace(2,10,100)

In [31]:
print(arr)

[ 2.          2.08080808  2.16161616  2.24242424  2.32323232  2.4040404
  2.48484848  2.56565657  2.64646465  2.72727273  2.80808081  2.88888889
  2.96969697  3.05050505  3.13131313  3.21212121  3.29292929  3.37373737
  3.45454545  3.53535354  3.61616162  3.6969697   3.77777778  3.85858586
  3.93939394  4.02020202  4.1010101   4.18181818  4.26262626  4.34343434
  4.42424242  4.50505051  4.58585859  4.66666667  4.74747475  4.82828283
  4.90909091  4.98989899  5.07070707  5.15151515  5.23232323  5.31313131
  5.39393939  5.47474747  5.55555556  5.63636364  5.71717172  5.7979798
  5.87878788  5.95959596  6.04040404  6.12121212  6.2020202   6.28282828
  6.36363636  6.44444444  6.52525253  6.60606061  6.68686869  6.76767677
  6.84848485  6.92929293  7.01010101  7.09090909  7.17171717  7.25252525
  7.33333333  7.41414141  7.49494949  7.57575758  7.65656566  7.73737374
  7.81818182  7.8989899   7.97979798  8.06060606  8.14141414  8.22222222
  8.3030303   8.38383838  8.46464646  8.54545455  8.6

In [32]:
arr_2 = arr.reshape((5,20))

In [33]:
arr_2.shape

(5, 20)

In [34]:
arr.shape

(100,)

***np.random* is a module that has the methods to generate arrays based on random sampling methods.**

***np.random.rand* generates values from the uniform distribution over \[0,1)**

```python
shape = (m,n,k)
dim0, dim1, dim2 = shape
np.random.rand(dim0, dim1, dim2)
```

Note that here, we do not pass the tuple as an argument, instead we pass the individual numbers as multiple arguments.

In [62]:
rand = np.random.rand(2,3,4)

In [64]:
print(rand)

[[[0.21389172 0.28836457 0.97282246 0.31234254]
  [0.72220704 0.89284686 0.67249623 0.15604941]
  [0.10928087 0.47522022 0.34300106 0.28820993]]

 [[0.63174513 0.07819502 0.70161858 0.2593489 ]
  [0.75803988 0.18968169 0.77926265 0.01831501]
  [0.94343116 0.04065033 0.76131074 0.01617402]]]


***np.random.randn* generates values from the standard normal distribution.**

```python
shape = (m,n,k)
dim0, dim1, dim2 = shape
np.random.randn(dim0, dim1, dim2)
```

In [65]:
randn = np.random.randn(2,3,4)

In [66]:
randn

array([[[ 1.01048567,  0.44505734, -1.15996116, -0.2168455 ],
        [ 0.03257972,  0.4874862 ,  0.08720311,  2.38803234],
        [ 0.80950197, -2.05061908, -0.49792933,  0.30357246]],

       [[-1.41650957, -0.14500954, -0.24966267, -0.27319855],
        [-0.72771877, -0.58726797, -0.76436127, -0.20526553],
        [ 0.60371148, -0.23603198,  0.84714246,  0.34324538]]])

***np.random.randint* generates random integers in range \[low, high).**

```python
np.random.randint(low, high, number_of_integers_to_generate)
```

In [67]:
randint = np.random.randint(3,10)

In [68]:
print(randint)

9


In [71]:
randints = np.random.randint(5,100,10)

In [72]:
print(randints)

[42 13 58 74 68 48 12 36 33 71]


***np.max* returns the maximum value from the array along a given axis.**

```python
np.max(array, axis)
```

***np.argmax* returns the index of the maximum value from the array along a given axis.**

```python
np.argmax(array, axis)
```

In [8]:
arr = np.linspace(1,10,10).reshape(2,5)

In [9]:
print(arr)

[[ 1.  2.  3.  4.  5.]
 [ 6.  7.  8.  9. 10.]]


In [10]:
print(np.max(arr, 0))
print(np.argmax(arr, 0))

[ 6.  7.  8.  9. 10.]
[1 1 1 1 1]


In [11]:
print(np.max(arr, 1))
print(np.argmax(arr,1))

[ 5. 10.]
[4 4]


In [12]:
print(np.max(arr))
print(np.argmax(arr))

10.0
9


If no axis argument is provided, it will return the biggest value in the entire array.

***np.min* returns the minimum value from the array along a given axis.**

```python
np.min(array, axis)
```

***np.argmin* returns the index of the minimum value from the array along a given axis.**

```python
np.argmin(array, axis)
```

In [92]:
print(arr)

[[ 1.  2.  3.  4.  5.]
 [ 6.  7.  8.  9. 10.]]


In [94]:
print(np.min(arr, 0))
print(np.argmin(arr, 0))

[1. 2. 3. 4. 5.]
[0 0 0 0 0]


In [95]:
print(np.min(arr, 1))
print(np.argmin(arr,1))

[1. 6.]
[0 0]


In [84]:
np.min(arr)

1.0

If no axis argument is provided, it will return the smallest value in the entire array.

**Arithmetic Operations**

In NumPy, the arithmetic operations are applied elementwise. Thus, the traditional dot product is done by the * operator.

In [13]:
arr1 = np.ones((3,3))

In [14]:
arr2 = np.linspace(3,11,9).reshape(3,3)

In [15]:
arr1

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [16]:
arr2

array([[ 3.,  4.,  5.],
       [ 6.,  7.,  8.],
       [ 9., 10., 11.]])

In [17]:
arr2+arr1

array([[ 4.,  5.,  6.],
       [ 7.,  8.,  9.],
       [10., 11., 12.]])

In [18]:
arr2-arr1

array([[ 2.,  3.,  4.],
       [ 5.,  6.,  7.],
       [ 8.,  9., 10.]])

In [19]:
arr2*arr1

array([[ 3.,  4.,  5.],
       [ 6.,  7.,  8.],
       [ 9., 10., 11.]])

In [20]:
arr2/arr1

array([[ 3.,  4.,  5.],
       [ 6.,  7.,  8.],
       [ 9., 10., 11.]])

To perform a matrix multiplication, we have to use the *matmul* function.

In [21]:
arr2**3

array([[  27.,   64.,  125.],
       [ 216.,  343.,  512.],
       [ 729., 1000., 1331.]])

In [43]:
np.matmul(arr1,arr2)

array([[18., 21., 24.],
       [18., 21., 24.],
       [18., 21., 24.]])

**Methods like np.exp, np.sqrt also apply the specified operations to each element of the array.**

#### Indexing

One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.

In [44]:
arr_1d = np.arange(10)**2

In [45]:
arr_1d

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81], dtype=int32)

In [46]:
arr_1d[1:5]

array([ 1,  4,  9, 16], dtype=int32)

In [47]:
arr_1d[:3]

array([0, 1, 4], dtype=int32)

**Multidimensional indexing**

To index a multidimensional array, we separate the indices of each dimension by a ','

In [22]:
arr_2d = np.random.rand(3,4)

In [23]:
print(arr_2d)

[[0.6709797  0.594772   0.47989649 0.94055614]
 [0.47555966 0.19808921 0.07890616 0.05920779]
 [0.37578    0.42629958 0.66481959 0.1020993 ]]


In [24]:
arr_2d[0,0]

0.6709796968593599

In [25]:
arr_2d[1,2]

0.07890615946622781

To get a slice, we can use the ':' as well.

In [26]:
arr_2d[0,1:3]

array([0.594772  , 0.47989649])

In [27]:
arr_2d[1,1:]

array([0.19808921, 0.07890616, 0.05920779])

If we want all elements of a particular dimension, we use ':' without preceding or superceding it by a number.

In [28]:
arr_2d[:,2].reshape(3,1)

array([[0.47989649],
       [0.07890616],
       [0.66481959]])

In [130]:
arr_2d[0,:]

array([0.61706044, 0.33203629, 0.07755612, 0.55620774])

**Broadcasting** is supported in NumPy

In [96]:
arr_1d[:5] = 10

In [97]:
print(arr_1d)

[10 10 10 10 10 25 36 49 64 81]


To access all elements of a particular array, we can use **\[:]**

In [106]:
arr_1d[:] = 5

In [107]:
print(arr_1d)

[5 5 5 5 5 5 5 5 5 5]


**In NumPy, when we want to make a copy of an array, we use the copy function, as just assigning one variable to another will not create a separate variable, but will just add another name(alias) to the same variable.**

In [108]:
my_arr = arr_1d

In [109]:
print(my_arr)
print(arr_1d)

[5 5 5 5 5 5 5 5 5 5]
[5 5 5 5 5 5 5 5 5 5]


In [110]:
my_arr[:] = 10

In [111]:
print(my_arr)
print(arr_1d)

[10 10 10 10 10 10 10 10 10 10]
[10 10 10 10 10 10 10 10 10 10]


In [113]:
my_arr = arr_1d.copy()

In [114]:
print(my_arr)
print(arr_1d)

[10 10 10 10 10 10 10 10 10 10]
[10 10 10 10 10 10 10 10 10 10]


In [115]:
my_arr[:] = 5

In [116]:
print(my_arr)
print(arr_1d)

[5 5 5 5 5 5 5 5 5 5]
[10 10 10 10 10 10 10 10 10 10]


**Selection**

We can also broadcast relational operators.

In [132]:
arr = np.linspace(1,10,10)

In [133]:
print(arr)

[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]


In [134]:
print(arr>5)

[False False False False False  True  True  True  True  True]


This will return a boolean array, which can be used to selectively filter elements from an array.

In [135]:
arr[arr>5]

array([ 6.,  7.,  8.,  9., 10.])

In [136]:
arr[arr%2==0]

array([ 2.,  4.,  6.,  8., 10.])

Thus, we can have an expression inside the \[], that gives us as a result a boolean array of the same shape as the original array.

Time for a few excercises:

- Generate the arrays shown below.

In [29]:
np.linspace(0.1,0.9,9)

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])

In [30]:
np.linspace(2,100,50).reshape(5,10)

array([[  2.,   4.,   6.,   8.,  10.,  12.,  14.,  16.,  18.,  20.],
       [ 22.,  24.,  26.,  28.,  30.,  32.,  34.,  36.,  38.,  40.],
       [ 42.,  44.,  46.,  48.,  50.,  52.,  54.,  56.,  58.,  60.],
       [ 62.,  64.,  66.,  68.,  70.,  72.,  74.,  76.,  78.,  80.],
       [ 82.,  84.,  86.,  88.,  90.,  92.,  94.,  96.,  98., 100.]])

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

- Generate a 4,5 matrix of random integers in the range 3 to 15.

In [147]:
np.random.randint(3,15,20).reshape(4,5)

array([[ 8,  6, 13, 13,  5],
       [ 9, 14,  3,  7,  6],
       [ 5, 12, 11,  4, 11],
       [ 8,  3,  4,  4, 10]])

### Pandas

Pandas is like Excel. It allows a lot more extended functionality, and works on NumPy arrays in the background.

The basic Pandas datatype is a **Series.** It stores the data that is stored in an Excel column.

It is built on top of a NumPy array. What differentiates the NumPy array from a Series, is that a Series can have *labels, meaning it can be indexed by a label*, instead of just a number location. Think of it like a combination of a list and a dictionary. Here, there is an order to the sequence, but you can define arbitrary labels for each value as well.

It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

In [31]:
import pandas as pd

We can convert a list, Numpy array or a dictionary to a series.

In [150]:
my_list = [3,4,10]

In [151]:
arr = np.array(my_list)

In [152]:
my_dict = {'a':1,'b':3,'c':5}

***pd.Series* is a class, calls a constructor which takes as arguments the data(list/numpy array) and the labels(list). If we pass a dictionary, we do not need to pass the labels.**

```python
pd.Series(values, labels)
```

In [153]:
pd.Series(my_list)

0     3
1     4
2    10
dtype: int64

In [155]:
pd.Series(arr)

0     3
1     4
2    10
dtype: int32

In [157]:
ser_dict = pd.Series(my_dict)

In [163]:
print(ser_dict)

a    1
b    3
c    5
dtype: int64


In [164]:
ser_dict[0]

1

In [165]:
ser_dict['a']

1

In [167]:
labels = ['a','c','e']

In [168]:
pd.Series(my_list, labels)

a     3
c     4
e    10
dtype: int64

In [169]:
pd.Series(arr, labels)

a     3
c     4
e    10
dtype: int32

Pandas primarily works with **DataFrames**. DataFrames are like Excel sheets. An entire data table is stored in a Pandas DataFrame.

It is a collection of Pandas Series objects, which share the same index.

In [32]:
arr = np.random.rand(4,5)

We have to manually specify the indexes and column names.

In [33]:
df = pd.DataFrame(arr, index = 'A B C D'.split(), columns = 'V W X Y Z'.split())

In [34]:
df

Unnamed: 0,V,W,X,Y,Z
A,0.601839,0.803947,0.577212,0.905881,0.109526
B,0.8001,0.111203,0.808298,0.374129,0.1852
C,0.872048,0.795725,0.086484,0.474502,0.960768
D,0.44085,0.848212,0.278503,0.875002,0.483598


**Selection and Indexing**

```python
dataframe_name[Column_name]
```

In [35]:
df['W']

A    0.803947
B    0.111203
C    0.795725
D    0.848212
Name: W, dtype: float64

Thus, we can get a column by using the column name.

We can also get multiple columns by passing a list of column names.

In [36]:
df[['W','Z','X']]

Unnamed: 0,W,Z,X
A,0.803947,0.109526,0.577212
B,0.111203,0.1852,0.808298
C,0.795725,0.960768,0.086484
D,0.848212,0.483598,0.278503


**Each DataFrame column is a series!**

In [37]:
type(df['W'])

pandas.core.series.Series

**To select rows, we have to use the loc and iloc attributes.**

```python
dataframe.loc[index_name]
dataframe.iloc[index_number]
```

In [38]:
df.loc['A']

V    0.601839
W    0.803947
X    0.577212
Y    0.905881
Z    0.109526
Name: A, dtype: float64

In [39]:
df.iloc[0]

V    0.601839
W    0.803947
X    0.577212
Y    0.905881
Z    0.109526
Name: A, dtype: float64

Just like numpy arrays, we can do get multiple rows and columns. But, since we have index and column names, we need to pass the individual column names.

In [40]:
df.loc[['A','C'],['W','Y']]

Unnamed: 0,W,Y
A,0.803947,0.905881
C,0.795725,0.474502


In [41]:
df.loc['A','W']

0.8039467805771514

The conditional selection is also similar to NumPy.

In [42]:
df[df>0.5]

Unnamed: 0,V,W,X,Y,Z
A,0.601839,0.803947,0.577212,0.905881,
B,0.8001,,0.808298,,
C,0.872048,0.795725,,,0.960768
D,,0.848212,,0.875002,


For multiple conditions, we have to use **& instead of and, | instead of or.**

In [43]:
df[df['W']>0.5]

Unnamed: 0,V,W,X,Y,Z
A,0.601839,0.803947,0.577212,0.905881,0.109526
C,0.872048,0.795725,0.086484,0.474502,0.960768
D,0.44085,0.848212,0.278503,0.875002,0.483598


This translates to: *Give me the row where the value for attribute W is greater than 0.5*

**Thus, we can pass a condition on a specific attribute(column) value, and return the records(rows) that satisfy the given attribute condition.**

This is based on the assumption that we always store tabular data, where rows are objects and columns are the respective attribute values.

**To remove a row/column from the dataframe, we use the *drop* method.**

```python
dataframe.drop(labels, axis)
```

In [44]:
df.drop('A')

Unnamed: 0,V,W,X,Y,Z
B,0.8001,0.111203,0.808298,0.374129,0.1852
C,0.872048,0.795725,0.086484,0.474502,0.960768
D,0.44085,0.848212,0.278503,0.875002,0.483598


In [45]:
df

Unnamed: 0,V,W,X,Y,Z
A,0.601839,0.803947,0.577212,0.905881,0.109526
B,0.8001,0.111203,0.808298,0.374129,0.1852
C,0.872048,0.795725,0.086484,0.474502,0.960768
D,0.44085,0.848212,0.278503,0.875002,0.483598


By default, we remove rows. If we want to remove a specific column, we have to add the **axis=1** argument.

In [46]:
df.drop('V', axis=1, inplace=True)

Note that when we perform changes to a dataframe using these methods, the dataframe is not changed, instead we get back a copy of the dataframe with the specified operation performed on it.

To ensure that the change is reflected, we need to add an argument called *inplace*, which is assigned binary value True

In [47]:
df.drop('A', inplace=True)

In [48]:
df

Unnamed: 0,W,X,Y,Z
B,0.111203,0.808298,0.374129,0.1852
C,0.795725,0.086484,0.474502,0.960768
D,0.848212,0.278503,0.875002,0.483598


**Adding a column/row in Pandas is very simple as well. We can assume that the column/row already exists and assign values to that column/row.**

In [49]:
df['V'] = np.random.rand(3)

In [50]:
df

Unnamed: 0,W,X,Y,Z,V
B,0.111203,0.808298,0.374129,0.1852,0.290404
C,0.795725,0.086484,0.474502,0.960768,0.393994
D,0.848212,0.278503,0.875002,0.483598,0.787308


In [230]:
df.loc['A'] = np.random.rand(5)

In [231]:
df

Unnamed: 0,W,X,Y,Z,A
B,0.207488,0.988543,0.686758,0.490803,0.55891
C,0.475167,0.886275,0.85624,0.578451,0.154393
D,0.067262,0.755163,0.728991,0.818036,0.478685
A,0.255476,0.487666,0.54906,0.774767,0.833657


**If we want to remove the index and get it as a data column, we can use a function called *reset_index***

In [233]:
df.reset_index(inplace=True)

In [234]:
df

Unnamed: 0,index,W,X,Y,Z,A
0,B,0.207488,0.988543,0.686758,0.490803,0.55891
1,C,0.475167,0.886275,0.85624,0.578451,0.154393
2,D,0.067262,0.755163,0.728991,0.818036,0.478685
3,A,0.255476,0.487666,0.54906,0.774767,0.833657


**If we want to make a column as the index, we can use a function called *set_index***

In [235]:
df.set_index('index', inplace=True)

In [236]:
df

Unnamed: 0_level_0,W,X,Y,Z,A
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
B,0.207488,0.988543,0.686758,0.490803,0.55891
C,0.475167,0.886275,0.85624,0.578451,0.154393
D,0.067262,0.755163,0.728991,0.818036,0.478685
A,0.255476,0.487666,0.54906,0.774767,0.833657


In [239]:
df = pd.DataFrame({'A':[10,np.nan,0],
                  'C':[np.nan,2,20],
                  'E':[1,5,3]})

**To handle missing values(np.nan), Pandas provides inbuilt methods.**

**To remove records/attributes with missing values, we use *dropna*** 

```python
dataframe.dropna(axis)
```

In [241]:
df.dropna()

Unnamed: 0,A,C,E
2,0.0,20.0,3


In [242]:
df.dropna(axis=1)

Unnamed: 0,E
0,1
1,5
2,3


**To fill in missing values, we use *fillna*** 

```python
dataframe.fillna(value)
```

In [251]:
df.fillna(1000, inplace=True)

In [252]:
df

Unnamed: 0,A,C,E
0,10.0,1000.0,1
1,1000.0,2.0,5
2,0.0,20.0,3


**We can apply a function to each row/column of a dataframe, by using the *apply* method.**

```python
df.apply(function, axis)
```

In [256]:
def func(x):
    print(x)
    print(type(x))
    return 0

In [257]:
df.apply(func)

0      10.0
1    1000.0
2       0.0
Name: A, dtype: float64
<class 'pandas.core.series.Series'>
0    1000.0
1       2.0
2      20.0
Name: C, dtype: float64
<class 'pandas.core.series.Series'>
0    1.0
1    5.0
2    3.0
Name: E, dtype: float64
<class 'pandas.core.series.Series'>


A    0
C    0
E    0
dtype: int64

In [258]:
df.apply(func, axis=1)

A      10.0
C    1000.0
E       1.0
Name: 0, dtype: float64
<class 'pandas.core.series.Series'>
A    1000.0
C       2.0
E       5.0
Name: 1, dtype: float64
<class 'pandas.core.series.Series'>
A     0.0
C    20.0
E     3.0
Name: 2, dtype: float64
<class 'pandas.core.series.Series'>


0    0
1    0
2    0
dtype: int64

#### Lambda Functions in Python

In Python, we can write temporary one line functions that have no name as well.

In [262]:
lambda x: 1 if x>2 else 0

<function __main__.<lambda>(x)>

Thus, we can provide this as an argument to the apply function as well.

In [267]:
df['A'].apply((lambda x: 1 if x>2 else 0))

0    1
1    1
2    0
Name: A, dtype: int64

We can also get the counts of a each value in a series.

In [259]:
df.value_counts()

AttributeError: 'DataFrame' object has no attribute 'value_counts'

In [261]:
df['C'].value_counts()

20.0      1
2.0       1
1000.0    1
Name: C, dtype: int64

Thus, this is only applicable to series, and not to DataFrames.

#### We can join dataframes together using 3 ways: Merging, Concatenating and Joining

In [268]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

In [269]:
df1

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3


In [270]:
df2

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [271]:
df3

Unnamed: 0,A,B,C,D
8,A8,B8,C8,D8
9,A9,B9,C9,D9
10,A10,B10,C10,D10
11,A11,B11,C11,D11


```python
pd.concat(list_of_dataframes, axis)
```

In [276]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,A,B,C,D,A.1,B.1,C.1,D.1
0,A0,B0,C0,D0,,,,
1,A1,B1,C1,D1,,,,
2,A2,B2,C2,D2,,,,
3,A3,B3,C3,D3,,,,
4,,,,,A4,B4,C4,D4
5,,,,,A5,B5,C5,D5
6,,,,,A6,B6,C6,D6
7,,,,,A7,B7,C7,D7


In [277]:
pd.concat([df1,df2])

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


In [289]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})    

```python
pd.merge(left_df, right_df, how, on_which_col)
```

In [290]:
left

Unnamed: 0,A,B,key
0,A0,B0,K0
1,A1,B1,K1
2,A2,B2,K2
3,A3,B3,K3


In [291]:
right

Unnamed: 0,C,D,key
0,C0,D0,K0
1,C1,D1,K1
2,C2,D2,K2
3,C3,D3,K3


In [292]:
pd.merge(left, right, how = 'inner',on = 'key')

Unnamed: 0,A,B,key,C,D
0,A0,B0,K0,C0,D0
1,A1,B1,K1,C1,D1
2,A2,B2,K2,C2,D2
3,A3,B3,K3,C3,D3


```python
dataframe.join(dataframe_2, how)
```

In [51]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [52]:
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [53]:
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [297]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


Inner join removes any non-overlapping records. It only keeps the records for which the indexes match.

In [298]:
left.join(right, how='inner')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2


Outer join keeps any non-overlapping records, and fills in the missing values with NaNs.

In [299]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


Left join keeps all records from the left(calling) dataframe. It fills the missing values with NaNs.

In [300]:
left.join(right, how='left')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


Right join keeps all records from the right(argument) dataframe. It fills the missing values with NaNs.

In [301]:
left.join(right, how='right')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2
K3,,,C3,D3


**In short, Pandas allows you to perform all Database style operations on relational database like tables.**

### Scikit Learn

Let us explore the use of Scikit Learn in an ML problem. This will also set the tone for our discussions on Machine Learning in the next few lectures.

In [54]:
from sklearn.datasets import load_iris

In [55]:
iris_data = load_iris()

**The Iris Dataset is a famous dataset, the data for which has been included directly in the Scikit Learn library.
Let's explore it further.**

Let's look at the description 

In [9]:
print(iris_data['DESCR'])

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

- *Number of instances* is your **training data size**.

- *Number of attributes* is your **number of features** for each example.

- *Class* is your **target variable**.

In [10]:
print(iris_data['feature_names'])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


![Iris Image](https://cdn-images-1.medium.com/max/1600/0*7H_gF1KnslexnJ3s)

In [11]:
iris_dataset = pd.DataFrame(data=iris_data['data'], columns=iris_data['feature_names'])

We make a **dataframe object** out of the data which was in a NumPy array. Now, we can use Pandas functions to get further *insights* into the dataset.

In [12]:
iris_dataset.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


The .head() method gives us the top few datapoint values.

In [13]:
iris_dataset['target'] = iris_data['target']

Inserting a new column is very simple in pandas. We can refer to the column as if it existed, and then pass in data to be stored.

In [14]:
iris_dataset.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [15]:
iris_data['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

These are the names for the classes:

- 0 - Setosa
- 1 - Versicolor
- 2 - Virginica

In [16]:
iris_dataset['target_name'] = np.apply_along_axis(lambda x: iris_data['target_names'][x], 0, iris_data['target'])

NumPy has an 'apply along axis' function, using which you can apply a function along a particular axis of a given array.

In [17]:
iris_dataset.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_name
145,6.7,3.0,5.2,2.3,2,virginica
146,6.3,2.5,5.0,1.9,2,virginica
147,6.5,3.0,5.2,2.0,2,virginica
148,6.2,3.4,5.4,2.3,2,virginica
149,5.9,3.0,5.1,1.8,2,virginica


*Why convert an array to a dataframe?*

Because now we can perform what is known as **Exploratory Data Analysis**, using only a few lines of code. Or use Pandas and Seaborn for what they're good at.

In [18]:
iris_dataset.describe(include='all')

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_name
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,virginica
freq,,,,,,50
mean,5.843333,3.054,3.758667,1.198667,1.0,
std,0.828066,0.433594,1.76442,0.763161,0.819232,
min,4.3,2.0,1.0,0.1,0.0,
25%,5.1,2.8,1.6,0.3,0.0,
50%,5.8,3.0,4.35,1.3,1.0,
75%,6.4,3.3,5.1,1.8,2.0,


The describe() function provides statistics on each data-column in the dataframe. Thus, we can quickly understand our data distribution.

In [19]:
iris_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
target               150 non-null int32
target_name          150 non-null object
dtypes: float64(4), int32(1), object(1)
memory usage: 6.5+ KB


The info() function tells us the number of *non-null values* in each column, alongwith the *datatype* of each column.

In [20]:
iris_dataset.corr()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
sepal length (cm),1.0,-0.109369,0.871754,0.817954,0.782561
sepal width (cm),-0.109369,1.0,-0.420516,-0.356544,-0.419446
petal length (cm),0.871754,-0.420516,1.0,0.962757,0.949043
petal width (cm),0.817954,-0.356544,0.962757,1.0,0.956464
target,0.782561,-0.419446,0.949043,0.956464,1.0


<a id="Step 2"></a>
### Step 2: Preprocess the Data

In this dataset, we see that there are no missing values. So, we can skip that step. Instead, a lot of Linear algorithms suffer if **all features are not at the same scale**. Hence, we use normalization/scaling to bring all variables to the same scale.

In [29]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

**MinMaxScaler** scales the values using the *minimum and maximum values* of the data, to a *given range* provided by the user.

**StandardScaler** scales the data to have *zero mean and unit variance*.

In [30]:
X = iris_dataset.drop(['target', 'target_name'], axis=1)

In [31]:
Y = iris_dataset['target']

In [32]:
scaler = StandardScaler()

In [33]:
X_sc = scaler.fit_transform(X)

The *fit* method uses the data to get a few variables it needs for future use, and the *transform* method applies the transformation to the given input data.

*fit_transform* performs both in one function call.

In [34]:
print("Minimum:",X_sc.min())
print("Maximum:",X_sc.max())
print("Mean:",X_sc.mean())
print("Standard Devaition:",X_sc.std())

Minimum: -2.438987252491842
Maximum: 3.1146839106774347
Mean: -1.3263464400855204e-15
Standard Devaition: 1.0


In [35]:
from sklearn.model_selection import train_test_split

train_test_split performs a split of the data into training and test sets.

In [36]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)

We do not need scaling of the variables in this example, so we apply train-test-split to the original data.

<a id="Step 3"></a>
### Step 3: Use an ML Algorithm to predict the class

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

For any ML algorithm in Scikit Learn, we have a **fit and score** method.

We always first **create an object** of the class of the algorithm, and provide parameters to it during the creation of the object.

Next, we call the *.fit()* method to train the algorithm on the data we provide as arguments to this function.

Finally, we can call *.score()* to get the score of the algorithm.

In [38]:
lr = LogisticRegression()

lr.fit(X_train, Y_train)

print(lr.score(X_train, Y_train))

print(lr.score(X_test, Y_test))

0.9464285714285714
0.9210526315789473


In [39]:
from sklearn.model_selection import cross_val_score

**cross_val_score** will run cross_validation on given model, and return an array of scores on the validation set.

In [40]:
cv = cross_val_score(lr, X, Y, cv=5)
print(cv)
print(cv.mean())

[1.         0.96666667 0.93333333 0.9        1.        ]
0.9600000000000002


In [41]:
dtc = DecisionTreeClassifier()

dtc.fit(X_train, Y_train)

print(dtc.score(X_train, Y_train))

print(dtc.score(X_test, Y_test))

1.0
0.9473684210526315


In [42]:
cv = cross_val_score(dtc, X, Y, cv=5)
print(cv)
print(cv.mean())

[0.96666667 0.96666667 0.9        0.96666667 1.        ]
0.9600000000000002


In [43]:
mlp = MLPClassifier(hidden_layer_sizes=(10,10), max_iter=3000)

mlp.fit(X_train, Y_train)

print(mlp.score(X_train, Y_train))

print(mlp.score(X_test, Y_test))

0.33035714285714285
0.34210526315789475


In [44]:
cv = cross_val_score(mlp, X, Y, cv=5)
print(cv)
print(cv.mean())

[1.         0.96666667 0.93333333 0.93333333 1.        ]
0.9666666666666668


All the above algorithms have their individual hyperparameters that need tuning. Hyperparameters are basically parameters of the algorithm that we have to set.

Thus, we can observe how Pandas dataframes are directly used as inputs in Scikit Learn.

The next lectures will be focused on understanding the theory behind Machine Learning.
# Thank you!