# Chapter 2 -The Python Machine Learning Ecosystem

## Numpy ndarray

+ All of the numeric functionality of numpy is orchestrated by two important constituents of the numpy package,ndarray and Ufuncs (Universal function).
+ Numpy ndarray is a multi-dimensional array object which is the core data container for all of the numpy operations.
+ Arrays (or matrices) are one of the fundamental representations of data. One important thing to keep in mind is that all the elements in an array must have the same data type.

In [1]:
import numpy as np
arr = np.array([1,3,4,5,6])
arr

array([1, 3, 4, 5, 6])

In [2]:
#The shape attribute of the array object will tell us about the dimensions of the array
arr.shape

(5,)

In [3]:
arr.dtype

dtype('int32')

# Creating Arrays

In [6]:
arr = np.array([[1,2,3],[2,4,6],[8,8,8]])
arr.shape

(3, 3)

In [7]:
arr

array([[1, 2, 3],
       [2, 4, 6],
       [8, 8, 8]])

In [8]:
#np.zeros: Creates a matrix of specified dimensions containing only zeroes
arr = np.zeros((2,4))
arr

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [9]:
#np.ones: Creates a matrix of specified dimension containing only ones
arr = np.ones((2,4))
arr

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [10]:
#np.identity: Creates an identity matrix of specified dimensions
arr = np.identity(3)
arr

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [11]:
#numpy.random package: to initialize an array of a specified dimension with random values
arr = np.random.randn(3,4)
arr

array([[ 7.90397743e-01,  1.25106488e+00, -2.28659486e-01,
         9.27087644e-01],
       [ 1.07779316e+00, -1.94577056e+00,  2.18356632e-01,
        -1.29940491e+00],
       [-1.84022587e+00, -6.93432288e-05,  2.29177733e-01,
         1.26063585e+00]])

# Accessing Array Elements

## Basic Indexing and Slicing

+ Ndarray can leverage the basic indexing operations that are followed by the list class, i.e. list object [obj].
+ If the obj is not an ndarray object, then the indexing is said to be basic indexing.

In [15]:
#to access the complete second row of the array in the earlier example
arr[1]

array([ 1.07779316, -1.94577056,  0.21835663, -1.29940491])

In [16]:
arr = np.arange(12).reshape(2,2,3)
arr

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]]])

In [17]:
arr[0]

array([[0, 1, 2],
       [3, 4, 5]])

In [18]:
#slicing
arr = np.arange(10)
arr[5:]

array([5, 6, 7, 8, 9])

In [19]:
arr[5:8]

array([5, 6, 7])

In [20]:
arr[:-5]

array([0, 1, 2, 3, 4])

In [21]:
arr = np.arange(12).reshape(2,2,3)
arr

array([[[ 0,  1,  2],
        [ 3,  4,  5]],

       [[ 6,  7,  8],
        [ 9, 10, 11]]])

In [22]:
arr[1:2]

array([[[ 6,  7,  8],
        [ 9, 10, 11]]])

In [23]:
#Suppose in a three-dimensional array we want to access the value of only one column
arr = np.arange(27).reshape(3,3,3)
arr

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

In [24]:
#one way
arr[:,:,2]

array([[ 2,  5,  8],
       [11, 14, 17],
       [20, 23, 26]])

In [25]:
#alternate way
arr[...,2]

array([[ 2,  5,  8],
       [11, 14, 17],
       [20, 23, 26]])

## Advance Indexing

+ The difference in advanced indexing and basic indexing comes from the type of object being used to reference the array. 
+ If the object is an ndarray object (data type int or bool) or a non-tuple sequence object or a tuple object containing an ndarray (data type integer or bool), then the indexing being done on the array is said to be advanced indexing.
+ Advanced indexing will always return the copy of the original array data.

In [26]:
#Integer array indexing
arr = np.arange(9).reshape(3,3)
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [27]:
#the first part identifies the rows we want to access and the second identifies the columns which we want to address
arr[[0,1,2],[1,0,0]]

array([1, 3, 6])

In [28]:
#Boolean indexing
cities = np.array(["delhi","bangalore","mumbai","chennai","bhopal"])
city_data = np.random.randn(5,3)
city_data

array([[-0.4091209 , -0.02898378, -0.93563566],
       [-0.40796099, -1.23968273,  1.36589037],
       [ 0.20590936, -1.57393708, -0.83866211],
       [-1.50095167, -0.79036848,  0.59324417],
       [-0.30851166, -0.97639495, -0.00384455]])

In [29]:
city_data[cities =="delhi"]

array([[-0.4091209 , -0.02898378, -0.93563566]])

In [30]:
city_data[city_data >0]

array([1.36589037, 0.20590936, 0.59324417])

In [31]:
#substitute all the non-zero values with 0
city_data[city_data >0] = 0
city_data

array([[-0.4091209 , -0.02898378, -0.93563566],
       [-0.40796099, -1.23968273,  0.        ],
       [ 0.        , -1.57393708, -0.83866211],
       [-1.50095167, -0.79036848,  0.        ],
       [-0.30851166, -0.97639495, -0.00384455]])

# Operations on Array

+ Universal functions are functions that operate on arrays in an element by element fashion. 
+ The implementation of Ufunc is vectorized, which means that the execution of Ufuncs on arrays is quite fast. 
+ The Ufuncs implemented in the numpy package are implemented in compiled C code for speed and efficiency.

In [33]:
arr = np.arange(15).reshape(3,5)
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [34]:
arr + 5

array([[ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [35]:
arr * 2

array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18],
       [20, 22, 24, 26, 28]])

In [36]:
#the concept of broadcasting: adding two arrays of different sizes
arr1 = np.arange(15).reshape(5,3)
arr2 = np.arange(5).reshape(5,1)
arr2 + arr1

array([[ 0,  1,  2],
       [ 4,  5,  6],
       [ 8,  9, 10],
       [12, 13, 14],
       [16, 17, 18]])

In [37]:
arr1

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [38]:
arr2

array([[0],
       [1],
       [2],
       [3],
       [4]])

In [39]:
arr1 = np.random.randn(5,3)
arr1

array([[ 1.37607606, -0.15050098, -1.06720174],
       [ 0.53506418, -0.09234757,  0.63692703],
       [ 0.26497111,  1.76897339,  0.61299676],
       [-0.07503773,  0.32069347, -1.2179581 ],
       [ 0.56539428,  1.84605577,  1.23502509]])

In [40]:
#The function modf will return the fractional and the integer part of the input supplied to it. Hence it will return two arrays of the same size.
np.modf(arr1)

(array([[ 0.37607606, -0.15050098, -0.06720174],
        [ 0.53506418, -0.09234757,  0.63692703],
        [ 0.26497111,  0.76897339,  0.61299676],
        [-0.07503773,  0.32069347, -0.2179581 ],
        [ 0.56539428,  0.84605577,  0.23502509]]), array([[ 1., -0., -1.],
        [ 0., -0.,  0.],
        [ 0.,  1.,  0.],
        [-0.,  0., -1.],
        [ 0.,  1.,  1.]]))

# Linear Algebra using Numpy

+ One of the most widely used operations in linear algebra is the dot product.

In [41]:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
B = np.array([[9,8,7],[6,5,4],[1,2,3]])

In [42]:
A.dot(B)

array([[ 24,  24,  24],
       [ 72,  69,  66],
       [120, 114, 108]])

In [43]:
#T function for transpose of a matrix
A = np.arange(15).reshape(3,5)
A.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

In [44]:
#popular matrix factorization method is SVD factorization, which returns decomposition of a matrix into three different matrices. 
#This can be done using linalg.svd function
np.linalg.svd(A)

(array([[-0.15425367,  0.89974393,  0.40824829],
        [-0.50248417,  0.28432901, -0.81649658],
        [-0.85071468, -0.3310859 ,  0.40824829]]),
 array([31.74202651,  2.72832424,  0.        ]),
 array([[-0.34716018, -0.39465093, -0.44214167, -0.48963242, -0.53712316],
        [-0.69244481, -0.37980343, -0.06716206,  0.24547932,  0.55812069],
        [-0.3545375 , -0.04008557,  0.87009952, -0.20179231, -0.27368413],
        [-0.36504752,  0.35761581, -0.14090063,  0.6691439 , -0.52081157],
        [-0.37555754,  0.7553172 , -0.15190078, -0.45991989,  0.232061  ]]))

In [45]:
#Solve a system equation 7x + 5y -3z = 16, 3x - 5y + 2z = -8, 5x + 3y - 7z = 0
a = np.array([[7,5,-3], [3,-5,2],[5,3,-7]])
b = np.array([16,-8,0])
x = np.linalg.solve(a, b)
x

array([1., 3., 2.])

In [46]:
#check if the solution is correct using the np.allclose function
np.allclose(np.dot(a, x), b)

True

# Data Structure of Pandas

+ Series and Dataframes are the data structures for all the data representation in pandas
+ Series in pandas is a one-dimensional ndarray with an axis label. It means that in functionality, it is almost similar to a simple array.
+ Series objects can be used to represent time series data also. In this case, the index is a datetime object.
+ Dataframe is the most important and useful data structure, which is used for almost all kind of data representation and manipulation in pandas.


## Data Retrieval

+ Pandas provides numerous ways to retrieve and read in data. We can convert data from CSV files, databases, flat files, and so on into dataframes. 
+ We can also convert a list of dictionaries (Python dict) into a dataframe.

In [47]:
#List of Dictionaries to Dataframe
import pandas as pd
d = [{'city':'Delhi',"data":1000},
     {'city':'Bangalore',"data":2000},
     {'city':'Mumbai',"data":1000}]
pd.DataFrame(d)

Unnamed: 0,city,data
0,Delhi,1000
1,Bangalore,2000
2,Mumbai,1000


In [48]:
df = pd.DataFrame(d)
df

Unnamed: 0,city,data
0,Delhi,1000
1,Bangalore,2000
2,Mumbai,1000


+ We provided a list of Python dictionaries to the DataFrame class of the pandas library and the dictionary was converted into a DataFrame above.
+ Two important things to note here: first the keys of dictionary are picked up as the column names in the dataframe.
+ Secondly we didn’t supply an index and hence it picked up the default index of normal arrays.

+ CSV Files to Dataframe: CSV (Comma Separated Files) files are perhaps one of the most widely used ways of creating a dataframe
+ Databases to Dataframe: Relational databases (DBs) and data warehouses are the de facto standard of data storage in almost all of the organizations.

In [52]:
#dataset link: https://simplemaps.com/data/world-cities
import pandas as pd
city_data = pd.read_csv('worldcities.csv')
#rows from top
city_data.head(n=10)

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Malishevë,Malisheve,42.4822,20.7458,Kosovo,XK,XKS,Malishevë,admin,,1901597212
1,Prizren,Prizren,42.2139,20.7397,Kosovo,XK,XKS,Prizren,admin,,1901360309
2,Zubin Potok,Zubin Potok,42.9144,20.6897,Kosovo,XK,XKS,Zubin Potok,admin,,1901608808
3,Kamenicë,Kamenice,42.5781,21.5803,Kosovo,XK,XKS,Kamenicë,admin,,1901851592
4,Viti,Viti,42.3214,21.3583,Kosovo,XK,XKS,Viti,admin,,1901328795
5,Shtërpcë,Shterpce,42.2394,21.0272,Kosovo,XK,XKS,Shtërpcë,admin,,1901828239
6,Shtime,Shtime,42.4331,21.0397,Kosovo,XK,XKS,Shtime,admin,,1901598505
7,Vushtrri,Vushtrri,42.8231,20.9675,Kosovo,XK,XKS,Vushtrri,admin,,1901107642
8,Dragash,Dragash,42.0265,20.6533,Kosovo,XK,XKS,Dragash,admin,,1901112530
9,Podujevë,Podujeve,42.9111,21.1899,Kosovo,XK,XKS,Podujevë,admin,,1901550082


In [53]:
#last rows from the bottom
city_data.tail()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
12954,Sturgis,Sturgis,44.4111,-103.4975,United States,US,USA,South Dakota,,6862.0,1840002174
12955,Tea,Tea,43.4515,-96.8346,United States,US,USA,South Dakota,,5415.0,1840002841
12956,Brandon,Brandon,43.5928,-96.5799,United States,US,USA,South Dakota,,9263.0,1840002650
12957,Madison,Madison,44.0062,-97.1084,United States,US,USA,South Dakota,,6983.0,1840002540
12958,Belle Fourche,Belle Fourche,44.664,-103.8564,United States,US,USA,South Dakota,,5202.0,1840002127


In [54]:
#slicing and dicing
series_es = city_data.lat
type(series_es)

pandas.core.series.Series

In [55]:
series_es[1:10:2]

1    42.2139
3    42.5781
5    42.2394
7    42.8231
9    42.9111
Name: lat, dtype: float64

In [56]:
series_es[:7]

0    42.4822
1    42.2139
2    42.9144
3    42.5781
4    42.3214
5    42.2394
6    42.4331
Name: lat, dtype: float64

In [57]:
series_es[:-7315]

0       42.4822
1       42.2139
2       42.9144
3       42.5781
4       42.3214
         ...   
5639   -34.1700
5640   -36.8300
5641   -38.2395
5642   -35.4550
5643   -18.5000
Name: lat, Length: 5644, dtype: float64

In [58]:
city_data[:7]

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Malishevë,Malisheve,42.4822,20.7458,Kosovo,XK,XKS,Malishevë,admin,,1901597212
1,Prizren,Prizren,42.2139,20.7397,Kosovo,XK,XKS,Prizren,admin,,1901360309
2,Zubin Potok,Zubin Potok,42.9144,20.6897,Kosovo,XK,XKS,Zubin Potok,admin,,1901608808
3,Kamenicë,Kamenice,42.5781,21.5803,Kosovo,XK,XKS,Kamenicë,admin,,1901851592
4,Viti,Viti,42.3214,21.3583,Kosovo,XK,XKS,Viti,admin,,1901328795
5,Shtërpcë,Shterpce,42.2394,21.0272,Kosovo,XK,XKS,Shtërpcë,admin,,1901828239
6,Shtime,Shtime,42.4331,21.0397,Kosovo,XK,XKS,Shtime,admin,,1901598505


In [59]:
city_data.iloc[:5,:4]

Unnamed: 0,city,city_ascii,lat,lng
0,Malishevë,Malisheve,42.4822,20.7458
1,Prizren,Prizren,42.2139,20.7397
2,Zubin Potok,Zubin Potok,42.9144,20.6897
3,Kamenicë,Kamenice,42.5781,21.5803
4,Viti,Viti,42.3214,21.3583


In [62]:
city_data[city_data['population'] >10000000][city_data.columns[pd.Series(city_data.columns).str.startswith('l')]]

Unnamed: 0,lat,lng
1075,19.4424,-99.131
1685,14.6042,120.9822
1790,24.87,66.99
2220,55.7522,37.6155
3412,41.105,29.01
4240,-34.6025,-58.3975
4552,23.7231,90.4086
5057,-22.925,-43.225
5105,-23.5587,-46.625
5758,31.2165,121.4365


In [63]:
city_greater_10mil = city_data[city_data['population'] > 10000000]
city_greater_10mil.where(city_greater_10mil.population > 15000000)

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
1075,Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484248000.0
1685,,,,,,,,,,,
1790,,,,,,,,,,,
2220,,,,,,,,,,,
3412,,,,,,,,,,,
4240,,,,,,,,,,,
4552,,,,,,,,,,,
5057,,,,,,,,,,,
5105,São Paulo,Sao Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076533000.0
5758,,,,,,,,,,,


+ Here we see that we get the output dataframe of the same size but the rows that don’t conform to the condition are replaced with NaN.

# Data Operations
## Values Attribute

+ Vectorized operations are much faster than function based operations on dataframes.
+ Using the values attribute of the output dataframe, we can treat it in the same way as a numpy array. This is very useful when working with feature sets in Machine Learning

In [64]:
df = pd.DataFrame(np.random.randn(8, 3),columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
0,0.603969,-0.440988,0.736716
1,1.374016,0.657515,1.034392
2,0.437587,-0.45202,0.114116
3,-1.388008,1.167475,-0.648296
4,-0.517887,-1.087331,0.112693
5,-1.111624,-1.850551,-0.171002
6,-1.846387,-1.93479,0.374596
7,-0.313786,-0.293114,-0.460515


In [65]:
nparray = df.values
type(nparray)

numpy.ndarray

## Missing Data and the fillna Function

+ One of the most common data quality issues is that of missing data. 
+ Pandas provides us with a convenient function that allows us to handle the missing values of a dataframe.

In [80]:
df.iloc[4,2] = 0
df
df.fillna (0)

Unnamed: 0,A,B,C
0,0.603969,-0.440988,0.736716
1,1.374016,0.657515,1.034392
2,0.437587,-0.45202,0.114116
3,-1.388008,1.167475,-0.648296
4,-0.517887,-1.087331,0.0
5,-1.111624,-1.850551,-0.171002
6,-1.846387,-1.93479,0.374596
7,-0.313786,-0.293114,-0.460515


# Descriptive Statistics Function

+ Descriptive statistics of a dataframe give data scientists a comprehensive look into important information about any attributes and features in the dataset.

In [82]:
columns_numeric = ['lat','lng','population']
city_data[columns_numeric].mean()

lat               28.081249
lng              -19.078344
population    217727.869288
dtype: float64

In [83]:
city_data[columns_numeric].sum()

lat           3.639049e+05
lng          -2.472363e+05
population    2.458583e+09
dtype: float64

In [84]:
city_data[columns_numeric].count()

lat           12959
lng           12959
population    11292
dtype: int64

In [85]:
city_data[columns_numeric].median()

lat              36.261
lng             -46.150
population    32719.000
dtype: float64

In [86]:
city_data[columns_numeric].quantile(0.8)

lat               44.83664
lng               45.56272
population    174976.80000
Name: 0.8, dtype: float64

In [87]:
city_data[columns_numeric].sum(axis = 1)

0          63.2280
1          62.9536
2          63.6041
3          64.1584
4          63.6797
           ...    
12954    6802.9136
12955    5361.6169
12956    9210.0129
12957    6929.8978
12958    5142.8076
Length: 12959, dtype: float64

In [88]:
city_data[columns_numeric].describe()

Unnamed: 0,lat,lng,population
count,12959.0,12959.0,11292.0
mean,28.081249,-19.078344,217727.9
std,24.000243,78.412783,874665.8
min,-54.9333,-179.59,0.0
25%,15.28185,-86.19825,10580.75
50%,36.261,-46.15,32719.0
75%,42.83095,33.47485,125030.2
max,82.4833,179.3833,35676000.0


# Concatenating Dataframes
+ Most Data Science projects will have data from more than one data source.
+ Pandas provides a rich set of functions that allow us to merge different data sources.

## Concatenating Using the concat Method
+ The first method to concatenate different dataframes in pandas is by using the concat method.
+ The majority of the concatenation operations on dataframes will be possible by tweaking the parameters of the concat method.
+ The simplest scenario of concatenating is when we have more than one fragment of the same dataframe.

In [89]:
city_data1 = city_data.sample(3)
city_data2 = city_data.sample(3)
city_data_combine = pd.concat([city_data1,city_data2])
city_data_combine

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
7882,Kōchi,Kochi,33.5624,133.5375,Japan,JP,JPN,Kōchi,admin,335570.0,1392086071
4889,Óbidos,Obidos,-1.91,-55.52,Brazil,BR,BRA,Pará,,27587.0,1076197849
9840,Ithaca,Ithaca,42.4442,-76.5032,United States,US,USA,New York,,55439.0,1840000442
5032,Ponte Nova,Ponte Nova,-20.4096,-42.9,Brazil,BR,BRA,Minas Gerais,,48187.0,1076567178
10237,San Marino,San Marino,34.1224,-118.1132,United States,US,USA,California,,13327.0,1840021863
9972,Carbondale,Carbondale,41.5714,-75.5048,United States,US,USA,Pennsylvania,,8447.0,1840003376


+ Another common scenario of concatenating is when we have information about the columns of same dataframe split across different dataframes.

In [90]:
df1 = pd.DataFrame({'col1': ['col10', 'col11', 'col12', 'col13'],
                     'col2': ['col20', 'col21', 'col22', 'col23'],
                     'col3': ['col30', 'col31', 'col32', 'col33'],
                     'col4': ['col40', 'col41', 'col42', 'col43']},
                     index=[0, 1, 2, 3])
df1

Unnamed: 0,col1,col2,col3,col4
0,col10,col20,col30,col40
1,col11,col21,col31,col41
2,col12,col22,col32,col42
3,col13,col23,col33,col43


In [92]:
df4 = pd.DataFrame({'col2': ['col22', 'col23', 'col26', 'col27'],
                    'Col4': ['Col42', 'Col43', 'Col46', 'Col47'],
                    'col6': ['col62', 'col63', 'col66', 'col67']},
                    index=[2, 3, 6, 7])
df4

Unnamed: 0,col2,Col4,col6
2,col22,Col42,col62
3,col23,Col43,col63
6,col26,Col46,col66
7,col27,Col47,col67


In [93]:
pd.concat([df1,df4], axis=1)

Unnamed: 0,col1,col2,col3,col4,col2.1,Col4,col6
0,col10,col20,col30,col40,,,
1,col11,col21,col31,col41,,,
2,col12,col22,col32,col42,col22,Col42,col62
3,col13,col23,col33,col43,col23,Col43,col63
6,,,,,col26,Col46,col66
7,,,,,col27,Col47,col67


## Database Style Concatenations Using the merge Command

+ The most familiar way to concatenate data (for those acquainted with relational databases) is using the join operation provided by the databases.
+ Pandas provides a database friendly set of join operations for dataframes.
+ Joining by columns: This is the most natural way of joining two dataframes. In this method, we have two dataframes sharing a common column and we can join the two dataframes using that column.

In [94]:
country_data = city_data[['iso3','country']].drop_duplicates()
country_data.shape

(235, 2)

In [95]:
country_data.head()

Unnamed: 0,iso3,country
0,XKS,Kosovo
37,XSV,Svalbard
38,XWB,West Bank
39,YEM,Yemen
65,MYT,Mayotte


In [96]:
del(city_data['country'])
city_data.merge(country_data, 'inner').head()

Unnamed: 0,city,city_ascii,lat,lng,iso2,iso3,admin_name,capital,population,id,country
0,Malishevë,Malisheve,42.4822,20.7458,XK,XKS,Malishevë,admin,,1901597212,Kosovo
1,Prizren,Prizren,42.2139,20.7397,XK,XKS,Prizren,admin,,1901360309,Kosovo
2,Zubin Potok,Zubin Potok,42.9144,20.6897,XK,XKS,Zubin Potok,admin,,1901608808,Kosovo
3,Kamenicë,Kamenice,42.5781,21.5803,XK,XKS,Kamenicë,admin,,1901851592,Kosovo
4,Viti,Viti,42.3214,21.3583,XK,XKS,Viti,admin,,1901328795,Kosovo


# Scikit-learn

+ It implements a wide range of Machine Learning algorithms covering major areas of Machine Learning like classification, clustering, regression, and so on.
+ All the mainstream Machine Learning algorithms like support vector machines, logistic regression, random forests, K-means clustering, hierarchical clustering, and many many more, are implemented efficiently in this library.

## Scikit-learn Example: Regression Models
### The Dataset

+ The diabetes dataset is one of the bundled datasets with the scikit-learn library.
+ It contains observations of 10 baseline variables, age, sex, body mass index, average blood pressure and six blood serum measurements for 442 diabetes patients.


In [98]:
from sklearn import datasets
diabetes = datasets.load_diabetes()
y = diabetes.target
X = diabetes.data
X.shape

(442, 10)

In [99]:
X[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632783, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567061, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286377, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665645,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02269202, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187235,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03199144, -0.04664087]])

In [100]:
y[:10]

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310.])

In [101]:
#Since we are using the data in the form of numpy arrays, we don’t get the name of the features in the data itself.
feature_names=['age', 'sex', 'bmi', 'bp',
               's1', 's2', 's3', 's4', 's5', 's6']

+ Lasso model is an extension of the normal linear regression model which allows us to apply L1 regularization to the model.

In [102]:
from sklearn import datasets
from sklearn.linear_model import Lasso
import numpy as np
from sklearn import linear_model, datasets
from sklearn.model_selection import GridSearchCV

+ We will split our data into separate test and train sets of data (train is used to train the model and test is used for model performance testing and evaluation).

In [104]:
diabetes = datasets.load_diabetes()
X_train = diabetes.data[:310]
y_train = diabetes.target[:310]
X_test = diabetes.data[310:]
y_test = diabetes.data[310:]
lasso = Lasso(random_state=0)
alphas = np.logspace(-4, -0.5, 30)
estimator = GridSearchCV(lasso, dict(alpha=alphas))
estimator.fit(X_train, y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=1000, normalize=False, positive=False,
                             precompute=False, random_state=0,
                             selection='cyclic', tol=0.0001, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'alpha': array([1.00000000e-04, 1.32035178e-04, 1.74332882e-04, 2.30180731e-04,
       3.0...
       2.80721620e-03, 3.70651291e-03, 4.89390092e-03, 6.46167079e-03,
       8.53167852e-03, 1.12648169e-02, 1.48735211e-02, 1.96382800e-02,
       2.59294380e-02, 3.42359796e-02, 4.52035366e-02, 5.96845700e-02,
       7.88046282e-02, 1.04049831e-01, 1.37382380e-01, 1.81393069e-01,
       2.39502662e-01, 3.16227766e-01])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

+ The GridSearchCV object will also score the models that we are learning and we can us the best_estimator_ attribute to identify the model and the optimal value of the hyperparameter that gave us the best score.

In [105]:
estimator.best_score_

0.4654063759023531

In [106]:
estimator.best_estimator_

Lasso(alpha=0.02592943797404667, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=0,
      selection='cyclic', tol=0.0001, warm_start=False)

In [107]:
estimator.predict(X_test)

array([203.42104984, 177.6595529 , 122.62188598, 212.81136958,
       173.61633075, 114.76145025, 202.36033584, 171.70767813,
       164.28694562, 191.29091477, 191.41279009, 288.2772433 ,
       296.47009002, 234.53378413, 210.61427168, 228.62812055,
       156.74489991, 225.08834492, 191.75874632, 102.81600989,
       172.373221  , 111.20843429, 290.22242876, 178.64605207,
        78.13722832,  86.35832297, 256.41378529, 165.99622543,
       121.29260976, 153.48718848, 163.09835143, 180.0932902 ,
       161.4330553 , 155.80211635, 143.70181085, 126.13753819,
       181.06471818, 105.03679977, 131.0479936 ,  90.50606427,
       252.66486639,  84.84786067,  59.41005358, 184.51368208,
       201.46598714, 129.96333913,  90.65641478, 200.10932516,
        55.2884802 , 171.60459062, 195.40750666, 122.14139787,
       231.72783897, 159.49750022, 160.32104862, 165.53701866,
       260.73217736, 259.77213787, 204.69526082, 185.66480969,
        61.09821961, 209.9214333 , 108.50410841, 141.18

# Neural Networks and Deep Learning
## Theano
+ The first library popularly used for learning neural networks is Theano.
Theano
allows us to symbolically define mathematical functions and automatically derive their gradient expression.

In [1]:
#pip install theano or conda install theano
import theano



In [2]:
import numpy
import theano.tensor as T
from theano import function
x = T.dscalar('x')
y = T.dscalar('y')
z = x + y
f = function([x, y], z)
f(8, 2)

array(10.)

In [3]:
#conda install keras or pip install keras
import keras

Using TensorFlow backend.


+ The epochs parameter indicates one complete forward and backward pass of all the training examples. 
+ The batch_size parameter indicates the total number of samples which are propagated through the NN model at a time for one backward and forward pass for training the model and updating the gradient.

In [6]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train = cancer.data[:340]
y_train = cancer.target[:340]
X_test = cancer.data[340:]
y_test = cancer.target[340:]
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(15, input_dim=30, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=50)

W0831 00:09:45.059211  2508 deprecation_wrapper.py:119] From C:\Users\INTEL\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0831 00:09:45.129252  2508 deprecation_wrapper.py:119] From C:\Users\INTEL\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0831 00:09:45.140264  2508 deprecation_wrapper.py:119] From C:\Users\INTEL\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0831 00:09:45.275348  2508 deprecation_wrapper.py:119] From C:\Users\INTEL\Anaconda3\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0831 00:09:45.343399  2508 deprecation_wrapper.py:119] From C:\Users\INTEL\Anac

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x6a988badd8>

In [13]:
predictions = model.predict_classes(X_test)

In [10]:
from sklearn import metrics
print('Accuracy:', metrics.accuracy_score(y_true=y_test, y_pred=predictions))

Accuracy: 0.8777292576419214


In [15]:
print(metrics.classification_report(y_true=y_test, y_pred=predictions))

              precision    recall  f1-score   support

           0       0.70      0.85      0.77        55
           1       0.95      0.89      0.92       174

    accuracy                           0.88       229
   macro avg       0.83      0.87      0.84       229
weighted avg       0.89      0.88      0.88       229



## The Power of Deep Learning

In [18]:
model = Sequential()
model.add(Dense(15, input_dim=30, activation='relu'))
model.add(Dense(15, activation='relu'))
model.add(Dense(15, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
model.fit(X_train, y_train,
          epochs=20,
          batch_size=50)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x6a9d5222b0>

In [19]:
predictions = model.predict_classes(X_test)
print('Accuracy:', metrics.accuracy_score(y_true=y_test, y_pred=predictions))
print(metrics.classification_report(y_true=y_test, y_pred=predictions))

Accuracy: 0.9126637554585153
              precision    recall  f1-score   support

           0       0.86      0.76      0.81        55
           1       0.93      0.96      0.94       174

    accuracy                           0.91       229
   macro avg       0.89      0.86      0.88       229
weighted avg       0.91      0.91      0.91       229



+ We achieve an overall accuracy and F1 score of 91% and we can see that we also have an F1 score of 83% as compared to 0% from the previous model, for class label 0 (malignant).

In [20]:
import nltk

In [None]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [None]:
nltk.download('all', halt_on_error=False)

+ Once the download is finished we will be able to use all the necessary functionalities and the bundled data of the nltk package.