## Chapter 2: Machine Learning

Chapter 1 provided an overview of some of the emerging trends in the industry around Big Data and Artificial Intelligence. We talked about software getting smarter with the application of Artificial Intelligence. In this chapter, we specif- ically focus on the most popular AI technique for infusing smarts into software - Machine Learning (ML). We see examples of using ML to capture patterns in data and capture these patterns in artifacts called models. We see the three types of ML techniques and discuss applications of each. Finally, in this chapter we review some code examples of building ML models from simple datasets. The code is highly commented, so you can start your own Colaboratory or Jupyter Notebook environment and run the code.

In [1]:
# Listing 2.3
# Pandas is my favourite tool for Data loading and munging
import pandas as pd

# Read a csv file from data folder and show the records
features = pd.read_csv('data/house.price.csv')
features.head(10)

Unnamed: 0,Area,Locality,Price
0,100,4,30
1,250,5,80
2,220,5,80
3,105,6,40
4,260,6,60
5,150,8,100
6,180,9,120
7,225,4,60
8,95,5,40
9,160,9,110


In [2]:
# Listing 2.4
# We will use the K-Means algorithm
from sklearn.cluster import KMeans

# We will only consider 2 features and see if we get a pattern
cluster_Xs = features[['Area', 'Locality']]

# How many clusters we want to find
NUM_CLUSTERS = 3

# Build the K Means Clusters model
model = KMeans(n_clusters=NUM_CLUSTERS)
model.fit(cluster_Xs)

# Predict and get cluster labels - 0, 1, 2 ... NUM_CLUSTERS
predictions = model.predict(cluster_Xs)

# Add predictions to the features data frame
features['cluster'] = predictions

features.head(10)

Unnamed: 0,Area,Locality,Price,cluster
0,100,4,30,2
1,250,5,80,1
2,220,5,80,1
3,105,6,40,2
4,260,6,60,1
5,150,8,100,0
6,180,9,120,0
7,225,4,60,1
8,95,5,40,2
9,160,9,110,0


In [3]:
# Listing 2.5
features_sorted = features.sort_values('cluster')
features_sorted

Unnamed: 0,Area,Locality,Price,cluster
5,150,8,100,0
6,180,9,120,0
9,160,9,110,0
1,250,5,80,1
2,220,5,80,1
4,260,6,60,1
7,225,4,60,1
0,100,4,30,2
3,105,6,40,2
8,95,5,40,2


In [14]:
# Pandas is my favourite tool for Data loading and munging
import pandas as pd

# Read a csv file and show the records
features = pd.read_csv('data/house.price.csv')
features.head(10)

Unnamed: 0,Area,Locality,Price
0,100,4,30
1,250,5,80
2,220,5,80
3,105,6,40
4,260,6,60
5,150,8,100
6,180,9,120
7,225,4,60
8,95,5,40
9,160,9,110


In [20]:
# Listing 2.8
# Seperate first 8 points as Validation set (0-7)
X_train = features[["Area","Locality"]].values[:8]
Y_train = features["Price"].values[:8]
# Seperate last 2 points as Validation set (0-7)
X_test = features[["Area","Locality"]].values[8:]
Y_test = features["Price"].values[8:]

In [21]:
# Listing 2.9
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, Y_train)
print("Model weights are: ", model.coef_)
print("Model interceot is: ", model.intercept_)

Model weights are:  [ 0.20370091 13.56708023]
Model interceot is:  -46.39589056258242


In [22]:
# Predict for one point from Test set
print('Predicting for ', X_test[0])
print('Expected value ', Y_test[0])
print('Predicted value ', model.predict([[95,5]]))

Predicting for  [95  5]
Expected value  40
Predicted value  [40.79109689]


In [8]:
# Pandas is my favourite tool for Data loading and munging
import pandas as pd

# Read a csv file and show the records
features = pd.read_csv('data/house.sale.csv')
features.head(10)

Unnamed: 0,Area,Locality,Price,Buy
0,100,4,30,0
1,250,5,80,1
2,220,5,80,1
3,105,6,40,1
4,150,8,100,0
5,180,9,120,0
6,225,4,60,0
7,95,5,40,1
8,260,6,60,1
9,160,9,110,0


In [9]:
# Listing 2.10
# Seperate first 8 points as Validation set (0-7)
X_train = features[["Area","Locality","Price"]].values[:8]
Y_train = features["Buy"].values[:8]
# Seperate last 2 points as Validation set (0-7)
X_test = features[["Area","Locality","Price"]].values[8:]
Y_test = features["Buy"].values[8:]

In [10]:
# Listing 2.11
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, Y_train)

# make a prediction on test data
Y_pred = model.predict(X_test)

# print expected results
print(Y_test)
# print the predictions
print(Y_pred)

[1 0]
[1 0]


In [23]:
# Listing 2.12
# Pandas is my favorite tool for Data loading and munging
import pandas as pd

# Read a csv file and show the records
features = pd.read_csv('data/winequality-red.csv')
features.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


In [24]:
# Listing 2.13
# separate the Xs and Ys
X = features # all features
X = X.drop(['quality'],axis=1) # remove the quality which is a Y
Y = features[['quality']]
print("X features (Inputs): ", X.columns)
print("Y features (Outputs): ", Y.columns)

X features (Inputs):  Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')
Y features (Outputs):  Index(['quality'], dtype='object')


In [25]:
from sklearn.model_selection import train_test_split

# split the data into training and test datasets -> 80-20 split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,test_size=0.2)

print("Training features: X", X_train.shape, " Y", Y_train.shape)
print("Test features: X", X_test.shape, " Y", Y_test.shape)

Training features: X (1279, 11)  Y (1279, 1)
Test features: X (320, 11)  Y (320, 1)


In [27]:
from sklearn.linear_model import LogisticRegression
# build the Model
model = LogisticRegression()
# fit our Training data
model.fit(X_train, Y_train)
# predict Y values for X_testy
Y_pred = model.predict(X_test)
# compare with Y_test and record the Precision
print("Precision for Logistic Regression: ", precision_score(Y_test, Y_pred, average='micro'))

  y = column_or_1d(y, warn=True)


NameError: name 'precision_score' is not defined

In [11]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(Y_test, Y_pred)
print(confusion_matrix)

[[1 0]
 [0 1]]


In [12]:
from sklearn.neural_network import MLPClassifier

model = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(2, 1))
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
print(Y_test)
print(Y_pred)

[coef.shape for coef in model.coefs_]

[1 0]
[1 0]


[(3, 2), (2, 1), (1, 1)]