https://www.geeksforgeeks.org/learning-model-building-scikit-learn-python-machine-learning-library/

pip install -U scikit-learn

# Step 1: Load a dataset
**Loading exemplar dataset:** scikit-learn comes loaded with a few example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression.

In [3]:
# load the iris dataset as an example 
from sklearn.datasets import load_iris 

In [32]:
iris = load_iris() 
# print(iris)

# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 

# store the feature and target names 
feature_names = iris.feature_names 
target_names = iris.target_names 

# printing features and target names of our dataset 
print("Feature names:", feature_names) 
print("Target names:", target_names) 

# X and y are numpy arrays 
print("\nType of X is:", type(X)) 

# printing first 5 input rows 
print("\nFirst 5 rows of X:\n", X[:5])

print("\nFirst 5 rows of y:\n", y[30:])

Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

Type of X is: <class 'numpy.ndarray'>

First 5 rows of X:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

First 5 rows of y:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2]


In [37]:
print(iris.target_names[0])
print(iris.target_names[1])
print(iris.target_names[2])

setosa
versicolor
virginica


In [7]:
X.shape

(150, 4)

**Loading external dataset:** Now, consider the case when we want to load an external dataset. For this purpose, we can use pandas library for easily loading and manipulating dataset.

In [8]:
import pandas as pd 

In [9]:
# reading csv file 
data = pd.read_csv('weather.csv') 

# shape of dataset 
print("Shape:", data.shape) 

# column names 
print("\nFeatures:", data.columns) 

# storing the feature matrix (X) and response vector (y) 
X = data[data.columns[:-1]] 
y = data[data.columns[-1]] 

# printing first 5 rows of feature matrix 
print("\nFeature matrix:\n", X.head()) 

# printing first 5 values of response vector 
print("\nResponse vector:\n", y.head())

Shape: (14, 5)

Features: Index(['Outlook', 'Temperature', 'Humidity', 'Windy', 'Play'], dtype='object')

Feature matrix:
     Outlook Temperature Humidity  Windy
0  overcase         hot     high  False
1  overcase        cool   normal   True
2  overcase        mild     high   True
3  overcase         hot   normal  False
4     rainy        mild     high  False

Response vector:
 0    yes
1    yes
2    yes
3    yes
4    yes
Name: Play, dtype: object


# Step 2: Splitting the dataset

In [14]:
# load the iris dataset as an example 
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split 

In [15]:
iris = load_iris() 

# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 

# splitting X and y into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) 

# printing the shapes of the new X objects 
print(X_train.shape) 
print(X_test.shape) 

# printing the shapes of the new y objects 
print(y_train.shape) 
print(y_test.shape)

(90, 4)
(60, 4)
(90,)
(60,)


# Step 3: Training the model

In [17]:
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn import metrics 
from sklearn.externals import joblib 

In [21]:
# load the iris dataset as an example 
iris = load_iris() 

# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 

# splitting X and y into training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) 

# training the model on training set 
knn = KNeighborsClassifier(n_neighbors=3) 
knn.fit(X_train, y_train) 

# making predictions on the testing set 
y_pred = knn.predict(X_test) 
print(y_pred[:5])

# comparing actual response values (y_test) with predicted response values (y_pred) 
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred)) 

# making prediction for out of sample data 
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] 
preds = knn.predict(sample) 
pred_species = [iris.target_names[p] for p in preds] 
print("Predictions:", pred_species) 

# saving the model 
joblib.dump(knn, 'iris_knn.pkl') 
# If you are not interested in training your classifier again and again and use the pre-trained classifier, one can save their classifier using joblib.

[0 1 1 0 2]
kNN model accuracy: 0.9833333333333333
Predictions: ['versicolor', 'virginica']


['iris_knn.pkl']

In [38]:
# In case you want to load an already saved classifier, use the following method:
knn = joblib.load('iris_knn.pkl')