<a href="https://colab.research.google.com/github/andrybrew/pythondataanalytics/blob/master/004_machine_learning_with_scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning with Scikit-Learn

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop a conventional algorithm for effectively performing the task.


Import Libraries

In [0]:
# Import Library for Data Manipulation
import pandas as pd

In [0]:
# Import Library for Visualization
import matplotlib. pyplot as plt
import seaborn as sns

## Supervised Learning

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.

### 1.Regression Model

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors', 'covariates', or 'features').

#### a.Linear Regression

Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

Import Data

In [0]:
# Import Dataset
df_salary = pd.read_csv('https://raw.githubusercontent.com/andrybrew/all-files/master/dataset_python/salary_data.csv')
df_salary

In [0]:
# Prints the Dataset Information
salary_df.info()

In [0]:
# Prints Descriptive Statistics
salary_df.describe().transpose()

Modeling Linear Regression

In [0]:
# Importing Linear Regression Module
from sklearn.linear_model import LinearRegression

# Modeling Linear Regression
lr = LinearRegression()

# Select X and Y Variable
X = salary_df.iloc[:, :-1].values
Y = salary_df.iloc[:, 1].values

# Apply Model to Data
lr.fit(X, Y)

In [0]:
# Show Coefficent and Intercept
print('Coefficient = ', lr.coef_)
print('Intercept =', lr.intercept_)

Visualizing Result

In [0]:
# Draw Scatter Plot with X and Y Axes
plt.scatter(X, Y)

# Show Regression Line with Green Color
plt.plot(X, lr.predict(X), color = "green")

# Set Title
plt.title("Salary vs Experience")

# Set Axes Name
plt.xlabel("Years of Experience")
plt.ylabel("Salary")

# Show Plot
plt.show()

#### b.Logistic Regression Model

Import Libraries

In [0]:
# Import Library for Data Manipulation
import pandas as pd

Import Data

In [0]:
# Import Dataset
df_churn = pd.read_csv('https://raw.githubusercontent.com/andrybrew/all-files/master/dataset_python/churn_transformed.csv')
df_churn.head(5)

In [0]:
# Prints the Dataset Information
df_churn.info()

In [0]:
# Prints Descriptive Statistics
df_churn.describe().transpose()

Data Preparation

In [0]:
# Remove "Unnamed:O" Coloumn
df_churn1 = df_churn.drop("Unnamed: 0", axis=1)
df_churn1.head()

In [0]:
# Selecting the Feature, by remove the unused feature 
feature = ['Churn', 'TotalCharges']
train_feature = df_churn1.drop(feature, axis=1)

In [0]:
# Set the Target
train_target = df_churn1["Churn"]

In [0]:
# Split Data 70% : 30%
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(train_feature ,train_target, shuffle = True, test_size=0.3, random_state=1)

Modeling Logistic Regression

In [0]:
# Importing Logistic Regression Module
from sklearn.linear_model import LogisticRegression

# Modeling Logistic Regression
logreg = LogisticRegression()

# Apply Model to Data
logreg_model = logreg.fit(X_train,y_train)

In [0]:
# Predict the Test Data
y_predlg =logreg.predict(X_test)

Evaluating Model

In [0]:
# Importing Confusion Metrics Module
from sklearn import metrics

cnf_matrixlg = metrics.confusion_matrix(y_test, y_predlg)
cnf_matrixlg

In [0]:
# Show the Accuracy, Precision, Recall
acc_lg = metrics.accuracy_score(y_test, y_predlg)
prec_lg = metrics.precision_score(y_test, y_predlg)
rec_lg = metrics.recall_score(y_test, y_predlg)
f1_lg = metrics.f1_score(y_test, y_predlg)
kappa_lg = metrics.cohen_kappa_score(y_test, y_predlg)

print("Accuracy:", acc_lg )
print("Precision:", prec_lg)
print("Recall:", rec_lg)
print("F1 Score:", f1_lg)
print("Cohens Kappa Score:", kappa_lg)

### 2.Classification Model

Import Libraries

In [0]:
# Import Library for Data Manipulation
import pandas as pd

In [0]:
# Import Library for Visualization
import matplotlib.pyplot as plt

Import Data

In [0]:
# Import Dataset
df_churn = pd.read_csv('https://raw.githubusercontent.com/andrybrew/all-files/master/dataset_python/churn_transformed.csv')
df_churn.head(5)

In [0]:
# Prints the Dataset Information
df_churn.info()

In [0]:
# Prints Descriptive Statistics
df_churn.dscribe().transpose()

Data Preparation

In [0]:
# Remove "Unnamed:O" Coloumn
df_churn1 = df_churn.drop("Unnamed: 0", axis=1)
df_churn1.head()

In [0]:
# Importing Standardscalar Module 
from sklearn.preprocessing import StandardScaler 

# Set Name for StandardScaler as scaler
scaler = StandardScaler() 

# Fit Standardization
scaler.fit(df_churn1) 

In [0]:
# Selecting the Feature, by remove the unused feature 
feature = ['Churn', 'TotalCharges']
train_feature = df_churn1.drop(feature, axis=1)

In [0]:
# Set the Target
train_target = df_churn1["Churn"]

In [0]:
# Split Data 70% : 30%
from sklearn.model_selection import train_test_split, cross_val_score
X_train, X_test, y_train, y_test = train_test_split(train_feature ,train_target, shuffle = True, test_size=0.3, random_state=1)

Modeling and Evaluation

##### a.Decision Tree

Modeling Decision Tree Classifier

In [0]:
# Importing Decision Tree Module
from sklearn import tree

# Modeling Decision Tree
dtc = tree.DecisionTreeClassifier(min_impurity_decrease=0.01)

# Apply Model to Data
dtc.fit(X_train, y_train)

In [0]:
# Predict the Test Data
y_preddtc = dtc.predict(X_test)

In [0]:
# Plot the Tree
tree.plot_tree(dtc.fit(X_train, y_train), class_names=['0','1']) 

In [0]:
# Visualize with graphviz

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(dtc, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

Evaluating Decision Tree Model

In [0]:
# Importing Confusion Metrics Module
from sklearn import metrics

cnf_matrixdtc = metrics.confusion_matrix(y_test, y_preddtc)
cnf_matrixdtc

In [0]:
# Show the Accuracy, Precision, Recall
acc_dtc = metrics.accuracy_score(y_test, y_preddtc)
prec_dtc = metrics.precision_score(y_test, y_preddtc)
rec_dtc = metrics.recall_score(y_test, y_preddtc)
f1_dtc = metrics.f1_score(y_test, y_preddtc)
kappa_dtc = metrics.cohen_kappa_score(y_test, y_preddtc)

print("Accuracy:", acc_dtc )
print("Precision:", prec_dtc)
print("Recall:", rec_dtc)
print("F1 Score:", f1_dtc)
print("Cohens Kappa Score:", kappa_dtc)

In [0]:
# Cross-validation Score
cv_iterations = 5
cv_score = cross_val_score(dtc, train_feature, train_target, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

In [0]:
# Set Plot
plt.rcParams['figure.figsize'] = (7, 7)
plt.style.use('ggplot')

# Visualize ROC Curve
y_pred_probadtc = dtc.predict_proba(X_test)[::,1]
fprdtc, tprdtc, _ = metrics.roc_curve(y_test,  y_pred_probadtc)
aucdtc = metrics.roc_auc_score(y_test, y_pred_probadtc)
plt.plot(fprdtc,tprdtc,label="Decision Tree, auc="+str(aucdtc))
plt.legend(loc=4)
plt.show()

##### b.K-Nearest Neighbor

Modeling K-Nearest Neighbor Classifier

In [0]:
# Importing KNN Module
from sklearn.neighbors import KNeighborsClassifier

# Modeling KNN
knn = KNeighborsClassifier(n_neighbors= 71)

# Apply Model to Data
knn.fit(X_train, y_train)

In [0]:
# Predict the Test Data
y_predknn = knn.predict(X_test)

Evaluating K-Nearest Neightbor Model

In [0]:
# Importing Confusion Metrics Module
from sklearn import metrics

cnf_matrixknn = metrics.confusion_matrix(y_test, y_predknn)
cnf_matrixknn

In [0]:
# Show the Accuracy, Precision, Recall
acc_knn = metrics.accuracy_score(y_test, y_predknn)
prec_knn = metrics.precision_score(y_test, y_predknn)
rec_knn = metrics.recall_score(y_test, y_predknn)
f1_knn = metrics.f1_score(y_test, y_predknn)
kappa_knn = metrics.cohen_kappa_score(y_test, y_predknn)

print("Accuracy:", acc_knn)
print("Precision:", prec_knn)
print("Recall:", rec_knn)
print("F1 Score:", f1_knn)
print("Cohens Kappa Score:", kappa_knn)

In [0]:
# Cross-validation Score
cv_iterations = 10
cv_score = cross_val_score(knn, train_feature, train_target, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

In [0]:
# Set Plot
plt.rcParams['figure.figsize'] = (7, 7)
plt.style.use('ggplot')

# Visualize ROC Curve
y_pred_probaknn = knn.predict_proba(X_test)[::,1]
fprknn, tprknn, _ = metrics.roc_curve(y_test,  y_pred_probaknn)
aucknn = metrics.roc_auc_score(y_test, y_pred_probaknn)
plt.plot(fprknn,tprknn,label="K-NN, auc="+str(aucknn))
plt.legend(loc=4)
plt.show()

##### c.Naive Bayes

Modeling Naive Bayes Classifier

In [0]:
# Importing Naive Bayes Module
from sklearn.naive_bayes import GaussianNB 

# Modeling Naive Bayes
gnb = GaussianNB()

# Apply Model to Data
gnb.fit(X_train, y_train)

In [0]:
# Predict the Test Data
y_predgnb= gnb.predict(X_test)

Evaluating Naive Bayes Model

In [0]:
# Importing Confusion Metrics Module
from sklearn import metrics

cnf_matrixgnb = metrics.confusion_matrix(y_test, y_predgnb)
cnf_matrixgnb

In [0]:
# Show the Accuracy, Precision, Recall
acc_gnb = metrics.accuracy_score(y_test, y_predgnb)
prec_gnb = metrics.precision_score(y_test, y_predgnb)
rec_gnb = metrics.recall_score(y_test, y_predgnb)
f1_gnb = metrics.f1_score(y_test, y_predgnb)
kappa_gnb = metrics.cohen_kappa_score(y_test, y_predgnb)

print("Accuracy:", acc_gnb)
print("Precision:", prec_gnb)
print("Recall:", rec_gnb)
print("F1 Score:", f1_gnb)
print("Cohens Kappa Score:", kappa_gnb)

In [0]:
# Cross-validation Score
cv_iterations = 10
cv_score = cross_val_score(gnb, train_feature, train_target, cv=cv_iterations)
print('Accuracy with cross-validation (split size = {}): {} (+/- {})'
      .format(cv_iterations, round(cv_score.mean(),2), round(cv_score.std() * 2,2)))

In [0]:
# Set Plot
plt.rcParams['figure.figsize'] = (7, 7)
plt.style.use('ggplot')

# Visualize ROC Curve
y_pred_probagnb = gnb.predict_proba(X_test)[::,1]
fprgnb, tprgnb, _ = metrics.roc_curve(y_test,  y_pred_probagnb)
aucgnb = metrics.roc_auc_score(y_test, y_pred_probagnb)
plt.plot(fprgnb,tprgnb,label="Naive Bayes, auc="+str(aucgnb))
plt.legend(loc=4)
plt.show()

#### d.Compare Evaluation Result

In [0]:
# Comparing Model Performance
print("Decision Tree Accuracy =",acc_dtc)
print("Decision Tree Precision =",prec_dtc)
print("Decision Tree Recall =",rec_dtc)
print("Decision Tree F1-Score =", f1_dtc)
print("_______________________")
print("k-NN Accuracy =", acc_knn)
print("k-NN Precision =", prec_knn)
print("k-NN Recall =", rec_knn)
print("k-NN F1-Score =", f1_knn)
print("_______________________")
print("Naive Bayes Accuracy =", acc_gnb)
print("Naive Bayes Precision =", prec_gnb)
print("Naive Bayes Recall =", rec_gnb)
print("Naive Bayes F1-Score =", f1_gnb)

In [0]:
# Comparing ROC Curve
plt.plot(fprdtc,tprdtc,label="Decision Tree, auc="+str(aucdtc))
plt.plot(fprknn,tprknn,label="K-NN, auc="+str(aucknn))
plt.plot(fprgnb,tprgnb,label="Naive Bayes, auc="+str(aucgnb))
plt.legend(loc=4)
plt.show()

## Unsupervised Learning

### 3.Clustering Model

Import Libraries

In [0]:
# Import Library for Data Manipulation
import pandas as pd
import numpy as np

In [0]:
# Import Library for Visualization
import matplotlib.pyplot as plt
import seaborn as sns

Import Data

In [0]:
# Import Dataset
df_income = pd.read_csv('https://raw.githubusercontent.com/rc-dbe/bigdatacertification/master/dataset/clustering.csv')
df_income.head(5)

In [0]:
# Prints the Dataset Information
df_income.info()

In [0]:
# Prints Descriptive Statistics
df_income.describe().transpose()

Data Preparation

In [0]:
# Importing Standardscalar Module 
from sklearn.preprocessing import StandardScaler 

# Set Name for StandardScaler as scaler
scaler = StandardScaler() 

# Fit Standardization
column_names = df_income.columns.tolist()
df_income[column_names] = scaler.fit_transform(df_income[column_names])
df_income.sort_index(inplace=True)
df_income.head()

In [0]:
# Styling Plot
sns.set() 
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

# Visualising the Data
Income = df_income['INCOME'].values
Spend = df_income['SPEND'].values
X = np.array(list(zip(Income, Spend)))
plt.scatter(Income, Spend, s=50)

Modeling

##### a.K-Means Clustering

Search for the Optimum Number of Clusters

In [0]:
# Elbow Method
from sklearn.cluster import KMeans
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=0)
    km.fit(X)
    wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()

In [0]:
# Silhoutte Method
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

for n_cluster in range(2, 11):
    kmeans = KMeans(n_clusters=n_cluster).fit(X)
    label = kmeans.labels_
    sil_coeff = silhouette_score(X, label, metric='euclidean')
    print("For n_clusters={}, The Silhouette Coefficient is {}".format(n_cluster, sil_coeff))

Modeling K-Means Clustering

In [0]:
# Apply the K-Means Model to the Data
km2=KMeans(n_clusters=3,init='k-means++', max_iter=300, n_init=10, random_state=0)
y_means = km2.fit_predict(X)

In [0]:
# Visualising Clusters for k=3
plt.scatter(X[y_means == 0, 0], X[y_means == 0, 1], s = 50, label = 'Cluster 1')
plt.scatter(X[y_means == 1, 0], X[y_means == 1, 1], s = 50, label = 'Cluster 2')
plt.scatter(X[y_means == 2, 0], X[y_means == 2, 1], s = 50, label = 'Cluster 3')

plt.scatter(km2.cluster_centers_[:,0], km2.cluster_centers_[:,1],s=200,marker='s', alpha=0.7, label='Centroids')
plt.title('Customer segments')
plt.xlabel('Annual income')
plt.ylabel('Annual spend')
plt.legend()
plt.show()

##### b.Hierarchical Clustering

Modeling Hierarchical Clustering

In [0]:
# Modeling and Visualizing Clusters by Dendogram
import scipy.cluster.hierarchy as sch
dend=sch.dendrogram(sch.linkage(X, method='ward'))
plt.title("Dendrogram")
plt.xlabel('Customer')
plt.ylabel('euclidean')
plt.show()

In [0]:
# Apply the Hierarchical Clustering Model to the Dataset
from sklearn.cluster import AgglomerativeClustering
hc=AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
y_hc = hc.fit_predict(X)

In [0]:
# Visualising Clusters for k=3
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 50, label = 'Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 50, label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 50, label = 'Cluster 3')
plt.title('Clusters of customers')
plt.xlabel('Annual Income')
plt.ylabel('Annual spend')
plt.legend()
plt.show()

### 4.Association Model

#### a.Market Basket Analysis

Import Libraries

In [0]:
# Import Library for Data Manipulation
import pandas as pd
import numpy as np

In [0]:
# Import Library for Visualization
import seaborn as sns

Import Dataset

In [0]:
# Import dataset
df_retail = pd.read_excel("https://github.com/rc-dbe/bigdatacertification/blob/master/dataset/Online%20Retail.xlsx?raw=true")
df_retail.head()

In [0]:
# Prints the Dataset Information
df_retail.info()

In [0]:
# Prints Descriptive Statistics
df_retail.describe().transpose()

Data Preparation

In [0]:
# Remove Additional Spaces
df_retail['Description'] = df_retail['Description'].str.strip()

# Remove Missing Values
df_retail.dropna(axis=0, subset=['InvoiceNo'], inplace=True)

# Remove Cancelled Orders
df_retail['InvoiceNo'] = df_retail['InvoiceNo'].astype('str')
df_retail = df_retail[~df_retail['InvoiceNo'].str.contains('C')]

In [0]:
# Create Encode Function
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

def create_basket(country_filter):
    basket = (df_retail[df_retail['Country'] == country_filter]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
    return basket

In [0]:
# Select Retail Data by Country : France
country_filter = "France"
basket_french = create_basket("France")
basket_sets = basket_french.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

Modeling

In [0]:
# Import Apriori Module
from mlxtend.frequent_patterns import apriori

# Search for Frequent Itemsets
frequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True)
frequent_itemsets.head()

In [0]:
# Import Association Rules Module
from mlxtend.frequent_patterns import association_rules

# Generate Rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules.head()

In [0]:
# Sorting the Rules by Confidence
rules.sort_values(["confidence"], axis=0, 
                 ascending=False, inplace=True) 
rules

### 5.Dimensionality Reduction

#### a.Principal Component Analysis

Import Data

In [0]:
# Here We Are Using Inbuilt Dataset of Scikit-Learn 
from sklearn.datasets import load_breast_cancer 
  
# Instantiating 
cancer = load_breast_cancer() 

In [0]:
# Creating Dataframe 
df_pca = pd.DataFrame(cancer['data'], columns = cancer['feature_names']) 

In [0]:
# Checking Head of Data 
df_pca.head(5) 

Data Preparation

In [0]:
# Importing Standardscalar Module 
from sklearn.preprocessing import StandardScaler 

# Set Name for StandardScaler as scaler
scaler = StandardScaler() 

In [0]:
# Fit Standardization
scaler.fit(df_pca) 

# Transformed Data
df_pca_scaled = pd.DataFrame(scaler.transform(df_pca))

# Checking Head of Data
df_pca_scaled.head(5) 

In [0]:
# Show Data Dimension
df_pca_scaled.shape 

Modeling PCA

In [0]:
# Importing PCA 
from sklearn.decomposition import PCA 

In [0]:
# Modeling PCA with Components = 2 
pca = PCA(n_components = 2) 

# Apply Model to Data
pca.fit(df_pca_scaled) 

# Show Result Dimension
x_pca = pca.transform(df_pca_scaled) 
x_pca.shape 

Visualizing Result

In [0]:
# Plot 
plt.scatter(x_pca[:, 0], x_pca[:, 1], c = cancer['target'], cmap ='plasma') 

# Labeling X and Y Axes 
plt.xlabel('First Principal Component') 
plt.ylabel('Second Principal Component')

In [0]:
# Plotting Heatmap
df_comp = pd.DataFrame(pca.components_, columns = cancer['feature_names']) 
sns.heatmap(df_comp)