## Scikit-Learn Tutorial
### Machine Learning 

In [14]:
import numpy as np
import pandas as pd

In [15]:
# import dataset already loaded in scikit-learn
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

In [16]:
print(housing['DESCR'])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [17]:
# store data into dataframe
df=pd.DataFrame(housing.data)

In [18]:
# set features as columns on the dataframe
df.columns = housing.feature_names

In [19]:
# view first 5 observation
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [20]:
# append price - target, as a new column to the dataset
df['Price'] = housing.target
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   Price       20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [21]:
X=housing.data
Y=housing.target

### linear regression

In [22]:
from sklearn.linear_model import LinearRegression
lineReg = LinearRegression()

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2, random_state=30)


In [24]:
# fit the training sets into the model
lineReg.fit(X_train, Y_train)

In [25]:
# print the coefficient
print('The coefficient is %d' %len(lineReg.coef_))

The coefficient is 8


In [26]:
# calculate variance
from sklearn.metrics import r2_score
var_score=lineReg.score(X_test, Y_test)
print('Variance score is ',var_score)
test_r2 = r2_score(Y_test, lineReg.predict(X_test))
print('r2 score is: ',test_r2)

Variance score is  0.5882336691134431
r2 score is:  0.5882336691134431


In this case, both scores are approximately 0.59, indicating that the model explains about 59% of the variance in the target variable.
This means that the model captures 59% of the variability in the observed data, which is a moderately good performance.

### Learning Models

#### supervised learning models
* The model is trained using labeled data, where each input comes with a corresponding output.
* To learn a mapping from inputs to outputs to make predictions on new data.

#### unsupervised learning models
* The model is trained using unlabeled data, which means there are no output labels.
* To find patterns, groupings, or structures within the data.


### supervised learning models: Logistic Regression

* logistic Regression is primarily used for binary classification problems, where the output is either 0 or 1.
*  Logistic Regression predicts the probability that a given data point belongs to a certain class.

In [27]:
# import sklearn load dataset
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [28]:
print(iris_dataset.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [29]:
X=iris_dataset.data
Y=iris_dataset.target

#### KNN model 
* KNN, or k-nearest neighbors, is a machine learning algorithm used for classification tasks
* The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance
* predicted class for a new data point will be the class of its closest neighbor.

In [30]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

In [31]:
# fit data into knn model(estimator)
knn.fit(X, Y)

In [32]:
# create object with new values for prediction
X_new = [[3,5,4,1],[5,3,4,2]]
knn.predict(X_new)

array([1, 1])

In [33]:
# Use logistic regression estimator
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression()

In [34]:
logReg.fit(X, Y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:
# predict the outcome using Logistic Regression estimator 
logReg.predict(X_new)

array([0, 1])

### Unsupervised Learning Models: Clustering

#### KMeans Clustering

In [36]:
# import KMeans class from sklearn.cluster
from sklearn.cluster import KMeans

In [37]:
# import make_blobs dataset from sklearn.cluster
from sklearn.datasets import make_blobs

In [38]:
# define number of feature as 5
X,y = make_blobs(n_samples=300, n_features=5, random_state=None)

predict_y = KMeans(n_clusters=3, random_state=20).fit_predict(X)

# print the estimator prediction
predict_y



array([2, 2, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 2, 1, 2, 2, 2, 0, 0, 1, 1, 2,
       1, 2, 2, 2, 1, 1, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 2, 2, 2, 0, 1, 0,
       0, 0, 1, 2, 0, 2, 2, 1, 1, 1, 2, 0, 1, 0, 1, 2, 1, 1, 2, 2, 1, 2,
       0, 1, 1, 2, 0, 0, 1, 0, 2, 1, 0, 1, 2, 2, 2, 0, 0, 2, 0, 1, 1, 2,
       2, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 2, 1, 0, 1, 1, 1, 2, 0, 1, 2, 2,
       0, 1, 1, 0, 0, 0, 2, 0, 1, 1, 1, 2, 0, 2, 2, 2, 1, 1, 2, 1, 1, 2,
       0, 1, 1, 2, 0, 1, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 1, 2, 0, 0, 2, 2,
       0, 1, 1, 0, 0, 2, 2, 2, 2, 0, 0, 2, 0, 0, 0, 2, 1, 0, 1, 2, 2, 2,
       0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1,
       2, 0, 1, 2, 0, 0, 2, 1, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 1, 0,
       0, 1, 2, 0, 1, 1, 0, 1, 2, 2, 1, 2, 2, 1, 1, 0, 1, 2, 0, 0, 2, 1,
       0, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1,
       2, 1, 2, 1, 0, 1, 2, 0, 1, 2, 2, 2, 0, 2])

### Unsupervised Learning Models : Dimensionality Reduction
Helps cut down dimentions without losing any data from dataset

#### Techniques used for dimensionality reduction :

* Drop data columns with missing values
* Drop data columns with low variance
* Drop data columns with high corelations
* Apply statistical functions - PCA(Principal Component Analysis)

#### Principal component analysis (PCA)
PCA is a tool for simplifying data by reducing the number of features while keeping the most important information.

In [39]:
# import required library PCA
from sklearn.decomposition import PCA
from sklearn.datasets import make_blobs

In [40]:
# Generate the dataset with 10 features (dimension)
X,y = make_blobs(n_samples=20, n_features=10, random_state=20)
X.shape


(20, 10)

In [41]:
# Define the PCA estimator with number of reduced components 
pca = PCA(n_components=3)

In [42]:
# Fit the data into the PCA estimator
pca.fit(X)

 pac.explained_variance_ratio tells you how much information (variance) each principal component captures from the original dataset.

In [43]:
print(pca.explained_variance_ratio_)  

[0.5192753  0.44713253 0.0102135 ]


In [44]:
# Transform the fitted data using transform method
pca_reduced= pca.transform(X)
pca_reduced.shape

(20, 3)

### Pipeline
A Pipeline is a tool to streamline the process of building and evaluating machine learning models. It allows you to chain together multiple steps into a single object, making your code cleaner and easier to manage. Each step in the pipeline can include data preprocessing, dimensionality reduction, feature selection, and the final modeling.

In [45]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA

#### chain estimators together

In [46]:
estimator = [('dim_reduction',PCA()), ('linear_model',LinearRegression())]

#### Put the chain of estimators in a pipeline object

In [47]:
pipeline_estimator = Pipeline(estimator)
pipeline_estimator

In [48]:
pipeline_estimator.steps

[('dim_reduction', PCA()), ('linear_model', LinearRegression())]

### Model Persistance
the process of saving a trained machine learning model to disk so that it can be loaded and used later without having to retrain it.

In [49]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [50]:
X = iris_dataset.data
Y = iris_dataset.target

In [51]:
# Create object with new values for prediction
X_new = [[3,5,4,1],[5,3,4,2]]

In [52]:
from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()

In [53]:
logreg.fit(X,Y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [54]:
logreg.predict(X_new)

array([0, 1])

#### persistance

In [55]:
# Use joblib.dump to persist the model to a file
import joblib
joblib.dump(logreg, 'regresfilename.pkl')

['regresfilename.pkl']

In [56]:
# Create new estimator from the saved model
new_logreg_estimator = joblib.load('regresfilename.pkl')

In [57]:
# Validate and use new estimator to predict
new_logreg_estimator.predict(X_new)

array([0, 1])