# Playing with Scikit-learn

## Defining applications for data science

http://scikit-learn.org/stable/developers/<BR>
http://scikit-learn.org/stable/faq.html<BR>

The California Housing dataset is a popular dataset that contains information on the median house value, as well as other information about various neighborhoods in California. The dataset was obtained from the StatLib repository and is often used as a benchmark in machine learning tasks. It was collected by the US Census Bureau and various other sources, and contains information from the 1990 California census. The dataset includes features such as population, median income, median house value, latitude, and longitude, among others. The goal of many machine learning tasks using this dataset is to predict the median house value based on the other features available.

In the California housing dataset, the target is the median house value for each block group, which ranges from $14,999 to $500,001.

* The dataset consists of 20,640 observations on housing prices from the 1990 California census.
* It contains eight input features, including latitude, longitude, median income, and the number of rooms, bedrooms, population, households, and median house age for each block group.
* The target variable is the median house value for each block group.
* The data has been preprocessed to remove any missing values, and the target variable has been scaled to range between 0.1 and 5.0.
* The dataset is commonly used in machine learning regression tasks, as it provides a good opportunity to practice feature engineering and regression modeling.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

def load_california_housing_data():
    dataset = fetch_california_housing()
    X = pd.DataFrame(data=dataset.data, 
                     columns=dataset.feature_names)
    y = pd.Series(data=dataset.target, name="target")
    return X, y

X, y = load_california_housing_data()
print(f"X:{X.shape} y:{y.shape}")

X:(20640, 8) y:(20640,)


In [2]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
scaled_X = scaler.transform(X)

In [3]:
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(scaled_X, y)
print(linear_regression.coef_.round(5))

[ 0.82962  0.11875 -0.26553  0.3057  -0.0045  -0.03933 -0.89989 -0.87054]


In [4]:
values = [[1.21315, 32., 3.31767135, 1.07731985, 898., 2.1424809, 37.82, -122.27]]
obs = pd.DataFrame(values, columns=X.columns)

scaled_obs = scaler.transform(obs)

pred = linear_regression.predict(scaled_obs)
value = pred[0] * 100_000
print(f"Estimated median house value: {value:.2f} USD")

Estimated median house value: 141088.56 USD


In [5]:
linear_regression.score(scaled_X, y)

0.606232685199805

# Using Transformative Functions

## Handling heterogeneous data

In [6]:
from sklearn.compose \
    import ColumnTransformer, make_column_selector
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing \
    import StandardScaler, KBinsDiscretizer
from sklearn.linear_model import LinearRegression

X, y = load_california_housing_data()

In [7]:
num_cols = ['MedInc', 'HouseAge', 'AveRooms', 
            'AveBedrms', 'Population', 'AveOccup']
cords = ['Latitude', 'Longitude']

num_transformer = ColumnTransformer([
    ("scaler", StandardScaler(), num_cols)], 
    remainder="drop")

cords_transformer = ColumnTransformer([
    ("discretizer", 
     KBinsDiscretizer(n_bins=20, encode="onehot-dense"), 
     cords)])

In [8]:
preprocessor = FeatureUnion(
    transformer_list=[("num_transformer", 
                       num_transformer), 
                      ("cords_transformer", 
                       cords_transformer)])

In [9]:
preprocessor.fit_transform(X).shape

(20640, 46)

In [10]:
predictive_pipeline = Pipeline([
    ("preprocessor", preprocessor), 
    ("model", LinearRegression())])

In [11]:
predictive_pipeline.fit(X, y)
predictive_pipeline.score(X, y)

0.6667461802611925

# Considering Timing and Performance

## Benchmarking with timeit

In [12]:
%timeit l = [k for k in range(10**6)]

69.4 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [13]:
%timeit -n 20 -r 5 l = [k for k in range(10**6)]

70.3 ms ± 1.17 ms per loop (mean ± std. dev. of 5 runs, 20 loops each)


In [14]:
%%timeit 
l = list()
for k in range(10**6):
    l.append(k)

112 ms ± 1.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
import sklearn.feature_extraction.text as txt
count_vectorizer = txt.CountVectorizer(
    binary=True, max_features=20)

texts = ["Python for data science", 
         "Python for machine learning",
         "Artificial intelligence in Python"]

count_vectorizer.fit(texts)
vectorized = count_vectorizer.transform(texts)


In [16]:
%timeit count_vectorizer.fit(texts)

575 µs ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [17]:
%timeit vectorized = count_vectorizer.transform(texts)

131 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [18]:
import timeit
cumulative_time = timeit.timeit(
    "vectorized = count_vectorizer.transform(texts)", 
    setup="from __main__ import count_vectorizer, texts", 
    number=10000)
print(cumulative_time / 10000.0)

0.0001340000200085342


## Working with the memory profiler

In [19]:
# Installation procedures
import sys
!{sys.executable} -m pip install memory_profiler

Collecting memory_profiler
  Using cached memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.61.0


In [20]:
# Initialization from IPython (to be repeat at every IPython start)
%load_ext memory_profiler

In [21]:
vectorized = count_vectorizer.transform(texts)
%memit dense_hashing = vectorized.toarray()

peak memory: 156.68 MiB, increment: 0.07 MiB


In [22]:
%%writefile example_code.py

import sklearn.feature_extraction.text as txt

def comparison_test(text):    
    count_vectorizer = txt.CountVectorizer(
        binary=True, max_features=20)
    count_vectorizer.fit(text)
    vectorized = count_vectorizer.transform(text)
    return vectorized.toarray()

Overwriting example_code.py


In [23]:
from example_code import comparison_test

texts = ["Python for data science", 
         "Python for machine learning",
         "Artificial intelligence in Python"]

%mprun -f comparison_test comparison_test(texts)




# Running in Parallel on Multiple Cores

## Demonstrating multiprocessing

In [24]:
from sklearn.datasets import load_digits
digits = load_digits()

X, y = digits.data, digits.target
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

In [26]:
%timeit single_core = cross_val_score( \
    SVC(), X, y, cv=20, n_jobs=1)

1.67 s ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [27]:
%timeit multi_core = cross_val_score( \
    SVC(),X, y, cv=20, n_jobs=-1)

436 ms ± 6.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [28]:
%timeit multi_core = cross_val_score( \
    SVC(), X, y, cv=20, n_jobs=-2)

438 ms ± 5.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
