In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import make_classification
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import HashingVectorizer
import matplotlib.pyplot as plt
%matplotlib inline

---
#Stochastic Gradient Descent
---

####SGDClassifier(), and SGDRegressor()
- have both a .fit() method and a .partial_fit() method. The latter for use with batches
- partial_fit() requires the declaration of the classes with the method. The algorithm needs to know in advance all the class codes that it expects to see during training
- The loss paramters determines the modelling algorithm type. Loss can take the following values to allow for the following algorithms:
1. loss - logistic regression
2. hinge - linear support vector machine

- SGDRegressor mimics linear regression using the squared_loss loss parameter. The huber loss transforms the squared loss into a linear loss over a certain distance, epsilon. It can also act as a linear SVM (regression) using the epsilon_insensitive loss, or squared_epsilon_insensitive (which penalizes outliers more)

- Performance of different loss functions cannot be estimated a priori.

- If doing classification and you need an estimation of class probabilities you will be limited in your choice to log or modified_huber only

#####Other key Parameters:
- n_iter: the number of iterations over the data
- penalty: L1, L2, or elasticnet
- alpha: regularization term. Higher means more regularization
- L1 ratio: only used with elasticnet penalty to set the balance between L1 and L2 
- learning_rate: Usually invscaling for regression. If you want to use invscaling for classification, set eta0 and power_t (invscaling = eta0/(t**power_t). With invscaling you can start with a lower learning rate, which is less than the optimal rate, but it will decrease more slowly
- epsilon: only use if your loss is huber, epsilon_insensitive ro squared_epsilon_insensitive
- shuffle: if True the algorithm will shuffle the training data order

fetch_20newsgroups - containing 11314 posts, with about 200 words in each post

#####Get the 20 news groups dataset

In [3]:
ngd = fetch_20newsgroups(shuffle = True, remove = ("headers", "footers", "quotes"), random_state = 6)

In [4]:
print np.shape((ngd.data))
np.mean([len(text.split(' ')) for text in ngd.data])

(11314,)


206.15980201520242

#####Using make_classification make a dataset with 10**7 samples and 5 features

In [6]:
#To do - copy this cell and execute

(10000000, 5) (10000000,)


---
#Scalability with Volume
---

###Strategies to manage high volumes of data without loading it all into memory:
- incrementally update the parameters of your algorithm until all the observations have been elaborated at least once
- partial_fit() method which can be applied to a certain number of supervised and unsupervised algorithms
- incremental learning: feed in chunks or batches of data to start fitting your model as soon as data arrives
- incremental learning is about:
1. Batch size - this is usually memory depdendent. In general the larger the better
2. Data preprocessing - feature scaling can be extremely difficult, because ahead of time you don't know the range of your data. You either complete data collection and then re-scale, or you estimate the range of the features, scale on the fly and discard any obesrvations that exceed the anticipated range
3. Number and passes through the data required (or not) - in general it is hard, if not possible to make more than one pass through the data. Data order is also very important. Stochastic Gradient Descent prefers shuffled dataa
4. Validation and hyper-parameter tuning - more difficult than normal. Either validate in a progressive way, or hold out some observations from every chunk/batch.


#####read_csv allows iteration over the file by reading batches or chunks of 10000 observations (in this case)
#####MinMaxScaler is used to range the data after the first batch becomes available, after this batches are trimmed, so no values exceed or fall below the min/max values set

#####Now run through the file by chunking the stream coming from the file

In [61]:
#To do - copy this cell and execute

In [43]:
clf.predict([0.65, 0.36, 0.49, 0.51, 0.52])

array([ 1.])

In [62]:
#fig = plt.figure(figsize = (15, 5))
#ax = plt.subplot(111)
#ax.plot(acc_list)
#ax.set_title("SGD Logistic Regression Accuracy")
#ax.set_ylabel("Accuracy")
#ax.set_xlabel("Number of Training Samples (x10000)")

---
##Algorithm Choices offering Partial Fit

###- MultinomialNB
###- BernoulliNB
###- SGDClassifier
###- SGDRegressor


- smaller batches tend to be slower, because usually the bottleneck for these things is disk access or data access

#####Handling data variability using hashing
- hash functions map, in a deterministic way, using any input they receive, whether is be numerical or string input
- they return an integer within a certain range
- hashing is extremely fast and efficient

#####Sparse matrices
- as we have already seen, sparse matrices only hold non-zero values
- a sparse matrix has a default value of zero

####hashing therefore bounds every input, to a certain range, or position onto a corresponding sparse matrix

In [50]:
#To do - copy this cell and execute

In [51]:
#To do - copy this cell and execute

In [63]:
#To do - copy this cell and execute

#####Try some predictions

In [60]:
new_text_A = [' A 2014 red Toyota Prius v Five with fewer than 14K miles. \
Powered by a reliable 1.8L four cylinder hybrid engine that averages 44mpg in the city and 40mpg on the highway.']

new_text_B = ['There always seems to be something unusual about the political class.\
The GOP and Democrats are poles apart in ideology']

new_text_C = ['Space, the final frontier. When one considers the number of start within the milky way galaxy, \
then it is hard not to conceive that there may be life out there']

new_text_vector = my_hash.transform(new_text_C)

yhat = clf.predict(new_text_vector)
print ngd.target_names[yhat]

sci.space
