forked from IndicoDataSolutions/Passage
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' of https://github.com/gchrupala/Passage
- Loading branch information
Showing
9 changed files
with
192 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Alec Radford <alec@indico.io> | ||
Madison May <madison@indico.io> | ||
Slater Victoroff <slater@indico.io> | ||
Grzegorz Chrupala <pitekus@gmail.com> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
v0.2.1, Mon Feb 16 -- Added pip package, updated README | ||
v0.2.2, Tue Feb 24 -- Updated readme, added readme to pip page | ||
v0.2.3, Tue Feb 24 -- Added setup.cfg to properly handle markdown readme on pypi page | ||
v0.2.4, Tue Feb 24 -- Changed to legitimate .rst README, removed setup.cfg |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
Passage | ||
======= | ||
|
||
A little library for text analysis with RNNs. | ||
|
||
Warning: very alpha, work in progress. | ||
|
||
Install | ||
------- | ||
|
||
via Github (version under active development) | ||
|
||
:: | ||
|
||
git clone http://github.com/IndicoDataSolutions/passage.git | ||
python setup.py develop | ||
|
||
or via pip | ||
|
||
:: | ||
|
||
sudo pip install passage | ||
|
||
Example | ||
------- | ||
|
||
Using Passage to do binary classification of text, this example: | ||
|
||
- Tokenizes some training text, converting it to a format Passage can | ||
use. | ||
- Defines the model's structure as a list of layers. | ||
- Creates the model with that structure and a cost to be optimized. | ||
- Trains the model for one iteration over the training text. | ||
- Uses the model and tokenizer to predict on new text. | ||
- Saves and loads the model. | ||
|
||
:: | ||
|
||
from passage.preprocessing import Tokenizer | ||
from passage.layers import Embedding, GatedRecurrent, Dense | ||
from passage.models import RNN | ||
from passage.utils import save, load | ||
|
||
tokenizer = Tokenizer() | ||
train_tokens = tokenizer.fit_transform(train_text) | ||
|
||
layers = [ | ||
Embedding(size=128, n_features=tokenizer.n_features), | ||
GatedRecurrent(size=128), | ||
Dense(size=1, activation='sigmoid') | ||
] | ||
|
||
model = RNN(layers=layers, cost='BinaryCrossEntropy') | ||
model.fit(train_tokens, train_labels) | ||
|
||
model.predict(tokenizer.transform(test_text)) | ||
save(model, 'save_test.pkl') | ||
model = load('save_test.pkl') | ||
|
||
Where: | ||
|
||
- train\_text is a list of strings ['hello world', 'foo bar'] | ||
- train\_labels is a list of labels [0, 1] | ||
- test\_text is another list of strings | ||
|
||
Datasets | ||
-------- | ||
|
||
Without sizeable datasets RNNs have difficulty achieving results better | ||
than traditional sparse linear models. Below are a few datasets that are | ||
appropriately sized, useful for experimentation. Hopefully this list | ||
will grow over time, please feel free to propose new datasets for | ||
inclusion through either an issue or a pull request. | ||
|
||
****Note****: **None of these datasets were created by indico, not | ||
should their inclusion here indicate any kind of endorsement** | ||
|
||
Blogger Dataset: http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip (Age | ||
and gender data) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
**Passage Examples** | ||
=================== | ||
[Slide Deck](https://docs.google.com/presentation/d/1HYfUZLRZRJovQpv5mYxox9bz9erxj7Ak_ZovENMvM90/edit?usp=sharing) & [Video](https://www.youtube.com/watch?v=VINCQghQRuM) | ||
|
||
<a href="https://www.youtube.com/watch?v=VINCQghQRuM"><img src="http://i.imgur.com/bJC0pjy.png" height="300"></a> | ||
|
||
[Passage Gender Classification](https://github.com/IndicoDataSolutions/Passage/blob/master/examples/gender.py) With [Blogger Dataset](http://goo.gl/EbWA1u) | ||
|
||
<a href="https://github.com/IndicoDataSolutions/Passage/blob/master/examples/gender.py"><img src="http://i.imgur.com/cEmonmC.jpg" height="300"></a> | ||
|
||
[Passage Newsgroup Classification Example](https://github.com/IndicoDataSolutions/Passage/blob/master/examples/newsgroup.py) | ||
|
||
<a href="https://github.com/IndicoDataSolutions/Passage/blob/master/examples/newsgroup.py"><img src="http://i.imgur.com/ByTczHW.jpg" height="300"></a> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
from sklearn.datasets import fetch_20newsgroups | ||
categories = ['alt.atheism', 'sci.space'] | ||
newsgroups_train = fetch_20newsgroups(subset='train', | ||
remove=('headers', 'footers', 'quotes'), | ||
categories=categories) | ||
newsgroups_test = fetch_20newsgroups(subset='test', | ||
remove=('headers', 'footers', 'quotes'), | ||
categories=categories) | ||
|
||
print len(newsgroups_train.data), len(newsgroups_test.data) | ||
|
||
from sklearn import metrics | ||
from passage.preprocessing import Tokenizer | ||
from passage.layers import Embedding, GatedRecurrent, Dense | ||
from passage.models import RNN | ||
from passage.utils import save | ||
|
||
tokenizer = Tokenizer(min_df=10, max_features=50000) | ||
X_train = tokenizer.fit_transform(newsgroups_train.data) | ||
X_test = tokenizer.transform(newsgroups_test.data) | ||
Y_train = newsgroups_train.target | ||
Y_test = newsgroups_test.target | ||
|
||
print tokenizer.n_features | ||
|
||
layers = [ | ||
Embedding(size=128, n_features=tokenizer.n_features), | ||
GatedRecurrent(size=256, activation='tanh', gate_activation='steeper_sigmoid', | ||
init='orthogonal', seq_output=False), | ||
Dense(size=1, activation='sigmoid', init='orthogonal') # sigmoid for binary classification | ||
] | ||
|
||
model = RNN(layers=layers, cost='bce') # bce is classification loss for binary classification and sigmoid output | ||
for i in range(2): | ||
model.fit(X_train, Y_train, n_epochs=1) | ||
tr_preds = model.predict(X_train[:len(Y_test)]) | ||
te_preds = model.predict(X_test) | ||
|
||
tr_acc = metrics.accuracy_score(Y_train[:len(Y_test)], tr_preds > 0.5) | ||
te_acc = metrics.accuracy_score(Y_test, te_preds > 0.5) | ||
|
||
print i, tr_acc, te_acc # dataset too small to fully utilize Passage | ||
|
||
save(model, 'model.pkl') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters