Permalink
Browse files

renamed all dspy to rosetta

  • Loading branch information...
dkrasner
dkrasner committed Nov 10, 2013
1 parent b13b5e0 commit c0b10bd7260ed997414fb502afc4e5fbe80cb552
Showing with 6,233 additions and 50 deletions.
  1. +6 −6 CONTRIBUTING.md
  2. +10 −10 LICENSE.txt
  3. +7 −7 MANIFEST.in
  4. +7 −7 README.md
  5. +7 −7 examples/vw_helpers.md
  6. +5 −5 makefile
  7. +1 −0 rosetta/__init__.py
  8. 0 rosetta/cmd/__init__.py
  9. +27 −0 rosetta/cmd/bashrc_additions
  10. +80 −0 rosetta/cmd/concat_csv.py
  11. +126 −0 rosetta/cmd/cut.py
  12. +149 −0 rosetta/cmd/files_to_vw.py
  13. +47 −0 rosetta/cmd/filter_sfile.py
  14. +149 −0 rosetta/cmd/join_csv.py
  15. +116 −0 rosetta/cmd/row_filter.py
  16. +184 −0 rosetta/cmd/split.py
  17. +121 −0 rosetta/cmd/subsample.py
  18. +233 −0 rosetta/common.py
  19. +42 −0 rosetta/common_abc.py
  20. +88 −0 rosetta/common_math.py
  21. 0 rosetta/modeling/__init__.py
  22. +53 −0 rosetta/modeling/categorical_fitter.py
  23. +285 −0 rosetta/modeling/eda.py
  24. +348 −0 rosetta/modeling/fitting.py
  25. +271 −0 rosetta/modeling/prediction_plotter.py
  26. +154 −0 rosetta/modeling/var_create.py
  27. 0 rosetta/parallel/__init__.py
  28. +182 −0 rosetta/parallel/pandas_easy.py
  29. +346 −0 rosetta/parallel/parallel_easy.py
  30. +295 −0 rosetta/tests/test_cmd.py
  31. +37 −0 rosetta/tests/test_common.py
  32. +181 −0 rosetta/tests/test_parallel.py
  33. +392 −0 rosetta/tests/test_text.py
  34. 0 rosetta/text/__init__.py
  35. +6 −0 rosetta/text/api.py
  36. +153 −0 rosetta/text/filefilter.py
  37. +210 −0 rosetta/text/gensim_helpers.py
  38. +95 −0 rosetta/text/nlp.py
  39. +335 −0 rosetta/text/streamers.py
  40. +838 −0 rosetta/text/text_processors.py
  41. +419 −0 rosetta/text/vw_helpers.py
  42. 0 rosetta/workflow/__init__.py
  43. +220 −0 rosetta/workflow/topic_seek.py
  44. +8 −8 setup.py
View
@@ -9,17 +9,17 @@ likelihood of your contribution being merged.**
How to contribute
-----------------
The preferred way to contribute to dspy is to fork the
[project repository](https://github.com/columbia-applied-data-science/dspy/) on
The preferred way to contribute to rosetta is to fork the
[project repository](https://github.com/columbia-applied-data-science/rosetta/) on
GitHub:
1. Fork the [project repository](https://github.com/columbia-applied-data-science/dspy/):
1. Fork the [project repository](https://github.com/columbia-applied-data-science/rosetta/):
click on the 'Fork' button near the top of the page. This creates
a copy of the code under your account on the GitHub server.
2. Clone this copy to your local disk:
$ git clone git@github.com:YourLogin/dspy.git
$ git clone git@github.com:YourLogin/rosetta.git
3. Create a branch to hold your changes:
@@ -37,7 +37,7 @@ GitHub:
$ git push -u origin my-feature
Finally, go to the web page of the your fork of the dspy repo,
Finally, go to the web page of the your fork of the rosetta repo,
and click 'Pull request' to send your changes to the maintainers for
review. request. This will send an email to the committers.
@@ -54,7 +54,7 @@ following rules before submitting a pull request:
example script in the ``examples/`` folder. Have a look at other
examples for reference. Examples should demonstrate why the new
functionality is useful in practice and, if possible, compare it
to other methods available in dspy.
to other methods available in rosetta.
- At least one paragraph of narrative documentation with links to
```` references in the literature (with PDF links when possible) and
View
@@ -2,10 +2,10 @@
License
=======
DSpy is distributed under a 3-clause ("Simplified" or "New") BSD
Rosetta is distributed under a 3-clause ("Simplified" or "New") BSD
license.
DSpy license
Rosetta license
==============
Redistribution and use in source and binary forms, with or without
@@ -40,28 +40,28 @@ About the Copyright Holders
===========================
The core team that coordinates development on GitHub can be found here:
http://github.com/columbia-applied-data-science/DSpy
http://github.com/columbia-applied-data-science/Rosetta
Full credits for DSpy contributors can be found in the documentation.
Full credits for Rosetta contributors can be found in the documentation.
Our Copyright Policy
====================
DSpy uses a shared copyright model. Each contributor maintains copyright
over their contributions to DSpy. However, it is important to note that
Rosetta uses a shared copyright model. Each contributor maintains copyright
over their contributions to Rosetta. However, it is important to note that
these contributions are typically only changes to the repositories. Thus,
the DSpy source code, in its entirety, is not the copyright of any single
the Rosetta source code, in its entirety, is not the copyright of any single
person or institution. Instead, it is the collective copyright of the
entire DSpy Development Team. If individual contributors want to maintain
entire Rosetta Development Team. If individual contributors want to maintain
a record of what changes/contributions they have specific copyright on,
they should indicate their copyright in the commit message of the change
when they commit the change to one of the DSpy repositories.
when they commit the change to one of the Rosetta repositories.
With this in mind, the following banner should be used in any source code
file to indicate the copyright and license terms:
#-----------------------------------------------------------------------------
# Copyright (c) 2013, DSpy Development Team
# Copyright (c) 2013, Rosetta Development Team
# All rights reserved.
#
# Distributed under the terms of the BSD Simplified License.
View
@@ -1,10 +1,10 @@
recursive-include dspy *
recursive-include dspy/cmd *
recursive-include dspy/modeling *
recursive-include dspy/parallel *
recursive-include dspy/tests *
recursive-include dspy/text *
recursive-include dspy/workflow *
recursive-include rosetta *
recursive-include rosetta/cmd *
recursive-include rosetta/modeling *
recursive-include rosetta/parallel *
recursive-include rosetta/tests *
recursive-include rosetta/text *
recursive-include rosetta/workflow *
include MANIFEST.in
include LICENSE
View
@@ -1,4 +1,4 @@
DSpy
Rosetta
====
Tools for data science with a focus on text processing.
@@ -32,7 +32,7 @@ See the `examples/` directory for more details.
Install
-------
Check out the dev branch or a tagged release from the [dspyrepo][dspyrepo]. Then (so long as you have `pip`).
Check out the dev branch or a tagged release from the [rosettarepo][rosettarepo]. Then (so long as you have `pip`).
make
make test
@@ -44,11 +44,11 @@ Development
You can check the latest sources with
git clone git://github.com/columbia-applied-data-science/dspy
git clone git://github.com/columbia-applied-data-science/rosetta
### Contributing
Feel free to contribute a bug report or a request by opening an [issue](https://github.com/columbia-applied-data-science/dspy/issues)
Feel free to contribute a bug report or a request by opening an [issue](https://github.com/columbia-applied-data-science/rosetta/issues)
Before contributing code, read `CONTRIBUTING.md`
@@ -57,12 +57,12 @@ Dependencies
Testing
-------
From the base repo directory, `dspy/`, you can run all tests with
From the base repo directory, `rosetta/`, you can run all tests with
make test
History
-------
The *DS* in DSpy clearly relates to *Data Science*. However, it came first from *Data Structure* and the *Dead Sea*. The tools concentrate on streaming text, and the dead sea scrolls are the most famous version of text in a stream (a lake actually...but just pretend and it's really cool).
The *DS* in Rosetta clearly relates to *Data Science*. However, it came first from *Data Structure* and the *Dead Sea*. The tools concentrate on streaming text, and the dead sea scrolls are the most famous version of text in a stream (a lake actually...but just pretend and it's really cool).
[dspyrepo]: https://github.com/columbia-applied-data-science/dspy
[rosettarepo]: https://github.com/columbia-applied-data-science/rosetta
View
@@ -1,9 +1,9 @@
Working with Vowpal Wabbit (VW)
===============================
To work with the `dspy` utilities you need to:
To work with the `rosetta` utilities you need to:
* Clone the [dspy repo][dspyrepo] and read `README.md`.
* Clone the [rosetta repo][rosettarepo] and read `README.md`.
Create the sparse file (sfile)
------------------------------
@@ -21,15 +21,15 @@ The `TextFileStreamer` needs a method to convert the text files to a list of str
Once you have a tokenizer, just initialize a streamer and write the VW file.
```python
from dspy import TextFileStreamer, TokenizerBasic
from rosetta import TextFileStreamer, TokenizerBasic
my_tokenizer = TokenizerBasic()
stream = TextFileStreamer(text_base_path='bodyfiles', tokenizer=my_tokenizer)
stream.to_vw('doc_tokens.vw', n_jobs=-1)
```
### Method 2: `files_to_vw.py`
`files_to_vw.py` is a fast and simple command line utility for converting files to VW format. Installing `dspy` will put these utilities in your path.
`files_to_vw.py` is a fast and simple command line utility for converting files to VW format. Installing `rosetta` will put these utilities in your path.
* Try converting the first 5 files in `my_base_path`. The following should print 5 lines of of results, in [vw format][vwinput]
@@ -150,7 +150,7 @@ The python function `filter_sfile.py` takes in `ddrs.vw` and streams a filtered
You can view the topics and predictions with this:
```python
from dspy.text.vw_helpers import LDAResults
from rosetta.text.vw_helpers import LDAResults
num_topics = 5
lda = LDAResults('topics.dat', 'prediction.dat', num_topics, 'sff_file.pkl')
lda.print_topics()
@@ -195,9 +195,9 @@ Contribute!
[vwinput]: https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format
[dspyrepo]: https://github.com/columbia-applied-data-science/dspy
[rosettarepo]: https://github.com/columbia-applied-data-science/rosetta
[vwlda]: https://github.com/JohnLangford/vowpal_wabbit/wiki/lda.pdf
[vwtricks]: www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit‎
[hashing]: https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction
[spot]: http://en.wikipedia.org/wiki/Single_Point_of_Truth
[issue]: https://github.com/columbia-applied-data-science/dspy/issues
[issue]: https://github.com/columbia-applied-data-science/rosetta/issues
View
@@ -6,7 +6,7 @@ PYTHON ?= python
UNITTEST ?= unittest
CTAGS ?= ctags
TESTDIR=dspy/tests
TESTDIR=rosetta/tests
all: install test
@@ -18,7 +18,7 @@ install: clean
# Reinstall with pip
reinstall: clean
pip uninstall dspy
pip uninstall rosetta
$(PYTHON) setup.py sdist
pip install dist/*
@@ -50,14 +50,14 @@ test-cmd:
$(PYTHON) -m $(UNITTEST) discover -s $(TESTDIR) -p '*cmd*' -v
trailing-spaces:
find dspy -name "*.py" | xargs perl -pi -e 's/[ \t]*$$//'
find rosetta -name "*.py" | xargs perl -pi -e 's/[ \t]*$$//'
ctags:
# make tags for symbol based navigation in emacs and vim
# Install with: sudo apt-get install exuberant-ctags
$(CTAGS) -R *
code-analysis:
flake8 dspy | grep -v __init__ | grep -v external
pylint -E -i y dspy/ -d E1103,E0611,E1101
flake8 rosetta | grep -v __init__ | grep -v external
pylint -E -i y rosetta/ -d E1103,E0611,E1101
View
@@ -0,0 +1 @@
from rosetta.text.api import *
View
No changes.
@@ -0,0 +1,27 @@
# Additions to your bashrc
#
#
###############################################################################
# INSTALLATION
###############################################################################
# Put desired sections in your ~/.bashrc (or ~/.bash_profile on macs) and then
# "source it" or close then open a new terminal.
#
###############################################################################
# Body function
###############################################################################
# This allows you to run a command on the body of the function, skipping the header
# (but still printing the header). For example,
#
# $ cat filewithheader | body sort -k1,1
#
# will sort filewithheader, using the first field, but leave the header at the top
# of the file.
body() {
IFS= read -r header
printf '%s\n' "$header"
"$@"
}
export -f body
View
@@ -0,0 +1,80 @@
#!/usr/bin/env python
"""
Concat a list of csv files in an "outer join" style.
From pandas, uses DataFrame.from_csv, DataFrame.to_csv, concat to do
reads/writes/joins. Except noted below, the default arguments are used.
"""
import argparse
import sys
import pandas as pd
def _cli():
# Text to display after help
epilog = """
EXAMPLES
Concat two files, each with a header and index, redirect output to newfile
$ python concat_csv.py --index --header file1 file2 > newfile
Concat two files, write result to newfile
$ python concat_csv.py --index --header -o newfile file1 file2
Concat all files in mydir/, write result to stdout.
$ python concat_csv.py mydir/*
"""
parser = argparse.ArgumentParser(
description=globals()['__doc__'], epilog=epilog,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument(
'paths', nargs='*', help='Concat files in this space separated list')
parser.add_argument(
'-o', '--outfile', default=sys.stdout,
type=argparse.FileType('w'),
help='Write to OUT_FILE rather than sys.stdout.')
parser.add_argument(
'-s', '--sep', default=',',
help='Delimiter to use. Regular expressions are accepted.'
' [default: %(default)s]')
parser.add_argument(
'--index', action='store_true', default=False,
help='Flag to set if files have an index (leftmost column).'
' [default: %(default)s].')
parser.add_argument(
'--header', action='store_true', default=False,
help='Flag to set if files have headers (in top row). '
'[default: %(default)s]')
parser.add_argument(
'-a', '--axis', type=int, default=0,
help='Axes along which to concatenate')
# Parse and check args
args = parser.parse_args()
# Call the module interface
_concat(
args.outfile, args.paths, args.sep, args.index, args.header, args.axis)
def _concat(outfile, paths, sep, index, header, axis):
# Read
index_col = 0 if index else False
header_row = 0 if header else False
kwargs = {'sep': sep, 'index_col': index_col, 'header': header_row}
frames = pd.concat(
(pd.DataFrame.from_csv(p, **kwargs) for p in paths), axis=axis)
# Write
kwargs = {'sep': sep, 'index': index, 'header': header}
frames.to_csv(outfile, **kwargs)
if __name__ == '__main__':
_cli()
Oops, something went wrong.

0 comments on commit c0b10bd

Please sign in to comment.