Skip to content
This repository
Fetching contributors…

Octocat-spinner-32-eaf2f5

Cannot retrieve contributors at this time

file 387 lines (361 sloc) 18.154 kb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386
v0.4.3, 201?-??-?? -- ???
 * runners:
   * All runners:
     * You can now set strict_protocols from mrjob.conf (#726)
       * new --no-strict-protocols command-line option

v0.4.2, 2013-11-27 -- that's one small step for a JAR
 * jobs:
   * can interpolate input and output path(s) into arguments of JarSteps,
     so they can be part of multi-step jobs (#773)
     * see mrjob/examples/mr_jar_step_example.py
   * JarStep now takes keyword arguments only (#769)
     * removed useless "name" field; "step_args" is now just "args"
   * MRJobStep (usually accessed via MRJob.mr()) is now MRStep
 * runners:
   * All runners:
     * --setup is now fully functional (#206)
       * --python-archive, --setup-cmd, and --setup-script are deprecated
     * --bootstrap option works and uses sh (#206)
       * --bootstrap-cmd, --bootstrap-file, --bootstrap-python-package,
         --bootstrap-script are deprecated
     * setup commands can no longer corrupt a task's input and output (#803)
     * sh_bin is now "sh -e" by default so setup fails fast (#810)
       * default is "/bin/sh -e" on EMR
   * EMR:
     * JarSteps work again (#763)
     * auto-uploads jars for JarSteps (#772)
       * JARs on the EMR instances can be accessed with file:/// URIs
     * ssh_cat() no longer raises an error when catting a file
       containing an error (#807)
     * Fixed SignatureDoesNotMatchError that happens with boto 2.10.0+
       with Python prior to 2.7.5 (#778)
   * Hadoop:
     * now handles JarSteps too (#770)
 * Fix to mrjob.parse.urlparse() that was breaking Python 2.5
 * mrjob.util.buffer_iterator_to_line_iterator() is now more efficient
   and uses a bounded amount of memory
 * bz2 decompression no longer discards data (#817)

v0.4.1, 2013-09-16 -- secondary sort and self-terminating job flows
 * jobs:
   * SORT_VALUES: Secondary sort by value (#240)
     * see mrjob/examples/
   * can now override jobconf() again (#656)
   * renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
   * examples:
     * bash_wrap/ (mapper/reducer_cmd() example)
     * mr_most_used_word.py (two step job)
     * mr_next_word_stats.py (SORT_VALUES example)
 * runners:
   * All runners:
     * single setup option works but is not yet documented (#206)
     * setup now uses sh rather than python internally
   * EMR runner:
     * max_hours_idle: self-terminating idle job flows (#628)
       * mins_to_end_of_hour option gives finer control over self-termination.
     * Can reuse pooled job flows where previous job failed (#633)
     * Throws IOError if output path already exists (#634)
     * Gracefully handles SSL cert issues (#621, #706)
     * Automatically infers EMR/S3 endpoints from region (#658)
     * ls() supports s3n:// schema (#672)
     * Fixed log parsing crash on JarSteps (#645)
     * visible_to_all_users works with boto <2.8.0 (#701)
     * must use --interpreter with non-Python scripts (#683)
     * cat() can decompress gzipped data (#601)
   * Hadoop runner:
     * check_input_paths: can disable input path checking (#583)
     * cat() can decompress gzipped data (#601)
   * Inline/Local runners:
     * Fixed counter parsing for multi-step jobs in inline mode
     * Supports per-step jobconf (#616)
 * Documentation revamp
 * mrjob.parse.urlparse() works consistently across Python versions (#686)
 * deprecated:
   * many constants in mrjob.emr replaced with functions in mrjob.aws
 * removed deprecated features:
   * old conf locations (~/.mrjob and in PYTHONPATH) (#747)
   * built-in protocols must be instances (#488)

v0.4.0, 2013-04-30 -- Slouching toward nirvana
 * Changes:
   * 'mrjob' command (#225)
   * Changed default runner from 'local' to 'inline' (#423)
   * Local runner no longer adds working directory to PYTHONPATH of
     subprocesses; use inline runner instead (#424)
   * Requires boto 2.2.0 or later
   * Filesystem functionality moved out of MRJobRunner into into 'fs' objects
     but forwarded from runners for backward compatibility
   * Changed exception hierarchy of mrjob.ssh (which is private but
     important)
   * Inline and local runners now inherit from the SimMRJobRunner class and thus share most
     of their implementation
   * Internal data structure for representing a step is much richer, allowing
     many cool future features (#479)
   * mrjob detects Hadoop version from EMR based on API responses instead of
     what's in the config (#611)
 * New features:
   * Support for non-Hadoop Streaming jar steps (#499)
   * Support for arbitrary commands as Hadoop Streaming
     mappers/combiners/reducers
   * mapper_pre_filter, combiner_pre_filter, and reducer_pre_filter allow
     running of a UNIX command in front of tasks to filter input outside of
     the interpreter
   * Hadoop runner uses PTY to print output from the Hadoop sub process to the
     console (#580)
   * mrjob knows how to terminate the job on cleanup (Ctrl+C closes the job).
     (#353)
   * Allow use of multiple -c flags on the command line (#420)
 * Bug fixes:
   * Silenced some incorrect warnings about ignored options in 'inline' runner
   * terminate_idle_job_flows uses the default configuration to terminate idle jobs (#559)
 * Removed deprecated functionality:
   * --hadoop-*-format
   * --*-protocol switches
   * MRJob.DEFAULT_*_PROTOCOL
   * MRJob.get_default_opts()
   * MRJob.protocols()
   * PROTOCOL_DICT
   * IF_SUCCESSFUL
   * DEFAULT_CLEANUP
   * S3Filesystem.get_s3_folder_keys()

v0.3.5, 2012-08-21 -- The Last Ride of v0.3.x[?]
 * EMR:
   * --pool-wait-minutes option lets you wait up to X minutes before creating a
     job flow (#455)
   * Job flow ID included in error messages on failure (#452)
   * JOB and JOB_FLOW cleanup options (#485, #455)
 * EMR and Hadoop:
   * Compatibility fixes related to deprecated options and Hadoop's bizarre
     non-sequential version numbers (#489, #534)
 * Other:
   * Warn when *_PROTOCOL is not a class (#490)
 * Bug fixes:
   * Unicode strings can be used when specifying interpreters (#431)
   * --enable-emr-logging no longer causes the wrong counters/logs to be parsed
     (#446)
   * TMP_DIR inserted into 'sort' environment variables (#477)
   * Setting hadoop_home in mrjob.conf works again
   * Gzipped input files work when specified with relative paths (#494)
   * Passthrough options are not re-ordered when sent to Hadoop Streaming
     (#509)

v0.3.4.1, 2012-06-12 -- The test suite doesn't catch everything...
 * Local mode doesn't try to send multiple mappers to the same output file
   when using multiple compressed files as input

v0.3.4, 2012-06-11 -- We are friendly people.
 * Experimental support for IronPython in the local and inline runners
 * set_status() and increment_counter() will encode messages/names of type
   'unicode' as UTF-8 when writing to Hadoop Streaming
 * EMR and Hadoop counter parsing is more correct
 * mrjob.tools.emr.fetch_logs fetches logs from S3 when asked instead of
   incorrectly refusing to do so
 * jobconf values can be booleans in mrjob.conf as well as 'true' and 'false'
   strings
 * hadoop_version can be a float in mrjob.conf, but a warning is printed to the
   console
 * Command line help is split across several --help-* commands
 * Local runner sorts output consistently

v0.3.3.2, 2012-04-10 -- It's a race [condition]!
 * Option parsing no longer dies when -- is used as an argument (#435)
 * Fixed race condition where two jobs can join same job flow thinking it is
   idle, delaying one of the jobs (#438)
 * Better error message when a config file contains no data for the current
   runner (#433)

v0.3.3.1, 2012-04-02 -- Hothothothothothothotfix
 * Fixed S3 locking mechanism parsing of last modified time to work around an
   inconsistency in the EMR API

v0.3.3, 2012-03-29 -- Bug...bug...bug...bug...bug...FEATURE!
 * EMR:
   * Error detection code follows symlinks in Hadoop logs (#396)
   * terminate_idle_job_flows locks job flows before terminating them (#391)
   * terminate_idle_job_flows -qq silences all output (#380)
 * Other fixes:
   * mr_tower_of_powers test no longer requires Testify (#395)
   * Various runner du() implementations no longer broken (#393, #394)
   * Hadoop counter parser regex handles long lines better (#388)
   * Hadoop counter parser regex is more correct (#305)
   * Better error when trying to parse YAML without PyYAML (#348)

v0.3.2, 2012-02-22 -- AMI versions, spot instances, and more
 * Docs:
   * 'Testing with mrjob' section in docs (includes #321)
   * MRJobRunner.counters() included in docs (#321)
   * terminate_idle_job_flows is spelled correctly in docs (#339)
 * Running jobs:
   * local mode:
     * Allow non-string jobconf values again (this changed in v0.3.0)
     * Don't split *.gz files (#333)
   * emr mode:
     * Spot instance support via ec2_*_instance_bid_price and renamed instance
       type/number options (#219)
     * ami_version option to allow switching between EMR AMIs (#306)
     * 'Error while reading from input file' displays correct file (#358)
     * python_bin used for bootstrap_python_packages instead of just 'python'
       (#355)
     * Pooling works with bootstrap_mrjob=False (#347)
     * Pooling makes sure a job flow has space for the new job before joining
       it (#324)
 * EMR tools:
   * create_job_flow no longer tries to use an option that does not exist
     (#349)
   * report_long_jobs tool alerts on jobs that have run for more than X hours
     (#345)
   * mrboss no longer spells stderr 'stsderr'
   * terminate_idle_job_flows counts jobs with pending (but not running)
     steps as idle (#365)
   * terminate_idle_job_flows can terminate job flows near the end of a
     billable hour (#319)
   * audit_usage breaks down job flows by pool (#239)
   * Various tools (e.g. audit_usage) get list of job flows correctly (#346)

v0.3.1, 2011-12-20 -- Nooooo there were bugs!
 * Instance-type command-line arguments always override mrjob.conf (Issue #311)
 * Fixed crash in mrjob.tools.emr.audit_usage (Issue #315)
 * Tests now use unittest; python setup.py test now works (Issue #292)

v0.3.0, 2011-12-07 -- Worth the wait
 * Configuration:
   * Saner mrjob.conf locations (Issue #97):
     * ~/.mrjob is deprecated in favor of ~/.mrjob.conf
     * searching in PYTHONPATH is deprecated
     * MRJOB_CONF environment variable for custom paths
 * Defining Jobs (MRJob):
   * Combiner support (Issue #74)
   * *_init() and *_final() methods for mappers, combiners, and reducers
     (Issue #124)
   * mapper/combiner/reducer methods no longer need to contain a yield
     statement if they emit no data
   * Protocols:
     * Protocols can be anything with read() and write() methods, and are
       instances by default (Issue #229)
     * Set protocols with the *_PROTOCOL attributes or by re-defining the
       *_protocol() methods
     * Built-in protocol classes cache the encoded and decoded value of the
       last key for faster decoding during reducing (Issue #230)
     * --*protocol switches and aliases are deprecated (Issue #106)
   * Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format()
     methods (Issue #241)
     * --hadoop-*-format switches are deprecated
     * Hadoop formats can no longer be set from mrjob.conf
   * Set jobconf with JOBCONF attribute or the jobconf() method (in addition
     to --jobconf)
   * Set Hadoop partitioner class with --partitioner, PARTITIONER, or
     partitioner() (Issue #6)
   * Custom option parsing (Issue #172)
   * Use mrjob.compat.get_jobconf_value() to get jobconf values from environment
 * Running jobs:
   * All modes:
     * All runners are Hadoop-version aware and use the correct jobconf and
       combiner invocation styles (Issue #111)
     * All types of URIs can be passed through to Hadoop (Issue #53)
     * Speed up steps with no mapper by using cat (Issue #5)
     * Stream compressed files with cat() method (Issue #17)
     * hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue #96)
     * job_name_prefix option is gone (was deprecated)
     * Better cleanup (Issue #10):
       * Separate cleanup_on_failure option
       * More granular cleanup options
     * Cleaner handling of passthrough options (Issue #32)
   * emr mode:
     * job flow pooling (Issue #26)
     * vastly improved log fetching via SSH (Issue #2)
       * New tool: mrjob.tools.emr.fetch_logs
     * default Hadoop version on EMR is 0.20 (was 0.18)
     * ec2_instance_type option now only sets instance type for slave nodes
       when there are multiple EC2 instances (Issue #66)
     * New tool: mrjob.tools.emr.mrboss for running commands on all nodes and
       saving output locally
   * inline mode:
     * Supports cmdenv (Issue #136)
     * Passthrough options can now affect steps list (Issue #301)
   * local mode:
     * Runs 2 mappers and 2 reducers in parallel by default (Issue #228)
     * Preliminary Hadoop simulation for some jobconf variables (Issue #86)
 * Misc:
   * boto 2.0+ is now required (Issue #92)
   * Removed debian packaging (should be handled separately)

v0.2.8, 2011-09-07 -- Bugfixes and betas
 * Fix log parsing crash dealing with timeout errors
 * Make mr_travelling_salesman.py work with simplejson
 * Add emr_additional_info option, to support EMR beta features
 * Remove debian packaging (should be handled separately)
 * Fix crash when creating tmp bucket for job in us-east-1

v0.2.7, 2011-07-12 -- Hooray for interns!
 * All runner options can be set from the command line (Issue #121)
   * Including for mrjob.tools.emr.create_job_flow (Issue #142)
 * New EMR options:
   * availability_zone (Issue #72)
   * bootstrap_actions (Issue #69)
   * enable_emr_debugging (Issue #133)
 * Read counters from EMR log files (Issue #134)
 * Clean old files out of S3 with mrjob.tools.emr.s3_tmpwatch (Issue #9)
 * EMR parses and reports job failure due to steps timing out (Issue #15)
 * EMR boostrap files are no longer made public on S3 (Issue #70)
 * mrjob.tools.emr.terminate_idle_job_flows handles custom hadoop streaming
   jars correctly (Issue #116)
 * LocalMRJobRunner separates out counters by step (Issue #28)
 * bootstrap_python_packages works regardless of tarball name (Issue #49)
 * mrjob always creates temp buckets in the correct AWS region (Issue #64)
 * Catch abuse of __main__ in jobs (Issue #78)
 * Added mr_travelling_salesman example

v0.2.6, 2011-05-24 -- Hadoop 0.20 in EMR, inline runner, and more
* Set Hadoop to run on EMR with --hadoop-version (Issue #71).
   * Default is still 0.18, but will change to 0.20 in mrjob v0.3.0.
 * New inline runner, for testing locally with a debugger
 * New --strict-protocols option, to catch unencodable data (Issue #76)
 * Added steps_python_bin option (for use with virtualenv)
 * mrjob no longer chokes when asked to run on an EMR job flow running
   Hadoop 0.20 (Issue #110)
 * mrjob no longer chokes on job flows with no LogUri (Issue #112)

v0.2.5, 2011-04-29 -- Hadoop input and output formats
 * Added hadoop_input/output_format options
 * You can now specify a custom Hadoop streaming jar (hadoop_streaming_jar)
 * extra args to hadoop now come before -mapper/-reducer on EMR, so
   that e.g. -libjar will work (worked in hadoop mode since v0.2.2)
 * hadoop mode now supports s3n:// URIs (Issue #53)

v0.2.4, 2011-03-09 -- fix bootstrapping mrjob
 * Fix bootstrapping of mrjob in hadoop and local mode (Issue #89)
 * SSH tunnels try to use the same port for the same job flow (Issue #67)
 * Added mr_postfix_bounce and mr_pegasos_svm to examples.
 * Retry on spurious 505s from EMR API

v0.2.3, 2011-02-24 -- boto compatibility
 * Fix incompatibility with boto 2.0b4 (Issue #91)

v0.2.2, 2011-02-15 -- GET/POST EMR issue
 * Use POST requests for most EMR queries (EMR was choking on large GETs)
 * find_probable_cause_of_failure() ignores transient errors (Issue #31)
 * --hadoop-arg now actually works (Issue #79)
   * on Hadoop, extra args are added first, so you can set e.g. -libjar
 * S3 buckets may now have . in their names
 * MRJob scripts now respect --quiet (Issue #84)
 * added --no-output option for MRJob scripts (Issue #81)
 * added --python-bin option (Issue #54)

v0.2.1, 2010-11-17 -- laststatechangereason bugfix
 * Don't assume EMR sets laststatechangereason

v0.2.0, 2010-11-15 -- Many bugfixes, Windows support
 * New Features/Changes:
   * EMRJobRunner now prints % of mappers and reducers completed when you
     enable the SSH tunnel.
   * Added mr_page_rank example
   * Added mrjob.tools.emr.audit_usage script (Issue #21)
   * You can specify alternate job owners with the "owner" option. Useful for
     auditing usage. (Issue #59)
   * The job_name_prefix option has been renamed to label (the old name still
     works but is deprecated)
   * bootstrap_cmds and bootstrap_scripts no longer automatically invoke sudo
 * Bugs Fixed/Cleanup:
   * bootstrap files no longer get uploaded to S3 twice (Issue #8)
   * When using add_file_option(), show_steps() can now see the local version
     of the file (Issue #45)
   * Now works on Windows (Issue #46)
   * No longer requires external jar, tar, or zip binaries (Issue #47)
   * mrjob-* scratch bucket is only created as needed (Issue #50)
   * Can now specify us-east-1 region explicitly (Issue #58)
   * mrjob.tools.emr.terminate_idle_job_flows leaves Hive jobs alone (Issue #60)

v0.1.0, 2010-10-28 -- Same code, better version. It's official!

v0.1.0-pre3, 2010-10-27 -- Pre-release to run Yelp code against
 * Added debian packaging
 * mrjob bootstrapping can now deal with symlinks in site-packages/mrjob
 * MRJobRunner.stream_output() can now be called multiple times

v0.1.0-pre2, 2010-10-25 -- Second pre-release after testing
 * Fixed small bugs that broke Python 2.5.1 and Python 2.7
 * Fixed reading mrjob.conf without yaml installed
 * Fix tests to work with modern simplejson and pipes.quote()
 * Auto-create temp bucket on S3 if we don't have one (Issue #16)
 * Auto-infer AWS region from bucket (Issue #7)
 * --steps now passes in all extra args (e.g. --protocol) (Issue #4)
 * Better docs

v0.1.0-pre1, 2010-10-21 -- Initial pre-release. YMMV!
Something went wrong with that request. Please try again.