Skip to content

benleon/NewLineRemover

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

NewLineRemover

MapReduce Program that uses a modified LineReader to remove new line characters in Quotes ( typical data preprocessing task )

This project is a Mapreduce program on the input folder removing all new line characters between quotes.

New line characters in database text fields is a common data preparation problem. This mapreduce program uses a custom Record Reader to remove any new line characters between quotes " I.e. Text files like

1,"Ben","Ben was working

on our cluster"

2,"Karl","Karl is an hadoop engineer" ...

Would be translated to

1,"Ben","Ben was working\n on our cluster"

2,"Karl","Karl is an hadoop engineer"

To use a different quote character than " set the QUOTE value in the QuotationLineReader

private static final byte QUOTE = '"';

To change the way newline characters are handled change the processing in the mapper. Currently they are escaped. Removing them would also be valid.

line = line.replaceAll("\n", "\\n");

line = line.replaceAll("\r", "\\r");

About

MapReduce Program that uses a modified LineReader to remove new line characters in Quotes ( typical data preprocessing task )

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages