Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple changes running 4mz in Spark 2.2 #27

Merged
merged 4 commits into from
Jan 2, 2018
Merged

Multiple changes running 4mz in Spark 2.2 #27

merged 4 commits into from
Jan 2, 2018

Conversation

snoe925
Copy link
Contributor

@snoe925 snoe925 commented Oct 24, 2017

Modern Hadoop does not require core-site.xml configurations
for codecs.

This allows the codec to work in Spark by adding the jar to the classpath.
You can copy the jar to the spark jars directory.

Implementations that do not have JavaServices code will work the same as without this META-INF data.

Modern Hadoop does not require core-site.xml configurations
for codecs. This allows the codec to work in Spark by adding
the jar to the classpath.
@snoe925 snoe925 changed the title Add JavaServices locators for codecs Multiple changes running 4mz in Spark 2.2 Oct 24, 2017
@snoe925
Copy link
Contributor Author

snoe925 commented Oct 24, 2017

I found that these changes were required to get 4mz working with newAPIHadoopFile. Here is an example spark shell reader.

sc.newAPIHadoopFile("data.4mz", classOf[com.hadoop.mapreduce.FourMzTextInputFormat], classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text])

@jordiolivares
Copy link
Contributor

Why hasn't this been merged yet?

Specifically, commit f6a57e3 has a really basic fix necessary for ZSTD to function properly. I would also add that FourMcTextInputFormat also needs to add the LongWritable and Text generic fields like FourMzTextInputFormat in your version.

@snoe925
Copy link
Contributor Author

snoe925 commented Dec 22, 2017

I can volunteer as a maintainer. I can also make an official repo if you want to avoid notifications.

@carlomedas
Copy link
Collaborator

I'd like to merge the pull requests of the first part.
While the index changes on the 4mc CLI is not clear to me.
What is it doing? The index in 4mc/4mz files is already inside the file itself.

@carlomedas
Copy link
Collaborator

P.S.: I can you your help to rebuild the lib on all platforms.

@snoe925
Copy link
Contributor Author

snoe925 commented Jan 2, 2018

I should have pushed the external index code on a branch. I was doing an experiment on timestamp indexing the data in a 4mz. Let me fix the pull request.

@snoe925
Copy link
Contributor Author

snoe925 commented Jan 2, 2018

I have removed the incorrect index code commit from this pull request.

@snoe925
Copy link
Contributor Author

snoe925 commented Jan 2, 2018

For platform building I will open a separate pull request for a Travis CI integration file. That can build Linux and OS X. I do not have Windows build machines.

@carlomedas
Copy link
Collaborator

Yes that'd be perfect, even if Linux is not an issue.
I'm going to rebuild a new version of the lib soon and also Mac is easy.
The only issue I have now is with windows, where you need cygwin64 to build it correctly to work good with latest JRE7/8 on latest Windows versions.
Since I don't think there is a lot of people using it, we could even think about releasing without it unless we find the time to recreate the build system I unfortunately lost in the past year...

@carlomedas carlomedas merged commit 0ab3864 into fingltd:master Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants