Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: #1164

Open
johnstaveley opened this issue Jan 10, 2024 · 11 comments
Open

[BUG]: #1164

johnstaveley opened this issue Jan 10, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@johnstaveley
Copy link

Describe the bug
Here it refers to installing spark and a worker process:
https://github.com/dotnet/spark/blob/main/docs/getting-started/windows-instructions.md

However when you download both for windows, you just get a list of files and no installer.
Please provide instructions or a link to install for windows.

@johnstaveley johnstaveley added the bug Something isn't working label Jan 10, 2024
@dbeavon
Copy link

dbeavon commented Jan 28, 2024

Hi @johnstaveley

You are correct that the dot-net-worker is just a set of files that you need to extract.

There isn't much an installer would need to do but extract those files, and create environment variables. Apache Spark solutions rely heavily on environment variables. You need to have an environment variable so that the Spark core (executors) can find the worker process. Please set this variable (you will find it in the link you shared)

DOTNET_WORKER_DIR=C:\Program Files\Microsoft Spark Worker\Microsoft.Spark.Worker-2.0.0

Soon you learn that the application you are building may have .net UDF's (transformations that are executed on individual dataframe rows, or on vectors of rows). In order for these assemblies to be located by the "Microsoft.Spark.Worker-2.0.0" , you will need yet another environment variable:

DOTNET_ASSEMBLY_SEARCH_PATHS=C:\Data\Workspace\Spark\Driver\LoadSalesHistory\bin\Debug\netcoreapp3.1

Hopefully this is clear. To make a long story short, there are several pieces of the architecture :

  • your custom driver and custom udf's (in your own .net core project)
  • generic dot-net-worker (Microsoft.Spark.Worker) that needs to be deployed to worker nodes, and located in Microsoft.Spark.Worker-2.0.0
  • nuget package (Microsoft.Spark) that you add to your .net project, and it emits a small but critical jar (eg. microsoft-spark-3-1_2.12-2.0.0.jar) which will be injected into the Spark core as a user jar, and is also the place where you find "org.apache.spark.deploy.dotnet.DotnetRunner". That is the scala class that is always launched (spark-submit) to get your dot-net stuff up and running on the driver's node.

@GeorgeS2019
Copy link

@dbeavon

Thx for keeping this project supported.

Recently, the microsoft team started to show interest in ikvm. I wonder if this could help Spark.NET. Ikvm users started to bring .net support to e..g. Java Spring Framework.

@dbeavon
Copy link

dbeavon commented Apr 17, 2024

Hi @johnstaveley is there anything more you need? Can we close this bug?

@johnstaveley
Copy link
Author

Thank you very much for your help though, it was appreciated. Honestly I have found using Spark on Windows and Spark with .Net too painful and have given up. There are no up to date training materials on it that work.

@dbeavon
Copy link

dbeavon commented Apr 17, 2024

Hi @johnstaveley, I agree that there can be complexity. I think it is the reason why so many companies offer Spark hosting from $1 to $10 an hour. Ultimately your reward for getting it running locally is not having to pay that bill (and achieving much better productivity as well). I can easily get $10/hour worth of Spark running on a my workstation (32 cores, 64 GB) and use it all week long without paying one penny to any cloud vendor.

(This is a huge contrast to certain other big data options. Eg. you often see posts in the Snowflake community about an unfortunate developer who accidentally consumed several thousand dollars worth of resources because they were in the preliminary stages of their development work and didn't quite know what they were doing yet. As I understand the Snowflake product team refuses to ever allow their products to run on a customer's own hardware ... it is easy to understand their financial motivation. They probably make just as money from junior developers who are making programming errors as what they make off of production-ready solutions!)

Even if you use a cloud-hosted Spark offering, there will be complexity ... but the complexity will shift around and be found in other areas (you will need to have a good understanding of linux rather than windows, and you will need another language and another set of libraries). It is sort of a "pick your poison" scenario.

When using Spark on Windows (standalone cluster) with .Net, it is important to already understand Spark pretty well, for starters. It is also good to get really familiar with sysinternals tools (process explorer, tcpview, process monitor). It is pretty important to understand what components are launched with elevated access, and what components can be launched without elevation. And another thing to remember is that you have to rely a great deal on linux-like concepts like environment variables since most of these components are built for cross-platform scenarios. Whenever I work with cross-platform .Net, solutions, I'm shocked about how much I need to tinker with environment variables. Whereas I haven't often played with environment variables for the sake of .Net Framework apps. I don't think I've interacted with those as much since the days of the MSDOS batch scripts!

@johnstaveley
Copy link
Author

Hi, not sure a github issues is the best place to raise this, but I have been trying to learn Spark. My go to way of learning a new technology, tested over 24 years as a software engineer is to get it running locally and then maybe use the cloud version later. I want to use Spark with .Net. I tried installing locally, couldn't get it working. I tried it as a docker container, couldn't get it working. I tried it in the cloud, could get it working but couldn't interact with it using .Net.

Can you point me to a training course that helps me understand Spark and I can use .Net to get data into and out of it in the quickest way possible?

@GeorgeS2019
Copy link

@dbeavon

Thanks for sharing so much to keep the interest on Spark .NET going.

I will find time to get back to this.

My problem has always been under .NET6 that UDF not working in a .NET notebook environment

image

@dbeavon
Copy link

dbeavon commented Apr 20, 2024

Hi @GeorgeS2019

Are you a pretty experienced .Net developer? Have you also used .net for spark with Visual Studio? In your notebooks, do you load custom libraries , eg.
#r "C:\path\to\your\DLL\netstandard2.0\Newtonsoft.Json.dll"

Do you know how to set the environment variable that points at your worker assembly directory?

I tried the other day and wasn't getting udf's working in polyglot notebooks on VS code either. But I'm pretty sure the errors were meaningful and there is a path forward. The pieces of the puzzle are all there. Our problem is that there are three or four separate teams that may not know about each other, so you have to understand how they play together... (vs code, .net interactive/polyglot, apache spark, and .net for spark).

I think you need to be more specific about your problem. It isn't that you can't run UDF, its that you can't run the REPL in a notebook cell as a UDF. You should probably start by doing something a bit more simple, like compiling a DLL with your UDF and sending it to the path where your worker dll's are stored, and using that UDF from your notebook (instead of a REPL UDF).
Granted it's not convenient, but at least you will see that your REPL is the problem, rather than the project itself (.Net for Spark_

@GeorgeS2019
Copy link

GeorgeS2019 commented Apr 20, 2024

Write and call UDFs in .NET Interactive

dotnet/docs#34013

Our problem is that there are three or four separate teams that may not know about each other

You are totally correct.

I am able to get UDF to work (e.g. addressing BinaryFormatter) under non polyglot environment.

Applying the same conditions needed to get UDF work ( e.g. console condition) failed to work in polyglot. This was many months ago. I recalled that this has to do with the Polyglot and I have not figured out how to describe what was missing in polyglot that caused the UDF failure in polyglot.

Reference

I think this PR could address that.

This PR removes the need to depend on BinaryFormatter.

With the PR, I believe UDF in PolyGlot should work

@dbeavon
Copy link

dbeavon commented Apr 20, 2024

@johnstaveley

If you are a developer with 24 years of experience then you should be in a good position to get it working locally. I would start by using python. The .net stuff is basically the same as python. A .net driver runs on the cluster (external to the jvm spark core) and injects dataframes into the cluster and then executes udf's (also .net) which launch themselves from executors (jvm) on the worker infrastructure (jvm). All this works the same whether the language is python or .net. The spark core is JVM which shouldn't care if it interfaces with external python or external .net. All it needs is a consistent set of interfaces (for launching and exchanging data via Apache Arrow).

In other words, I think you should start by following python tutorials from other communities, in order to get spark up and running locally on Windows. Make sure you look for an appropriate version of spark (probably spark-3.1.2-bin-hadoop3.2) But you should have a future version of this .net project in mind as well (probably 2.1.1 or else the master branch).

I think you should start with python and get a simple example running. Even if you get it going in jupyter-vs-code (not VS) it will prove that you have a fully functional spark environment (standalone cluster on a single machine). Make sure you are able to run a python sample that has UDFs or vector UDFs.

Once you can launch something in python, you are about half way.

Moving to .net for spark you have to decide what version of code to use. Selecting a version of the project 2.1.1 has the effect of determining a number of dependencies (both on .net ecosystem and java ecosystem). Because we are having trouble getting PR's thru, many people are tweaking these dependencies to some degree in our own github forks.

Here are the things you get with the default 2.1.1 on the .net side.

  • .net core 3.1
  • BinaryFormatter

Here are things you get on the JVM/Spark side:

  • several versions of spark that you can select from, when running spark-submit

image

  • Scala 2.12 (i'm not actually sure why this is included in the ugly jar name since it should be implied, based on spark. jars are named in a strange way, eg. microsoft-spark-3-2_2.12)

  • You get tons of libraries in python but you really shouldn't have to care about that ecosystem at all once you are up and running.

Once you have a full Spark environment running for purposes of some basic use of jupyter/python, then I think that is the best point to start opening issues in this community, and we can help get you up and running on .Net as well. For the .net side you should be prepared to use VS (2022) as your primary development environment, not notebooks. Notebooks should be possible as well, but aren't preferable for large-scale engineering projects.

@dbeavon
Copy link

dbeavon commented Apr 20, 2024

@GeorgeS2019

I think your issue related to polyglot (REPL) UDFs is basically described here:
#619

What is your actual end goal or use case? For enterprise workloads, I don't spend much time with notebooks. I can only see myself using notebooks to analyze my spark dataframes after the fact, and that uses almost no UDF's (only spark sql). If/when I need UDF's for some reason, I could build them as an assembly and load into polyglot with "#r"?

You don't have trouble with spark sql from a c#.net notebook right?

You don't have trouble loading custom assemblies, right?

I agree that it should be possible to run the code in a REPL cell via a UDF but it will be a bit challenging. The .net interactive team may need to cooperate on that goal. I'm guessing we would need to write some assemblies and deploy them into the search paths of the worker (DOTNET_ASSEMBLY_SEARCH_PATHS). At that point .net interactive becomes a miniature build environment.

I did some digging and it isn't straightforward how you would serialize the assembly from .net interactive. We might have to dig into their project to see how they build these:

This code below can be used in polyglot. It looks for the location of the assembly, but you can see it is empty. It is probably generated dynamically in memory, therefore not accessible to a remote worker.

public class A { public string B { get; set; }}
var L = System.Reflection.Assembly.GetAssembly(typeof(A));

L

image

Obviously the Synapse Analytics team found a way to make this work (and so did the pyspark folks) so it isn't impossible. It just takes a bit of effort. In the meantime, just create a simple assembly in Visual Studio and load that into your notebook for the sake of your UDF's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants