Skip to content
This repository has been archived by the owner on Nov 8, 2018. It is now read-only.

read spark data frame #1

Closed
elenacuoco opened this issue Sep 28, 2016 · 1 comment
Closed

read spark data frame #1

elenacuoco opened this issue Sep 28, 2016 · 1 comment

Comments

@elenacuoco
Copy link

Why don't use the dataframe way to read data in your example? Have you tried with these lines?

conf = SparkConf()
conf.set("spark.executor.memory", "4G")
conf.set("spark.driver.memory", "2G")
conf.set("spark.executor.cores", "7")
conf.set("spark.python.worker.memory", "4G")
conf.set("spark.driver.maxResultSize", "0")
conf.set("spark.sql.crossJoin.enabled", "true")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.default.parallelism", "4")
conf.set("spark.sql.crossJoin.enabled", "true")
spark = SparkSession \
    .builder.config(conf=conf) \
    .appName("test-spark").getOrCreate()
df = spark.read.csv("../input/train_numeric.csv", header="true",inferSchema="true",mode="DROPMALFORMED")
@JoeriHermans
Copy link
Collaborator

This works as well, but the DataBricks CSV package will allow you to indicate null values. For example, in the dataset they are denoted by -999. But anyway, you are right, you can do it like this. :)

JoeriHermans pushed a commit that referenced this issue Sep 10, 2017
Add temporary logging to help debugging
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants