<a href="https://colab.research.google.com/github/ankitarm/PySpark/blob/main/Spark_Reading_files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text files

.text('path', , lineSep = "||",  wholetext=True).option("encoding" , "UTF-8")

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Textfiles").getOrCreate()
data1 = spark.read.text("/content/textfile1.txt")
data1.show()

data2 = spark.read.text("/content/textfile2.txt")
data2.show()

+------------------+
|             value|
+------------------+
|name||address||age|
+------------------+

+------------------+
|             value|
+------------------+
|name||address||age|
|name||address||age|
+------------------+



In [None]:
data3 = spark.read.option("lineSep","||").text("/content/textfile1.txt")
data3.show()
data4 = spark.read.option("lineSep","||").text("/content/textfile2.txt")
data4.show()

+-------+
|  value|
+-------+
|   name|
|address|
|    age|
+-------+

+-----------+
|      value|
+-----------+
|       name|
|    address|
|age\r\nname|
|    address|
|        age|
+-----------+



You can write in .option() like above or within text() like below

In [None]:
data5 = spark.read.text("/content/textfile1.txt", lineSep = "||")
data5.show()

+-------+
|  value|
+-------+
|   name|
|address|
|    age|
+-------+



UTF-8 (default), ISO-8859-1, or latin1

If you want to specify encoding (like UTF-8, ISO-8859-1, etc.), you should use the option() method with the key "encoding" because spark.read.text() does not support the encoding parameter directly.

In [None]:
data7 = spark.read.option("lineSep","||").option("encoding" , "UTF-8").text("/content/textfile1.txt")
data7.show()
data6 = spark.read.text("/content/textfile1.txt", lineSep = "||", encoding = "UTF-8")
data6.show()

+-------+
|  value|
+-------+
|   name|
|address|
|    age|
+-------+



TypeError: DataFrameReader.text() got an unexpected keyword argument 'encoding'

Below doesnt show desired output as it should read entire file in a line. So .option() doesn't work.

In [None]:
data8 = spark.read.option("wholetext",True).text("/content/textfile2.txt")
data8.show()

+------------------+
|             value|
+------------------+
|name||address||age|
|name||address||age|
+------------------+



In [None]:
data9 = spark.read.wholeTextFiles("/content/textfile2.txt")
data9.show()

AttributeError: 'DataFrameReader' object has no attribute 'wholeTextFiles'

wholetext works as below


In [None]:
data10 = spark.read.text("/content/textfile2.txt", wholetext=True)
data10.show(truncate=False)

+----------------------------------------+
|value                                   |
+----------------------------------------+
|name||address||age\r\nname||address||age|
+----------------------------------------+



*italicized text*# New Section

# CSV files


In [None]:
df4 = spark.read.options(delimiter=";", header=True).csv(path)

# Json files

In [None]:
peopleDF = spark.read.json(path)

#rdds
#parallelise and then pass in json
sc = spark.sparkContext
jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()


# Read the multiline JSON
df = spark.read.option("multiline", "true").json("path/to/data.json")

{
  "id": 1,
  "name": "Alice",
  "location": {
    "city": "Austin",
    "state": "TX"
  }
}


In [32]:
data11 = spark.read.option("multiline", "true").json("/content/json1.json")
data11.show(truncate=False)

+---+------------+-----+
|id |location    |name |
+---+------------+-----+
|1  |{Austin, TX}|Alice|
+---+------------+-----+



Different ways to read json

1.   df = spark.read.json("path/to/file.json")
2.   df = spark.read.format("json").load("path/to/file.json")
3.   df = spark.read.option("multiLine", True).json("path/to/multiline.json")
4.   df = spark.read.json("path/to/files/*.json")
5.   rdd = sc.parallelize(['{"name":"John", "age":30}', '{"name":"Alice", "age":25}'])
df = spark.read.json(rdd)
6.   If a column contains JSON strings, you can parse them using from_json.
7.   Read stream stream_df = spark.readStream.schema(schema).json("path/to/json_stream")







*italicized text*# New Section