## Creating DataFrames from CSV file

In [2]:
from pyspark.sql import SparkSession 

# Create a Spark Session object
spark = SparkSession.builder.getOrCreate()

In [None]:
# Create DataFrame from persons.csv
df = spark.read.load("persons.csv",           # input file directory
                        format = "csv",       # input file fotmat
                        header = True,        # specify if the first line contains attributes/column's names
                        inferSchema = True)   # specify if system have to infer data types, otherwise all strings

## Creating DataFrames from JSON file - **Option 1 (not standard JSON file)**
**Pay Attention:** Spark SQL privides an API that allows creating a DataFrame directly from txt file where each line contains a JSON obj, hence the input file is not a **standard** JSON file, but must be formatted in order to have one JSON object (tuple) for each line.

In [None]:
# Create DataFrame from persons.json
df = spark.read.load("persons.csv",           # input file directory
                        format = "json")       # input file fotmat

## Creating DataFrames from JSON file - **Option 2 (standard JSON file)**
The same API allows also reading 'standard' multiline JSON files by setting the argument **multiLine = True** on the defined DataFrameReader.

**NOTE:** reading a set of small JSON files from HDFS is **very slow**

In [None]:
# Create DataFrame from persons.json
df = spark.read.load("persons.csv",           # input file directory
                        format = "json",       # input file fotmat
                         multiLine = True)

## Creating DataFrames from RDDs or Python lists
The content of an RDD of tuple or the content of a Python list of tuples can be stored in a DF by using:

**spark.createDataFrame(data, schema)**

In [3]:
# Create a Python list of tuples
profilesList = [(19, "Justin"), (30, "Andy"),(None, "Michael")]

# Create a DataFrame from the profilesList
df = spark.createDataFrame(profilesList,["age","name"])

## Creating an RDD from a DataFrame
The rdd member of the DataFrame class returns an RDD of Row objects containing the content of the DataFrame.

In [None]:
# Create a DataFrame from persons.csv (name, age)
df = spark.read.load( "persons.csv",
                        format="csv",
                        header=True,
                        inferSchema=True)

# Define an RDD based on the content of the DataFrame
rddRows = df.rdd

# Use the map transformation to extract the name field/column
rddNames = rddRows.map(lambda row: row.name)

# Store the result
rddNames.saveAsTextFile(outputPath)