# PySpark – Create DataFrame

**In order to create a DataFrame from a list we need the data hence, first, let’s create the data and the columns that are needed.**

In [0]:
columns = ['language', 'user_count']

data = [('Java', '20000'), ('Python', '100000'), ('Scala', '3000')]

# 1. Create DataFrame from RDD

One easy way to manually create PySpark DataFrame is from an existing RDD. first, let’s create a Spark RDD from a collection List by calling **parallelize()** function from SparkContext . We would need this rdd object for all our examples below.

In [0]:
rdd = sc.parallelize(data)

## 1.1 Using toDF() function

PySpark RDD’s **toDF()** method is used to create a DataFrame from the existing RDD. Since RDD doesn’t have columns, the DataFrame is created with default column names “_1” and “_2” as we have two columns.

In [0]:
dfFromRDD1 = rdd.toDF()

dfFromRDD1.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)



If you wanted to provide column names to the DataFrame use toDF() method with column names as arguments as shown below.

In [0]:
columns = ['language', 'users_count']

dfFromRDD1 = rdd.toDF(columns)

dfFromRDD1.printSchema()

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)



**use the show() method on PySpark DataFrame to show the DataFrame**

In [0]:
dfFromRDD1.show()

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+



By default, the datatype of these columns infers to the type of data. We can change this behavior by supplying schema, where we can specify a column name, data type, and nullable for each field/column.

## 1.2 Using createDataFrame() from SparkSession

Using createDataFrame() from SparkSession is another way to create manually and it takes rdd object as an argument. and chain with toDF() to specify name to the columns.

In [0]:
dfFromRDD2 = spark.createDataFrame(rdd).toDF(*columns)

In [0]:
dfFromRDD2.show()

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+



# 2. Create DataFrame from List Collection

**How to create PySpark DataFrame from a list.**

## 2.1 Using createDataFrame() from SparkSession

Calling **createDataFrame()** from SparkSession is another way to create PySpark DataFrame manually, it takes a list object as an argument. and chain with toDF() to specify names to the columns.

In [0]:
dfFromData2 = spark.createDataFrame(data).toDF(*columns)

In [0]:
dfFromData2.show()

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+



## 2.2 Using createDataFrame() with the Row type

**createDataFrame()** has another signature in PySpark which takes the collection of Row type and schema for column names as arguments. To use this first we need to convert our “data” object from the list to list of Row.

In [0]:
from pyspark.sql import Row

rowData = map(lambda x: Row(*x), data)
dfFromData3 = spark.createDataFrame(rowData, columns)

In [0]:
dfFromData3.show()

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+



## 2.3 Create DataFrame with schema

If you wanted to specify the column names along with their data types, you should create the **StructType schema** first and then assign this while creating a DataFrame.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType


data2 = [
    ("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
]


schema = StructType([
    StructField('fristname', StringType(), True),
    StructField('middlename', StringType(), True),
    StructField('lastname', StringType(), True),
    StructField('id', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('salary', StringType(), True)
])


df = spark.createDataFrame(data=data2, schema=schema)
df.printSchema()
df.show(truncate=False)

root
 |-- fristname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: string (nullable = true)

+---------+----------+--------+-----+------+------+
|fristname|middlename|lastname|id   |gender|salary|
+---------+----------+--------+-----+------+------+
|James    |          |Smith   |36636|M     |3000  |
|Michael  |Rose      |        |40288|M     |4000  |
|Robert   |          |Williams|42114|M     |4000  |
|Maria    |Anne      |Jones   |39192|F     |4000  |
|Jen      |Mary      |Brown   |     |F     |-1    |
+---------+----------+--------+-----+------+------+



In [0]:
df.show(vertical=True)

-RECORD 0--------------
 fristname  | James    
 middlename |          
 lastname   | Smith    
 id         | 36636    
 gender     | M        
 salary     | 3000     
-RECORD 1--------------
 fristname  | Michael  
 middlename | Rose     
 lastname   |          
 id         | 40288    
 gender     | M        
 salary     | 4000     
-RECORD 2--------------
 fristname  | Robert   
 middlename |          
 lastname   | Williams 
 id         | 42114    
 gender     | M        
 salary     | 4000     
-RECORD 3--------------
 fristname  | Maria    
 middlename | Anne     
 lastname   | Jones    
 id         | 39192    
 gender     | F        
 salary     | 4000     
-RECORD 4--------------
 fristname  | Jen      
 middlename | Mary     
 lastname   | Brown    
 id         |          
 gender     | F        
 salary     | -1       



# 3. Create DataFrame from Data sources


**In real-time mostly you create DataFrame from data source files like CSV, Text, JSON, XML e.t.c.**

PySpark by default supports many data formats out of the box without importing any libraries and to create DataFrame you need to use the appropriate method available in DataFrameReader class.

## 3.1 Creating DataFrame from CSV

Use **csv()** method of the DataFrameReader object to create a DataFrame from CSV file. you can also provide options like what delimiter to use, whether you have quoted data, date formats, infer schema, and many more.

In [0]:
df2 = spark.read.csv("dbfs:/FileStore/resources/small_zipcode.csv", header=True, sep=",", inferSchema=True)

In [0]:
df2.printSchema()
df2.show()


root
 |-- id: integer (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- type: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- population: integer (nullable = true)

+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|               null|   PR|     30100|
|  2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|               null|   TX|      null|
+---+-------+--------+-------------------+-----+----------+



In [0]:
pandas_df = df2.toPandas()
type(pandas_df)
pandas_df_str = pandas_df.to_string(columns=['id', 'zipcode', 'type', 'city', 'state', 'population'], header=True, index=False)
    
dbutils.fs.put('/tmp/small+zipcode.txt', pandas_df_str, True)
#df2.write.text('dbfs:/FileStore/resources/small_zipcode.txt')

Wrote 359 bytes.
Out[103]: True

In [0]:
# to delete file
dbutils.fs.rm('dbfs:/tmp/small+zipcode.txt/', True)

Out[102]: True

## 3.2. Creating from text (TXT) file

Similarly you can also create a DataFrame by reading a from Text file, use text() method of the DataFrameReader to do so.

In [0]:
# df3 = spark.read.text('dbfs:/tmp/small+zipcode.txt')

df3 = spark.read.csv('dbfs:/tmp/small+zipcode.txt', inferSchema=True, header=True)

In [0]:
df3.printSchema()
df3.show(truncate=False)

root
 |--  id  zipcode     type                city state  population: string (nullable = true)

+-----------------------------------------------------------+
| id  zipcode     type                city state  population|
+-----------------------------------------------------------+
|  1      704 STANDARD                None    PR     30100.0|
|  2      704     None PASEO COSTA DEL SUR    PR         NaN|
|  3      709     None        BDA SAN LUIS    PR      3700.0|
|  4    76166   UNIQUE   CINGULAR WIRELESS    TX     84000.0|
|  5    76177 STANDARD                None    TX         NaN|
+-----------------------------------------------------------+



## 3.3. Creating from JSON file

**PySpark is also used to process semi-structured data files like JSON format. you can use json() method of the DataFrameReader to read JSON file into DataFrame. Below is a simple example.**

In [0]:
df_json = spark.read.json('dbfs:/FileStore/resources/simple_zipcodes.json')

In [0]:
df_json.show()

+-------------------+-----+-----------+-------+
|               City|State|ZipCodeType|Zipcode|
+-------------------+-----+-----------+-------+
|        PARC PARQUE|   PR|   STANDARD|    704|
|PASEO COSTA DEL SUR|   PR|   STANDARD|    704|
|       BDA SAN LUIS|   PR|   STANDARD|    709|
|  CINGULAR WIRELESS|   TX|     UNIQUE|  76166|
|         FORT WORTH|   TX|   STANDARD|  76177|
|           FT WORTH|   TX|   STANDARD|  76177|
|    URB EUGENE RICE|   PR|   STANDARD|    704|
|               MESA|   AZ|   STANDARD|  85209|
|               MESA|   AZ|   STANDARD|  85210|
|           HILLIARD|   FL|   STANDARD|  32046|
+-------------------+-----+-----------+-------+



## 4. Other sources (Avro, Parquet, ORC, Kafka)

In [0]:
df_parquet = spark.read.parquet('dbfs:/FileStore/resources/zipcodes.parquet')

In [0]:
display(df_parquet)

RecordNumber,Zipcode,ZipCodeType,City,State,LocationType,Lat,Long,Xaxis,Yaxis,Zaxis,WorldRegion,Country,LocationText,Location,Decommisioned,TaxReturnsFiled,EstimatedPopulation,TotalWages,Notes
1,704,STANDARD,PARC PARQUE,PR,NOT ACCEPTABLE,17.96,-66.22,0.38,-0.87,0.3,,US,"Parc Parque, PR",NA-US-PR-PARC PARQUE,False,,,,
2,704,STANDARD,PASEO COSTA DEL SUR,PR,NOT ACCEPTABLE,17.96,-66.22,0.38,-0.87,0.3,,US,"Paseo Costa Del Sur, PR",NA-US-PR-PASEO COSTA DEL SUR,False,,,,
10,709,STANDARD,BDA SAN LUIS,PR,NOT ACCEPTABLE,18.14,-66.26,0.38,-0.86,0.31,,US,"Bda San Luis, PR",NA-US-PR-BDA SAN LUIS,False,,,,
61391,76166,UNIQUE,CINGULAR WIRELESS,TX,NOT ACCEPTABLE,32.72,-97.31,-0.1,-0.83,0.54,,US,"Cingular Wireless, TX",NA-US-TX-CINGULAR WIRELESS,False,,,,
61392,76177,STANDARD,FORT WORTH,TX,PRIMARY,32.75,-97.33,-0.1,-0.83,0.54,,US,"Fort Worth, TX",NA-US-TX-FORT WORTH,False,2126.0,4053.0,122396986.0,
61393,76177,STANDARD,FT WORTH,TX,ACCEPTABLE,32.75,-97.33,-0.1,-0.83,0.54,,US,"Ft Worth, TX",NA-US-TX-FT WORTH,False,2126.0,4053.0,122396986.0,
4,704,STANDARD,URB EUGENE RICE,PR,NOT ACCEPTABLE,17.96,-66.22,0.38,-0.87,0.3,,US,"Urb Eugene Rice, PR",NA-US-PR-URB EUGENE RICE,False,,,,
39827,85209,STANDARD,MESA,AZ,PRIMARY,33.37,-111.64,-0.3,-0.77,0.55,,US,"Mesa, AZ",NA-US-AZ-MESA,False,14962.0,26883.0,563792730.0,"no NWS data,"
39828,85210,STANDARD,MESA,AZ,PRIMARY,33.38,-111.84,-0.31,-0.77,0.55,,US,"Mesa, AZ",NA-US-AZ-MESA,False,14374.0,25446.0,471000465.0,
49345,32046,STANDARD,HILLIARD,FL,PRIMARY,30.69,-81.92,0.12,-0.85,0.51,,US,"Hilliard, FL",NA-US-FL-HILLIARD,False,3922.0,7443.0,133112149.0,


In [0]:
df_avro = spark.read.format('avro').load('dbfs:/FileStore/resources/zipcodes.avro')

In [0]:
display(df_avro)

RecordNumber,Zipcode,ZipCodeType,City,State,LocationType,Lat,Long,Xaxis,Yaxis,Zaxis,WorldRegion,Country,LocationText,Location,Decommisioned,TaxReturnsFiled,EstimatedPopulation,TotalWages,Notes
1,704,STANDARD,PARC PARQUE,PR,NOT ACCEPTABLE,17.96,-66.22,0.38,-0.87,0.3,,US,"Parc Parque, PR",NA-US-PR-PARC PARQUE,False,,,,
2,704,STANDARD,PASEO COSTA DEL SUR,PR,NOT ACCEPTABLE,17.96,-66.22,0.38,-0.87,0.3,,US,"Paseo Costa Del Sur, PR",NA-US-PR-PASEO COSTA DEL SUR,False,,,,
10,709,STANDARD,BDA SAN LUIS,PR,NOT ACCEPTABLE,18.14,-66.26,0.38,-0.86,0.31,,US,"Bda San Luis, PR",NA-US-PR-BDA SAN LUIS,False,,,,
61391,76166,UNIQUE,CINGULAR WIRELESS,TX,NOT ACCEPTABLE,32.72,-97.31,-0.1,-0.83,0.54,,US,"Cingular Wireless, TX",NA-US-TX-CINGULAR WIRELESS,False,,,,
61392,76177,STANDARD,FORT WORTH,TX,PRIMARY,32.75,-97.33,-0.1,-0.83,0.54,,US,"Fort Worth, TX",NA-US-TX-FORT WORTH,False,2126.0,4053.0,122396986.0,
61393,76177,STANDARD,FT WORTH,TX,ACCEPTABLE,32.75,-97.33,-0.1,-0.83,0.54,,US,"Ft Worth, TX",NA-US-TX-FT WORTH,False,2126.0,4053.0,122396986.0,
4,704,STANDARD,URB EUGENE RICE,PR,NOT ACCEPTABLE,17.96,-66.22,0.38,-0.87,0.3,,US,"Urb Eugene Rice, PR",NA-US-PR-URB EUGENE RICE,False,,,,
39827,85209,STANDARD,MESA,AZ,PRIMARY,33.37,-111.64,-0.3,-0.77,0.55,,US,"Mesa, AZ",NA-US-AZ-MESA,False,14962.0,26883.0,563792730.0,"no NWS data,"
39828,85210,STANDARD,MESA,AZ,PRIMARY,33.38,-111.84,-0.31,-0.77,0.55,,US,"Mesa, AZ",NA-US-AZ-MESA,False,14374.0,25446.0,471000465.0,
49345,32046,STANDARD,HILLIARD,FL,PRIMARY,30.69,-81.92,0.12,-0.85,0.51,,US,"Hilliard, FL",NA-US-FL-HILLIARD,False,3922.0,7443.0,133112149.0,
