**Topics covered**

     1) Create a DataFrame from a List of Tuples
     2) Create a DataFrame from a List of Lists
     3) Create a DataFrame using dictionary
     4) Create a DataFrame from a Simple List
     5) Create a DataFrame with an Explicit Schema
     6) Create a DataFrame Directly from a List Using Row
     7) toDF()

In [0]:
help(spark.createDataFrame)

Help on method createDataFrame in module pyspark.sql.session:

createDataFrame(data: Union[ForwardRef('RDD[Any]'), Iterable[Any], ForwardRef('PandasDataFrameLike'), ForwardRef('ArrayLike')], schema: Union[pyspark.sql.types.AtomicType, pyspark.sql.types.StructType, str, NoneType] = None, samplingRatio: Optional[float] = None, verifySchema: bool = True) -> pyspark.sql.dataframe.DataFrame method of pyspark.sql.session.SparkSession instance
    Creates a :class:`DataFrame` from an :class:`RDD`, a list, a :class:`pandas.DataFrame`
    or a :class:`numpy.ndarray`.
    
    .. versionadded:: 2.0.0
    
    .. versionchanged:: 3.4.0
        Supports Spark Connect.
    
    Parameters
    ----------
    data : :class:`RDD` or iterable
        an RDD of any kind of SQL data representation (:class:`Row`,
        :class:`tuple`, ``int``, ``boolean``, ``dict``, etc.), or :class:`list`,
        :class:`pandas.DataFrame` or :class:`numpy.ndarray`.
    schema : :class:`pyspark.sql.types.DataType`, str

**1) Create a DataFrame from a List of Tuples**
- If your **list** contains **tuples** where **each tuple represents a row**, you can create a DataFrame

In [0]:
# List of tuples
data = [(1, "Albert", 25, "Sales", "ADF"),
        (2, "Buvan", 30, "Marketing", "Oracle"),
        (3, "Chandar", 28, "IT", "SAP"),
        (4, "Syam", 33, "Admin", "Tally"),
        (5, "Senthil", 26, "Production", "MATLAB"),
        (6, "Surya", 35, "Quality", "Excel")]

# Define column names
columns = ["ID", "Name", "Age", "Department", "Technology"]

In [0]:
# Create DataFrame
df = spark.createDataFrame(data, schema=columns)

# Display DataFrame
display(df)

ID,Name,Age,Department,Technology
1,Albert,25,Sales,ADF
2,Buvan,30,Marketing,Oracle
3,Chandar,28,IT,SAP
4,Syam,33,Admin,Tally
5,Senthil,26,Production,MATLAB
6,Surya,35,Quality,Excel


In [0]:
# Create DataFrame
df1 = spark.createDataFrame(data, 'ID int, Name string, Age int, Department string, Technology string')

# Display DataFrame
display(df1)

ID,Name,Age,Department,Technology
1,Albert,25,Sales,ADF
2,Buvan,30,Marketing,Oracle
3,Chandar,28,IT,SAP
4,Syam,33,Admin,Tally
5,Senthil,26,Production,MATLAB
6,Surya,35,Quality,Excel


**2) Create a DataFrame from a List of Lists**
- If your **list** contains **lists instead of tuples**.

In [0]:
# List of Lists
data = [[1, "Albert", 25, "Sales", "ADF"],
        [2, "Buvan", 30, "Marketing", "Oracle"],
        [3, "Chandar", 28, "IT", "SAP"],
        [4, "Syam", 33, "Admin", "Tally"]]

# Define column names
columns = ["ID", "Name", "Age", "Department", "Technology"]

df2 = spark.createDataFrame(data, schema=columns)
display(df2)

ID,Name,Age,Department,Technology
1,Albert,25,Sales,ADF
2,Buvan,30,Marketing,Oracle
3,Chandar,28,IT,SAP
4,Syam,33,Admin,Tally


**3) Create a DataFrame using dictionary**

In [0]:
data = [{'Name':'Jayanth', 'ID':'A123', 'Country':'USA'},
        {'Name':'Rupesh', 'ID':'A124', 'Country':'USA'},
        {'Name':'Thrusanth', 'ID':'A125', 'Country':'IND'},
        {'Name':'Jahangeer', 'ID':'A126', 'Country':'USA'},
        {'Name':'Sowmya', 'ID':'A127', 'Country':'INA'}]

df_dict = spark.createDataFrame(data)
display(df_dict)

Country,ID,Name
USA,A123,Jayanth
USA,A124,Rupesh
IND,A125,Thrusanth
USA,A126,Jahangeer
INA,A127,Sowmya


**4) Create a DataFrame from a Simple List**
- If your **list** contains a **single column**, you can still use createDataFrame

     # Method 01
     df = spark.createDataFrame(data, 'int')

     # Method 02
     df = spark.createDataFrame([(x,) for x in data], ["Numbers"])

In [0]:
data = [15, 25, 36, 44, 57, 65, 89, 95, 9]

df3 = spark.createDataFrame(data, 'int')
display(df3)

value
15
25
36
44
57
65
89
95
9


In [0]:
df3 = df3.withColumnRenamed("value", "Numbers")
display(df3)

Numbers
15
25
36
44
57
65
89
95
9


In [0]:
data = [15, 25, 36, 44, 57, 65, 89, 95, 9]

df4 = spark.createDataFrame([(x,) for x in data], ["Numbers"])
display(df4)

Numbers
15
25
36
44
57
65
89
95
9


In [0]:
# Create a sample DataFrame
data = [(1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,), (10,)]

df41 = spark.createDataFrame(data, ["id"])
display(df41)

id
1
2
3
4
5
6
7
8
9
10


**5) Create a DataFrame with an Explicit Schema**
- You can define the **schema** explicitly using **StructType and StructField**.

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

data = [[1, "Albert", 25, "Sales", "ADF"],
        [2, "Buvan", 30, "Marketing", "Oracle"],
        [3, "Chandar", 28, "IT", "SAP"],
        [4, "Syam", 33, "Admin", "Tally"],
        [5, "Bharat", 28, "Maintenance", "Excel"]]

schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Department", StringType(), True),
    StructField("Technology", StringType(), True)
])

df_schema = spark.createDataFrame(data, schema=schema)
display(df_schema)

ID,Name,Age,Department,Technology
1,Albert,25,Sales,ADF
2,Buvan,30,Marketing,Oracle
3,Chandar,28,IT,SAP
4,Syam,33,Admin,Tally
5,Bharat,28,Maintenance,Excel


In [0]:
# create student data with 5 rows and 6 attributes
students =[['001', 'sravan', 23, 5.79, 67, 'Chennai'],
            ['002', 'ojaswi', 16, 3.79, 34, 'Hyderabad'],
            ['003', 'gnanesh chowdary', 7, 2.79, 17, 'Bangalore'],
            ['004', 'rohith', 9, 3.69, 28, 'Delhi'],
            ['005', 'sridevi', 37, 5.59, 54, 'Nasik']]

# define the StructType and StructFields for the below column names
schema = """rollno string, name string, age int, height float, weight int, address string"""

# create the dataframe and add schema to the dataframe
df_schema_string = spark.createDataFrame(students, schema=schema)
display(df_schema_string)

rollno,name,age,height,weight,address
1,sravan,23,5.79,67,Chennai
2,ojaswi,16,3.79,34,Hyderabad
3,gnanesh chowdary,7,2.79,17,Bangalore
4,rohith,9,3.69,28,Delhi
5,sridevi,37,5.59,54,Nasik


In [0]:
data = [(1, 'Sandhya', [20, 30, 40]),
        (2, 'Alex', [40, 20, 10]),
        (3, 'Joseph', []),
        (4, 'Arya', [20, 1, None])]

df_schema_def = spark.createDataFrame(data, schema="ID int, Name string, Marks array<int>")
display(df_schema_def)

ID,Name,Marks
1,Sandhya,"List(20, 30, 40)"
2,Alex,"List(40, 20, 10)"
3,Joseph,List()
4,Arya,"List(20, 1, null)"


**6) Create a DataFrame Directly from a List Using Row**

In [0]:
from pyspark.sql import Row

data = [Row(ID=1, Name="Albert", Age=25, Department="Sales", Technology="ADF"),
        Row(ID=2, Name="Buvan", Age=30, Department="Marketing", Technology="Oracle"),
        Row(ID=3, Name="Chandar", Age=28, Department="IT", Technology="SAP"),
        Row(ID=4, Name="Syam", Age=33, Department="Admin", Technology="Tally"),
        Row(ID=5, Name="Bharat", Age=28, Department="Maintenance", Technology="Excel")]

df_row = spark.createDataFrame(data)
display(df_row)

ID,Name,Age,Department,Technology
1,Albert,25,Sales,ADF
2,Buvan,30,Marketing,Oracle
3,Chandar,28,IT,SAP
4,Syam,33,Admin,Tally
5,Bharat,28,Maintenance,Excel


**7) toDF()**

In [0]:
employees = [(1, "Santhosh", "Kumar", 1000.0, "united states", "+1 123 456 7890", "123 45 6789"),
             (2, "Hemanth", "Raju", 1250.0, "India", "+91 234 567 8901", "456 78 9123"),
             (3, "Nisha", "Kakkar", 750.0, "united KINGDOM", "+44 111 111 1111", "222 33 4444"),
             (4, "Bobby", "Deol", 1500.0, "AUSTRALIA", "+61 987 654 3210", "789 12 6118"),
             (5, "Harish", "Rao", 1650.0, "Sweden", "+91 234 567 8901", "456 78 9123"),
             (6, "Yung", "Lee", 850.0, "China", "+44 222 111 1111", "567 33 4444"),
             (7, "Bushan", "Dayal", 1770.0, "Butan", "+71 932 654 5215", "489 16 8318")]

# create the dataframe
df_todf = spark.createDataFrame(employees). \
     toDF("employee_id", "first_name", "last_name", "salary", "nationality", "phone_number", "ssn")
display(df_todf)

employee_id,first_name,last_name,salary,nationality,phone_number,ssn
1,Santhosh,Kumar,1000.0,united states,+1 123 456 7890,123 45 6789
2,Hemanth,Raju,1250.0,India,+91 234 567 8901,456 78 9123
3,Nisha,Kakkar,750.0,united KINGDOM,+44 111 111 1111,222 33 4444
4,Bobby,Deol,1500.0,AUSTRALIA,+61 987 654 3210,789 12 6118
5,Harish,Rao,1650.0,Sweden,+91 234 567 8901,456 78 9123
6,Yung,Lee,850.0,China,+44 222 111 1111,567 33 4444
7,Bushan,Dayal,1770.0,Butan,+71 932 654 5215,489 16 8318
