### **How to join multiple datasets?**

- used to combine fields from **two or multiple** DataFrames by **chaining join()**.
- how to **eliminate the duplicate columns** on the result DataFrame
- Join is a **wider transformation** that does a lot of **shuffling**.
- This notebook covers **inner join** and **left join**.

**Syntax**

     df1.join(df2, df1.emp_id == df2.emp_id, 'left')

     df1.join(df2, on=[f.col('emp_id') == f.col('emp_id')], how='left')

     df1.join(df2, on=[f.col('emp_id') == f.col('emp_id')], how='left') \
        .join(df3, on=["emp_dept_id"], 'left')

     df1.join(df2, ["emp_id"], 'left') \
        .join(df3, df1["emp_dept_id"] == df3["dept_id"], 'left')

In [0]:
%fs ls /FileStore/tables/

path,name,size,modificationTime
dbfs:/FileStore/tables/Flatten Nested Array.json,Flatten Nested Array.json,3756,1718618620000
dbfs:/FileStore/tables/MarketPrice-1.csv,MarketPrice-1.csv,19528,1719656512000
dbfs:/FileStore/tables/MarketPrice.csv,MarketPrice.csv,19528,1719656208000
dbfs:/FileStore/tables/RunningData_Rev02.csv,RunningData_Rev02.csv,1222,1719810609000
dbfs:/FileStore/tables/RunningData_Rev03.csv,RunningData_Rev03.csv,1216,1719810946000
dbfs:/FileStore/tables/SalesData_Rev02.csv,SalesData_Rev02.csv,472,1719810784000
dbfs:/FileStore/tables/SalesData_Rev03.csv,SalesData_Rev03.csv,460,1719810973000
dbfs:/FileStore/tables/Sales_Collect_Rev02.csv,Sales_Collect_Rev02.csv,166107,1719810826000
dbfs:/FileStore/tables/Sales_Collect_Rev03.csv,Sales_Collect_Rev03.csv,182828,1719811001000
dbfs:/FileStore/tables/StructType-1.csv,StructType-1.csv,648,1717934508000


In [0]:
import pyspark.sql.functions  as f
from pyspark.sql.functions import col
from pyspark.sql.types import StringType, DoubleType, LongType

#### **Ex 01**

- How to join three datasets **Emp, Address and Dept** datasets.
- Inner join, this is the **default join** and it’s mostly used
- **Inner Join** joins two DataFrames on **key** columns, and where keys **don’t match** the rows get **dropped from both datasets**.

In [0]:
# Emp Table

emp_data = [(1,"Smith",10), (2,"Rose",20), (3,"Williams",10), (4,"Jones",30)]
emp_Columns = ["emp_id","name","emp_dept_id"]

df_emp = spark.createDataFrame(emp_data, emp_Columns)
display(df_emp)

emp_id,name,emp_dept_id
1,Smith,10
2,Rose,20
3,Williams,10
4,Jones,30


In [0]:
# Address Table

add_data=[(1,"1523 Main St","SFO","CA"),
          (2,"3453 Orange St","SFO","NY"),
          (3,"34 Warner St","Jersey","NJ"),
          (4,"221 Cavalier St","Newark","DE"),
          (5,"789 Walnut St","Sandiago","CA")
         ]
add_Columns = ["emp_id","addline1","city","state"]

df_add = spark.createDataFrame(add_data, add_Columns)
display(df_add)

emp_id,addline1,city,state
1,1523 Main St,SFO,CA
2,3453 Orange St,SFO,NY
3,34 Warner St,Jersey,NJ
4,221 Cavalier St,Newark,DE
5,789 Walnut St,Sandiago,CA


In [0]:
# Dept Table

dept_data = [("Finance",10), ("Marketing",20), ("Sales",30),("IT",40)]
dept_Columns = ["dept_name","dept_id"]

df_dept = spark.createDataFrame(dept_data, dept_Columns)  
display(df_dept)

dept_name,dept_id
Finance,10
Marketing,20
Sales,30
IT,40


In [0]:
# join Employee and Address datasets
df_emp.join(df_add, df_emp.emp_id == df_add.emp_id, 'left').display()

emp_id,name,emp_dept_id,emp_id.1,addline1,city,state
1,Smith,10,1,1523 Main St,SFO,CA
2,Rose,20,2,3453 Orange St,SFO,NY
3,Williams,10,3,34 Warner St,Jersey,NJ
4,Jones,30,4,221 Cavalier St,Newark,DE


#### **Drop Duplicate Columns After Join**
- If you notice above Join DataFrame **emp_id** is **duplicated** on the result, In order to remove this duplicate column, specify the join column as an **array type or string**.

- In order to use join columns as an **array**, you need to have the **same join columns** on **both DataFrames**.

In [0]:
# Removes duplicate columns emp_id
df_emp.join(df_add, ["emp_id"], 'left').display()

emp_id,name,emp_dept_id,addline1,city,state
1,Smith,10,1523 Main St,SFO,CA
2,Rose,20,3453 Orange St,SFO,NY
3,Williams,10,34 Warner St,Jersey,NJ
4,Jones,30,221 Cavalier St,Newark,DE


In [0]:
# Join Multiple DataFrames (Employee, Address and Department) by chaining
df_emp.join(df_add, ["emp_id"], 'left') \
      .join(df_dept, df_emp["emp_dept_id"] == df_dept["dept_id"], 'left') \
      .display()

emp_id,name,emp_dept_id,addline1,city,state,dept_name,dept_id
1,Smith,10,1523 Main St,SFO,CA,Finance,10
2,Rose,20,3453 Orange St,SFO,NY,Marketing,20
3,Williams,10,34 Warner St,Jersey,NJ,Finance,10
4,Jones,30,221 Cavalier St,Newark,DE,Sales,30


#### **Ex 02**

- Read all input .csv files
  - Sales_Collect_Rev02.csv
  - SalesData_Rev02.csv
  - RunningData_Rev02.csv

- join two datasets:
  - **Sales_Collect_Rev02** and **SalesData_Rev02**
- join three datasets:
  - **Sales_Collect_Rev02**, **SalesData_Rev02** and **RunningData_Rev02**

In [0]:
# Read "Sales_Collect_Rev02.csv"
Sales_Collect_Rev02 = spark.read.option('header',True).option("quote", "\"").option('InferSchema',True).csv("/FileStore/tables/Sales_Collect_Rev02.csv")

display(Sales_Collect_Rev02.limit(10))
Sales_Collect_Rev02.printSchema()
print("Number of Rows:", Sales_Collect_Rev02.count())

Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Average,Increment,Target_Simulation_Id
257,257,1,1,0,Baleno,6,72567.98,5678.01,2023-02-20,2381.657773,0.0,1071
264,264,1,0,0,Engine_Base,90,91768.98,12678.01,2025-06-30,553.8461539,0.0,1063
265,265,1,0,0,Baleno,83,8400.123,1450.01,2023-12-27,-7199.999999,0.0,1065
266,266,1,0,0,Engine_Base,76,77345.665,3456.01,2024-04-30,7200.0,0.0,1063
267,267,1,0,0,Suzuki Swift,96,974567.11,110.01,2025-12-31,1404.878049,0.0,1063
268,268,1,1,0,Suzuki Swift,48,49.0,0.01,2023-03-20,834.1253,0.0,1068
270,270,1,0,0,Wagon R,76,77345.665,3456.01,2024-03-26,7200.0,0.0,1065
271,271,1,0,0,Engine_Base,34,35.0,12340.0123,2023-03-20,1668.2506,0.0,1068
272,272,1,1,0,Creta,29,30.0,12340.0123,2023-03-20,-2383.215143,0.0,1071
277,277,1,0,0,Brezza,73,74567.34567,3456.01,2023-12-28,7440.0,0.0,1065


root
 |-- Id: integer (nullable = true)
 |-- dept_Id: integer (nullable = true)
 |-- SubDept_Id: integer (nullable = true)
 |-- Vehicle_Id: integer (nullable = true)
 |-- Vehicle_Profile_Id: integer (nullable = true)
 |-- Description: string (nullable = true)
 |-- Vehicle_Price_Id: integer (nullable = true)
 |-- Vehicle_Showroom_Price: double (nullable = true)
 |-- Vehicle_Showroom_Delta: double (nullable = true)
 |-- Vehicle_Showroom_Payment_Date: date (nullable = true)
 |-- Average: double (nullable = true)
 |-- Increment: double (nullable = true)
 |-- Target_Simulation_Id: integer (nullable = true)

Number of Rows: 2087


In [0]:
# Read "SalesData_Rev02.csv"
SalesData_Rev02 = spark.read.option('header',True).option("quote", "\"").option('InferSchema',True).csv("/FileStore/tables/SalesData_Rev02.csv")

display(SalesData_Rev02.limit(10))
SalesData_Rev02.printSchema()
print("Number of Rows:", SalesData_Rev02.count())

Target_Event_Identity,Vehicle_Name,Sales_Currency,Transmission,Capacity,Target_Simulation_Identity
1032,SUV,INR,Manual,1300,1061
1033,SUV,INR,Manual,1300,1062
1034,SUV,INR,Manual,1300,1063
1035,SUV,INR,Manual,1300,1064
1037,SUV,INR,Manual,1300,1065
1036,SUV,INR,Manual,1300,1066
1040,SUV,INR,Manual,1300,1067
1038,SUV,INR,Manual,1300,1068
1039,SUV,INR,Manual,1300,1069
1042,SUV,INR,Manual,1300,1070


root
 |-- Target_Event_Identity: integer (nullable = true)
 |-- Vehicle_Name: string (nullable = true)
 |-- Sales_Currency: string (nullable = true)
 |-- Transmission: string (nullable = true)
 |-- Capacity: integer (nullable = true)
 |-- Target_Simulation_Identity: integer (nullable = true)

Number of Rows: 12


In [0]:
# Read "RunningData_Rev02.csv"
RunningData_Rev02 = spark.read.option('header',True).option("quote", "\"").option('InferSchema',True).csv("/FileStore/tables/RunningData_Rev02.csv")

display(RunningData_Rev02.limit(10))
RunningData_Rev02.printSchema()
print("Number of Rows:", RunningData_Rev02.count())

Target_Event_Identity,Sales_Timestamp,Method,Customer,Sales_Event,Type_of_Market,Vehicle_Delivery_Date,Vehicle_Delivery_Status,Post_Vehicle_Delivery_Status,Database
1032,1717497686,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1033,1717497687,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1034,1717497688,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1035,1717497695,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1036,1717497734,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1037,1717497711,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1038,1717497741,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1039,1717497742,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1040,1717497740,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1041,1717497749,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure


root
 |-- Target_Event_Identity: integer (nullable = true)
 |-- Sales_Timestamp: integer (nullable = true)
 |-- Method: string (nullable = true)
 |-- Customer: string (nullable = true)
 |-- Sales_Event: string (nullable = true)
 |-- Type_of_Market: string (nullable = true)
 |-- Vehicle_Delivery_Date: string (nullable = true)
 |-- Vehicle_Delivery_Status: string (nullable = true)
 |-- Post_Vehicle_Delivery_Status: string (nullable = true)
 |-- Database: string (nullable = true)

Number of Rows: 12


In [0]:
# left join "Sales_Collect_Rev02" & "SalesData_Rev02"
Sales_Collect_df_Rev02_01 = Sales_Collect_Rev02.\
                             join(SalesData_Rev02, how='left',
                                  on=[f.col('Target_Simulation_Id') == f.col('Target_Simulation_Identity')])
                        
display(Sales_Collect_df_Rev02_01.limit(100))

Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Average,Increment,Target_Simulation_Id,Target_Event_Identity,Vehicle_Name,Sales_Currency,Transmission,Capacity,Target_Simulation_Identity
257,257,1,1,0,Baleno,6,72567.98,5678.01,2023-02-20,2381.657773,0.0,1071,1041,SUV,INR,Manual,1300,1071
264,264,1,0,0,Engine_Base,90,91768.98,12678.01,2025-06-30,553.8461539,0.0,1063,1034,SUV,INR,Manual,1300,1063
265,265,1,0,0,Baleno,83,8400.123,1450.01,2023-12-27,-7199.999999,0.0,1065,1037,SUV,INR,Manual,1300,1065
266,266,1,0,0,Engine_Base,76,77345.665,3456.01,2024-04-30,7200.0,0.0,1063,1034,SUV,INR,Manual,1300,1063
267,267,1,0,0,Suzuki Swift,96,974567.11,110.01,2025-12-31,1404.878049,0.0,1063,1034,SUV,INR,Manual,1300,1063
268,268,1,1,0,Suzuki Swift,48,49.0,0.01,2023-03-20,834.1253,0.0,1068,1038,SUV,INR,Manual,1300,1068
270,270,1,0,0,Wagon R,76,77345.665,3456.01,2024-03-26,7200.0,0.0,1065,1037,SUV,INR,Manual,1300,1065
271,271,1,0,0,Engine_Base,34,35.0,12340.0123,2023-03-20,1668.2506,0.0,1068,1038,SUV,INR,Manual,1300,1068
272,272,1,1,0,Creta,29,30.0,12340.0123,2023-03-20,-2383.215143,0.0,1071,1041,SUV,INR,Manual,1300,1071
277,277,1,0,0,Brezza,73,74567.34567,3456.01,2023-12-28,7440.0,0.0,1065,1037,SUV,INR,Manual,1300,1065


In [0]:
# left join "Sales_Collect_Rev02", "SalesData_Rev02" and "RunningData_Rev02"
Sales_Collect_df_Rev02_02 = Sales_Collect_Rev02.\
                             join(SalesData_Rev02, how='left',
                                  on=[f.col('Target_Simulation_Id') == f.col('Target_Simulation_Identity')]).\
                             join(RunningData_Rev02, how='left', on=['Target_Event_Identity'])
                        
display(Sales_Collect_df_Rev02_02.limit(10))

Target_Event_Identity,Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Average,Increment,Target_Simulation_Id,Vehicle_Name,Sales_Currency,Transmission,Capacity,Target_Simulation_Identity,Sales_Timestamp,Method,Customer,Sales_Event,Type_of_Market,Vehicle_Delivery_Date,Vehicle_Delivery_Status,Post_Vehicle_Delivery_Status,Database
1041,257,257,1,1,0,Baleno,6,72567.98,5678.01,2023-02-20,2381.657773,0.0,1071,SUV,INR,Manual,1300,1071,1717497749,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1034,264,264,1,0,0,Engine_Base,90,91768.98,12678.01,2025-06-30,553.8461539,0.0,1063,SUV,INR,Manual,1300,1063,1717497688,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1037,265,265,1,0,0,Baleno,83,8400.123,1450.01,2023-12-27,-7199.999999,0.0,1065,SUV,INR,Manual,1300,1065,1717497711,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1034,266,266,1,0,0,Engine_Base,76,77345.665,3456.01,2024-04-30,7200.0,0.0,1063,SUV,INR,Manual,1300,1063,1717497688,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1034,267,267,1,0,0,Suzuki Swift,96,974567.11,110.01,2025-12-31,1404.878049,0.0,1063,SUV,INR,Manual,1300,1063,1717497688,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1038,268,268,1,1,0,Suzuki Swift,48,49.0,0.01,2023-03-20,834.1253,0.0,1068,SUV,INR,Manual,1300,1068,1717497741,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1037,270,270,1,0,0,Wagon R,76,77345.665,3456.01,2024-03-26,7200.0,0.0,1065,SUV,INR,Manual,1300,1065,1717497711,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1038,271,271,1,0,0,Engine_Base,34,35.0,12340.0123,2023-03-20,1668.2506,0.0,1068,SUV,INR,Manual,1300,1068,1717497741,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1041,272,272,1,1,0,Creta,29,30.0,12340.0123,2023-03-20,-2383.215143,0.0,1071,SUV,INR,Manual,1300,1071,1717497749,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1037,277,277,1,0,0,Brezza,73,74567.34567,3456.01,2023-12-28,7440.0,0.0,1065,SUV,INR,Manual,1300,1065,1717497711,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure


#### **Ex 03**

- Read all input .csv files
  - Sales_Collect_Rev03.csv
  - SalesData_Rev03.csv
  - RunningData_Rev03.csv

- join two datasets:
  - **Sales_Collect_Rev03** and **SalesData_Rev03**
- join three datasets:
  - **Sales_Collect_Rev03**, **SalesData_Rev03** and **RunningData_Rev03**

In [0]:
# Read "Sales_Collect_Rev03.csv"
Sales_Collect_Rev03 = spark.read.option('header',True).option("quote", "\"").option('InferSchema',True).csv("/FileStore/tables/Sales_Collect_Rev03.csv")

display(Sales_Collect_Rev03.limit(10))
Sales_Collect_Rev03.printSchema()
print("Number of Rows:", Sales_Collect_Rev03.count())

Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Currency,Target_Currency,Average,Increment,Target_Simulation_Id
257,257,1,1,0,Baleno,6,72567.98,5678.01,2023-02-20,INR,INR,2381.657773,0.0,1071
264,264,1,0,0,Engine_Base,90,91768.98,12678.01,2025-06-30,INR,INR,553.8461539,0.0,1063
265,265,1,0,0,Baleno,83,8400.123,1450.01,2023-12-27,INR,INR,-7199.999999,0.0,1065
266,266,1,0,0,Engine_Base,76,77345.665,3456.01,2024-04-30,INR,INR,7200.0,0.0,1063
267,267,1,0,0,Suzuki Swift,96,974567.11,110.01,2025-12-31,INR,INR,1404.878049,0.0,1063
268,268,1,1,0,Suzuki Swift,48,49.0,0.01,2023-03-20,INR,INR,834.1253,0.0,1068
270,270,1,0,0,Wagon R,76,77345.665,3456.01,2024-03-26,INR,INR,7200.0,0.0,1065
271,271,1,0,0,Engine_Base,34,35.0,12340.0123,2023-03-20,INR,INR,1668.2506,0.0,1068
272,272,1,1,0,Creta,29,30.0,12340.0123,2023-03-20,INR,INR,-2383.215143,0.0,1071
277,277,1,0,0,Brezza,73,74567.34567,3456.01,2023-12-28,INR,INR,7440.0,0.0,1065


root
 |-- Id: integer (nullable = true)
 |-- dept_Id: integer (nullable = true)
 |-- SubDept_Id: integer (nullable = true)
 |-- Vehicle_Id: integer (nullable = true)
 |-- Vehicle_Profile_Id: integer (nullable = true)
 |-- Description: string (nullable = true)
 |-- Vehicle_Price_Id: integer (nullable = true)
 |-- Vehicle_Showroom_Price: double (nullable = true)
 |-- Vehicle_Showroom_Delta: double (nullable = true)
 |-- Vehicle_Showroom_Payment_Date: date (nullable = true)
 |-- Currency: string (nullable = true)
 |-- Target_Currency: string (nullable = true)
 |-- Average: double (nullable = true)
 |-- Increment: double (nullable = true)
 |-- Target_Simulation_Id: integer (nullable = true)

Number of Rows: 2087


In [0]:
# Read "SalesData_Rev03.csv"
SalesData_Rev03 = spark.read.option('header',True).option("quote", "\"").option('InferSchema',True).csv("/FileStore/tables/SalesData_Rev03.csv")

display(SalesData_Rev03.limit(10))
SalesData_Rev03.printSchema()
print("Number of Rows:", SalesData_Rev03.count())

Target_Event_Id,Vehicle_Name,Sales_Currency,Transmission,Capacity,Target_Simulation_Id
1032,SUV,INR,Manual,1300,1061
1033,SUV,INR,Manual,1300,1062
1034,SUV,INR,Manual,1300,1063
1035,SUV,INR,Manual,1300,1064
1037,SUV,INR,Manual,1300,1065
1036,SUV,INR,Manual,1300,1066
1040,SUV,INR,Manual,1300,1067
1038,SUV,INR,Manual,1300,1068
1039,SUV,INR,Manual,1300,1069
1042,SUV,INR,Manual,1300,1070


root
 |-- Target_Event_Id: integer (nullable = true)
 |-- Vehicle_Name: string (nullable = true)
 |-- Sales_Currency: string (nullable = true)
 |-- Transmission: string (nullable = true)
 |-- Capacity: integer (nullable = true)
 |-- Target_Simulation_Id: integer (nullable = true)

Number of Rows: 12


In [0]:
# Read "RunningData_Rev03.csv"
RunningData_Rev03 = spark.read.option('header',True).option("quote", "\"").option('InferSchema',True).csv("/FileStore/tables/RunningData_Rev03.csv")

display(RunningData_Rev03.limit(10))
RunningData_Rev03.printSchema()
print("Number of Rows:", RunningData_Rev03.count())

Target_Event_Id,Sales_Timestamp,Method,Customer,Sales_Event,Type_of_Market,Vehicle_Delivery_Date,Vehicle_Delivery_Status,Post_Vehicle_Delivery_Status,Database
1032,1717497686,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1033,1717497687,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1034,1717497688,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1035,1717497695,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1036,1717497734,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1037,1717497711,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1038,1717497741,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1039,1717497742,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1040,1717497740,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1041,1717497749,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure


root
 |-- Target_Event_Id: integer (nullable = true)
 |-- Sales_Timestamp: integer (nullable = true)
 |-- Method: string (nullable = true)
 |-- Customer: string (nullable = true)
 |-- Sales_Event: string (nullable = true)
 |-- Type_of_Market: string (nullable = true)
 |-- Vehicle_Delivery_Date: string (nullable = true)
 |-- Vehicle_Delivery_Status: string (nullable = true)
 |-- Post_Vehicle_Delivery_Status: string (nullable = true)
 |-- Database: string (nullable = true)

Number of Rows: 12


In [0]:
# left join "Sales_Collect_Rev03" & "SalesData_Rev03"
Sales_Collect_df_Rev03_01 = Sales_Collect_Rev03.\
                             join(SalesData_Rev03, how='left',
                                  on=['Target_Simulation_Id'])
                        
display(Sales_Collect_df_Rev03_01.limit(10))

Target_Simulation_Id,Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Currency,Target_Currency,Average,Increment,Target_Event_Id,Vehicle_Name,Sales_Currency,Transmission,Capacity
1071,257,257,1,1,0,Baleno,6,72567.98,5678.01,2023-02-20,INR,INR,2381.657773,0.0,1041,SUV,INR,Manual,1300
1063,264,264,1,0,0,Engine_Base,90,91768.98,12678.01,2025-06-30,INR,INR,553.8461539,0.0,1034,SUV,INR,Manual,1300
1065,265,265,1,0,0,Baleno,83,8400.123,1450.01,2023-12-27,INR,INR,-7199.999999,0.0,1037,SUV,INR,Manual,1300
1063,266,266,1,0,0,Engine_Base,76,77345.665,3456.01,2024-04-30,INR,INR,7200.0,0.0,1034,SUV,INR,Manual,1300
1063,267,267,1,0,0,Suzuki Swift,96,974567.11,110.01,2025-12-31,INR,INR,1404.878049,0.0,1034,SUV,INR,Manual,1300
1068,268,268,1,1,0,Suzuki Swift,48,49.0,0.01,2023-03-20,INR,INR,834.1253,0.0,1038,SUV,INR,Manual,1300
1065,270,270,1,0,0,Wagon R,76,77345.665,3456.01,2024-03-26,INR,INR,7200.0,0.0,1037,SUV,INR,Manual,1300
1068,271,271,1,0,0,Engine_Base,34,35.0,12340.0123,2023-03-20,INR,INR,1668.2506,0.0,1038,SUV,INR,Manual,1300
1071,272,272,1,1,0,Creta,29,30.0,12340.0123,2023-03-20,INR,INR,-2383.215143,0.0,1041,SUV,INR,Manual,1300
1065,277,277,1,0,0,Brezza,73,74567.34567,3456.01,2023-12-28,INR,INR,7440.0,0.0,1037,SUV,INR,Manual,1300


In [0]:
# left join "Sales_Collect_Rev03", "SalesData_Rev03" and "RunningData_Rev03"
Sales_Collect_df_Rev03_02 = Sales_Collect_Rev03.\
                             join(SalesData_Rev03, how='left',
                                  on=['Target_Simulation_Id']).\
                             join(RunningData_Rev03, how='left', on=['Target_Event_Id'])
                        
display(Sales_Collect_df_Rev03_02.limit(10))

Target_Event_Id,Target_Simulation_Id,Id,dept_Id,SubDept_Id,Vehicle_Id,Vehicle_Profile_Id,Description,Vehicle_Price_Id,Vehicle_Showroom_Price,Vehicle_Showroom_Delta,Vehicle_Showroom_Payment_Date,Currency,Target_Currency,Average,Increment,Vehicle_Name,Sales_Currency,Transmission,Capacity,Sales_Timestamp,Method,Customer,Sales_Event,Type_of_Market,Vehicle_Delivery_Date,Vehicle_Delivery_Status,Post_Vehicle_Delivery_Status,Database
1041,1071,257,257,1,1,0,Baleno,6,72567.98,5678.01,2023-02-20,INR,INR,2381.657773,0.0,SUV,INR,Manual,1300,1717497749,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1034,1063,264,264,1,0,0,Engine_Base,90,91768.98,12678.01,2025-06-30,INR,INR,553.8461539,0.0,SUV,INR,Manual,1300,1717497688,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1037,1065,265,265,1,0,0,Baleno,83,8400.123,1450.01,2023-12-27,INR,INR,-7199.999999,0.0,SUV,INR,Manual,1300,1717497711,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1034,1063,266,266,1,0,0,Engine_Base,76,77345.665,3456.01,2024-04-30,INR,INR,7200.0,0.0,SUV,INR,Manual,1300,1717497688,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1034,1063,267,267,1,0,0,Suzuki Swift,96,974567.11,110.01,2025-12-31,INR,INR,1404.878049,0.0,SUV,INR,Manual,1300,1717497688,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1038,1068,268,268,1,1,0,Suzuki Swift,48,49.0,0.01,2023-03-20,INR,INR,834.1253,0.0,SUV,INR,Manual,1300,1717497741,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1037,1065,270,270,1,0,0,Wagon R,76,77345.665,3456.01,2024-03-26,INR,INR,7200.0,0.0,SUV,INR,Manual,1300,1717497711,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1038,1068,271,271,1,0,0,Engine_Base,34,35.0,12340.0123,2023-03-20,INR,INR,1668.2506,0.0,SUV,INR,Manual,1300,1717497741,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1041,1071,272,272,1,1,0,Creta,29,30.0,12340.0123,2023-03-20,INR,INR,-2383.215143,0.0,SUV,INR,Manual,1300,1717497749,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
1037,1065,277,277,1,0,0,Brezza,73,74567.34567,3456.01,2023-12-28,INR,INR,7440.0,0.0,SUV,INR,Manual,1300,1717497711,Dealership,SRS Travels,Bulk Delivery,Commercial,4-Jan-23,DOB,DOJ,Azure
