# Big Data
Spark Dataframes/SparkSQL  
2020.03.22  

## Tasks

Using DataFrames or SQL.

1. Show top 5 committees names by dollar amount that have contributed to “LEWIS, JASON MARK MR.”
2. Show me the top 5 candidate names who have received the most money from various committees.
3. Which candidate is most popular, in terms of the number of committee's he/she gets funds from.

## Execution

Before importing the json files, I will need to run the following Spark Session command.

In [2]:
# Spark Session that is already initiated as we create a workbook.
# Spark Sessions replace what Spark Context used to perform
sc.appName

### Read Data

In [4]:
# from json file, create a dataframe
# json file schemas are automatically detected
dir = "/FileStore/tables/"

# file candidate
file = "CAND_MSI_UNI_JS.json"
df_candidate = spark.read.json(dir + file)

# file committee
file = "CMTE_MST_UNI_JS.json"
df_committee = spark.read.json(dir + file)

# file contribution
file = "CONTRIB_TO_CMTE_FR_CMTE_MN_JS.json"
df_contribute = spark.read.json(dir + file)

df_candidate.show(2)

### Explore Data

In [6]:
# Inspect the schema
# Ensure that the infered schema is correct
df_candidate.columns
Out[8]: ['empno', 'ename', 'job', 'mgr', 'hiredate', 'sal', 'comm', 'deptno']

In [7]:
df_candidate.schema

In [8]:
df_candidate.printSchema()

In [9]:
df_committee.printSchema()

In [10]:
df_contribute.printSchema()

In [11]:
# View statistics
# useful for looking at numeric data types
df_candidate.describe().show(5)

### Query Data

1. Give top 5 committees names by dollar amount that have contributed to “LEWIS, JASON MARK MR.”

Committee Name  
File = CMTE_MST_UNI    
Field = CMTE_NM  
df = df_committee  

Transaction Amount  
File = CONTRIB_TO_CMTE_FR_CMTE_MN  
Field = TRANSACTION_AMT  
df = df_contribute  

Candidate Name  
File = CAND_MST_UNI  
Field = CAND_NAME  
df = df_candidate  

Contributor/Lender/Transfer Name  
File = CONTRIB_TO_CMTE_FR_CMTE_MN  
Field = NAME  
df = df_contribute

Starting with SparkSQL. First, we need to convert the dataframes into tables.

According this [this](https://stackoverflow.com/a/31509776/5825523), you register a table to perform SQL-like queries on it.

In [14]:
# Create temp views for the dataframes
df_candidate.createOrReplaceTempView("candidate")
df_committee.createOrReplaceTempView("committee")
df_contribute.createOrReplaceTempView("contribute")

df_candidate.registerTempTable("candidate")  # creates in-memory table?

In [15]:
%sql
show tables;

database,tableName,isTemporary
default,dftable,False
,candidate,True
,committee,True
,contribute,True


In [16]:
# Use temp views to query the tables 
# find the candidate of interest
q1 = \
"""
select
  cand_id
from candidate
where
  upper(cand_name) = 'LEWIS, JASON MARK MR.'
"""
spark.sql(q1).show()

In [17]:
# preview transaction amounts
q2 = \
"""
select
  cmte_id
  , transaction_amt
from contribute
"""
spark.sql(q2).show()

In [18]:
# preview committee names. This is the bridge/junction table 
q3 = \
"""
select
  cmte_id
  , cmte_nm
  , cand_id
from committee
"""
spark.sql(q3).show()

#### Query 1
Show top 5 committees names by dollar amount that have contributed to “LEWIS, JASON MARK MR.”

In [20]:
q = \
"""
select
  contr.cmte_id
  , comm.cmte_nm
  , contr.transaction_amt
  , cand.cand_id
  , cand_name
from contribute as contr

inner join committee as comm -- inner or left, no diff
  on contr.cmte_id = comm.cmte_id

inner join candidate as cand -- inner or left, no diff
  on contr.other_id = cand.cand_id

where upper(cand.cand_name) = 'LEWIS, JASON MARK MR.'

order by contr.transaction_amt desc
  """
results = spark.sql(q)
results.show()

#### Query 2
Show the top 5 candidate names who have received the most money from various committees.

In [22]:
q = \
"""
select
  cand.cand_id
  , cand.cand_name
  , sum(contr.transaction_amt) as total_contr -- the main question
from candidate as cand

left join contribute as contr
  on cand.cand_id = contr.other_id 

group by 
  cand.cand_id
  , cand.cand_name
  
order by total_contr desc
"""
results = spark.sql(q)
results.show()

#### Query 3
Which candidate is most popular, in terms of the number of committee's he/she gets funds from.

In [24]:
q = \
"""
select
  count(contr.transaction_amt) as cnt
  , cand.cand_name
  --, contr.other_id  
  --, contr.cmte_id
from contribute contr

inner join candidate cand -- i only care where they match up
  on contr.other_id = cand.cand_id

group by
  contr.cmte_id
  , contr.other_id
  , cand.cand_name

order by cnt desc
"""
results = spark.sql(q)
results.show()