### AIT 614 - Big Data Essentials <br>
#### Project Title: FHWA Bridge Conditions Analysis Using Big Data Techniques
#### 5. Hadoop Spark Queries
#### TEAM 4
<hr>

Course Section #: AIT 614 - 003 <br>
#### Team Members
1. Aryan Patel Kolagani - G01517560 <br>
2. Rithvik Madhavaram - G01501806 <br>
3. Chetan Muppavarapu - G01504057 <br>
4. Srivaths Nrusimha Rao Chengal - G01512113 <br>
5. Vaibhav Hasu - G01517039 <br>

### Step 1: Loading dataset using Spark <br>

In [0]:
# Load dataset (acts like HDFS input)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/akolagan@gmu.edu/FHWA_Bridge_Conditions_Dataset.csv")

# Show data
df.show(5)

+---------+-----+------------------+-------------------+---------+---------+------------------+-----+--------------+------------------------+----------------------+-------------+------------------+------------------+---------------------------+--------------------+------------------+----------------+-------------------+-------------------+
|Bridge_ID|State|          Latitude|          Longitude| Material|Age_Years|     Length_Meters|Lanes|Deck_Condition|Superstructure_Condition|Substructure_Condition|Daily_Traffic|  Truck_Percentage| Avg_Annual_Temp_C|Avg_Annual_Precipitation_mm|Last_Inspection_Year|   Repair_Cost_USD|Repair_Time_Days| Deterioration_Rate|Failure_Probability|
+---------+-----+------------------+-------------------+---------+---------+------------------+-----+--------------+------------------------+----------------------+-------------+------------------+------------------+---------------------------+--------------------+------------------+----------------+---------------

#### Step 2: Cleaning the Repair Cost Coloumn

In [0]:
from pyspark.sql.functions import regexp_replace, col

df_clean = df.withColumn("Repair_Cost_USD_clean", regexp_replace("Repair_Cost_USD", "[$,]", "").cast("double"))
df_clean.createOrReplaceTempView("bridges")

#### Step 3: Hadoop-style Spark SQL Queries (Simulating Hive on HDFS)

#### Query 1: Count of Total Bridges

In [0]:
spark.sql("SELECT COUNT(*) AS total_bridges FROM bridges").show()

+-------------+
|total_bridges|
+-------------+
|         5000|
+-------------+



#### Query 2: Average Deck condition by State

In [0]:
spark.sql("""
SELECT State, ROUND(AVG(Deck_Condition), 2) AS avg_condition
FROM bridges
GROUP BY State
ORDER BY avg_condition ASC
""").show()

+-----+-------------+
|State|avg_condition|
+-----+-------------+
|   IL|         5.41|
|   PA|         5.43|
|   FL|         5.46|
|   TX|         5.47|
|   MI|          5.5|
|   CA|         5.52|
|   NC|         5.54|
|   GA|         5.59|
|   NY|         5.61|
|   OH|         5.65|
+-----+-------------+



#### Query 3: Top 5 Bridges with highest Repair Cost

In [0]:
spark.sql("""
SELECT Bridge_ID, State, Repair_Cost_USD_clean
FROM bridges
ORDER BY Repair_Cost_USD_clean DESC
LIMIT 5
""").show()

+---------+-----+---------------------+
|Bridge_ID|State|Repair_Cost_USD_clean|
+---------+-----+---------------------+
|   101420|   CA|    4995193.941049254|
|   104400|   OH|    4994320.511339711|
|   102158|   FL|     4993752.42208293|
|   104609|   NC|    4993634.032032534|
|   100160|   FL|   4992727.6564623555|
+---------+-----+---------------------+



#### Query 4: Number of Bridges by Material

In [0]:
spark.sql("""
SELECT Material, COUNT(*) AS count
FROM bridges
GROUP BY Material
ORDER BY count DESC
""").show()

+---------+-----+
| Material|count|
+---------+-----+
|   Timber| 1034|
|    Steel| 1029|
| Concrete|  990|
|  Masonry|  981|
|Composite|  966|
+---------+-----+



#### Query 5: Average Deterioration rate by Material

In [0]:
spark.sql("""
SELECT Material, ROUND(AVG(Deterioration_Rate), 2) AS avg_deterioration
FROM bridges
GROUP BY Material
ORDER BY avg_deterioration DESC
""").show()

+---------+-----------------+
| Material|avg_deterioration|
+---------+-----------------+
|  Masonry|             0.16|
|    Steel|             0.15|
| Concrete|             0.15|
|Composite|             0.15|
|   Timber|             0.15|
+---------+-----------------+

