<br><br><br>
<span style="color:red;font-size:60px">GraphFrames Assignment</span>
<br><br>
In this assignment, you need to do the following:
<li>Read the file 201710-citibike-tripdata.csv</li>
<li>Construct a graph with stations as vertices and trips between stations as edges</li>
<li>Vertex Ids are station numbers and Vertex attributes are station names</li>
<li>Edge attributes are trip duration (durations are in seconds)</li>
<li>Then answer the questions below</li>

<h2>NOTE</h2>
<li>There is a good chance that this won't run on your local Jupyter notebook. If that happens, create a subset of the data (you can use python to do that), run it locally, and then run it on the entire dataset on GCP</li>
<li>If you reboot your machine, make sure no other applications are open, and then work on the assignment, you have a good shot (depending on your machine) of running it locally</li>

In [1]:
%%init_spark
launcher.packages= ["graphframes:graphframes:0.8.2-spark3.2-s_2.12"]

In [2]:
//GraphFrame imports
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._


//GraphX imports
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD




Intitializing Scala interpreter ...

Spark Web UI available at http://cluster-be60-m:8088/proxy/application_1668271071267_0002
SparkContext available as 'sc' (version = 3.1.3, master = yarn, app id = application_1668271071267_0002)
SparkSession available as 'spark'


import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD


<br><br><br>
<span style="color:green;font-size:xx-large">Step 1: Construct the graph</span>
<br><br>
<li>read the data file and drop the header line</li>
<li>create a vertex dataframe (the union of start stations and end stations)</li>
<li>create an edge dataframe (the trips - start station id, end station id, duration)</li>
<li>create a GraphFrame</li>

In [3]:
val text = sc.textFile("gs://wz2547-ieor4526-bucket/data/201710-citibike-tripdata.csv")
val text_nohead = text.mapPartitionsWithIndex{ (idx,iter) => if (idx==0) iter.drop(1) else iter}.map(x=>x.split(','))

//Construct vertices and edges here
val rdd_vertex_start = text_nohead.map(x=>(x(3),x(4)))
val rdd_vertex_end = text_nohead.map(x=>(x(7),x(8)))
val rdd_vertex = rdd_vertex_start.union(rdd_vertex_end).distinct
val vertices = spark.createDataFrame(rdd_vertex).toDF("id","attr")

val rdd_edge = text_nohead.map(x=>(x(3),x(7),x(0).toFloat))
val edges = spark.createDataFrame(rdd_edge).toDF("src","dst","duration_secs")


val g = GraphFrame(vertices,edges)

text: org.apache.spark.rdd.RDD[String] = gs://wz2547-ieor4526-bucket/data/201710-citibike-tripdata.csv MapPartitionsRDD[1] at textFile at <console>:37
text_nohead: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at <console>:38
rdd_vertex_start: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[4] at map at <console>:41
rdd_vertex_end: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[5] at map at <console>:42
rdd_vertex: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[9] at distinct at <console>:43
vertices: org.apache.spark.sql.DataFrame = [id: string, attr: string]
rdd_edge: org.apache.spark.rdd.RDD[(String, String, Float)] = MapPartitionsRDD[10] at map at <console>:46
edges: org.apache.spark.sql.DataFrame = [src: string, dst...


In [4]:
// g.edges.show

<br><br><br>
<span style="color:green;font-size:xx-large">Step 2: Basic questions</span>
<br><br>

<li>How many citibike stations are there in the network?</li>
<li>How many trips were made in the month in question?</li>
<li>How many trips started and ended at the same station?</li>
<li>How many station to station connections are there (at least one edge exists between station i and station j and i is not equal to j)?</li>
<li>Your code should print:</li>
<pre>
Total number of stations: 785
Total number of trips.  : 1897592
Trips that started and ended at the same station: 33245
Number of station to station connections: 107524
</pre>

In [5]:
//You might need this
def make_undirected_graph(g: GraphFrame) = {
    val u_edge_df = g.find("(a)-[]->(b)")
        .select($"a.id".as("src"),$"b.id".as("dst"))
        .withColumn("swap",when(col("src")<col("dst"),col("dst")))
        .withColumn("dst",
                    when(col("swap").isNotNull,col("src"))
                    .otherwise(col("dst")))
        .withColumn("src",
                    when(col("swap").isNotNull,col("swap"))
                   .otherwise(col("src")))
        .drop(col("swap"))
        .distinct
    val u_vertices_df = g.vertices
    val u_g = GraphFrame(u_vertices_df,u_edge_df)    
    u_g
}
val total_stations = g.vertices.count()
val total_trips = g.edges.count()
val round_trips = g.find("(a)-[]->(a)").count()
val u_g = make_undirected_graph(g)
val station_to_station = u_g.find("(a)-[e]->(b)").filter("a.id != b.id").select("e.src","e.dst").count()

make_undirected_graph: (g: org.graphframes.GraphFrame)org.graphframes.GraphFrame
total_stations: Long = 785
total_trips: Long = 1897592
round_trips: Long = 33245
u_g: org.graphframes.GraphFrame = GraphFrame(v:[id: string, attr: string], e:[src: string, dst: string])
station_to_station: Long = 107524


In [6]:
println(s"Total number of stations: $total_stations")
println(s"Total number of trips: $total_trips")
println(s"Trips that started and ended at the same station: $round_trips")
println(s"Number of station to station connections: $station_to_station")

Total number of stations: 785
Total number of trips: 1897592
Trips that started and ended at the same station: 33245
Number of station to station connections: 107524


<br><br><br>
<span style="color:green;font-size:xx-large">Step 3: Find the Station from which most trips originate</span>
<br><br>
<li>Note that the graph has one edge for each trip (i.e., there are many edges between two vertices)</li>
<li>The function <span style="color:blue">outDegrees</span> returns the number of outgoing edges from every vertex</li>
<li>Print the name of the station with most originating trips</li>
<li>Your code should print:</li>
<pre>  
The station from which most trips originate is: "Pershing Square North"
</pre>

In [7]:
val outDegree = g.outDegrees
outDegree.createOrReplaceTempView("outDegree_v")
val most_station_id = spark.sql("select id from outDegree_v order by outDegree desc limit 1").collect()(0)(0).toString
val most_trips = g.vertices.select("attr").filter($"id" === most_station_id).collect()(0)(0).toString

outDegree: org.apache.spark.sql.DataFrame = [id: string, outDegree: int]
most_station_id: String = 519
most_trips: String = "Pershing Square North"


In [8]:
println(s"The station from which most trips originate is: $most_trips")

The station from which most trips originate is: "Pershing Square North"


<br><br><br>
<span style="color:green;font-size:xx-large">STEP 4: Proportion of trips for each station that start and end at that same station</span>
<br><br>
<li>Create a GraphX graph from the GraphFrames graph (use the method that retains datatypes)</li>
<li>Use aggregateMessages to calculate the number of trips that start and end at the same vertex (for each vertex)</li>
<li>Convert the resulting (VertexRDD) to a DataFrame</li>
<li>Using join add the location of the station column to the result df from the previous step and then use select to create a dataframe with the schema (vertex, location, trips)</li>
<li>Join this df to the out degrees df created earlier</li>
<li>Divide the "same trips" column by the "out degrees column" </li>
<li>Sort the resulting df by this proportion in descending order</li>
<li>Your output should be the following dataframe
<pre>

+----+--------------------+-----+---------+-------------------+
|  id|            location|trips|outDegree|               prop|
+----+--------------------+-----+---------+-------------------+
|3488|  "8D QC Station 01"|    1|        1|                1.0|
|3245|"NYCBS DEPOT - DE...|    1|        2|                0.5|
|3182|"Yankee Ferry Ter...|  309|      900| 0.3433333333333333|
|3254|  "Soissons Landing"|  358|     1100|0.32545454545454544|
|3342|"Pioneer St & Ric...|   59|      299|0.19732441471571907|
|3477|"39 St & 2 Ave - ...|   45|      245| 0.1836734693877551|
|3532|"Ditmars Blvd & 1...|   70|      407|  0.171990171990172|
|3180|"Brooklyn Bridge ...|  232|     1354|0.17134416543574593|
|3423|"West Drive & Pro...|  367|     2463|0.14900527811611855|
|3636|"Expansion Wareho...|    1|        8|              0.125|
|3302|"Columbus Ave & W...|   74|      598|0.12374581939799331|
|3120|"Center Blvd & Bo...|   74|      622| 0.1189710610932476|
|3514|"Astoria Park S &...|   34|      299|0.11371237458193979|
|3479|      "Picnic Point"|   77|      712|0.10814606741573034|
|3594|"Montgomery St & ...|    9|       87|0.10344827586206896|
|3524|    "19 St & 24 Ave"|   24|      249| 0.0963855421686747|
|3349|"Grand Army Plaza...|  237|     2570|0.09221789883268483|
|3333|"Columbia St & Lo...|    5|       55|0.09090909090909091|
|3354|"3 St & Prospect ...|  142|     1572|0.09033078880407125|
|3607|    "31 Ave & 14 St"|   10|      113|0.08849557522123894|
+----+--------------------+-----+---------+-------------------+
only showing top 20 rows
</pre>

In [9]:
val v = g.vertices.rdd.map(r => (r(0).toString.toLong,r(1).toString))
val e = g.edges.rdd.map(r => Edge(r(0).toString.toLong,r(1).toString.toLong,r(2).toString.toDouble))
val gx: Graph[String, Double] = Graph(v, e)

v: org.apache.spark.rdd.RDD[(Long, String)] = MapPartitionsRDD[69] at map at <console>:37
e: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Double]] = MapPartitionsRDD[75] at map at <console>:38
gx: org.apache.spark.graphx.Graph[String,Double] = org.apache.spark.graphx.impl.GraphImpl@2927a853


In [10]:
val trip_count = gx.aggregateMessages[Int](ec => if (ec.srcId == ec.dstId) ec.sendToDst(1)
                                           ,(a,b) => a+b)
                   .toDF("id","trips")
val trip_location = vertices.join(trip_count,g.vertices("id") === trip_count("id"),"left")
                            .select(g.vertices("id"),g.vertices("attr") as "location",trip_count("trips"))
val trip_outDegree = trip_location.join(g.outDegrees,trip_location("id") === g.outDegrees("id"),"left")
                                  .drop(g.outDegrees("id"))

val trip_prop = trip_outDegree.withColumn("prop",$"trips"/$"outDegree").orderBy(col("prop").desc)
trip_prop.show

+----+--------------------+-----+---------+-------------------+
|  id|            location|trips|outDegree|               prop|
+----+--------------------+-----+---------+-------------------+
|3488|  "8D QC Station 01"|    1|        1|                1.0|
|3245|"NYCBS DEPOT - DE...|    1|        2|                0.5|
|3182|"Yankee Ferry Ter...|  309|      900| 0.3433333333333333|
|3254|  "Soissons Landing"|  358|     1100|0.32545454545454544|
|3342|"Pioneer St & Ric...|   59|      299|0.19732441471571907|
|3477|"39 St & 2 Ave - ...|   45|      245| 0.1836734693877551|
|3532|"Ditmars Blvd & 1...|   70|      407|  0.171990171990172|
|3180|"Brooklyn Bridge ...|  232|     1354|0.17134416543574593|
|3423|"West Drive & Pro...|  367|     2463|0.14900527811611855|
|3636|"Expansion Wareho...|    1|        8|              0.125|
|3302|"Columbus Ave & W...|   74|      598|0.12374581939799331|
|3120|"Center Blvd & Bo...|   74|      622| 0.1189710610932476|
|3514|"Astoria Park S &...|   34|      2

trip_count: org.apache.spark.sql.DataFrame = [id: bigint, trips: int]
trip_location: org.apache.spark.sql.DataFrame = [id: string, location: string ... 1 more field]
trip_outDegree: org.apache.spark.sql.DataFrame = [id: string, location: string ... 2 more fields]
trip_prop: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, location: string ... 3 more fields]


In [11]:
// round_trip_count_df.createOrReplaceTempView("trips")
// g.vertices.createOrReplaceTempView("vertices")
// spark.sql("select a.id, a.attr, b.trips from vertices as a left join trips as b on a.id = b.id")

<br><br><br>
<span style="color:green;font-size:xx-large">STEP 5: Create a new graph that contains all edges except for those between the same station</span>
<br><br>


In [12]:
val except_same_station = g.find("(a)-[e]->(b)").filter("a.id != b.id")
val except_same_station_vertices = except_same_station.select("a.*").union(except_same_station.select("b.*"))
val except_same_station_edges = except_same_station.select("e.*")
val new_g = GraphFrame(except_same_station_vertices,except_same_station_edges)

except_same_station: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: struct<id: string, attr: string>, e: struct<src: string, dst: string ... 1 more field> ... 1 more field]
except_same_station_vertices: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, attr: string]
except_same_station_edges: org.apache.spark.sql.DataFrame = [src: string, dst: string ... 1 more field]
new_g: org.graphframes.GraphFrame = GraphFrame(v:[id: string, attr: string], e:[src: string, dst: string ... 1 more field])


In [13]:
new_g.edges.printSchema

root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- duration_secs: float (nullable = false)



In [14]:
new_g.vertices.printSchema

root
 |-- id: string (nullable = true)
 |-- attr: string (nullable = true)



<br><br><br>
<span style="color:green;font-size:xx-large">STEP 6: Calculate the average duration between every pair of stations</span>
<br><br>
<li>use the new graph from step 5 for this</li>
<li>I'll let you figure this out but this should be really easy (think SQL)</li>
<pre>
+----+----+------------------+
| src| dst|                 m|
+----+----+------------------+
| 504| 350| 772.7647058823529|
| 433| 527| 532.9677419354839|
| 434| 470|316.52272727272725|
| 438| 151|  546.195652173913|
| 445| 507| 553.3947368421053|
|2021| 446| 827.6904761904761|
| 116| 518| 1115.857142857143|
|3435| 358| 895.6666666666666|
|3402|3414| 634.1666666666666|
| 498| 495| 801.2272727272727|
|3637| 418|             889.7|
| 380|3260|419.42105263157896|
|3360| 507|            1442.0|
| 326| 247|             713.0|
|3358| 467| 500.6666666666667|
|3164| 457|502.97241379310344|
| 498| 528| 434.3207547169811|
| 405|3256| 843.2168674698795|
| 477|2000|2672.3333333333335|
|3226|3163| 686.4444444444445|
+----+----+------------------+
only showing top 20 rows
<pre>

In [15]:
new_g.edges.groupBy("src","dst").mean("duration_secs")
.withColumnRenamed("avg(duration_secs)","m")
// .filter($"src"==="504").filter("dst==350")
.show
// spark.sql("select src, dst, avg(duration_secs) from new_g.edges")

+----+----+------------------+
| src| dst|                 m|
+----+----+------------------+
| 467| 330|            1388.5|
| 467| 144| 791.6666666666666|
| 296|2021| 5554.333333333333|
|3312| 519|1330.1739130434783|
| 447|3289|            1035.4|
| 307| 454|            1307.8|
| 307| 417| 713.7407407407408|
| 307| 249|             764.5|
| 307| 498|            1260.0|
|3167|3258|            1017.0|
|3167|3577|            2074.0|
| 334| 439| 912.7391304347826|
|3553|3542|            1702.8|
|3553| 285|            2477.0|
| 334| 443|            2291.0|
|3408| 217| 968.7826086956521|
|3365|3324|             580.6|
| 442|3307|            1569.8|
| 470| 383| 518.5862068965517|
| 470|2003|          677.5625|
+----+----+------------------+
only showing top 20 rows



<br><br><br>
<span style="color:green;font-size:xx-large">STEP 7: Important stations</span><br><br>
Citibike wants to figure out how best to deploy its workers in checking whether a station is over-full (too many bikes) or needs more bikes. It figures that the best way to do this is to find out which stations are the most important in terms of flows:
<li>A station that has high bike returns and is connected to other stations with high bike returns is more likely to have too many bikes in its station and therefore should be monitored more often</li>
<li>A station that has high bike pickups and is connected to other stations with high bike pickups is more likely to be short of bikes and therefore should be monitored more often</li>
<li>Calculate the propensities for over-fullness and emptiness for every station</li>
<li>Report the 5 most important stations for over-fullness (use pageRank on the graph)</li>
<li>Report the 5 most important stations for emptiness (reverse all the edges on the graph and use pageRank)</li>
<li>Your results (Don't worry about the meaning of location names!):</li>
<li>Note: Assume a reset_probability of 0.15 and a tolerance of .0001 if you want the same results as mine</li>
<pre>
+---+--------------------+------------------+
| id|            location|          pagerank|
+---+--------------------+------------------+
|519|"Pershing Square ...| 4.930887390071603|
|426|"West St & Chambe...|3.7410934274030576|
|402|"Broadway & E 22 St"|  3.58520147183096|
|497|"E 17 St & Broadway"| 3.537658018512581|
|435|   "W 21 St & 6 Ave"| 3.438585855241344|
+---+--------------------+------------------+

+----+--------------------+------------------+
|  id|            location|          pagerank|
+----+--------------------+------------------+
|3197|      "Hs Don't Use"| 5.710640869520747|
| 519|"Pershing Square ...| 5.012823444592195|
|3480|      "WS Don't Use"| 4.272284643284593|
| 402|"Broadway & E 22 St"|3.4515211069038183|
| 497|"E 17 St & Broadway"|3.3347259745457443|
+----+--------------------+------------------+
</pre>


In [21]:
val ranks = g.pageRank.resetProbability(0.15).tol(0.0001).run()
val top_fullness = ranks.vertices.orderBy(col("pagerank").desc)
top_fullness.show(5)

+---+--------------------+------------------+
| id|                attr|          pagerank|
+---+--------------------+------------------+
|519|"Pershing Square ...| 4.930887390071584|
|426|"West St & Chambe...|3.7410934274030416|
|402|"Broadway & E 22 St"| 3.585201471830956|
|497|"E 17 St & Broadway"|3.5376580185125737|
|435|   "W 21 St & 6 Ave"| 3.438585855241341|
+---+--------------------+------------------+
only showing top 5 rows



ranks: org.graphframes.GraphFrame = GraphFrame(v:[id: string, attr: string ... 1 more field], e:[src: string, dst: string ... 2 more fields])
top_fullness: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, attr: string ... 1 more field]


In [22]:
val reverse_edges = g.edges.select($"src" as "dst",$"dst" as "src",$"duration_secs")
val reverse_g = GraphFrame(g.vertices,reverse_edges)
val reverse_ranks = reverse_g.pageRank.resetProbability(0.15).tol(0.0001).run()
val top_emptiness = reverse_ranks.vertices.orderBy(col("pagerank").desc)
top_emptiness.show(5)

+----+--------------------+------------------+
|  id|                attr|          pagerank|
+----+--------------------+------------------+
|3197|      "Hs Don't Use"| 5.710640869520743|
| 519|"Pershing Square ...| 5.012823444592204|
|3480|      "WS Don't Use"|  4.27228464328459|
| 402|"Broadway & E 22 St"| 3.451521106903815|
| 497|"E 17 St & Broadway"|3.3347259745457425|
+----+--------------------+------------------+
only showing top 5 rows



reverse_edges: org.apache.spark.sql.DataFrame = [dst: string, src: string ... 1 more field]
reverse_g: org.graphframes.GraphFrame = GraphFrame(v:[id: string, attr: string], e:[src: string, dst: string ... 1 more field])
reverse_ranks: org.graphframes.GraphFrame = GraphFrame(v:[id: string, attr: string ... 1 more field], e:[src: string, dst: string ... 2 more fields])
top_emptiness: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string, attr: string ... 1 more field]


<br><br><br>
<span style="color:green;font-size:xx-large">STEP 8: Calculate the clustering coefficient of every station</span><br><br>

<li>And report the top 20 stations by clustering coefficient</li>
<li>Find the number of triangles that each vertex belongs to in the undirected graph</li>
<li>Get the number of adjacent vertices (degrees of the undirected graph)</li>
<li>Calculate the number of possible triangles a vertex can belong to (for every vertex)</li>
<li>Divide actual triangles by possible triangles for each vertex

<li>And report the top 20 stations by clustering coefficient</li>
<pre>
+----+--------------------+------------------+
|  id|            location|             coeff|
+----+--------------------+------------------+
|3040|     "GOW Tech Shop"|               1.0|
|3639|        "Harborside"|               1.0|
|3192|"Liberty Light Rail"|               1.0|
|3485| "NYCBS Depot - RIS"|               1.0|
|3647|    "48 Ave & 30 Pl"|               1.0|
|3279|       "Dixon Mills"|               1.0|
|3186|     "Grove St PATH"|               1.0|
| 153|   "E 40 St & 5 Ave"|               1.0|
| 339|"Avenue D & E 12 St"| 0.877201420748853|
|3464|"W 37 St & Broadway"|0.8679573382796197|
| 247|"Perry St & Bleec...|0.8602079768329604|
|3175|"W 70 St & Amster...|0.8592469808193227|
|3176|"W 64 St & West E...|0.8568452539928423|
|3623|"W 120 St & Clare...|0.8549019607843137|
|3491|  "E 118 St & 1 Ave"| 0.854122621564482|
| 266| "Avenue D & E 8 St"| 0.849218980253463|
|3441|   "10 Hudson Yards"|0.8482701509017299|
|3646|    "35 Ave & 10 St"|0.8333333333333334|
|3642|"E 98 St & Lexing...|             0.832|
| 444|"Broadway & W 24 St"|0.8283229697508064|
+----+--------------------+------------------+
only showing top 20 rows
</pre>

In [39]:
def make_undirected_graph(g: GraphFrame) = {
    val u_edge_df = g.find("(a)-[]->(b)")
        .select($"a.id".as("src"),$"b.id".as("dst"))
        .withColumn("swap",when(col("src")<col("dst"),col("dst")))
        .withColumn("dst",
                    when(col("swap").isNotNull,col("src"))
                    .otherwise(col("dst")))
        .withColumn("src",
                    when(col("swap").isNotNull,col("swap"))
                   .otherwise(col("src")))
        .drop(col("swap"))
        .distinct
    val u_vertices_df = g.vertices
    val u_g = GraphFrame(u_vertices_df,u_edge_df)    
    u_g
}

make_undirected_graph: (g: org.graphframes.GraphFrame)org.graphframes.GraphFrame


In [40]:
val triangles = g.triangleCount.run().withColumnRenamed("id","t_id") //Get the number of triangles each vertex belongs to
val degrees = make_undirected_graph(g).degrees //Get the number of adjacent vertices for each vertex
val possible = degrees.withColumn("possible",col("degree")*(col("degree")-1)/lit(2)) //Calculate possible triangles
val joined = triangles.select($"t_id",$"count",$"attr" as "location").join(possible,triangles("t_id")===possible("id"))
val coeff = joined.withColumn("coeff",col("count")/col("possible"))
coeff.orderBy(col("coeff").desc).select("id","location","coeff").show

+----+--------------------+------------------+
|  id|            location|             coeff|
+----+--------------------+------------------+
| 153|   "E 40 St & 5 Ave"|               1.0|
|3485| "NYCBS Depot - RIS"|               1.0|
|3040|     "GOW Tech Shop"|               1.0|
|3639|        "Harborside"|               1.0|
|3192|"Liberty Light Rail"|               1.0|
|3647|    "48 Ave & 30 Pl"|               1.0|
|3279|       "Dixon Mills"|               1.0|
|3186|     "Grove St PATH"|               1.0|
| 339|"Avenue D & E 12 St"| 0.877201420748853|
|3464|"W 37 St & Broadway"|0.8679573382796197|
| 247|"Perry St & Bleec...|0.8602079768329604|
|3175|"W 70 St & Amster...|0.8592469808193227|
|3176|"W 64 St & West E...|0.8568452539928423|
|3623|"W 120 St & Clare...|0.8549019607843137|
|3491|  "E 118 St & 1 Ave"| 0.854122621564482|
| 266| "Avenue D & E 8 St"| 0.849218980253463|
|3441|   "10 Hudson Yards"|0.8482701509017299|
|3646|    "35 Ave & 10 St"|0.8333333333333334|
|3642|"E 98 S

triangles: org.apache.spark.sql.DataFrame = [count: bigint, t_id: string ... 1 more field]
degrees: org.apache.spark.sql.DataFrame = [id: string, degree: int]
possible: org.apache.spark.sql.DataFrame = [id: string, degree: int ... 1 more field]
joined: org.apache.spark.sql.DataFrame = [t_id: string, count: bigint ... 4 more fields]
coeff: org.apache.spark.sql.DataFrame = [t_id: string, count: bigint ... 5 more fields]
