In [1]:
%%init_spark
launcher.packages= ["graphframes:graphframes:0.8.2-spark3.2-s_2.12"]

# Notes: Dataframe, ML

<h1>Sample Problem 1</h1>

<p><strong><span style="font-size: 14pt; color: #3598db;">tl;dr jump to the bottom, copy the code into a Jupyter notebook, and fill in the missing code.&nbsp;</span></strong></p>
<p>In this question, you will prepare data for text analysis. Starting with the raw text data, you need to do the necessary feature engineering to get the data ready for topic modeling by tokenizing and cleaning the text, extracting entities from the text, and then calculating the tfidf vector for the entities in each document. The operations are described below and some functions are already written for you so all you need to do is to follow the steps carefully!</p>
<p><span style="font-size: 14pt;"><strong>Step 1</strong></span>: Write a function&nbsp;<span style="color: #e03e2d;"><em>make_df</em></span> that takes a Seq of documents as an argument (each document is in a <span style="color: #e03e2d;">(String,String) tuple</span> containing (document_id, document_text) and returns a <span style="color: #e03e2d;">dataframe</span> with five columns.</p>
<p>column1: <span style="color: #e03e2d;">document_id</span> (the document id)</p>
<p>column2: <span style="color: #e03e2d;">document_text</span> (the original text of the document)</p>
<p>column3: <span style="color: #e03e2d;">cleaned_text</span> (the text with periods, \n's and commas removed.</p>
<p>column 4: <span style="color: #e03e2d;">document_terms</span> (cleaned_text split on spaces)</p>
<p>column 5: <span style="color: #e03e2d;">entity_terms</span> (any bi-gram in which both terms begin with uppercase letters is considered an entity and the two terms should be replaced by the their concatenated value)</p>
<p>As an example, if:</p>
<pre>val doc1 = ("d1", "New York is a city in the United States.")</pre>
<p><span style="color: #e03e2d;"><em>make_df(Seq(doc1)) </em></span>should return (note that the period has been removed in cleaned_text):</p>
<table style="border-collapse: collapse; width: 97.8346%; height: 58px;" border="1">
    <tbody>
        <tr style="height: 29px;">
            <td style="width: 7.95465%; height: 29px;">document_id</td>
            <td style="width: 29.5373%; height: 29px;">document_text</td>
            <td style="width: 22.4459%; height: 29px;">cleaned_text</td>
            <td style="width: 15.2029%; height: 29px;">document_terms</td>
            <td style="width: 24.7588%; height: 29px;">entity_terms</td>
        </tr>
        <tr style="height: 29px;">
            <td style="width: 7.95465%; height: 29px;">d1</td>
            <td style="width: 29.5373%; height: 29px;">New York is a city in the United States.</td>
            <td style="width: 22.4459%; height: 29px;">New York is a city in the United States</td>
            <td style="width: 15.2029%; height: 29px;">
                <pre>New, York, is, a, city, in, the, United, States</pre>
            </td>
            <td style="width: 24.7588%; height: 29px;">
                <pre>[<br />NewYork, UnitedStates]</pre>
            </td>
        </tr>
    </tbody>
</table>
<p>&nbsp;</p>
<p>The way to do this is to</p>
<p>1. create a <span style="color: #e03e2d;">dataframe</span> from the input sequence with the first two columns</p>
<p>2. write <span style="color: #e03e2d;">udfs</span> (user defined functions) for generating each subsequent column and then apply the udf to the dataframe using the withColumn transformation</p>
<p><strong>Step 2</strong>: Use <a class="inline_disabled" href="https://spark.apache.org/docs/latest/ml-features#countvectorizer" target="_blank" rel="noopener">CountVectorizer</a> to generate the&nbsp; vocabulary and the vector of term frequencies for each document in a new column <span style="color: #e03e2d;">term_freqs</span>.&nbsp;</p>
<p><strong>Step 3</strong>: Use <a class="inline_disabled" href="https://spark.apache.org/docs/latest/ml-features.html#tf-idf" target="_blank" rel="noopener">IDF</a> to get the tfidf vector from the term frequencies in a dataframe containing two columns, the document_id and <span style="color: #e03e2d;">tfidfVec</span></p>
<p><strong>Example</strong>:</p>
<p>If the initial data is in the following two documents:</p>
<pre>val doc1 = ("doc 1","""<br />Columbia University is a large university in New York.<br />It has many schools including Columbia College, Engineering School, Law School, and Business School.<br />It was established in 1754<br />""")<br />val doc2 = ("doc 2","""<br />Operations Research is a department in the Engineering School of Columbia University.<br />Operations Research was established in 1919.<br />Operations Research has a BS major and offers many MS degrees.<br />Graduates of Operations Research get good jobs and have a very happy life.<br />""")</pre>
<p>then, the final idfMatrix dataframe should be:</p>
<table style="border-collapse: collapse; width: 97.8346%; height: 87px;" border="1">
    <tbody>
        <tr style="height: 29px;">
            <td style="width: 10.7925%; height: 29px;">document_id</td>
            <td style="width: 89.1069%; height: 29px;">tfidfVec</td>
        </tr>
        <tr style="height: 29px;">
            <td style="width: 10.7925%; height: 29px;">doc1</td>
            <td style="width: 89.1069%; height: 29px;">
                <pre>(12,<br />[1,2,3,4,5,6,7,9,10,11],<br />[0.0,0.0,0.4054651081081644,0.4054651081081644,0.4054651081081644,<br />   0.4054651081081644,0.4054651081081644,0.4054651081081644,0.4054651081081644,<br />   0.4054651081081644])</pre>
            </td>
        </tr>
        <tr style="height: 29px;">
            <td style="width: 10.7925%; height: 29px;">doc2</td>
            <td style="width: 89.1069%; height: 29px;">
                <pre>(12,[0,1,2,8],[1.6218604324326575,0.0,0.0,0.4054651081081644]) </pre>
            </td>
        </tr>
    </tbody>
</table>
<p>The term frequencies should be:</p>
<table style="border-collapse: collapse; width: 97.8346%;" border="1">
    <tbody>
        <tr>
            <td style="width: 99.8994%;">term_freqs</td>
        </tr>
        <tr>
            <td style="width: 99.8994%;">
                <pre>(12,[1,2,3,4,5,6,7,9,10,11],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])</pre>
            </td>
        </tr>
        <tr>
            <td style="width: 99.8994%;">
                <pre>(12,[0,1,2,8],[4.0,1.0,1.0,1.0])</pre>
            </td>
        </tr>
    </tbody>
</table>
<p>The entity_terms should be:</p>
<table style="border-collapse: collapse; width: 97.8346%;" border="1">
    <tbody>
        <tr>
            <td style="width: 99.8994%;">entity_terms</td>
        </tr>
        <tr>
            <td style="width: 99.8994%;">
                <pre>[ColumbiaUniversity, NewYork, YorkIt, ColumbiaCollege, CollegeEngineering, EngineeringSchool, SchoolLaw, LawSchool, BusinessSchool, SchoolIt]</pre>
            </td>
        </tr>
        <tr>
            <td style="width: 99.8994%;">
                <pre>[OperationsResearch, EngineeringSchool, ColumbiaUniversity, UniversityOperations, OperationsResearch, OperationsResearch, OperationsResearch]</pre>
            </td>
        </tr>
    </tbody>
</table>
<p>The code for this problem is outlined below. Easiest if you cut and paste this into your notebook and then fill in the missing parts<br /><br /></p>
<pre>import org.apache.spark.sql.DataFrame<br />val doc1 = ("doc 1","""<br />Columbia University is a large university in New York.<br />It has many schools including Columbia College, Engineering School, Law School, and Business School.<br />It was established in 1754<br />""")<br />val doc2 = ("doc 2","""<br />Operations Research is a department in the Engineering School of Columbia University.<br />Operations Research was established in 1919.<br />Operations Research has a BS major and offers many MS degrees.<br />Graduates of Operations Research get good jobs and have a very happy life.<br />""")<br /><br />//This function takes two strings as input and returns true if both begin with an uppercase letter<br />//and false otherwise<br />//The function char.isUpper returns true if a character is an uppercase letter and false otherwise<br /><br />def both_uc(w1: String,w2: String): Boolean = //WRITE THIS FUNCTION<br /><br />//both_uc("columbia","University") returns false<br />//both_uc("Columbia","University") returns true<br />//both_uc("columbia","university") returns false<br /><br />//clean_data removes periods, \n's and commas from the text string<br />//Do also use trim() to remove leading and trailing spaces<br />def clean_data(a: String): String = //WRITE THIS FUNCTION<br /><br />/*<br />val sample = """<br />Jim, Jill and John.<br />The three siblings.<br />"""<br />clean_data(sample)<br /><br />should return:<br /><br />String = Jim &nbsp;Jill and John &nbsp;The three siblings<br /><br />*/<br /><br /><br />//split_data takes a string and splits it on or more spaces (just in case the text has extra&nbsp;<br />//spaces between words). This is written for you.&nbsp;<br />def split_data(a: String): Array[String] = a.split("\\s+")<br /><br />/*&nbsp;<br />split_data(clean_data(sample))<br /><br />should return:<br /><br />Array[String] = Array(Jim, Jill, and, John, The, three, siblings)<br /><br />*/<br /><br />//Given an Array of strings (the terms), find entities (pairs of words beginning with&nbsp;<br />// &nbsp;uppercase letters, concatenate them and replace the pair by the concatenation)<br />//I've written this for you as well!<br />def replace_entities(a: Array[String]):Array[String] = {<br />&nbsp; &nbsp; val indices = 0 to a.length-1<br />&nbsp; &nbsp; indices.slice(0,indices.length-1)<br />&nbsp; &nbsp; .flatMap(i =&gt;&nbsp;<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;if (both_uc(a(i),a(i+1))) Some(a(i)+a(i+1))<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;else None)<br />&nbsp; &nbsp; .toArray<br />}<br /><br />//Create udfs<br />val clean_data_udf =&nbsp;<br />val split_data_udf =&nbsp;<br />val replace_entities_udf =&nbsp;<br /><br />//Write the make_df function<br />//The function should return a DataFrame&nbsp;<br />//1. Write the signature. make_df takes a Seq of (document_id,document_text) tuples as argument<br />// and returns a DataFrame<br />//2. make an rdd and convert that into a DF with with appropriate column names<br />//3. using withColumn transformations, clean the data (clean_data column),<br />// &nbsp; &nbsp; split the cleaned data (document_terms column)<br />// &nbsp; &nbsp; get the entities (entity_terms column)<br /><br />def make_df //WRITE THIS FUNCTION<br /><br />//make the df using make_df (Written for you)<br />val df = make_df(Array(doc1,doc2))<br /><br />//Using CountVectorizer, generate term_freqs column<br />import org.apache.spark.ml.feature.CountVectorizer<br />val countVectorizer = //WRITE THIS<br /><br />val vocabModel = countVectorizer.fit(df)<br />val freqs = vocabModel.transform(df)<br /><br />//Using IDF get the tfidfVec<br />import org.apache.spark.ml.feature.IDF<br /><br />val idf = //WRITE THIS<br />val idfModel = idf.fit(freqs)<br />val idfMatrix = //WRITE THIS<br /><br />idfMatrix.show(false) //The Result<br /><br /><br /><br /></pre>
<pre></pre>

In [None]:
import org.apache.spark.sql.DataFrame
val doc1 = ("doc 1","""
Columbia University is a large university in New York.
It has many schools including Columbia College, Engineering School, Law School, and Business School.
It was established in 1754
""")
val doc2 = ("doc 2","""
Operations Research is a department in the Engineering School of Columbia University.
Operations Research was established in 1919.
Operations Research has a BS major and offers many MS degrees.
Graduates of Operations Research get good jobs and have a very happy life.
""")

def both_uc(w1: String,w2: String): Boolean = if (w1(0).isUpper & w2(0).isUpper) true else false
both_uc("columbia","University")

def split_data(a: String): Array[String] = a.split("\\s+")
def clean_data(a: String): String =
    a.replace("\n"," ").replace("."," ").replace(","," ").replace("  "," ").trim()


def replace_entities(a: Array[String]):Array[String] = {
    val indices = 0 to a.length-1
    indices.slice(0,indices.length-1)
    .flatMap(i => 
         if (both_uc(a(i),a(i+1))) Some(a(i)+a(i+1))
         else None)
    .toArray
}

val clean_data_udf = udf(clean_data _)
val split_data_udf = udf(split_data _)
val replace_entities_udf = udf(replace_entities _)

def make_df(s: Seq[(String,String)]): DataFrame = {
    sc.parallelize(s)
        .toDF("document_id","document_text")
        .withColumn("cleaned_string",clean_data_udf($"document_text"))
        .withColumn("document_terms",split_data_udf($"cleaned_string"))
        .withColumn("entity_terms",replace_entities_udf($"document_terms"))
}


val df = make_df(Array(doc1,doc2))

import org.apache.spark.ml.feature.CountVectorizer
val countVectorizer = new CountVectorizer()
    .setInputCol("entity_terms")
    .setOutputCol("term_freqs")
    .setVocabSize(20)

val vocabModel = countVectorizer.fit(df)
val freqs = vocabModel.transform(df)

import org.apache.spark.ml.feature.IDF

val idf = new IDF()
    .setInputCol("term_freqs")
    .setOutputCol("tfidfVec")
val idfModel = idf.fit(freqs)
val idfMatrix = idfModel
                .transform(freqs)
                .select("document_id", "tfidfVec")

idfMatrix.show(false)

<h1>Sample Problem 2</h1>

<p><strong>Problem</strong>: Write a function that returns the betweenness centrality for each of the vertices in g. Once again, you can skip down to the code below, cut and paste it into your notebook, and start working. Or, read the long explanation below</p>
<h3>Betweenness centrality</h3>
<p>Betweenness Centrality is a measure of the criticality of a vertex in a graph and is generally calculated as a function of the number of shortest paths that pass through the vertex. A simple calculation is as follows:</p>
<p>1. If there are n vertices in a graph, then the number of possible shortest paths between all pairs of vertices is n*(n-1)</p>
<p>2. Calculate the shortest paths between all pairs of vertices in the graph</p>
<p>3. For any vertex k, count the number of shortest paths calculated in step 2 that go through vertex k (a shortest path from i to j, i!=k and j!=k, that contains k). Let this count be c<sub>k</sub></p>
<p>4. The betweenness centrality measure of vertex k is c<sub>k</sub>/(n*(n-1))</p>
<p>Betweenness centrality is important because&nbsp; keeping the vertices with the highest betweenness centrality operational is critical to keeping the graph operational. For example, United Airlines has a hub at Chicago O'Hare airport and that airport is on the shortest path on flights between many pairs of cities in the United States (it has a high betweenness centrality). If the airport shuts down (a snowstorm), the disruptions are monumental!</p>
<h3>Betweenness Centrality given a graph g</h3>
<p>1. Write a function that returns an Array[(Int,Int)] whose elements are all vertex pairs in the graph. For example, if the graph has 3 vertices: 1, 2, 3; then the function should return (note that a graphframes graph is a directed graph)</p>
<pre>Array[(Int, Int)] = Array((1,2),(1,3),(2,1),(2,3),(3,1),(3,2))</pre>
<p>&nbsp;2. Write a function that, given a graph g, a vertex i, and a vertex j, returns an Array containing the nodes on the shortest path from i to j. For example, for the example graph below, the shortest path from node 1 to node 8 is</p>
<pre>Array[Int] = Array(3, 5, 6)</pre>
<p>(assume that node ids' are integers for the purposes of this problem)</p>
<p>3. Write a function that, given a graph g, returns a list (List[Array[Int]]) where each element is the shortest path between a pair of vertices. Construct this list recursively (does not need to be tail recursive). For the example graph below, this function should return:</p>
<pre>List[Array[Int]] = List(Array(), Array(), Array(), Array(3), Array(3, 5), Array(3, 5, 6), <br />     Array(3, 5, 6), Array(), Array(), Array(), Array(4), Array(3, 5), Array(4, 5, 6), <br />     Array(3, 5, 6), Array(), Array(), Array(), Array(), Array(5), Array(5, 6), Array(5, 6), <br />     Array(), Array(), Array(), Array(), Array(5), Array(5, 6), Array(5, 6), Array(), Array(), <br />     Array(), Array(), Array(), Array(6), Array(6), Array(), Array(), Array(), Array(), Array(), <br />     Array(), Array(), Array(), Array(), Array(), Array(), Array(), Array(), Array(), Array(), <br />     Array(), Array(), Array(), Array(), Array(), Array())</pre>
<p><strong>Note</strong>: an empty Array signifies that either there is no path between a pair of nodes (e.g., between 8 and 1) or that the path length is 1 and there are no intermediate nodes (e.g. between 1 and 3). Each array only contains the intermediate nodes</p>
<p>4. Remove (using filter) all empty arrays and then count the number of occurrences of every remaining node. For our example below, you may get (your values may be different because if there are two candidate shortest paths between a pair of nodes, the one that is included depends on the order in which worker nodes report their results!):</p>
<pre>scala.collection.immutable.Map[Int,Int] = Map(3 -&gt; 5, 5 -&gt; 12, 6 -&gt; 10, 4 -&gt; 3)</pre>
<p>(you might end up with a different data structure)</p>
<p>5. Calculate the betweenness centrality by dividing each count by the number of possible paths. For vertices that are not in the above map, the count is 0 and the betweenness centrality is 0.0. For our example, you should get:</p>
<h3>Array[Double] = Array(0.0, 0.0, 0.07142857142857142, 0.07142857142857142, 0.21428571428571427, <br />0.17857142857142858, 0.0, 0.0)<br /><br />EXAMPLE GRAPH</h3>
<p><img src="betweenness.png" alt="betweenness_sample_graph.png" /></p>
<p><strong>Cut and paste the code below into your Jupyter notebook. Quite a bit is already written for you, you need to fill in the missing parts</strong></p>
<pre>import org.apache.spark.sql._<br />import org.apache.spark.sql.functions._<br />import org.graphframes._<br /><br /><br />val vertexArray = Array(<br />&nbsp; (1,1),<br />&nbsp; (2,2),<br />&nbsp; (3,3),<br />&nbsp; (4,4),<br />&nbsp; (5,5),<br />&nbsp; (6,6),<br />&nbsp; &nbsp; (7,7),<br />&nbsp; &nbsp; (8,8)<br />)<br /><br /><br />val edgeArray = Array(<br />&nbsp; (1, 3),<br />&nbsp; (2, 3),<br />&nbsp; (2, 4),<br />&nbsp; (4, 5),<br />&nbsp; (3, 5),<br />&nbsp; (5, 6),<br />&nbsp; (6, 7),<br />&nbsp; (6, 8),<br />&nbsp; (7, 8)<br />)<br /><br />val vertex_df = spark.createDataFrame(vertexArray).toDF("id","v_desc")<br />val edge_df = spark.createDataFrame(edgeArray).toDF("src","dst")<br /><br />val g = GraphFrame(vertex_df, edge_df)<br /><br />//Function to get all vertex pairs. This is written for you<br />def getAllVertexPairs(g: GraphFrame): Array[(Int,Int)] = {<br />&nbsp; &nbsp; def getAllPairs(nums: Seq[Int]) =<br />&nbsp; &nbsp; &nbsp; &nbsp; nums.flatMap(x =&gt; nums.map(y =&gt; (x,y))).filter(p=&gt;p._1 != p._2)<br /><br />&nbsp; &nbsp; val col_vals = g.vertices.select("id").map(_.getInt(0)).collect.toSeq.toArray<br />&nbsp; &nbsp; val all_vertex_pairs = getAllPairs(col_vals).toArray<br />&nbsp; &nbsp; all_vertex_pairs<br />}<br /><br />//Function to get the shortest path between two vertices. This is also already written<br />//for you<br />//Note that this uses the bfs algorithm. So it will take some time to run and should<br />//not be run on large graphs!<br /><br />def getShortestPath(g: GraphFrame,i: Int, j: Int) = {<br />&nbsp; &nbsp; val path_df = g.bfs.fromExpr(s"id=$i").toExpr(s"id=$j").run()&nbsp;<br />&nbsp; &nbsp; if (path_df.count &gt; 0) {<br />&nbsp; &nbsp; &nbsp; &nbsp; val cols = path_df.columns.filter(n=&gt;n.contains("v")).map(n=&gt;col(n+".id"))<br />&nbsp; &nbsp; &nbsp; &nbsp; val a = path_df.select(cols:_*).rdd.collect()(0).toSeq.toArray.map(e =&gt; e.toString.toInt)<br />&nbsp; &nbsp; &nbsp; &nbsp; a<br />&nbsp; &nbsp; }<br />&nbsp; &nbsp; else Array[Int]()<br />}<br /><br />//Function that returns all shortest paths<br />//You need to fill in the loop function<br />//Keep in mind:<br />//if the array a is empty, you should return an empty List<br />//if the array has one element, find the shortest path for that element<br />// and return a List of one element that contains the shortest path<br />//If the array has &gt; 1 element, find the shortest path for the first element<br />// in the list and return that shortest path CONS a call to loop with the remaining elements<br />//Use array.slice(start,end) to get the remaining elements<br />def getAllShortestPaths(g: GraphFrame):List[Array[Int]] &nbsp;= {<br />&nbsp; &nbsp; def loop(a: Array[(Int,Int)]):List[Array[Int]] = {<br />&nbsp; &nbsp; &nbsp; &nbsp; //YOU NEED TO DO THIS. SEE COMMENT IMMEDIATELY ABOVE<br /><br />&nbsp; &nbsp; }<br />&nbsp; &nbsp; val all_vertex_pairs = getAllVertexPairs(g)<br />&nbsp; &nbsp; loop(all_vertex_pairs)<br />}<br /><br />def getBetweenessCentrality(g: GraphFrame) = {<br />&nbsp; &nbsp; //get all shortest paths removing empty paths<br />&nbsp; &nbsp; val all_shortest_paths = //FILL THIS PART<br />&nbsp; &nbsp; //get all vertices in the graph in an array<br />&nbsp; &nbsp; //select the "id" column<br />&nbsp; &nbsp; //convert into an rdd<br />&nbsp; &nbsp; //convert each element into an Int<br />&nbsp; &nbsp; val vertices = //FILL THIS PART<br />&nbsp; &nbsp; //Calculate the denominator for betweenness centrality<br />&nbsp; &nbsp; val denominator = //n * (n-1) where n is number of vertices<br />&nbsp; &nbsp;&nbsp;<br />&nbsp; &nbsp; //for each vertex, calculate the betweenness centrality<br />&nbsp; &nbsp; //see notes below<br />&nbsp; &nbsp; vertices.map(v =&gt; //FILL THIS PART)<br />}<br /><br />//Result<br />val b = getBetweenessCentrality(g)<br />b.collect<br /><br />/*<br />NOTES:<br />NOTE 1<br />to get the count of the number of elements in an array, you can do the following:<br /><br />val x = Array(1,1,3,2,1,4,4,4)<br />x.groupBy(identity) groups array elements by value<br />x.groupBy(identity) returns scala.collection.immutable.Map[Int,Array[Int]] = Map(1 -&gt;&nbsp;<br />&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Array(1, 1, 1), 3 -&gt; Array(3), 2 -&gt; Array(2), 4 -&gt; Array(4, 4, 4))<br />x.groupBy(identity).mapValues(_.size) calculates the number of each element<br />x.groupBy(identity).mapValues(_.size) returns Map(1 -&gt; 3, 3 -&gt; 1, 2 -&gt; 1, 4 -&gt; 3)<br /><br />You could also use map and reduceByKey, but this may be easier<br /><br />NOTE 2:<br />Given&nbsp;<br />val y = Map(1 -&gt; 3, 3 -&gt; 1, 2 -&gt; 1, 4 -&gt; 3)<br />y.getOrElse(3,0) 1 because 3 -&gt; 1 in y<br />y.getOrElse(8,0) returns 0 because 8 is not a key in y<br /><br />NOTE 3:<br />If you get a serializability error, make sure that any function you're calling,<br />&nbsp; even an anonymous one, has all the data necessary to compute a value.<br /><br /><br /></pre>

In [None]:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.graphframes._


val vertexArray = Array(
  (1,1),
  (2,2),
  (3,3),
  (4,4),
  (5,5),
  (6,6),
    (7,7),
    (8,8)
)

//val vertexArray = Array(1,2,3,4,5,6,7,8)

val edgeArray = Array(
  (1, 3),
  (2, 3),
  (2, 4),
  (4, 5),
  (3, 5),
  (5, 6),
  (6, 7),
  (6, 8),
  (7, 8)
)

val vertex_df = spark.createDataFrame(vertexArray).toDF("id","v_desc").drop("v_desc")
val edge_df = spark.createDataFrame(edgeArray).toDF("src","dst")

val g = GraphFrame(vertex_df, edge_df)

//All vertex pairs
def getAllVertexPairs(g: GraphFrame): Array[(Int,Int)] = {
    def getAllPairs(nums: Seq[Int]) =
        nums.flatMap(x => nums.map(y => (x,y))).filter(p=>p._1 != p._2)

    val col_vals = g.vertices.select("id").map(_.getInt(0)).collect.toSeq.toArray
    val all_vertex_pairs = getAllPairs(col_vals).toArray
    all_vertex_pairs
}

//getAllVertexPairs(g)

def getShortestPath(g: GraphFrame,i: Int, j: Int) = {
    val path_df = g.bfs.fromExpr(s"id=$i").toExpr(s"id=$j").run() 
    if (path_df.count > 0) {
        val cols = path_df.columns.filter(n=>n.contains("v")).map(n=>col(n+".id"))
        val a = path_df.select(cols:_*).rdd.collect()(0).toSeq.toArray.map(e => e.toString.toInt)
        a
    }
    else Array[Int]()
}

def getAllShortestPaths(g: GraphFrame):List[Array[Int]]  = {
    def loop(a: Array[(Int,Int)]):List[Array[Int]] = {
        if (a.length == 0) List[Array[Int]]()
        else {
            val sp = getShortestPath(g,a(0)._1,a(0)._2)
            if (a.length == 1)
                List(sp)
            else sp ::loop(a.slice(1,a.length))
        }

    }
    val all_vertex_pairs = getAllVertexPairs(g)
    loop(all_vertex_pairs)
}

def getBetweenessCentrality(g: GraphFrame) = {
    //get all shortest paths removing empty paths
    val all_shortest_paths = getAllShortestPaths(g).filter(p => p.length > 0)
    val vertices = g.vertices.select("id").rdd.map(v=>v(0).toString.toInt)
    val denominator = vertices.count * (vertices.count -1)
    vertices.map(v => all_shortest_paths.flatten.groupBy(identity).mapValues(_.size).getOrElse(v,0)*1.0/denominator)
}


val b = getBetweenessCentrality(g)
b.collect

In [None]:
b.collect

<h1>SAMPLE QUESTION 3</h1>

A social network contains some information about people  in a Demographics object and some information about their relationships in a Connection object. The relationship information contains information on the strength of their relationship (strength attribute) and the probability that a message received by a person will be shared with a connection (the msgProbability attribute). 

You have been hired by an advertising company to explore this data. The goal of your research is to identify users that the company should target to diffuse its message. A "good diffuser" has the following characteristics:

1. Must be  older than 21 years

2. Must have an income over $20,000

3. Must have friends that are 21 years or older and the relationship strength should be greater than or equal to 3

Write code that returns the vertex id and the number of friends for each user that satisfies the above criteria. You must use aggregateMessages for this program!

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

case class Demographics(age: Int,gender: Char, income: Double)
case class Person(name: String,demographics: Demographics)
case class Connection(strength: Int,msgProbability: Double)


//Example users and connections

val users = Array(
(1L, Person("Alice",Demographics(28,'F',150000.0))),
(2L, Person("Bob",Demographics(27,'M',50000.0))),
(3L, Person("Charlie",Demographics(65,'M',250000.0))),
(4L, Person("David",Demographics(42,'M',750000.0))),
(5L, Person("Ed",Demographics(55,'M',25000.0))),
(6L, Person("Fran",Demographics(50,'F',3150000.0))),
(7L, Person("Jack",Demographics(17,'M',5000.0))),
(8L, Person("Jill",Demographics(16,'F',1000.0)))
)

val connections = Array(
Edge(2L, 1L, Connection(7,.2)),
Edge(2L, 4L, Connection(2,.7)),
Edge(3L, 2L, Connection(4,.31)),
Edge(3L, 6L, Connection(3,.22)),
Edge(4L, 1L, Connection(1,.12)),
Edge(5L, 2L, Connection(2,.45)),
Edge(5L, 3L, Connection(8,.91))
)

val vertexRDD: RDD[(Long, Person)] = sc.parallelize(users)
val edgeRDD: RDD[Edge[Connection]] = sc.parallelize(connections)

val social_graph = Graph(vertexRDD,edgeRDD)

Your code should return an RDD containing:

Array((2,1), (3,2), (5,2))
 


In [None]:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

case class Demographics(age: Int,gender: Char, income: Double)
case class Person(name: String,demographics: Demographics)
case class Connection(strength: Int,msgProbability: Double)


//Example users and connections

val users = Array(
(1L, Person("Alice",Demographics(28,'F',150000.0))),
(2L, Person("Bob",Demographics(27,'M',50000.0))),
(3L, Person("Charlie",Demographics(65,'M',250000.0))),
(4L, Person("David",Demographics(42,'M',750000.0))),
(5L, Person("Ed",Demographics(55,'M',25000.0))),
(6L, Person("Fran",Demographics(50,'F',3150000.0))),
(7L, Person("Jack",Demographics(17,'M',5000.0))),
(8L, Person("Jill",Demographics(16,'F',1000.0)))
)

val connections = Array(
Edge(2L, 1L, Connection(7,.2)),
Edge(2L, 4L, Connection(2,.7)),
Edge(3L, 2L, Connection(4,.31)),
Edge(3L, 6L, Connection(3,.22)),
Edge(4L, 1L, Connection(1,.12)),
Edge(5L, 2L, Connection(2,.45)),
Edge(5L, 3L, Connection(8,.91))
)

val vertexRDD: RDD[(Long, Person)] = sc.parallelize(users)
val edgeRDD: RDD[Edge[Connection]] = sc.parallelize(connections)

val social_graph = Graph(vertexRDD,edgeRDD)

val over21_friends = social_graph.aggregateMessages[Int](
    triplet => {
    if (triplet.dstAttr.demographics.age >= 21)
       {
      triplet.sendToSrc(1);
    }
  },
  (a, b) => (a+b)).map(t => t._1).collect

val result = social_graph.aggregateMessages[Int](
  triplet => {
    if (triplet.srcAttr.demographics.age > 21
      && triplet.srcAttr.demographics.income > 20000
      && triplet.attr.strength >= 3
      && (over21_friends contains triplet.srcId) )
      {
      triplet.sendToDst(1);
    }
  },
  (a, b) => (a+b)
)
result.collect

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.149:4040
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1670800381956)
SparkSession available as 'spark'


<h1>SAMPLE QUESTION 4</h1>

In [None]:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

case class PersonData(age: Int,gender: Char, income: Double)
case class Client(name: String,data: PersonData)
case class Relationship(strength: Int,msgProbability: Double)


//Example users and connections

val people = Array(
  (1L, Client("Alice",PersonData(28,'F',150000.0))),
  (2L, Client("Bob",PersonData(27,'M',50000.0))),
  (3L, Client("Charlie",PersonData(65,'M',250000.0))),
  (4L, Client("David",PersonData(42,'M',750000.0))),
  (5L, Client("Ed",PersonData(55,'M',25000.0))),
  (6L, Client("Fran",PersonData(50,'F',3150000.0))),
  (7L, Client("Jack",PersonData(17,'M',5000.0))),
  (8L, Client("Jill",PersonData(16,'F',1000.0)))
)

val relationships = Array(
  Edge(2L, 1L, Relationship(7,.2)),
  Edge(2L, 4L, Relationship(2,.7)),
  Edge(3L, 2L, Relationship(4,.31)),
  Edge(3L, 6L, Relationship(3,.22)),
  Edge(4L, 1L, Relationship(1,.12)),
  Edge(5L, 2L, Relationship(2,.45)),
  Edge(5L, 3L, Relationship(8,.91))
)

val vertexRDD: RDD[(Long, Client)] = sc.parallelize(people)
val edgeRDD: RDD[Edge[Relationship]] = sc.parallelize(relationships)

val social_graph = Graph(vertexRDD,edgeRDD)

//Fill in the types for social_graph and sourceId
def get_max_path(social_graph: Graph[Client,Relationship],sourceId: VertexId) = {
    val initialGraph = social_graph.mapVertices((id,_) => if (id == sourceId) 1.0 else 0.0)
    val vertexProgram = (id: VertexId, prob: Double, newProb: Double) => math.max(prob, newProb)
    val sendMsg = (triplet:EdgeTriplet[Double,Relationship]) => {  // Send Message
        val edgeProb = triplet.attr.msgProbability
        if (triplet.srcAttr == 0.0) {
          Iterator.empty
        } else if (triplet.srcAttr*edgeProb > triplet.dstAttr) {
          Iterator((triplet.dstId,triplet.srcAttr*edgeProb))
        } else {
          Iterator.empty
        }
      }
    val mrgMsg = (a: Double, b: Double) => math.max(a, b)
    
    val maxPath = initialGraph.pregel(0.0,3)(vertexProgram,sendMsg,mrgMsg)
    maxPath.vertices
}

 
 

<h1>SAMPLE QUESTION 5</h1>

In [None]:
val df = spark.createDataFrame(Seq(
(1,1,3.2),
(1,2,4.3),
(2,4,1.9),
(2,2,3.3),
(2,1,4.1),
(3,15,4.5),
(3,2,4.3)))
.toDF("user_id","movie_id","rating")
val train_df = df
val test_df = df
val avg_df = train_df.groupBy("movie_id").avg("rating")
val new_df = test_df.join(avg_df,Seq("movie_id")).withColumnRenamed("avg(rating)","prediction")
import org.apache.spark.ml.evaluation.RegressionEvaluator
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("rating")
.setPredictionCol("prediction")

val rmse = evaluator.evaluate(new_df)

<h1>SAMPLE QUESTION 6</h1>

In [None]:
val dataRDD = spark.sparkContext.makeRDD(
"""[{"name":"Le Monde","reviews":{"count":14,"rating":3.2},"serves":{"alcohol":true,"vegetarian":false}} ,
{"name":"Junzi Kitchen","reviews":{"count":7,"rating":4.5},"serves":{"alcohol":false,"vegetarian":true}},
{"name":"Atlas Kitchen","reviews":{"count":9,"rating":2.9},"serves":{"alcohol":true,"vegetarian":true}}]""":: Nil)
val df = spark.read.json(dataRDD)

import org.apache.spark.sql.functions.udf
def score(alc: Boolean, vg: Boolean, ra: Double) = {
    var score = 0.0
    if (alc) score=1
    if (vg) score=score+1
    score+ra/2.0
    
}
val score_udf = udf(score _)

val df2 = df.withColumn("score",score_udf($"serves.alcohol",$"serves.vegetarian",$"reviews.rating"))

df2.show

<h1>SAMPLE QUESTION 7</h1>

In [2]:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.udf

val df = spark.read.option("header","true").option("inferschema","true").csv("AAPL.csv")
def set_label(v: Double):Double = if (v>0) 1 else 0
val label_udf = udf(set_label _)
val df_new = df.withColumn("ma8", avg(df("price")).over( Window.orderBy("date").rowsBetween(-7,0)))
    .withColumn("ma13", avg(df("price")).over( Window.orderBy("date").rowsBetween(-12,0)))
    .withColumn("diff",$"ma8"-$"ma13")
    .withColumn("label",label_udf($"diff"))
import org.apache.spark.ml.feature.QuantileDiscretizer


val discretizer = new QuantileDiscretizer()
  .setInputCol("diff")
  .setOutputCol("deciles")
  .setNumBuckets(10)

val result = discretizer.fit(df_new).transform(df_new)
result.show(100,false)

Intitializing Scala interpreter ...

Spark Web UI available at http://10.56.160.213:4040
SparkContext available as 'sc' (version = 3.3.0, master = local[*], app id = local-1671142485698)
SparkSession available as 'spark'


22/12/15 17:14:57 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/12/15 17:14:57 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/12/15 17:14:57 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/12/15 17:14:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/12/15 17:14:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/12/15 17:14:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/12/15 1

|1981-03-27 00:00:00|0.086604|0.090485625        |0.08630092307692308|0.0041847019230769195 |1.0  |6.0    |
|1981-03-30 00:00:00|0.086604|0.09004825000000001|0.08714207692307692|0.002906173076923091  |1.0  |5.0    |
|1981-03-31 00:00:00|0.085729|0.089610875        |0.08768046153846153|0.00193041346153848   |1.0  |5.0    |
|1981-04-01 00:00:00|0.084854|0.08895475         |0.08821876923076924|7.359807692307596E-4  |1.0  |4.0    |
|1981-04-02 00:00:00|0.09229 |0.08879075         |0.08909353846153847|-3.027884615384724E-4 |0.0  |4.0    |
|1981-04-03 00:00:00|0.092727|0.088736           |0.08969915384615385|-9.631538461538497E-4 |0.0  |3.0    |
|1981-04-06 00:00:00|0.090977|0.08868125         |0.08976638461538464|-0.0010851346153846336|0.0  |3.0    |
|1981-04-07 00:00:00|0.090103|0.088736           |0.08983369230769232|-0.0010976923076923273|0.0  |3.0    |
|1981-04-08 00:00:00|0.094477|0.08972012500000001|0.09017015384615383|-4.500288461538188E-4 |0.0  |4.0    |
|1981-04-09 00:00:00|0.09622

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.udf
df: org.apache.spark.sql.DataFrame = [date: timestamp, price: double]
set_label: (v: Double)Double
label_udf: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$3887/0x000000080175b040@a296009,DoubleType,List(Some(class[value[0]: double])),Some(class[value[0]: double]),None,false,true)
df_new: org.apache.spark.sql.DataFrame = [date: timestamp, price: double ... 4 more fields]
import org.apache.spark.ml.feature.QuantileDiscretizer
discretizer: org.apache.spark.ml.feature.QuantileDiscretizer = quantileDiscretizer_6e9a68e5f983
result: org.apache.spark.sql.DataFrame = [date: timestamp, price: double ... 5 more fields]
