# Analyzing Engineering Faculty Member Bios

In [1]:
%%html
<iframe width="100%" height="550" 
src="http://engineering.vanderbilt.edu/people/">
</iframe>

## Tranform XML to JSON 

```html
<td>
    <h4>
        <a href="/bio/michael-alles">Michael Alles</a>
    </h4>
    <br>
    <strong>Intellectual Neighborhoods:</strong> 
    Risk and Reliability,Nano Science and Technology
</td>
```

```javascript
{
    "name":"Michael Alles",
    "focus":"",
    "nhood":" Risk and Reliability,Nano Science and Technology"
}
```

## Use Word2Vec to Find Synonyms in Research Focus

In [2]:
val sqlc = sqlContext
import sqlc.implicits._
val facultyDF = sqlc.read.json("tmp/engineeringFaculty.json")
facultyDF.registerTempTable("faculty")

In [3]:
%%SQL
SELECT * FROM faculty

+--------------------+-------------------+--------------------+
|               focus|               name|               nhood|
+--------------------+-------------------+--------------------+
| Risk management,...|      Mark Abkowitz| Risk and Reliabi...|
| Nonlinear struct...|      Douglas Adams| Risk and Reliabi...|
| Human-System Int...|     Julie A. Adams| Cyber Physical S...|
|         Development|     Nicholas Adams|                    |
|                    |      Michael Alles| Risk and Reliabi...|
| Magnetic resonan...|      Adam Anderson| Biomedical Imagi...|
| Drop dynamics, a...|    A. V. Anilkumar| Energy and Natur...|
|                    |     Theodore Bapty|                    |
| Solar energy con...|      Rizia Bardhan| Regenerative Med...|
| Welding and weld...|Robert Joel Barnett| Energy and Natur...|
+--------------------+-------------------+--------------------+
only showing top 10 rows



## Select and transform input data using SQL queries

In [4]:
val focusDF = sqlc.sql("SELECT focus FROM faculty WHERE LENGTH(focus)>0")

// Transform String to Tuple1[List[String]]
val focusSplitDF = focusDF.map(r => Tuple1(r(0).toString
  .replaceAll("""[\p{Punct}]"""," ").split("\\s+").filterNot(_ == ""))
).toDF("focus")

focusSplitDF.show()

+--------------------+
|               focus|
+--------------------+
|[Risk, management...|
|[Nonlinear, struc...|
|[Human, System, I...|
|       [Development]|
|[Magnetic, resona...|
|[Drop, dynamics, ...|
|[Solar, energy, c...|
|[Welding, and, we...|
|[Dynamic, systems...|
|[Multiscale, beha...|
|[Bioinstrumentati...|
|[Microfluidics, m...|
|[Tech, based, ent...|
|[Technology, stra...|
|[Computer, aided,...|
|[Modeling, and, a...|
|[Radiation, effec...|
|[Virtual, Environ...|
|[nanoscience, gra...|
|[Information, pro...|
+--------------------+
only showing top 20 rows



In [14]:
import org.apache.spark.ml.feature.{Word2Vec, StopWordsRemover}

// Remove stop words
val remover = new StopWordsRemover().setInputCol("focus").setOutputCol("filtered")
val dataSet = remover.transform(focusSplitDF)

// Learn a mapping from words to Vectors
val word2Vec = (new Word2Vec()
  .setInputCol("filtered")
  .setOutputCol("result")
  .setVectorSize(100)
  .setMinCount(0))
val model = word2Vec.fit(dataSet)
val resultDF = model.findSynonyms("data", 50)

resultDF.show()

+---------------+--------------------+
|           word|          similarity|
+---------------+--------------------+
|      Hydrology|0.009062745917056093|
|      signaling|0.007882261001329424|
|       Modeling|0.007774564754125474|
|     underwater|0.007724221027430538|
|       spectral|0.007616483427429222|
|        antigen|0.007540289727149277|
|    distributed|0.007514163031044968|
|        welding|0.007366753454299188|
|         Secure|0.007362351015667...|
|      appraisal|0.007210608123060892|
|        display|0.007104453107582285|
|  environmental| 0.00675192222738749|
|           path|0.006719895624075037|
|      batteries|0.006464656359015513|
|        cardiac|0.006458926007323657|
|electrochemical|0.006457020772064...|
|       Computer|0.006418360492989414|
|          Raman|0.006353728381891622|
|       learning|0.006349404061198687|
|       recovery|0.006339612316653455|
+---------------+--------------------+
only showing top 20 rows



# Analyze all faculty

In [4]:
%%html
<iframe width="100%" height="550" 
src="http://virg.vanderbilt.edu/webtools/registry/FacDetail.aspx?fname=&lname=A&school=0&dept=0">
</iframe>