# Named Entity Recognition

Named Entity Recognition (NER) the is task of extracting entities (people, organizations, locations, etc.) from natural language. In this example, we show how we can use Stanford's CoreNLP to extract names entities and build a word cloud show the relative occurrences.

## Import Dependencies

First, we need to import a number of dependencies:

In [None]:
%%init_spark
launcher.conf.spark.executor.instances = 8
launcher.conf.spark.executor.cores = 32
launcher.conf.spark.executor.memory = '8G'
launcher.conf.spark.driver.memory = '4G'
launcher.jars = ["sparksolrini.jar"]

## Extract Named Entities

Next, we can extract the named entities. The output is a single file (`part-00000`) containing one entity per line.

In [None]:
import com.lucidworks.spark.rdd.SelectSolrRDD
import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
import org.apache.hadoop.fs.{FileSystem, Path}

import java.util.Properties
import collection.JavaConversions._

// Solr's ZooKeeper URL
val SOLR = "192.168.1.111:9983"

// The Solr collection
val INDEX = "core17"

// The Solr query
val QUERY = "contents:music"

// The number of partitions
val PARTITIONS = 8

// Filter for entity type (PERSON, ORGANIZATION, LOCATION, DATE, etc.)
val ENTITY_TYPE = "PERSON"

// The limit for number of rows to process
val LIMIT = 100

// Output directory
val OUT_DIR = "ner"

// Delete old output dir
FileSystem.get(sc.hadoopConfiguration).delete(new Path(OUT_DIR), true)

val rdd = new SelectSolrRDD(SOLR, INDEX, sc, maxRows = Some(LIMIT))
    .rows(1000)
    .query(QUERY)
    .repartition(PARTITIONS)
    .mapPartitions(docs => {
        
        val props = new Properties()
        props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner")
        props.setProperty("ner.applyFineGrained", "false")
        props.setProperty("ner.useSUTime", "false")
        props.setProperty("threads", "8")
        
        val pipeline = new StanfordCoreNLP(props)
        val entities = docs.map(doc => {
                        
            val coreDoc = new CoreDocument(doc.get("raw").asInstanceOf[String])
            pipeline.annotate(coreDoc)
          
            if (ENTITY_TYPE.equals("*")) {
                coreDoc.entityMentions().map(x => x.toString).toList
            } else {
                coreDoc.entityMentions().filter(cem => cem.entityType().equals(ENTITY_TYPE)).map(x => x.toString).toList
            }
        })
        
        entities
                
    })
    .flatMap(x => x)
    .coalesce(1)

rdd.saveAsTextFile(OUT_DIR)

rdd.take(10)

## Generate Word Cloud

Now we can generate the word cloud using the Python word_cloud package.

In [None]:
import sys.process._

// Remove the old output directory
"rm -rf /tmp/ner" !

// Copy new output from HDFS to local filesystem
"hdfs dfs -copyToLocal ner /tmp/ner" !

// Generate the word cloud
"./cloud.sh" !

![](ner.png)