bloom-filters-language models

Calculate n-grams from a list of Wikipedia abstracts

Autism    Autism is a disorder of neural development characterized by impaired social interaction and communication,
and by restricted and repetitive behavior. The diagnostic criteria require that symptoms become apparent before a 
child is three years old. Autism affects information processing in the brain by altering how nerve cells and their 
synapses connect and organize; how this occurs is not well understood.

Achilles  In Greek mythology, Achilles was a Greek hero of the Trojan War, the central character and the greatest
warrior of Homer's Iliad. Plato named Achilles the most handsome of the heroes assembled against Troy. Later 
legends (beginning with a poem by Statius in the 1st century AD) state that Achilles was invulnerable in all of 
his body except for his heel. As he died because of a small wound on his heel, the term Achilles' heel has come to 
mean a person's principal weakness.

Results for n = 3

t,t,t  153925
world,war,ii	52033
administrative,district,gmina	50100
village,administrative,district	46547
u00a0,u00a0,u00a0	43681
new,york,city	41405
poland,lies,approximately	34257
register,historic,places	34149
national,register,historic	34113
during,world,war	30734
2000,socorro,linear	30548
new,south,wales	23063
2001,socorro,linear	22489
world,war,i	21159
united,states,population	20912
village,has,population	20767
major,league,baseball	20070
best,known,his	18904
he,has,been	18437
until,his,death	18364

Create a language model using bloom filters

Bases on the algorithm described on Smoothed Bloom ﬁlter language models: Tera-Scale LMs on the Cheap, David Talbot and Miles Osborne, http://acl.ldc.upenn.edu/D/D07/D07-1049.pdf

Create language model

hadoop jar target/mapreduce-1.0.0-SNAPSHOT.jar mx.itam.metodos.bf.BFLMDriver -libjars guava-13.0.1.jar \
medium.txt-out-5-true N 170490 0.03 medium.txt-lm

Test locally:

hadoop dfs -get medium.txt-lm .
java  mx.itam.metodos.bf.BFTool medium.txt-lm 170490 0.03 4

Process on elastic-map-reduce

Generate an AWS credentials pair. Download the elastic-mapreduce scripts from http://aws.amazon.com/developertools/2264 and configure the credentials.

  {
    "access-id": "AAAAAAAAAAAAAAAAAAAA",
    "private-key": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
    "region": "us-east-1",
    "key-pair":      "keypair",
    "key-pair-file": "/Users/user/.ssh/keypair.pem",
    "log-uri": "s3n://[NOMBRE_DE_UN_BUCKET]/logs"
  }

Calculate 4-grams for wikipedia articles and use `Pig` to analize the results:

Crear un bucket utilizando la consola de S3 y reemplazar mi-bucket con el nombre del bucket recien creado.
Crear un job flow en EMR

elastic-mapreduce --create --alive --name "Contar NGramas" --hadoop-version 1.0.3  --ami-version 2.2 \
--num-instances 4 --instance-type m1.medium --pig-interactive --pig-versions latest

Calcular los 4-gramas utilizando un Analyzer

elastic-mapreduce --jar s3n://metodos/mapreduce-1.0.0-SNAPSHOT.jar --main-class mx.itam.metodos.mr.CountNGrams \
--arg -libjars --arg s3n://metodos/guava-13.0.1.jar,s3n://metodos/lucene-analyzers-common-4.1.0.jar,s3n://metodos/lucene-core-4.1.0.jar \
--args s3n://metodos/long_abstracts_en.txt,hdfs:///long_abstracts_en.txt-out,4,true \
--enable-debugging --jobflow j-AAAAAAAAAAAAA

Utilizar Pig para ordenar los resultados de forma descendente y sleccionar los 1000 más altos


elastic-mapreduce --pig-script s3n://metodos/get-top-k.pig \
--args "-p,INPUT=hdfs:///long_abstracts_en.txt-out-4/,-p,OUTPUT=hdfs:///long_abstracts_en.txt-top-4/,-p,K=1000" \
--enable-debugging --jobflow j-AAAAAAAAAAAAA

Copiar los resultados a S3

elastic-mapreduce --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--dest,s3://mi-bucket/long_abstracts_en.txt-top-4/,--src,hdfs:///long_abstracts_en.txt-top-4' \
--jobflow j-AAAAAAAAAAAAA

Opciones comunes para configurar Hadoop localmente:

$HADOOP_HOME/conf/mapred-site.xml

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

$HADOOP_HOME/conf/hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/var/data/hadoop/dfs/name</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/var/data/hadoop/dfs/data</value>
  </property>
</configuration>

$HADOOP_HOME/conf/core-site.xml

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/var/data/hadoop/</value>
  </property>
</configuration>

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.settings		.settings
data		data
src/main		src/main
.classpath		.classpath
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.settings

.settings

data

data

src/main

src/main

.classpath

.classpath

.gitignore

.gitignore

.project

.project

LICENSE

LICENSE

README.md

README.md

pom.xml

pom.xml

Repository files navigation

bloom-filters-language models

Calculate n-grams from a list of Wikipedia abstracts

Results for n = 3

Create a language model using bloom filters

Process on elastic-map-reduce

Calculate 4-grams for wikipedia articles and use `Pig` to analize the results:

Opciones comunes para configurar Hadoop localmente:

$HADOOP_HOME/conf/mapred-site.xml

$HADOOP_HOME/conf/hdfs-site.xml

$HADOOP_HOME/conf/core-site.xml

About

Releases

Packages

Languages

License

elmer-garduno/bloom-filters-language-models

Folders and files

Latest commit

History

Repository files navigation

bloom-filters-language models

Calculate n-grams from a list of Wikipedia abstracts

Results for n = 3

Create a language model using bloom filters

Process on elastic-map-reduce

Calculate 4-grams for wikipedia articles and use Pig to analize the results:

Opciones comunes para configurar Hadoop localmente:

$HADOOP_HOME/conf/mapred-site.xml

$HADOOP_HOME/conf/hdfs-site.xml

$HADOOP_HOME/conf/core-site.xml

About

Resources

License

Stars

Watchers

Forks

Languages

Calculate 4-grams for wikipedia articles and use `Pig` to analize the results: