first commit

citiususc · May 24, 2016 · b357daa · b357daa
1 parent 9370a08
commit b357daa
Show file tree

Hide file tree

Showing 227 changed files with 6,712,897 additions and 4 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,4 +1,4 @@
-                    GNU GENERAL PUBLIC LICENSE
+                     GNU GENERAL PUBLIC LICENSE
                        Version 3, 29 June 2007
 
  Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
@@ -631,8 +631,8 @@ to attach them to the start of each source file to most effectively
 state the exclusion of warranty; and each file should have at least
 the "copyright" line and a pointer to where the full notice is found.
 
-    {one line to give the program's name and a brief idea of what it does.}
-    Copyright (C) {year}  {name of author}
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU General Public License as published by
@@ -652,7 +652,7 @@ Also add information on how to contact you by electronic and paper mail.
   If the program does terminal interaction, make it output a short
 notice like this when it starts in an interactive mode:
 
-    {project}  Copyright (C) {year}  {fullname}
+    <program>  Copyright (C) <year>  <name of author>
     This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
     This is free software, and you are welcome to redistribute it
     under certain conditions; type `show c' for details.

diff --git a/README b/README
@@ -0,0 +1,164 @@
+LINGUAKIT, a linguistic tool containing:
+ - dependency parser (DepPattern)
+ - PoS tagger
+ - NER (named entity recognition)
+ - NEC (named entity classification)
+ - Sentiment analysis
+
+#######################################
+Author: 
+Grupo ProLNat@GE, CiTIUS
+University of Santiago de Compostela
+Galiza, Spain
+######################################
+
+DESCRIPTION
+The command 'linguakit' is able to process 4 languages (Portuguese, English, Spanish, Galician), and allows you to run the following linguistic tools:
+
+* Dependency parser (flag "dep"): It runs 4 default parsers for 4 languages: English, Spanish, Galician, and Portuguese. The parsers were implemented in PERL and are stored in the 'parsers' file. The parsers were compiled from formal grammars (more information in https://github.com/gamallo/DepPattern). The parsers are provided with several types of output: basic triplets (-a), triplets with morphological information (-fa), the same output as the input (-c) for correction purpose, and CoNLL format (-conll). 
+
+* PoS tagger (flag "tagger"): the software also provides the PoS tagger 'CitiusTools". It is provided with two submodules: NER (-ner) and NEC (-nec). The NEC module returns semantic tags for named entities: NP0SP00 (Person), NP00G00 (Location), NP00O00 (Organization), NP00V00 (Miscelaneous)
+
+* Multiword extraction (flag "mwe"): It extracts multiwords from PoS tagged text. There are several options, each one being a specific lexical association measure for ranking the candidate terms: chi square ("chi", by default measure), loglikelihood ("log"), mutual information ("mi"),  symmetrical conditional probability ("scp"), simple co-occurrences ("cooc").
+
+* Sentiment analysis (flag "sent"): It returns POSITIVE, NONE or NEGATIVE, using a polarity lexicon and a classifier trained from annotated tweets. The input should be a sentence or a small paragraph.
+
+* Language recognition (flag "recog"): It returns the language of the input text: en, es, pt, gl, gz (agal galician variety), fr, eu, ca, bn (bengali), ur (urdu), hi (hindi), ta (tamil). This module is also used to recognize the language of a text before being processed by another module (only for the four languages that can be processed: pt, en, es, gl).
+
+REQUIREMENTS
+GNU/LINUX (bash + perl)
+'Storable' Perl module. To install, you may use CPAN: 
+	   cpan>install Storable
+
+HOW TO INSTALL
+
+(1)
+> git clone https://github.com/gamallo/CitiusLinguakit.git
+(2)
+> sh install-linguakit.sh
+
+Pay attention: do not install the package in a directory whose name contain blank spaces!
+
+HOW TO USE
+ linguakit <lang> <module> <file> [options]
+
+      language = gl, es, en, pt, none
+      module = dep, tagger, mwe, recog, sent
+      file = path of the file input 
+
+      Available command-line options:
+
+      -a       by default output of the parser: simple dependency analysis (only with 'dep' module)
+      -fa      parser output: full dependency analysis  (only with 'dep' module)
+      -c       parser output: correct tagged text (only with 'dep' module)
+      -conll   parser output: CoNLL style (only with 'dep' module)
+      -noner   by default PoS tagger output: no NER or NEC is processed (only with 'tagger' module)
+      -ner     PoS tagger with Named Entity Recognition - NER (only with 'tagger' module)
+      -nec     PoS tagger with Named Entity Classification - NEC (only with 'tagger' module)
+      -chi     by default co-occurrence measure: chi-square (only with 'mwe' module)
+      -log     co-occurrence measure: loglikelihood (only with 'mwe' module)
+      -scp     co-occurrence measure: symmetrical conditional probability  (only with 'mwe' module)
+      -mi      co-occurrence measure: mutual information  (only with 'mwe' module)
+      -cooc    co-occurrence measure: co-occurrence counting  (only with 'mwe' module)
+      -s       the input <file> is just a string (only with both 'sent' and 'recog' modules)
+
+
+EXEMPLES OF USE
+
+> ./linguakit pt dep tests/pt.txt -conll
+(this returns a dependency-based analysis in CoNLL format)
+
+> ./linguakit en tagger tests/en.txt -nec
+(this returns the PoS tags with NEC information for named entities)
+
+> ./linguakit en sent "I don't like the film" -s
+(this returns a sentiment value)  
+
+> ./linguakit none mw tests/pt.txt -mi
+(this identifies the language of the input text and then makes multiword extraction ranked with mutual information)
+
+INPUT FILE
+
+The input must be in plain text format, and codified in UTF8.
+
+
+LEXICONS
+
+Lexicons (electronic dictionaries) are in tagger/$lang/lexicon/dicc.src files (where $lang is en, es, pt, gl). If you modify them, then you should compile them as follows:
+>sh lexicon_compiler.sh
+(Remember you need the 'Storalbe' Perl package)
+
+
+OUTPUT FORMAT OF THE DEPENDENCY PARSER 
+
+* Option -a means that the dp.sh generates a file with a dependency-based analysis. Each analysed sentence consists of two elements:
+
+1. a line containing the POS tagged lemmas of the sentence. This line begins with the tag SENT. The set of tags used here are listed in file TagSet.txt. All lemmas are identified by means of a position number from 1 to N, where N is the size of the sentence.
+
+2. All dependency triplets identified by the grammar. A triplet consists of:
+
+(relation;head_lemma;dependent_lemma)
+
+For instance, the sentence "I am a man." generates the following output:
+
+################
+SENT::<I_PRO_0_<number:0|lemma:I|possessor:0|case:0|genre:0|person:0|politeness:0|type:P|token:I|> am_VERB_1_<number:0|mode:0|lemma:be|genre:0|tense:0|person:0|type:S|token:am|> a_DT_2_<number:0|lemma:a|possessor:0|genre:0|person:0|type:0|token:a|> man_NOUN_3_<number:S|lemma:man|genre:0|person:3|type:C|token:man|> ._SENT>
+(Lobj;be_VERBF_1;I_PN_0)
+(Spec;man_NOM_3;a_DT_2)
+(Robj;be_VERBF_1;man_NOM_3)
+################
+
+* Option -fa gives rise to a full represention of the depedency-based analysis. Each triplet is associated with two pieces of information: morpho-syntactic features of both the head and the dependent. 
+
+* Option -c allows dp.sh to generate a file with the same input (a tagged text) but with some corrections proposed by the grammar. This option is useful to identify and correct regular errors of PoS taggers using grammatical rules. 
+
+* Option -conll  gets an output file with the format defined by CoNLL-X, inspired by Lin (1998). This format was adopted by the evaluation tasks defined in CoNLL.
+
+For more information: http://gramatica.usc.es/pln/tools/deppattern.html
+
+OUTPUT FORMAT AND TAGSET OF THE POS TAGGER
+We follow the EAGLES convention:
+https://talp-upc.gitbooks.io/freeling-user-manual/content/tagsets.html	
+
+For more information on our PoS tagger and NERC: http://gramatica.usc.es/pln/tools/CitiusTools.html
+
+SENTIMENT ANALYISIS
+The input can be either a file (by default) or a string (option -e). The output is POSITIVE, NONE, OR NEGATIVE, and a score between 0 and 1. 
+The classifier was trained with tweets, so the input should be just one sentence or a small paragraph. 
+
+For more information: http://gramatica.usc.es/pln/tools/CitiusSentiment.html
+
+LANGUAGE IDENTIFICATION:
+It returns the language of the input text: en, es, pt, gl, gz (agal galician variety), fr, eu, ca, bn (bengali), ur (urdu), hi (hindi), ta (tamil). This module is also used to recognize the language of a text before being processed by another module (only for the four languages that can be processed: pt, en, es, gl).
+
+For more information: http://gramatica.usc.es/~gamallo/quelingua/QueLingua.htm
+
+####################
+More information on the modules can be found in papers you'll find in directory ".docs".
+
+References:
+
+* Dependency analysis:
+
+Gamallo P. , González I. (2011) A Grammatical Formalism Based on Patterns of Part-of-Speech Tags , International Journal of Corpus Linguistics , 16(1), 45-71. ISNN:1384-6655 
+
+Gamallo, P. 2015. Dependency Parsing with Compression Rules, The 14th International Conference on Parsing Technologies (IWPT-2015) p. 107-117, Bilbao. ISBN 978-1-941643-98-3 
+
+Gamallo, P., González, I. 2012. DepPattern: A Multilingual Dependency Parser, Demo Session of the International Conference on Computational Processing of the Portuguese Language (PROPOR 2012) , April 17-20, Coimbra, Portugal. 
+
+* PoS tagging and NEC:
+
+Garcia, M. and Gamallo, P. 2015. Yet another suite of multilingual NLP tools, Symposium on Languages, Applications and Technologies (SLATE 2015) p. 81-90. ISBN 978-84-606-8762-7 
+
+Abuín, José Manuel, Juan Carlos Pichel, Tomás Fernández Pena, Pablo Gamallo e Marcos Garcia (2014). Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters, IEEE International Conference on Big Data (IEEE Big Data 2014).
+
+Gamallo P., Garcia, M. (2011) A Resource-Based Method for Named Entity Extraction and Classification , Lecture Notes in Computer Science, vol. 7026 , Springer-Verlag, 610-623. ISNN: 0302-9743 
+
+* Sentiment analysis:
+
+Gamallo, P. and Garcia, M. 2014. Citius: A Naive-Bayes Strategy for Sentiment Analysis on English Tweets, In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin: 171-175. 
+
+Gamallo, P., Garcia, M. and Fernández-Lanza, S. (2013). TASS: A Naive-Bayes strategy for sentiment analysis on Spanish tweets, Proceedings of XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural. Workshop on Sentiment Analysis at SEPLN (TASS2013), Madrid. pp. 126-132. ISBN: 978-84-695-8349-4. (FIRST system in the task of polarity detection at the entity level)
+
+* Multiword extraction:
+Barcala M., E. Domínguez-Noya, P. Gamallo, M.López, E. Moscoso, G. Rojo, P. Santalla, S. Sotelo. (2007) A Corpus and Lexical Resources for Multi-word Terminology Extraction in the Field of Economy, 3rd Language & Technology Conference(LeTC'2007), Poznan, Poland (355-359). 
diff --git a/docs/dependencies/Dependency_Analysis_IWPT_2015.pdf b/docs/dependencies/Dependency_Analysis_IWPT_2015.pdf
diff --git a/docs/dependencies/Dependency_Analysis_Journal_Corpus_Linguistics.pdf b/docs/dependencies/Dependency_Analysis_Journal_Corpus_Linguistics.pdf
diff --git a/docs/mwe/term_extraction_2007.pdf b/docs/mwe/term_extraction_2007.pdf
diff --git a/docs/sentiment/Sentiment_SemEval_2014.pdf b/docs/sentiment/Sentiment_SemEval_2014.pdf
diff --git a/docs/sentiment/Sentiment_TASS_2013.pdf b/docs/sentiment/Sentiment_TASS_2013.pdf
diff --git a/docs/tagging_nec/NEC_EPIA_2011.pdf b/docs/tagging_nec/NEC_EPIA_2011.pdf
diff --git a/docs/tagging_nec/Tagger_SEPLN_2014.pdf b/docs/tagging_nec/Tagger_SEPLN_2014.pdf
diff --git a/docs/tagging_nec/Tagger_SLATE_2015.pdf b/docs/tagging_nec/Tagger_SLATE_2015.pdf
diff --git a/install-linguakit.sh b/install-linguakit.sh
@@ -0,0 +1,45 @@
+#!/bin/sh
+
+
+for file in linguakit*
+do
+    awk '$0=="MAIN_DIR=\"./CitiusLinguakit\""{print "MAIN_DIR=\"'`pwd`'\"";next}\
+        {print}' $file > $file.tmp;
+    mv $file.tmp $file;
+done
+
+
+echo 'Path variables modified in tagging scripts.'
+
+chmod 0755 linguakit
+chmod 0755 scripts/*
+chmod 0755 parsers/*
+chmod 0755 tagger/en/*
+chmod 0755 tagger/es/*
+chmod 0755 tagger/pt/*
+chmod 0755 tagger/gl/*
+chmod 0755 sentiment/en/*
+chmod 0755 sentiment/es/*
+chmod 0755 sentiment/pt/*
+chmod 0755 sentiment/gl/*
+
+echo "Permissions of execution, done!"
+
+echo "Compiling lexicons"
+echo "english lexicon"
+./tagger/en/store_lex.perl
+echo "portuguese lexicon"
+./tagger/pt/store_lex.perl
+echo "spanish lexicon"
+./tagger/es/store_lex.perl
+echo "galician lexicon"
+./tagger/gl/store_lex.perl
+./tagger/gl/store_split.perl
+
+echo "Warning: if the lexicon compilation has returned error messages, please check whether the Perl module 'Storable' is installed."
+echo "To install this Perl module, you may use CPAN:"
+echo "          cpan> install Storable"
+
+
+echo ''
+echo "Installation done!"
diff --git a/lanrecog/lanrecog.perl b/lanrecog/lanrecog.perl
@@ -0,0 +1,108 @@
+#!/usr/bin/perl
+
+#IDENTIFICA SE UM TEXTO E ESPANHOL OU GALEGO 
+
+#le um ficheiro a identificar (pipe) que foi previamente tokenizado
+#le um ficheiro com todos os lexicons disponiveis. O formato é "token ling"
+
+binmode STDIN, ':utf8';
+binmode STDOUT, ':utf8';
+
+
+my $lexicon = shift(@ARGV);
+open (L, $lexicon) or die "O ficheiro não pode ser aberto: $!\n";
+
+my $suffix = shift(@ARGV);
+open (S, $suffix) or die "O ficheiro não pode ser aberto: $!\n";
+
+my $ling_def="en";
+
+my $Separador = "[\.\,\;\:\«\»\"\&\%\+\=\$\#\(\)\<\>\!\¡\?\¿\\[\\]]" ;
+
+my $i=1;
+my $term="";
+my $suffix="";
+my $ling="";
+my %Rank;
+my %Lex;
+my %Peso;
+my %Suffix;
+
+while (my $line = <L>) {
+
+   chomp $line;
+  ($term, $ling) = split ("\t", $line);
+
+   if (!defined $Lex{$ling}) {
+       $i=1;
+   }
+   $Rank{$ling}{$term} = $i;
+   $Lex{$ling}{$term} = $term;
+   $i++;
+  #print STDERR "#$term# #$ling#\n";
+}
+
+
+while (my $line = <S>) {
+   chomp $line;
+  ($suffix, $ling) = split ("\t", $line);
+
+   $Suffix{$ling}{$suffix}++;
+  # $Lex{$ling}{$suffix} = $term;
+  #print STDERR "#$suffix# #$ling#\n";
+}
+
+my $found=0;
+while (my $line = <STDIN>) {
+   chomp $line;
+   my $token = $line;
+   ##change uppercase to lowercase:
+   $token = lc ($token);
+   if ($token !~ /$Separador/) {
+       foreach $ling (keys %Lex) {
+        #if ($Lex{$ling}{$token} =~ /^$token$/i) {
+         if (defined $Lex{$ling}{$token}) {
+           $Peso{$ling} += $i - $Rank{$ling}{$token} ;
+           # print STDERR "lex: #$ling# :: #$token# #$Peso{$ling}# #$i# # $Rank{$ling}{$token} # \n";
+           $found=1;
+         } 
+
+         else {
+
+	   foreach $s (keys %{$Suffix{$ling}}) {
+                 #print STDERR "lex: #$ling# :: #$token# #$Peso{$ling}# #$s# \n";
+                 if ($token =~ /$s$/) {
+                     $Peso{$ling} += $i - ($i/2) ;
+                     #print STDERR "lex: #$ling# :: #$token# #$Peso{$ling}# #$i# \n";
+
+		 }
+	   }
+	 }
+      }
+    }
+
+}
+
+#print STDERR "esp = $esp || gal = $gal\n";
+
+##default:
+if (!$found){
+  print $ling_def 
+}
+else {
+ my $First=0;
+ foreach $ling (sort {$Peso{$b} <=>
+                      $Peso{$a} }
+                      keys %Peso ) {
+    if (!$First) {
+      print $ling;
+      $First=1;
+
+    }
+ }
+}
+
+
+
+
+
diff --git a/lanrecog/lexicons/.es.lx.swp b/lanrecog/lexicons/.es.lx.swp