Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
  • Loading branch information
gamallo committed May 24, 2016
1 parent 9370a08 commit b357daa
Show file tree
Hide file tree
Showing 227 changed files with 6,712,897 additions and 4 deletions.
8 changes: 4 additions & 4 deletions LICENSE
@@ -1,4 +1,4 @@
GNU GENERAL PUBLIC LICENSE
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007

Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
Expand Down Expand Up @@ -631,8 +631,8 @@ to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.

{one line to give the program's name and a brief idea of what it does.}
Copyright (C) {year} {name of author}
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
Expand All @@ -652,7 +652,7 @@ Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:

{project} Copyright (C) {year} {fullname}
<program> Copyright (C) <year> <name of author>
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
Expand Down
164 changes: 164 additions & 0 deletions README
@@ -0,0 +1,164 @@
LINGUAKIT, a linguistic tool containing:
- dependency parser (DepPattern)
- PoS tagger
- NER (named entity recognition)
- NEC (named entity classification)
- Sentiment analysis

#######################################
Author:
Grupo ProLNat@GE, CiTIUS
University of Santiago de Compostela
Galiza, Spain
######################################

DESCRIPTION
The command 'linguakit' is able to process 4 languages (Portuguese, English, Spanish, Galician), and allows you to run the following linguistic tools:

* Dependency parser (flag "dep"): It runs 4 default parsers for 4 languages: English, Spanish, Galician, and Portuguese. The parsers were implemented in PERL and are stored in the 'parsers' file. The parsers were compiled from formal grammars (more information in https://github.com/gamallo/DepPattern). The parsers are provided with several types of output: basic triplets (-a), triplets with morphological information (-fa), the same output as the input (-c) for correction purpose, and CoNLL format (-conll).

* PoS tagger (flag "tagger"): the software also provides the PoS tagger 'CitiusTools". It is provided with two submodules: NER (-ner) and NEC (-nec). The NEC module returns semantic tags for named entities: NP0SP00 (Person), NP00G00 (Location), NP00O00 (Organization), NP00V00 (Miscelaneous)

* Multiword extraction (flag "mwe"): It extracts multiwords from PoS tagged text. There are several options, each one being a specific lexical association measure for ranking the candidate terms: chi square ("chi", by default measure), loglikelihood ("log"), mutual information ("mi"), symmetrical conditional probability ("scp"), simple co-occurrences ("cooc").

* Sentiment analysis (flag "sent"): It returns POSITIVE, NONE or NEGATIVE, using a polarity lexicon and a classifier trained from annotated tweets. The input should be a sentence or a small paragraph.

* Language recognition (flag "recog"): It returns the language of the input text: en, es, pt, gl, gz (agal galician variety), fr, eu, ca, bn (bengali), ur (urdu), hi (hindi), ta (tamil). This module is also used to recognize the language of a text before being processed by another module (only for the four languages that can be processed: pt, en, es, gl).

REQUIREMENTS
GNU/LINUX (bash + perl)
'Storable' Perl module. To install, you may use CPAN:
cpan>install Storable

HOW TO INSTALL

(1)
> git clone https://github.com/gamallo/CitiusLinguakit.git
(2)
> sh install-linguakit.sh

Pay attention: do not install the package in a directory whose name contain blank spaces!

HOW TO USE
linguakit <lang> <module> <file> [options]

language = gl, es, en, pt, none
module = dep, tagger, mwe, recog, sent
file = path of the file input

Available command-line options:

-a by default output of the parser: simple dependency analysis (only with 'dep' module)
-fa parser output: full dependency analysis (only with 'dep' module)
-c parser output: correct tagged text (only with 'dep' module)
-conll parser output: CoNLL style (only with 'dep' module)
-noner by default PoS tagger output: no NER or NEC is processed (only with 'tagger' module)
-ner PoS tagger with Named Entity Recognition - NER (only with 'tagger' module)
-nec PoS tagger with Named Entity Classification - NEC (only with 'tagger' module)
-chi by default co-occurrence measure: chi-square (only with 'mwe' module)
-log co-occurrence measure: loglikelihood (only with 'mwe' module)
-scp co-occurrence measure: symmetrical conditional probability (only with 'mwe' module)
-mi co-occurrence measure: mutual information (only with 'mwe' module)
-cooc co-occurrence measure: co-occurrence counting (only with 'mwe' module)
-s the input <file> is just a string (only with both 'sent' and 'recog' modules)


EXEMPLES OF USE

> ./linguakit pt dep tests/pt.txt -conll
(this returns a dependency-based analysis in CoNLL format)

> ./linguakit en tagger tests/en.txt -nec
(this returns the PoS tags with NEC information for named entities)

> ./linguakit en sent "I don't like the film" -s
(this returns a sentiment value)

> ./linguakit none mw tests/pt.txt -mi
(this identifies the language of the input text and then makes multiword extraction ranked with mutual information)

INPUT FILE

The input must be in plain text format, and codified in UTF8.


LEXICONS

Lexicons (electronic dictionaries) are in tagger/$lang/lexicon/dicc.src files (where $lang is en, es, pt, gl). If you modify them, then you should compile them as follows:
>sh lexicon_compiler.sh
(Remember you need the 'Storalbe' Perl package)


OUTPUT FORMAT OF THE DEPENDENCY PARSER

* Option -a means that the dp.sh generates a file with a dependency-based analysis. Each analysed sentence consists of two elements:

1. a line containing the POS tagged lemmas of the sentence. This line begins with the tag SENT. The set of tags used here are listed in file TagSet.txt. All lemmas are identified by means of a position number from 1 to N, where N is the size of the sentence.

2. All dependency triplets identified by the grammar. A triplet consists of:

(relation;head_lemma;dependent_lemma)

For instance, the sentence "I am a man." generates the following output:

################
SENT::<I_PRO_0_<number:0|lemma:I|possessor:0|case:0|genre:0|person:0|politeness:0|type:P|token:I|> am_VERB_1_<number:0|mode:0|lemma:be|genre:0|tense:0|person:0|type:S|token:am|> a_DT_2_<number:0|lemma:a|possessor:0|genre:0|person:0|type:0|token:a|> man_NOUN_3_<number:S|lemma:man|genre:0|person:3|type:C|token:man|> ._SENT>
(Lobj;be_VERBF_1;I_PN_0)
(Spec;man_NOM_3;a_DT_2)
(Robj;be_VERBF_1;man_NOM_3)
################

* Option -fa gives rise to a full represention of the depedency-based analysis. Each triplet is associated with two pieces of information: morpho-syntactic features of both the head and the dependent.

* Option -c allows dp.sh to generate a file with the same input (a tagged text) but with some corrections proposed by the grammar. This option is useful to identify and correct regular errors of PoS taggers using grammatical rules.

* Option -conll gets an output file with the format defined by CoNLL-X, inspired by Lin (1998). This format was adopted by the evaluation tasks defined in CoNLL.

For more information: http://gramatica.usc.es/pln/tools/deppattern.html

OUTPUT FORMAT AND TAGSET OF THE POS TAGGER
We follow the EAGLES convention:
https://talp-upc.gitbooks.io/freeling-user-manual/content/tagsets.html

For more information on our PoS tagger and NERC: http://gramatica.usc.es/pln/tools/CitiusTools.html

SENTIMENT ANALYISIS
The input can be either a file (by default) or a string (option -e). The output is POSITIVE, NONE, OR NEGATIVE, and a score between 0 and 1.
The classifier was trained with tweets, so the input should be just one sentence or a small paragraph.

For more information: http://gramatica.usc.es/pln/tools/CitiusSentiment.html

LANGUAGE IDENTIFICATION:
It returns the language of the input text: en, es, pt, gl, gz (agal galician variety), fr, eu, ca, bn (bengali), ur (urdu), hi (hindi), ta (tamil). This module is also used to recognize the language of a text before being processed by another module (only for the four languages that can be processed: pt, en, es, gl).

For more information: http://gramatica.usc.es/~gamallo/quelingua/QueLingua.htm

####################
More information on the modules can be found in papers you'll find in directory ".docs".

References:

* Dependency analysis:

Gamallo P. , González I. (2011) A Grammatical Formalism Based on Patterns of Part-of-Speech Tags , International Journal of Corpus Linguistics , 16(1), 45-71. ISNN:1384-6655

Gamallo, P. 2015. Dependency Parsing with Compression Rules, The 14th International Conference on Parsing Technologies (IWPT-2015) p. 107-117, Bilbao. ISBN 978-1-941643-98-3

Gamallo, P., González, I. 2012. DepPattern: A Multilingual Dependency Parser, Demo Session of the International Conference on Computational Processing of the Portuguese Language (PROPOR 2012) , April 17-20, Coimbra, Portugal.

* PoS tagging and NEC:

Garcia, M. and Gamallo, P. 2015. Yet another suite of multilingual NLP tools, Symposium on Languages, Applications and Technologies (SLATE 2015) p. 81-90. ISBN 978-84-606-8762-7

Abuín, José Manuel, Juan Carlos Pichel, Tomás Fernández Pena, Pablo Gamallo e Marcos Garcia (2014). Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters, IEEE International Conference on Big Data (IEEE Big Data 2014).

Gamallo P., Garcia, M. (2011) A Resource-Based Method for Named Entity Extraction and Classification , Lecture Notes in Computer Science, vol. 7026 , Springer-Verlag, 610-623. ISNN: 0302-9743

* Sentiment analysis:

Gamallo, P. and Garcia, M. 2014. Citius: A Naive-Bayes Strategy for Sentiment Analysis on English Tweets, In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin: 171-175.

Gamallo, P., Garcia, M. and Fernández-Lanza, S. (2013). TASS: A Naive-Bayes strategy for sentiment analysis on Spanish tweets, Proceedings of XXIX Congreso de la Sociedad Española de Procesamiento de lenguaje natural. Workshop on Sentiment Analysis at SEPLN (TASS2013), Madrid. pp. 126-132. ISBN: 978-84-695-8349-4. (FIRST system in the task of polarity detection at the entity level)

* Multiword extraction:
Barcala M., E. Domínguez-Noya, P. Gamallo, M.López, E. Moscoso, G. Rojo, P. Santalla, S. Sotelo. (2007) A Corpus and Lexical Resources for Multi-word Terminology Extraction in the Field of Economy, 3rd Language & Technology Conference(LeTC'2007), Poznan, Poland (355-359).
Binary file not shown.
Binary file not shown.
Binary file added docs/mwe/term_extraction_2007.pdf
Binary file not shown.
Binary file added docs/sentiment/Sentiment_SemEval_2014.pdf
Binary file not shown.
Binary file added docs/sentiment/Sentiment_TASS_2013.pdf
Binary file not shown.
Binary file added docs/tagging_nec/NEC_EPIA_2011.pdf
Binary file not shown.
Binary file added docs/tagging_nec/Tagger_SEPLN_2014.pdf
Binary file not shown.
Binary file added docs/tagging_nec/Tagger_SLATE_2015.pdf
Binary file not shown.
45 changes: 45 additions & 0 deletions install-linguakit.sh
@@ -0,0 +1,45 @@
#!/bin/sh


for file in linguakit*
do
awk '$0=="MAIN_DIR=\"./CitiusLinguakit\""{print "MAIN_DIR=\"'`pwd`'\"";next}\
{print}' $file > $file.tmp;
mv $file.tmp $file;
done


echo 'Path variables modified in tagging scripts.'

chmod 0755 linguakit
chmod 0755 scripts/*
chmod 0755 parsers/*
chmod 0755 tagger/en/*
chmod 0755 tagger/es/*
chmod 0755 tagger/pt/*
chmod 0755 tagger/gl/*
chmod 0755 sentiment/en/*
chmod 0755 sentiment/es/*
chmod 0755 sentiment/pt/*
chmod 0755 sentiment/gl/*

echo "Permissions of execution, done!"

echo "Compiling lexicons"
echo "english lexicon"
./tagger/en/store_lex.perl
echo "portuguese lexicon"
./tagger/pt/store_lex.perl
echo "spanish lexicon"
./tagger/es/store_lex.perl
echo "galician lexicon"
./tagger/gl/store_lex.perl
./tagger/gl/store_split.perl

echo "Warning: if the lexicon compilation has returned error messages, please check whether the Perl module 'Storable' is installed."
echo "To install this Perl module, you may use CPAN:"
echo " cpan> install Storable"


echo ''
echo "Installation done!"
108 changes: 108 additions & 0 deletions lanrecog/lanrecog.perl
@@ -0,0 +1,108 @@
#!/usr/bin/perl

#IDENTIFICA SE UM TEXTO E ESPANHOL OU GALEGO

#le um ficheiro a identificar (pipe) que foi previamente tokenizado
#le um ficheiro com todos os lexicons disponiveis. O formato é "token ling"

binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';


my $lexicon = shift(@ARGV);
open (L, $lexicon) or die "O ficheiro não pode ser aberto: $!\n";

my $suffix = shift(@ARGV);
open (S, $suffix) or die "O ficheiro não pode ser aberto: $!\n";

my $ling_def="en";

my $Separador = "[\.\,\;\:\«\»\"\&\%\+\=\$\#\(\)\<\>\!\¡\?\¿\\[\\]]" ;

my $i=1;
my $term="";
my $suffix="";
my $ling="";
my %Rank;
my %Lex;
my %Peso;
my %Suffix;

while (my $line = <L>) {

chomp $line;
($term, $ling) = split ("\t", $line);

if (!defined $Lex{$ling}) {
$i=1;
}
$Rank{$ling}{$term} = $i;
$Lex{$ling}{$term} = $term;
$i++;
#print STDERR "#$term# #$ling#\n";
}


while (my $line = <S>) {
chomp $line;
($suffix, $ling) = split ("\t", $line);

$Suffix{$ling}{$suffix}++;
# $Lex{$ling}{$suffix} = $term;
#print STDERR "#$suffix# #$ling#\n";
}

my $found=0;
while (my $line = <STDIN>) {
chomp $line;
my $token = $line;
##change uppercase to lowercase:
$token = lc ($token);
if ($token !~ /$Separador/) {
foreach $ling (keys %Lex) {
#if ($Lex{$ling}{$token} =~ /^$token$/i) {
if (defined $Lex{$ling}{$token}) {
$Peso{$ling} += $i - $Rank{$ling}{$token} ;
# print STDERR "lex: #$ling# :: #$token# #$Peso{$ling}# #$i# # $Rank{$ling}{$token} # \n";
$found=1;
}

else {

foreach $s (keys %{$Suffix{$ling}}) {
#print STDERR "lex: #$ling# :: #$token# #$Peso{$ling}# #$s# \n";
if ($token =~ /$s$/) {
$Peso{$ling} += $i - ($i/2) ;
#print STDERR "lex: #$ling# :: #$token# #$Peso{$ling}# #$i# \n";

}
}
}
}
}

}

#print STDERR "esp = $esp || gal = $gal\n";

##default:
if (!$found){
print $ling_def
}
else {
my $First=0;
foreach $ling (sort {$Peso{$b} <=>
$Peso{$a} }
keys %Peso ) {
if (!$First) {
print $ling;
$First=1;

}
}
}





Binary file added lanrecog/lexicons/.es.lx.swp
Binary file not shown.

0 comments on commit b357daa

Please sign in to comment.