## Natural Language Processing in Julia

Julia has a rich ecosystem of tools to work with natural languages. In this notebook, we will work with a few simple cases, and provide pointers to other possibilities

### Loading Data

The first step in working with a corpus of text is usually to load it to memory. Julia has the ability to process various source formats, such as PDFs, XML, or CSV files. It can also load certain specific corpus formats, such as Semcor or Semeval files. 

We will work with a corpus of Australian legal decisions, in PDF format. 

In [1]:
using Glob

The data is delivered as a set of pdf files, about three and a half thousand documents. 

In [2]:
files = Glob.glob("data/corpus/pdfs/*.pdf")

3536-element Array{String,1}:
 "data/corpus/pdfs/06_100.pdf" 
 "data/corpus/pdfs/06_1001.pdf"
 "data/corpus/pdfs/06_1004.pdf"
 "data/corpus/pdfs/06_1005.pdf"
 "data/corpus/pdfs/06_1006.pdf"
 "data/corpus/pdfs/06_1015.pdf"
 "data/corpus/pdfs/06_1017.pdf"
 "data/corpus/pdfs/06_1018.pdf"
 "data/corpus/pdfs/06_102.pdf" 
 "data/corpus/pdfs/06_1021.pdf"
 "data/corpus/pdfs/06_1022.pdf"
 "data/corpus/pdfs/06_1023.pdf"
 "data/corpus/pdfs/06_1026.pdf"
 ⋮                             
 "data/corpus/pdfs/09_976.pdf" 
 "data/corpus/pdfs/09_977.pdf" 
 "data/corpus/pdfs/09_978.pdf" 
 "data/corpus/pdfs/09_979.pdf" 
 "data/corpus/pdfs/09_980.pdf" 
 "data/corpus/pdfs/09_981.pdf" 
 "data/corpus/pdfs/09_983.pdf" 
 "data/corpus/pdfs/09_984.pdf" 
 "data/corpus/pdfs/09_985.pdf" 
 "data/corpus/pdfs/09_99.pdf"  
 "data/corpus/pdfs/09_992.pdf" 
 "data/corpus/pdfs/09_996.pdf" 

We look inside one of the files to see how the data is arranged

![Example PDF](pdfimg.png)

The `Taro` julia package provides a robust PDF reader, based on the Java `Apache Tika` library. We use that package to load and parse all the PDF files in the directory.

In [3]:
using Taro
Taro.init()

In [4]:
meta, txtdata = Taro.extract(files[1]);

Loaded /Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk/Contents/Home/lib/server/libjvm.dylib



signal (11): Segmentation fault: 11
while loading In[3], in expression starting on line 2
unknown function (ip: 0x12fd044f2)
Allocations: 10125297 (Pool: 10123448; Big: 1849); GC: 18


In [5]:
meta

Dict{String,String} with 20 entries:
  "access_permission:can_print"        => "true"
  "access_permission:fill_in_form"     => "true"
  "access_permission:modify_annotatio… => "true"
  "dc:format"                          => "application/pdf; version=1.4"
  "dcterms:created"                    => "2017-08-31T10:42:17Z"
  "xmpTPg:NPages"                      => "3"
  "created"                            => "Thu Aug 31 11:42:17 BST 2017"
  "Creation-Date"                      => "2017-08-31T10:42:17Z"
  "meta:creation-date"                 => "2017-08-31T10:42:17Z"
  "access_permission:assemble_documen… => "true"
  "X-Parsed-By"                        => "org.apache.tika.parser.DefaultParser"
  "access_permission:can_print_degrad… => "true"
  "access_permission:can_modify"       => "true"
  "access_permission:extract_content"  => "true"
  "pdf:encrypted"                      => "false"
  "producer"                           => "Apache FOP Version 2.0"
  "pdf:PDFVersion"                 

In [6]:
txtdata

"\nLawrance v Human Rights and Equal Opportunity\nCommission [2006] FCA 100 (9 February 2006)\n\n1 These are two applications for orders of review under the Administrative Decisions\n(Judicial Review) Act 1977 (Cth) (\"the AD(JR) Act\").\nThey concern correspondence sent to the Human Rights and Equal Opportunity\nCommission (\"the Commission\") by the applicant in late 2005.\nIn a letter dated 26 September 2005, the applicant wrote to the Commission concerning\nallegations of unlawful discrimination.\nThe Commission replied in a letter dated 7 October 2005, in which it indicated that it\nwas not able to assist her.\nOn 13 October 2005, the applicant again wrote to the Commission.\nThat letter addressed alleged breaches of human rights.\nThe applicant wrote to the Commission a third time on 7 November 2005, this time\nconcerning allegations of sexual harassment.\nThe Commission did not respond to the second and third letters it received from the\napplicant.\n2 It is now accepted in thes

In [7]:
txtdata = replace(txtdata, '\n', ' ')

" Lawrance v Human Rights and Equal Opportunity Commission [2006] FCA 100 (9 February 2006)  1 These are two applications for orders of review under the Administrative Decisions (Judicial Review) Act 1977 (Cth) (\"the AD(JR) Act\"). They concern correspondence sent to the Human Rights and Equal Opportunity Commission (\"the Commission\") by the applicant in late 2005. In a letter dated 26 September 2005, the applicant wrote to the Commission concerning allegations of unlawful discrimination. The Commission replied in a letter dated 7 October 2005, in which it indicated that it was not able to assist her. On 13 October 2005, the applicant again wrote to the Commission. That letter addressed alleged breaches of human rights. The applicant wrote to the Commission a third time on 7 November 2005, this time concerning allegations of sexual harassment. The Commission did not respond to the second and third letters it received from the applicant. 2 It is now accepted in these proceedings in w

We use the `TextAnalysis` julia package for basic text analysis tasks

In [8]:
using TextAnalysis
using Languages

In [9]:
getTitle(t) = TextAnalysis.sentence_tokenize(Languages.English(), t)[1]

getTitle (generic function with 1 method)

In [10]:
getTitle(txtdata) 

" Lawrance v Human Rights and Equal Opportunity Commission [2006] FCA 100 (9 February 2006)  1 These are two applications for orders of review under the Administrative Decisions (Judicial Review) Act 1977 (Cth) (\"the AD(JR) Act\")."

First, we extract the sentences from the PDF documents, and the create a `Corpus` that includes the entire set of documents. 

In [11]:
docs = Any[]
for i in 1:1000
    try 
        meta,txt = Taro.extract(files[i])
        txt = replace(txt, '\n', ' ')
        title = getTitle(txt)
        dm = TextAnalysis.DocumentMetadata(Languages.English(), title, "", meta["Creation-Date"] )
        doc = StringDocument(txt, dm)
        
        push!(docs, doc)
    catch e
        @show e
    end
end

crps = Corpus(docs)

TextAnalysis.Corpus{Union{TextAnalysis.FileDocument, TextAnalysis.NGramDocument, TextAnalysis.StringDocument, TextAnalysis.TokenDocument}}(Union{TextAnalysis.FileDocument, TextAnalysis.NGramDocument, TextAnalysis.StringDocument, TextAnalysis.TokenDocument}[TextAnalysis.StringDocument{String}(" Lawrance v Human Rights and Equal Opportunity Commission [2006] FCA 100 (9 February 2006)  1 These are two applications for orders of review under the Administrative Decisions (Judicial Review) Act 1977 (Cth) (\"the AD(JR) Act\"). They concern correspondence sent to the Human Rights and Equal Opportunity Commission (\"the Commission\") by the applicant in late 2005. In a letter dated 26 September 2005, the applicant wrote to the Commission concerning allegations of unlawful discrimination. The Commission replied in a letter dated 7 October 2005, in which it indicated that it was not able to assist her. On 13 October 2005, the applicant again wrote to the Commission. That letter addressed alleged 

In [12]:
typeof(crps)

TextAnalysis.Corpus{Union{TextAnalysis.FileDocument, TextAnalysis.NGramDocument, TextAnalysis.StringDocument, TextAnalysis.TokenDocument}}

ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR FlateF

In [67]:
typeof(crps)

TextAnalysis.Corpus{Union{TextAnalysis.FileDocument, TextAnalysis.NGramDocument, TextAnalysis.StringDocument, TextAnalysis.TokenDocument}}

In [14]:
orig_corpus = deepcopy(crps);

In [74]:
d = orig_corpus[1]
typeof(d)

TextAnalysis.StringDocument{String}

In [75]:
tokens(d)

1568-element Array{SubString{String},1}:
 "Lawrance"                                         
 "v"                                                
 "Human"                                            
 "Rights"                                           
 "and"                                              
 "Equal"                                            
 "Opportunity"                                      
 "Commission"                                       
 "["                                                
 "2006"                                             
 "]"                                                
 "FCA"                                              
 "100"                                              
 ⋮                                                  
 "Disclaimers"                                      
 "|"                                                
 "Privacy"                                          
 "Policy"                                           
 "|" 

In [76]:
ngrams(d, 2)

Dict{AbstractString,Int64} with 1521 entries:
  "the AD"              => 2
  "be affirmatively"    => 1
  "1"                   => 5
  "sexual harassment"   => 1
  "legislative scheme." => 1
  "concerning the"      => 1
  "replied in"          => 1
  "and probably"        => 1
  "seems inappropriate" => 1
  "sought by"           => 2
  "to a"                => 1
  "prescriptive"        => 1
  "her earlier"         => 1
  "both matters"        => 1
  "and Equal"           => 4
  "those"               => 1
  "exercised. 7"        => 1
  "the referral"        => 2
  "K"                   => 1
  "d )"                 => 1
  "this is"             => 1
  "applications for"    => 1
  "obligation"          => 2
  "general"             => 1
  "exercise certain"    => 1
  ⋮                     => ⋮

In [78]:
convert(TokenDocument, d)

TextAnalysis.TokenDocument{SubString{String}}(SubString{String}["Lawrance", "v", "Human", "Rights", "and", "Equal", "Opportunity", "Commission", "[", "2006"  …  "Privacy", "Policy", "|", "Feedback", "URL", ":", "http", ":", "//", "www.austlii.edu.au/au/cases/cth/FCA/2006/100.html"], TextAnalysis.DocumentMetadata(Languages.English(), " Lawrance v Human Rights and Equal Opportunity Commission [2006] FCA 100 (9 February 2006)  1 These are two applications for orders of review under the Administrative Decisions (Judicial Review) Act 1977 (Cth) (\"the AD(JR) Act\").", "", "2017-08-31T10:42:17Z"))

In [79]:
convert(NGramDocument, d)

TextAnalysis.NGramDocument{SubString{String}}(Dict("1"=>5,"prescriptive"=>1,"those"=>1,"K"=>1,"obligation"=>2,"general"=>1,"and"=>29,"these"=>4,"received"=>1,"//"=>1…), 1, TextAnalysis.DocumentMetadata(Languages.English(), " Lawrance v Human Rights and Equal Opportunity Commission [2006] FCA 100 (9 February 2006)  1 These are two applications for orders of review under the Administrative Decisions (Judicial Review) Act 1977 (Cth) (\"the AD(JR) Act\").", "", "2017-08-31T10:42:17Z"))

The `TextAnalysis` package contains basic utilities for pre-processing the documents, removing punctuation and other stop words, normalising case, and removing numbers. Finally we also `stem` the words, using the built in `Porter2` stemmer. 

### Data Cleaning
```
 strip_patterns                
 strip_corrupt_utf8            
 strip_case                    
 stem_words                               
 strip_whitespace              
 strip_punctuation             
 strip_numbers                 
 strip_non_letters             
 strip_indefinite_articles     
 strip_definite_articles       
 strip_articles                
 strip_prepositions            
 strip_pronouns                
 strip_stopwords               
 strip_sparse_terms            
 strip_frequent_terms          
 strip_html_tags
 ```

In [15]:
prepare!(crps, strip_non_letters | strip_punctuation | strip_case | strip_stopwords)

In [16]:
crps[1]

TextAnalysis.StringDocument{String}(" lawrance   human rights   equal opportunity commission   fca     february            applications       review     administrative decisions  judicial review  act    cth    ad jr  act    concern correspondence sent     human rights   equal opportunity commission   commission      applicant   late       letter dated   september     applicant wrote     commission concerning allegations   unlawful discrimination    commission replied     letter dated   october         indicated         able   assist       october     applicant   wrote     commission    letter addressed alleged breaches   human rights    applicant wrote     commission   third time     november     time concerning allegations   sexual harassment    commission     respond         third letters   received     applicant          accepted     proceedings       commonwealth attorney      intervened   act     contradictor      commission   failed   respond   responded inappropriately     appli

In [17]:
stem!(crps)

Having normalised the words in the corpus, we can now start to process it. First, we generate the `lexicon` (ie, the dictionary of all words that comprise the corpus), and then create an inverse index, so that we can quickly say, for all words, which documents they belong. 

In [18]:
update_lexicon!(crps)
update_inverse_index!(crps)

For example we can see that the word "_injustice_" (which is stemmed to _injustic_) appears in 109 documents in the corpus. 

In [19]:
crps["injustic"]

109-element Array{Int64,1}:
   7
  22
  30
  41
  46
  48
  62
  67
  88
 115
 121
 122
 133
   ⋮
 875
 892
 901
 910
 923
 933
 935
 938
 944
 956
 991
 999

In [20]:
m = DocumentTermMatrix(crps)

TextAnalysis.DocumentTermMatrix(
  [2    ,     1]  =  6
  [5    ,     1]  =  1
  [7    ,     1]  =  28
  [9    ,     1]  =  15
  [10   ,     1]  =  3
  [11   ,     1]  =  7
  [15   ,     1]  =  1
  [16   ,     1]  =  4
  [17   ,     1]  =  3
  [18   ,     1]  =  1
  ⋮
  [716  , 23378]  =  1
  [716  , 23379]  =  12
  [831  , 23379]  =  1
  [716  , 23380]  =  1
  [831  , 23380]  =  1
  [69   , 23381]  =  1
  [69   , 23382]  =  45
  [69   , 23383]  =  1
  [69   , 23384]  =  1
  [69   , 23385]  =  1
  [69   , 23386]  =  2, String["a", "aa", "aaa", "aab", "aad", "aae", "aahl", "aaj", "aal", "aala"  …  "zzkand", "zzn", "zzo", "zzp", "zzq", "zzr", "zzs", "zzt", "zzu", "zzzq"], Dict("null"=>14507,"mh"=>13121,"walba"=>22466,"gout"=>8416,"coyl"=>4550,"curv"=>4830,"unoffici"=>21696,"aozhong"=>972,"addston"=>232,"bidder"=>2122…))

In [21]:
dt = dtm(m)

1000×23386 SparseMatrixCSC{Int64,Int64} with 494626 stored entries:
  [2    ,     1]  =  6
  [5    ,     1]  =  1
  [7    ,     1]  =  28
  [9    ,     1]  =  15
  [10   ,     1]  =  3
  [11   ,     1]  =  7
  [15   ,     1]  =  1
  [16   ,     1]  =  4
  [17   ,     1]  =  3
  [18   ,     1]  =  1
  ⋮
  [716  , 23378]  =  1
  [716  , 23379]  =  12
  [831  , 23379]  =  1
  [716  , 23380]  =  1
  [831  , 23380]  =  1
  [69   , 23381]  =  1
  [69   , 23382]  =  45
  [69   , 23383]  =  1
  [69   , 23384]  =  1
  [69   , 23385]  =  1
  [69   , 23386]  =  2

In [22]:
tfidf = tf_idf(m)

1000×23386 SparseMatrixCSC{Float64,Int64} with 494626 stored entries:
  [2    ,     1]  =  0.000700388
  [5    ,     1]  =  0.000930835
  [7    ,     1]  =  0.00535259
  [9    ,     1]  =  0.00446855
  [10   ,     1]  =  0.00324068
  [11   ,     1]  =  0.00713382
  [15   ,     1]  =  0.000198796
  [16   ,     1]  =  0.00201311
  [17   ,     1]  =  0.00151857
  [18   ,     1]  =  0.000299652
  ⋮
  [716  , 23378]  =  0.00425616
  [716  , 23379]  =  0.045949
  [831  , 23379]  =  0.0013798
  [716  , 23380]  =  0.00382909
  [831  , 23380]  =  0.0013798
  [69   , 23381]  =  0.0021235
  [69   , 23382]  =  0.0955576
  [69   , 23383]  =  0.0021235
  [69   , 23384]  =  0.0021235
  [69   , 23385]  =  0.0021235
  [69   , 23386]  =  0.00424701

In [23]:
using Clustering

In [24]:
cl = kmeans(full(tfidf'), 5)

Clustering.KmeansResult{Float64}([0.00140916 0.0 … 0.000855432 0.0; 0.000383612 0.000812118 … 0.0 0.0; … ; 2.13417e-6 0.0 … 0.0 0.0; 4.26835e-6 0.0 … 0.0 0.0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1  …  1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [0.0223038, 0.0161045, 0.0423691, 0.0319471, 0.0136629, 0.0353136, 0.0362196, 0.00900888, 0.0242804, 0.00995068  …  0.0157399, 0.0219015, 0.0838083, 0.0137136, 0.0423821, 0.0414701, 0.0921547, 0.227053, 0.0887132, 0.0466102], [995, 1, 2, 1, 1], [995.0, 1.0, 2.0, 1.0, 1.0], 36.668276890304554, 2, true)

In [25]:
cl.counts

5-element Array{Int64,1}:
 995
   1
   2
   1
   1

In [26]:
collect(keys(lexicon(crps)))[sortperm(cl.centers[:,4]; rev=true)[1:20]]

20-element Array{String,1}:
 "conway"    
 "domest"    
 "cornal"    
 "mj"        
 "juxtaposit"
 "mc"        
 "carrick"   
 "siemensma" 
 "vii"       
 "battersbi" 
 "feedlott"  
 "accost"    
 "canon"     
 "sandbank"  
 "wfm"       
 "handran"   
 "penola"    
 "cheng"     
 "withhold"  
 "imp"       

In [27]:
k = 2            # number of topics
iteration = 1000 # number of gibbs sampling iterations
alpha = 0.1      # hyper parameter
beta = 0.1       # hyper parameter
l = lda(m, k, iteration, alpha, beta) # l is k x word matrix.
                                      # value is probablity of occurrence of a word in a topic.

2×23386 SparseMatrixCSC{Float64,Int64} with 28942 stored entries:
  [1    ,     1]  =  0.00135573
  [2    ,     1]  =  0.00300947
  [1    ,     2]  =  0.000162619
  [2    ,     2]  =  0.00020264
  [1    ,     3]  =  5.21771e-6
  [2    ,     3]  =  1.77994e-5
  [1    ,     4]  =  1.56531e-5
  [2    ,     4]  =  1.36919e-5
  [1    ,     5]  =  8.69618e-7
  [2    ,     5]  =  6.84593e-6
  ⋮
  [2    , 23376]  =  1.64302e-5
  [2    , 23377]  =  1.36919e-6
  [1    , 23378]  =  8.69618e-7
  [1    , 23379]  =  1.1305e-5
  [1    , 23380]  =  1.73924e-6
  [1    , 23381]  =  8.69618e-7
  [1    , 23382]  =  3.91328e-5
  [2    , 23383]  =  1.36919e-6
  [1    , 23384]  =  8.69618e-7
  [2    , 23385]  =  1.36919e-6
  [1    , 23386]  =  1.73924e-6

## Summarisation

In [28]:
c = Corpus(Any[StringDocument(t) for t in TextAnalysis.sentence_tokenize(Languages.English(), txtdata)]);

In [29]:
prepare!(c, strip_non_letters | strip_punctuation | strip_case | strip_stopwords)

In [30]:
stem!(c)

In [31]:
update_lexicon!(c)
update_inverse_index!(c)

In [32]:
tf = tf_idf(DocumentTermMatrix(c))

49×224 SparseMatrixCSC{Float64,Int64} with 512 stored entries:
  [4  ,   1]  =  0.486478
  [22 ,   2]  =  0.389182
  [9  ,   3]  =  0.178966
  [10 ,   3]  =  0.417588
  [29 ,   3]  =  0.156595
  [44 ,   3]  =  0.250553
  [10 ,   4]  =  0.380397
  [27 ,   4]  =  0.456476
  [29 ,   4]  =  0.142649
  [44 ,   4]  =  0.228238
  ⋮
  [49 , 218]  =  0.0810796
  [22 , 219]  =  0.389182
  [47 , 220]  =  0.299371
  [20 , 221]  =  1.3966
  [41 , 221]  =  0.465535
  [46 , 221]  =  0.558642
  [45 , 222]  =  0.243239
  [3  , 223]  =  0.279321
  [5  , 223]  =  0.698302
  [7  , 223]  =  0.253928
  [49 , 224]  =  0.0810796

In [33]:
tf * tf'

49×49 SparseMatrixCSC{Float64,Int64} with 1303 stored entries:
  [1 ,  1]  =  0.51813
  [2 ,  1]  =  0.109747
  [3 ,  1]  =  0.00950459
  [4 ,  1]  =  0.00659802
  [5 ,  1]  =  0.0237615
  [6 ,  1]  =  0.066431
  [7 ,  1]  =  0.00864054
  [8 ,  1]  =  0.015841
  [9 ,  1]  =  0.0173252
  [10,  1]  =  0.00704363
  ⋮
  [38, 49]  =  0.00871802
  [39, 49]  =  0.0027881
  [40, 49]  =  0.0198784
  [42, 49]  =  0.0510418
  [43, 49]  =  0.00669145
  [44, 49]  =  0.00208937
  [45, 49]  =  0.00597596
  [46, 49]  =  0.0261569
  [47, 49]  =  0.00160721
  [48, 49]  =  0.0163967
  [49, 49]  =  0.370584

In [34]:
function pagerank( A; Niter=20, damping=.15)
         Nmax = size(A, 1)
         r = rand(1,Nmax);              # Generate a random starting rank.
         r = r ./ norm(r,1);            # Normalize
         a = (1-damping) ./ Nmax;       # Create damping vector

         for i=1:Niter
             s = r * A
             scale!(s, damping)
             r = s .+ (a * sum(r,2));   # Compute PageRank.
         end

         r = r./norm(r,1);

         return r
end

pagerank (generic function with 1 method)

In [35]:
p=pagerank(tf * tf')

1×49 Array{Float64,2}:
 0.495949  0.500753  0.541316  0.551849  …  0.537261  0.51992  0.459878

In [36]:
TextAnalysis.sentence_tokenize(Languages.English(), txtdata)[sort(sortperm(vec(p), rev=true)[1:10])]

10-element Array{SubString{String},1}:
 "On 13 October 2005, the applicant again wrote to the Commission."                                                                           
 "As to NSD 2340 of 2005, it was proposed that the application be dismissed."                                                                 
 "I am of that view for two reasons."                                                                                                         
 "7 Accordingly, I propose to make an order only that the complaints be referred to the President."                                           
 "   11 Dealing with that second matter, it seems inappropriate to make the orders sought."                                                   
 "That power is enlivened where the Court considers it is necessary to do justice between the parties."                                       
 "I am not affirmatively satisfied that the orders sought are orders that are necessary to do justice b

In [37]:
[summarize(d) for d in orig_corpus[1:100]]

1000-element Array{Array{SubString{String},1},1}:
 SubString{String}["I am of that view for two reasons.", "That power is enlivened where the Court considers it is necessary to do justice between the parties.", "I am not affirmatively satisfied that the orders sought are orders that are necessary to do justice between the parties.", "As to the other orders proposed by the applicant are concerned, I am not satisfied they are necessary to do justice between the parties.", "Such an order can be made by the Court if it is of the view it is necessary in order to prevent prejudice to the administration of justice."]                            
 SubString{String}["5.", "   5.", "5.", "5.", "5."]                                                                                                                                                                                                                                                                                                               

## Sentiment Analysis

A pre-trained model, trained on the IMDB dataset. 

In [38]:
s = SentimentAnalyzer()

Sentiment Analysis Model Trained on IMDB with a 88587 word corpus

In [39]:
d1 = d=StringDocument("A very nice thing that everyone likes.")

TextAnalysis.StringDocument{String}("A very nice thing that everyone likes.", TextAnalysis.DocumentMetadata(Languages.English(), "Unnamed Document", "Unknown Author", "Unknown Time"))

In [40]:
prepare!(d1, strip_case | strip_punctuation)

In [41]:
s(d1)

0.51831096f0

In [42]:
d2=StringDocument("a horrible thing that everyone hates")

TextAnalysis.StringDocument{String}("a horrible thing that everyone hates", TextAnalysis.DocumentMetadata(Languages.English(), "Unnamed Document", "Unknown Author", "Unknown Time"))

In [43]:
prepare!(d2, strip_case | strip_punctuation)

In [44]:
s(d2)

0.47193587f0

## A detour of Languages.jl

In [45]:
articles(Languages.English())

3-element Array{String,1}:
 "a"  
 "an" 
 "the"

In [46]:
articles(Languages.German())

3-element Array{String,1}:
 "der"
 "die"
 "das"

In [47]:
pronouns(Languages.German())

22-element Array{String,1}:
 "ich"   
 "meiner"
 "mir"   
 "mich"  
 "du"    
 "deiner"
 "dir"   
 "dich"  
 "er"    
 "seiner"
 "ihm"   
 "ihn"   
 "sie"   
 "ihrer" 
 "ihr"   
 "wir"   
 "unser" 
 "uns"   
 "euer"  
 "euch"  
 "ihnen" 
 "ihrer" 

In [48]:
hi_txt = "गणित ऐसी विद्याओं का समूह है जो संख्याओं, मात्राओं, परिमाणों, रूपों और उनके आपसी रिश्तों, गुण, स्वभाव इत्यादि का अध्ययन करती हैं। गणित एक अमूर्त या निराकार (abstract) और निगमनात्मक प्रणाली है"
Languages.detect_script(asm_txt)

LoadError: [91mUndefVarError: asm_txt not defined[39m

In [49]:
ld = LanguageDetector()
ld(hi_txt)

(Languages.Hindi(), Languages.DevanagariScript(), 1.0)

In [50]:
mh_txt = "मोजणी, संरचना, अवकाश आणि बदल या संकल्पनांवर आधारित असलेली आणि त्यांचा अभ्यास करणारी गणित ही ज्ञानाची एक शाखा आहे. गणित हे निरपवाद निष्कर्ष काढण्याचे शास्त्र आहे असे विद्वान मानतात"
Languages.detect_script(mh_txt)

Languages.DevanagariScript()

In [51]:
ld(mh_txt)

(Languages.Marathi(), Languages.DevanagariScript(), 1.0)

## Embedding

In [52]:
using Embeddings

In [53]:
embeddings = load_embeddings(Word2Vec) 

Embeddings.EmbeddingTable{Array{Float32,2},Array{String,1}}(Float32[0.0673199 0.0529562 … -0.21143 0.0136373; -0.0534466 0.0654598 … -0.0087888 -0.0742876; … ; -0.00733469 0.0108946 … -0.00405157 0.0156112; -0.00514565 -0.0470722 … -0.0341579 0.0396559], String["</s>", "in", "for", "that", "is", "on", "##", "The", "with", "said"  …  "#-###-PA-PARKS", "Lackmeyer", "PERVEZ", "KUNDI", "Budhadeb", "Nautsch", "Antuane", "tricorne", "VISIONPAD", "RAFFAELE"])

In [54]:
tk = tokens(orig_corpus[1])

1568-element Array{SubString{String},1}:
 "Lawrance"                                         
 "v"                                                
 "Human"                                            
 "Rights"                                           
 "and"                                              
 "Equal"                                            
 "Opportunity"                                      
 "Commission"                                       
 "["                                                
 "2006"                                             
 "]"                                                
 "FCA"                                              
 "100"                                              
 ⋮                                                  
 "Disclaimers"                                      
 "|"                                                
 "Privacy"                                          
 "Policy"                                           
 "|" 

In [55]:
emb_ind = [findfirst(embeddings.vocab, x) for x in tk]

1568-element Array{Int64,1}:
 141414
   6733
   7987
   9090
      0
  35573
  15454
   1373
      0
      0
      0
  49984
      0
      ⋮
 140232
      0
  22217
   5875
      0
  35021
  14012
      0
  83330
      0
      0
      0

In [56]:
filter!(x->x!=0, emb_ind)

1171-element Array{Int64,1}:
 141414
   6733
   7987
   9090
  35573
  15454
   1373
  49984
    527
    780
    166
    663
     20
      ⋮
  13940
  41223
    527
    780
  10249
   5875
 140232
  22217
   5875
  35021
  14012
  83330

In [57]:
embeddings.embeddings[:, emb_ind]

300×1171 Array{Float32,2}:
  0.0335272     0.0395419    0.0321659   …  -0.0264624   -0.0777286 
  0.00457627    0.00479563   0.00998984      0.0310793    0.00951224
 -0.0197502     0.0734349    0.12753        -0.0246607   -0.0143363 
 -0.0500981     0.010062     0.0507285       0.0233095    0.077185  
 -0.00227609    0.00102238  -0.0399593      -0.040313     0.0156273 
 -0.0283247     0.0576652    0.00814774  …   0.00509543   0.0434845 
 -0.00426315    0.02236      0.0225303       0.0337818    0.0164426 
 -0.0302515    -0.062608     0.0250809       0.0382861    0.0437563 
  0.0237002     0.0208301    0.00281628      0.080626    -0.00495995
  0.0743764     0.00953242   0.0453439      -0.0788243   -0.0337005 
  0.0158965    -0.0421309    0.0175708   …  -0.00332188  -0.0286726 
 -0.0847814     0.0114742    0.0563965       0.0159901   -0.0205193 
 -0.0242783    -0.0644909    0.0211133      -0.0497719   -0.0951224 
  ⋮                                      ⋱                ⋮         
  0.110

In [58]:
sum(embeddings.embeddings[:, emb_ind], 2)

300×1 Array{Float32,2}:
   2.42556 
  19.5039  
  41.4283  
  27.9364  
 -39.469   
 -28.9175  
  28.2123  
 -29.069   
  36.9225  
  22.0292  
 -31.0469  
 -36.7958  
  -9.94179 
   ⋮       
   8.50296 
   6.93094 
 -40.1263  
  -6.12996 
 -26.599   
  19.1105  
  -8.83452 
  13.9986  
  -0.167549
 -11.0462  
  20.5708  
 -17.991   

In [59]:
function generate_emb(doc::AbstractDocument)
    tk = tokens(doc)
    emb_ind = [findfirst(embeddings.vocab, x) for x in tk]
    filter!(x->x!=0, emb_ind)
    sum(embeddings.embeddings[:, emb_ind], 2)
end

generate_emb (generic function with 1 method)

In [62]:
[generate_emb(x) for x in orig_corpus[1:2]]

2-element Array{Array{Float32,2},1}:
 Float32[2.42556; 19.5039; … ; 20.5708; -17.991] 
 Float32[63.6404; 168.488; … ; 96.1504; -220.719]