# Words starting with a capital letter

This notebook shows how to define an analyzer filtering out only the words starting with a capital letter.

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simplepattern-tokenizer.html

In [377]:
import re

s = "Ahoj. Cau.  Nazdar iDeal Bye"

print(re.findall("[A-Z][a-zA-Z]+", s))
print(re.findall(r"\b[A-Z][a-zA-Z]+", s))
print(re.findall(r"(?<=\s)[A-Z][a-zA-Z]+", s))
print(re.findall(r"(?<!\.) [A-Z][a-zA-Z]+", s))
print(re.findall(r"(?<!\.\s)\b[A-Z][a-zA-Z]+", s))
print(re.findall(r"(?<!\.)(?<=\s)+[A-Z][a-zA-Z]+", s))

['Ahoj', 'Cau', 'Nazdar', 'Deal', 'Bye']
['Ahoj', 'Cau', 'Nazdar', 'Bye']
['Cau', 'Nazdar', 'Bye']
[' Nazdar', ' Bye']
['Ahoj', 'Nazdar', 'Bye']
['Cau', 'Nazdar', 'Bye']


In [378]:
import subprocess
import json

In [379]:
query = """
curl -X DELETE "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pattern_test" -u guest:teradata
"""

#res = subprocess.getoutput(query)
#res = json.loads(res[res.find("{"):])
#res

In [380]:
query = """
curl -X DELETE "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pattern_test" -u guest:teradata
"""

res = subprocess.getoutput(query)

query = """
curl -s -X PUT "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pattern_test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": [
            "_english_", "the", "be", "and", "of", "a", "in", "to", "have", "to",
            "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this",
            "they", "at", "but", "we", "his", "from", "that", "not", "by",
            "she", "or", "as", "what", "go", "their", "can", "who", "get", "if",
            "would", "her", "all", "my", "make", "about", "know", "will", "as", "up",
            "one", "time", "there", "year", "so", "think", "when", "which", "them",
            "some", "me", "people", "take", "out", "into", "just", "see", "him",
            "your", "come", "could", "now", "than", "like", "other", "how", "then",
            "its", "our", "two", "more", "these", "want", "way", "look", "first",
            "also", "new", "because", "day", "more", "use", "no", "man", "find",
            "here", "thing", "give", "many", "well", "only", "those", "tell", "one",
            "very", "her", "even", "back", "any", "good", "had", "does", "doesn\\u0027t",
            "sunday", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday"
          ]
        }
      },
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "capital_tokenizer",
          "filter": ["trim", "lowercase", "english_stop"]
        }
      },
      "tokenizer": {
        "capital_tokenizer": {
          "type": "simple_pattern",
          "pattern": " [A-Z][a-zA-Z0-9]+"
        }
      }
    }
  }
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res

'{"acknowledged":true,"shards_acknowledged":true,"index":"pattern_test"}'

In [381]:
text = """Rabobank and ABN Amro were targeted by DDoS attacks on Thursday night and Friday morning. As a result their online and mobile banking, iDeal payments and websites were hard to reach or completely offline for several hours.

The attacks on Rabobank started around 6:00 p.m. on Thursday, NOS reports. The problems were resolved shortly before midnight. ABN Amro was attacked during the early hours of Friday morning.

In a DDoS attack a website is bombarded by large amounts of data, crashing its server and therefore also the site. 

A spokesperson for ABN Amro thinks that the cyber criminals are jumping from one bank to the next with these attacks. It looks like a cat-and-mouse game, he said to NOS.

The banks apologized to their customers for the inconvenience. 
"""

In [382]:
query = """
curl -s -X GET "http://a3557701c4b3211e88f8a060fa4fdbf3-427558466.eu-west-3.elb.amazonaws.com/elasticsearch/pattern_test/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "test_analyzer",
  "text":     """ + '"' + text.replace("\n", "") + '"' + """
}
' -u guest:teradata
"""

res = subprocess.getoutput(query)
res = json.loads(res)
res

{'tokens': [{'end_offset': 16,
   'position': 0,
   'start_offset': 12,
   'token': 'abn',
   'type': 'word'},
  {'end_offset': 21,
   'position': 1,
   'start_offset': 16,
   'token': 'amro',
   'type': 'word'},
  {'end_offset': 43,
   'position': 2,
   'start_offset': 38,
   'token': 'ddos',
   'type': 'word'},
  {'end_offset': 245,
   'position': 6,
   'start_offset': 236,
   'token': 'rabobank',
   'type': 'word'},
  {'end_offset': 287,
   'position': 8,
   'start_offset': 283,
   'token': 'nos',
   'type': 'word'},
  {'end_offset': 352,
   'position': 10,
   'start_offset': 348,
   'token': 'abn',
   'type': 'word'},
  {'end_offset': 357,
   'position': 11,
   'start_offset': 352,
   'token': 'amro',
   'type': 'word'},
  {'end_offset': 421,
   'position': 13,
   'start_offset': 416,
   'token': 'ddos',
   'type': 'word'},
  {'end_offset': 549,
   'position': 14,
   'start_offset': 545,
   'token': 'abn',
   'type': 'word'},
  {'end_offset': 554,
   'position': 15,
   'start_offse

In [383]:
for token in res['tokens']:
    print(token['token'])

abn
amro
ddos
rabobank
nos
abn
amro
ddos
abn
amro
nos
