# @metro_madrid tweets analysis

This notebook will be focused in the analysis of the official metro account tweets

Load of the libraries and set of globals variables, in this case, the id of the account (@metro_madrid)

In [1]:
import configparser
import json
import re

import spacy
from spacy.matcher import PhraseMatcher, Matcher
from spacy.tokens import Doc, Span, Token

In [2]:
nlp = spacy.load('es_core_news_md')

In [3]:
config = configparser.ConfigParser()
config.read('config.ini')

['config.ini']

In [4]:
faults_texts = config['keywords']['faults'].split(',')
metro_account_id = config['accounts']['metro_madrid']

Save the tweets in list of the tweet object used

In [5]:
with open('../data/fault_tweets.json') as json_data:
    faults = json.load(json_data)
with open('../data/solution_tweets.json') as json_data:
    solutions = json.load(json_data)
with open('../data/strikes_tweets.json') as json_data:
    strikes = json.load(json_data)

In [6]:
print(faults[0])
print(faults_texts)

{'text': 'Circulación interrumpida en L6, entre Laguna y Moncloa en ambos sentidos, por causas técnicas. Tiempo estimado más de 15 minutos.', 'date': '2018-04-05T10:08:34', 'user_id': 182764833, 'place': None, 'country': None, 'country_code': None}
['servicio interrumpido', 'circulación interrumpida', 'circulación lenta', 'los trenes no efectúan parada', 'tramo interrumpido', '']


## Pattern recognition
We can create recognizers that will use a matcher to tag the tokens, docs and spans that contains a determinate pattern.
This will be useful to clasify lines, station, problems and estimated waiting times in the tweets that the metro account
post in twiter.

We are going to create a metro line recognizer that will identify the line that has the fault. To do so, we will have to configure the the matcher indicating the lines of the metro line. We will use the next list that can be found in `config.ini` file:

```data
línea 1,línea 2,línea 3,línea 4,línea 5,línea 6,línea 7,línea 8,línea 9,línea 10,línea 11,línea 12,L1,L2,L3,L4,L5,L6,L7,L8,L9,L10,L11,L12

```

Generally, the official metro account will use the string `LX` (where x is the number of the line) but as we want to use tweets from other accounts, the might use the complete name.

In [7]:
class MetroLinesRecognizer(object):
    """Pipeline component that recognises the Madrid underground system lines
    and sets entity annotations to the text that holds them. This allow the
    easy recognition and handling of the lines.
    
    The lines are labelled as FACILITY and their spans are merged into one token.
    Additionally, ._.has_metro_line and ._.is_metro_line is set on the 
    Doc/Span and Token respectively
    """
    name = 'metro_lines'  # component name, will show up in the pipeline

    def __init__(self, nlp):
        """
        Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the configured lines (config.ini) and
        generate Doc objects as phrase match patterns.
        
        :param nlp: spaCy nlp instance
        """
        self.label = nlp.vocab.strings['FACILITY']  # get entity label ID

        # Set up the PhraseMatcher – it can now take Doc objects as patterns,
        # so even if the list of companies is long, it's very efficient
        metro_lines = config['keywords']['lines'].split(',')
        patterns = [nlp(org) for org in metro_lines]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('METRO_LINES', None, *patterns)

        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_metro_line', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_element_matched == True.
        Doc.set_extension('has_metro_line', getter=self.has_metro_line)
        Span.set_extension('has_metro_line', getter=self.has_metro_line)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        
        :param doc: text to be analysed
        :return: text updated with the tags and the entities matched
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity and set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_metro_line', True)
            # Overwrite doc.ents and add entity
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token.
            span.merge()
        return doc

    @staticmethod
    def has_metro_line(tokens):
        """
        Getter for Doc and Span attributes
        
        :param tokens: tokens of the Doc or the Span, that is, the text
        :return: True if one of the tokens is a matched element
        """
        return any([token._.get('is_metro_line') for token in tokens])


In [8]:
class MetroLinesRecognizer2(object):
    """Pipeline component that recognises the Madrid underground system lines
    and sets entity annotations to the text that holds them. This allow the
    easy recognition and handling of the lines.
    
    The lines are labelled as FACILITY and their spans are merged into one token.
    Additionally, ._.has_metro_line and ._.is_metro_line is set on the 
    Doc/Span and Token respectively
    """
    name = 'metro_lines_token'  # component name, will show up in the pipeline

    def __init__(self, nlp):
        """
        Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the configured lines (config.ini) and
        generate Doc objects as phrase match patterns.
        
        :param nlp: spaCy nlp instance
        """
        self.label = nlp.vocab.strings['FACILITY']  # get entity label ID

        line_flag = lambda text: bool(re.compile('l([i|í]nea){0,1}', re.IGNORECASE).match(text))
        IS_LINE = nlp.vocab.add_flag(line_flag)
        metro_line_flag = lambda text: bool(re.compile('l([i|í]nea){0,1}[ -]{0,1}(1[0-2]|[1-9])', re.IGNORECASE).match(text))
        IS_METRO_LINE = nlp.vocab.add_flag(metro_line_flag)
        short_line_flag = lambda text: bool(re.compile('l[ -]{0,1}(1[0-2]|[1-9])', re.IGNORECASE).match(text))
        IS_LINE_SHORT = nlp.vocab.add_flag(metro_line_flag)
        
        # Set up the PhraseMatcher – it can now take Doc objects as patterns,
        # so even if the list of companies is long, it's very efficient
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add('METRO_LINES', None, [{IS_LINE: True}, {'IS_SPACE': True}, {'IS_DIGIT': True}], 
                         [{IS_LINE: True}, {'IS_DIGIT': True}], [{IS_METRO_LINE: True}], [{IS_LINE_SHORT: True}])

        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_metro_line', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_element_matched == True.
        Doc.set_extension('has_metro_line', getter=self.has_metro_line)
        Span.set_extension('has_metro_line', getter=self.has_metro_line)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        
        :param doc: text to be analysed
        :return: text updated with the tags and the entities matched
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity and set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_metro_line', True)
            # Overwrite doc.ents and add entity
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token.
            span.merge()
        return doc

    @staticmethod
    def has_metro_line(tokens):
        """
        Getter for Doc and Span attributes
        
        :param tokens: tokens of the Doc or the Span, that is, the text
        :return: True if one of the tokens is a matched element
        """
        return any([token._.get('is_metro_line') for token in tokens])


We add the lines recognizer to the pipeline to assure it is executed in the pre-processing of the text and save us the step of executing it.

In [None]:
lines_detector = MetroLinesRecognizer(nlp=nlp)
nlp.add_pipe(lines_detector, last=True)
print(nlp.pipe_names)

In [9]:
lines_detector = MetroLinesRecognizer2(nlp=nlp)
nlp.add_pipe(lines_detector, last=True)
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'metro_lines_token']


Let try it with a tweet containing the official account style

In [None]:
# nlp.remove_pipe('metro_lines')

In [10]:
def print_data_about_lines(text):
    doc = nlp(text)
    doc.ents
    print('The text contains a metro line in it: ', doc._.has_metro_line)
    print('Tokens and if it is part of a line:', [(t, t._.is_metro_line) for t in doc])
    print('Tokens that form part of a line:', [t for t in doc if t._.is_metro_line])
    print('Entities that are a facility:', [(e.text, e.label_) for e in doc.ents if e.label_ == 'FACILITY']) 

In [11]:
test = faults[0]['text']
print(test)
print_data_about_lines(test)

Circulación interrumpida en L6, entre Laguna y Moncloa en ambos sentidos, por causas técnicas. Tiempo estimado más de 15 minutos.
The text contains a metro line in it:  True
Tokens and if it is part of a line: [(Circulación, False), (interrumpida, False), (en, False), (L6, True), (,, False), (entre, False), (Laguna, False), (y, False), (Moncloa, False), (en, False), (ambos, False), (sentidos, False), (,, False), (por, False), (causas, False), (técnicas, False), (., False), (Tiempo, False), (estimado, False), (más, False), (de, False), (15, False), (minutos, False), (., False)]
Tokens that form part of a line: [L6]
Entities that are a facility: [('L6', 'FACILITY')]


As we see, when the text contains and station the value of the attribute `has_metro_line` is `True`. Also the 

In [12]:
test = faults[1]['text']
print(test)
print_data_about_lines(test)

Circulación lenta en línea 6 entre Príncipe Pío y Metropolitano, dirección andén 2, por causas técnicas.
The text contains a metro line in it:  True
Tokens and if it is part of a line: [(Circulación, False), (lenta, False), (en, False), (línea 6, True), (entre, False), (Príncipe, False), (Pío, False), (y, False), (Metropolitano, False), (,, False), (dirección, False), (andén, False), (2, False), (,, False), (por, False), (causas, False), (técnicas, False), (., False)]
Tokens that form part of a line: [línea 6]
Entities that are a facility: [('línea 6', 'FACILITY')]


In [13]:
test = faults[2]['text']
print(test)
print_data_about_lines(test)

Circulación interrumpida en LÍNEA 10 en ambos sentidos entre Nuevos Ministerios y Begoña por asistencia sanitaria a un viajero. Tiempo estimado de solución +30 min
The text contains a metro line in it:  True
Tokens and if it is part of a line: [(Circulación, False), (interrumpida, False), (en, False), (LÍNEA 10, True), (en, False), (ambos, False), (sentidos, False), (entre, False), (Nuevos, False), (Ministerios, False), (y, False), (Begoña, False), (por, False), (asistencia, False), (sanitaria, False), (a, False), (un, False), (viajero, False), (., False), (Tiempo, False), (estimado, False), (de, False), (solución, False), (+, False), (30, False), (min, False)]
Tokens that form part of a line: [LÍNEA 10]
Entities that are a facility: [('LÍNEA 10', 'FACILITY')]


In [14]:
test = faults[3]['text']
print(test)
print_data_about_lines(test)

El tramo de circulación interrumpida en linea10 por asistencia sanitaria a un viajero pasa a ser entre Begoña y Cuzco. Tiempo estimado solución + 30 min.
The text contains a metro line in it:  True
Tokens and if it is part of a line: [(El, False), (tramo, False), (de, False), (circulación, False), (interrumpida, False), (en, False), (linea10, True), (por, False), (asistencia, False), (sanitaria, False), (a, False), (un, False), (viajero, False), (pasa, False), (a, False), (ser, False), (entre, False), (Begoña, False), (y, False), (Cuzco, False), (., False), (Tiempo, False), (estimado, False), (solución, False), (+, False), (30, False), (min, False), (., False)]
Tokens that form part of a line: [linea10]
Entities that are a facility: [('linea10', 'FACILITY')]


In [15]:
test = faults[4]['text']
print(test)
print_data_about_lines(test)

Circulación lenta en Linea 1 entre Bilbao y Pinar de Chamartín, en ambos sentidos, por causas técnicas.
The text contains a metro line in it:  True
Tokens and if it is part of a line: [(Circulación, False), (lenta, False), (en, False), (Linea 1, True), (entre, False), (Bilbao, False), (y, False), (Pinar, False), (de, False), (Chamartín, False), (,, False), (en, False), (ambos, False), (sentidos, False), (,, False), (por, False), (causas, False), (técnicas, False), (., False)]
Tokens that form part of a line: [Linea 1]
Entities that are a facility: [('Linea 1', 'FACILITY')]


In [16]:
test = faults[5]['text']
print(test)
print_data_about_lines(test)

Circulación lenta en Línea 2 entre las estaciones de Goya y Avenida de Guadalajara, en ambos sentidos por causas técnicas. Tiempo estimado de solución: +30 min
The text contains a metro line in it:  True
Tokens and if it is part of a line: [(Circulación, False), (lenta, False), (en, False), (Línea 2, True), (entre, False), (las, False), (estaciones, False), (de, False), (Goya, False), (y, False), (Avenida, False), (de, False), (Guadalajara, False), (,, False), (en, False), (ambos, False), (sentidos, False), (por, False), (causas, False), (técnicas, False), (., False), (Tiempo, False), (estimado, False), (de, False), (solución, False), (:, False), (+, False), (30, False), (min, False)]
Tokens that form part of a line: [Línea 2]
Entities that are a facility: [('Línea 2', 'FACILITY')]


In [17]:
PATTERN = re.compile('l([i|í]nea){0,1}[ -]{0,1}(1[0-2]|[1-9])', re.IGNORECASE)

In [None]:
print(bool(PATTERN.match("linea 1")))
print(bool(PATTERN.match("linea10")))
print(bool(PATTERN.match("línea 5")))
print(bool(PATTERN.match("LÍNEA 22")))
print(bool(PATTERN.match("L 10")))
print(bool(PATTERN.match("L-1")))
print(bool(PATTERN.match("Las")))