# Madrid underground problems recognition

This notebook will be focused in the recognition of the different faults that could happen in the Madrid underground system. The official account informs about three kind of situations that can affect the circulation, these are:

- **Faults:** when are produced and where they are fixed.
- **Delays:** when they start and where the circulation is normalized.
- **Strikes:** reminding the days that there is a strike called.

Some examples of the texts that form the tweets indicating this alterations are:

- **Faults:** servicio interrumpido, circulación interrumpida, circulación lenta, trenes no efectúan parada, tramo interrumpido.
- **Delays:** retrasos, min de espera, minutos de espera, minutos esperando.
- **Strikes:** huelga, servicios mínimos



In [1]:
import configparser
import json
import re

import spacy
from spacy.matcher import PhraseMatcher, Matcher
from spacy.tokens import Doc, Span, Token

In [2]:
nlp = spacy.load('es_core_news_md')

In [3]:
config = configparser.ConfigParser()
config.read('config.ini')

['config.ini']

In [4]:
faults_texts = config['keywords']['faults'].split(',')
metro_account_id = config['accounts']['metro_madrid']

In [5]:
with open('../data/fault_tweets.json') as json_data:
    faults = json.load(json_data)
with open('../data/solution_tweets.json') as json_data:
    solutions = json.load(json_data)
with open('../data/strikes_tweets.json') as json_data:
    strikes = json.load(json_data)

In [8]:
class MetroFaultsRecognizer(object):
    """Pipeline component that recognises the Madrid underground system lines
    and sets entity annotations to the text that holds them. This allow the
    easy recognition and handling of the lines.
    
    The lines are labelled as FACILITY and their spans are merged into one token.
    Additionally, ._.has_metro_line and ._.is_metro_line is set on the 
    Doc/Span and Token respectively
    """
    name = 'faults_recognizer'  # component name, will show up in the pipeline

    def __init__(self, nlp):
        """
        Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the configured lines (config.ini) and
        generate Doc objects as phrase match patterns.
        
        :param nlp: spaCy nlp instance
        """
        self.label = nlp.vocab.strings['EVENT']  # get entity label ID
        
        #LINE_PATTERN = re.compile('l([i|í]nea){0,1}', re.IGNORECASE)
        #LONG_METRO_LINE_PATTERN = re.compile('l([i|í]nea){0,1}[ -]{0,1}(1[0-2]|[1-9])', re.IGNORECASE)

        #line_flag = lambda text: bool(LINE_PATTERN.match(text))
        #IS_LINE = nlp.vocab.add_flag(line_flag)
        #metro_line_flag = lambda text: bool(LONG_METRO_LINE_PATTERN.match(text))
        #IS_METRO_LINE = nlp.vocab.add_flag(metro_line_flag)
        
        # Set up the PhraseMatcher – it can now take Doc objects as patterns,
        # so even if the list of companies is long, it's very efficient
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add('METRO_LINES', None, [{'LOWER': 'circulación'}, {'LOWER': 'interrumpida'}],
                         [{'LOWER': 'incidencia'}],
                         [{'LOWER': 'circulacion'}, {'LOWER': 'interrumpida'}],
                         [{'LOWER': 'circulación'}, {'LOWER': 'lenta'}],
                         [{'LOWER': 'circulacion'}, {'LOWER': 'lenta'}],
                         [{'LOWER': 'tramo'}, {'LOWER': 'interrumpido'}],
                         [{'LOWER': 'tramo'}, {'LOWER': 'cortado'}],
                         [{'LOWER': 'servicio'}, {'LOWER': 'interrumpido'}],
                         [{'LOWER': 'servicio'}, {'LOWER': 'cortado'}],
                         [{'LOWER': 'trenes'}, {'LOWER': 'no'}, {'LOWER': 'efectúan'}, {'LOWER': 'parada'}], 
                         [{'LOWER': 'trenes'}, {'LOWER': 'no'}, {'LOWER': 'efectuan'}, {'LOWER': 'parada'}])
        
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_metro_fault', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_element_matched == True.
        Doc.set_extension('has_metro_fault', getter=self.has_metro_line)
        Span.set_extension('has_metro_fault', getter=self.has_metro_line)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        
        :param doc: text to be analysed
        :return: text updated with the tags and the entities matched
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity and set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_metro_fault', True)
            # Overwrite doc.ents and add entity
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token.
            span.merge()
        return doc

    @staticmethod
    def has_metro_line(tokens):
        """
        Getter for Doc and Span attributes
        
        :param tokens: tokens of the Doc or the Span, that is, the text
        :return: True if one of the tokens is a matched element
        """
        return any([token._.get('is_metro_fault') for token in tokens])


In [8]:
class MetroDelaysRecognizer(object):
    """Pipeline component that recognises the Madrid underground system lines
    and sets entity annotations to the text that holds them. This allow the
    easy recognition and handling of the lines.
    
    The lines are labelled as FACILITY and their spans are merged into one token.
    Additionally, ._.has_metro_line and ._.is_metro_line is set on the 
    Doc/Span and Token respectively
    """
    name = 'faults_recognizer'  # component name, will show up in the pipeline

    def __init__(self, nlp):
        """
        Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the configured lines (config.ini) and
        generate Doc objects as phrase match patterns.
        
        :param nlp: spaCy nlp instance
        """
        self.label = nlp.vocab.strings['EVENT']  # get entity label ID
        
        #LINE_PATTERN = re.compile('l([i|í]nea){0,1}', re.IGNORECASE)
        #LONG_METRO_LINE_PATTERN = re.compile('l([i|í]nea){0,1}[ -]{0,1}(1[0-2]|[1-9])', re.IGNORECASE)

        #line_flag = lambda text: bool(LINE_PATTERN.match(text))
        #IS_LINE = nlp.vocab.add_flag(line_flag)
        #metro_line_flag = lambda text: bool(LONG_METRO_LINE_PATTERN.match(text))
        #IS_METRO_LINE = nlp.vocab.add_flag(metro_line_flag)
        
        # Set up the PhraseMatcher – it can now take Doc objects as patterns,
        # so even if the list of companies is long, it's very efficient
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add('METRO_LINES', None, [{'LOWER': 'retrasos'}], [{'LOWER': 'retraso'}], [{'LOWER': 'frequencia'}],
                         [{'LOWER': 'minutos'}, {'LOWER': 'de'}, {'LOWER': 'espera'}],
                         [{'LOWER': 'min'}, {'LOWER': 'de'}, {'LOWER': 'espera'}],
                         [{'LOWER': 'minutos'}, {'LOWER': 'de'}, {'LOWER': 'retraso'}],
                         [{'LOWER': 'min'}, {'LOWER': 'de'}, {'LOWER': 'retraso'}],
                         [{'LOWER': 'minutos'}, {'LOWER': 'esperando'}],
                         [{'LOWER': 'tiempo'}, {'LOWER': 'de'}, {'LOWER': 'espera'}],
                         [{'LOWER': 'tiempos'}, {'LOWER': 'de'}, {'LOWER': 'espera'}],
                         [{'LOWER': 'frecuencia'}, {'LOWER': 'de'}, {'LOWER': 'paso'}],
                         [{'LOWER': 'frecuencias'}, {'LOWER': 'de'}, {'LOWER': 'paso'}],
                         [{'LOWER': 'frecuencias'}, {'LOWER': 'de'}, {'LOWER': 'trenes'}],
                         [{'LOWER': 'servicio'}, {'LOWER': 'cortado'}],
                         [{'LOWER': 'trenes'}, {'LOWER': 'no'}, {'LOWER': 'efectúan'}, {'LOWER': 'parada'}], 
                         [{'LOWER': 'trenes'}, {'LOWER': 'no'}, {'LOWER': 'efectuan'}, {'LOWER': 'parada'}])
        
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_metro_fault', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_element_matched == True.
        Doc.set_extension('has_metro_fault', getter=self.has_metro_line)
        Span.set_extension('has_metro_fault', getter=self.has_metro_line)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        
        :param doc: text to be analysed
        :return: text updated with the tags and the entities matched
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity and set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_metro_fault', True)
            # Overwrite doc.ents and add entity
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token.
            span.merge()
        return doc

    @staticmethod
    def has_metro_line(tokens):
        """
        Getter for Doc and Span attributes
        
        :param tokens: tokens of the Doc or the Span, that is, the text
        :return: True if one of the tokens is a matched element
        """
        return any([token._.get('is_metro_fault') for token in tokens])


In [9]:
fault_detector = MetroFaultsRecognizer(nlp=nlp)
nlp.add_pipe(fault_detector, last=True)
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'faults_recognizer']


Let try it with a tweet containing the official account style

In [10]:
def print_recognizer_info(text):
    doc = nlp(text)
    doc.ents
    print('The text contains a fault in the metro in it: ', doc._.has_metro_fault)
    print('Tokens and if it is part of a fault:', [(t, t._.is_metro_fault) for t in doc])
    print('Entities that are an event:', [(e.text, e.label_) for e in doc.ents if e.label_ == 'EVENT']) 

In [11]:
test = faults[0]['text']
print(test)
print_recognizer_info(test)

Circulación interrumpida en L6, entre Laguna y Moncloa en ambos sentidos, por causas técnicas. Tiempo estimado más de 15 minutos.
The text contains a fault in the metro in it:  True
Tokens and if it is part of a fault: [(Circulación interrumpida, True), (en, False), (L6, False), (,, False), (entre, False), (Laguna, False), (y, False), (Moncloa, False), (en, False), (ambos, False), (sentidos, False), (,, False), (por, False), (causas, False), (técnicas, False), (., False), (Tiempo, False), (estimado, False), (más, False), (de, False), (15, False), (minutos, False), (., False)]
Entities that are an event: [('Circulación interrumpida', 'EVENT')]


In [12]:
test = faults[1]['text']
print(test)
print_recognizer_info(test)

Circulación lenta en línea 6 entre Príncipe Pío y Metropolitano, dirección andén 2, por causas técnicas.
The text contains a fault in the metro in it:  True
Tokens and if it is part of a fault: [(Circulación lenta, True), (en, False), (línea, False), (6, False), (entre, False), (Príncipe, False), (Pío, False), (y, False), (Metropolitano, False), (,, False), (dirección, False), (andén, False), (2, False), (,, False), (por, False), (causas, False), (técnicas, False), (., False)]
Entities that are an event: [('Circulación lenta', 'EVENT')]


In [13]:
test = faults[2]['text']
print(test)
print_recognizer_info(test)

Circulacion interrumpida en LÍNEA 10 en ambos sentidos entre Nuevos Ministerios y Begoña por asistencia sanitaria a un viajero. Tiempo estimado de solución +30 min
The text contains a fault in the metro in it:  True
Tokens and if it is part of a fault: [(Circulacion interrumpida, True), (en, False), (LÍNEA, False), (10, False), (en, False), (ambos, False), (sentidos, False), (entre, False), (Nuevos, False), (Ministerios, False), (y, False), (Begoña, False), (por, False), (asistencia, False), (sanitaria, False), (a, False), (un, False), (viajero, False), (., False), (Tiempo, False), (estimado, False), (de, False), (solución, False), (+, False), (30, False), (min, False)]
Entities that are an event: [('Circulacion interrumpida', 'EVENT')]


In [14]:
test = faults[3]['text']
print(test)
print_recognizer_info(test)

El tramo de circulación interrumpida en linea10 por asistencia sanitaria a un viajero pasa a ser entre Begoña y Cuzco. Tiempo estimado solución + 30 min.
The text contains a fault in the metro in it:  True
Tokens and if it is part of a fault: [(El, False), (tramo, False), (de, False), (circulación interrumpida, True), (en, False), (linea10, False), (por, False), (asistencia, False), (sanitaria, False), (a, False), (un, False), (viajero, False), (pasa, False), (a, False), (ser, False), (entre, False), (Begoña, False), (y, False), (Cuzco, False), (., False), (Tiempo, False), (estimado, False), (solución, False), (+, False), (30, False), (min, False), (., False)]
Entities that are an event: [('circulación interrumpida', 'EVENT')]


In [15]:
test = faults[4]['text']
print(test)
print_recognizer_info(test)

Circulacion lenta en Linea 1 entre Bilbao y Pinar de Chamartín, en ambos sentidos, por causas técnicas.
The text contains a fault in the metro in it:  True
Tokens and if it is part of a fault: [(Circulacion lenta, True), (en, False), (Linea, False), (1, False), (entre, False), (Bilbao, False), (y, False), (Pinar, False), (de, False), (Chamartín, False), (,, False), (en, False), (ambos, False), (sentidos, False), (,, False), (por, False), (causas, False), (técnicas, False), (., False)]
Entities that are an event: [('Circulacion lenta', 'EVENT')]


In [16]:
test = faults[5]['text']
print(test)
print_recognizer_info(test)

Circulación lenta en Línea 2 entre las estaciones de Goya y Avenida de Guadalajara, en ambos sentidos por causas técnicas. Tiempo estimado de solución: +30 min
The text contains a fault in the metro in it:  True
Tokens and if it is part of a fault: [(Circulación lenta, True), (en, False), (Línea, False), (2, False), (entre, False), (las, False), (estaciones, False), (de, False), (Goya, False), (y, False), (Avenida, False), (de, False), (Guadalajara, False), (,, False), (en, False), (ambos, False), (sentidos, False), (por, False), (causas, False), (técnicas, False), (., False), (Tiempo, False), (estimado, False), (de, False), (solución, False), (:, False), (+, False), (30, False), (min, False)]
Entities that are an event: [('Circulación lenta', 'EVENT')]
