# Madrid underground problems recognition

This notebook will be focused in the recognition of the different faults that could happen in the Madrid underground system. The official account informs about three kind of situations that can affect the circulation, these are:

- **Faults:** when are produced and where they are fixed.
- **Delays:** when they start and where the circulation is normalized.
- **Strikes:** reminding the days that there is a strike called.

Some examples of the texts that form the tweets indicating this alterations are:

- **Faults:** servicio interrumpido, circulación interrumpida, circulación lenta, trenes no efectúan parada, tramo interrumpido.
- **Delays:** retrasos, min de espera, minutos de espera, minutos esperando.
- **Strikes:** huelga, servicios mínimos



In [1]:
import configparser
import json
import re

import spacy
from spacy.matcher import PhraseMatcher, Matcher
from spacy.tokens import Doc, Span, Token

In [2]:
nlp = spacy.load('es_core_news_md')

In [3]:
config = configparser.ConfigParser()
config.read('config.ini')

['config.ini']

In [4]:
faults_texts = config['keywords']['faults'].split(',')
metro_account_id = config['accounts']['metro_madrid']

In [5]:
with open('../data/fault_tweets.json') as json_data:
    faults = json.load(json_data)
with open('../data/solution_tweets.json') as json_data:
    solutions = json.load(json_data)
with open('../data/strikes_tweets.json') as json_data:
    strikes = json.load(json_data)

In [6]:
class MetroFaultsRecognizer(object):
    """Pipeline component that recognises the Madrid underground system fautls
    and sets entity annotations to the text that holds them. This allow the
    easy recognition and handling of the faults.
    
    The faults are labelled as an EVENT and their spans are merged into one token.
    Additionally, ._.has_metro_fault and ._.is_metro_fault is set on the 
    Doc/Span and Token respectively
    """
    name = 'faults_recognizer'  # component name, will show up in the pipeline

    def __init__(self, nlp):
        """
        Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the configured lines (config.ini) and
        generate Doc objects as phrase match patterns.
        
        :param nlp: spaCy nlp instance
        """
        self.label = nlp.vocab.strings['EVENT']  # get entity label ID
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add('METRO_FAULTS', None, [{'LOWER': 'circulación'}, {'LOWER': 'interrumpida'}],
                         [{'LOWER': 'incidencia'}],
                         [{'LOWER': 'circulacion'}, {'LOWER': 'interrumpida'}],
                         [{'LOWER': 'circulación'}, {'LOWER': 'lenta'}],
                         [{'LOWER': 'circulacion'}, {'LOWER': 'lenta'}],
                         [{'LOWER': 'tramo'}, {'LOWER': 'interrumpido'}],
                         [{'LOWER': 'tramo'}, {'LOWER': 'cortado'}],
                         [{'LOWER': 'servicio'}, {'LOWER': 'interrumpido'}],
                         [{'LOWER': 'servicio'}, {'LOWER': 'cortado'}],
                         [{'LOWER': 'trenes'}, {'LOWER': 'no'}, {'LOWER': 'efectúan'}, {'LOWER': 'parada'}], 
                         [{'LOWER': 'trenes'}, {'LOWER': 'no'}, {'LOWER': 'efectuan'}, {'LOWER': 'parada'}])
        
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_metro_fault', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_element_matched == True.
        Doc.set_extension('has_metro_fault', getter=self.has_metro_fault)
        Span.set_extension('has_metro_fault', getter=self.has_metro_fault)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        
        :param doc: text to be analysed
        :return: text updated with the tags and the entities matched
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity and set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_metro_fault', True)
            # Overwrite doc.ents and add entity
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token.
            span.merge()
        return doc

    @staticmethod
    def has_metro_fault(tokens):
        """
        Getter for Doc and Span attributes
        
        :param tokens: tokens of the Doc or the Span, that is, the text
        :return: True if one of the tokens is a matched element
        """
        return any([token._.get('is_metro_fault') for token in tokens])


In [7]:
class MetroDelaysRecognizer(object):
    """Pipeline component that recognises the Madrid underground system delays
    and sets entity annotations to the text that holds them. This allow the
    easy recognition and handling of the delays.
    
    The delays are labelled as an EVENT and their spans are merged into one token.
    Additionally, ._.has_metro_delay and ._.is_metro_delay is set on the 
    Doc/Span and Token respectively
    """
    name = 'delays_recognizer'  # component name, will show up in the pipeline

    def __init__(self, nlp):
        """
        Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the configured lines (config.ini) and
        generate Doc objects as phrase match patterns.
        
        :param nlp: spaCy nlp instance
        """
        self.label = nlp.vocab.strings['EVENT']  # get entity label ID
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add('METRO_DELAYS', None, [{'LOWER': 'retrasos'}], [{'LOWER': 'retraso'}], [{'LOWER': 'frequencia'}],
                         [{'LOWER': 'minutos'}, {'LOWER': 'de'}, {'LOWER': 'espera'}],
                         [{'LOWER': 'min'}, {'LOWER': 'de'}, {'LOWER': 'espera'}],
                         [{'LOWER': 'minutos'}, {'LOWER': 'de'}, {'LOWER': 'retraso'}],
                         [{'LOWER': 'min'}, {'LOWER': 'de'}, {'LOWER': 'retraso'}],
                         [{'LOWER': 'minutos'}, {'LOWER': 'esperando'}],
                         [{'LOWER': 'tiempo'}, {'LOWER': 'de'}, {'LOWER': 'espera'}],
                         [{'LOWER': 'tiempos'}, {'LOWER': 'de'}, {'LOWER': 'espera'}],
                         [{'LOWER': 'frecuencia'}, {'LOWER': 'de'}, {'LOWER': 'paso'}],
                         [{'LOWER': 'frecuencias'}, {'LOWER': 'de'}, {'LOWER': 'paso'}],
                         [{'LOWER': 'frecuencias'}, {'LOWER': 'de'}, {'LOWER': 'trenes'}])
        
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_metro_delay', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_element_matched == True.
        Doc.set_extension('has_metro_delay', getter=self.has_metro_delay)
        Span.set_extension('has_metro_delay', getter=self.has_metro_delay)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        
        :param doc: text to be analysed
        :return: text updated with the tags and the entities matched
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity and set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_metro_delay', True)
            # Overwrite doc.ents and add entity
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token.
            span.merge()
        return doc

    @staticmethod
    def has_metro_delay(tokens):
        """
        Getter for Doc and Span attributes
        
        :param tokens: tokens of the Doc or the Span, that is, the text
        :return: True if one of the tokens is a matched element
        """
        return any([token._.get('is_metro_delay') for token in tokens])


In [8]:
class MetroSolutionsRecognizer(object):
    """Pipeline component that recognises the Madrid underground system solutions to faults
    abd delays and sets entity annotations to the text that holds them. This allow the
    easy recognition and handling of the solutions.
    
    The solutions are labelled as an EVENT and their spans are merged into one token.
    Additionally, ._.has_metro_solution and ._.is_metro_solution is set on the 
    Doc/Span and Token respectively
    """
    name = 'solutions_recognizer'  # component name, will show up in the pipeline

    def __init__(self, nlp):
        """
        Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the configured lines (config.ini) and
        generate Doc objects as phrase match patterns.
        
        :param nlp: spaCy nlp instance
        """
        self.label = nlp.vocab.strings['EVENT']  # get entity label ID
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add('METRO_SOLUTIONS', None, [{'LOWER': 'circulación'}, {'LOWER': 'normalizada'}],
                         [{'LOWER': 'circulacion'}, {'LOWER': 'normalizada'}],
                         [{'LOWER': 'servicio'}, {'LOWER': 'normalizado'}],
                         [{'LOWER': 'normalizado'}, {'LOWER': 'el'}, {'LOWER': 'servicio'}],
                         [{'LOWER': 'restablecido'}, {'LOWER': 'el'}, {'LOWER': 'servicio'}],
                         [{'LOWER': 'ya'}, {'LOWER': 'efectúan'}, {'LOWER': 'parada'}],
                         [{'LOWER': 'ya'}, {'LOWER': 'efectuan'}, {'LOWER': 'parada'}])
        
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_metro_solution', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_element_matched == True.
        Doc.set_extension('has_metro_solution', getter=self.has_metro_solution)
        Span.set_extension('has_metro_solution', getter=self.has_metro_solution)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        
        :param doc: text to be analysed
        :return: text updated with the tags and the entities matched
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity and set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_metro_solution', True)
            # Overwrite doc.ents and add entity
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token.
            span.merge()
        return doc

    @staticmethod
    def has_metro_solution(tokens):
        """
        Getter for Doc and Span attributes
        
        :param tokens: tokens of the Doc or the Span, that is, the text
        :return: True if one of the tokens is a matched element
        """
        return any([token._.get('is_metro_solution') for token in tokens])


In [9]:
class MetroStrikesRecognizer(object):
    """Pipeline component that recognises the Madrid underground system strikes
    and sets entity annotations to the text that holds them. This allow the
    easy recognition and handling of the delays.
    
    The strikes are labelled as an EVENT and their spans are merged into one token.
    Additionally, ._.has_metro_strike and ._.is_metro_strike is set on the 
    Doc/Span and Token respectively
    """
    name = 'strikes_recognizer'  # component name, will show up in the pipeline

    def __init__(self, nlp):
        """
        Initialise the pipeline component. The shared nlp instance is used
        to initialise the matcher with the configured lines (config.ini) and
        generate Doc objects as phrase match patterns.
        
        :param nlp: spaCy nlp instance
        """
        self.label = nlp.vocab.strings['EVENT']  # get entity label ID
        self.matcher = Matcher(nlp.vocab)
        self.matcher.add('METRO_STRIKES', None, [{'LOWER': 'huelga'}], [{'LOWER': 'paros'}],
                         [{'LOWER': 'servicios'}, {'LOWER': 'minimos'}],
                         [{'LOWER': 'servicios'}, {'LOWER': 'mínimos'}],
                         [{'LOWER': 'paros'}, {'LOWER': 'convocados'}, {'LOWER': 'para'}, {'LOWER': 'hoy'}],
                         [{'LOWER': 'paros'}, {'LOWER': 'convocados'}, {'LOWER': 'para'}, {'LOWER': 'mañana'}],
                         [{'LOWER': 'paros'}, {'LOWER': 'convocados'}, {'LOWER': 'para'}, {'LOWER': 'el'}, 
                          {'LOWER': 'dia'}, {'IS_DIGIT': True}],
                         [{'LOWER': 'paros'}, {'LOWER': 'convocados'}, {'LOWER': 'para'}, {'LOWER': 'el'}, 
                          {'LOWER': 'día'}, {'IS_DIGIT': True}],
                         [{'LOWER': 'paros'}, {'LOWER': 'convocados'}])
        
        # Register attribute on the Token. We'll be overwriting this based on
        # the matches, so we're only setting a default value, not a getter.
        Token.set_extension('is_metro_strike', default=False)

        # Register attributes on Doc and Span via a getter that checks if one of
        # the contained tokens is set to is_element_matched == True.
        Doc.set_extension('has_metro_strike', getter=self.has_metro_strike)
        Span.set_extension('has_metro_strike', getter=self.has_metro_strike)

    def __call__(self, doc):
        """Apply the pipeline component on a Doc object and modify it if matches
        are found. Return the Doc, so it can be processed by the next component
        in the pipeline, if available.
        
        :param doc: text to be analysed
        :return: text updated with the tags and the entities matched
        """
        matches = self.matcher(doc)
        spans = []  # keep the spans for later so we can merge them afterwards
        for _, start, end in matches:
            # Generate Span representing the entity and set label
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
            # Set custom attribute on each token of the entity
            for token in entity:
                token._.set('is_metro_strike', True)
            # Overwrite doc.ents and add entity
            doc.ents = list(doc.ents) + [entity]
        for span in spans:
            # Iterate over all spans and merge them into one token.
            span.merge()
        return doc

    @staticmethod
    def has_metro_strike(tokens):
        """
        Getter for Doc and Span attributes
        
        :param tokens: tokens of the Doc or the Span, that is, the text
        :return: True if one of the tokens is a matched element
        """
        return any([token._.get('is_metro_strike') for token in tokens])


In [10]:
fault_detector = MetroFaultsRecognizer(nlp=nlp)
delay_detector = MetroDelaysRecognizer(nlp=nlp)
solution_detector = MetroSolutionsRecognizer(nlp=nlp)
strike_detector = MetroStrikesRecognizer(nlp=nlp)
nlp.add_pipe(fault_detector, last=True)
nlp.add_pipe(delay_detector, last=True)
nlp.add_pipe(solution_detector, last=True)
nlp.add_pipe(strike_detector, last=True)
print(nlp.pipe_names)

['tagger', 'parser', 'ner', 'faults_recognizer', 'delays_recognizer', 'solutions_recognizer', 'strikes_recognizer']


Let try it with a tweet containing the official account style

In [11]:
def print_recognizer_info(text):
    doc = nlp(text)
    doc.ents
    print('The text contains a fault in it: ', doc._.has_metro_fault)
    print('The text contains a delay in it: ', doc._.has_metro_delay)
    print('The text contains a solution in it: ', doc._.has_metro_solution)
    print('The text contains a strike in it: ', doc._.has_metro_strike)
    print('Entities that are an event:', [(e.text, e.label_) for e in doc.ents if e.label_ == 'EVENT']) 

In [12]:
test = faults[0]['text']
print(test)
print_recognizer_info(test)

Circulación interrumpida en L6, entre Laguna y Moncloa en ambos sentidos, por causas técnicas. Tiempo estimado más de 15 minutos.
The text contains a fault in it:  True
The text contains a delay in it:  False
The text contains a solution in it:  False
The text contains a strike in it:  False
Entities that are an event: [('Circulación interrumpida', 'EVENT')]


In [13]:
test = faults[1]['text']
print(test)
print_recognizer_info(test)

Circulación lenta en línea 6 entre Príncipe Pío y Metropolitano, dirección andén 2, por causas técnicas.
The text contains a fault in it:  True
The text contains a delay in it:  False
The text contains a solution in it:  False
The text contains a strike in it:  False
Entities that are an event: [('Circulación lenta', 'EVENT')]


In [14]:
test = faults[2]['text']
print(test)
print_recognizer_info(test)

Circulacion interrumpida en LÍNEA 10 en ambos sentidos entre Nuevos Ministerios y Begoña por asistencia sanitaria a un viajero. Tiempo estimado de solución +30 min
The text contains a fault in it:  True
The text contains a delay in it:  False
The text contains a solution in it:  False
The text contains a strike in it:  False
Entities that are an event: [('Circulacion interrumpida', 'EVENT')]


In [15]:
test = faults[3]['text']
print(test)
print_recognizer_info(test)

El tramo de circulación interrumpida en linea10 por asistencia sanitaria a un viajero pasa a ser entre Begoña y Cuzco. Tiempo estimado solución + 30 min.
The text contains a fault in it:  True
The text contains a delay in it:  False
The text contains a solution in it:  False
The text contains a strike in it:  False
Entities that are an event: [('circulación interrumpida', 'EVENT')]


In [16]:
test = faults[4]['text']
print(test)
print_recognizer_info(test)

Circulacion lenta en Linea 1 entre Bilbao y Pinar de Chamartín, en ambos sentidos, por causas técnicas.
The text contains a fault in it:  True
The text contains a delay in it:  False
The text contains a solution in it:  False
The text contains a strike in it:  False
Entities that are an event: [('Circulacion lenta', 'EVENT')]


In [17]:
test = faults[5]['text']
print(test)
print_recognizer_info(test)

Circulación lenta en Línea 2 entre las estaciones de Goya y Avenida de Guadalajara, en ambos sentidos por causas técnicas. Tiempo estimado de solución: +30 min
The text contains a fault in it:  True
The text contains a delay in it:  False
The text contains a solution in it:  False
The text contains a strike in it:  False
Entities that are an event: [('Circulación lenta', 'EVENT')]


In [18]:
test = solutions[0]['text']
print(test)
print_recognizer_info(test)

Circulación normalizada en L6 entre Laguna y Moncloa.
The text contains a fault in it:  False
The text contains a delay in it:  False
The text contains a solution in it:  True
The text contains a strike in it:  False
Entities that are an event: [('Circulación normalizada', 'EVENT')]


In [19]:
test = solutions[5]['text']
print(test)
print_recognizer_info(test)

Restablecido el servicio en L6 entre Metropolitano y Argüelles.
The text contains a fault in it:  False
The text contains a delay in it:  False
The text contains a solution in it:  True
The text contains a strike in it:  False
Entities that are an event: [('Restablecido el servicio', 'EVENT')]


In [20]:
test = strikes[0]['text']
print(test)
print_recognizer_info(test)

➡ RECUERDA: servicios mínimos durante los paros convocados para hoy:
The text contains a fault in it:  False
The text contains a delay in it:  False
The text contains a solution in it:  False
The text contains a strike in it:  True
Entities that are an event: [('servicios mínimos', 'EVENT'), ('paros convocados para hoy', 'EVENT')]


In [21]:
test = strikes[1]['text']
print(test)
print_recognizer_info(test)

➡ INFORMACIÓN: servicios mínimos durante los paros convocados para el día 15 de marzo:
The text contains a fault in it:  False
The text contains a delay in it:  False
The text contains a solution in it:  False
The text contains a strike in it:  True
Entities that are an event: [('servicios mínimos', 'EVENT'), ('paros convocados para el día 15', 'EVENT')]
