# Make LSTMs Great Again

# Named Entity Recognition on Twitter Data

## Reading the Data

Corpus contains tweets and named entity tags. A line in corpus is a token with a tag separated by a space.

Different tweets are separated by a new line.

Replace usernames that starts with @ with USR and url that starts with 'http:// || https://' with URL

In [1]:
def read_data(file_path):
    tokens = [] # List of list of words in a tweet, for all tweets
    tags = [] # List of list of tags in a tweet, for all tags corresponding to the tweet
    
    tweet_tokens = []
    tweet_tags = []
    for line in open(file_path, encoding='utf-8'):  
        line = line.strip() # remove leading and trailing space
        if not line:
            if tweet_tokens:
                tokens.append(tweet_tokens)
                tags.append(tweet_tags)
            tweet_tokens = []
            tweet_tags = []
        else:
            token, tag = line.split()
            if token.startswith("@"):
                token="<USR>" # Replace username with <USR>
            elif token.startswith("http://") or token.startswith("https://"):
                token="<URL>" # Replace links with <URL>
            tweet_tokens.append(token)
            tweet_tags.append(tag)
            
    return tokens, tags

### Loading the Train, Validation and Test Data

In [3]:
train_tokens, train_tags = read_data('Data/train.txt')
validation_tokens, validation_tags = read_data('Data/validation.txt')
test_tokens, test_tags = read_data('Data/test.txt')

### Exploring the Data

In [9]:
for word in train_tokens[0]: print(word, end=" ")

RT <USR> : Online ticket sales for Ghostland Observatory extended until 6 PM EST due to high demand . Get them before they sell out ... 

In [10]:
for tag in train_tags[0]: print(tag, end=" ")

O O O O O O O B-musicartist I-musicartist O O O O O O O O O O O O O O O O O 

Each element loaded to train tokens is a tweet, which in turn is a list of words.

In [18]:
print("We have", len(train_tokens), "tweets")

We have 5795 tweets
