# Executive summary

#### The Steps one to five below have been follwed in order to perform twitter analysis on the search terms for a perio dof 6 months i.e. April 2019 to September 2019. 

Step 1: Scrape tweets for a period of 6 months. This has resulted in approximately 43.7K tweets.

Step 2: Added a list of relevant financial institutions to the tweets.

Step 3: Extracted the relevant topics from each tweet using LDA for topic extraction.

Step 4: Sentient extraction was performed using an LSTM based deep learning model.

Step 5: Entity recognition was performed using thwe open source 3 class Stanford NER tagger i.e. Organization, Location and Names.

#### The result of the analysis is visualised using tableau and is available via the below link:  https://public.tableau.com/profile/prajwal3523#!/vizhome/TwitterAnalysis_15704300313050/SummaryofTweets


## Step 1: Scrape Tweet Data

In [1]:
import GetOldTweets3 as got
import pandas as pd
import warnings
import re
import numpy as np

# NTLK functions
import nltk
from nltk.corpus import stopwords
from nltk import tokenize as tok
from nltk.stem.snowball import SnowballStemmer # load nltk's SnowballStemmer as variabled 'stemmer'
import lda # topic modeling -NMF & LDA
import string
from nltk.tag import StanfordNERTagger

warnings.simplefilter("ignore", DeprecationWarning)
# Load the LDA model from sk-learn
from sklearn.decomposition import LatentDirichletAllocation as LDA
# Tf-Idf and Clustering packages
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


In [2]:
search_terms = ['mortgage','current account','savings account','insurance','credit card','pension',
                'personal loan','money transfer','tax advice','investment','wealth management']

In [None]:
## scrape data from twitter for the above search terms
tweet_df_all = pd.DataFrame()
for term in search_terms:
    print(term)
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch(term)\
                                               .setSince("2019-04-01")\
                                               .setUntil("2019-09-30")\
                                               .setNear("London")\
                                               .setWithin("310mi")
    tweet = got.manager.TweetManager.getTweets(tweetCriteria)
    tweet_list = [[tweet[x].id,
                  tweet[x].author_id,
                  tweet[x].text,
                  tweet[x].retweets,
                  tweet[x].permalink,
                  tweet[x].date,
                  tweet[x].formatted_date,
                  tweet[x].favorites,
                  tweet[x].mentions,
                  tweet[x].hashtags,
                  tweet[x].geo,
                  tweet[x].urls
                 ]for x in range(0, len(tweet))]
    tweet_df = pd.DataFrame(tweet_list)
    tweet_df['search_term'] = term
    tweet_df_all = tweet_df_all.append(tweet_df)

tweet_df_all.columns = ['id','author_id','text','retweets','permalink','date','formatted_date','favorites','mentions','hashtags','geo','urls']
tweet_df_all.to_csv('../data/all_tweets.csv', index=False)


In [27]:
tweet_df_all.shape

(0, 0)

## Step2: Add List of Financial Institutions providng the above products

In [62]:
tweet_df_all = pd.read_csv('../data/all_tweets.csv')
tweet_df_all = tweet_df_all[tweet_df_all['text'].notna()]

In [63]:
print(tweet_df_all.shape);tweet_df_all.head()

(43704, 13)


Unnamed: 0,id,author_id,text,retweets,permalink,date,formatted_date,favorites,mentions,hashtags,geo,urls,search_term
0,1178457108276289536,40080176,This normalisation of no deal is horrendous. P...,0,https://twitter.com/KatarinaKeys/status/117845...,2019-09-29 23:50:43+00:00,Sun Sep 29 23:50:43 +0000 2019,0,,,,,mortgage
1,1178455823242035201,1126071201481334787,Jumbo Mortgage Program https:// conclud.com/ht...,0,https://twitter.com/Conclud2/status/1178455823...,2019-09-29 23:45:37+00:00,Sun Sep 29 23:45:37 +0000 2019,0,,,,https://conclud.com/https-www-madisonmortgageg...,mortgage
2,1178450126219685893,729387514914603009,If you have no work it's harder to feed your k...,0,https://twitter.com/cjhenrygonzo/status/117845...,2019-09-29 23:22:59+00:00,Sun Sep 29 23:22:59 +0000 2019,0,,,,,mortgage
3,1178446295985541120,1697126574,"Solution. ""You'll need to be: 18+ and a UK res...",0,https://twitter.com/blazedstorm/status/1178446...,2019-09-29 23:07:46+00:00,Sun Sep 29 23:07:46 +0000 2019,2,,,,,mortgage
4,1178446170722619393,1239955070,Kabaddi x3 UK Premier 1st show House Full Show...,0,https://twitter.com/habamoment/status/11784461...,2019-09-29 23:07:16+00:00,Sun Sep 29 23:07:16 +0000 2019,0,@Peepal,,,https://www.facebook.com/habteam/posts/1106547...,mortgage


In [64]:
# read the financail services comoanies list from csv
fin_inst = pd.read_csv('../data/fin_serv.csv')
tweet_df_all['text'] = tweet_df_all['text'].str.lower()
tweet_df_all['company']=''
fin_inst.head()

Unnamed: 0,fin_sector,fin_comp
0,Banks,Barclays
1,Banks,Lloyds
2,Banks,HSBC
3,Banks,Citi Bank
4,Banks,Santander


In [65]:
# Search through the data frame and find the companies
for comp in fin_inst['fin_comp'].unique():
    print(comp.lower())
    tweet_df_all.loc[tweet_df_all['text'].str.contains(comp.lower()),'company']=comp

barclays
lloyds
hsbc
citi bank
santander
nationwide
bupa
axa
allianz
zurich insurance group
qbe
direct line
monzo
starling
revolut
transfer wise
western union


In [66]:
tweet_df_comp = tweet_df_all.merge(fin_inst, how='left', left_on='company', right_on='fin_comp')
tweet_df_comp.head().T

Unnamed: 0,0,1,2,3,4
id,1178457108276289536,1178455823242035201,1178450126219685893,1178446295985541120,1178446170722619393
author_id,40080176,1126071201481334787,729387514914603009,1697126574,1239955070
text,this normalisation of no deal is horrendous. p...,jumbo mortgage program https:// conclud.com/ht...,if you have no work it's harder to feed your k...,"solution. ""you'll need to be: 18+ and a uk res...",kabaddi x3 uk premier 1st show house full show...
retweets,0,0,0,0,0
permalink,https://twitter.com/KatarinaKeys/status/117845...,https://twitter.com/Conclud2/status/1178455823...,https://twitter.com/cjhenrygonzo/status/117845...,https://twitter.com/blazedstorm/status/1178446...,https://twitter.com/habamoment/status/11784461...
date,2019-09-29 23:50:43+00:00,2019-09-29 23:45:37+00:00,2019-09-29 23:22:59+00:00,2019-09-29 23:07:46+00:00,2019-09-29 23:07:16+00:00
formatted_date,Sun Sep 29 23:50:43 +0000 2019,Sun Sep 29 23:45:37 +0000 2019,Sun Sep 29 23:22:59 +0000 2019,Sun Sep 29 23:07:46 +0000 2019,Sun Sep 29 23:07:16 +0000 2019
favorites,0,0,0,2,0
mentions,,,,,@Peepal
hashtags,,,,,


In [67]:
tweet_df_comp.groupby('search_term')['id'].count()

search_term
credit card           2226
current account        254
insurance            10814
investment           15673
money transfer          69
mortgage              4704
pension               9347
peronal loan            20
savings account        182
tax advice             146
wealth management      269
Name: id, dtype: int64

## Step 3: Topic Extraction with LDA model

In [68]:
#remove unnessary words
#Complie all regular expressions
isURL = re.compile(r'http[s]?:// (?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', re.VERBOSE | re.IGNORECASE)
isRTusername = re.compile(r'^RT+[\s]+(@[\w_]+:)',re.VERBOSE | re.IGNORECASE) #r'^RT+[\s]+(@[\w_]+:)'
isEntity = re.compile(r'@[\w_]+', re.VERBOSE | re.IGNORECASE)

# Helper functions
def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]])) 

# Show top n keywords for each topic
def show_topics(vectorizer, lda_model, n_words=20):
    keywords = np.array(vectorizer.get_feature_names())
    topic_keywords = []
    for topic_weights in lda_model.components_:
        top_keyword_locs = (-topic_weights).argsort()[:n_words]
        topic_keywords.append(keywords.take(top_keyword_locs))
    return topic_keywords
     
        
def clean_tweet(row):
    row = isURL.sub("",row)
    row = isRTusername.sub("",row)
    row = isEntity.sub("",row)
    return row

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in tok.sent_tokenize(text) for word in tok.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens


In [69]:
# remove urls and retweets and entities from the text
tweet_df_comp['text_clean'] = tweet_df_comp['text'].apply(lambda row:clean_tweet(row))

#remove punctuations
RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])  
tweet_df_comp['text_clean'] = tweet_df_comp['text_clean'].str.replace(RE_PUNCTUATION, "")
tweet_df_comp.head()

Unnamed: 0,id,author_id,text,retweets,permalink,date,formatted_date,favorites,mentions,hashtags,geo,urls,search_term,company,fin_sector,fin_comp,text_clean
0,1178457108276289536,40080176,this normalisation of no deal is horrendous. p...,0,https://twitter.com/KatarinaKeys/status/117845...,2019-09-29 23:50:43+00:00,Sun Sep 29 23:50:43 +0000 2019,0,,,,,mortgage,,,,this normalisation of no deal is horrendous pe...
1,1178455823242035201,1126071201481334787,jumbo mortgage program https:// conclud.com/ht...,0,https://twitter.com/Conclud2/status/1178455823...,2019-09-29 23:45:37+00:00,Sun Sep 29 23:45:37 +0000 2019,0,,,,https://conclud.com/https-www-madisonmortgageg...,mortgage,,,,jumbo mortgage program https concludcomhttpsww...
2,1178450126219685893,729387514914603009,if you have no work it's harder to feed your k...,0,https://twitter.com/cjhenrygonzo/status/117845...,2019-09-29 23:22:59+00:00,Sun Sep 29 23:22:59 +0000 2019,0,,,,,mortgage,,,,if you have no work its harder to feed your ki...
3,1178446295985541120,1697126574,"solution. ""you'll need to be: 18+ and a uk res...",0,https://twitter.com/blazedstorm/status/1178446...,2019-09-29 23:07:46+00:00,Sun Sep 29 23:07:46 +0000 2019,2,,,,,mortgage,,,,solution youll need to be 18 and a uk resident...
4,1178446170722619393,1239955070,kabaddi x3 uk premier 1st show house full show...,0,https://twitter.com/habamoment/status/11784461...,2019-09-29 23:07:16+00:00,Sun Sep 29 23:07:16 +0000 2019,0,@Peepal,,,https://www.facebook.com/habteam/posts/1106547...,mortgage,,,,kabaddi x3 uk premier 1st show house full show...


In [70]:
# List of stopwords
stop_words= stopwords.words('english') #import stopwords from NLTK package
readInStopwords = pd.read_csv("../data/pre_process/twitterStopWords.csv", encoding='ISO-8859-1') # import stopwords from CSV file as pandas data frame
readInStopwords = readInStopwords.wordList.tolist() # convert pandas data frame to a list
readInStopwords.append('http')
readInStopwords.append('https')

# add in search terms as topic extraction is performed within each search topic, 
# we do not want the word or valriation of the word captured as a topic word
search_terms_revised = ['mortgages','wealthmanagement','pensions','money','transfer']
readInStopwords.extend(search_terms)
readInStopwords.extend(search_terms_revised)

stop_list = stop_words + readInStopwords # combine two lists i.e. NLTK stop words and CSV stopwords
stop_list = list(set(stop_list)) # strore only unique values 

In [71]:
# parameter for lda, i am selecrign 3 topic and 4 words for each of the search terms 
number_topics = 5
number_words = 5

In [72]:
tweets_all_topics= pd.DataFrame()
# term frequency modelling
for terms in tweet_df_comp['search_term'].unique():
    print(terms)
    tweets_search_topics  = tweet_df_comp[tweet_df_comp['search_term']==terms].reset_index(drop=True)
    corpus = tweets_search_topics['text_clean'].tolist()
    # print(corpus)
    tf_vectorizer = CountVectorizer(max_df=0.9, min_df=0.00, stop_words=stop_list, tokenizer=tokenize_only) # Use tf (raw term count) features for LDA.
    tf = tf_vectorizer.fit_transform(corpus)
    
    # Create and fit the LDA model
    model = LDA(n_components=number_topics, n_jobs=-1)
    id_topic = model.fit(tf)
    # Print the topics found by the LDA model
    print("Topics found via LDA:")
    topic_keywords = show_topics(vectorizer=tf_vectorizer, lda_model=model, n_words=number_words)        
    # Topic - Keywords Dataframe
    df_topic_keywords = pd.DataFrame(topic_keywords)
    df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
    df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
    df_topic_keywords = df_topic_keywords.reset_index()
    df_topic_keywords['topic_index'] = df_topic_keywords['index'].str.split(' ', n = 1, expand = True)[[1]].astype('int')
    print(df_topic_keywords)
    
    ############ get the dominat topic for each document in a data frame ###############
    # Create Document — Topic Matrix
    lda_output = model.transform(tf)
    # column names
    topicnames = ["Topic" + str(i) for i in range(model.n_components)]
    # index names
    docnames = ["Doc" + str(i) for i in range(len(corpus))]
    
    # Make the pandas dataframe
    df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)
    # Get dominant topic for each document
    dominant_topic = np.argmax(df_document_topic.values, axis=1)
    df_document_topic['dominant_topic'] = dominant_topic   
    df_document_topic = df_document_topic.reset_index()
        
    #combine all the search terms into one data frame
    tweets_topics = tweets_search_topics.merge(df_document_topic, left_index=True, right_index=True, how='left')
    tweets_topics_words = tweets_topics.merge(df_topic_keywords, how='left', left_on='dominant_topic', right_on='topic_index')
    tweets_all_topics = tweets_all_topics.append(tweets_topics_words)

mortgage
Topics found via LDA:
     index   Word 0    Word 1     Word 2  Word 3    Word 4  topic_index
0  Topic 0  adviser      work       jobs     pay      pass            0
1  Topic 1     week        us      right  broker  business            1
2  Topic 2      get    people        pay   years      home            2
3  Topic 3    rates  interest  financial    bank      rate            3
4  Topic 4     like       day      going     got       say            4
current account
Topics found via LDA:
     index   Word 0   Word 1 Word 2    Word 3   Word 4  topic_index
0  Topic 0  deficit     also   it’s      keep  banking            0
1  Topic 1      one  savings   help  interest     card            1
2  Topic 2     card       uk   it’s     today    using            2
3  Topic 3  deficit      get    use     great     paid            3
4  Topic 4       hi     bank    one      card   trying            4
savings account
Topics found via LDA:
     index  Word 0    Word 1 Word 2    Word 3   Word 

In [73]:
tweets_all_topics = tweets_all_topics.reset_index(drop=True)
print(tweets_all_topics.shape)
tweets_all_topics.head()

(43704, 31)


Unnamed: 0,id,author_id,text,retweets,permalink,date,formatted_date,favorites,mentions,hashtags,...,Topic3,Topic4,dominant_topic,index_y,Word 0,Word 1,Word 2,Word 3,Word 4,topic_index
0,1178457108276289536,40080176,this normalisation of no deal is horrendous. p...,0,https://twitter.com/KatarinaKeys/status/117845...,2019-09-29 23:50:43+00:00,Sun Sep 29 23:50:43 +0000 2019,0,,,...,0.01,0.96,4,Topic 4,like,day,going,got,say,4
1,1178455823242035201,1126071201481334787,jumbo mortgage program https:// conclud.com/ht...,0,https://twitter.com/Conclud2/status/1178455823...,2019-09-29 23:45:37+00:00,Sun Sep 29 23:45:37 +0000 2019,0,,,...,0.84,0.04,3,Topic 3,rates,interest,financial,bank,rate,3
2,1178450126219685893,729387514914603009,if you have no work it's harder to feed your k...,0,https://twitter.com/cjhenrygonzo/status/117845...,2019-09-29 23:22:59+00:00,Sun Sep 29 23:22:59 +0000 2019,0,,,...,0.01,0.01,0,Topic 0,adviser,work,jobs,pay,pass,0
3,1178446295985541120,1697126574,"solution. ""you'll need to be: 18+ and a uk res...",0,https://twitter.com/blazedstorm/status/1178446...,2019-09-29 23:07:46+00:00,Sun Sep 29 23:07:46 +0000 2019,2,,,...,0.63,0.01,3,Topic 3,rates,interest,financial,bank,rate,3
4,1178446170722619393,1239955070,kabaddi x3 uk premier 1st show house full show...,0,https://twitter.com/habamoment/status/11784461...,2019-09-29 23:07:16+00:00,Sun Sep 29 23:07:16 +0000 2019,0,@Peepal,,...,0.02,0.02,0,Topic 0,adviser,work,jobs,pay,pass,0


In [74]:
tweets_all_topics.to_csv('../processed_data/tweets_all_topics.csv', index=False)

## Step 4: Sentiment analysis using Deep Learning

I have trained the model usign the movie review data. The details of the training of the model can be found here: https://towardsdatascience.com/sentiment-analysis-for-text-with-deep-learning-2f0a0c6472b5

In [75]:
import keras
import h5py
from keras.models import model_from_json
from keras.models import load_model
import json
from nltk.tokenize import RegexpTokenizer


In [76]:
# read in the weight of the trained model.
weight_path = '../models/dl_sentiment_model/best_weight_glove_bi_512.hdf5'
prd_model = load_model(weight_path)
prd_model.summary()
word_idx = json.load(open("../models/dl_sentiment_model/word_idx.txt"))


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 56, 300)           120000300 
_________________________________________________________________
bidirectional_2 (Bidirection (None, 256)               439296    
_________________________________________________________________
dense_20 (Dense)             (None, 512)               131584    
_________________________________________________________________
dropout_6 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_21 (Dense)             (None, 10)                5130      
Total params: 120,576,310
Trainable params: 576,010
Non-trainable params: 120,000,300
_________________________________________________________________


In [77]:
def get_sentiment_DL(prd_model, text_data, word_idx):

    #data = "Pass the salt"

    live_list = []
    batchSize = len(text_data)
    live_list_np = np.zeros((56,batchSize))
    for index, row in text_data.iterrows():
        #print (index)
        text_data_sample = text_data['text'][index]
        # split the sentence into its words and remove any punctuations.
        tokenizer = RegexpTokenizer(r'\w+')
        text_data_list = tokenizer.tokenize(text_data_sample)

        #text_data_list = text_data_sample.split()


        labels = np.array(['1','2','3','4','5','6','7','8','9','10'], dtype = "int")
        #word_idx['I']
        # get index for the live stage
        data_index = np.array([word_idx[word.lower()] if word.lower() in word_idx else 0 for word in text_data_list])
        data_index_np = np.array(data_index)

        # padded with zeros of length 56 i.e maximum length
        padded_array = np.zeros(56)
        padded_array[:data_index_np.shape[0]] = data_index_np[:56]
        data_index_np_pad = padded_array.astype(int)


        live_list.append(data_index_np_pad)

    live_list_np = np.asarray(live_list)
    score = prd_model.predict(live_list_np, batch_size=batchSize, verbose=0)
    single_score = np.round(np.dot(score, labels)/10,decimals=2)

    score_all  = []
    for each_score in score:

        top_3_index = np.argsort(each_score)[-3:]
        top_3_scores = each_score[top_3_index]
        top_3_weights = top_3_scores/np.sum(top_3_scores)
        single_score_dot = np.round(np.dot(top_3_index, top_3_weights)/10, decimals = 2)
        score_all.append(single_score_dot)

    text_data['Sentiment_Score'] = pd.DataFrame(score_all)

    return text_data


In [78]:
text_data =  tweets_all_topics
# Deep Learning sentiment scoring
text_out = get_sentiment_DL(prd_model, text_data, word_idx)


### example of negative tweets


In [79]:
text_out.sort_values(by='Sentiment_Score')[['text','Sentiment_Score']].head().T

Unnamed: 0,37967,42412,30487,19344,14520
text,johnson was worse at delivery than khans and h...,"i'm just so disappointed, i made a very, very ...",snp ministers lose public £135m in bad loans a...,"48:00 ""additional pension is extremely poor va...",the very worst insurance company i have ever u...
Sentiment_Score,0.04,0.04,0.04,0.04,0.04


### example of positive tweets

In [80]:
text_out.sort_values(by='Sentiment_Score', ascending=False)[['text','Sentiment_Score']].head().T

Unnamed: 0,34887,31525,20778,33675,6975
text,#patrizia tracks surge in first-half performan...,we are delighted to have won property manager ...,newham pensions fund awarded a ‘highly commend...,"""one of the finest clos apaltas of all time,"" ...",heartiest congratulations to team applied syst...
Sentiment_Score,0.87,0.87,0.86,0.86,0.85


In [81]:
# save the output files
text_out.to_csv('../processed_data/tweets_topics_sentiment.csv', index=False)

## Step 5:  Named Entity Recognition

The below section is implementing a stanford 3 class NER tagger. The model is trained based on on supervised Conditional Random Field (CRF) model. Additional information on the model is available at https://nlp.stanford.edu/software/CRF-NER.html

In [330]:
def get_NER(text_data):
    #/Users/prajwalshreyas/Desktop/Singularity/dockerApps/ner-algo/stanford-ner-2015-01-30
    stanford_classifier = '../models/ner/english.all.3class.distsim.crf.ser.gz'
    stanford_ner_path = '../models/ner/stanford-ner.jar'

    #try:
        # Creating Tagger Object
    st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
    #except Exception as e:
    #       print (e)

    # Get keyword for the input data frame
    #keyword = tweetDataFrame.keyword.unique()
    # Subset column containing tweet text and convert to list
    # next insert a placeholder ' 12345678 ' to signify end of individual tweets

    #text_data = pd.read_json('/Users/prajwalshreyas/Desktop/Singularity/dockerApps/sentiment-algo/app-sentiment-algo/sample_text.json')
    print ('start get_NER')
    text_out = text_data.copy()
    doc = [ docs + ' 12345678 ' for docs in list(text_data['text'])]
    # ------------------------- Stanford Named Entity Recognition
    tokens = nltk.word_tokenize(str(doc))
    entities = st.tag(tokens) # actual tagging takes place using Stanford NER algorithm


    entities = [list(elem) for elem in entities] # Convert list of tuples to list of list
    print ('tag complete')
    for idx,element in enumerate(entities):
        try:
            if entities[idx][0] == '12345678':
                entities[idx][1] = "DOC_NUMBER"  #  Modify data by adding the tag "Doc_Number"
            #elif entities[idx][0].lower() == keyword:
            #    entities[idx][1] = "KEYWORD"
            # Combine First and Last name into a single word
            elif entities[idx][1] == "PERSON" and entities[idx + 1][1] == "PERSON":
                entities[idx + 1][0] = entities[idx][0] + '-' + entities[idx+1][0]
                entities[idx][1] = 'Combined'
            # Combine consecutive Organization names
            elif entities[idx][1] == 'ORGANIZATION' and entities[idx + 1][1] == 'ORGANIZATION':
                entities[idx + 1][0] = entities[idx][0] + '-' + entities[idx+1][0]
                entities[idx][1] = 'Combined'
        except IndexError:
            break
    print ('enumerate complete')
    # Filter list of list for the words we are interested in
    filter_list = ['DOC_NUMBER','PERSON','LOCATION','ORGANIZATION']
    entityWordList = [element for element in entities if any(i in element for i in filter_list)]

    entityString = ' '.join(str(word) for insideList in entityWordList for word in insideList) # convert list to string and concatenate it
    entitySubString = entityString.split("DOC_NUMBER") # split the string using the separator 'TWEET_NUMBER'
    del entitySubString[-1] # delete the extra blank row created in the previous step

    # Store the classified NERs in the main tweet data frame
    for idx,docNER in enumerate(entitySubString):
        docNER = docNER.strip().split() # split the string into word list
        # Filter for words tagged as Organization and store it in data frame
        text_out.loc[idx,'Organization'] =  ','.join([docNER[i-1]  for i,x in enumerate(docNER) if x == 'ORGANIZATION'])
        # Filter for words tagged as LOCATION and store it in data frame
        text_out.loc[idx,'Place'] = ','.join([docNER[i-1] for i,x in enumerate(docNER) if x == 'LOCATION'])
        # Filter for words tagged as PERSON and store it in data frame
        text_out.loc[idx,'Person'] = ','.join([docNER[i-1]  for i,x in enumerate(docNER) if x == 'PERSON'])

    print ('process complete')
    return text_out

In [331]:
text_ner_out = get_NER(text_out)

start get_NER
tag complete
enumerate complete
process complete


In [332]:
#the outputs of the ner tagger
text_ner_out.loc[(text_ner_out['Place'] != '') | (text_ner_out['Organization'] != '')|(text_ner_out['Person'] != '')][['text','Organization','Place','Person']].head()


Unnamed: 0,text,Organization,Place,Person
73,lenders refused to offer the couple of mortgag...,,dubai,
1472,shadow housing minister to outlaw ‘profiteerin...,,,john-healey
1597,"meanwhile, back in the house of commons, back ...",,,bla-bla-bla-bla
1674,the situation in the u.k. and pakistan is not ...,,pakistan,
2397,nice to see sky's coverage of liverpool v arse...,liverpool,,


### output  the file for visualisation

In [333]:
text_ner_out.to_csv('../processed_data/tweets_topics_sentiment_ner.csv', index=False)