# Creating Lists of Words Related to a Specified Search Term

## This algorithm provides the following:
- Imports a dataframe of tweets
- Processes tweets and tokenizes words
- Uses Word Embeddings to convert words into vectors
- Determines related words by using cosine similarity within the vector space
- Generates a list, or lists, of words within the time delta(s) selected, ordered by min-maxed cosine similarity.

### Imported Libraries
Libraries and modules below are used to import and process the tokens into vectors, removing extraneous words.

In [2]:
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.preprocessing import MinMaxScaler
import string
import re

### Tweet Import
Importing and parsing tweets into dataframe, converting the datestamp strings to datetime objects.

In [3]:
columns = ['tweet_id','timestamp','tweet_text','user_id',
           'tweet_coords','tweet_coords_list','tweet_long','tweet_lat','location',
           'enc_url','tweet_lang','hashtags']
tweet_full = pd.read_csv(r'./tweetCoords.csv',
                         header=None,
                         names=columns,
                         parse_dates=[1],
                         infer_datetime_format=True,
                         index_col='timestamp')

### Isolating Tweets by language (english)


In [4]:
tweet_full_en = tweet_full[tweet_full['tweet_lang'] == 'en']

### Text Cleaning and Tokenization Functions

In [5]:
tweet_stops = stopwords.words('english')
tweet_tokenizer = TweetTokenizer(strip_handles=True,preserve_case=False,reduce_len=True)

def clean_tweet(tweet):
#     takes input string and converts or removes characters depending on settings.
#     returns a string
#     convert case:
    tweet = tweet.lower()
#     remove URLs:
    tweet = re.sub('https?://\S+','',tweet)
#     remove @mentions, including those with a leading '-' or '.' : 
    tweet = re.sub('[-\.]?@\w+','',tweet)
#     remove punctuation, but not hashtags:
    tweet = tweet.translate(tweet.maketrans('','',string.punctuation.replace("#","")))
#     remove non-hashtag '#'.
    tweet = re.sub('#\B','',tweet)
#     remove 'amp', 'gt', 'lt', indicating decoded ampersand, greater-than, less-than characters
    tweet = re.sub(r'\b(amp|gt|lt)\b','',tweet)
#     drop numbers and words of < 4 characters.
    tweet = re.sub(r'\b\w{1,3}\b','',tweet)
    tweet = re.sub(r'\b\d+\b','',tweet)
    return tweet

def tokens_no_stopwords(tweet_as_string):
#     wrapper function that combines the tokenizer, cleaner, and stopword removal.
#     takes a string and returns a list of strings
    cleaned_tweet = clean_tweet(tweet_as_string)
    tweet_as_tokens = tweet_tokenizer.tokenize(cleaned_tweet)
    tweet_no_stops = [word for word in tweet_as_tokens if word not in tweet_stops]
    
    return tweet_no_stops

### Search Term
This is the term that will serve as the comparison for all later lists.

In [6]:
search_term = "irma"

### Related Words Dataframe and Time Deltas
This instantiates the dataframe for the related words and specifies the start, end, and time delta for the periods of related words.

In [7]:
related_words = pd.DataFrame()
tweet_date = pd.to_datetime("2017-09-10 00:00:00")
date_delta = pd.Timedelta("24HR")
end_date = pd.to_datetime("2017-09-12 00:00:00")

### Related Word list quantity
This number specifies the number of words that will be returned in each list associated with the time periods specified above

In [8]:
top_num_words = 200

## Word Embeddings
### Converting Words to Vectors using the 'Word2Vec' library

- Iterate through each time period designated above
- Apply the tokenization and cleaning functions
- Convert the tokens to vectors using the following:
    - Minimum count = 1
    - Skip-Gram
    - Window of 3
    - 100D vectors

In [9]:
for tweet_day in pd.date_range(start = tweet_date, end = end_date, freq = date_delta):
    tweet_text = tweet_full_en.loc[tweet_day:tweet_day + date_delta,"tweet_text"]

    tweets_tokens = tweet_text.apply(tokens_no_stopwords)
    vector_model = Word2Vec(tweets_tokens, min_count=1, sg=1, window=3, workers=5, size=100)
    word_matrix = vector_model.wv[vector_model.wv.vocab]

    terms_from_range = pd.DataFrame.from_records(vector_model.wv.most_similar(search_term,topn=top_num_words),
                                                 columns=[tweet_day,"Cos_Sim"])

    related_words = pd.concat([related_words,terms_from_range],axis=1)

  if np.issubdtype(vec.dtype, np.int):


### MinMax Scaling on Cosine Similarity Values
Convert the cosine similarity to min-maxed value

In [11]:
cs_scaler = MinMaxScaler(feature_range=(0,100))

In [22]:
related_words.iloc[:,1::2] = cs_scaler.fit_transform(related_words.iloc[:,1::2])

In [23]:
related_words

Unnamed: 0,2017-09-10 00:00:00,Cos_Sim,2017-09-11 00:00:00,Cos_Sim.1,2017-09-12 00:00:00,Cos_Sim.2
0,waiting,100.000000,hurricane,100.000000,hurricane,100.000000
1,ready,95.676280,family,95.676280,back,95.676280
2,away,95.275459,friends,95.275459,still,95.275459
3,#hurrcaneirma,95.177063,#hurricaneirma,95.177063,without,95.177063
4,hours,94.810572,survived,94.810572,everyone,94.810572
5,sunday,91.720314,made,91.720314,safe,91.720314
6,going,89.697266,stay,89.697266,home,89.697266
7,update,88.424078,damage,88.424078,work,88.424078
8,path,88.278607,never,88.278607,know,88.278607
9,live,84.884662,#irma,84.884662,good,84.884662
