## Working with Language Processing Methods

In [1]:
import pandas as pd
import numpy as np

In [2]:
columns = ['tweet_id','timestamp','tweet_text','user_id',
           'tweet_coords','tweet_coords_list','tweet_long','tweet_lat','location',
           'enc_url','tweet_lang','hashtags']

`%timeit` has the following command at:
`3.02 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`
This is much faster than attempting to explicitly define the string for date formatting, and using `pd.to_datetime()`.

In [3]:
tweet_data = pd.read_csv(r'./tweetCoords.csv',header=None,names=columns,parse_dates=[1],infer_datetime_format=True)

In [4]:
time_string = "2017-09-01 00:00:00"
delta_hours = 1
start_time = pd.to_datetime(time_string)

Getting information in a single hour.

In [5]:
tweet_hour = tweet_data[(tweet_data['timestamp'] >= start_time) &
                        (tweet_data['timestamp'] <= start_time + pd.Timedelta(hours=delta_hours))].copy()

removing extraneous columns for this analysis.

In [6]:
tweet_hour.drop(columns=tweet_hour.columns[3:],axis=1,inplace=True)

In [7]:
tweet_hour.count()

tweet_id      1365
timestamp     1365
tweet_text    1365
dtype: int64

In [8]:
pd.options.display.max_colwidth=100

Working on functionality to clean tweet text.
- Eliminate links.
- Drop user mentions.
- *amp* as ampersand (this may need to be removed later)
- remove non-word characters (ascii x21-x40, x5B-x60, x7B-x7F)


instead of manual cleaning, working with the nltk tweet tokenizer

In [51]:
from nltk.tokenize import TweetTokenizer

In [52]:
tweet_tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)

comparing the difference. Notice how it reduced the length of consecutive characters.

In [70]:
tweet_hour.loc[1340,'tweet_text'].split('\s+')

['yooooooooooooooo chillll that s not happening ']

In [67]:
tweet_tokenizer.tokenize(tweet_hour.loc[1340,'tweet_text'])

['yooo', 'chilll', 'that', 's', 'not', 'happening']

In [71]:
tweet_hour['tweet_text'].apply(tweet_tokenizer.tokenize).head()

0                                                                             [ocala, pm, sunset]
1              [wind, mph, ese, barometer, in, steady, temperature, f, rain, today, in, humidity]
2                                                               [where, words, fallmusic, speaks]
3                                           [first, with, my, bride, lovetampa, bucs, buccaneers]
4    [wow, that, was, rough, it, s, basically, drinking, a, shot, of, whiskey, beer, minute, ipa]
Name: tweet_text, dtype: object

In [11]:
from nltk.corpus import stopwords
# from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
stopWords = set(stopwords.words('english')) | set(stopwords.words('spanish'))

In [13]:
tweet_vector = CountVectorizer(analyzer='word',stop_words=stopWords).fit(tweet_hour['tweet_text'])

In [14]:
len(tweet_vector.get_feature_names())

4078

In [49]:
# !pip install gensim

In [18]:
import nltk
from gensim import models



In [63]:
tweet_model = models.Word2Vec(sentences=tweet_hour['tweet_text'].apply(tweet_tokenizer.tokenize))

In [65]:
tweet_model.wv.similar_by_word('humidity')

[('the', 0.9961910843849182),
 ('a', 0.9960900545120239),
 ('so', 0.9960758686065674),
 ('to', 0.9960094094276428),
 ('i', 0.9958324432373047),
 ('of', 0.9957069158554077),
 ('good', 0.9957038164138794),
 ('you', 0.9956586956977844),
 ('is', 0.9956137537956238),
 ('and', 0.995611310005188)]

In [27]:
nltk.FreqDist(tweet_hour['tweet_text'])

