## Working with Language Processing Methods

In [1]:
import pandas as pd
import numpy as np

In [2]:
columns = ['tweet_id','timestamp','tweet_text','user_id',
           'tweet_coords','tweet_coords_list','tweet_long','tweet_lat','location',
           'enc_url','tweet_lang','hashtags']

`%timeit` has the following command at:
`3.02 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`
This is much faster than attempting to explicitly define the string for date formatting, and using `pd.to_datetime()`.

In [3]:
tweet_data = pd.read_csv(r'./tweetCoords.csv',header=None,names=columns,parse_dates=[1],infer_datetime_format=True)

In [4]:
time_string = "2017-09-01 00:00:00"
delta_hours = 1
start_time = pd.to_datetime(time_string)

Getting information in a single hour.

In [5]:
tweet_hour = tweet_data[(tweet_data['timestamp'] >= start_time) &
                        (tweet_data['timestamp'] <= start_time + pd.Timedelta(hours=delta_hours))].copy()

removing extraneous columns for this analysis.

In [6]:
tweet_hour.drop(columns=tweet_hour.columns[3:],axis=1,inplace=True)

In [7]:
tweet_hour.count()

tweet_id      1365
timestamp     1365
tweet_text    1365
dtype: int64

In [8]:
pd.options.display.max_colwidth=100

Working on functionality to clean tweet text.
- Eliminate links.
- Drop user mentions.
- *amp* as ampersand (this may need to be removed later)
- remove non-word characters (ascii x21-x40, x5B-x60, x7B-x7F)


instead of manual cleaning, working with the nltk tweet tokenizer

In [9]:
from nltk.tokenize import TweetTokenizer

In [10]:
tweet_tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)

comparing the difference. Notice how it reduced the length of consecutive characters.

In [38]:
tweet_hour.loc[1340,'tweet_text'].split()

['YOOOOOOOOOOOOOOO',
 'CHILLLL',
 'that',
 's',
 'not',
 'happening',
 'https://t.co/KGhi654L0L']

In [12]:
tweet_tokenizer.tokenize(tweet_hour.loc[1340,'tweet_text'])

['YOOO', 'CHILLL', 'that', 's', 'not', 'happening', 'https://t.co/KGhi654L0L']

In [13]:
tweet_hour['tweet_text'].apply(tweet_tokenizer.tokenize).head()

0                                                                        [Ocala, :, 7:50, pm, :, sunset]
1    [Wind, 2.0, mph, ESE, ., Barometer, 30.013, in, ,, Steady, ., Temperature, 85.2, F, ., Rain, tod...
2                                                               [Where, words, fall, ..., music, speaks]
3                      [First, with, my, bride, #lovetampa, #bucs, #buccaneers, https://t.co/miYlZw6YEX]
4    [Wow, ., That, was, rough, ., It, s, basically, drinking, a, shot, of, whiskey, beer, ..., 120, ...
Name: tweet_text, dtype: object

In [14]:
from nltk.corpus import stopwords
# from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
stopWords = set(stopwords.words('english')) | set(stopwords.words('spanish'))

In [16]:
tweet_vector = CountVectorizer(analyzer='word',stop_words=stopWords).fit(tweet_hour['tweet_text'])

In [17]:
len(tweet_vector.get_feature_names())

5488

In [18]:
# !pip install gensim

In [19]:
import nltk
from gensim import models



In [20]:
tweet_model = models.Word2Vec(sentences=tweet_hour['tweet_text'].apply(tweet_tokenizer.tokenize))

In [43]:
tweet_model.wv.similar_by_word('Sheeran')

[('m', 0.9883741140365601),
 ('re', 0.9883188605308533),
 ('an', 0.988035261631012),
 ('en', 0.9878236055374146),
 ('now', 0.9877570271492004),
 ('he', 0.9876581430435181),
 ('gt', 0.9876514673233032),
 ('I', 0.9876255989074707),
 ('me', 0.987604022026062),
 ('we', 0.9875683784484863)]

In [52]:
tweet_hour['tweet_text'].map(tweet_tokenizer.tokenize)

0                                                                           [Ocala, :, 7:50, pm, :, sunset]
1       [Wind, 2.0, mph, ESE, ., Barometer, 30.013, in, ,, Steady, ., Temperature, 85.2, F, ., Rain, tod...
2                                                                  [Where, words, fall, ..., music, speaks]
3                         [First, with, my, bride, #lovetampa, #bucs, #buccaneers, https://t.co/miYlZw6YEX]
4       [Wow, ., That, was, rough, ., It, s, basically, drinking, a, shot, of, whiskey, beer, ..., 120, ...
5       [I, can, t, even, watch, #Diana20, programmes, because, it, s, just, so, sad, She, was, an, incr...
6                                                                     [Gainesville, :, 7:51, pm, :, sunset]
7        [Exactly, 4hrs, til, my, blessings, ..., @, The, World, Famous, Original, https://t.co/RzaIK0aEIM]
8       [I, m, at, Louis, Pappas, Market, Cafe, :, Shoppes, at, Citrus, Park, in, Tampa, ,, FL, https://...
9       [Don, t, try, amp, t

In [49]:
nltk.FreqDist(list(tweet_hour['tweet_text'].apply(tweet_tokenizer.tokenize)))

TypeError: unhashable type: 'list'