## Working with Language Processing Methods

In [1]:
import pandas as pd
import numpy as np

In [2]:
columns = ['tweet_id','timestamp','tweet_text','user_id',
           'tweet_coords','tweet_coords_list','tweet_long','tweet_lat','location',
           'enc_url','tweet_lang','hashtags']

In [4]:
pd.read_csv?

`%timeit` has the following command at:
`3.02 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`
This is much faster than attempting to explicitly define the string for date formatting, and using `pd.to_datetime()`.

In [12]:
tweet_data = pd.read_csv(r'./tweetCoords.csv',header=None,names=columns,parse_dates=[1],infer_datetime_format=True)

In [13]:
time_string = "2017-09-01 00:00:00"
delta_hours = 1
start_time = pd.to_datetime(time_string)

Getting information in a single hour.

In [25]:
tweet_hour = tweet_data[(tweet_data['timestamp'] >= start_time) &
                        (tweet_data['timestamp'] <= start_time + pd.Timedelta(hours=delta_hours))].copy()

In [26]:
tweet_hour.drop(columns=tweet_hour.columns[3:],axis=1,inplace=True)

In [27]:
tweet_hour.count()

tweet_id      1365
timestamp     1365
tweet_text    1365
dtype: int64

In [62]:
pd.options.display.max_colwidth=100

Working on functionality to clean tweet text.
- Eliminate links.
- Drop user mentions.
- *amp* as ampersand

In [83]:
tweet_hour[tweet_hour['tweet_text'].str.contains("https?:[\w\.\/_~=]+")]

Unnamed: 0,tweet_id,timestamp,tweet_text
3,903407078823399424,2017-09-01 00:00:06,First @TBBuccaneers with my bride @carrie_dunavant #lovetampa #bucs #buccaneers https://t.co/mi...
4,903407093855776768,2017-09-01 00:00:10,Wow. That was rough. It s basically drinking a shot of whiskey beer.... 120 Minute IPA https:/...
7,903407131998674944,2017-09-01 00:00:19,Exactly 4hrs til my blessings... @ The World Famous Original https://t.co/RzaIK0aEIM
8,903407148163588096,2017-09-01 00:00:23,"I m at Louis Pappas Market Cafe: Shoppes at Citrus Park in Tampa, FL https://t.co/qPPnbVN1jy"
10,903407153729429504,2017-09-01 00:00:24,"#Repost Great shoot today, with lululemon ambassador, dragonflyogi. Such an honor to meet and ..."
13,903407177343238144,2017-09-01 00:00:30,"#GalationsSixAndNine @ Spring Hill, FL https://t.co/8dBicwcPFh"
14,903407183206924288,2017-09-01 00:00:31,There s more #traveling in middle school girl s #basketball than on https://t.co/qSH33Yi2JQ
16,903407187526959104,2017-09-01 00:00:32,"If she cant do this on da i dont want her #MoonWalkin @ Tallahassee, Florida https://t.co/m..."
17,903407194804289538,2017-09-01 00:00:34,"Severe Thunderstorm Warning including Monticello FL, Greenville FL, Waukeenah FL until 8:45 PM E..."
18,903407203285172224,2017-09-01 00:00:36,"Severe Thunderstorm Warning including Monticello FL, Greenville FL, Waukeenah FL until 8:45 PM E..."


In [75]:
tweet_hour['tweet_text'].str.replace("https?:[\/\.\w]+","").str.replace("@[\S]+","").str.replace('\s+amp\s+'," ").head(10)

0                                                                                  Ocala: 7:50pm: sunset
1    Wind 2.0 mph ESE. Barometer 30.013 in, Steady. Temperature 85.2  F. Rain today 0.00 in. Humidity...
2                                                                    Where words fall....music speaks   
3                                                   First  with my bride   #lovetampa #bucs #buccaneers 
4              Wow. That was rough. It s basically drinking a shot of whiskey beer....  120 Minute IPA  
5    I can t even watch #Diana20 programmes because it s just so sad   She was an incredible person a...
6                                                                            Gainesville: 7:51pm: sunset
7                                        Exactly 4hrs til  my blessings... @ The World Famous Original  
8                                  I m at Louis Pappas Market Cafe: Shoppes at Citrus Park in Tampa, FL 
9    Don t try talk 2 me when it s convenient 4 U..I ll

In [47]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [48]:
stopWords = set(stopwords.words('english')) | set(stopwords.words('spanish'))

In [56]:
tweet_vector = TfidfVectorizer(analyzer='word',stop_words=stopWords).fit_transform(tweet_hour['tweet_text'])

In [57]:
tweet_vector

<1365x5488 sparse matrix of type '<class 'numpy.float64'>'
	with 11294 stored elements in Compressed Sparse Row format>

In [58]:
tweet_vector.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.31243485, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [59]:
from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(tweet_vector, tweet_vector).flatten()

In [60]:
cosine_similarities.argsort()

array([ 931014, 1231258, 1710653, ..., 1590024,  396140,  549132],
      dtype=int64)