## Working with Language Processing Methods

In [1]:
import pandas as pd
import numpy as np

In [2]:
columns = ['tweet_id','timestamp','tweet_text','user_id',
           'tweet_coords','tweet_coords_list','tweet_long','tweet_lat','location',
           'enc_url','tweet_lang','hashtags']

In [3]:
pd.read_csv?

`%timeit` has the following command at:
`3.02 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`
This is much faster than attempting to explicitly define the string for date formatting, and using `pd.to_datetime()`.

In [4]:
tweet_data = pd.read_csv(r'./tweetCoords.csv',header=None,names=columns,parse_dates=[1],infer_datetime_format=True)

In [5]:
time_string = "2017-09-01 00:00:00"
delta_hours = 1
start_time = pd.to_datetime(time_string)

Getting information in a single hour.

In [6]:
tweet_hour = tweet_data[(tweet_data['timestamp'] >= start_time) &
                        (tweet_data['timestamp'] <= start_time + pd.Timedelta(hours=delta_hours))].copy()

In [7]:
tweet_hour.drop(columns=tweet_hour.columns[3:],axis=1,inplace=True)

In [8]:
tweet_hour.count()

tweet_id      1365
timestamp     1365
tweet_text    1365
dtype: int64

In [9]:
pd.options.display.max_colwidth=100

Working on functionality to clean tweet text.
- Eliminate links.
- Drop user mentions.
- *amp* as ampersand
- remove numbers as time or as decimal


In [23]:
tweet_hour[tweet_hour['tweet_text'].str.contains("[\d\:\.]+")].head()

Unnamed: 0,tweet_id,timestamp,tweet_text
0,903407051438567424,2017-09-01 00:00:00,Ocala: 7:50pm: sunset
1,903407059512721413,2017-09-01 00:00:02,"Wind 2.0 mph ESE. Barometer 30.013 in, Steady. Temperature 85.2 F. Rain today 0.00 in. Humidity..."
2,903407067968536576,2017-09-01 00:00:04,Where words fall....music speaks
3,903407078823399424,2017-09-01 00:00:06,First @TBBuccaneers with my bride @carrie_dunavant #lovetampa #bucs #buccaneers https://t.co/mi...
4,903407093855776768,2017-09-01 00:00:10,Wow. That was rough. It s basically drinking a shot of whiskey beer.... 120 Minute IPA https:/...


In [27]:
tweet_hour['tweet_text'].str.replace("[\d\:\.#@]+","").head()

0                                                                                        Ocala pm sunset
1                          Wind  mph ESE Barometer  in, Steady Temperature   F Rain today  in Humidity  
2                                                                        Where words fallmusic speaks   
3       First TBBuccaneers with my bride carrie_dunavant  lovetampa bucs buccaneers https//tco/miYlZwYEX
4    Wow That was rough It s basically drinking a shot of whiskey beer   Minute IPA  https//tco/AMWFBrnK
Name: tweet_text, dtype: object

In [29]:
tweet_hour['tweet_text'].str.replace("https?:\S+","").str.replace("@[\S]+","").str.replace('\s+amp\s+'," ").str.replace("[\d\:\.#@]+","")

0                                                                                           Ocala pm sunset
1                             Wind  mph ESE Barometer  in, Steady Temperature   F Rain today  in Humidity  
2                                                                           Where words fallmusic speaks   
3                                                         First  with my bride   lovetampa bucs buccaneers 
4                          Wow That was rough It s basically drinking a shot of whiskey beer   Minute IPA  
5       I can t even watch Diana programmes because it s just so sad   She was an incredible person and ...
6                                                                                     Gainesville pm sunset
7                                                Exactly hrs til  my blessings  The World Famous Original  
8                                      I m at Louis Pappas Market Cafe Shoppes at Citrus Park in Tampa, FL 
9       Don t try talk  me w

In [12]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
stopWords = set(stopwords.words('english')) | set(stopwords.words('spanish'))

In [14]:
tweet_vector = TfidfVectorizer(analyzer='word',stop_words=stopWords).fit_transform(tweet_hour['tweet_text'])

In [15]:
tweet_vector

<1365x5488 sparse matrix of type '<class 'numpy.float64'>'
	with 11294 stored elements in Compressed Sparse Row format>

In [16]:
tweet_vector.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.31243485, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [17]:
from sklearn.metrics.pairwise import linear_kernel
cosine_similarities = linear_kernel(tweet_vector, tweet_vector).flatten()

In [18]:
cosine_similarities.argsort()

array([ 931014, 1231258, 1710653, ..., 1590024,  396140,  549132])