## Working with Language Processing Methods

In [2]:
import pandas as pd
import numpy as np

In [3]:
columns = ['tweet_id','timestamp','tweet_text','user_id',
           'tweet_coords','tweet_coords_list','tweet_long','tweet_lat','location',
           'enc_url','tweet_lang','hashtags']

`%timeit` has the following command at:
`3.02 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`
This is much faster than attempting to explicitly define the string for date formatting, and using `pd.to_datetime()`.

In [5]:
tweet_data = pd.read_csv(r'./tweetCoords.csv',header=None,names=columns,parse_dates=[1],infer_datetime_format=True)

In [64]:
time_string = "2017-09-01 00:00:00"
delta_hours = 1
start_time = pd.to_datetime(time_string)

Getting information in a single hour.

In [51]:
tweet_hour = tweet_data[(tweet_data['timestamp'] >= start_time) &
                        (tweet_data['timestamp'] <= start_time + pd.Timedelta(hours=delta_hours))].copy()

removing extraneous columns for this analysis.

In [52]:
tweet_hour.drop(columns=tweet_hour.columns[3:],axis=1,inplace=True)

In [53]:
tweet_hour.count()

tweet_id      1244
timestamp     1244
tweet_text    1244
dtype: int64

In [54]:
pd.options.display.max_colwidth=100

Working on functionality to clean tweet text.
- Eliminate links.
- Drop user mentions.
- *amp* as ampersand (this may need to be removed later)
- remove non-word characters (ascii x21-x40, x5B-x60, x7B-x7F)


In [83]:
tweet_hour['tweet_text'] = tweet_hour['tweet_text'].str.replace("https?:\S+","").str.replace("@[\S]+","").str.replace('\s+amp\s+'," ").str.replace("[\x21-\x40,\x5B-\x60,\x7B-\x7F]","").str.replace("\s+"," ").str.lower()

In [89]:
from nltk.corpus import stopwords
# from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [88]:
stopWords = set(stopwords.words('english')) | set(stopwords.words('spanish'))

In [92]:
tweet_vector = CountVectorizer(analyzer='word',stop_words=stopWords).fit(tweet_hour['tweet_text'])

In [94]:
len(tweet_vector.get_feature_names())

3912