### CUNY Data 620 - Web Analytics, Summer 2020  
**Final Project: Twitter Pull**   
**Prof:** Alain Ledon  
**Members:** Misha Kollontai, Amber Ferger, Zach Alexander, Subhalaxmi Rout 

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import warnings
import datetime
import time
import math
import GetOldTweets3 as got
warnings.filterwarnings('ignore')

### Functions

We'll define the following functions:
* **perdelta**: Based on a [stackoverflow](https://stackoverflow.com/questions/10688006/generate-a-list-of-datetimes-between-an-interval) thread, this will be used to generate a list of date ranges for our twitter pull. 
* **getTweets**: This will be used to pull the tweets. 

In [2]:
################ date function
def perdelta(start, end, delta):
    curr = start
    while curr < end:
        yield curr
        curr += delta
        
################ get tweets function 
def getTweets(city, startDate, endDate):
    n = 1000
    
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch('COVID')\
    .setSince(startDate)\
    .setUntil(endDate)\
    .setMaxTweets(n)\
    .setNear(city)
    
    ls = []
    
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    print(len(tweets))
    for i in tweets:
        ls.append([i.text, i.hashtags, city, startDate, endDate])
    
    return(ls)

### Twitter Data

#### Largest City by State

* Read in a list of the top 1000 [cities]([https://public.opendatasoft.com/explore/dataset/1000-largest-us-cities-by-population-with-geographic-coordinates/table/?sort=-rank]) in the US
* Select top city by state, extract geocoordinates
* Split the geocoordinates data into 2 lists so that we can run on 2 separate machines (there is a max of 15 requests per 15 minutes)

In [3]:
# read in cities doc, select top city from each 
# https://stackoverflow.com/questions/50415632/how-to-select-top-n-row-from-each-group-after-group-by-in-pandas
allData = pd.read_csv('largeCities.csv', delimiter=';')
final_cities = allData.sort_values(by = ['State', 'Population'], ascending=False).groupby(['State'], sort=False).head(1)
coords = final_cities['Coordinates'].values.tolist()

In [4]:
# splitting the coordinates into 2 lists
mid = math.floor(len(coords)/2)
coords1 = coords[0:mid]
coords2 = coords[mid:]

#### Date Ranges
Next, we'll generate date ranges for pull. Each range will represent 2 weeks, defined as Sunday - Saturday. The total span of the analysis will go from **3/8/2020** to **7/11/2020**.

In [5]:
#all_dates = []
#for result in perdelta(datetime.date(2020, 3, 8), datetime.date(2020, 7, 6), datetime.timedelta(days=14)):  
#    nextWk = result + datetime.timedelta(days=6)
#    startDt = result.strftime("%Y-%m-%d")
#    endDt = nextWk.strftime("%Y-%m-%d")   
#    all_dates.append((startDt,endDt))
    
all_dates = [(datetime.date(2020, 3, 8).strftime("%Y-%m-%d"), datetime.date(2020, 7, 15).strftime("%Y-%m-%d"))]
all_dates

[('2020-03-08', '2020-07-15')]

#### Testing Set

In [None]:
finalList = []
test = coords[0:2]

for c in test:
    print(c)
    for d in all_dates:
        ls = getTweets(c,d[0],d[1])
        [finalList.append(x) for x in ls]

In [None]:
df = pd.DataFrame(finalList, columns = ['TEXT', 'HASHTAGS', 'COORDS', 'WEEK_START', 'WEEK_END']) 
df.to_csv('test_data.csv')

#### Pull Tweets

In [6]:
# Cycle through all cities
finalList = []

for i,c in enumerate(coords1):
    print(c)
    if (i+1) % 15 == 0:
        time.sleep(930) # wait 15 min, 30 sec before continuing
    for d in all_dates:
        ls = getTweets(c,d[0],d[1])
        [finalList.append(x) for x in ls]

41.1399814,-104.8202462
162
43.0389025,-87.9064736
1000
38.3498195,-81.6326234
747
47.6062095,-122.3320708
1000
36.8529263,-75.977985
1000
44.4758825,-73.212072
650
40.7607793,-111.8910474
1000
29.7604267,-95.3698028
1000
35.1495343,-90.0489801
1000
43.5445959,-96.7311034
661
34.0007104,-81.0348144
1000
41.8239891,-71.4128343
1000
39.9525839,-75.1652215
1000
45.5230622,-122.6764816
1000
35.4675602,-97.5164276
1000
39.9611755,-82.9987942
1000
46.8771863,-96.7898034
797
35.2270869,-80.8431267
1000
40.7127837,-74.0059413
1000
35.0853336,-106.6055534
1000
40.735657,-74.1723667
1000
42.9956397,-71.4547891
664
36.1699412,-115.1398296
1000
41.2523634,-95.9979883
1000
45.7832856,-108.5006904
213


In [7]:
df = pd.DataFrame(finalList, columns = ['TEXT', 'HASHTAGS', 'COORDS', 'WEEK_START', 'WEEK_END'])

In [10]:
finalCsv = pd.merge(left=df, right=final_cities, left_on='COORDS', right_on='Coordinates')
finalCsv = finalCsv.drop(columns=['Rank', 'Growth From 2000 to 2013', 'Coordinates'])
finalCsv.to_csv('csv1.csv')