### CUNY Data 620 - Web Analytics, Summer 2020  
**Final Project: Twitter Pull**   
**Prof:** Alain Ledon  
**Members:** Misha Kollontai, Amber Ferger, Zach Alexander, Subhalaxmi Rout 

### Import Libraries

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import warnings
import datetime
import time
import math
import GetOldTweets3 as got
warnings.filterwarnings('ignore')

### Functions

We'll define the following functions:
* **perdelta**: Based on a [stackoverflow](https://stackoverflow.com/questions/10688006/generate-a-list-of-datetimes-between-an-interval) thread, this will be used to generate a list of date ranges for our twitter pull. 
* **getTweets**: This will be used to pull the tweets. 

In [2]:
################ date function
def perdelta(start, end, delta):
    curr = start
    while curr < end:
        yield curr
        curr += delta
        
################ get tweets function 
def getTweets(city, startDate, endDate):
    n = 10
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch('COVID')\
    .setSince(startDate)\
    .setUntil(endDate)\
    .setMaxTweets(n)\
    .setTopTweets(1)\
    .setNear(city)
    
    ls = []

    for i in range(0,n):
        try:
            tweet = got.manager.TweetManager.getTweets(tweetCriteria)[i]
            ls.append([tweet, city, startDate, endDate])
        except:
            pass 
    
    return(ls)

### Twitter Data

#### Largest City by State

* Read in a list of the top 1000 [cities]([https://public.opendatasoft.com/explore/dataset/1000-largest-us-cities-by-population-with-geographic-coordinates/table/?sort=-rank]) in the US
* Select top city by state, extract geocoordinates
* Split the geocoordinates data into 2 lists so that we can run on 2 separate machines (there is a max of 15 requests per 15 minutes)

In [3]:
# read in cities doc, select top city from each 
# https://stackoverflow.com/questions/50415632/how-to-select-top-n-row-from-each-group-after-group-by-in-pandas
allData = pd.read_csv('largeCities.csv', delimiter=';')
final_cities = allData.sort_values(by = ['State', 'Population'], ascending=False).groupby(['State'], sort=False).head(1)
coords = final_cities['Coordinates'].values.tolist()

In [19]:
# splitting the coordinates into 2 lists
mid = math.floor(len(coords)/2)
coords1 = coords[0:mid]
coords2 = coords[mid:]

#### Date Ranges
Next, we'll generate date ranges for pull. Each range will represent 2 weeks, defined as Sunday - Saturday. The total span of the analysis will go from **3/8/2020** to **7/11/2020**.

In [21]:
all_dates = []
for result in perdelta(datetime.date(2020, 3, 8), datetime.date(2020, 7, 6), datetime.timedelta(days=14)):  
    nextWk = result + datetime.timedelta(days=6)
    startDt = result.strftime("%Y-%m-%d")
    endDt = nextWk.strftime("%Y-%m-%d")   
    all_dates.append((startDt,endDt))

#### Testing Set

In [31]:
finalList = []
test = coords[0:2]

for c in test:
    print(c)
    for d in all_dates:
        ls = getTweets(c,d[0],d[1])
        finalList.append(ls)
        
compiledLs = []

for i in finalList:
    for j in i:
        compiledLs.append([j[1], j[2], j[3], j[0].text, j[0].hashtags])

df = pd.DataFrame(compiledLs, columns = ['COORDS', 'WEEK_START', 'WEEK_END', 'TEXT', 'HASHTAGS']) 
df.to_csv('test_data.csv')

41.1399814,-104.8202462
43.0389025,-87.9064736


In [56]:
#compiledLs = []

#for i in finalList:
#    for j in i:
#        compiledLs.append([j[1], j[2], j[3], j[0].text, j[0].hashtags])

#df = pd.DataFrame(compiledLs, columns = ['COORDS', 'WEEK_START', 'WEEK_END', 'TEXT', 'HASHTAGS']) 
#df.to_csv('test_data.csv')

#### Pull Tweets

In [None]:
# Cycle through all cities
finalList = []

for i,c in enumerate(coords1):
    if ((i+1) % 2 != 0) and (i != 0):
        time.sleep(930) # wait 15 min, 30 sec before continuing
    for d in all_dates:
        ls = getTweets(c,d[0],d[1])
        finalList.append(ls)