### CUNY Data 620 - Web Analytics, Summer 2020  
**Final Project: Twitter Pull**   
**Prof:** Alain Ledon  
**Members:** Misha Kollontai, Amber Ferger, Zach Alexander, Subhalaxmi Rout 

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import warnings
import datetime
import time
import math
import GetOldTweets3 as got
warnings.filterwarnings('ignore')

### Functions

We'll define the following functions:
* **perdelta**: Based on a [stackoverflow](https://stackoverflow.com/questions/10688006/generate-a-list-of-datetimes-between-an-interval) thread, this will be used to generate a list of date ranges for our twitter pull. 
* **getTweets**: This will be used to pull the tweets. 

In [2]:
################ date function
def perdelta(start, end, delta):
    curr = start
    while curr < end:
        yield curr
        curr += delta
        
################ get tweets function 
def getTweets(city, startDate, endDate):
    n = 100
    
    tweetCriteria = got.manager.TweetCriteria().setQuerySearch('COVID')\
    .setSince(startDate)\
    .setUntil(endDate)\
    .setMaxTweets(n)\
    .setNear(city)
    
    ls = []
    
    tweets = got.manager.TweetManager.getTweets(tweetCriteria)
    print(len(tweets))
    for i in tweets:
        ls.append([i.text, i.hashtags, i.date, i.username, city, startDate, endDate])
    
    return(ls)

### Twitter Data

#### Largest City by State

* Read in a list of the top 1000 [cities]([https://public.opendatasoft.com/explore/dataset/1000-largest-us-cities-by-population-with-geographic-coordinates/table/?sort=-rank]) in the US
* Select top city by state, extract geocoordinates
* Split the geocoordinates data into 2 lists so that we can run on 2 separate machines (there is a max of 15 requests per 15 minutes)

In [3]:
# read in cities doc, select top city from each 
# https://stackoverflow.com/questions/50415632/how-to-select-top-n-row-from-each-group-after-group-by-in-pandas
allData = pd.read_csv('largeCities.csv', delimiter=';')
final_cities = allData.sort_values(by = ['State', 'Population'], ascending=False).groupby(['State'], sort=False).head(1)
coords = final_cities['Coordinates'].values.tolist()

In [4]:
# splitting the coordinates into 2 lists
mid = math.floor(len(coords)/2)
coords1 = coords[0:mid]
coords2 = coords[mid:]

#### Date Ranges
Next, we'll generate date ranges for pull. Each range will represent 2 weeks, defined as Sunday - Saturday. The total span of the analysis will go from **3/8/2020** to **7/11/2020**.

In [5]:
all_dates = []
for result in perdelta(datetime.date(2020, 3, 8), datetime.date(2020, 7, 6), datetime.timedelta(days=7)):  
    nextWk = result + datetime.timedelta(days=6)
    startDt = result.strftime("%Y-%m-%d")
    endDt = nextWk.strftime("%Y-%m-%d")   
    all_dates.append((startDt,endDt))
    
#all_dates = [(datetime.date(2020, 3, 8).strftime("%Y-%m-%d"), datetime.date(2020, 7, 15).strftime("%Y-%m-%d"))]
all_dates

[('2020-03-08', '2020-03-14'),
 ('2020-03-15', '2020-03-21'),
 ('2020-03-22', '2020-03-28'),
 ('2020-03-29', '2020-04-04'),
 ('2020-04-05', '2020-04-11'),
 ('2020-04-12', '2020-04-18'),
 ('2020-04-19', '2020-04-25'),
 ('2020-04-26', '2020-05-02'),
 ('2020-05-03', '2020-05-09'),
 ('2020-05-10', '2020-05-16'),
 ('2020-05-17', '2020-05-23'),
 ('2020-05-24', '2020-05-30'),
 ('2020-05-31', '2020-06-06'),
 ('2020-06-07', '2020-06-13'),
 ('2020-06-14', '2020-06-20'),
 ('2020-06-21', '2020-06-27'),
 ('2020-06-28', '2020-07-04'),
 ('2020-07-05', '2020-07-11')]

#### Pull Tweets

In [6]:
# Cycle through all cities
finalList = []

for i,c in enumerate(coords1):
    print(i)
    if (i+1) % 2 == 0:
        time.sleep(900) # wait 15 min before continuing
    for d in all_dates:
        ls = getTweets(c,d[0],d[1])
        [finalList.append(x) for x in ls]

0
3
4
3
5
6
3
2
3
3
1
0
2
2
2
1
5
6
24
1


KeyboardInterrupt: 

In [9]:
df = pd.DataFrame(finalList, columns = ['TEXT', 'HASHTAGS', 'TWEET_DATE', 'USERNAME', 'COORDS', 'WEEK_START', 'WEEK_END'])

In [10]:
finalCsv = pd.merge(left=df, right=final_cities, left_on='COORDS', right_on='Coordinates')
finalCsv = finalCsv.drop(columns=['Rank', 'Growth From 2000 to 2013', 'Coordinates'])
finalCsv.to_csv('csvFinal.csv')