## EDA-01

### Questions:
1. What rules did you put in place to refine/clean the data?
2. What number of duplicates occurred?
3. What amount of data was nonsensical?
4. What number of unique posts/posters occurred?
5. Is there surprising missingness or sparsity (e.g. by time, date, location, etc.)




### Field	Description

1. **tweet_id**: Unique identifier of the tweet

2. **created_at**:	Time stamp of when the tweet was sent. 

3. **full_text**:	The text of the tweet 

4. **geo**	Geo tag of the tweet. Will only be included if user has location services enabled. 

5. **coordinates** 	Latitude and Longitude of the tweet. Will only be included if the user has location services 
enabled. 

6. **place**	Twitter identified place. Will only be included if the user has location services enabled. 

7. **retweet_count**	The number of retweets the tweet received 

8. **favorite_count**	The number of favorites the tweet received 

9. **possible_sensitive**	Twitter flag if the tweet contains sensitive material. This is done by Twitter. If the tweet did not have this tagged by Twitter, we imputed NOT FOUND

10. **lang**	The language of the tweet. We filtered towards English 'en' 

11. **user_id**	The unique identiifer of the person or entity sending the tweet

### Data Collection Methodology 
The twitter public API allows for a query based keyword search to return 100 random samples of tweets containing keywords from the previous 7 days.

Data was scraped from February 23 - March 3 searching for either a combination of a Covid-19 related keyword and a vaccine related key word, or just containing a hashtag keyword.

The queries were randomly generated with the above logic in order to obtain unique tweets per API request.


In [1]:
# import data libraries
import pandas as pd
pd.options.mode.chained_assignment = None  
import datetime

In [2]:
# import viz lib/settings
import matplotlib.pyplot as plt

%matplotlib inline

large = "22"
med = "18"
small = "16"
params = {"axes.titlesize": large, 
        "legend.fontsize":large,
        "figure.figsize":(20,8), 
        "axes.labelsize":med,
        "axes.titlesize":med,
        "xtick.labelsize":med,
        "ytick.labelsize":med,
        "figure.titlesize":large
        }
plt.rcParams.update(params)
plt.style.use('seaborn-darkgrid')

In [3]:
import os 
# set your cwd
my_cwd = "/home/mriveralanas/projects/Mar21-vaccine-uptake"
os.chdir(my_cwd)

In [4]:
# pull local twitter_text
raw_text = pd.read_csv("data/twitter_gdrive/twitter_text.csv")

In [5]:
raw_text.head()

Unnamed: 0,tweet_id,created_at,full_text,geo,coordinates,place,retweet_count,favorite_count,possibly_sensitive,lang,user_id
0,1364223054851813377,Tue Feb 23 14:38:16 +0000 2021,Here’s what's in the COVID relief package:\n \...,,,,9160,38093,NOT FOUND,en,29501250.0
1,1364381497302671362,Wed Feb 24 01:07:52 +0000 2021,Will the National Endowment for the Arts be he...,,,,6131,18560,NOT FOUND,en,1.201671e+18
2,1364609594056704002,Wed Feb 24 16:14:15 +0000 2021,"This is both anecdotal and early, but many lon...",,,,5941,63174,NOT FOUND,en,38428720.0
3,1364726798412443649,Wed Feb 24 23:59:58 +0000 2021,A Link to Professor Chossudovsky’s Analysis of...,,,,1,0,False,en,2192010000.0
4,1364726797947052038,Wed Feb 24 23:59:58 +0000 2021,Children warned over hugging grandparents even...,,,,0,2,False,en,2868190000.0


In [7]:
# pull local twitter_user
raw_user = pd.read_csv("data/twitter_gdrive/twitter_user.csv")

In [8]:
raw_user.head()

Unnamed: 0,user_id,user_location,user_verified,user_followers_count
0,29501253,"Burbank, CA",True,2931387
1,1201670995435646976,"Colorado, USA",True,557873
2,38428725,,True,124131
3,2192010002,here & now 🇺🇸,False,2947
4,2868189989,Brighton UK and Bayonne France,False,2556


In [10]:
# pull local twitter_hashtag
raw_hashtag = pd.read_csv("data/twitter_gdrive/twitter_hashtag.csv")

In [11]:
raw_hashtag.head()

Unnamed: 0,hashtags,tweet_id
0,Covid,1364726798412443649
1,COVID19,1364726771606691845
2,COVID19,1364726746797367297
3,eviction,1364726728183144449
4,ChuckSchumer,1364726702027530241


### What is the Tweet location distribution over the 7 day window?

In [None]:
def convert_date(d):
    if pd.isnull(d):
        return "NOT FOUND"
    else:
        new_d = datetime.datetime.strptime(d, "%a %b %d %H:%M:%S +0000 %Y").strftime("%Y-%m-%d")
        return new_d

raw_data['date'] = list(map(lambda d: convert_date(d), raw_data['created_at']))


In [None]:
tweet_time_geo = raw_data[['tweet_id', 'date', 'geo', 'coordinates', 'place', 'lang', 'user_id']]

In [None]:
tweet_bydate = tweet_time_geo[['date','tweet_id']].groupby(['date'])['tweet_id'].count().reset_index()

x = tweet_bydate['date']
y = tweet_bydate['tweet_id']

x_pos = [i for i, _ in enumerate(x)]
plt.bar(x_pos, y, color='green')
plt.xlabel("Energy Source")
plt.ylabel("Energy Output (GJ)")
plt.title("Energy output from various fuel sources")

plt.xticks(x_pos, x)

plt.show()

In [None]:
# geo field not useful
tweet_time_geo.groupby(['geo'])['place'].count()

In [None]:
tweet_time_geo.dropna()

In [None]:
raw_data.dropna()