## What are you trying to do?
Articulate your objectives using absolutely no jargon (i.e. as if you were explaining to a salesperson, executive, or recruiter). Again, this should be a description of a real world problem, not an algorithm you aspire to use.

Disinformation is Disinformation can be thought of as a form of pollution that contaminates civil discourse. We can either regulate the polluters or filter out the disingenuous rhetoric. However, it can be difficult to discriminate between Specifically, we have  if we can't do if you can't like regulate the polluters we have to scrub the environment but first you have to find out you have to filter it out of the environment 

> Civil discourse online is under attack and mDuring the impeachment saga I invested a lot of time on /r/politics on Reddit. Ultimately, I want a tool that will combat disinformation on social media in real time. 

## How has this problem been solved before?
If you feel like you are addressing a novel issue, what similar problems have been solved, and how are you borrowing from those? Note, at this point you should be citing papers, medium articles, or other sources from the data science community. You should have links to your research in your proposal document and cite previous work.

### same
#### BERT and CNN - https://towardsdatascience.com/using-bert-and-cnns-for-russian-troll-detection-on-reddit-8ae59066b1c (AUC:84.6%)

### other
#### latent semantic analysis on subreddits - https://fivethirtyeight.com/features/dissecting-trumps-most-rabid-online-following/
#### build convincing reddit personality - https://bonkerfield.org/2020/02/reddit-bot-gpt2-bert/

## What is new about your approach, why do you think it will be successful?
It's OK to be working on a problem that has been worked on before, but you do need to do need to have some novel contribution to the project.

## Who cares? If you're successful, what will the impact be?
It is preferable to be asked 'Who cares' now then when you're telling someone about your project. Not every project needs to change the world, but project whose appeal is very niche might not be the best way to spend your time and energy.

## How will you present your work?
Again, be ambitious in your proposals, but think about what is the most impactful way your project could exist.

## What are your data sources? What is the size of your dataset, and what is your storage format?
You should no longer be analyzing csv files. Is the amount of data you have limiting the ways you can approach the problem? Would you have a different approach if you could collect more? Same as last capstone, you should have all of your data on day one.

## What are potential problems with your capstone?
More importantly, what is your plan to mitigate these problems? Again, being aware of previous solutions to similar problems gives you a template for successfully working with data of this type. 

## What is the next thing you need to work on?
Getting the data, not just some, likely all? Understanding the data? Building a minimum viable product? Gauging how much signal might be in the data?

## Get the data from twitter

In [56]:
import numpy as np
import pandas as pd

import requests
from bs4 import BeautifulSoup as BS
import re

In [81]:
# I acquired the information operations html file manually after giving my email address to the form here:
# https://transparency.twitter.com/en/information-operations.html
with open('/Users/cdavis/Downloads/twitter_information_operations.html', 'r') as fp:
    html = fp.read()
soup = BS(html, 'html.parser')

In [189]:
accounts = [e.parent['href'] for e in soup.findAll(text=re.compile('Account '))]
tweets = [e.parent['href'] for e in soup.findAll(text=re.compile('Tweet '))]
media = [e.parent['href'] for e in soup.findAll(text=re.compile('Media'))]

In [163]:
def parse_event_title(event_titles):
    '''
    Input: event_titles (list of bs4.element.Tag)
    Output: list of dict containing attributes for each event
    
    e.g. 
        [<b>Egypt (February  2020) - 2541 Accounts</b>]
    returns    
        [{'country': 'Egypt',
         'month': 'February',
         'year': '2020',
         'num_accounts': '2541'}]
    '''
    keys = ['country', 'month', 'year', 'num_accounts', 'set']
    events = []
    for t in event_titles:
        temp_t = t.string
        for mychars in ['\xa0', ' -', ',', ')']:
            temp_t = temp_t.replace(mychars, '')
        parsed = temp_t.split('(') # account for spaces in source of coordinated activity
        
        country = parsed[0].strip()
        other_attr = parsed[1].split(' ')[:-1]
        this_set = '1'
        if len(other_attr) > 4: # 'set' appears if multiple releases from same source
            this_set = other_attr[3] # e.g. Venezuela (January 2019, set 2) - 764 accounts
            other_attr = other_attr[:2] + [other_attr[-1]]
        stripped = [parsed[0].strip()] + other_attr + [this_set]
        events.append(dict(zip(keys, stripped)))
    return events

In [166]:
events_df = pd.DataFrame(parse_event_title(soup.select('b')))

In [187]:
events_df['accounts_url'] = accounts
events_df['tweets_url'] = tweets
events_df['media_url'] = media

In [188]:
events_df

Unnamed: 0,country,month,year,num_accounts,set,accounts_url,tweets_url,media_url
0,Egypt,February,2020,2541,1,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...
1,Honduras,February,2020,3104,1,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...
2,Indonesia,February,2020,795,1,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...
3,Serbia,February,2020,8558,1,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...
4,SA_EG_AE,February,2020,5350,1,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...
5,Ghana / Nigeria,March,2020,70,1,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...
6,Saudi Arabia,October,2019,5929,1,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...
7,China,July,2019,4301,3,https://storage.googleapis.com/twitter-electio...,https://storage.googleapis.com/twitter-electio...,https://storage.cloud.google.com/twitter-elect...
8,Saudi Arabia,April,2019,6,1,https://storage.googleapis.com/twitter-electio...,https://storage.googleapis.com/twitter-electio...,https://storage.cloud.google.com/twitter-elect...
9,Ecuador,April,2019,1019,1,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...,https://storage.cloud.google.com/twitter-elect...


In [200]:
def download_from_google_cloud_storage(raw_url, 
                          target_local_dir='data/twitter', 
                          bucket='twitter-election-integrity'):
    '''
    given a raw link from twitter's election integrity page, download 
    file from google cloud storage to local data directory 
    
    input: raw_url (str) from twitter election integrity page
    return None
    
    e.g.
    'https://storage.cloud.google.com/twitter-election-integrity/hashed/2020_04/egypt_022020/egypt_022020_users_csv_hashed.zip'
        downloads
    'https://twitter-election-integrity.storage.googleapis.com/hashed/2020_04/egypt_022020/egypt_022020_users_csv_hashed.zip'
    '''
    twitter_google_cloud_storage = 'https://' + bucket + '.storage.googleapis.com'

    url_parts = raw_url.split('/')
    filename = url_parts[-1]
    target_url = '/'.join([twitter_google_cloud_storage] + url_parts[4:])
    r = requests.get(target_url, allow_redirects=True)
    with open(target_local_dir + '/' + filename, 'bw') as fp:
        fp.write(r.content)

In [201]:
raw_url = 'https://storage.cloud.google.com/twitter-election-integrity/hashed/2020_04/egypt_022020/egypt_022020_users_csv_hashed.zip'
download_from_google_cloud_storage(raw_url)