**Import required libraries here**

In [3]:
import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
import dask.bag as db

# 1. Loading twitter and domain data

**Run the code below to:**

Read twitter data csvs and domain data csvs and clean up the data. Then store them under `Saved` folder
- Twitter data will be stored as parquet in `./Saved/twitter_data.parquet`
- Domain data will be stored as parquet in `./Saved/domain_data.parquet`

> *Code is under `/post_input/load_input.py`*

In [1]:
import post_input.load_input as load_input
# load domain and twitter data
load_input.load_twitter('./logs/*_output_*.csv') # change path to your twitter output csvs
load_input.load_domain('./DataDomain/*.csv') # change path to your domain output csvs

Skipping line 20: unexpected end of data


**You can view the saved data after loaded into post-processor**

In [5]:
# change to './Saved/domain_data.parquet' to view domain data
data = dd.read_parquet('./Saved/twitter_data.parquet') 
data.head(1)

Unnamed: 0_level_0,id,title,author,date,html_content,article_text,domain,found_urls,type,completed
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
https://www.nytimes.com/2002/04/25/aponline/technology/article-2002042593816751672-no-title.html,b1beab0a-4bc6-540b-b578-6fe28c2e6aef,Article 2002042593816751672 -- No Title (Publi...,By,2002-04-25T05:00:00.000Z,"<div id=""readability-page-1"" class=""page""><art...","April 25, 2002 The New York Times: APTechnolog...",https://www.nytimes.com,"[{'title': 'Skip to content', 'url': 'https://...",article,False


# 2. Post Processing
**Run the code below to:**

Finding and adding citations and text alias information with respect to citation scope.
Finding and adding domain information with respect to crawl scope.
Then Cross match between domain and twitter data to find referrals. Stored them under `Saved` folder

- Note: This step is independent from step one as long as there is `./Saved/twitter_data.parquet` and `./Saved/domain_data.parquet`
- Processed Twitter data will be stored as parquet in `./Saved/processed_twitter_data.parquet`
- Processed Domain data will be stored as parquet in `./Saved/processed_domain_data.parquet`

> *Code is under `/post_processor/*.py`*

In [6]:
%%time
import post_processor.processor as processor
import post_input.load_input as load_input
# post process data
crawl_scope = load_input.load_scope('./crawl_scope.csv')
citation_scope = load_input.load_scope('./citation_scope.csv')
processor.process_crawler(crawl_scope, citation_scope)

CPU times: user 2min 56s, sys: 1.75 s, total: 2min 58s
Wall time: 2min 57s


**You can view processed data here**

In [7]:
# change to './Saved/processed_domain_data.parquet' to view the domain data
df = dd.read_parquet('./Saved/processed_twitter_data.parquet')
df.head(1)

Unnamed: 0_level_0,domain,date,article_text,found_urls,Mentions,id,type,title,completed,citation url or text alias,citation name,anchor text,associated publisher,tags,name,referring record id,number of referrals,url_dup
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
https://twitter.com/ACMideast/status/1000009046223704064,ACMideast,2018-05-25 13:41:50+00:00,دلالات الانتخابات المحلية التونسية \nhttps://t...,['http://www.achariricenter.org/is-tunisias-gl...,[],9af4a6a6-fbd8-5866-9b59-9850edf109a7,twitter,,True,[],[],[],,Think_Tank|||US|11-20||||||,Atlantic Council,[],0,https://twitter.com/ACMideast/status/100000904...


# 3. Generate output

**Run the cell below to:**

Rename and clean up keys. Select rows with citations from **citation scope** and belongs to **crawl scope**. Then create output in `Output/output_0.csv` and also in `./Saved/final_output.parquet`.

- Note: This step is independent from step one and two as long as there is `./Saved/processed_twitter_data.parquet` and `./Saved/processed_domain_data.parquet`

> code is under `post_output/create_output.py`


In [None]:
import post_output.create_output as output
output.create_output()

**You can view final output here or in `/Output/output_0.csv`**

In [8]:
df = dd.read_parquet('./Saved/final_output.parquet')
# twitter_data = dd.read_parquet('./Saved/processed_twitter_data.parquet')
df.head(1)

Unnamed: 0_level_0,title,author,date of publication,plain text,type,citation url or text alias,citation name,anchor text,referring record id,number of referrals,url,associated publisher,tags,name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
002467b0-7a6c-5431-9cb1-013a79da4a23,,,2018-12-13 18:14:23+00:00,Zones of influence in #Syria predict short-ter...,twitter,['@USIP'],['United States Institute of Peace'],[],[],0,https://twitter.com/ACMideast/status/107327998...,,Think_Tank|||US|11-20||||||,Atlantic Council


In [3]:
df = dd.read_csv('./DataDomain/*.csv', engine='python', encoding='utf8', error_bad_lines=False)
df.head()

Skipping line 20: unexpected end of data


Unnamed: 0,id,title,url,author,date,html_content,article_text,domain,found_urls
0,e44265d4-84a9-50d2-b07a-a9f88bbf027a,Candidate for Lieutenant Governor Facing Scrut...,https://www.nytimes.com/2002/08/10/nyregion/ca...,RICHARD PÉREZ-PEÑA,2002-08-10T05:00:00.000Z,"<div id=""readability-page-1"" class=""page""><art...","Aug. 10, 2002See the article in its original c...",https://www.nytimes.com,"[{'title': 'Skip to content', 'url': 'https://..."
1,9b13dea8-5c68-5063-94c2-fbbf6ec10371,Bush Far Ahead of Democrats in Campaign Fund-R...,https://www.nytimes.com/2004/01/07/politics/ca...,By,2004-01-07T05:00:00.000Z,"<div id=""readability-page-1"" class=""page""><div...",Politics|Bush Far Ahead of Democrats in Campai...,https://www.nytimes.com,"[{'title': 'Skip to content', 'url': 'https://..."
2,08a1c4ec-c4a8-52e4-a087-1bf973a8b011,Taking ‘Oleanna’ into limbo (Published 2004),https://www.nytimes.com/2004/04/28/style/takin...,By,2004-04-28T05:00:00.000Z,"<div id=""readability-page-1"" class=""page""><art...","April 28, 2004LONDON— David Mamet's short, sh...",https://www.nytimes.com,"[{'title': 'Skip to content', 'url': 'https://..."
3,cae61051-2672-5a49-8aae-f5c62d597344,"A Force in the House, a Soft Voice Back Home; ...",https://www.nytimes.com/2002/07/05/nyregion/fo...,By,2002-07-05T05:00:00.000Z,"<div id=""readability-page-1"" class=""page""><art...","July 5, 2002See the article in its original co...",https://www.nytimes.com,"[{'title': 'Skip to content', 'url': 'https://..."
4,2f0945d5-8593-5ff9-a041-b76cb188d1d2,The Week Ahead; POLITICS (Published 2004),https://www.nytimes.com/2004/04/11/weekinrevie...,,2004-04-11T05:00:00.000Z,"<div id=""readability-page-1"" class=""page""><art...","April 11, 2004Annual meetings this week for tw...",https://www.nytimes.com,"[{'title': 'Skip to content', 'url': 'https://..."
