Development on post-processor framework #1

kstapelfeldt · 2020-10-13T17:50:29Z

Given the output files of the domain-crawler & the anticipated output files of the twitter-crawler - how do we parse/transform the data into our output format (JSON/csv). This needs to be done in such a way that we can continue to add rules or modifications to the framework as needed to address things like filtering non-news or homepage content.

kstapelfeldt · 2020-10-20T17:18:28Z

@RaiyanRahman will provide sample output from the domain crawler based on what he already has
@danhuacai will work with the twitter output to start.

kstapelfeldt · 2020-10-20T17:26:53Z

Notes on a matching algorithm -

Extract all possible citations from all the articles/tweets (Text Aliases, Twitter Handles, and Domain names)
Compare citation list against the scope crawl data.
When citations appear in scope, we create the referring ID link. When things don't appear in scope, they are stored in a list and ranked by popularity.

kstapelfeldt · 2020-10-27T18:15:37Z

Amy can match the URLs and determine who has cited whom.
Still need to be able to match using text aliases (will need to do some type of regex or other pattern matching within article text)
Still need to do extraction of twitter handles
Still need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope (we added a new sheet to .csv)
Still need to see how this will operate given Danhua's twitter crawler output (a sample of which exists in pull request https://github.com/UTMediaCAT/mediacat-twitter-crawler/pull/1) @danhuacai
push code for @danhuacai

kstapelfeldt · 2020-11-03T19:27:40Z

regex done for text aliases and twitter handles for domain crawler

Still to do

Twitter Crawler Code mediacat-docs#5 @danhuacai - DONE
extraction of twitter handles using the pattern that begins with at symbol. - DONE
Cross-matching all references between twitter and domain output data <-- Try developing against dummy data - STARTED
need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope

danhuacai · 2020-11-03T21:21:47Z

https://drive.google.com/file/d/1bzsWzckV03JtGM7QT1WO6fhNA2X6jDsq/view?usp=sharing

one big twitter output file

kstapelfeldt · 2020-11-10T18:43:45Z

@danhuacai has added a column for URL and added a pull request

kstapelfeldt · 2020-11-17T18:32:26Z

TODO

Cross-matching all references between twitter and domain output data <-- Try developing against dummy data - Progress made. Bugs remain.
need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope

amygao9 · 2020-11-17T19:15:17Z

Went through a tutorial with @danhuacai on the post-processing, she will go through the code and understand it first. Then we will split up tasks later.

kstapelfeldt · 2020-11-24T19:12:01Z

TODO

need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope
Modify to handle "small JSON" file example provided by @RaiyanRahman

danhuacai · 2020-11-30T21:19:14Z

https://databricks.com/glossary/pyspark

Pyspark might be helpful for huge amount of data after we get the data from post-processor.

kstapelfeldt · 2020-12-01T18:27:49Z

2 is done. 1 logic is there but needs real data to be tested. Right now final code is on post-processor branch. Will wait until we can test against crawled twitter data to confirm before pushing.

kstapelfeldt · 2020-12-08T18:51:11Z

@amygaoo will make a change to accept the small .csvs and then @danhuacai needs to test

kstapelfeldt · 2020-12-15T18:30:19Z

@amygaoo has passed code and @danhuacai is running in a virtual machine, but it's not finished yet. We will know more once it returns output. It has been running ~30 hours on a partial data set. @danhuacai needs to provide the number of files being processed in this time period, as well as how the VM is provisioned, so that we can benchmark approximately how long post-processing takes.

kstapelfeldt · 2021-01-05T19:01:32Z

We did not get this benchmarking done. Needs to be completed. Amy needs to add more logic to the 'interest output' to sort according to most links

kstapelfeldt · 2021-01-14T14:24:21Z

Added logic to interest output for sorting.
@amygaoo to test the post-processor on the whole twitter crawl output on the Compute Canada instance.

kstapelfeldt · 2021-01-21T14:25:27Z

@jacqueline-chan and @amygaoo met last night - trying to find mini .csvs and running post-processor. Not at today's meeting and so this process in pending.

kstapelfeldt · 2021-01-28T14:07:35Z

Amy ran the post-processor on 10 users - finished in 2 days. We encountered an issue in running the full output as we are still encountering poorly formed .csvs even after running through Danhua's mini processor. Currently, Amy ran this while skipping all malformed records (only about 20). Post-processor is still running. Gone through 600,000/5 million since Sat/Sunday.

@amygaoo will continue to try to find out where the errors in .csv creation are being introduced so we can resolve the issue.

kstapelfeldt · 2021-02-04T14:11:30Z

Last time she checked it was at 1,000,000/5,000,000 - took a week to run one million but then there was a connection issue. Problem with speed but also inability to pick up after process is terminated (through things like connection problems)

Suggestions:

Create a dictionary of all user handles/URLs so they can be tracked and restarting is possible? OR use database?
Split handles into 5 and run five simultaneously? Join dictionaries after the several processes? Other possibilities for multi-threading to improve speed. Look into https://dask.org.

Top priority: Make the process more robust (pick up after a break).
Second priority: Make it faster.

amygao9 · 2021-02-04T15:49:06Z

TODO:
All tweets by @Marianhouk get a hit if someone mentions @Marianhouk, so many of tweets by a twitter handle has the same amount of hits
Proposed solution:

Include a json node in output for each twitter handle and domain, which will hold all referrals for that specific source
Each node for a specific tweet/article will only hold the referrals to the specific article/tweet.
Add a field in each node that specifies if its a source domain or if its one article/tweet.

kstapelfeldt · 2021-02-11T14:10:41Z

Top priority is to pick up after a break.
Then we will discuss the data model and use cases to find out where the data model might need to be altered.

jacqueline-chan · 2021-02-25T14:39:40Z

@amygaoo was able to catch errors and write the data that have been processed to multiple files for the reason of being able to pick up after it is stopped in the middle
currently working on picking back up after the process has been stopped

kstapelfeldt · 2021-03-11T15:04:02Z

Picking up after the process has been stopped is working (Yay!)
Now @amygaoo is working on the refactor.
Right now it's possible it's possible that a source has multiple text aliases and associated twitter handles.
Currently multiple text aliases are associated with a domain. These will be grouped together and separated by pipe in the output.
Currently multiple twitter handles are associated with a domain. These will be separate.
Amy has to create new nodes for the text aliases, twitter handles, and domains (everything that is a new article or a tweet).

kstapelfeldt · 2021-03-18T13:20:53Z

@amygaoo completed refactor for modified output. Tested on small test data and it looks like it works. Text aliases are in list format. in-code documentation complete. Right now, for newly created nodes

Rules

if the node has type 'domain' 'twitter article' or 'text alias' and has no referrals, it's not included in the output
Sometimes a domain name will appear twice. For example, if it is in the scope, but also crawled. In this instance, we would have the same data, but one would be marked as type 'domain' and one would be of type 'article.' In this case, we keep the 'article' and discard the 'domain.' If a homepage is not crawled, there will be a 'domain' record, which is kept in the output.

how to run

Documentation is in-code - But @amygaoo will add more documentation re: files and folders expected by the script.

kstapelfeldt · 2021-10-14T14:45:21Z

Modifying the post-processor framework to operate more quickly (optimize)

kstapelfeldt · 2021-10-21T14:09:58Z

John has tried several approaches to optimization and discussed with KS and Nat - the first approach is the best one. 2 days to 4 hours.
John spent the week looking at the alternatives and verified his original approach was the correct one to take.
New problem: Run out of memory at the very end and didn't process the twitter crawler data and needs to re-run the crawler. Working on an approach to monitor size and write to disk if the process is at risk, and is now seeking to re-run.
John is having a problem with Graham cloud that he is trying to address (he keeps getting kicked out). He will write Alejandro who will write the Compute Canada stuff.

kstapelfeldt · 2021-11-05T16:22:40Z

Close to complete, but need to test.

johnguirgis · 2021-11-09T18:21:23Z

Made to modifications to periodically remove duplicates from referrals list while executing rather than at the end to save memory
Ran successfully on smaller scope, currently running with full scope

kstapelfeldt assigned danhuacai Oct 13, 2020

kstapelfeldt assigned amygao9 and unassigned danhuacai Oct 20, 2020

kstapelfeldt assigned danhuacai Oct 27, 2020

kstapelfeldt unassigned danhuacai Jan 5, 2021

kstapelfeldt assigned johnguirgis Oct 14, 2021

kstapelfeldt changed the title ~~Create post-processor framework~~ Development on post-processor framework Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development on post-processor framework #1

Development on post-processor framework #1

kstapelfeldt commented Oct 13, 2020

kstapelfeldt commented Oct 20, 2020

kstapelfeldt commented Oct 20, 2020

kstapelfeldt commented Oct 27, 2020 •

edited

kstapelfeldt commented Nov 3, 2020 •

edited

danhuacai commented Nov 3, 2020

kstapelfeldt commented Nov 10, 2020

kstapelfeldt commented Nov 17, 2020

amygao9 commented Nov 17, 2020

kstapelfeldt commented Nov 24, 2020

danhuacai commented Nov 30, 2020

kstapelfeldt commented Dec 1, 2020 •

edited

kstapelfeldt commented Dec 8, 2020 •

edited

kstapelfeldt commented Dec 15, 2020

kstapelfeldt commented Jan 5, 2021 •

edited

kstapelfeldt commented Jan 14, 2021

kstapelfeldt commented Jan 21, 2021

kstapelfeldt commented Jan 28, 2021

kstapelfeldt commented Feb 4, 2021 •

edited

amygao9 commented Feb 4, 2021 •

edited

kstapelfeldt commented Feb 11, 2021

jacqueline-chan commented Feb 25, 2021

kstapelfeldt commented Mar 11, 2021

kstapelfeldt commented Mar 18, 2021 •

edited

kstapelfeldt commented Oct 14, 2021

kstapelfeldt commented Oct 21, 2021

kstapelfeldt commented Nov 5, 2021

johnguirgis commented Nov 9, 2021

Development on post-processor framework #1

Development on post-processor framework #1

Comments

kstapelfeldt commented Oct 13, 2020

kstapelfeldt commented Oct 20, 2020

kstapelfeldt commented Oct 20, 2020

kstapelfeldt commented Oct 27, 2020 • edited

kstapelfeldt commented Nov 3, 2020 • edited

Still to do

danhuacai commented Nov 3, 2020

kstapelfeldt commented Nov 10, 2020

kstapelfeldt commented Nov 17, 2020

TODO

amygao9 commented Nov 17, 2020

kstapelfeldt commented Nov 24, 2020

TODO

danhuacai commented Nov 30, 2020

kstapelfeldt commented Dec 1, 2020 • edited

kstapelfeldt commented Dec 8, 2020 • edited

kstapelfeldt commented Dec 15, 2020

kstapelfeldt commented Jan 5, 2021 • edited

kstapelfeldt commented Jan 14, 2021

kstapelfeldt commented Jan 21, 2021

kstapelfeldt commented Jan 28, 2021

kstapelfeldt commented Feb 4, 2021 • edited

amygao9 commented Feb 4, 2021 • edited

kstapelfeldt commented Feb 11, 2021

jacqueline-chan commented Feb 25, 2021

kstapelfeldt commented Mar 11, 2021

kstapelfeldt commented Mar 18, 2021 • edited

Rules

how to run

kstapelfeldt commented Oct 14, 2021

kstapelfeldt commented Oct 21, 2021

kstapelfeldt commented Nov 5, 2021

johnguirgis commented Nov 9, 2021

kstapelfeldt commented Oct 27, 2020 •

edited

kstapelfeldt commented Nov 3, 2020 •

edited

kstapelfeldt commented Dec 1, 2020 •

edited

kstapelfeldt commented Dec 8, 2020 •

edited

kstapelfeldt commented Jan 5, 2021 •

edited

kstapelfeldt commented Feb 4, 2021 •

edited

amygao9 commented Feb 4, 2021 •

edited

kstapelfeldt commented Mar 18, 2021 •

edited