Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Development on post-processor framework #1

Open
kstapelfeldt opened this issue Oct 13, 2020 · 27 comments
Open

Development on post-processor framework #1

kstapelfeldt opened this issue Oct 13, 2020 · 27 comments
Assignees

Comments

@kstapelfeldt
Copy link
Member

Given the output files of the domain-crawler & the anticipated output files of the twitter-crawler - how do we parse/transform the data into our output format (JSON/csv). This needs to be done in such a way that we can continue to add rules or modifications to the framework as needed to address things like filtering non-news or homepage content.

@kstapelfeldt
Copy link
Member Author

@RaiyanRahman will provide sample output from the domain crawler based on what he already has
@danhuacai will work with the twitter output to start.

@kstapelfeldt
Copy link
Member Author

Notes on a matching algorithm -

  1. Extract all possible citations from all the articles/tweets (Text Aliases, Twitter Handles, and Domain names)
  2. Compare citation list against the scope crawl data.
  3. When citations appear in scope, we create the referring ID link. When things don't appear in scope, they are stored in a list and ranked by popularity.

@kstapelfeldt kstapelfeldt assigned amygao9 and unassigned danhuacai Oct 20, 2020
@kstapelfeldt
Copy link
Member Author

kstapelfeldt commented Oct 27, 2020

  • Amy can match the URLs and determine who has cited whom.
  • Still need to be able to match using text aliases (will need to do some type of regex or other pattern matching within article text)
  • Still need to do extraction of twitter handles
  • Still need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope (we added a new sheet to .csv)
  • Still need to see how this will operate given Danhua's twitter crawler output (a sample of which exists in pull request https://github.com/UTMediaCAT/mediacat-twitter-crawler/pull/1) @danhuacai
  • push code for @danhuacai

@kstapelfeldt
Copy link
Member Author

kstapelfeldt commented Nov 3, 2020

  • regex done for text aliases and twitter handles for domain crawler

Still to do

  1. Twitter Crawler Code mediacat-docs#5 @danhuacai - DONE
  2. extraction of twitter handles using the pattern that begins with at symbol. - DONE
  3. Cross-matching all references between twitter and domain output data <-- Try developing against dummy data - STARTED
  4. need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope

@danhuacai
Copy link
Contributor

@kstapelfeldt
Copy link
Member Author

@danhuacai has added a column for URL and added a pull request

@kstapelfeldt
Copy link
Member Author

TODO

  1. Cross-matching all references between twitter and domain output data <-- Try developing against dummy data - Progress made. Bugs remain.
  2. need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope

@amygao9
Copy link
Contributor

amygao9 commented Nov 17, 2020

Went through a tutorial with @danhuacai on the post-processing, she will go through the code and understand it first. Then we will split up tasks later.

@kstapelfeldt
Copy link
Member Author

TODO

  1. need to create the extra JSON/csv output that contains top twitter handles and domain names not included in the scope
  2. Modify to handle "small JSON" file example provided by @RaiyanRahman

@danhuacai
Copy link
Contributor

https://databricks.com/glossary/pyspark

Pyspark might be helpful for huge amount of data after we get the data from post-processor.

@kstapelfeldt
Copy link
Member Author

kstapelfeldt commented Dec 1, 2020

2 is done. 1 logic is there but needs real data to be tested. Right now final code is on post-processor branch. Will wait until we can test against crawled twitter data to confirm before pushing.

@kstapelfeldt
Copy link
Member Author

kstapelfeldt commented Dec 8, 2020

@amygaoo will make a change to accept the small .csvs and then @danhuacai needs to test

@kstapelfeldt
Copy link
Member Author

@amygaoo has passed code and @danhuacai is running in a virtual machine, but it's not finished yet. We will know more once it returns output. It has been running ~30 hours on a partial data set. @danhuacai needs to provide the number of files being processed in this time period, as well as how the VM is provisioned, so that we can benchmark approximately how long post-processing takes.

@kstapelfeldt
Copy link
Member Author

kstapelfeldt commented Jan 5, 2021

We did not get this benchmarking done. Needs to be completed. Amy needs to add more logic to the 'interest output' to sort according to most links

@kstapelfeldt
Copy link
Member Author

  • Added logic to interest output for sorting.
  • @amygaoo to test the post-processor on the whole twitter crawl output on the Compute Canada instance.

@kstapelfeldt
Copy link
Member Author

@jacqueline-chan and @amygaoo met last night - trying to find mini .csvs and running post-processor. Not at today's meeting and so this process in pending.

@kstapelfeldt
Copy link
Member Author

Amy ran the post-processor on 10 users - finished in 2 days. We encountered an issue in running the full output as we are still encountering poorly formed .csvs even after running through Danhua's mini processor. Currently, Amy ran this while skipping all malformed records (only about 20). Post-processor is still running. Gone through 600,000/5 million since Sat/Sunday.

@amygaoo will continue to try to find out where the errors in .csv creation are being introduced so we can resolve the issue.

@kstapelfeldt
Copy link
Member Author

kstapelfeldt commented Feb 4, 2021

Last time she checked it was at 1,000,000/5,000,000 - took a week to run one million but then there was a connection issue. Problem with speed but also inability to pick up after process is terminated (through things like connection problems)

Suggestions:

  1. Create a dictionary of all user handles/URLs so they can be tracked and restarting is possible? OR use database?
  2. Split handles into 5 and run five simultaneously? Join dictionaries after the several processes? Other possibilities for multi-threading to improve speed. Look into https://dask.org.

Top priority: Make the process more robust (pick up after a break).
Second priority: Make it faster.

@amygao9
Copy link
Contributor

amygao9 commented Feb 4, 2021

TODO:
All tweets by @Marianhouk get a hit if someone mentions @Marianhouk, so many of tweets by a twitter handle has the same amount of hits
Proposed solution:

  1. Include a json node in output for each twitter handle and domain, which will hold all referrals for that specific source
  2. Each node for a specific tweet/article will only hold the referrals to the specific article/tweet.
  3. Add a field in each node that specifies if its a source domain or if its one article/tweet.

@kstapelfeldt
Copy link
Member Author

  • Top priority is to pick up after a break.
  • Then we will discuss the data model and use cases to find out where the data model might need to be altered.

@jacqueline-chan
Copy link
Contributor

  • @amygaoo was able to catch errors and write the data that have been processed to multiple files for the reason of being able to pick up after it is stopped in the middle
  • currently working on picking back up after the process has been stopped

@kstapelfeldt
Copy link
Member Author

  • Picking up after the process has been stopped is working (Yay!)
  • Now @amygaoo is working on the refactor.
  • Right now it's possible it's possible that a source has multiple text aliases and associated twitter handles.
  • Currently multiple text aliases are associated with a domain. These will be grouped together and separated by pipe in the output.
  • Currently multiple twitter handles are associated with a domain. These will be separate.
  • Amy has to create new nodes for the text aliases, twitter handles, and domains (everything that is a new article or a tweet).

@kstapelfeldt
Copy link
Member Author

kstapelfeldt commented Mar 18, 2021

@amygaoo completed refactor for modified output. Tested on small test data and it looks like it works. Text aliases are in list format. in-code documentation complete. Right now, for newly created nodes

Rules

  • if the node has type 'domain' 'twitter article' or 'text alias' and has no referrals, it's not included in the output
  • Sometimes a domain name will appear twice. For example, if it is in the scope, but also crawled. In this instance, we would have the same data, but one would be marked as type 'domain' and one would be of type 'article.' In this case, we keep the 'article' and discard the 'domain.' If a homepage is not crawled, there will be a 'domain' record, which is kept in the output.

how to run

  • Documentation is in-code - But @amygaoo will add more documentation re: files and folders expected by the script.

@kstapelfeldt
Copy link
Member Author

Modifying the post-processor framework to operate more quickly (optimize)

@kstapelfeldt kstapelfeldt changed the title Create post-processor framework Development on post-processor framework Oct 14, 2021
@kstapelfeldt
Copy link
Member Author

  • John has tried several approaches to optimization and discussed with KS and Nat - the first approach is the best one. 2 days to 4 hours.
  • John spent the week looking at the alternatives and verified his original approach was the correct one to take.
  • New problem: Run out of memory at the very end and didn't process the twitter crawler data and needs to re-run the crawler. Working on an approach to monitor size and write to disk if the process is at risk, and is now seeking to re-run.
  • John is having a problem with Graham cloud that he is trying to address (he keeps getting kicked out). He will write Alejandro who will write the Compute Canada stuff.

@kstapelfeldt
Copy link
Member Author

Close to complete, but need to test.

@johnguirgis
Copy link
Contributor

Made to modifications to periodically remove duplicates from referrals list while executing rather than at the end to save memory
Ran successfully on smaller scope, currently running with full scope

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants