Skip to content

Latest commit

 

History

History
56 lines (31 loc) · 4.57 KB

HW10.md

File metadata and controls

56 lines (31 loc) · 4.57 KB

This is the public posting of the assignment. See Piazza for the invite link to make your submission in your own repository in the class organization.

Homework 10 - Analyzing Disinformation Domains

Due: Tuesday, May 5, 2020 by 3:30pm

10 points - all extra credit

Data Sources:

Google spreadsheets can be downloaded as CSV or TSV files using the File...Download... menu option

Assignment

Write a report that answers and explains your answers to the following questions. Support your answers by including all relevant discussion, assumptions, examples, etc. You must describe how you answered the questions. Your GitHub repo should include all scripts, code, output files, images needed to complete the assignment. If you use a Google Colab notebook, you must save a copy in GitHub in your HW10 repo.

Q1. (1 point) The D2 dataset contains some shortened URIs and proxies. Process each URI and follow redirects until it resolves as a 200 or there is a 40x HTTP status.

Save a data file that contains the original URI, tweet frequency (from the original data file), final URI (many of these will be the same as the original URI), and current HTTP status.

Q2. (1 point) For each of the final URIs from Q1, extract the domain from the URI. For example, given the URI https://en.wikipedia.org/wiki/Domain_name, the domain is wikipedia.org, and for the URI https://www.theregister.co.uk/2020/04/16/cloudflare_cobol/, the domain is theregister.co.uk.

How many unique domains are there?

Save a data file that contains each unique domain, the number of times that domain appeared in the dataset, and the total number of tweets the domain appeared in.

Q3. (3 points) Compare the domains present in D2 (from your processed dataset in Q2) and in D1. Is there an overlap? Compare with the domains from D3. How much overlap is there between the three datasets?

Receiving all the points for this question requires thoughtful discussion of the results.

Q4. (2 points) For each domain in D1, D2 (using results from Q2), and D3, check the archival status of the domain's main webpage using MemGator (see HW2). If a domain appears in multiple datasets, only check the status once.

You don't need to figure out the actual main webpage URI, just putting the domain in the request should work, for example % curl http://memgator.cs.odu.edu/timemap/link/theregister.co.uk/

As with HW2, save the TimeMap for each domain in your GitHub repo.

Save a data file that contains each domain name, the dataset(s) it appears in (D1/D2/D3), datetime of its first memento, datetime of its last memento, and its total number of mementos.

Note that most of the main webpages for the domains in D3 should have at least 1 memento because the Internet Archive has created an Archive-It collection of these (see https://archive-it.org/collections/13559).

Q5. (3 points) Create the following charts based on the collected data from Q4.

  1. Scatterplot of domain vs. datetime of the first memento and last memento (x-axis), sorted by the datetime of the first memento. Color dots based on the dataset (or datasets) it comes from. This should look similar to this chart of URIs vs. memento datetimes, but with only the first and last dot on each row plotted (since you're only plotting the datetimes of the first and last mementos).

  2. Histogram (see also Histograms at mathisfun) of the total number of mementos per domain for all the datasets. The x-axis should be total mementos (binned appropriately) and y-axis should be number of domains that fall into that bin.

Receiving all the points for this question requires thoughtful discussion of the results, including suggested questions for future investigation.

Submission

Make sure that you have committed and pushed your local repo to GitHub before the deadline. Include "@weiglemc Ready to grade" in your commit comment.