URL domain extraction #56

bstarling · 2017-03-31T11:15:29Z

Problem:

In order to do analysis on types of links being shared we need a reliable way to extract & count domains that appear in a list of URLs.

Tasks:

There are libraries that do this but none of them are perfect (It's fine to leverage a library but try to do your own validation on the results)
Attempt to alias domains known to be associated and count together youtu.be and youtube.com are both youtube
Make sure you're capturing the actual domain ex forums.website.com the domain is website
Output should be a list of domain counts in descending order.
Sort out shortened links and publish as a separate file. Ex bit.ly, t.co

You do not need a full solution in order to submit a PR. If you have questions drop in to assemble chat and see if anyone else is interested in working on the problem.

You can download the data here or load directly to pandas via

import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/far-right/fourchan/youtube_urls.csv')

Post cleaning should generate a list of domains and their count. As well as a separate file of all shortened links where domain is not know. (Recommend you do not try to visit these shortened links)

youtube, 500
facebook, 200
twitter, 150
wikipedia, 100

warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board. Please do not visit the links you find as some may contain malware/offensive content.

The text was updated successfully, but these errors were encountered:

strongdan · 2017-03-31T22:35:10Z

I'm a beginner looking to gain a bit more experience and I'd be willing to attempt this. How quickly do you expect a PR?

bstarling · 2017-04-01T00:46:55Z

Hey @strongdan that sounds good. No time limit. The only request is if end up not finding time to finish you post back here to free it up for someone else. Feel free to post a partial solution so others can collaborate or drop by the #assemble channel with any questions.

strongdan · 2017-04-01T01:36:34Z

Sounds great! I'll try to get something completed this weekend and update you with what I have.

strongdan · 2017-04-03T18:00:59Z

I didn't see that @harish-garg already solved this one. Great job!

bstarling · 2017-04-03T18:09:52Z

Still room for improvement or alternative approaches.

strongdan · 2017-04-05T17:00:09Z

I had to post on StackOverflow about sorting out short URLs: http://stackoverflow.com/questions/43219063/detecting-a-short-url-using-python

It sounds tough to implement. I can try to come up with a list of known short urls or match on a regular expression. I will most likely need some help with the cleaning and validation of results.

bstarling added beginner-friendly Hackathon help wanted labels Mar 31, 2017

bstarling changed the title ~~Top level domain extraction~~ URL domain extraction Mar 31, 2017

bstarling mentioned this issue Apr 3, 2017

Issue: Solution for Issue 56 URL Domain Extractor #62

Merged

bstarling closed this as completed Nov 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL domain extraction #56

URL domain extraction #56

bstarling commented Mar 31, 2017

strongdan commented Mar 31, 2017

bstarling commented Apr 1, 2017

strongdan commented Apr 1, 2017

strongdan commented Apr 3, 2017

bstarling commented Apr 3, 2017

strongdan commented Apr 5, 2017

URL domain extraction #56

URL domain extraction #56

Comments

bstarling commented Mar 31, 2017

Problem:

Tasks:

strongdan commented Mar 31, 2017

bstarling commented Apr 1, 2017

strongdan commented Apr 1, 2017

strongdan commented Apr 3, 2017

bstarling commented Apr 3, 2017

strongdan commented Apr 5, 2017