Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL domain extraction #56

Closed
bstarling opened this issue Mar 31, 2017 · 6 comments
Closed

URL domain extraction #56

bstarling opened this issue Mar 31, 2017 · 6 comments

Comments

@bstarling
Copy link
Contributor

Problem:

In order to do analysis on types of links being shared we need a reliable way to extract & count domains that appear in a list of URLs.

Tasks:

  • There are libraries that do this but none of them are perfect (It's fine to leverage a library but try to do your own validation on the results)
  • Attempt to alias domains known to be associated and count together youtu.be and youtube.com are both youtube
  • Make sure you're capturing the actual domain ex forums.website.com the domain is website
  • Output should be a list of domain counts in descending order.
  • Sort out shortened links and publish as a separate file. Ex bit.ly, t.co

You do not need a full solution in order to submit a PR. If you have questions drop in to assemble chat and see if anyone else is interested in working on the problem.

You can download the data here or load directly to pandas via

import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/far-right/fourchan/youtube_urls.csv')

Post cleaning should generate a list of domains and their count. As well as a separate file of all shortened links where domain is not know. (Recommend you do not try to visit these shortened links)

youtube, 500
facebook, 200
twitter, 150
wikipedia, 100

warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board. Please do not visit the links you find as some may contain malware/offensive content.

@bstarling bstarling changed the title Top level domain extraction URL domain extraction Mar 31, 2017
@strongdan
Copy link
Contributor

I'm a beginner looking to gain a bit more experience and I'd be willing to attempt this. How quickly do you expect a PR?

@bstarling
Copy link
Contributor Author

Hey @strongdan that sounds good. No time limit. The only request is if end up not finding time to finish you post back here to free it up for someone else. Feel free to post a partial solution so others can collaborate or drop by the #assemble channel with any questions.

@strongdan
Copy link
Contributor

Sounds great! I'll try to get something completed this weekend and update you with what I have.

@strongdan
Copy link
Contributor

I didn't see that @harish-garg already solved this one. Great job!

@bstarling
Copy link
Contributor Author

Still room for improvement or alternative approaches.

@strongdan
Copy link
Contributor

I had to post on StackOverflow about sorting out short URLs: http://stackoverflow.com/questions/43219063/detecting-a-short-url-using-python

It sounds tough to implement. I can try to come up with a list of known short urls or match on a regular expression. I will most likely need some help with the cleaning and validation of results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants