Inspiration

After seeing people at hackathons submit the same projects repeatedly, we were a little annoyed with the fact that no one seem to know that the cool sounding project they saw at the last hackathon was also done at the hackathon before that, and the one before that. Moreover, the idea of using advanced natural language techniques to analyse a huge source of data quite excited us. Ultimately, we were able to make a data driven argument which proves our hypothesis of quite similar hackathon projects, some often plagiarised.

What it does

Diff-post is a web portal where you can upload the url of your dev post project which you submitted at a hackathon and then we index it against our database of close to 1000 projects to compute a similarity score based on 3 features, normalised word2vec vectors, common tags for technologies used and keywords obtained using tf-idf.

How we built it

We used Python to automate data collection from Devpost and then stored it in a database that was accessed by our Python scripts which extracted the features required for text mining. Once the features were extracted, we used another script to rank close to 1000 projects based on how similar they were to the submitted project. Once we had the projects ranked, we displayed them in a dashboard which showed how similar your project was in percentage terms and what technologies other such projects had used. We also utilised t-distributed stochastic neighbour embedding to project our 100 dimension vectors to 2 dimensions so we could visualise how similar the workings of projects on DevPost were.

Challenges we ran into

This was probably our most ambitious project to date. And it was fraught with challenges. For starters, scraping DevPost was tricky since the HTML pages weren't well organised and we often had to loop over groups of similar tags to find the ones we were interested in. With the data in place, it was challenging to make use of the word2vec and tf-idf models and it nearly took us all night to first index the database with those features and then generate those features for all new links was challenging too. Finally, we had to make the various pieces of the puzzle play well together and that took nearly all of morning.

Accomplishments that we are proud of

Scraping such a vast and varied data source. Using natural language processing algorithms to extract meaningful information from textual data. Building a pipeline that did this over and over again.

What I learned

We learnt a lot about how text analysis is such a complex field to work in. The various preprocessing steps involved in dealing with text data. Scraping while being responsible and trying not to bring down DevPost. Using Flask

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
bin		bin
config		config
flask		flask
heatmap		heatmap
public		public
scrapers		scrapers
server		server
src		src
tests		tests
trained		trained
wordCloud		wordCloud
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.eslintrc		.eslintrc
.gitignore		.gitignore
.travis.yml		.travis.yml
GOODdatabase.json		GOODdatabase.json
LICENSE		LICENSE
README.md		README.md
cleanup.py		cleanup.py
database.json		database.json
hack2vec.py		hack2vec.py
hackathon_database_100dim_unique.json		hackathon_database_100dim_unique.json
main.py		main.py
package.json		package.json
postcss.config.js		postcss.config.js
projectsWithVectors.json		projectsWithVectors.json
ranker.py		ranker.py
scatterplot.html		scatterplot.html
similarity.py		similarity.py
tsne_viz.py		tsne_viz.py
tsneviz.csv		tsneviz.csv
wordvecmodel.py		wordvecmodel.py
yarn.lock		yarn.lock

License

achadha0111/hack-the-burgh

Folders and files

Latest commit

History

Repository files navigation

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments that we are proud of

What I learned

About

Resources

License

Stars

Watchers

Forks

Languages