BloggerDemoGraph Analyser

Finds blogger demographics from Amazon Common corpus data. The data extracted for analysis from the corpus is the crawled blogger profile webpage. Sample Page.

Special credit to URL index by Scott Robertson, which made the data analysis possible just in my local machine, without the need for aws.

Some finds

25% of blogger from Bangalore write about Music(Not so surprising as it is the rock capital of india)
NY, SF and Toronto account for 40% of the bloggers.
San Francisco and Dallas, each 17% account for the most bloggers with interest in Politics
San Francisco,Bangalore and Vancouver are among the top cities with bloggers whose interest is travel

See more report charts in SnapShots folder!

Installation

Import the BloggerAuthorInfo-Total.sql file into your sql server
Change connection credentials in ConnectionUtils.java file.
Run BloggerAuthorInfoAnalyser.java in src/analyser/main (Work in progress)

About

This project aims at profiling blogger interests correlated with their demographics.

Amazon's Common-Crawl corpus was used for this purpose. I extracted the crawled data corresponding to the blogger profile webpages as the dataset for my analysis.

The selective download of the required dataset was made possible by the Common Crawl URL Index(https://github.com/trivio/common_crawl_index) by Scott Robertson(https://angel.co/srobertson)

The technology stack of the project comprises of

Python script to facilitate downloading the selective chunks of ARC files in Common-Crawl corpus
Maui(http://code.google.com/p/maui-indexer/) to extract topics from raw text. This was used in conjunction with AGROVOC vocabulary(http://aims.fao.org/standards/agrovoc)
MySQL database is used to dump the rough results of the crawl
JFreeChart is used to display charts.

Data used

There were about 8000 blogger profile urls that were found using the URL index on the Common-Crawl corpus data. The resultant HTML dumps of these webpages was just about 100MB. These dumps were in-turn used for analysis.

Thought process which made me arrive to this implementation stack

Reason to use URL Index

Cost aversion was a factor which lead me to use the URL index, thus allowing for all the computation to be done in my local machine

Reason to use Maui

Maui is a topic extractor having the ability to extract topic terms from text, even those which are not actually present in the text For example, it could give me "Politics" as a topic, even though the terms politics or any of its roots not being mentioned in the text. Maui consumes RDF based vocabularies for topic extraction.I used the latest version of AGROVOC vocabulary as it contains a broad array of topics

Reason to use blogger profile webpages

Since I wanted to analyse blogger interests based on demographical information, these profile pages served as a rich source of that information. It was possible to infer about the topics which are of interest to the blogger, without actually having to crawl his entire blog. As this was just a Proof of Concept(PoC) project, this minimed the computation cost as everthing was done locally, without the need of EC2

TODO

Enrichment to data by crawling actual blogsites and Blogs I follow section of blogger profile.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
SnapShots		SnapShots
data		data
doc		doc
lib		lib
python		python
sql		sql
src		src
.DS_Store		.DS_Store
._.DS_Store		._.DS_Store
.classpath		.classpath
.gitignore		.gitignore
.gitignore~		.gitignore~
.project		.project
README.md		README.md
README.md~		README.md~
blogPost		blogPost
changes.log		changes.log
fao780		fao780
pattern		pattern
readme.txt		readme.txt
sampleBetaBloggerProfile.html		sampleBetaBloggerProfile.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BloggerDemoGraph Analyser

Some finds

Installation

About

Data used

Thought process which made me arrive to this implementation stack

Reason to use URL Index

Reason to use Maui

Reason to use blogger profile webpages

TODO

About

Releases

Packages

Languages

geekypunk/BloggerAnalytics

Folders and files

Latest commit

History

Repository files navigation

BloggerDemoGraph Analyser

Some finds

Installation

About

Data used

Thought process which made me arrive to this implementation stack

Reason to use URL Index

Reason to use Maui

Reason to use blogger profile webpages

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages