GitHub

Webpage for CS839 Project

This repository contains the code, data, and report for the different stages of the Data Science project for CS 839.

Team: Akshata Bhat, Pratyush Mahapatra, Felipe Gutierrez Barragan.

Group ID: 4.

Website: https://akshatabhat.github.io/DataScience/

Contents:

Stage 1: Information Extraction Form Text
Stage 2: Crawling and extracting structured data from Web pages
Stage 3: Entity Matching
Python Environment Setup

Stage 1: Information Extraction From Text

Annotated Documents: You can find the directory with the annotated documents here. Please refer to the README.md file withing that directory for information on the annotations.

Browsable Directory for Set I: Set I's Browsable Directory Link

Browsable Directory for Set J: Set J's Browsable Directory Link

Browsable Directory for Code: Code

Link to compressed file with all of the above directories: To download the compressed files you can go to the project repository here. Then click the green button that says Clone or download. Finally, click on download zip.

Link to Project Report Report

Link to main function main.py

Stage 2: Crawling and extracting structured data from Web pages

Data Directory: Data . The directory also includes a README file which explains both the tables.

Code Directory: Code

Stage 2 Report: PDF

Stage 3:

Matching Fodors and Zagats:

User ID: group4
Project ID: stage3_trial
Screenshot:

Blocking Results:

User ID: group4
Project ID: stage3
Screenshot:

Matching Results:

User ID: group4
Project ID: stage3
Screenshot:

Files Downloaded from CloudMatcher

Table A

Table B

Candidate Set : Size = 5401

Prediction List

Candidate Set L

cand_set_blocked_and_labeled.csv : Size = 397

Labelled Candidate Set L

cand_set_blocked_and_labeled.csv : Size = 397

Precision Recall Results for CloudMatcher's Candidate Set:

Recall = [0.982813752548643 - 1.0054215415690038]

Precision = [0.9325870283311566 - 0.9451907494466212]

Precision Recall Results for Candidate Set with OUR Blocking Rule:

Recall = [0.9941176470588236 - 0.9941176470588236]

Precision = [0.9012651926029382 - 0.9159391084723305]

Iterations:

Step 1 : Our initial Candidate Set had 5401 elements. We randomly sampled 50 elements and found only 2 matching pairs, resulting in a density of 0.04.

Step 2: We added a Blocking Rule (description given below) which reduced the candidate set to 397 elements.

Step 3: We labeled all the 397 elements in the new candiate set L.

Blocking Rule Description: Our matching task was to find same papers between arxiv and cvpr. Our blocking rule was to check if the number of authors of both entries was the same. The code is here. We use the code linked here inside the Jupyter Notebook where we perform our full analysis.

Analysis of Blocking Rules: We observed that the blocking stage in CloudMatcher removed a large number of true matches. We proved this running the debug_blocker on the reduced candidate set (candidate set with our blocking rule). Here we found that most (196 / 200) of the tuple pairs that were true positives had been removed by CloudMatcher. Our own blocking rule removed 4 / 200 of them, because even though the papers seemed to match, their number of authors was different.

Take a look at the jupyter_notebook.pdf for the results of all the steps and the code we added to the provided notebook.

Python Environment Setup

Follow the steps in this section to setup an anaconda virtual environment that contains all the required dependencies/libraries to run the code in this repository.

Install miniconda
Create anaconda environment: The following command creates an anaconda environment called dsenv with python 3.5. conda create -n dsenv python=3.5
Activate environment: source activate dsenv
Install libraries: The code in this repository uses numpy, scipy, pandas, matplotlib, scikit-learn, and ipython (for development). To install all these dependencies run the following command: conda install numpy scipy matplotlib ipython pandas scikit-learn

Note: If directly installing the packages with the above commands does not work it is probably because different versions of the libraries were installed. If this happened remove the environment and start over with the following steps.

Install miniconda and clone this repository.
Navigate to the folder containing this repos.
Use the dsenv.yml file to create the conda environment by running the following command: conda env create -f dsenv.yml. This command will setup the exact same conda environment we are currently using with all library versions being the same.

For more details on how to manage a conda environment take a look at this webpage: Managing Conda Environment.

Name		Name	Last commit message	Last commit date
Latest commit History 301 Commits
Data		Data
FileRepo_Annotated		FileRepo_Annotated
src		src
stage2		stage2
stage3		stage3
test		test
utils		utils
.gitignore		.gitignore
CS839_Project1.pdf		CS839_Project1.pdf
ProQuestDocuments-2019-02-19.txt		ProQuestDocuments-2019-02-19.txt
README.md		README.md
dsenv.yml		dsenv.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

FileRepo_Annotated

FileRepo_Annotated

src

src

stage2

stage2

stage3

stage3

test

test

utils

utils

.gitignore

.gitignore

CS839_Project1.pdf

CS839_Project1.pdf

ProQuestDocuments-2019-02-19.txt

ProQuestDocuments-2019-02-19.txt

README.md

README.md

dsenv.yml

dsenv.yml

Repository files navigation

Webpage for CS839 Project

Stage 1: Information Extraction From Text

Stage 2: Crawling and extracting structured data from Web pages

Stage 3:

Python Environment Setup

About

Releases

Packages

Contributors 3

Languages

akshatabhat/CS839_DataScience

Folders and files

Latest commit

History

Repository files navigation

Webpage for CS839 Project

Stage 1: Information Extraction From Text

Stage 2: Crawling and extracting structured data from Web pages

Stage 3:

Python Environment Setup

About

Resources

Stars

Watchers

Forks

Languages