Skip to content

This project aims to analyse textual data. Initially it downloads a set of published papers from a given list of URLs. The papers are then pre-processed after which, a set of features are derived from them. These features are nothing but a numerical representation of the paper which could be used in downstream modelling tasks.

Notifications You must be signed in to change notification settings

akchi/Text_pre_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Pre-Processing and Feature extraction

This project aims to analyse textual data. Initially it downloads a set of published papers from a given list of URLs. The papers are then pre-processed after which, a set of features are derived from them. These features are nothing but a numerical representation of the paper which could be used in downstream modelling tasks.

The dataset Urls.pdf contains 200 URLs of published papers from a popular AI conference. The features extracted from the processed documents are stored in:

  1. vocab.txt : contains the unigrams and bigrams. These tokens are stored alphabetically as token_string:token:index.
  2. count_vectors.txt : each row of the file contains the sparse representation of a particular paper. The format - paper id, token 1 index, token1 word count, token2 index, token2 word count, and so on is used respectively.

A preliminary analysis on the processed data is done and the results are stored in stats.csv.

Please Note : For Pyhon 2, pdfminer needs to be installed and for Python 3 pdfminer.six needs to be installed (both through pip).

About

This project aims to analyse textual data. Initially it downloads a set of published papers from a given list of URLs. The papers are then pre-processed after which, a set of features are derived from them. These features are nothing but a numerical representation of the paper which could be used in downstream modelling tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published