## Text Extraction

This notebook shows how to use AxCell to extract text from arXiv papers with LaTeX source files.

In [4]:
from pathlib import Path
from axcell.helpers.paper_extractor import PaperExtractor

The `ROOT_PATH` is the root path where the following structure is created after constructing the dataset. 

The `SOURCES_PATH` is the path where the downloaded source files should be stored. [Please see here fore bulk downloading arXiv source files.](https://info.arxiv.org/help/bulk_data_s3.html)

```
ROOT_PATH
├── sources                       # arXiv source files
├── unpacked_sources              # extracted latex sources (generated automatically)
├── htmls                         # converted html files (generated automatically)
└── papers                        # extracted text and tables (generated automatically)
```

In [9]:
ROOT_PATH = Path('../data/papers_s2abel')
SOURCES_PATH = ROOT_PATH / 'sources'

In [None]:
extract = PaperExtractor(ROOT_PATH)

Extract text and tables from a single paper

In [None]:
extract(SOURCES_PATH / '1606.02891v2')

Extract all papers

In [None]:
for s in SOURCES_PATH.iterdir():
    if s.is_file():
        extract(s)

### Clean up text

In [5]:
data_dir = '../data'

Extract the row text into a Dataframe column.

In [10]:
import json
import pandas as pd
import os

def get_raw_text(arxiv_id):
    with open(ROOT_PATH/ arxiv_id/ 'text.json') as f:
        text = json.load(f)
    return "\n".join([f['text'] for f in text['fragments']])

papers = pd.read_json(f'{data_dir}/papers.jsonl', lines=True)
papers['raw_text'] = papers['arxiv_id'].apply(get_raw_text)
papers.to_pickle(os.path.join(data_dir, 'papers_with_text.pkl'))