# Text Mining Project

Track changes in nouns used in **The Acutary** magazine over time. Skills:

## Possible steps:

1. Create a list of locations to PDF files of the magaizen
2. Loop through each file and create a dictionary or DataFrame
    - Extract properties (year, volume, issue) 
    - Extract all text from PDF with [`pypdf`](https://github.com/py-pdf/pypdf/tree/3.0.0)
3. For each file, tokenize using [`spacy`](https://spacy.io/) and extract just nouns and proper nouns
4. Create `tf-idf` matrix using [`sklearn`](https://scikit-learn.org/stable/) — see [example](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html) 
5. Be creative about visualizations and/or animations to tell the data story.

In [None]:
import glob
import os
import re

### Code snippets to get started

#### How to get file locations matching pattern

In [None]:
folder = '/Users/ccc14/Library/CloudStorage/Box-Box/Complexity Science/THE ACTUARY MAGAZINE'

In [None]:
fs = glob.glob(f'{folder}/*pdf')
fs[0]

#### Getting started with extracting file metadata

In [None]:
for f in fs[:3]:
    path, filename = os.path.split(f)
    name, ext = os.path.splitext(filename)
    items = name.split('-')
    print(items)

#### Getting text from PDF

In [None]:
from pypdf import PdfReader

texts = []
reader = PdfReader(fs[0])
for page in reader.pages:
    text = page.extract_text()
    texts.append(text)

In [None]:
texts[0]

#### Tokenize with `spacy`

In [None]:
import spacy

nlp = spacy.load("en_core_web_lg")
doc = nlp('\n'.join(texts))

In [None]:
for ent in doc.ents:
    if ent.label_ == 'ORG':
        print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
for token in doc[:20]:
    if token.pos_ == 'NOUN' or token.pos_ == 'PROPN':
        print(token)

## PROJECT Code