<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2023 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# `Course Title` `1/2/3`

This is lesson `1` of 3 in the educational series on `TOPIC`. This notebook is intended `to teach XXX and introduce the concepts of XXXX`. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner` / `Intermediate` / `Advanced`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
* Object-oriented programming (classes, instances, inheritance)
* Regular Expressions (`re`, character classes)

These should be general skills but can mention a particular library
```

**Knowledge Recommended:**
```
* Basic file operations (open, close, read, write)
* Data cleaning with `Pandas`
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Describe and implement an XXXX for XXXX
2. Convert XXXX into XXXX for the purpose of XXXX
3. Develop a workflow in order to XXXX
4. Be familiar with XXXXX resources for pursuing the topic
```
**Research Pipeline:**
```
1. Research steps before this notebook
2. **The skills in this notebook**
3. Steps after this notebook
4. Final steps
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [Tesseract](https://tesseract-ocr.github.io/) for performing [optical character recognition](https://docs.constellate.org/key-terms/#ocr).
* [Pandas](https://pandas.pydata.org/) for manipulating and cleaning data.
* [Pdf2image](https://pdf2image.readthedocs.io/en/latest/) for converting pdf files into image files.

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install pdf2image

# Using %%bash magic with apt-get and yes prompt

In [None]:
%%bash
apt-get install tesseract-ocr
y

In [None]:
### Import Libraries ###
import urllib.request
import os

# Required Data

`List out the data sources, including their formats and a few sentences describing the data. Include a link to the data source description, if possible.`

**Data Format:** 
* image files (.jpg, .png)
* document files (.pdf)
* plain text (.txt)

**Data Source:**
* [Detroit Open Data Portal](https://data.detroitmi.gov/datasets/detroitmi::dpd-citizen-complaints/about)

**Data Quality/Bias:**
`Analysis of this data should consider the following quality and bias issues...`

**Data Description:**

`This lesson uses XXXX data in XXX format from XXXX source. Additional details about the data used.`

## Download Required Data

In [None]:
### Grab files with console `wget` and `mv` ###
!wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
!mv eng.traineddata /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata


In [None]:
### Grab a single file and supply name ###
urllib.request.urlretrieve('https://file.address.txt', 'filename.txt')

In [None]:
### Retrieve multiple files using a list ###

download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_01.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_02.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_03.pdf'
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])

In [None]:
### Retrieve multiple files using a list ###
### With data folder creation using os ###

# Files hosted somewhere else (don't store data on GitHub)

# Check if a folder exists to hold pdfs. If not, create it.
if os.path.exists('sample_pdfs') == False:
    os.mkdir('sample_pdfs')

# Move into our new directory
os.chdir('sample_pdfs')

# Download the pdfs into our directory
import urllib.request
download_urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_01.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_02.pdf',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_03.pdf'
]

for url in download_urls:
    urllib.request.urlretrieve(url, url.rsplit('/', 1)[-1])
    
## Move back out of our directory
os.chdir('../')

## Success message
print('Folder created and pdfs added.')

In [None]:
### Constellate Example ###

# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
#dataset_file = constellate.download(dataset_id, 'jsonl')


# Introduction

```
Introduce the lesson topic. Answer questions such as:
* Why is it useful? 
* Why should we learn it? 
* Who might use it? 
* Where has it been used by scholars/industry?
* What do we need to do it?
* What subjects are included in the notebooks?
* What is not in this notebook? Where should we look for it?
```

In [1]:
import time
import requests
from bs4 import BeautifulSoup
import pathlib

In [2]:
target = pathlib.Path('sci-pages')

In [3]:
for p in target.glob("*.html"):
    print(p)

sci-pages/sci-F.html
sci-pages/sci-S.html
sci-pages/sci-J.html
sci-pages/sci-K.html
sci-pages/sci-R.html
sci-pages/sci-G.html
sci-pages/sci-L.html
sci-pages/sci-Y.html
sci-pages/sci-U.html
sci-pages/sci-T.html
sci-pages/sci-A.html
sci-pages/sci-X.html
sci-pages/sci-M.html
sci-pages/sci-N.html
sci-pages/sci-W.html
sci-pages/sci-B.html
sci-pages/sci-C.html
sci-pages/sci-V.html
sci-pages/sci-O.html
sci-pages/sci-Z.html
sci-pages/sci-Q.html
sci-pages/sci-D.html
sci-pages/sci-H.html
sci-pages/sci-I.html
sci-pages/sci-E.html
sci-pages/sci-P.html


In [51]:


parsed_target = pathlib.Path('parsed_sci-pages')
parsed_target.mkdir(exist_ok = True)

for p in target.glob("*.html"):
    soup = BeautifulSoup(p.read_text(), 'html.parser')
    f = parsed_target / pathlib.Path(str(p.stem) + '.txt')
    text = ""
    for a in soup.select("td > p > a"):
        text += a.parent.text
        # f.write_text(a.parent.text)
        # print(a.parent.text)
    f.write_text(text)

In [25]:
out = pathlib.Path('all_species')
out.write_text(page_text)

0

In [1]:
import re
from bs4 import BeautifulSoup
import pathlib

In [43]:
parsed_target = pathlib.Path('parsed_sci-pages')
parsed_target.mkdir(exist_ok = True)

all_species = []
target = pathlib.Path('sci-pages')

for p in target.glob("*.html"):
    soup = BeautifulSoup(p.read_text(), 'html.parser')
    f = parsed_target / pathlib.Path(str(p.stem) + '.txt')
    p_tags = soup.select("td > p")
    p_text = [p.text for p in p_tags]
    lines = []
    for chunk in p_text:
        lines.extend(chunk.split('\n'))
    lines = [l for l in lines if len(l) > 0]
    with open(f, 'w', encoding = 'utf-8') as speciesout:
        for l in lines:
            speciesout.write(l + '\n')
    all_species.extend(lines)

In [39]:
len(all_species)

29814

In [42]:
with open('all_species.txt', 'w', encoding = 'utf-8') as outfile:
    for s in all_species:
        outfile.write(s + '\n')

In [45]:
with open('all_species.txt', 'r', encoding = 'utf-8') as infile:
    names = infile.read().splitlines() # another way of removing newlines

In [47]:
len(names)

29814

In [54]:
without_p = 0

for n in names:
    if not ')' in n:
        without_p += 1

In [55]:
without_p

0

In [56]:
import re

In [69]:
match_species = re.compile('(.+)\(([0-9]+)\)')
results = []
for n in names:
    found = re.findall(match_species, n)
    if len(found) != 1:
        print(found)
    else:
        results.append(found[0])

In [71]:
for r in results:
    if len(r) != 2:
        print(r)

In [72]:
species_pics = {name: int(count) for name, count in results}

In [73]:
species_pics

{'Fabaceae sp. ': 3,
 'Fabiana imbricata ': 5,
 'Fabronia pusilla ': 7,
 'Facelis retusa ': 3,
 'Facheiroa ulei ': 2,
 'Fagonia chilensis ': 2,
 'Fagonia laevis ': 105,
 'Fagonia pachyacantha ': 48,
 'Fagopyrum esculentum ': 24,
 'Fagopyrum tataricum ': 1,
 'Fagraea berteriana ': 10,
 'Fagus grandifolia ': 19,
 'Fagus orientalis ': 1,
 'Fagus sylvatica ': 46,
 'Faidherbia albida ': 60,
 'Faisherbia albida ': 2,
 'Falcataria moluccana ': 10,
 'Falkia repens ': 6,
 'Fallopia convolvulus ': 6,
 'Fallopia dumetorum ': 5,
 'Fallopia japonica ': 25,
 'Fallopia japonica var. japonica ': 1,
 'Fallopia sachalinensis ': 5,
 'Fallopia scandens  ': 4,
 'Falluga paradoxa ': 1,
 'Fallugia paradoxa ': 102,
 'Famatina cisandina ': 4,
 'Faradaya splendida ': 1,
 'Farfugium japonicum ': 8,
 'Farfugium japonicum var. giganteum ': 4,
 'Fascicularia bicolor ': 2,
 'Fascicularia pitcairnifolia ': 1,
 'Fatsia japonica ': 20,
 'Faucaria britteniae ': 1,
 'Faucaria tigrina ': 5,
 'Faucaria tuberculosa ': 2,
 '

In [74]:
from collections import Counter

counted = Counter(species_pics)

counted.most_common(10)

In [84]:
len(set(counted.elements()))

29814

In [85]:
counted.total()

437982

In [87]:
for name, count in counted.items():
    if count == 1:
        print(name)

Fagopyrum tataricum 
Fagus orientalis 
Fallopia japonica var. japonica 
Falluga paradoxa 
Faradaya splendida 
Fascicularia pitcairnifolia 
Faucaria britteniae 
Fauchea laciniata 
Faxonia pusilla 
Felicia muricata 
Fenestraria rhopalophylla 
Ferocactus acanthodes 
Ferocactus echidnae var. echidnae 
Ferocactus fordii 
Ferocactus fordii var. fordii 
Ferocactus latispinus 
Ferocactus pottsii 
Ferocactus sinuatus 
Ferocactus stainesii 
Ferraria antherosa 
Ferraria crispa ssp. nortieri 
Festuca brachyphylla ssp. coloradoensis 
Festuca californica ssp. californica 
Festuca drymeja 
Festuca roemeri 
Fibigia clypeata 
Fibigia eriocarpa 
Ficinia truncata 
Ficus altissima 
Ficus audrey 
Ficus citrifolia 
Ficus repens 
Ficus thonningii 
Ficus tinctoria subsp. tinctoria 
Ficus trigonata 
Filago vulgaris 
Filicium decipiens 
Filipendula kamtschatica 
Fimbristylis puberula var. interior 
Firmiania simplex 
Flourensia oolepis 
Fockea edulis 
Fontinalis neomexicana 
Forestiera 
Forestiera arizonica 
Fo