# Text Content Analysis - Setup 

![BSSDH](https://site-512948.mozfiles.com/files/512948/DHbaneris2.gif)

## Welcome to the Text Content Analysis Workshop!

### A few words about me

Valdis Saulespurens 

 * researcher at the National Library of Latvia, digital development department
 * 30+ years programming, 15 years with Python, 5 of those teaching Python
 * lecturer at Riga Technical University, Riga Business School
 * contact: valdis.saulespurens@lnb.lv - email, [Valdis on LinkedIn](https://www.linkedin.com/in/valdis-saulespurens), [ValRCS on Github](https://github.com/ValRCS)


### A few words about the workshop

* Two days - we meet on 25th and 26th of July of 2023 here at National Library of Latvia.
#### Day 1 - 25th of July
* Part 1 - 11:00 - 12:30 - Introduction to Text Content Analysis
* Lunch Break 1 hour - 12:30 - 13:30
* Part 2 - 13:30 - 15:00 - Cleaning, preprocessing and tokenization
* Coffee Break 20 min 
* Part 3 - 15:20 - 16:50 - Creating embeddings 

#### Day 2 - 26th of July
* Part 4 - 11:00 - 12:30 - Topic modeling
* Lunch Break 1 hour - 12:30 - 13:30
* Part 5 - 13:30 - 15:00 - Trend analysis, visualization and interpretation
* Coffee Break 20 min
* Part 6 - 15:20 - 16:50 - Your own work in class with the help of the instructor - required for the certificate/gaining credits


## Testing your computer setup

This workshop will be using Python programming language and Jupyter Notebooks. Some minimal prerequisites are required to be able to run the notebooks in this repository.

### Assumptions

* You know a little bit about Python - refresher is provided in this repository
* You have a Google account (gmail) and can use Google Colab

For those new to Jupyter Notebooks - they are a way to combine text and code in a single document. You can run the code and see the results right in the notebook. You can also edit the code and run it again. This is a very convenient way to learn and experiment with Python. More on Jupyter Notebooks here : [Jupyter Notebooks](https://jupyter.org/)

Jupyter Notebooks can be run locally on your computer or in the cloud. The primary/minimal option is using Google Colab - a cloud based Jupyter Notebook environment. You will need a Google account (gmail) to use Google Colab.

I as an instructor will be using Visual Studio Code with Python extension and git support. This is a very powerful environment for Python development. You can install it on your computer and use it for this workshop. You will need to install Python and git on your computer - instructions were sent and are below. This is the preferred option for this workshop.

### Test minimal prerequisites

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ValRCS/BSSDH_2023_workshop/blob/main/notebooks/test_python_setup.ipynb)

You can run this same notebook in your own local environment if you have Python and git installed.


### Practice Your Python Notebook skills (includes NumPy and Pandas library refresher)

Here is a Python syntax refresher, optional if you have good working knowledge of Python.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ValRCS/BSSDH_2023_workshop/blob/main/notebooks/python_colab.ipynb)

Again you can run this notebook in your own local environment if you have Python and git installed.

### Local install instructions

Detailed local install instructions can be found in INSTALL.md file in this repository. Link : [INSTALL.md](https://github.com/ValRCS/BSSDH_2023_workshop/blob/main/INSTALL.md)


## Introduction to Text Analysis

### What is Text Analysis?

Text analysis is the process of transforming unstructured text documents into structured data for further analysis. It is a form of data mining that is used to identify patterns and establish relationships between words in a text-based dataset. Text analysis is also known as text mining, text analytics, and data mining.

### Why is Text Analysis Important in research and academia?

Text analysis is important because it is a valuable method for extracting meaning from text-based data. It is used to quantify qualitative data, which is particularly helpful for research that involves collecting large amounts of unstructured data, such as customer feedback, open-ended survey responses, and social media comments.

Text analysis is part of discourse analysis, which is the study of language use in texts and contexts. It is used to analyze the structure of written texts and is often used in the humanities and social sciences to analyze texts such as interview transcripts, news articles, and speeches.

### Modeling Text Data after Preprocessing

Text data is often modeled as numerical data after preprocessing. This is because most machine learning algorithms require numerical data as input. The most common way to model text data is to use a bag-of-words model, which represents each document as a vector of word counts. This is a simple and effective way to represent text data, but it does not capture the order of words in a document.
We will be looking at some other ways to model text data in this workshop.

## Pipeline - Plan of Attack

### 1. Data Collection

We will want to obtain data from some source. This could be a website, a database, or a file. We will be using a file for this workshop.
However not every file is ready to be analyzed. We will need to clean the data and prepare it for analysis.

Our first goal will be to read the data into a pandas dataframe. We will be using the `pandas` library for this. Pandas is a Python library that is used for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series. It also provides powerful data structures that are designed to make working with structured data fast, easy, and expressive.

The process of importing data into a tabular format can be very easy or it can take quite some effort. We will work with medium sized datasets for this workshop. However, if you are working with truly large datasets, you may need to use a distributed computing framework such as Apache Spark to import the data. - this places huge demands on your hardware and is not recommended for beginners.
Also Apache Spark is quite a bit slower when working with small datasets and few machines than regular Pandas based workflows.

### 2. Data Preprocessing

We will want to clean the data and prepare it for analysis. This includes removing punctuation, numbers, and other non-text characters. We will also want to remove stopwords, which are common words that do not add much meaning to a sentence, such as "the", "and", and "a". We will also want to remove words that appear too frequently or too infrequently in the dataset. This is known as removing words with high and low document frequency.

### 3. Data Processing

We will want to process the data in order to extract features from it. This includes tokenization, which is the process of splitting a text document into individual words. We will also want to stem and lemmatize the words in the dataset. Stemming is the process of reducing a word to its root form. Lemmatization is the process of reducing a word to its dictionary form. We will also want to remove words that are not nouns, verbs, adjectives, or adverbs. This is known as part-of-speech tagging.

### 4. Creating Embeddings

We will want to create embeddings from the data in order to represent the words in the dataset as vectors. This is known as word embedding. We will be using the `gensim` library for this. Gensim is a Python library for topic modeling, document indexing, and similarity retrieval with large corpora. It uses a fast implementation of the Word2Vec algorithm to learn vector representations of words.

### 5. Modeling

We will want to model the data in order to extract insights from it. This includes topic modeling, which is the process of discovering topics in a text-based dataset. We will might want to perform sentiment analysis, which is the process of determining whether a text document is positive, negative, or neutral. We will be using the `scikit-learn` library for this. Scikit-learn is a Python library for machine learning. It provides a range of supervised and unsupervised learning algorithms for classification, regression, and clustering.

### 6. Visualization and Interpretation

We will want to visualize the data in order to communicate our findings to others. This includes creating interactive visualizations using the Plotly library. Plotly is a Python library for creating interactive visualizations. It provides a range of tools for creating charts, maps, and graphs. We will also want to interpret the data in order to gain insights from it. This includes using the results of our analysis to make decisions about the data.

For extra visualization of LDAModels we will be using pyLDAvis library - time permitting. It requires some extra setup and is not included in the Google Colab environment.

### 7. Your own work in class  - required for the certificate/gaining credits

In the last part of workshop you will be tasked with obtaining your own data and performing the same steps as in the workshop. You will be able to ask questions and get help from the instructor. This is required for the certificate/gaining credits.

## 1. Data Collection

### General Considerations and sources

There are various sources of text data. Some of the most common sources are:

* Web pages - see workshop on web scraping
* Social media - generally need API access -- see issues with Twitter API
* Books - see Project Gutenberg
* News articles - see News API for organized access
* Research papers - see ArXiv API
* Wikipedia - see Wikipedia API
* Blogs - see Blogger API or web scraping
* Emails - see Enron Email Dataset as one example
* Speeches - see American Presidency Project
* Curated datasets - see Kaggle - note possible licensing issues
* Dataset search - see [Google Dataset Search](https://datasetsearch.research.google.com/) - beware Killed By Google syndrome
* US open data - see [data.gov](https://www.data.gov/)
* European open data - see [European Data Portal](https://www.europeandataportal.eu/en)
* Latvian open data - see [Latvian Open Data Portal](https://data.gov.lv/)
* Your own data - see your own data you collected or have access to

### Clarin - Reputable source of data

Clarin is a European research infrastructure for language resources and technology. It is a networked federation of centres pooling their human and technical resources to create an infrastructure. The infrastructure consists of an interconnected network of repositories, service centres and knowledge centres, offering language resources (datasets) and natural language processing (NLP) tools and expertise. The infrastructure offers widespread access to language resources and advanced tools to support researchers in the humanities and social sciences, and beyond.

[Clarin](https://www.clarin.eu/resource-families/historical-corpora)



![Old Bailey](https://www.clarin.eu/sites/default/files/styles/large/public/media/showcases/Old_Bailey_Microcosm_edited_0.jpg?itok%253Dik2HYoSp)

### Old Bailey Corpus

The Old Bailey Corpus is a collection of aproximately 130 million words of text from the proceedings of the Old Bailey, a criminal court in London that operated from 1674 to 1913. The corpus is available for download from the Old Bailey Online website and other sources such as Clarin.

Official Bailiff's website: [Old Bailey Online](https://www.oldbaileyonline.org/)

Link at Clarin: [Old Bailey Corpus](https://www.clarin.eu/showcase/old-bailey-corpus-20-1720-1913)



### Downloading the Data

* Size: 134 Million Words
* Annotation: detailed sociobiographical, pragmatic and textual annotation
* Licence: CC-BY-NC-SA 4.0

Note: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License is a free license that allows you to share your work with others, as long as you credit the original author and indicate if changes were made. It also allows you to adapt the work and use it for non-commercial purposes, as long as you distribute it under the same license.

More on Creative Commons Licenses: [Creative Commons](https://creativecommons.org/licenses/)

Full size of the corpus is around 200MB, we will be using a selection of cases from each decade that is in the full corpus. This will be around 17MB of data uncompressed - around 3MB compressed.


### Some considerations on downloading data

* Data is available in various formats - we will be using XML format - this is a text based format that is supposed to be human readable, and computer parsable
* Data is available in various sizes - we will be using a subset of the data - around 17MB uncompressed
* We will be downloading the data from github repository - this is a very convenient way to share data and code as long as the data is not too large - github has a 100MB limit on file size without extra setup
* if you are using local setup your data will already be in data folder - you could skip the download step but try to follow the rest of the steps
* Google Colab does not provide automatic access to your local files - you will need to upload the data to Google Colab environment - this is a bit cumbersome but doable
* We will be using a python library called `requests` to download the data - this is a very convenient way to download data from the internet

## Some Good Practices in creating notebooks

* Use markdown cells to explain what you are doing and why
* See Markdown cheatsheet at Github : [Markdown Cheatsheet](https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax)
* Markdown headings can serve as table of contents for your notebook - use only one level 1 heading
* Use comments in code cells to explain what you are doing and why
* Use descriptive variable names - avoid single letter variable names
* Use descriptive function names - avoid single letter function names
* Use descriptive file names - avoid single letter file names
* Use descriptive folder names - avoid single letter folder names - you get the idea by now :)
* Use version control - git is the most popular version control system - we will be using it in this workshop
* Note: git is not ideal for notebooks (JSON format) - it is better suited for pure code files - but it is better than nothing
* Your notebook should be able to run all the way through without errors - this is not always possible but should be the goal
* In other words your notebook should be able to be exported and run as standalone python file in ideal case - not always possible but should be the goal

Finally, remember creating notebooks is a fusion of logical and creative process and you might not always know where you will end up. :)

In [1]:
## Finally we can get started with some real coding!

## First we will want to import all the libraries need for this particular notebook

## Note: When developing your notebook it might be okay to import libraries as you need them, but when you are done with your notebook you should move all your imports to the top of the notebook.

## First we want to import standard libraries that come with python. These are libraries that are installed with python and you don't have to install them yourself.

# we might need sys to get the version of python we are using and possibly other things
import sys
# print python version
print("Python version:", sys.version) # not really needed but good to know in case something goes wrong

# we will want Path module from pathlib to work with paths
from pathlib import Path # notice how we only import Path from pathlib and not the whole library

# we will be forking with zipfiles so we will need zipfile
import zipfile

# we will be working with xml files so we will need xml
import xml.etree.ElementTree as ET # so we imported xml but we only imported the ElementTree module from xml
# we also renamed ElementTree to ET so we don't have to type out ElementTree every time we want to use it

# we might need some regular expression magic
import re

# we might want to deal with some json data
import json # also standard library

# we might want to deal with some dates
from datetime import datetime # also standard library
# note how we imported datetime from datetime, this is because datetime is a module and a class in the datetime library
# a bit of unfortunate naming but we can deal with it

## External Libraries
## These are libraries that are not installed with python and you have to install them yourself. You can install them with pip or conda. We will use pip for this class.
## Note: You can also import libraries with an alias. This is useful when you want to use a library but don't want to type out the whole name every time you use it. For example, we can import pandas as pd. This will allow us to use pandas but we only have to type pd when we want to use it.

# we will import tqdm for progress bars
from tqdm import tqdm # tqdm is a library for progress bars
# strictly speaking we don't need to import tqdm but it makes our lives easier and it's nice to have progress bars

# we will want to deal with web requests so requests provides a nice interface for that
import requests # requests is a library for making web requests GET, POST, PUT, DELETE, etc.
# it supports many features like authentication, cookies, sessions, etc.
# more on it in your web scraping workshop

import pandas as pd # pandas is a library for data analysis, commonly used in data science, machine learning, and artificial intelligence
# notice how we renamed pandas to pd so we don't have to type out pandas every time we want to use it, this is very common
# NOTE: Use common conventions when importing libraries. For example, pandas is commonly imported as pd, 
# numpy is commonly imported as np, matplotlib.pyplot is commonly imported as plt, etc.

# for external libraries they will often provide a way to check the version of the library
print("Pandas version:", pd.__version__) # this is the version of pandas we are using
# sometimes some functions might not work in older versions of the library so it's good to know what version you are using


Python version: 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)]
Pandas version: 2.0.3


In [2]:
# our first step is to find out url where our data is located
# we can find this out by looking at the source code of the webpage
# in this case if you check Github repository of this project you will find that Github will 
# provide an option to download data raw
# you could do this manually by left clickin raw button in 
# https://github.com/ValRCS/BSSDH_2023_workshop/blob/main/data/old_bailey_sample_1720_1913.zip
# alternatively you can right click raw button and click copy link address
url = "https://github.com/ValRCS/BSSDH_2023_workshop/raw/main/data/old_bailey_sample_1720_1913.zip"
# note raw part of the url, this is important because it will allow us to download the data directly
# in general there would be some way to figure out the url but it's not always easy
# let's print our url to make sure it's correct
print("URL:", url)


URL: https://github.com/ValRCS/BSSDH_2023_workshop/raw/main/data/old_bailey_sample_1720_1913.zip


In [3]:
# now we want to download the data and copy into temp folder
# first we will want to creata a temp folder if it does not exist
# to keep notebooks on the same path we will use Path.cwd() to get the current working directory
# print current working directory
print("Current working directory:", Path.cwd())
# then we will use Path.mkdir() to create a folder
# we will use exist_ok=True to make sure we don't get an error if the folder already exists
Path(Path.cwd() / "temp").mkdir(exist_ok=True)
# check that we created the folder
print("Temp folder exists:", Path(Path.cwd() / "temp").exists())

Current working directory: c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks
Temp folder exists: True


In [4]:
# finally we are ready to download the data
# we will use requests to download the data
# we will use requests.get() to make a GET request to the url
# we will use stream=True to stream the data instead of downloading it all at once
# we will use allow_redirects=True to allow redirects
# we will use timeout=10 to timeout after 10 seconds

# we will use with statement to open a file
# we will use open() to open a file
# we will use "wb" to open a file in binary mode
# we will use Path() to create a path to our file
# we will use Path.cwd() to get the current working directory   
# we will use / to join paths
# we will use "temp" to create a path to our file

# we will use tqdm() to create a progress bar

# we will use .iter_content() to iterate over the content of the response
# we will use chunk_size=1024 to iterate over the content in chunks of 1024 bytes

# we will use .write() to write the content to the file

# we will use .close() to close the file when we are done

# we will use .raise_for_status() to raise an exception if the status code is not 200

# we will get file name from our url it is the last part of the url after the last /
# we will use .split() to split the url by /
file_name = url.split("/")[-1]
print("Will save file as:", file_name)

# now let's download the data - we are chunking data to support large files

with open(Path(Path.cwd() / "temp" / file_name), "wb") as file: # note wb - write binary
    # to speed up the download we will use stream=True
    # also we are using chunk_size=1024 to download the data in chunks of 1024 bytes
    # you can adjust chunk_size to your liking
    with requests.get(url, stream=True, allow_redirects=True, timeout=10) as response:
        with tqdm(total=int(response.headers.get("content-length", 0)), unit="B", unit_scale=True, desc=file_name) as progress:
            for chunk in response.iter_content(chunk_size=1024):
                file.write(chunk)
                progress.update(len(chunk))
        # file.close() # not required because of with statement
        response.raise_for_status()

Will save file as: old_bailey_sample_1720_1913.zip


old_bailey_sample_1720_1913.zip: 100%|██████████| 2.90M/2.90M [00:00<00:00, 7.87MB/s]


In [5]:
# now those on local machine can check that the data was downloaded to temp folder and is identical to same file in data folder
# you can also check that the file size is the same
# print file size
print("File size:", Path(Path.cwd() / "temp" / file_name).stat().st_size)

File size: 2900720


In [6]:
# print file size of original file - will not work in Colab!
print("Original file size:", Path(Path.cwd().parent / "data" / file_name).stat().st_size)
## same file size is generally a good sign that the files are identical - but not always

Original file size: 2900720


In [7]:
# for more hardcore people you can use checksums to check that the files are identical
# you can use md5, sha1, sha256, etc.
# you can use hashlib to calculate checksums
import hashlib
# we will use md5
md5 = hashlib.md5()
# we will use .read_bytes() to read the file as bytes
# we will use .update() to update the md5 hash
md5.update(Path(Path.cwd() / "temp" / file_name).read_bytes())
# we will use .hexdigest() to get the md5 hash as a string
print("MD5:", md5.hexdigest())

MD5: 66c094035ceee02b306d110adf7cf658


In [8]:
# advantage of above approach is that you can check that the file is identical to the original file without having two copies of the file
# you can also check that the file is identical to the original file without having to download the original file

In [9]:
# So now we have downloaded the data and we can start working with it
# first we will want to unzip the data
# the simplest would be to use zipfile and just extract all of the files in original zip file keeping folder structure

# let's do that for now

# first we will want to create a path to our zip file
zip_path = Path(Path.cwd() / "temp" / file_name)
print("We are going to unzip", zip_path) # the absolute path will be differnt on each machine!

# unzip folder will be same temp folder
unzip_folder = Path(Path.cwd() / "temp")
print("We are going to unzip to", unzip_folder) # the absolute path will be differnt on each machine!

We are going to unzip c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\old_bailey_sample_1720_1913.zip
We are going to unzip to c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp


In [10]:
# Let's unzip our zip file
# we will use zipfile.ZipFile() to create a ZipFile object

# we will use .extractall() to extract all files from the zip file to the unzip folder

with zipfile.ZipFile(zip_path) as zip_file:
    zip_file.extractall(unzip_folder) # extract all files to unzip folder keeping folder structure

# there are other recipes for working with truly large zip files where you might not want to unzip the whole file at once
# but for now we will keep it simple

In [11]:
# we can see that there was some file structure in the zip file
# we could use our file explorer or we could list the directory in temp
# we will use Path.iterdir() to iterate over the files in the temp folder and print them
for file in Path(unzip_folder).iterdir():
    # if file is a directory we will print it as a directory
    if file.is_dir():
        print("Directory:", file)
    # if file is a file we will print it as a file
    elif file.is_file():
        print("File:", file)


File: c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\LICENSE.txt
Directory: c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts
File: c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\old_bailey_sample_1720_1913.zip
File: c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\README.html


In [12]:
# we can use Path.rglob() to recursively iterate over all files in the temp folder
# we will use ** to recursively iterate over all files in the temp folder
# we will use * to iterate over all files in the temp folder

all_files = [file for file in Path(unzip_folder).rglob("*")]
print("Number of files:", len(all_files))
# print first 10 files
print("First 10 files:", *all_files[:10], sep="\n") # i unrolled the first 10 files with * and separated them with new line


Number of files: 24
First 10 files:
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\LICENSE.txt
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\old_bailey_sample_1720_1913.zip
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\README.html
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17200427.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17310428.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17420428.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17540116.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17620224.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17731020.xml


In [13]:
# now we can see that we have a lot of files
# how about only getting xml files because those are in fact what we want
# let's use list comprehension to get only xml files
xml_files = [file for file in Path(unzip_folder).rglob("*.xml")]
print("Number of xml files:", len(xml_files))
# print first 10 xml files
print("First 10 xml files:", *xml_files[:10], sep="\n") # i unrolled the first 10 files with * and separated them with new line

Number of xml files: 20
First 10 xml files:
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17200427.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17310428.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17420428.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17540116.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17620224.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17731020.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17841210.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17961026.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-18020217.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-18140112.xml


In [14]:
## we could provide various ordering of files, let's sort by file name since that is the simplest
xml_files.sort(key=lambda x: x.name)
print("First 5 xml files sorted by name:", *xml_files[:5], sep="\n") # i unrolled the first 10 files with * and separated them with new line

First 5 xml files sorted by name:
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17200427.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17310428.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17420428.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17540116.xml
c:\Users\Valdis\Github\BSSDH_2023_workshop\notebooks\temp\obc_texts\OBC2-17620224.xml


In [15]:
# looks like the first file by name is indeed the oldest file from 1720
# let's load it in and see what we have
# we will use xml.etree.ElementTree.parse() to parse the xml file
# we will use .getroot() to get the root element of the xml file
# we will use .tag to get the tag of the root element
# we will use .attrib to get the attributes of the root element

# let's get started

tree = ET.parse(xml_files[0]) # parse the xml file
root = tree.getroot() # get the root element
print("Root tag:", root.tag) # print the tag of the root element

Root tag: TEI.2


## TEI - Text Encoding Initiative

What is TEI?

TEI - Text Encoding Initiative is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of resources and training events for learning TEI, information on projects using the TEI, a bibliography of TEI-related publications, and software developed for or adapted to the TEI.

### TEI and XML

![TEI](https://tei-c.org/release/doc/tei-p5-doc/en/html/Images/banner.jpg)

[TEI](https://tei-c.org/) is a standard for encoding text in XML. 

XML is a markup language that is used to describe data. XML is extremely flexible which is both a strength and a weakness. It is a strength because it allows you to create your own tags and attributes. It is a weakness because it can be difficult to understand and maintain.

### XML Tutorial

XML resembles HTMl in some ways as they share a common ancestor. XML is a markup language that is used to describe data. 

A basic tutorial on XML can be found here: [XML Tutorial](https://www.w3schools.com/xml/) - W3Schools is not officially affiliated with W3C. It used to be less reputable but has improved over the years.

For additional complexity you can use namespaces to create your own tags and attributes. This is a way to avoid name collisions with other XML tags and attributes. This can add additional complexity to your code.

TEI provides its own standart XML tags:
See TEI documentation: [TEI](https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html)

A good starting point for sample document might here: [Default Document](https://tei-c.org/release/doc/tei-p5-doc/en/html/DS.html)

In [16]:
# We can use any standart text editor to open the xml file and see what it looks like
# I will be using this very same Visual Studio Code

# upon inspection looks like I will be interested in div1 tags of type trialAccount
# our next task will be to extract all the trialAccount div1 tags and save plain text of those div1 tags
# we will also want to have some metadata about the trialAccount div1 tags such as year, trial number, and trial date if available


In [17]:
# let's get a list of all div1 tags of type trialAccount
# we will use .findall() to find all div1 tags
# we will use .iter() to iterate over all div1 tags
# we will use .attrib to get the attributes of the div1 tag
# we will use .get() to get the value of the attribute
# we will use .text to get the text of the div1 tag

# we will use list comprehension to get all div1 tags of type trialAccount
trial_accounts = [div1 for div1 in root.findall(".//div1[@type='trialAccount']")] 
# note we are using .//div1[@type='trialAccount'] to find all div1 tags of type trialAccount
# this is using xpath syntax
# more about xpath here: https://www.w3schools.com/xml/xpath_intro.asp
print("Number of trialAccount div1 tags:", len(trial_accounts))

Number of trialAccount div1 tags: 75


In [18]:
# if xpath is not your thing you can use .iter() to iterate over all div1 tags and check if they have type attribute and if it's value is trialAccount
# we will use .iter() to iterate over all div1 tags
# we will use .attrib to get the attributes of the div1 tag
# we will use .get() to get the value of the attribute
# we will use .text to get the text of the div1 tag

# we will use a simple loop
trial_accounts_too = []
for div1 in root.iter("div1"): # iterate over all div1 tags
    if "type" in div1.attrib: # check if div1 tag has type attribute
        if div1.attrib["type"] == "trialAccount": # check if div1 tag has type attribute with value trialAccount
            trial_accounts_too.append(div1) # if it does append it to trial_accounts_too
print("Number of trialAccount div1 tags:", len(trial_accounts_too))


Number of trialAccount div1 tags: 75
