# Coding for the Digital Humanities Workshop

- [Website link](https://dh-coding-docs.netlify.app/)
- [Literature Review CSV](DH-lit-review.csv)

## Installing and Importing

In [4]:
# install the packages
!pip3 install nltk
!pip3 install pandas
!pip3 install langdetect
!pip3 install iso_language_codes



In [1]:
# import everything and define dataframe variables
import re

import pandas as pd

import nltk
from nltk import FreqDist
from nltk.text import Text
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')

from collections import Counter

from langdetect import detect
from iso_language_codes import language_name

from string import punctuation
punctuation = list(punctuation)

ps = PorterStemmer()

stop_words = stopwords.words('english')
stop_words.extend(["n't", "'s", 'would', '—', "“", "”", '"'])

path = "DH-lit-review.csv"
full_df = pd.read_csv(path)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/aidanpower/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aidanpower/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/aidanpower/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


## Solution #1: list of abstracts

My original solution working with the abstract list directly from the dataframe

In [2]:
# get the abstracts from the dataframe, remove NaN and non-english language abstracts
full_list = list(full_df["Description"])
valid_list = list(filter(lambda x: str(x) != 'nan', full_list))
en_list = list(filter(lambda x: detect(x) == 'en', valid_list))
print("Full list of abstracts: " + str(len(full_list)) + "\nFiltered without NaN: " + str(len(valid_list)) + "\nFiltered for English only: " + str(len(en_list)))

abstract_list = list(map(lambda x: x.lower(), en_list))

# merge the abstracts, tokenize the sentences and words
abstract_string = " ".join(abstract_list)
abstract_sent = sent_tokenize(abstract_string)
abstract_words = word_tokenize(abstract_string)
print("\nNumber of sentences: " + str(len(abstract_sent)) + "\nNumber of words: " + str(len(abstract_words)))

Full list of abstracts: 50
Filtered without NaN: 40
Filtered for English only: 29

Number of sentences: 179
Number of words: 6215


## Solution #2: working within the dataframe

My updated solution working wihtin the dataframe to fill out more informaion such as publication year and language of each abstract

In [82]:
# adds the year to the "Creation Date" column by extracting it from the "Is part of" column
def add_year(df):
    new_df = df.copy()
    for index, row in df.iterrows():
        string = row["Is part of"]
        match = re.match(r'.*([2][0][0-9]{2})', string)
        if match is not None:
            year = int(match.group(0)[-4:])
            new_df.loc[index, "Year"] = year
    # was having an issue with the year value appearing as a float rather than an int
    new_df['Year'] = new_df['Year'].astype(int)
    return(new_df)

# do the same process but filter via the dataframe and add the language to the Language Note column
valid_df = full_df[(full_df["Description"].notna())].copy().reset_index(drop=True)

# rename the columns and add the language ISO code and name
valid_df = valid_df.rename(columns={"Language Note": "Language ISO", "Exhibitions Note": "Language Full", "Creation Date": "Year", "Creator": "Authors", "Description": "Abstract"})
valid_df["Language ISO"] = valid_df["Abstract"].apply(lambda x: detect(x))
valid_df["Language Full"] = valid_df["Language ISO"].apply(lambda x: language_name(x))
valid_df = add_year(valid_df)

# filter for English only and drop empty columns
en_df = valid_df[(valid_df["Language ISO"] == 'en')].copy().reset_index(drop=True)
dataframe = en_df #.dropna(axis=1, how='all').reset_index(drop=True)
print("Full list of abstracts: " + str(len(full_df.index)) + "\nFiltered without NaN: " + str(len(valid_df.index)) + "\nFiltered for English only: " + str(len(dataframe.index)))

# merge the abstracts, tokenize the sentences and words
abstract_list = list(map(lambda x: x.lower(), dataframe["Abstract"]))

abstract_string = "\n".join(abstract_list)
abstract_sent = sent_tokenize(abstract_string)
abstract_words = word_tokenize(abstract_string)
print("\nNumber of sentences: " + str(len(abstract_sent)) + "\nNumber of words: " + str(len(abstract_words)))

Full list of abstracts: 50
Filtered without NaN: 40
Filtered for English only: 29

Number of sentences: 179
Number of words: 6215


## Langauge Analysis

In [40]:
# count the number of entires in each language
lang_count = Counter(valid_df["Language Full"])
lang_df = pd.DataFrame.from_dict(lang_count, orient="index").reset_index().rename(columns={"index": "Language", 0: "Count"})
lang_df.sort_values(by="Count", ascending=False).reset_index(drop=True)

Unnamed: 0,Language,Count
0,English,29
1,Italian,4
2,Danish,3
3,Spanish,2
4,Portuguese,1
5,German,1


## Tokenization and Frequency Distribution

In [41]:
# filter the tokenized word list and get the regular and stemmed frequency distributions 
filtered_abstract_words = []
stemmed_abstract_words = []
for word in abstract_words:
    if word not in stop_words and word not in punctuation:
        stem = ps.stem(word)
        stemmed_abstract_words.append(stem)
        filtered_abstract_words.append(word)

abstract_fdist = FreqDist(filtered_abstract_words)
stemmed_fdist = FreqDist(stemmed_abstract_words)

In [42]:
print("Abstract frequency distribution")
for value in abstract_fdist.most_common(20):
    print(str(value[1]) + ": " + str(value[0]))

Abstract frequency distribution
95: humanities
90: digital
56: research
42: dh
26: analysis
25: field
24: topic
24: text
23: study
19: new
19: reading
18: work
18: data
18: texts
16: also
15: studies
15: dhp-lclw
14: paper
13: topics
13: exploration


In [43]:
print("\nStemmed frequency distribution")
for value in stemmed_fdist.most_common(20):
    print(str(value[1]) + ": " + str(value[0]))


Stemmed frequency distribution
101: human
92: digit
59: research
42: dh
42: text
39: studi
37: topic
32: explor
32: use
26: field
26: analysi
21: read
21: visual
19: work
19: new
19: differ
18: practic
18: develop
18: data
16: also


## Concordance

In [44]:
# concordance
text_list = Text(abstract_words)
text_list.concordance("dhp-lclw", lines=10)

Displaying 10 of 15 matches:
 for mr. lo chia-lun 's writings ( dhp-lclw ) with and without the ata to assi
ness of text exploration using the dhp-lclw with and without the ata varied si
c of the text being explored . the dhp-lclw with the ata was found to be more 
oring historical texts , while the dhp-lclw without the ata was more suitable 
 exploring educational texts . the dhp-lclw with the dhp-lclw was found to be 
onal texts . the dhp-lclw with the dhp-lclw was found to be significantly more
s of perceived usefulness than the dhp-lclw without the ata , indicating that 
for mr. lo chia-lun ’ s writings ( dhp-lclw ) . htat can assist humanities sch
 conducting text exploration using dhp-lclw with htat or dhp-lclw with single-
ration using dhp-lclw with htat or dhp-lclw with single-layer topic analysis t


## Searching the dataframe

I wrote a quick little function that searches the dataframe and returns relevant articles where it appears

In [83]:
# searches the df for word in abstract and returns relevant titles, authors, publication years and full abstracts
def search(df, word):
    subset = df.filter(["Title", "Authors", "Year", "Abstract"])
    index_list = []
    for index, row in df.iterrows():
        abstract = row["Abstract"]
        if str(abstract) != 'nan' and word in abstract.lower():
            index_list += [index]

    return subset.iloc[index_list]

search(dataframe, "historical")

Unnamed: 0,Title,Authors,Year,Abstract
12,Communicating Uncertainty in Digital Humanitie...,"Panagiotidou, Georgia ; Lamqaddam, Houda ; P...",2023,"Due to their historical nature, humanistic dat..."
22,An associative text analyzer to facilitate eff...,"Chen, Chih-Ming ; Chen, Xian-Xu",2024,PurposeThis study aims to develop an associati...
27,Curating China's Cultural Revolution (1966–197...,"Ma, Rongqian",2022,CR/10 is a digital oral history platform that ...
28,A hierarchical topic analysis tool to facilita...,"Chen, Chih-Ming ; Ho, Szu-Yu ; Chang, Chung",2023,PurposeThis study aims to develop a hierarchic...


In [85]:
# Saves the dataframe of english artilces with publication year added to file
path = "DH-lit-review-output.csv"
dataframe.to_csv(path, index=False)