## Introduction to Exploratory Data Analysis and Word Counts

Welcome to our third and final 😳notebook!!

Our agenda today is the following:

- Review homework assignment and answer questions
- Intro to Exploratory Data Analysis 
- Discussion how we teach humanities data analysis and visualization
- Discuss more complex visualizations (time-permitting)

***

In [261]:
### Let's import our libraries and would anyone be willing to share their code and discuss their visualizations??

In [262]:
import pandas as pd
import altair as alt
import numpy as np
import nltk

In [263]:
### HOMEWORK REVIEW CODE GOES HERE

***

Let's discuss Zoe's solution! I decided to stick with the humanist listserv and wanted to build from our initial word counts to see if I could uncover some more interesting patterns about early DH discourses.

In [264]:
humanist_vols = pd.read_csv('web_scraped_humanist_listserv.csv')

In [265]:
humanist_vols['humanities_computing_counts'] = humanist_vols.text.apply(lambda x: x.count('humanities computing'))
humanist_vols['digital_humanities_counts'] = humanist_vols.text.apply(lambda x: x.count('digital humanities'))
humanist_vols.loc[:,'year'] = humanist_vols.dates.str.split('-').str[0]
humanist_vols.loc[:,'year'] = pd.to_datetime(humanist_vols.year, format='%Y', errors='ignore')

In [266]:
humanist_melted = pd.melt(humanist_vols, id_vars=['dates', 'text', 'year'])
humanist_melted = humanist_melted[['dates', 'variable', 'value', 'year']]

humanist_melted.head()

Unnamed: 0,dates,variable,value,year
0,1987-1988,humanities_computing_counts,98,1987
1,1988-1989,humanities_computing_counts,55,1988
2,1989-1990,humanities_computing_counts,107,1989
3,1990-1991,humanities_computing_counts,29,1990
4,1991-1992,humanities_computing_counts,49,1991


In [267]:
# Here's our initial chart from Wednesday
alt.Chart(humanist_melted[humanist_melted.year.str.len() == 4]).mark_line().encode(
    x='year:T',
    y='value',
    color='variable'
)

I honestly struggled to think of something more powerful that this line chart. I thought initially of counting more words but decided to try different chart styles.

In [268]:
alt.Chart(humanist_melted[humanist_melted.year.str.len() == 4]).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    alt.X('year:O', axis=alt.Axis(labelAngle=0)),
    alt.Y('variable:N'),
    alt.Size('value:Q',
        scale=alt.Scale(range=[0, 400]),
#         legend=alt.Legend(title='')
    ),
    alt.Color('variable:N', legend=None)
).properties(
    width=450,
    height=200,
    title='Comparison of Digital Humanities to Humanities Computing'
)

This bubble chart is a bit cleaner (thinking of the data-ink-ratio) but I think it's less compelling and more difficult to interpret.

In [260]:
chart = alt.Chart(humanist_vols[humanist_vols.year.str.len() == 4][['digital_humanities_counts','humanities_computing_counts', 'year']]).mark_circle(size=100).encode(
    x='digital_humanities_counts',
    y='humanities_computing_counts',
    color=alt.Color('year:O', legend=alt.Legend(columns=3, symbolLimit=0), scale=alt.Scale(scheme='plasma'))
).properties(width=250)
chart1 = alt.Chart(humanist_melted[humanist_melted.year.str.len() == 4]).mark_line().encode(
    x='year:T',
    y='value',
    color='variable'
).properties(width=250)
(chart | chart1).resolve_scale(color='independent')

Combining the original chart with this scatterplot is helpful for seeing the relationship between the two but not sure it's much better than the line chart alone. The color scheme on the scatterplot is helpful but also not necessarily the most obvious to interpret either. Generally showing data with multiple plots is a good rule of thumb but not sure this is really telling us much more 🤔.

I think part of the issue is that I'm not really sure besides this what might be of interest in this dataset. I could try exploring more terms and seeing their counts...

In [291]:
terms = ['digital humanities', 'digital', 'humanities', 'humanities computing', 'computer']

term_counts = []
for term in terms:
    humanist_terms = humanist_vols.copy()
    humanist_terms[f'{term}_counts'] = humanist_terms.text.apply(lambda x: x.count(term))
    humanist_terms['term'] = term
    humanist_terms = humanist_terms[[f'{term}_counts', 'year', 'term']]

    humanist_pivot = pd.pivot(humanist_terms, index='term', columns='year', values=f'{term}_counts').reset_index().rename_axis(None, axis=1)
    term_counts.append(humanist_pivot)

In [318]:
# Remake our line chart
humanist_concat = pd.concat(term_counts)
humanist_melted = pd.melt(humanist_concat, id_vars=['term'])

alt.Chart(humanist_melted[humanist_melted.variable.str.len() == 4]).mark_line().encode(
    x='variable:T',
    y='value',
    color='term'
)

In [322]:
# Remake our scatterplot
humanist_pivot = pd.pivot(humanist_melted, index='variable', columns='term', values='value').rename_axis(None, axis=1).reset_index() 

alt.Chart(humanist_pivot[humanist_pivot.variable.str.len() == 4]).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color=alt.Color('variable:T', legend=alt.Legend(columns=3, symbolLimit=0), scale=alt.Scale(scheme='plasma'))
).properties(
    width=150,
    height=150
).repeat(
    row=terms,
    column=terms
)

So here we are starting to see some interesting patterns. Though we knew that digital humanities and humanities computing did not really correlate, it seems like digital, computer, and humanities computing. We also see that humanities actually does correlate with both humanities computing and digital. However, only one of these is likely signal given how Python does string matching.

What else could we do with counting to explore more trends in this dataset?

***

Two very popular approaches is using Term Frequency Inverse Document Frequency (which you can read more about here https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) or Named Entity Recognition. 

Let's try them out!

In [323]:
# I pre-processed this dataset for NER and you can look at the code in the identify_ner.py script. The reason this is a script and not a note book is because it is VERY slow!
humanist_vols = pd.read_csv('humanist_vols_ner_tokens.csv')

In [324]:
humanist_vols.loc[:, 'year'] = humanist_vols.dates.str.split('-').str[0]
humanist_vols.loc[:,'datetime'] = pd.to_datetime(humanist_vols.year.astype(str) + '-' + humanist_vols.month.astype(str) + '-01', format='%Y-%m-%d', errors='ignore')

In [325]:
humanist_vols.dtypes

dates                     object
split_tokens              object
month                      int64
text_lowercase            object
ner_terms                 object
cleaned_terms             object
year                      object
datetime          datetime64[ns]
dtype: object

In [326]:
# Let's try TF-IDF. This code is from the PH Tutorial linked above
from sklearn.feature_extraction.text import TfidfVectorizer

all_docs = humanist_vols.cleaned_terms.tolist()

vectorizer = TfidfVectorizer(max_df=.7, min_df=1, stop_words=None, use_idf=True, norm=None)
transformed_documents = vectorizer.fit_transform(all_docs)

transformed_documents_as_array = transformed_documents.toarray()

all_files = humanist_vols.datetime.astype(str).tolist()
tfidf_results = []
for counter, doc in enumerate(transformed_documents_as_array):
    # construct a dataframe
    tf_idf_tuples = list(zip(vectorizer.get_feature_names(), doc))
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)
    one_doc_as_df['datetime'] = all_files[counter]
    tfidf_results.append(one_doc_as_df)
#     print(one_doc_as_df[0:1])
    

![tfidf_refresh](https://miro.medium.com/max/1838/1*qQgnyPLDIkUmeZKN2_ZWbQ.png)

In [327]:
tfidf_df = pd.concat(tfidf_results)
tfidf_df.head()

Unnamed: 0,term,score,datetime
0,edt,135.298501,1987-01-01
1,marlene,119.274463,1987-01-01
2,prolog,114.635112,1987-01-01
3,utoronto,93.868831,1987-01-01
4,hirst,80.584459,1987-01-01


In [328]:
tfidf_df = tfidf_df.sort_values(by=['score'], ascending=False)
tfidf_df.head()

Unnamed: 0,term,score,datetime
0,ox,1845.057133,1987-02-01
0,apr,554.127199,1990-12-01
0,jul,524.45113,1990-03-01
0,mar,501.676853,1990-11-01
0,http,483.117041,1999-07-01


In [330]:
# Let's look at the most unique across the corpus terms
alt.Chart(tfidf_df[0:10]).mark_bar().encode(
    x='score:Q',
    y=alt.Y('term:N', sort='-x'),
    color='datetime:N'

)

In [331]:
# We can also split it by email chain
tfidf_grouped = tfidf_df.groupby('datetime').apply(lambda x: x.nlargest(5, 'score')).reset_index(drop=True) 
tfidf_grouped

alt.Chart(tfidf_grouped[0:100]).mark_bar().encode(
    x='score:Q',
    y=alt.Y('term:N', sort='-x'),
    color='datetime:N',
).facet(
    facet='datetime:N',
    columns=3
).resolve_scale(y='independent')


In [332]:
# Let's try out NER
humanist_vols.loc[:,'ner_tokens'] = humanist_vols.apply(lambda row: nltk.word_tokenize(row['ner_terms']), axis=1)

 for more info checkout https://spacy.io/usage/linguistic-features#named-entities
![ner](https://miro.medium.com/max/1400/0*zDbB-LV-Dlm_F_PX)

In [333]:
humanist_exploded = humanist_vols.explode('ner_tokens')
humanist_exploded.head()

Unnamed: 0,dates,split_tokens,month,text_lowercase,ner_terms,cleaned_terms,year,datetime,ner_tokens
0,1987-1988,"['from', ':', 'mccarty', '@', 'utorepas', 'sub...",1,from : mccarty @ utorepas subject : date : 12 ...,york ny france hearn holland oregon york us hu...,mccarty utorepas subject date edt x humanist v...,1987,1987-01-01,york
0,1987-1988,"['from', ':', 'mccarty', '@', 'utorepas', 'sub...",1,from : mccarty @ utorepas subject : date : 12 ...,york ny france hearn holland oregon york us hu...,mccarty utorepas subject date edt x humanist v...,1987,1987-01-01,ny
0,1987-1988,"['from', ':', 'mccarty', '@', 'utorepas', 'sub...",1,from : mccarty @ utorepas subject : date : 12 ...,york ny france hearn holland oregon york us hu...,mccarty utorepas subject date edt x humanist v...,1987,1987-01-01,france
0,1987-1988,"['from', ':', 'mccarty', '@', 'utorepas', 'sub...",1,from : mccarty @ utorepas subject : date : 12 ...,york ny france hearn holland oregon york us hu...,mccarty utorepas subject date edt x humanist v...,1987,1987-01-01,hearn
0,1987-1988,"['from', ':', 'mccarty', '@', 'utorepas', 'sub...",1,from : mccarty @ utorepas subject : date : 12 ...,york ny france hearn holland oregon york us hu...,mccarty utorepas subject date edt x humanist v...,1987,1987-01-01,holland


In [334]:
humanist_grouped = humanist_exploded.groupby(['datetime', 'ner_tokens']).size().reset_index(name='frequency')
humanist_grouped.head()

Unnamed: 0,datetime,ner_tokens,frequency
0,1987-01-01,aberdeen,1
1,1987-01-01,arizona,1
2,1987-01-01,france,1
3,1987-01-01,gmt,3
4,1987-01-01,hearn,1


In [138]:
humanist_grouped.sort_values(by=['datetime', 'frequency' ], inplace=True, ascending=False)
humanist_top = humanist_grouped.sort_values(by=['datetime','frequency'],ascending = [False, False]).groupby(['datetime']).head(5).sort_index()
humanist_pivot = humanist_top.pivot(index='ner_tokens', columns='datetime', values='frequency').fillna(0)
humanist_top_terms = humanist_pivot.unstack().reset_index(name='frequency')
humanist_top_terms.sort_values(by=['frequency'], ascending=False, inplace=True)

In [178]:
top_terms = humanist_top_terms[0:70].ner_tokens.unique().tolist()

In [203]:
alt.Chart(humanist_top_terms[humanist_top_terms.ner_tokens.isin(top_terms)][['datetime', 'ner_tokens', 'frequency']]).mark_line().encode(
    x = alt.X('datetime:T', axis=alt.Axis(labelAngle=0)),
    y = alt.Y('frequency:Q'),
    color = alt.Color('ner_tokens:N', legend=alt.Legend(title='Place Name Mentioned on Humanist Listserv')),
    tooltip = [alt.Tooltip('ner_tokens', title='Place Identified'), 'frequency']
)

In [210]:
selection = alt.selection_multi(fields=['ner_tokens'], bind='legend')
alt.Chart(humanist_top_terms[humanist_top_terms.ner_tokens.isin(top_terms)][['datetime', 'ner_tokens', 'frequency']]).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    x = alt.X('datetime:T', axis=alt.Axis(labelAngle=0)),
    y = alt.Y('ner_tokens:N'),
    size= alt.Size('frequency:Q',
        scale=alt.Scale(range=[0, 400]),
    ),
    color = alt.Color('ner_tokens:N', legend=alt.Legend(title='Place Name Mentioned on Humanist Listserv')),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
    tooltip = [alt.Tooltip('ner_tokens', title='Place Identified'), 'frequency']
).add_selection(
    selection
).properties(
    width=450,
    height=320,
    title=''
).configure_legend(labelLimit= 0)


In [254]:
step = 30
overlap = 1
alt.Chart(humanist_top_terms[humanist_top_terms.ner_tokens.isin(top_terms)], height=step).mark_line().encode(
    y=alt.Y('frequency:Q', scale=alt.Scale(range=[step, -step * overlap]), axis=None),
    x='datetime:T',
    color=alt.Color('ner_tokens:N', legend=None),
).facet(
    row=alt.Row(
        'ner_tokens:N',
        title=None,
        header=alt.Header(labelAngle=0, labelAlign='right'),
    )
).properties(
    bounds='flush',
    title='Frequency of Top Place Names Appearing on the Humanist Listserv, 1987-2007'
).configure_facet(
    spacing=0
).configure_view(
    stroke=None
).configure_title(
    anchor='middle'
).configure_axisY(
    labelPadding=-10, 
    labelAlign='right'
)

#### Experiments with Humanist Listserv Data

**Need to contextualize our data**


Example 1: Julianne Nyhan "In Search of Identities in the Digital Humanities: The Early History of Humanist" [https://www.researchgate.net/publication/280836611_In_Search_of_Identities_in_the_Digital_Humanities_The_Early_History_of_Humanist](https://www.researchgate.net/publication/280836611_In_Search_of_Identities_in_the_Digital_Humanities_The_Early_History_of_Humanist)


- Gives the history of the listserv, overview of the posts, and argues about the creation of disciplinary identity
- "Terms used to describe the group seem to signal how idealistic and personally involved a number of the early practitioners were. These included Suppor[t]ers of computing in the Humanities (Humanist 1:44); free people (1:80); true believer[s] (1:1035) and the lament “I thought we were all in this together” (1:661)"

---

Example 2: Rockwell, Geoffrey, and Stéfan Sinclair. “The Swallow Flies Swiftly Through: An Analysis of
Humanist.” [http://hermeneuti.ca/swallow-flies](http://hermeneuti.ca/swallow-flies)

- Does some text analysis of the listserv
- Compares DH to humanities computing, rise of the web, rise of software and social media discourses
- Makes an argument about hack vs yack

---

Example 3: David McClure’s “Visualizing 27 years, 12 million words of the Humanist list” [http://humanist.dclure.org/](http://humanist.dclure.org/) and [http://dclure.org/essays/visualizing-the-humanist/](http://dclure.org/essays/visualizing-the-humanist/)


- Using kde to compute patterns in text and compare the similarity in patterns. Finally use those similarities as edges in a network.
- 1980s computer hardware and textual studies
- 1990s increase in place names
- 2000s rise of administrative language
