# Word frequencies


Following the first code notebook, we now have the texts from all abstracts abstracts in two nice neat .csv files. One with all of the abstracts that we were able to identify and one with only the subset of the identified abstracts that also contain one or more of the keywords of interest. 

This second notebook starts with downloading and importing the packages we need, importing the .csv files, and moving on to then starting the first part of the analysis. 

## Get ready 

As always, we start with a couple of code cells that load up and nickname some useful packages, then check file locations, then import files and check them. 


In [None]:
%%capture

# installing necessary pdf conversion packages via pip
# the '%%capture' at the top of this cell suppresses the output (which is normally quite long and annoying looking). 
# You can remove or comment it out if you prefer to see the output. 
!pip install nltk


In [None]:
%%capture

import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
from nltk import word_tokenize    # and some of its key functions

nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.corpus import wordnet                    # Finally, things we need for lemmatising!
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
nltk.download('averaged_perceptron_tagger')        # Like a POS-tagger...
nltk.download('wordnet')
nltk.download('webtext')
from nltk.corpus import webtext

import pandas as pd
pd.set_option('display.max_colwidth', 200)
import numpy as np
import statistics
import datetime
date = datetime.date.today()

import codecs
import csv                        # csv is for importing and working with csv files

from collections import Counter

import statistics
import re                         # things we need for RegEx corrections
import matplotlib.pyplot as plt
import string 

import math 

English_punctuation = "-!\"#$%&()'*-–+,./:;<=>?@[\]^_`{|}~''“”"      # Things for removing punctuation, stopwords and empty strings
table_punctuation = str.maketrans('','', English_punctuation)

In [None]:
print(os.listdir("..\\output")  )                                # check 'results' folder is not empty/has correct stuff

## Import

Having checked the contents of the output folder and seen the files we expected to see, we can now import and check them. 
There is an additional file here, called 'ESHG_abstract_sumbissions.csv' which contains numbers given to us my the ESHG conference team about how many abstracts they accepted each year. This will be useful for comparing to how many abstracts we detected each year to get a sense of how successful our process is. 

In [None]:
all_texts = pd.read_csv('..\\output\\all_abstracts_no_null_texts.csv')            # one for all of the texts and then
matched_texts = pd.read_csv('..\\output\\matched_abstracts_no_null_texts.csv')    # one for just those that match the keyword
reported_abstracts = pd.read_csv('..\\output\\ESHG_abstract_submissions.csv')     # one for the number of expected abstracts

In [None]:
print (len(all_texts))                        # it is always useful to double check that the length matches your expectations
print (len(matched_texts))                    # in this case, we already know how many rows to expect in each file. 
print (len(reported_abstracts))

## Get some basic stats about how texts are spread out over time

We know that all of the rows in the files have at least two columns with contents - 'Year' and 'Text'. This means that it is probably a useful thing to get a little schematic and/or table that counts row according to year. Let's do that now!  

In [None]:
all_counts_by_year = all_texts['Year'].value_counts()         # this creates a little table with two columns - year and count
print(all_counts_by_year)                                     # however, when we print it we can see it has no headers,
                                                              # is not in order, has the years appearing as  floats, etc. 

In [None]:
all_counts_by_year = pd.DataFrame(all_counts_by_year)                            # Convert the imported table to a data frame,
all_counts_by_year = all_counts_by_year.rename(columns={"Year": "All"})          # rename the columns,
all_counts_by_year = all_counts_by_year.rename_axis('Year').reset_index()        # set the axis to 'Year' and reset the index,
all_counts_by_year = all_counts_by_year.sort_values(by=['Year']).astype('Int64') # retype 'Year' column and sort by year.
print(all_counts_by_year)                                                        # Let's just check it worked. 

In [None]:
matched_counts_by_year = matched_texts['Year'].value_counts()                             # repeat for the matched_texts file
matched_counts_by_year = pd.DataFrame(matched_counts_by_year)                             # again, turn it into a data frame
matched_counts_by_year = matched_counts_by_year.rename(columns={"Year": "Matches"})       # name the columns
matched_counts_by_year = matched_counts_by_year.rename_axis('Year').reset_index()         # rename axis, reset index
matched_counts_by_year = matched_counts_by_year.sort_values(by=['Year']).astype('Int64')  # retype and sort by value of year
print(matched_counts_by_year)                                                             # and check it looks correct. 

In [None]:
reported_abstracts             # This file wasn't created by our preparation processes, so we check it. 
                               # We don't need all the columns, just 'Year' and 'Total'. 
                               # Also, we can see that we don't need to retype the 'Year' or sort by 'Year' value. 

In [None]:
reported_abstracts_by_year = pd.DataFrame(reported_abstracts[['Year', 'Total']])            # create data frame from just 2 columns
reported_abstracts_by_year = reported_abstracts_by_year.rename(columns = {'Total':'Reported'}) # rename 'Total' column to 'Expected'
print(reported_abstracts_by_year)                                                           # Let's just check it worked. 

In [None]:
counts_year = all_counts_by_year.merge(matched_counts_by_year, on='Year', how='left') # now, combine the 1st two data frames
counts_year = counts_year.merge(reported_abstracts_by_year, on='Year', how='left')       # add the 3rd 
print(counts_year)                                                                    # and have a look at it

In [None]:
counts_year['Year'] = pd.to_datetime(counts_year['Year'].astype(str), format="%Y")

In [None]:
counts_year = counts_year.set_index('Year')                     # set the year as the index
print(counts_year)                                              # and have a look. Nice!

In [None]:
print(counts_year['All'].sum())                                  # I worry too much sometimes
print(counts_year['Matches'].sum())                              # but why not check that the numbers still add up? 
print(counts_year['Reported'].sum())                                  

In [None]:
counts_year.plot()                          # create a plot from the combined, re-indexed and renamed data frame
plt.legend(frameon=False)         # Set position for legend and set legend frame to be false
plt.show()                                          # have a look at the plot

In [None]:
plt.savefig('..\\output\\abstract_count.jpg')    # we can right click on the plot above to save it, or save it via command

## Count word frequencies - 'bag of words'

Now that we have some basic descriptive stats about how many abstracts were imported properly with text in the 'Text' column, we can get on to the actual natural language processing steps. The most basic NLP option is to count the most frequent words found in the two sets of abstracts - meaning we need to find the most frequent words found in ALL of the abstracts and then compare that to the most frequnet words found in only those abstracts that contain a keyword of interest. 

To this end, we use the 'bag of words' method which whacks all of the words from all of the texts together, turns them into 'tokens' then processes to make them as unified as possible by removing uppercase letters, punctuation, digits, empty strings, stop words (e.g. 'the', 'and', 'for', etc. ) and word forms (e.g. pluralisations, verb endings, etc. ). 

Let's demo this with a simple example. If the text we want to 'bag of words' is "The cat named Cat was one of 5 cats." it would become a list of stemmed word-tokens like 
'''[[cat]
[name]
[cat]
[be]
[cat]]''' 
and the most common word would obviously be '''[cat]'''. 

Applying the 'bag of words' method to our texts is not so trivial, but should also be more enlightening. We would expect that the most common words from all of the texts would be similar to, but not identical to, the most common words from only the abstracts that contain a keyword of interest.

This bag of words approach ignores years, session codes, authors and everything else. Subsetting the texts by those things might be useful later. 

In [None]:
def bag_of_words_analysis(input, how_many):     # define a 'bag of words' function with 2 arguments, an input and a quantity 
    holding_string = ""                                                        # that creates a temporary variable
    for text in input['Text']:                                                 # looks at the 'Text' column for the input
        holding_string += text                                                 # fills up the temp variable with the text
    holding_string = word_tokenize(holding_string)                             # word tokenises that text
    holding_string = [word.lower() for word in holding_string]                 # remove uppercase letters
    holding_string = [w.translate(table_punctuation) for w in holding_string]  # removes punctuation
    holding_string = (list(filter(lambda x: x, holding_string)))               # removes andy empty strings
    holding_string = [token for token in holding_string if not token.isdigit()]  # removes digits
    holding_string = [token for token in holding_string if token not in stop_words]  # removes stopwords
    holding_string = [porter.stem(token) for token in holding_string]                # stems the word-tokens
    list_for_count = []                                                              # and creates an empty list
    for token in holding_string:                                         # then iterates over the tokens
        list_for_count.append(token)                                     # appending them to the list
    counts = Counter(list_for_count)                                     # applies the Counter function imported earlier 
    return counts.most_common(how_many)                                  # and returns the tokens with highest counts 
                                                                         # up to the quantity specified as an argument

In [None]:
most_frequent_all = bag_of_words_analysis(all_texts, 20)   # apply bag of words function to all texts, and save the output table
                                                           # this will take a while. 

In [None]:
most_frequent_all = pd.DataFrame(most_frequent_all)                                # convert the saved output as a data frame
most_frequent_all = most_frequent_all.rename(columns={0: "Word", 1: "All count" }) # name the columns
print(most_frequent_all)                                                  # Let's just check it worked. print(most_frequent_all)

In [None]:
most_frequent_matched = bag_of_words_analysis(matched_texts, 20) # apply the bag of words function to matched texts, and save

In [None]:
most_frequent_matched = pd.DataFrame(most_frequent_matched)                                    # convert it to a data frame
most_frequent_matched = most_frequent_matched.rename(columns={0: "Word", 1: "Matched count" }) # name the columns
print(most_frequent_matched)                                                  # Let's just check it worked. print(most_frequent_all)

In [None]:
most_frequent = most_frequent_all.merge(most_frequent_matched, on='Word', how='outer') # combine the two data frames via 'outer'
                                                                                       # to get the total list of all words
print(most_frequent)

In [None]:
most_frequent.to_csv('..\\output\\most_frequent_comparison.csv')  # write out the joined most frequent words to a .csv
                                                                  # again, with a clear and useful name