# Word frequencies


Now that we have the abstracts in two nice neat .csv files, we need to download/import the packages needed, import the .csv files, and then can get on with the first part of the analysis. 

## Get ready 

As always, we start with a couple of code cells that load up and nickname some useful packages, then check file locations, then import files and check them. 


In [1]:
%%capture

# installing necessary pdf conversion packages via pip
# the '%%capture' at the top of this cell suppresses the output (which is normally quite long and annoying looking). 
# You can remove or comment it out if you prefer to see the output. 
!pip install nltk


In [2]:
%%capture

import os                         # os is a module for navigating your machine (e.g., file directories).
import nltk                       # nltk stands for natural language tool kit and is useful for text-mining. 
from nltk import word_tokenize    # and some of its key functions

nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.corpus import wordnet                    # Finally, things we need for lemmatising!
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
nltk.download('averaged_perceptron_tagger')        # Like a POS-tagger...
nltk.download('wordnet')
nltk.download('webtext')
from nltk.corpus import webtext

import pandas as pd
pd.set_option('display.max_colwidth', 200)
import numpy as np
import statistics
import datetime
date = datetime.date.today()

import codecs
import csv                        # csv is for importing and working with csv files

from collections import Counter

import statistics
import re                         # things we need for RegEx corrections
import matplotlib.pyplot as plt
import string 

import math 

English_punctuation = "-!\"#$%&()'*-–+,./:;<=>?@[\]^_`{|}~''“”"      # Things for removing punctuation, stopwords and empty strings
table_punctuation = str.maketrans('','', English_punctuation)

In [None]:
print(os.listdir("..\\output")  )                                # check 'results' folder is not empty/has correct stuff

## Import

Having checked the contents of the output folder and seen the files we expected to see, we can now import and check them. 

In [3]:
batches_list = []
batches_dict = {}

for batch in os.listdir("..\\output"):
        if batch.endswith(".csv"):
            name = batch.rsplit('.', maxsplit=1)[0]
            batches_list.append(name)

for batch in batches_list:
    df = pd.read_csv('..\\output\\' + batch + ".csv").drop(['Unnamed: 0', 'Year'],axis=1)
    year = batch.rsplit('_', maxsplit=1)[1]
    batches_dict[year] = df

In [4]:
batches_dict.values()

dict_values([                                                                                                                                                                                                         Text
0     Despite the availability of most of the human genome sequence the accu rate identification of genes on the DNA sequence remains to be continu ously improved and updated. This procedure relies on t...
1     The mechanisms whereby inherited DNA mutations cause disease areonly beginning to be understood. These are best understood in the contextof knowledge of the three dimensional structure of the rele...
2     Missense mutations can knock out residues in proteins that are importantfor binding catalysis or conformational change or have subtle effects onstability or conformation. Because buried residues i...
3     The Genome Project has accelerated the availability of tools to apply phar macogenetics to the development of personalized medicines. This differsfrom the lo

## Count word frequencies - 'bag of words'

Now that we have some basic descriptive stats about how many abstracts were imported properly with text in the 'Text' column, we can get on to the actual natural language processing steps. The most basic NLP option is to count the most frequent words found in the two sets of abstracts - meaning we need to find the most frequent words found in ALL of the abstracts and then compare that to the most frequnet words found in only those abstracts that contain a keyword of interest. 

To this end, we use the 'bag of words' method which whacks all of the words from all of the texts together, turns them into 'tokens' then processes to make them as unified as possible by removing uppercase letters, punctuation, digits, empty strings, stop words (e.g. 'the', 'and', 'for', etc. ) and word forms (e.g. pluralisations, verb endings, etc. ). 

Let's demo this with a simple example. If the text we want to 'bag of words' is "The cat named Cat was one of 5 cats." it would become a list of stemmed word-tokens like 
'''[[cat]
[name]
[cat]
[be]
[cat]]''' 
and the most common word would obviously be '''[cat]'''. 

Applying the 'bag of words' method to our texts is not so trivial, but should also be more enlightening. We would expect that the most common words from all of the texts would be similar to, but not identical to, the most common words from only the abstracts that contain a keyword of interest.

This bag of words approach ignores years, session codes, authors and everything else. Subsetting the texts by those things might be useful later. 

In [5]:
REPLACEMENTS_1 = [
    ('mutations', 'mutation' ),
    ('variants', "variant"),
    ('changes', "change"),
    ('alterations', "alteration"),
    ('diseases', "disease"), 
    ('disorders', "disorder"),
    ('illnesses', "illness"), 
    ('conditions', "condition"),
    ('diagnoses', "diagnosis"), 
    ('syndromes', "syndrome"), 
    ('patients', "patient"),
    ('individuals', "individual"),
    ('people', "person"),
    ('probands', "proband"), 
    ('subjects', "subject"), 
    ('cases', "case"), 
    ('normals', "normal"), 
    ('typicals', "typical"), 
    ('wilds', "wild"), 
    ('types', "type"), 
    ('abnormals', "abnormal"), 
    ('atypicals', "atypical"), 
    ]

REPLACEMENTS_2 = [
    ('gene change', "genechange"), 
    ('gene alteration', "genealteration"), 
    ('suffering from', "sufferingfrom"), 
    ('living with', "livingwith"), 
    ('wild type', "wildtype"), 
    ('rare condition', "rarecondidion"), 
    ('rare disease', "raredisease"), 
    ('rare disorder', "raredisorder"), 
    ('orphan condition', "orphancondition"), 
    ('orphan disease type', "orphandisease"), 
    ('orphan disoredr', "orphandisorder"),]

In [5]:
print(sorted(stop_words))                                # OPTIONAL: check what counts as a stopword if you want to see

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'isn', "isn't", 'it', "it's", 'its', 'itself', 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she', "she's", 'should', "should've", 'shouldn', "shouldn't", 'so', 'some',

In [6]:
sets = [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6 ]
words = ['mutation', 'variant', 'genechange', 'genealteration', 
         'disease', 'disorder', 'illness', 'condition', 'diagnosis', 'syndrome', 
         'patient', 'individual', 'person', 'proband', 'subject', 'case', 
         'affected', 'diagnosed', 'sufferingfrom', 'impacted', 'livingwith', 
         'normal', 'typical', 'wildtype', 'abnormal', 'atypical',
        'rarecondition', 'raredisease', 'raredisorder',  'orphancondition', 'orphandisease', 'orphandisorder' ]

target_word_dataframe = pd.DataFrame(list(zip(sets, words)), columns=['set','words'])
target_word_dataframe

Unnamed: 0,set,words
0,1,mutation
1,1,variant
2,1,genechange
3,1,genealteration
4,2,disease
5,2,disorder
6,2,illness
7,2,condition
8,2,diagnosis
9,2,syndrome


In [7]:
for k,v in batches_dict.items():
    year = str(k)
    holding_string = ""                                                        # that creates a temporary variable
    for text in v['Text']:                                                 # looks at the 'Text' column for the input
        holding_string += text                                                 # fills up the temp variable with the text
    for plural, singular in REPLACEMENTS_1:
        holding_string = holding_string.replace(plural, singular)
    for multi, single in REPLACEMENTS_2:
        holding_string = holding_string.replace(multi, single)
    holding_string = word_tokenize(holding_string)                             # word tokenises that text
    holding_string = [word.lower() for word in holding_string]                 # remove uppercase letters
    holding_string = [w.translate(table_punctuation) for w in holding_string]  # removes punctuation
    holding_string = (list(filter(lambda x: x, holding_string)))               # removes andy empty strings
    holding_string = [token for token in holding_string if not token.isdigit()]  # removes digits
    holding_string = [token for token in holding_string if token not in stop_words]  # removes stopwords
    list_for_count = []                                                              # and creates an empty list        
    for token in holding_string:                                         # then iterates over the tokens
        if token in target_word_dataframe.values:
            list_for_count.append(token)                                     # appending them to the list
    counts = Counter(list_for_count)                                     # applies the Counter function imported earlier 
    temp_df = pd.DataFrame.from_records(list(dict(counts).items()), columns=['words',year])
    target_word_dataframe = pd.merge(target_word_dataframe, temp_df, on='words', how='outer')

    
print(target_word_dataframe)


    set            words    2001    2002    2003    2004    2005    2006  \
0     5         abnormal   173.0   125.0   108.0    93.0   175.0   123.0   
1     4         affected   504.0   322.0   223.0   266.0   373.0   363.0   
2     5         atypical    35.0    13.0    30.0    26.0    30.0    30.0   
3     3             case  1666.0  1088.0   794.0  1041.0  1355.0  1168.0   
4     2        condition   193.0   126.0    70.0    94.0   126.0   177.0   
5     4        diagnosed   144.0   104.0    71.0    88.0   153.0   135.0   
6     2        diagnosis   655.0   488.0   287.0   366.0   485.0   497.0   
7     2          disease  1342.0   758.0   568.0   613.0   999.0  1038.0   
8     2         disorder   646.0   425.0   309.0   334.0   554.0   581.0   
9     1   genealteration     5.0     2.0     1.0     NaN     4.0     1.0   
10    1       genechange     NaN     1.0     NaN     NaN     NaN     1.0   
11    2          illness    15.0    11.0     3.0     9.0     3.0     9.0   
12    4     

In [None]:
target_word_dataframe


In [8]:
target_word_dataframe.to_csv('..\\output\\final\\target_words_by_year.csv')

### Steps to take

* revise the bag_of_words_analysis function to group by year? ALternatively, create one bag of words for each year. 
* write a new function that scans the year-bags-of-words for all words on an input list
* save the output of that function to a list (or .csv?) 
* use the outputs to create graphs that track the popularity of all the words on that list over time

