# Word "energy" as a keyword detection algorithm
In this notebook, we calculate an "energy spectrum" for every word in a document using the spatial feature that is the word's position within the document. We also calculate statistics on the spectrum that we will use to measure how relevant or important the word is to the document. The work here is based on the paper:

>Carpena, Pedro, et al. "Level statistics of words: Finding keywords in literary texts and symbolic sequences." Physical Review E 79.3 (2009): 035102.

For sample data, I used Reddit data *(science subreddit)* found at the following GitHub repository https://github.com/linanqiu/reddit-dataset. This textual data has already been cleaned and tokenized, and the exact details can be found in the readme.md file.

### Ingest the data into a pandas dataframe

In [None]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn

In [None]:
df = pd.read_csv('/Users/ahairston/Desktop/learning_science.csv', index_col=0)

In [None]:
df.head()

*The column header descriptions can be found in the GitHub readme, but we are only interested in the messages found in column 0*

Next, we load all of the messages into a list that we can treat as a document that we will call `text`, and we store the unique words in the text in a list called `words` .

In [None]:
df.dropna(subset=['0'], inplace=True)

In [None]:
df.head()

### Process the words in the messages

In [None]:
msgs = df['0']

In [None]:
text = ' '.join(msgs).split()

In [None]:
text[:5]

In [None]:
len(text)

In [None]:
words = list(set(text))

In [None]:
len(words)

The dictionary, `energies`, will contain the "energy spectrum" for each word, and we can use a rug plot to visualize the [level spacings](https://en.wikipedia.org/wiki/Level-spacing_distribution) between the energy values.

In [None]:
energies = {}

In [None]:
for word in words:
    energies[word] = [i for i,j in enumerate(text) if j == word]

In [None]:
print(energies['hello'])

In [None]:
seaborn.rugplot(energies['hello'], height=1)

In [None]:
seaborn.rugplot(energies['gravitational'], height=1)

### Compute statistics for the word clusters

We now quantify the spacings between clusters and compute a key parameter, `sigma` of the spacing distribution.

In [None]:
spaces = {}

In [None]:
for word in words:
    spaces[word] = []
    for i in range(len(energies[word])-1):
        spaces[word].append(energies[word][i+1]-energies[word][i])

In [None]:
mean_spaces = {}

In [None]:
for word in words:
    mean_spaces[word] = sum(spaces[word])/len(spaces[word])

In [None]:
mean_sqr_spaces = {}

In [None]:
for word in words:
    mean_sqr_spaces[word] = sum([num**2 for num in spaces[word]])/len(spaces[word])

In [None]:
sigma = {}

In [None]:
for word in words:
    sigma[word] = (math.sqrt(mean_sqr_spaces[word] - mean_spaces[word]**2))/mean_spaces[word]

In [None]:
sigma_norm = {}

We normalize the `sigma` parameter based on the probability of the word occurring in the document.

In [None]:
for word in words:
    sigma_norm[word] = sigma[word]/math.sqrt(1-(len(energies[word])/len(text)))

The `sigma_norm` parameter that quantifies the clustering can be misleading without accounting for the statistical significance of the parameter given the word count. Thus, we compute the mean and standard deviation of `sigma_norm` as a function of word count. 

In [None]:
mean_sigma_norm = {}

In [None]:
sd_sigma_norm = {}

In [None]:
for word in words:
    mean_sigma_norm[word] = (2*len(energies[word])-1)/(2*len(energies[word])+2)
    sd_sigma_norm[word] = 1/(math.sqrt((len(energies[word])))*(1+2.8*len(energies[word])**(-0.865)))

The `c_score` measures the deviation of `sigma_norm` with respect to the expected value in a random text in units of the expected standard deviation. Thus `c_score` is a Z-score measure which depends on the word's count, and combines the clustering of a word and its frequency.

In [None]:
c_score = {}

In [None]:
for word in words:
    c_score[word] = (sigma_norm[word] - mean_sigma_norm[word])/(sd_sigma_norm[word])

In [None]:
len(c_score)

### Print the results

Now we sort the `c_scores` from highest to lowest to detect keywords. After inspection, you can determine a c_score threshold to label a word as a "keyword".

In [None]:
sorted_c_scores = sorted(c_score.items(), key=lambda kv: kv[1])

In [None]:
type(sorted_c_scores)

In [None]:
high_c_scores = list(reversed(sorted_c_scores))

The cell below will write the `sigma_norm` (clustering parameter) and `c_score` to file for the 100 words with the highest c_score. A choice has to be made here as to what threshold will be used to designate a word as a "keyword", which will require a bit of exploration of the results.

In [None]:
%%capture cap --no-stderr
for item in high_c_scores[:100]: 
    print(item[0], sigma_norm[item[0]], item[1])
with open('reddit_scores.txt', 'w') as f:
    f.write(cap.stdout)