## Load library and candidate list

In [None]:
pip install pymediawiki

Collecting pymediawiki
  Downloading pymediawiki-0.7.0-py3-none-any.whl (23 kB)
Installing collected packages: pymediawiki
Successfully installed pymediawiki-0.7.0


In [None]:
import re
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize

In [None]:
url = 'https://raw.githubusercontent.com/casszhao/FAIR/main/0901_full_list.csv'
sorted_cat = pd.read_csv(url, header=None)
sorted_cat = sorted_cat[0].to_list()

In [None]:
sorted_cat_lowervob_list = list((map(lambda x: x.lower(), sorted_cat)))
print(len(sorted_cat_lowervob_list))
sorted_cat_list = list(set(sorted_cat_lowervob_list))
print(len(sorted_cat_lowervob_list))

2159
2159


**Example Abstract**

Example to show how an abstract is processed





In [None]:
abstract = 'how health care reform can transform the health of criminal justice involved individualsProvisions of the Affordable Care Act offer new opportunities to apply a public health and medical perspective to the complex relationship between involvement in the criminal justice system and the existence of fundamental health disparities. Incarceration can cause harm to individual and community health, but prisons and jails also hold enormous potential to play an active and beneficial role in the health care system and, ultimately, to improving health. Traditionally, incarcerated populations have been incorrectly viewed as isolated and self-contained communities with only peripheral importance to the public health at large. This misconception has resulted in missed opportunities to positively affect the health of both the individuals and the imprisoned community as a whole and potentially to mitigate risk behaviors that may contribute to incarceration. Both community and correctional health care professionals can capitalize on these opportunities by working together to advocate for the health of the criminal justice-involved population and their communities. We present a set of recommendations for the improvement of both correctional health care, such as improving systems of external oversight and quality management, and access to community-based care, including establishing strategies for postrelease care and medical record transfers. ' #@param {type:"string"}

## Step 1: Data pre-processing (extract nouns from text).


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


stopwords_list = stopwords.words('english')
print(len(stopwords_list))
extended = ['methodology', 'study', 'use', 'purpose', 'research', 'conclusion',
            'research', 'paper', 'background', 'dissertation', 'essays',
            'purpose', 'addition', 'elsevier']
stopwords_list=stopwords_list+extended

print(len(stopwords_list))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
179
193


In [None]:
words = [word for word in word_tokenize(abstract) if word.lower() not in stopwords_list]
nostop_abstract = " ".join(words)

## Step 2: Identify terms in the abstract which are also in the list of candidate categories.

In [None]:
def generate_ngrams(s, n):
    # Convert to lowercases
    s = s.lower()
    
    # Replace all none alphanumeric characters with spaces
    s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
    
    # Break sentence in the token, remove empty tokens
    tokens = [token for token in s.split(" ") if token != ""]
    
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

In [None]:
def get_matched_gram(abstract):  
  uni_gram = nltk.word_tokenize(abstract)
  bi_gram = generate_ngrams(abstract, 2)
  tri_gram = generate_ngrams(abstract, 3)

  all = uni_gram + bi_gram + tri_gram

  all_lower = list((map(lambda x: x.lower(), all)))

  matched = []
  for gram in all_lower:
    if gram in sorted_cat_lowervob_list:
      matched.append(gram)
    else:
      pass
  return list(set(matched))


In [None]:
matched_list = get_matched_gram(abstract)

In [None]:
matched_list

['health disparities',
 'health care',
 'community',
 'health care reform',
 'health',
 'health care system',
 'public health',
 'individual',
 'community health']

## Step 3: Check Wikipedia categories associated with each noun and return those that appear in the candidate list. 

1.   Identify nouns in abstract
2.   Retrive Wikipedia categories associated with each noun
3.   Save list of categories which also appear in the candidate list.

In [None]:
is_noun = lambda pos: pos[:2] == 'NN'
#   # do the nlp stuff
tokenized = nltk.word_tokenize(nostop_abstract)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 
nouns = list(set(nouns))

In [None]:
pip install pymediawiki



In [None]:
from mediawiki import MediaWiki
wikipedia = MediaWiki()

In [None]:
search_list = list(set(nouns + matched_list))
one_text_cats_list = []
for topic in search_list:
  try:
    p = wikipedia.page(topic)
    one_nouns_cat = p.categories
    one_text_cats_list = one_text_cats_list + one_nouns_cat
  except:
    print('no wikipedia search result for ', topic)

no wikipedia search result for  quality
no wikipedia search result for  Act
no wikipedia search result for  access
no wikipedia search result for  relationship
no wikipedia search result for  care
no wikipedia search result for  misconception
no wikipedia search result for  Care
no wikipedia search result for  disparities
no wikipedia search result for  transfers
no wikipedia search result for  play
no wikipedia search result for  record
no wikipedia search result for  opportunities
no wikipedia search result for  perspective
no wikipedia search result for  set
no wikipedia search result for  transform
no wikipedia search result for  complex


**Match the pre-defined categories**


pre-defined vocabulary: sorted_cat_lowervob_list
from the vocabulary

maching categories if it contains pre-defined vocabularies

In [None]:
saved_cat_list = []
for one_cat in one_text_cats_list:
  if one_cat.lower() in sorted_cat_lowervob_list:
    saved_cat_list.append(one_cat.lower())
  else:
    pass
saved_cat_list = list(set(saved_cat_list))
print(saved_cat_list)

['health economics', 'medical humanities', 'sanitation', 'health care', 'health equity', 'euthenics', 'social problems in medicine', 'primary care', 'determinants of health', 'medical sociology', 'health policy', 'demography', 'economic inequality', 'organizational theory', 'community', 'health', 'public health', 'health care reform']


## Step 4: Produce combined list of categories identified in previous steps

In [None]:
combined_identical_matched = matched_list + saved_cat_list
combined_identical_matched

['health disparities',
 'health care',
 'community',
 'health care reform',
 'health',
 'health care system',
 'public health',
 'individual',
 'community health',
 'health economics',
 'medical humanities',
 'sanitation',
 'health care',
 'health equity',
 'euthenics',
 'social problems in medicine',
 'primary care',
 'determinants of health',
 'medical sociology',
 'health policy',
 'demography',
 'economic inequality',
 'organizational theory',
 'community',
 'health',
 'public health',
 'health care reform']