<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Natural Language Processing (NLP)
## *Data Science Unit 4 Sprint 2 Assignment 1*

Analyze a corpus of text using text visualization of token frequency. Try cleaning the data as much as possible. Try the following techniques: 
- Stemming
- Lemmatization
- Custom stopword removal
- Using frequency based stopwrod removal

You are free to use any dataset you are interested in. Kaggle is a great place to start. Feel free to sample the data if the dataset is too large to hanlde in memory. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_json('data/summaries.json')
df = df.T
df.head()

Unnamed: 0,description,id,nct_id
0,\n J-Valve TF Compassionate Use cases app...,3654765,NCT03876964
1,\n The study is designed to assess the ro...,3647891,NCT03966794
2,\n DHA on Breath Holding Spells\n,3647953,NCT03965988
3,\n The purpose of this study is to assess...,3650218,NCT03936374
4,\n The purpose of this trial is to evalua...,3647820,NCT03967730


In [3]:
df['description'][:5]

0    \n      J-Valve TF Compassionate Use cases app...
1    \n      The study is designed to assess the ro...
2           \n      DHA on Breath Holding Spells\n    
3    \n      The purpose of this study is to assess...
4    \n      The purpose of this trial is to evalua...
Name: description, dtype: object

## Cleaning Text Data

In [22]:
import re

def clean_text(text):
    text = text.lower().strip()
    text = re.sub(r'[^a-zA-Z ^0-9]', '', text)
    text = " ".join(text.split())
    return text

clean_text(sample)

'investigators have designed a pilot study involving chronic hand dermatitis chd patients who attend the dermatology clinic at the george washington medical faculty associates gw mfa in order to assess the efficacy and safety of apremilast treatment for the treatment of moderate to severe chd'

In [24]:
df['clean_desc'] = df['description'].apply(clean_text)
df['clean_desc'][:5]

0    jvalve tf compassionate use cases approved on ...
1    the study is designed to assess the role of fu...
2                         dha on breath holding spells
3    the purpose of this study is to assess the eff...
4    the purpose of this trial is to evaluate the s...
Name: clean_desc, dtype: object

# Tokenizing

In [4]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_sm")

tokenizer = Tokenizer(nlp.vocab)

In [26]:
tokens = []

for text in tokenizer.pipe(df['description'], batch_size=500):
    text_tokens = [token.text for token in text]
    tokens.append(text_tokens)
    
tokens[:5]

[['\n      ',
  'J-Valve',
  'TF',
  'Compassionate',
  'Use',
  'cases',
  'approved',
  'on',
  'a',
  'case',
  'by',
  'case',
  'basis',
  'by',
  'the',
  'FDA',
  '\n    '],
 ['\n      ',
  'The',
  'study',
  'is',
  'designed',
  'to',
  'assess',
  'the',
  'role',
  'of',
  'functional',
  'neural',
  'regeneration',
  'collagen',
  'scaffold',
  '\n      ',
  'transplantation',
  'combined',
  'with',
  'epidural',
  'electrical',
  'stimulation',
  'in',
  'spinal',
  'cord',
  'injury',
  'patients.',
  '\n    '],
 ['\n      ', 'DHA', 'on', 'Breath', 'Holding', 'Spells', '\n    '],
 ['\n      ',
  'The',
  'purpose',
  'of',
  'this',
  'study',
  'is',
  'to',
  'assess',
  'the',
  'effect',
  'of',
  'multiple',
  'dose',
  'administrations',
  'of',
  '\n      ',
  'Omeprazole',
  'on',
  'the',
  'pharmacokinetics',
  'of',
  'BMS-986205.',
  '\n    '],
 ['\n      ',
  'The',
  'purpose',
  'of',
  'this',
  'trial',
  'is',
  'to',
  'evaluate',
  'the',
  'safety',

In [27]:
df['tokens'] = tokens
df['tokens'].head()

0    [\n      , J-Valve, TF, Compassionate, Use, ca...
1    [\n      , The, study, is, designed, to, asses...
2    [\n      , DHA, on, Breath, Holding, Spells, \...
3    [\n      , The, purpose, of, this, study, is, ...
4    [\n      , The, purpose, of, this, trial, is, ...
Name: tokens, dtype: object

## StopWord Removal

In [32]:
STOP_WORDS = nlp.Defaults.stop_words.union(['I','\n      '])

In [33]:
updated_tokens = []

for tokens in df['tokens']:
    real_tokens = []
    for token in tokens:
        if token not in STOP_WORDS:
            real_tokens.append(token)
    updated_tokens.append(real_tokens)
    
df['tokens'] = updated_tokens

In [34]:
df['tokens'].sample(5)

7117      [Pupillary, diameter, monitoring, currently, r...
53101     [Invasive, mechanical, ventilation, (IMV), ass...
84439     [Alzheimer's, disease, (AD), prevalent, neurod...
109930    [AC0010MA, new,, irreversible,, Epidermal, Gro...
13154     [Contusive, cervical, spinal, cord, injury, (c...
Name: tokens, dtype: object

# Stemming

In [36]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

words = ["game", "gaming", "gamed", "games"]

stemmed_tokens = []

for tokens in df['tokens']:
    stemmed_token = []
    for token in tokens:
        stem = ps.stem(token)
        stemmed_token.append(stem)
    stemmed_tokens.append(stemmed_token)
    
df['Stemmed Tokens'] = stemmed_tokens

In [39]:
df['Stemmed Tokens'].head()

0    [j-valv, TF, compassion, use, case, approv, ca...
1    [the, studi, design, assess, role, function, n...
2                   [dha, breath, hold, spell, \n    ]
3    [the, purpos, studi, assess, effect, multipl, ...
4    [the, purpos, trial, evalu, safeti, efficaci, ...
Name: Stemmed Tokens, dtype: object

# Lemminization

In [None]:
lemmas = []

for text in df['clean_desc']:
    text = nlp(text)
    token_lemma = []
    for token in text:
        token_lemma.append(token.lemma_)
    lemmas.append(token_lemma)
    
df['lemmas'] = lemmas

In [52]:
df['lemmas'].head()

this
be
the
start
of
-PRON-
nlp
adventure
.
-PRON-
start
here
with
spacy
.
