# **Milestone 1:**
Text Processing using Spacy


### **Setting up the environment**

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


###**Importing the required modules**

In [2]:
# import libraries
import json
import spacy

### **Getting the data**

In [3]:
DATA_DIR = '/content/drive/MyDrive/SearchToolwNLP/01_Text_Search_spaCy_and_scikit-learn/data/'

In [4]:
# load a spacy language model
nlp = spacy.load("en_core_web_sm")

In [5]:
# load the json file
with open(DATA_DIR + 'data.json', 'r') as outfile:
    summaries = json.load(outfile)

### **Inspecting the dataset**

In [6]:
# len of the list
print(f'The dataset comprises a list of {len(summaries)} dicts')

The dataset comprises a list of 26 dicts


In [7]:
# get the keys
print(f'Each entry contains the following {summaries[0].keys()}')

Each entry contains the following dict_keys(['title', 'text', 'url'])


In [8]:
# print the first entry
print(summaries[0]['title'])
print('---')
print(summaries[0]['text'])
print('---')
print(summaries[0]['url'])

Pandemic
---
A pandemic (from Greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. A widespread endemic disease with a stable number of infected people is not a pandemic. Widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.
Throughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century. The term was not used yet but was for later pandemics including the 1918 influenza pandemic (Spanish flu). Current pandemics include COVID-19 (SARS-CoV-2) and HIV/A

### **Cleaning the dataset**

In [9]:
# get the text content
text = summaries[0]['text']

# create a doc object
doc = nlp(text.lower())

# explore the attributes of each token returned spacy
print(doc[:20])
print('--------------------------------')
for token in doc[:5]:
    print(token.text) 
    print(token.pos_) 
    print(token.dep_)
    print('---')

a pandemic (from greek πᾶν, pan, "all" and δῆμος, demos, "people"
--------------------------------
a
DET
det
---
pandemic
NOUN
nsubj
---
(
PUNCT
punct
---
from
ADP
prep
---
greek
ADJ
amod
---


In [10]:
# identify unclassified tokens
unclassified_tokens = [(token.lemma_, token.dep_) for token in doc if token.dep_ is '']
unclassified_tokens[:10]

[('\n', '')]

In [11]:
# remove stop words and punctuation
token_without_sw = [word for word in doc if not word.is_stop and not word.is_punct]
token_without_sw[:10]

[pandemic,
 greek,
 πᾶν,
 pan,
 δῆμος,
 demos,
 people,
 epidemic,
 infectious,
 disease]

In [12]:
# lemmatize (tokenize) the texts
token_lemmas = [token.lemma_ for token in token_without_sw if token.dep_]
token_lemmas[:10]

['pandemic',
 'greek',
 'πᾶν',
 'pan',
 'δῆμος',
 'demos',
 'people',
 'epidemic',
 'infectious',
 'disease']

In [13]:
# build a tokenizer function
def tokenizer(document):
    """
    This function accepts a text string and:
    1. Lowercases it
    2. Removes redundant tokens
    3. Performs token lemmatization
    """
    doc = nlp(document.lower())
    token_without_sw = [word for word in doc if not word.is_stop and not word.is_punct]
    token_lemmas = [token.lemma_ for token in token_without_sw if token.dep_]  

    return token_lemmas

### **Saving the dataset**

In [14]:
# save the tokenized texts to file:
with open(DATA_DIR + 'summaries.json', 'w') as outfile:
    json.dump(summaries, outfile)