<a href="https://colab.research.google.com/github/aishwikr/NLP/blob/master/TextProcessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing Libraries

In [1]:
#loading the dataset
import pandas as pd

#string manipulations
import string
import re

#nltk libraries and corpus
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download()
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [

True

## Step 1: Reading Data
- Read the dataset in an appropriate data structure

In [0]:
def loadData(filename):
    #loading the dataset into a dataframe 
    data = pd.read_csv(filename)

    #size of the dataframe
    print("\nDimensions of Dataset:",data.shape)

    #visualising the first few rows for the dataset (5 by default)
    print(data.head())
    
    return data

### Additional processing for Dataset 1

##Step 2: Pre-processing
- To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task. This may include lowercasing, stemming, lemmatization, stop-words removal etc. Tokenize the documents carefully to extract individual terms.

### Step 2(a): Removing HTML Tags
- Datasets often contain tags when downloaded and hence its removal is paramount.

In [0]:
def remTags(text):
    # Using regular expressions to match html tags and replace them with ''
    updated = re.sub('<[^<]+?>', '', text)
    print("Tags removed >>", updated)
    return updated

### Step 2(b): Removing Punctuations
- Dataset 1 contains two different columns labelled 'Title' & 'Body' which can be handled separately or combined and then processed as per the user's requirement.

In [0]:
def remPunctuations(text):
    # table is a translation table for removing the punctuation marks from the words
    table = str.maketrans({key: None for key in string.punctuation})
    translated = text.translate(table)
    print("Punctuations removed >>", translated)
    return translated

### Step 2(c): Tokenization function
- This function takes string data type as input and outputs of series of tokens

In [0]:
def tokenize(text):
    return word_tokenize(text)

### Step 2(d): Stop-words removal function 
- Removing the stop words using the english vocabulary

In [0]:
def remStop(tokens):
    stop_words = set(stopwords.words('english'))
    filtered = [t for t in tokens if not t in stop_words]
    return filtered

### Step 2(e): Stemming, Lemmatization & Lowercasing
- Functions for finding the stemmed word, the root and then to convert all the documents to lower case

In [0]:
def stem(tokens):
    ps = PorterStemmer()
    stemmed = [ps.stem(t) for t in tokens]
    return list(set(stemmed))

In [0]:
def lemmatize(tokens):
    lz = WordNetLemmatizer()
    lemmatized = [lz.lemmatize(t) for t in tokens]
    return list(set(lemmatized))

In [0]:
def toLower(tokens):
    return [t.lower() for t in tokens]

## **Step 3: POS Tagging**


*   Function to Assign POS tags to each token
*   Pass tokens of a document as arguments to the given function,the returned value will have POS tags corresponding to each token in that document



In [0]:
def POStagging(tokens):
    return nltk.pos_tag(tokens)

## Step 5: Named Entity Recognition (NER) 
(also known as entity identification, entity chunking and entity extraction) 

**What is an entity?**

Named entity is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name. e.g., New York City is an instance of a Place.

**Use cases:**

*   Creating related tags or linking to relevant topics
*   Context understanding in search, recommendations




---



Using another useful NLP library called **[Spacy](https://spacy.io/usage)**.

In [10]:
# Install and import spacy
!pip install -U spacy
import spacy
print("\n\nSpacy with version=={} imported.".format(spacy.__version__))

Requirement already up-to-date: spacy in /usr/local/lib/python3.6/dist-packages (2.2.4)


Spacy with version==2.2.4 imported.


**Language Model for Spacy:**
A language model is the context provider for a particular language. It contains informations such as how the words are related and which one's are tags, entity, etc.

In [11]:
# Download the language model for spacy
!python -m spacy download en
spacy_model = spacy.load('en')  ## spacy_model = spacy.load("en_core_web_sm")
print("\n\nSpacy model imported at: {}".format(spacy_model))

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


Spacy model imported at: <spacy.lang.en.English object at 0x7fd40f46ef98>


In [12]:
doc1 = spacy_model(u"Apple is looking at buying U.K. startup for $1 billion.")
for ent in doc1.ents:
#     print(ent, type(ent))
    print("Token [{}] belongs to Entity type [{}] which occurs in the text from index [{}] to index [{}].".format(ent.text, ent.label_, ent.start_char, ent.end_char))

Token [Apple] belongs to Entity type [ORG] which occurs in the text from index [0] to index [5].
Token [U.K.] belongs to Entity type [GPE] which occurs in the text from index [27] to index [31].
Token [$1 billion] belongs to Entity type [MONEY] which occurs in the text from index [44] to index [54].


In [13]:
# Filtering sentences
doc2 = spacy_model(u"""In the expression named entity, the word named restricts
                   the task to those entities for which one or many strings,
                   such as words or phrases, stands (fairly) consistently for
                   some referent. This is closely related to rigid designators,
                   as defined by Kripke[3][4], although in practice NER deals
                   with many names and referents that are not philosophically
                   'rigid'. For instance, the automotive company created by
                   Henry Ford in 1903 can be referred to as Ford or Ford Motor
                   Company, although 'Ford' can refer to many other entities as
                   well (see Ford). Rigid designators include proper names as
                   well as terms for certain biological species and substances,
                   but exclude pronouns (such as 'it'; see coreference 
                   resolution), descriptions that pick out a referent by its 
                   properties (see also De dicto and de re), and names for kinds
                   of things as opposed to individuals (for example 'Bank').""")
sentences = list(doc2.sents)
print("This is the first sentence: [{}]".format(sentences [0]))
sentences

This is the first sentence: [In the expression named entity, the word named restricts
                   the task to those entities for which one or many strings,
                   such as words or phrases, stands (fairly) consistently for
                   some referent.]


[In the expression named entity, the word named restricts
                    the task to those entities for which one or many strings,
                    such as words or phrases, stands (fairly) consistently for
                    some referent.,
 This is closely related to rigid designators,
                    as defined by Kripke[3][4], although in practice NER deals
                    with many names and referents that are not philosophically
                    'rigid'.,
 For instance, the automotive company created by
                    Henry Ford in 1903 can be referred to as Ford or Ford Motor
                    Company, although 'Ford' can refer to many other entities as
                    ,
 well (see Ford).,
 Rigid designators include proper names as
                    well as terms for certain biological species and substances,
                    but exclude pronouns (such as 'it'; see coreference 
                    resolution), descriptions that pick out a refe