# References

- [1] https://stackabuse.com/python-for-nlp-introduction-to-the-pattern-library/
- [2] [NLP Tutorial 3 - Extract Text from PDF Files in Python for NLP | PDF Writer and Reader in Python](https://youtu.be/_VSX7yd-zPE)
- [3] https://analyticsindiamag.com/hands-on-guide-to-pattern-a-python-tool-for-effective-text-processing-and-data-mining/
- [4] [General Comparison between different Python NLP Libraries](https://medium.com/towards-artificial-intelligence/natural-language-processing-nlp-with-python-tutorial-for-beginners-1f54e610a1a0)
- [5] https://textminingonline.com/getting-started-with-pattern

# Intro

- The Pattern library is a multipurpose library capable of handling the following tasks: [1]
 - NLP: performing tasks such as tokenization, stemming, POS tagging, sentiment analysis, etc
 - Data Mining: has API to mine data from sites like Twitter, Facebook, Wikipedia, etc
 - ML: contains ML models such as SVM, KNN, and perceptron, which can be used for classification, regression, and clustering tasks
- Even it's not as popular as spaCy or NLTK, it has unique functionalities such as finding superlatives and comparatives, get fact and opinion detecetion which other NLP libraries doesn't have [1]

In [2]:
## installation
# !pip install pattern

Collecting pattern
  Downloading Pattern-3.6.0.tar.gz (22.2 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 22.2 MB 18.8 MB/s eta 0:00:01   |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé                     | 7.1 MB 2.0 MB/s eta 0:00:08     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå     | 18.5 MB 18.8 MB/s eta 0:00:01     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 21.6 MB 18.8 MB/s eta 0:00:01
Collecting backports.csv
  Downloading backports.csv-1.0.7-py2.py3-none-any.whl (12 kB)
Collecting mysqlclient
  Downloading mysqlclient-2.0.3.tar.gz (88 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 88 kB 10.4 MB/s eta 0:00:01
Collecting feedparser
  Downloading feedparser-6.0.2-py3-none-any.whl (80 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

# Python for NLP: Introduction to the Pattern Library [1]

## Pattern Library Functions for NLP

### Tokenizing, POS Tagging, and Chunking

In [4]:
from pattern.en import parse
from pattern.en import pprint

In [6]:
pprint(parse('I drove my car to the hospital yesterday', relations=True, lemmata=True))

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA       
                                                                 
             I   PRP    NP      SBJ    1      -      i           
         drove   VBD    VP      -      1      -      drive       
            my   PRP$   NP      OBJ    1      -      my          
           car   NN     NP ^    OBJ    1      -      car         
            to   TO     -       -      -      -      to          
           the   DT     NP      -      -      -      the         
      hospital   NN     NP ^    -      -      -      hospital    
     yesterday   NN     NP ^    -      -      -      yesterday   


In [7]:
print(parse('I drove my car to the hospital yesterday', relations=True, lemmata=True).split())

[[['I', 'PRP', 'B-NP', 'O', 'NP-SBJ-1', 'i'], ['drove', 'VBD', 'B-VP', 'O', 'VP-1', 'drive'], ['my', 'PRP$', 'B-NP', 'O', 'NP-OBJ-1', 'my'], ['car', 'NN', 'I-NP', 'O', 'NP-OBJ-1', 'car'], ['to', 'TO', 'O', 'O', 'O', 'to'], ['the', 'DT', 'B-NP', 'O', 'O', 'the'], ['hospital', 'NN', 'I-NP', 'O', 'O', 'hospital'], ['yesterday', 'NN', 'I-NP', 'O', 'O', 'yesterday']]]


### Pluralizing and Singularizing the Tokens

In [8]:
from pattern.en import pluralize, singularize

print(pluralize('leaf'))
print(singularize('theives'))

leaves
theife


### Converting Adjective to Comparative and Superlative Degrees

In [9]:
from pattern.en import comparative, superlative

print(comparative('good'))
print(superlative('good'))

better
best


### Finding N-Grams

In [10]:
from pattern.en import ngrams

print(ngrams("He goes to hospital", n=2))

[('He', 'goes'), ('goes', 'to'), ('to', 'hospital')]


### Finding Sentiments

In [11]:
from pattern.en import sentiment

print(sentiment("This is an excellent movie to watch. I really love it"))

(0.75, 0.8)


Explanation:

- 0.75 show the sentiment score of the sentence that means highly positive
- 0.8 is the subjectivity score that is a personal of the user  

### Checking if a Statement is a Fact

In [12]:
from pattern.en import parse, Sentence
from pattern.en import modality

text = "Paris is the capital of France"
sent = parse(text, lemmata=True)
sent = Sentence(sent)

print(modality(sent))

1.0


In [13]:
text = "I think we can complete this task"
sent = parse(text, lemmata=True)
sent = Sentence(sent)

print(modality(sent))

0.25


### Spelling Corrections

In [14]:
from pattern.en import suggest

print(suggest("Whitle"))

[('While', 0.6459209419680404), ('White', 0.2968881412952061), ('Title', 0.03280067283431455), ('Whistle', 0.023549201009251473), ('Chile', 0.0008410428931875525)]


In [15]:
from pattern.en import suggest
print(suggest("Fracture"))

[('Fracture', 1.0)]


### Working with Numbers

In [16]:
from pattern.en import number, numerals

print(number("one hundred and twenty two"))
print(numerals(256.390, round=2))

122
two hundred and fifty-six point thirty-nine


In [17]:
from pattern.en import quantify

print(quantify(['apple', 'apple', 'apple', 'banana', 'banana', 'banana', 'mango', 'mango']))

several bananas, several apples and a pair of mangoes


In [18]:
from pattern.en import quantify

print(quantify({'strawberry': 200, 'peach': 15}))
print(quantify('orange', amount=1200))

hundreds of strawberries and a number of peaches
thousands of oranges


## Pattern Library Functions for Data Mining

In [23]:
# For macOS SSL issue when downloading file(s) from external sources
import ssl 
ssl._create_default_https_context = ssl._create_unverified_context

### Accessing Web Pages

In [21]:
from pattern.web import download

page_html = download('https://en.wikipedia.org/wiki/Artificial_intelligence', unicode=True)

In [22]:
from pattern.web import URL, extension

page_url = URL('https://upload.wikimedia.org/wikipedia/commons/f/f1/RougeOr_football.jpg')
file = open('football' + extension(page_url.page), 'wb')
file.write(page_url.download())
file.close()

### Finding URLs within Text

In [24]:
from pattern.web import find_urls

print(find_urls('To search anything, go to www.google.com', unique=True))

['www.google.com']


### Making Asynchronous Requests for Webpages

In [25]:
from pattern.web import asynchronous, time, Google

asyn_req = asynchronous(Google().search, 'artificial intelligence', timeout=4)
while not asyn_req.done:
    time.sleep(0.1)
    print('searching...')

print(asyn_req.value)

print(find_urls(asyn_req.value, unique=True))

searching...
searching...
searching...
searching...
searching...
searching...
searching...
searching...
searching...
searching...
[Result({'url': 'https://en.wikipedia.org/wiki/Artificial_intelligence', 'title': 'Artificial intelligence - Wikipedia', 'text': '<b>Artificial intelligence</b> (<b>AI</b>) is intelligence demonstrated by machines, unlike the <br>\nnatural intelligence displayed by humans and animals, which involves&nbsp;...'}), Result({'url': 'https://www.investopedia.com/terms/a/artificial-intelligence-ai.asp', 'title': 'Artificial Intelligence (AI) Definition', 'text': '... <b>Artificial intelligence</b> (<b>AI</b>) refers to the simulation of human intelligence in <br>\nmachines that are programmed to think like humans and mimic their&nbsp;...', 'date': 'Jan 6, 2021'}), Result({'url': 'https://builtin.com/artificial-intelligence', 'title': 'What is Artificial Intelligence? How Does AI Work? | Built In', 'text': '<b>Artificial intelligence</b> (<b>AI</b>) is wide-ranging 

### Getting Search Engine Results with APIs

#### Google

In [26]:
from pattern.web import Google

google = Google(license=None)
for search_result in google.search('artificial intelligence'):
    print(search_result.url)
    print(search_result.text)

https://en.wikipedia.org/wiki/Artificial_intelligence
<b>Artificial intelligence</b> (<b>AI</b>) is intelligence demonstrated by machines, unlike the <br>
natural intelligence displayed by humans and animals, which involves&nbsp;...
https://www.investopedia.com/terms/a/artificial-intelligence-ai.asp
... <b>Artificial intelligence</b> (<b>AI</b>) refers to the simulation of human intelligence in <br>
machines that are programmed to think like humans and mimic their&nbsp;...
https://builtin.com/artificial-intelligence
<b>Artificial intelligence</b> (<b>AI</b>) is wide-ranging branch of computer science concerned <br>
with building smart machines capable of performing tasks that typically require&nbsp;...
https://www.aaai.org/
AAAI advances the understanding of the mechanisms underlying thought and <br>
<b>intelligent</b> behavior and their embodiment in machines.
https://www.britannica.com/technology/artificial-intelligence
<b>Artificial intelligence</b>, the ability of a computer or com

#### Twitter

In [27]:
from pattern.web import Twitter

twitter = Twitter()
index = None
for j in range(3):
    for tweet in twitter.search('artificial intelligence', start=index, count=3):
        print(tweet.text)
        index = tweet.id

RT @Stevewal63: Artificial Intelligence Will Change The Way We Work Once We Get Back To The Office https://t.co/WmjSi4yt4B
RT @TheNextTech2018: 8 Tips to Use Artificial Intelligence (AI) in Mobile Apps

Read post: - https://t.co/HXoH8DeI5N

#artificialintelligence #mobileapps #artificialintelligenceai #tips #tip #intelligence #mobile #apps #illustration #fiction #animation #art #cartoons https://t.co/yOiUDnRuFR
RT @PhathaATM: @somadodafikeni Not at the timeüòÖ until I discovered that actually the whole machine learning and to a larger extent the artificial intelligence knowledge depends on this understanding. Unbelievable!
RT @KevinClarity: ‚ÄúAutomating Trading and Market Making With Artificial Intelligence‚Äù by @PoseysThumbs
https://t.co/bK0tcydBg9

#Machinelearning #100DaysOfCode #IoT #IIoT #Bigdata #100DaysOfMLCode #Python #flutter #cybersecurity #RStats #CodeNewbie #DataScience #DEVCommunity #RPA
RT @STPIBHOPAL: The use of data analytics, artificial intelligence, machine learnin

### Converting HTML Data to Plain Text

In [28]:
from pattern.web import URL, plaintext

html_content = URL('https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/').download()
cleaned_page = plaintext(html_content.decode('utf-8'))
print(cleaned_page)

Python for NLP: Introduction to the TextBlob Library

Toggle navigation Stack Abuse

* JavaScript
* Python
* Java
* Jobs

Python for NLP: Introduction to the TextBlob Library

By

Usman Malik

‚Ä¢0 Comments

Introduction

This is the seventh article in my series of articles on Python for NLP. In my previous article, I explained how to perform topic modeling using Latent Dirichlet Allocation and Non-Negative Matrix factorization. We used the Scikit-Learn library to perform topic modeling.

In this article, we will explore TextBlob, which is another extremely powerful NLP library for Python. TextBlob is built upon NLTK and provides an easy to use interface to the NLTK library. We will see how TextBlob can be used to perform a variety of NLP tasks ranging from parts-of-speech tagging to sentiment analysis, and language translation to text classification.

The detailed download instructions for the library can be found at the official link. I would suggest that you install the TextBlob lib

### Parsing PDF Documments

#### Using Pattern PDF module (doesn't work)

In [53]:
# # This doesn't work
# from pattern.web import URL, PDF

# pdf_doc = URL('http://demo.clab.cs.cmu.edu/NLP/syllabus_f18.pdf').download()
# # pdf_doc2 = URL('https://courses.cs.ut.ee/LTAT.01.001/2020_spring/uploads/Main/Lecture1_Introduction.pdf').download()
# print(PDF(pdf_doc2.decode('utf-8')))

#### Using PyPDF2 library [4]

In [40]:
## dependencies
# !pip install PyPDF2

In [51]:
import PyPDF2 as pdf

file = open('data/syllabus_f18.pdf', 'rb') # source: http://demo.clab.cs.cmu.edu/NLP/syllabus_f18.pdf
file

<_io.BufferedReader name='data/syllabus_f18.pdf'>

In [48]:
pdf_reader = pdf.PdfFileReader(file)
pdf_reader

<PyPDF2.pdf.PdfFileReader at 0x12fd6a6a0>

In [49]:
help(pdf_reader)

Help on PdfFileReader in module PyPDF2.pdf object:

class PdfFileReader(builtins.object)
 |  Initializes a PdfFileReader object.  This operation can take some time, as
 |  the PDF stream's cross-reference tables are read into memory.
 |  
 |  :param stream: A File object or an object that supports the standard read
 |      and seek methods similar to a File object. Could also be a
 |      string representing a path to a PDF file.
 |  :param bool strict: Determines whether user should be warned of all
 |      problems and also causes some correctable problems to be fatal.
 |      Defaults to ``True``.
 |      ``sys.stderr``).
 |      ``True``).
 |  
 |  Methods defined here:
 |  
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  cacheGetIndirectObject(self, generation, idnum)
 |  
 |  cacheIndirectObject(self, generation, idnum, obj)
 |  
 |  decrypt(self, password)
 |      When using an encrypted / secured PDF file with the PDF Standard
 |      encryption 

In [50]:
pdf_reader.getIsEncrypted()

False

This PDF is not encrypted

In [54]:
pdf_reader.getNumPages()

4

In [56]:
page1 = pdf_reader.getPage(0)
page1

{'/Type': '/Page',
 '/Contents': {'/Filter': '/FlateDecode'},
 '/Resources': {'/Font': {'/F16': {'/Type': '/Font',
    '/Subtype': '/Type1',
    '/BaseFont': '/IFHENN+CMR17',
    '/FontDescriptor': {'/Type': '/FontDescriptor',
     '/FontName': '/IFHENN+CMR17',
     '/Flags': 4,
     '/FontBBox': [-33, -250, 945, 749],
     '/Ascent': 694,
     '/CapHeight': 683,
     '/Descent': -195,
     '/ItalicAngle': 0,
     '/StemV': 53,
     '/XHeight': 430,
     '/CharSet': '/L/N/P/S/a/b/c/colon/e/g/i/l/n/o/r/s/t/u/y',
     '/FontFile': {'/Length1': 1629,
      '/Length2': 9058,
      '/Length3': 0,
      '/Filter': '/FlateDecode'}},
    '/FirstChar': 58,
    '/LastChar': 121,
    '/Widths': [249.6,
     249.6,
     249.6,
     719.8,
     432.5,
     432.5,
     719.8,
     693.3,
     654.3,
     667.6,
     706.6,
     628.2,
     602.1,
     726.3,
     693.3,
     327.6,
     471.5,
     719.4,
     576,
     850,
     693.3,
     719.8,
     628.2,
     719.8,
     680.5,
     510.9,
   

In [57]:
page1.extractText()

"NaturalLanguageProcessing:Syllabus\nAlanW.Black&DavidR.Mortensen\nCarnegieMellonUniversity\nFall2018\nInstructors:\nProf.AlanWBlack(\nawb@cs.cmu.edu\n)andDavidR.Mortensen(\ndmortens@\ncs.cmu.edu\n)\nTeachingassistants:\nFatimaAl-Raisi(\nfraisi@andrew.cmu.edu\n),ManishaChaurasia\n(\nmchauras@andrew.cmu.edu\n),PoojaChitkara(\npchitkar@andrew.cmu.\nedu\n),SarveshwaranDhansekar(\nsarveshd@andrew.cmu.edu\n)\nLecturetime:\nTuesdays&Thursdays,3:00{4:20\nLocation:\nWEH4623\nWebpage:\nhttp://demo.clab.cs.cmu.edu/NLP/\nFacultyehours:\nByappointment(Black);\nByappointmentat\nhttps://davidmortensen.youcanbook.me\n(Mortensen)\nTAehours:\nTBA\n1Summary\nThiscourseisaboutavarietyofwaystorepresenthumanlanguages(likeEnglishandChinese)as\ncomputationalsystems,andhowtoexploitthoserepresentationstowriteprogramsthatdouseful\nthingswithtextandspeechdata,liketranslation,summarization,extractinginformation,question\nanswering,naturalinterfacestodatabases,andconversationalagents.\nThisiscalledNaturalLanguageP

In [59]:
page2 = pdf_reader.getPage(1)
page2.extractText()

'gram.Prerequisite:FundamentalDataStructuresandAlgorithms(15-211)orequivalent;strong\nprogrammingcapabilities.\n3Evaluation\nStudentswillbeevaluatedineways:\nExams(40%)\nonein-classmidtermon\nMarch\n(20%)andonecumulativeexam(20%),date\nTBD.\nProject(30%)\nasemester-long4-personteamproject(seebelow).\nHomeworkassignments(20%)\n7pencil-and-paperorsmallprogrammingproblemsgivenroughly\nweekly.\nQuizzes(10%)\n10Canvasquizzesgivenatthebeginningofmanylectures\n1\n.\nThelowest2homeworkgradesandthelowest3quizgradeswillbedropped.\nLatePolicy\nNoworkwillbeacceptedlate.Thegradingpolicyforpopquizzesandhomework\nassignmentspermitssomeslackofanadministrativelysimplerkindthandeductingpointsfor\nlatenessormissingalecture.\nAcademicHonesty\nExamsandpopquizzesaretobecompletedindividually.Verbalcollab-\norationonhomeworkassignmentsisacceptable,but(a)youmustnotshareanycodeorother\nwrittenmaterial,(b)everythingyouturninmustbeyourownwork,and(c)youmustnotethe\nnamesof\nanyone\nyoucollaboratedwithoneachproblem

### Clearing the Cache

In [60]:
from pattern.web import cache

cache.clear()