# Text Processing

## Table of Contents

- [Data](#Data)
- [Text Wrap](#Text-Wrap)
- [Counting](#Counting)
- [NLTK](#NLTK)

    - [NLTK Counting](#NLTK-Counting)
    - [Filter](#Filter)
    - [Line Tokenization](#Line-Tokenization)
    - [Non-English Tokenization](#Non-English-Tokenization)
    - [Word Tokenization](#Word-Tokenization)
    - [Stopwords](#Stopwords)
    - [Wordnet](#Wordnet)
    - [Corpora](#Corpora)
    - [Tagging Words](#Tagging-Words)
    - [Text Classification](#Text-Classification)
    - [Bigrams](#Bigrams)

- [Strings](#Strings)
- [Regex](#Regex)
- [PrettyPrint](#Pretty-Print)
- [Capitalization](#Capitalization)
- [Spell Check](#Spell-Check)
- [PDFs](#PDFs)
- [Word Document](#Word-Document)
- [RSS Feed](#RSS-Feed)

Source: https://www.tutorialspoint.com/python_text_processing/index.htm

## Data

In [1]:
text1 = 'In late summer 1945, guests are gathered for the wedding reception of Don Vito Corleones daughter Connie (Talia Shire) and Carlo Rizzi (Gianni Russo). Vito (Marlon Brando), the head of the Corleone Mafia family, is known to friends and associates as Godfather. He and Tom Hagen (Robert Duvall), the Corleone family lawyer, are hearing requests for favors because, according to Italian tradition, no Sicilian can refuse a request on his daughters wedding day.'

FileName = (r"data\text.txt")

# fs = open(FileName, 'r')
# data = fs.readlines()

with open(FileName, 'r') as file:
    data = file.readlines()

In [2]:
data

[' Summer is here.\n',
 '  Sky is bright.\n',
 '\tBirds are gone.\n',
 '\t Nests are empty.\n',
 '\t  Where is Rain?\n']

## Text Wrap

In [3]:
from textwrap3 import wrap, dedent

x = wrap(text1, 30)
x

['In late summer 1945, guests',
 'are gathered for the wedding',
 'reception of Don Vito',
 'Corleones daughter Connie',
 '(Talia Shire) and Carlo Rizzi',
 '(Gianni Russo). Vito (Marlon',
 'Brando), the head of the',
 'Corleone Mafia family, is',
 'known to friends and',
 'associates as Godfather. He',
 'and Tom Hagen (Robert Duvall),',
 'the Corleone family lawyer,',
 'are hearing requests for',
 'favors because, according to',
 'Italian tradition, no Sicilian',
 'can refuse a request on his',
 'daughters wedding day.']

In [4]:
print("**Before Formatting**")
for i in range(len(data)):
    print(data[i])

**Before Formatting**
 Summer is here.

  Sky is bright.

	Birds are gone.

	 Nests are empty.

	  Where is Rain?



In [5]:
print("**After Formatting**")
for i in range(len(data)):
    dedented_text = dedent(data[i]).strip()
    print(dedented_text)
# fs.close()

**After Formatting**
Summer is here.
Sky is bright.
Birds are gone.
Nests are empty.
Where is Rain?


## Counting

In [6]:
with open(FileName, 'r') as file:
    lines_in_file = file.read()
    print(lines_in_file)

 Summer is here.
  Sky is bright.
	Birds are gone.
	 Nests are empty.
	  Where is Rain?



In [7]:
print(lines_in_file.split())
print("Number of Words: " , len(lines_in_file.split()))

['Summer', 'is', 'here.', 'Sky', 'is', 'bright.', 'Birds', 'are', 'gone.', 'Nests', 'are', 'empty.', 'Where', 'is', 'Rain?']
Number of Words:  15


## NLTK

In [8]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Corey\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### NLTK Counting

In [9]:
nltk_tokens = nltk.word_tokenize(lines_in_file)
print(nltk_tokens)
print("Number of Words: " , len(nltk_tokens))

['Summer', 'is', 'here', '.', 'Sky', 'is', 'bright', '.', 'Birds', 'are', 'gone', '.', 'Nests', 'are', 'empty', '.', 'Where', 'is', 'Rain', '?']
Number of Words:  20


### Filter

In [10]:
word_data = "The Sky is blue also the ocean is blue also Rainbow has a blue colour." 

# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)

In [11]:
# Applying Set
no_order = list(set(nltk_tokens))

In [12]:
no_order

['ocean',
 'the',
 '.',
 'also',
 'is',
 'colour',
 'blue',
 'Rainbow',
 'has',
 'a',
 'The',
 'Sky']

In [13]:
# Preserving Order
ordered_tokens = set()
result = []
for word in nltk_tokens:
    if word not in ordered_tokens:
        ordered_tokens.add(word)
        result.append(word)

In [14]:
result

['The',
 'Sky',
 'is',
 'blue',
 'also',
 'the',
 'ocean',
 'Rainbow',
 'has',
 'a',
 'colour',
 '.']

### Line Tokenization

In [15]:
sentence_data = "The First sentence is about Python. The Second: about Django. You can learn Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)

In [16]:
nltk_tokens

['The First sentence is about Python.',
 'The Second: about Django.',
 'You can learn Python,Django and Data Ananlysis here.']

### Non-English Tokenization

In [17]:
german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
german_tokens = german_tokenizer.tokenize('Wie geht es Ihnen?  Gut, danke.')

In [18]:
german_tokens

['Wie geht es Ihnen?', 'Gut, danke.']

### Word Tokenization

In [19]:
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)

In [20]:
nltk_tokens

['It',
 'originated',
 'from',
 'the',
 'idea',
 'that',
 'there',
 'are',
 'readers',
 'who',
 'prefer',
 'learning',
 'new',
 'skills',
 'from',
 'the',
 'comforts',
 'of',
 'their',
 'drawing',
 'rooms']

### Stopwords

In [21]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Corey\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [23]:
stopwords.words('english')
stopwords.words()[620:680]

['your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at']

In [24]:
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree','near','the','river']

for word in all_words: 
    if word not in en_stops:
        print(word)

There
tree
near
river


### Wordnet

In [25]:
nltk.download('wordnet')
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Corey\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [26]:
wordnet.synset('locomotive.n.01').lemma_names()

['locomotive', 'engine', 'locomotive_engine', 'railway_locomotive']

In [27]:
wordnet.synset('locomotive.n.01').definition()

'a wheeled vehicle consisting of a self-propelled engine that is used to draw trains along railway tracks'

In [28]:
wordnet.synset('good.n.01').examples()

['for your own good', "what's the good of worrying?"]

In [29]:
wordnet.lemma('horizontal.a.01.horizontal').antonyms()

[Lemma('vertical.a.01.vertical'), Lemma('inclined.a.02.inclined')]

#### Synonyms and Antonyms

In [30]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("Soil"):
    for lm in syn.lemmas():
        if lm.antonyms():
            antonyms.append(lm.antonyms()[0].name())
        else:
            synonyms.append(lm.name())

In [31]:
set(synonyms)

{'begrime',
 'bemire',
 'colly',
 'dirt',
 'filth',
 'grease',
 'grime',
 'ground',
 'grunge',
 'land',
 'soil',
 'stain',
 'territory'}

In [32]:
set(antonyms)

{'clean'}

### Corpora

In [33]:
nltk.download('gutenberg')
from nltk.corpus import gutenberg

fields = gutenberg.fileids()

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Corey\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [34]:
fields

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [35]:
from nltk.tokenize import sent_tokenize

sample = gutenberg.raw("blake-poems.txt")

token = sent_tokenize(sample)

for para in range(2):
    print(token[para])

[Poems by William Blake 1789]

 
SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL


 SONGS OF INNOCENCE
 
 
 INTRODUCTION
 
 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:
 
 "Pipe a song about a Lamb!"
So I piped with merry cheer.


#### Frequency

In [36]:
wlist = []

for i in range(50):
    wlist.append(token[i])

wordfreq = [wlist.count(w) for w in wlist]
print("Pairs\n" + str(zip(token, wordfreq)))

Pairs
<zip object at 0x00000146505B2648>


In [37]:
nltk.download('brown')
from nltk.corpus import brown

cfd = nltk.ConditionalFreqDist(
          (genre, word)
          for genre in brown.categories()
          for word in brown.words(categories=genre))

categories = ['hobbies', 'romance','humor']
searchwords = [ 'may', 'might', 'must', 'will']

cfd.tabulate(conditions=categories, samples=searchwords)

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Corey\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


          may might  must  will 
hobbies   131    22    83   264 
romance    11    51    45    43 
  humor     8     8     9    13 


### Tagging Words

In [38]:
nltk.download('averaged_perceptron_tagger')

text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest")
tagged_text = nltk.pos_tag(text)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Corey\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [39]:
tagged_text

[('A', 'DT'),
 ('Python', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('serpent', 'NN'),
 ('which', 'WDT'),
 ('eats', 'VBZ'),
 ('eggs', 'NNS'),
 ('from', 'IN'),
 ('the', 'DT'),
 ('nest', 'JJS')]

In [40]:
nltk.download('tagsets')

nltk.help.upenn_tagset('DT')
nltk.help.upenn_tagset('NNP')
nltk.help.upenn_tagset('VBZ')

DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...


[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\Corey\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [41]:
# Tagging a corpus (see Corpora above)
for i in token[:2]:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)
    print(tagged)

[('[', 'JJ'), ('Poems', 'NNP'), ('by', 'IN'), ('William', 'NNP'), ('Blake', 'NNP'), ('1789', 'CD'), (']', 'NNP'), ('SONGS', 'NNP'), ('OF', 'NNP'), ('INNOCENCE', 'NNP'), ('AND', 'NNP'), ('OF', 'NNP'), ('EXPERIENCE', 'NNP'), ('and', 'CC'), ('THE', 'NNP'), ('BOOK', 'NNP'), ('of', 'IN'), ('THEL', 'NNP'), ('SONGS', 'NNP'), ('OF', 'NNP'), ('INNOCENCE', 'NNP'), ('INTRODUCTION', 'NNP'), ('Piping', 'VBG'), ('down', 'RP'), ('the', 'DT'), ('valleys', 'NN'), ('wild', 'JJ'), (',', ','), ('Piping', 'NNP'), ('songs', 'NNS'), ('of', 'IN'), ('pleasant', 'JJ'), ('glee', 'NN'), (',', ','), ('On', 'IN'), ('a', 'DT'), ('cloud', 'NN'), ('I', 'PRP'), ('saw', 'VBD'), ('a', 'DT'), ('child', 'NN'), (',', ','), ('And', 'CC'), ('he', 'PRP'), ('laughing', 'VBG'), ('said', 'VBD'), ('to', 'TO'), ('me', 'PRP'), (':', ':'), ('``', '``'), ('Pipe', 'VB'), ('a', 'DT'), ('song', 'NN'), ('about', 'IN'), ('a', 'DT'), ('Lamb', 'NN'), ('!', '.'), ("''", "''")]
[('So', 'RB'), ('I', 'PRP'), ('piped', 'VBD'), ('with', 'IN'), ('m

### Text Classification

In [42]:
nltk.download('movie_reviews')

# Lets See how the movies are classified
from nltk.corpus import movie_reviews

all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Corey\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [43]:
all_cats

['neg', 'pos']

In [44]:
fields = movie_reviews.fileids()

sample = movie_reviews.raw("pos/cv944_13521.txt")

token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade with hollywood churning out films like deep impact , = godzilla , the x-files , armageddon , the truman show , all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film releases from the = spielberg-katzenberg-geffen's dreamworks production company .


In [45]:
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822)]


### Bigrams

In [46]:
word_data = "The best performance can bring in sky high success."
nltk_tokens = nltk.word_tokenize(word_data)

list(nltk.bigrams(nltk_tokens))

[('The', 'best'),
 ('best', 'performance'),
 ('performance', 'can'),
 ('can', 'bring'),
 ('bring', 'in'),
 ('in', 'sky'),
 ('sky', 'high'),
 ('high', 'success'),
 ('success', '.')]

## Strings

In [47]:
for i in range(len(data)):
    print ("Line No- ", i)
    print (data[i])

Line No-  0
 Summer is here.

Line No-  1
  Sky is bright.

Line No-  2
	Birds are gone.

Line No-  3
	 Nests are empty.

Line No-  4
	  Where is Rain?



In [48]:
[i.replace('\n', '') for i in data]

[' Summer is here.',
 '  Sky is bright.',
 '\tBirds are gone.',
 '\t Nests are empty.',
 '\t  Where is Rain?']

### Reverse file order

In [49]:
# data.reverse()
# print(data)

## Regex

### Emails

In [50]:
import re

text2 = "Please contact us at contact@tutorialspoint.com for further information."+\
        " You can also give feedbacl at feedback@tp.com"


emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text2)

In [51]:
emails

['contact@tutorialspoint.com', 'feedback@tp.com']

### URLs

In [52]:
text3 = 'Now a days you can learn almost anything by just visiting http://www.google.com. But if you are completely new to computers or internet then first you need to leanr those fundamentals. Next'+\
'you can visit a good e-learning site like - https://www.tutorialspoint.com to learn further on a variety of subjects.'

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text3)

In [53]:
urls

['http://www.google.com.', 'https://www.tutorialspoint.com']

### Search

In [54]:
re.search("tor", "Tutorial")

<re.Match object; span=(2, 5), match='tor'>

In [55]:
not re.search("^tor", "Tutorial")

True

### Match

In [56]:
re.match("Tut", "Tutorial")

<re.Match object; span=(0, 3), match='Tut'>

In [57]:
not re.match("tor", "Tutorial")

True

### Substitute

In [58]:
import random


def replace(t):
    inner_word = list(t.group(2))
    random.shuffle(inner_word)
    return t.group(1) + "".join(inner_word) + t.group(3)

text = "Hello, You should reach the finish line."

print(re.sub(r"(\w)(\w+)(\w)", replace, text))

Hello, You sohlud rcaeh the fiisnh lnie.


### Constrained Search

In [59]:
text = "The web address is https://www.tutorialspoint.com"

# Taking "://" and "." to separate the groups 
result = re.search('([\w.-]+)://([\w.-]+)\.([\w.-]+)', text)

if result :
    print("The main web Address: ", result.group())
    print("The protocol: ", result.group(1))
    print("The doman name: ", result.group(2)) 
    print("The TLD: ", result.group(3))

The main web Address:  https://www.tutorialspoint.com
The protocol:  https
The doman name:  www.tutorialspoint
The TLD:  com


## Pretty Print

In [60]:
import pprint

student_dict = {'Name': 'Tusar', 'Class': 'XII', 
     'Address': {'FLAT ':1308, 'BLOCK ':'A', 'LANE ':2, 'CITY ': 'HYD'}}

In [61]:
student_dict

{'Name': 'Tusar',
 'Class': 'XII',
 'Address': {'FLAT ': 1308, 'BLOCK ': 'A', 'LANE ': 2, 'CITY ': 'HYD'}}

In [62]:
pprint.pprint(student_dict,width=-1)

{'Address': {'BLOCK ': 'A',
             'CITY ': 'HYD',
             'FLAT ': 1308,
             'LANE ': 2},
 'Class': 'XII',
 'Name': 'Tusar'}


In [63]:
emp = {"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
   "Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],   
   "StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
      "7/30/2013","6/17/2014"],
   "Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"] }

In [64]:
emp

{'Name': ['Rick', 'Dan', 'Michelle', 'Ryan', 'Gary', 'Nina', 'Simon', 'Guru'],
 'Salary': ['623.3', '515.2', '611', '729', '843.25', '578', '632.8', '722.5'],
 'StartDate': ['1/1/2012',
  '9/23/2013',
  '11/15/2014',
  '5/11/2014',
  '3/27/2015',
  '5/21/2013',
  '7/30/2013',
  '6/17/2014'],
 'Dept': ['IT',
  'Operations',
  'IT',
  'HR',
  'Finance',
  'IT',
  'Operations',
  'Finance']}

In [65]:
x = pprint.pformat(emp, indent=1)
print(x)

{'Dept': ['IT',
          'Operations',
          'IT',
          'HR',
          'Finance',
          'IT',
          'Operations',
          'Finance'],
 'Name': ['Rick', 'Dan', 'Michelle', 'Ryan', 'Gary', 'Nina', 'Simon', 'Guru'],
 'Salary': ['623.3', '515.2', '611', '729', '843.25', '578', '632.8', '722.5'],
 'StartDate': ['1/1/2012',
               '9/23/2013',
               '11/15/2014',
               '5/11/2014',
               '3/27/2015',
               '5/21/2013',
               '7/30/2013',
               '6/17/2014']}


## Capitalization

In [66]:
import string

text4 = 'Tutorialspoint - simple easy learning.'

In [67]:
string.capwords(text4, sep=None)

'Tutorialspoint - Simple Easy Learning.'

In [68]:
text4.upper()

'TUTORIALSPOINT - SIMPLE EASY LEARNING.'

### Translate

In [69]:
transtable = text4.maketrans('tpol', 'wxyz')
transtable

{116: 119, 112: 120, 111: 121, 108: 122}

In [70]:
text4.translate(transtable)

'Tuwyriazsxyinw - simxze easy zearning.'

## Spell Check

In [71]:
from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['Let', 'us', 'wlak','on','the','groud'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

proud
{'proud', 'groun', 'grout', 'group', 'aroud', 'grodd', 'ground'}
walk
{'wlat', 'wak', 'alak', 'flak', 'weak', 'walk', 'blak'}


## PDFs

Source(s):
- https://automatetheboringstuff.com/chapter13/
- https://www.geeksforgeeks.org/working-with-pdf-files-in-python/
- https://realpython.com/pdf-python/

In [72]:
import PyPDF2

# pdfName = 'data\CA Data Science Resume.pdf'
pdfFileObj = open('data\CA Data Science Resume.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [73]:
pdfReader.numPages

2

In [74]:
pdfReader.isEncrypted

False

In [75]:
# pdfReader.decrypt('pwd')

In [76]:
# One PDF page
page = pdfReader.getPage(0)
page_content = page.extractText()
# page_content

In [77]:
# Multiple PDF pages
for i in range(pdfReader.getNumPages()):
    page = pdfReader.getPage(i)
    print('Page No - ' + str(1 + pdfReader.getPageNumber(page)))
    page_content = page.extractText()
    # print(page_content)

Page No - 1
Page No - 2


## Word Document

Source(s):
- https://automatetheboringstuff.com/chapter13/

In [78]:
import docx  # python-docx

file = 'data\CA Data Science Resume.docx'

doc = docx.Document(file)

In [79]:
len(doc.paragraphs)

53

In [80]:
doc.paragraphs[0].text

'Corey Atkins'

In [81]:
len(doc.paragraphs[1].runs)

14

In [82]:
doc.paragraphs[3].text

'Qualifications Summary'

In [83]:
# for i in doc.paragraphs:
#     print(i.text)

In [84]:
def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

# print(getText(file))

## RSS Feed

In [85]:
import feedparser

marvel = 'http://www.marvel.com/feeds/rss/movies_news'
test = 'https://timesofindia.indiatimes.com/rssfeedstopstories.cms'

NewsFeed = feedparser.parse(test)

In [86]:
NewsFeed

{'feed': {'language': 'en-gb',
  'links': [{'type': 'application/rss+xml',
    'rel': 'self',
    'href': 'https://timesofindia.indiatimes.com/rssfeedstopstories.cms'},
   {'rel': 'alternate',
    'type': 'text/html',
    'href': 'https://timesofindia.indiatimes.com'}],
  'title': 'Times of India',
  'title_detail': {'type': 'text/plain',
   'language': 'en-US',
   'base': 'https://timesofindia.indiatimes.com/rssfeedstopstories.cms',
   'value': 'Times of India'},
  'link': 'https://timesofindia.indiatimes.com',
  'subtitle': 'The Times of India: Breaking news, views, reviews, cricket from across India',
  'subtitle_detail': {'type': 'text/html',
   'language': 'en-US',
   'base': 'https://timesofindia.indiatimes.com/rssfeedstopstories.cms',
   'value': 'The Times of India: Breaking news, views, reviews, cricket from across India'},
  'rights': 'Copyright:(C) 2021 Bennett Coleman & Co. Ltd, http://info.indiatimes.com/terms/tou.html',
  'rights_detail': {'type': 'text/plain',
   'langua