### For any given text, preprocessing involves conversion of the raw text into well-defined sequences of linguistic components. Preprocessing techniques usually add metadata, giving meaning to those components. Such techniques aim to remove html tags, tokenize the text, remove unnecessary tokens, such as stopwords, expand contractions, correct spelling errors, etc. Let's see some of these techniques.

### This notebook requires tensorflow 2.10: pip install tensorflow == 2.10

### You can start by removing accented characters and converting the text to lower/upper case if you want. This helps to standardize the text for analysis. 

In [1]:
import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8','ignore')
    return(text)

text_port = "Eu gosto de vários Esportes e de ouvir Música com Harmonia."
text_port = remove_accented_chars(text_port).lower()
print(text_port)

eu gosto de varios esportes e de ouvir musica com harmonia.


### If it is an html text, you will certainly want to remove its tags.

In [2]:
import re
import bs4 # beautiful soup
import requests

def strip_html_tags(text):
    soup = bs4.BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text) # leverage regular expressions from Python
                                                                  # to remove new-line characters
    return(stripped_text)

url = 'https://en.wikipedia.org/wiki/Parsing'
res = requests.get(url)
html_page = res.text

clean_text = strip_html_tags(html_page)
print(clean_text) 


Parsing - Wikipedia
Jump to content
Main menu
Main menu
move to sidebar
hide
		Navigation
	
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us
		Contribute
	
HelpLearn to editCommunity portalRecent changesUpload file
Search
Search
Appearance
Donate
Create account
Log in
Personal tools
Donate Create account Log in
		Pages for logged out editors learn more
ContributionsTalk
Contents
move to sidebar
hide
(Top)
1
Human languages
Toggle Human languages subsection
1.1
Traditional methods
1.2
Computational methods
1.3
Psycholinguistics
1.4
Discourse analysis
2
Computer languages
Toggle Computer languages subsection
2.1
Parser
2.2
Overview of process
3
Types of parsers
Toggle Types of parsers subsection
3.1
Implementation
4
Parser development software
5
Lookahead
6
List of parsing algorithms
7
See also
8
References
9
Further reading
10
External links
Toggle the table of contents
Parsing
37 languages
العربيةবাংলাČeštinaDanskDeutschEestiΕλληνικάEsperantoفارسیFrançais한국어Հայեր

### Expand contractions: it is another example of operation to standardize the text.

In [3]:
# Note that some contractions have multiple forms and the only way to solve them is by analyzing the text locally.
CONTRACTION_MAP = {
    "ain't": "is not",
    "aren't": "are not",
    "can't": "can not",
    "can't've": "can not have",
    "wouldn't": "would not",
    "wouldn't've": "would not have",
    "hasn't": "has not",
    "hasn't've": "has not have",
    "hadn't": "had not",
    "hadn't've": "had not have",
    "haven't": "have not",
    "you'll've": "you will have",
    "you're": "you are",
    "you've": "you have",
    "it's": "it is",
    "he's": "he is",
    "she's": "she is",
    "they're": "they are",
    "I'm": "I am",
    "can't": "can not",
    "can't've": "can not have",
}

def expand_contractions(text, contraction_map=CONTRACTION_MAP):
    contraction_patterns = re.compile('({})'.format('|'.join(contraction_map.keys())), \
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction): # finds the pattern and returns its expansion
        match      = contraction.group(0) # returns a match in the group of contraction patterns
        first_char = match[0] # gets its first character
        # gets the expanded pattern
        expanded_contraction = contraction_map.get(match) if contraction_map.get(match) else \
                                                            contraction_map.get(match.lower())
        expanded_contraction = first_char+expanded_contraction[1:]
        return(expanded_contraction)
    # finds and replaces the occurrencies of the contraction patterns 
    expanded_text = contraction_patterns.sub(expand_match, text) 
    expanded_text = re.sub("'","",expanded_text) # eliminates extra "'" if it remains in expanded_text
    return(expanded_text)

text = "You aren't the best choice. Especially when she's around." 
text = expand_contractions(text)
print(text)

You are not the best choice. Especially when she is around.


### Text tokenization: it separates independent and minimal textual components (tokens) that have some definite syntax and semantics.

In [4]:
import nltk # https://www.nltk.org/
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

#### by word 

In [5]:
text = "I love hiking, surfing, biking, and all types of music with harmony."
print(nltk.word_tokenize(text, "english"))
text_port = "Eu gosto de vários esportes e de ouvir música com harmonia."
print(nltk.word_tokenize(text_port, "portuguese"))

['I', 'love', 'hiking', ',', 'surfing', ',', 'biking', ',', 'and', 'all', 'types', 'of', 'music', 'with', 'harmony', '.']
['Eu', 'gosto', 'de', 'vários', 'esportes', 'e', 'de', 'ouvir', 'música', 'com', 'harmonia', '.']


#### by sentences

In [6]:
text = "I love hiking in the winter, but I most like surfing in the summer. I also like to play classic guitar."
print(nltk.sent_tokenize(text, "english"))

['I love hiking in the winter, but I most like surfing in the summer.', 'I also like to play classic guitar.']


#### You may also download and use tokenizers, or even build, train and evaluate your own tokenizer.

In [7]:
nltk.download('popular','punkt') # download popular data and punkt tokenizer. 
                                 # Without arguments, one can select from a user 
                                 # interface: models, packages, corpora (collections of texts with metadata) 

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to punkt...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to punkt...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to punkt...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to punkt...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to punkt...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to punkt...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to punkt...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Downloading package shakespeare to punkt...
[nltk_data]    |   Package shakespeare is already

True

In [8]:
port_text = "Alice é uma menina levada. Ela subiu no telhado para pegar uma pipa. \
             O telhado estava molhado. Alice escorregou, caiu e quebrou o braço."

# load a tokenizer model pretrained for Portuguese
port_tokinizer = nltk.data.load(resource_url='tokenizers/punkt/portuguese.pickle')
port_tokinizer.tokenize(port_text)

['Alice é uma menina levada.',
 'Ela subiu no telhado para pegar uma pipa.',
 'O telhado estava molhado.',
 'Alice escorregou, caiu e quebrou o braço.']

#### You can then create a function to tokenize sentences and then words in each sentence.

In [9]:
def tokenize_text_nltk(text):
    sentences   = nltk.sent_tokenize(text)
    sentences   = [nltk.word_tokenize(sentence) for sentence in sentences]
    return(sentences)

print(tokenize_text_nltk(text))

[['I', 'love', 'hiking', 'in', 'the', 'winter', ',', 'but', 'I', 'most', 'like', 'surfing', 'in', 'the', 'summer', '.'], ['I', 'also', 'like', 'to', 'play', 'classic', 'guitar', '.']]


#### SpaCy (https://spacy.io/) is another package with fast implementation of several techniques. It already produces sentences and tokens of each sentence.

In [10]:
import spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m[36m0:00:01[0m01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [11]:
def text_to_sentences(text):
    nlp         = spacy.load('en_core_web_sm') # Load English tokenizer, tagger, parser 
                                               # and NER (Named Entity Recognition)
    text_obj    = nlp(text) # creates a text object with metadata, such as 
                            # sentences, tokens, and much more as we will see. 
    sentences   = [[token.text for token in sentence] for sentence in text_obj.sents]
    return(sentences)

sentences = text_to_sentences(text)
print(sentences)

[['I', 'love', 'hiking', 'in', 'the', 'winter', ',', 'but', 'I', 'most', 'like', 'surfing', 'in', 'the', 'summer', '.'], ['I', 'also', 'like', 'to', 'play', 'classic', 'guitar', '.']]


In [12]:
# The list of sentences and tokens per sentence is convenient for processing information, but 
# you may want to return that list into a text.

def sentences_to_text(sentences):
    text = ""
    for sentence in sentences:
        if len(sentence) > 0: 
            for token in sentence:
                if token in [',', '.', ':', ';', '?', '!']:
                    text = text + token
                else:
                    text = text + " " + token
    return(text)

text = sentences_to_text(sentences)
print(text)

 I love hiking in the winter, but I most like surfing in the summer. I also like to play classic guitar.


### Stopwords removal: remove unnecessary word tokens for search engines. There is no universal list of such word tokens. You may create your own list.

In [13]:
def remove_stopwords(text):
    sentences = text_to_sentences(text)
    stopword_list = nltk.corpus.stopwords.words('english') # get a list of stopwords from a corpus of stopwords 
    new_sentences = []
    for i, sentence in enumerate(sentences):
        new_sentences.append([])
        for word_token in sentence:
            if (word_token.lower() not in stopword_list):
                new_sentences[i].append(word_token)
    text = sentences_to_text(new_sentences)            
    return(text)

text = remove_stopwords(text)
print(text)

   love hiking winter, like surfing summer. also like play classic guitar.


### Special character removal: you may want to remove unnecessary symbols, characters, or even digits.

In [14]:
def remove_special_characters(text, remove_digits=False):
    sentences = text_to_sentences(text)
    if (remove_digits):
        pattern = r'[^a-zA-Z\s]'
    else:
        pattern = r'[^a-zA-Z0-9\s]'
    new_sentences = []
    for i, sentence in enumerate(sentences):
        new_sentences.append([])
        for word_token in sentence:
            word_token = re.sub(pattern,'',word_token)
            if (word_token != ''):
                new_sentences[i].append(word_token)   
    text = sentences_to_text(new_sentences)             
    return(text)

example = '13 years in prison! No way, he decided to flee.'
example = remove_special_characters(example, True)
print(example)
text    = remove_special_characters(text, False)
print(text)

 years in prison No way he decided to flee
     love hiking winter like surfing summer also like play classic guitar


### Stemming text: this does not work always. It finds the root of a word.

In [15]:
from nltk.stem import PorterStemmer, SnowballStemmer

stemmer = PorterStemmer()
print('the root of {} is {}'.format('hiking',stemmer.stem('hiking')))

print('It supports the following languages:', SnowballStemmer.languages)

port_stemmer = SnowballStemmer('portuguese')

print("However, it does not work always:", port_stemmer.stem('dançando'))

the root of hiking is hike
It supports the following languages: ('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
However, it does not work always: danc


### Lemmatizing text: this removes word affixes to get a base form of the word.

In [16]:
from nltk.stem import WordNetLemmatizer

def lemmatize_nltk(text):
    sentences     = text_to_sentences(text)
    lem           = WordNetLemmatizer()
    new_sentences = []
    for i, sentence in enumerate(sentences):
        new_sentences.append([])
        for word_token in sentence:
            word_token = lem.lemmatize(word_token)
            new_sentences[i].append(word_token)
    text = sentences_to_text(new_sentences)        
    return(text)

example = 'He believes in love and then they became lovers'
example = lemmatize_nltk(example)
print(example)

text = lemmatize_nltk(text)
print(text)

 He belief in love and then they became lover
       love hiking winter like surfing summer also like play classic guitar


#### spacy makes things lot easier since it performs parts of speech tagging and lemmatization for each token in a text at once.

In [17]:
def lemmatize(text):
    nlp         = spacy.load('en_core_web_sm') # Load English tokenizer, tagger, parser and NER
    text_obj    = nlp(text) # creates a text object with metadata: e.g., it can identify Parts of Speech (POS) 
                            # such as verbs, adverbs, nouns, pronouns, etc. 
    # lemmatize whatever is not a pronoun    
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text_obj]) 
    return(text)

print(lemmatize("My system keeps crashing! His crashed yesterday, mine crashes today."))

my system keep crash ! his crash yesterday , mine crash today .


### Parts of Speech (POS) tags: specific lexical categories to which words are assigned based on their syntactic context and role. It also assigns a TAG to each token according to the Penn Treebank POS tags and a dependency tag, which tells its role in the sentence. 

In [18]:
nlp         = spacy.load('en_core_web_sm')
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    left_dep  = [t for t in token.lefts]
    right_dep = [t for t in token.rights]
    print("TOKEN: {:10s} POS: {:5s} TAG: {:3s} DEP: {:10s}".format(token.text, token.pos_, \
                                                                token.tag_, token.dep_))
    if (len(left_dep)>0):
        print("Dependencies on the left: {}".format(left_dep))
    if (len(right_dep)>0):
        print("Dependencies on the right: {}".format(right_dep))    

TOKEN: Apple      POS: PROPN TAG: NNP DEP: nsubj     
TOKEN: is         POS: AUX   TAG: VBZ DEP: aux       
TOKEN: looking    POS: VERB  TAG: VBG DEP: ROOT      
Dependencies on the left: [Apple, is]
Dependencies on the right: [at]
TOKEN: at         POS: ADP   TAG: IN  DEP: prep      
Dependencies on the right: [buying]
TOKEN: buying     POS: VERB  TAG: VBG DEP: pcomp     
Dependencies on the right: [startup]
TOKEN: U.K.       POS: PROPN TAG: NNP DEP: nsubj     
TOKEN: startup    POS: VERB  TAG: VBD DEP: ccomp     
Dependencies on the left: [U.K.]
Dependencies on the right: [for]
TOKEN: for        POS: ADP   TAG: IN  DEP: prep      
Dependencies on the right: [billion]
TOKEN: $          POS: SYM   TAG: $   DEP: quantmod  
TOKEN: 1          POS: NUM   TAG: CD  DEP: compound  
TOKEN: billion    POS: NUM   TAG: CD  DEP: pobj      
Dependencies on the left: [$, 1]


#### Such dependency tree can be visualized as follows. There other types of parsers (shallow, constituency) and such structures help reasoning from text analysis.

In [19]:
from spacy import displacy

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

displacy.render(doc, style="dep")

#### List of tags from the Penn Treebank POS tags

In [20]:
nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package tagsets to /home/afalcao/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [21]:
nltk.pos_tag(['I', 'love', 'hiking']) # POS tags for a list of tokens

[('I', 'PRP'), ('love', 'VBP'), ('hiking', 'VBG')]

### Named Entity Recognition (NER): spacy also categorizes tokens. 

In [22]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print("Ent: {:10s} Start_position: {:2d} End_position: {:2d} Label: {:5s} Explanation: {:25s}".format(\
    ent.text, ent.start_char, ent.end_char, ent.label_, str(spacy.explain(ent.label_))))

Ent: Apple      Start_position:  0 End_position:  5 Label: ORG   Explanation: Companies, agencies, institutions, etc.
Ent: U.K.       Start_position: 27 End_position: 31 Label: GPE   Explanation: Countries, cities, states
Ent: $1 billion Start_position: 44 End_position: 54 Label: MONEY Explanation: Monetary values, including unit


### A corpus is a collection of texts with metadata. It can be useful for several tasks.

#### WordNet Corpus: it is a semantic oriented lexical database for English language

In [23]:
from nltk.corpus import wordnet as wn

synonymous = wn.synsets('hike')
for word in synonymous:
    print(word.name(),word.pos()) # print name and part of speach
    print(word.definition()) # print definition
    print(word.examples()) # print examples

hike.n.01 n
a long walk usually for exercise or pleasure
['she enjoys a hike in her spare time']
rise.n.09 n
an increase in cost
['they asked for a 10% rise in rates']
raise.n.01 n
the amount a salary is increased
['he got a 3% raise', 'he got a wage hike']
hike.v.01 v
increase
['The landlord hiked up the rents']
hike.v.02 v
walk a long way, as for pleasure or physical exercise
['We were hiking in Colorado', 'hike the Rockies']


 #### Brown Corpus: it consists of texts from a variety of sources and categories for English language

In [24]:
nltk.download('brown')           # download the brown corpus 

[nltk_data] Downloading package brown to /home/afalcao/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [25]:
from nltk.corpus import brown

In [26]:
print(brown.categories()) # all 15 categories

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [27]:
sentence_tokens = brown.sents(categories='humor') # get tokenized sentences from category='humor'
sentences = [' '.join(token) for token in sentence_tokens] # get sentences in natural form
for i in range(3): # show the first three sentences
    print(sentences[i])
    print()

It was among these that Hinkle identified a photograph of Barco ! !

For it seems that Barco , fancying himself a ladies' man ( and why not , after seven marriages ? ?

) , had listed himself for Mormon Beard roles at the instigation of his fourth murder victim who had said : `` With your beard , dear , you ought to be in movies '' ! !



In [28]:
pos_tagged_tokens = brown.tagged_sents(categories='humor') # get tokenized sentences from category='humor' with 
                                                           # the corresponding part of speach (POS) classification
print(pos_tagged_tokens)

[[('It', 'PPS'), ('was', 'BEDZ'), ('among', 'IN'), ('these', 'DTS'), ('that', 'CS'), ('Hinkle', 'NP'), ('identified', 'VBD'), ('a', 'AT'), ('photograph', 'NN'), ('of', 'IN'), ('Barco', 'NP'), ('!', '.'), ('!', '.')], [('For', 'CS'), ('it', 'PPS'), ('seems', 'VBZ'), ('that', 'CS'), ('Barco', 'NP'), (',', ','), ('fancying', 'VBG'), ('himself', 'PPL'), ('a', 'AT'), ("ladies'", 'NNS$'), ('man', 'NN'), ('(', '('), ('and', 'CC'), ('why', 'WRB'), ('not', '*'), (',', ','), ('after', 'IN'), ('seven', 'CD'), ('marriages', 'NNS'), ('?', '.'), ('?', '.')], ...]


#### Movie Review corpus: it contains positive and negative reviews about movies

In [29]:
from nltk.corpus import movie_reviews as mr

In [30]:
print(mr.categories())

['neg', 'pos']


In [31]:
sentence_tokens = mr.sents(categories='pos') # get tokenized sentences with positive review
sentences = [' '.join(token) for token in sentence_tokens] # get sentences in natural form
for i in range(3): # show the first three sentences with positive reviews
    print(sentences[i])
    print()

films adapted from comic books have had plenty of success , whether they ' re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there ' s never really been a comic book like from hell before .

for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid ' 80s with a 12 - part series called the watchmen .

to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .



### We may then use a corpus to assist removal of repeated characters in tokens

In [32]:
def remove_repeated_characters(text):
    sentences = text_to_sentences(text)
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wn.synsets(old_word):
            return(old_word)
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return(replace(new_word) if new_word != old_word else new_word)
    new_sentences = []
    for i, sentence in enumerate(sentences):
        new_sentences.append([])
        for word_token in sentence:
            word_token = replace(word_token)
            new_sentences[i].append(word_token)
    text = sentences_to_text(new_sentences)        
    return(text)

example = 'finalllyyy loooveee, artt aand mussicc beccamme her prriorityy'
example = remove_repeated_characters(example)
print(example)

 finally love, art and music became her priority


### Other examples of preprocessing tasks are: expanding contractions, spelling corrections, removal of multiple white spaces, etc.

### We will now learn how to acquire text data from the internet. We can use scrapy --  a fast high-level web tool to crawl websites and extract structured data from their pages (see https://docs.scrapy.org/en/latest/intro/tutorial.html). 

### In the command line, create a new project called stockmarket. The idea is to find positive sentences in the text involving names of organizations. Such information may be useful to understand the behavior of the stock market with respect to positive and negative news related to companies.  

scrapy startproject stockmarket  

### Create a crawler script, extract_links.py, as described in the next cell, that extracts urls and saves them in a csv file, links.csv. Copy your script to  stockmarket/stockmarket/spiders. 

import scrapy

import csv

class ExtractLinks(scrapy.Spider):

    name     = "extract_links" 
    
    maxdepth = 2 # crawling depth from initial urls
    
    # provide the list of urls
    
    def start_requests(self):
        urls = [
            'https://finance.yahoo.com/topic/stock-market-news/',
            'https://www.cnbc.com/china-markets/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # define the parse function that will create 
    # a list of urls from the initial ones provided above.
    
    def parse(self, response):
        """ Main function that parses downloaded pages """
        # Save link
        if (response.url.split(".")[-1] == "html"): 
            with open('links.csv', 'a', newline='') as file:
                writer = csv.writer(file)            
                writer.writerow([response.url])
                file.close()
        # Get all the <a> tags
        a_selectors = response.xpath("//a")
        # Loop on each tag
        for depth in range(self.maxdepth):
            for selector in a_selectors:
                # Extract the link text
                text = selector.xpath("text()").extract_first()
                # Extract the link href
                link = selector.xpath("@href").extract_first()
                # Create a new Request object
                request = response.follow(link, callback=self.parse)
                # Return it thanks to a generator
                yield request            


### Execute your script extract_links.py  

cd stockmarket/stockmarket/spiders \
scrapy crawl extract_links

### Create a crawler script, extract_texts.py, as described in the next cell, which saves the text from those urls in links.csv into separated files inside a folder, named by the current date. Copy extract_texts.py to stockmarket/stockmarket/spiders and execute it.


import scrapy

import os

from datetime import date

class ExtractTexts(scrapy.Spider):

    name   = "extract_texts"
    
    # create a folder named by the current date and yield the urls for parsing
    
    def start_requests(self):
        folder = date.today()
        os.system("mkdir {}".format(folder))
        with open('links.csv', 'r') as file:
            for line in file:
                url = line.strip()
                yield scrapy.Request(url=url, callback=self.parse)
            file.close()
    
    # parse the texts from those urls into html files inside that folder. 
    
    def parse(self, response):
        folder = date.today()
        page = response.url.split("/")[-2]
        filename = "{}/{}.html".format(folder,page)
        with open(filename, 'wb') as file:
            file.write(response.body)
            file.close()

#### Execute your script inside stockmarket/stockmarket/spiders:

scrapy crawl extract_texts

### Now that you have created several html files, the first task is data preparation. You will need to remove irrelevant information from each file. For the sake of simplicity, I will show the process for a single html file. 

In [33]:
from datetime import date
folder   = "stockmarket/stockmarket/spiders/{}".format('2021-10-01') #date.today())
filename = "{}/news.html".format(folder) # example of one of the files
file     = open(filename,"r")
text     = file.read()
print('This page contains {} bytes'.format(len(text)))
print(text)

This page contains 473727 bytes
<!DOCTYPE html><html id="atomic" class="Fz(62.5%) Pos(r) desktop bktfinance-US-en-US-def ua-chrome ua-41.0 failsafe" lang="en-US"><head><script>
        window.performance.mark('PageStart');
        document.documentElement.className += ' JsEnabled jsenabled';
        /**
        * Empty darlaOnready method, to avoid JS error.
        * This can happen when Async Darla JS file is loaded earlier than Darla Proxy JS.
        * This method will be overridden by Darla Proxy
        */
        window.darlaOnready = function() {};
        </script><title>Oil Falls Ahead of OPEC+ Meeting With Supply in Focus</title><meta http-equiv="content-type" content="text/html; charset=utf-8" /><meta http-equiv="x-dns-prefetch-control" content="on" /><meta http-equiv="X-UA-Compatible" content="chrome=1" /><meta name="oath:guce:consent-host" content="guce.yahoo.com" /><meta name="news_keywords" content="Bloomberg, OPEC, Helima Croft" /><meta name="twitter:title" content="Oi

### First, you will need to use the prior knowledge to preprocess and extract sentences from the text. 

In [34]:
def my_remove_stopwords(text):
    sentences = text_to_sentences(text)
    stopword_list = nltk.corpus.stopwords.words('english')
    # important for positive/negative classification
    preserve_list = ['i', 'am', 'is', 'are', 'do', 'not', 'does', 'you', 'he', 'it', 'she'] 
    new_sentences = []
    for i, sentence in enumerate(sentences):
        new_sentences.append([])
        for word in sentence:
            if ((word.lower() not in stopword_list) or 
                (word.lower() in preserve_list)):
                new_sentences[i].append(word)
    text = sentences_to_text(new_sentences)            
    return(text)

def remove_sentences_without_verbs(text):
    text_obj      = nlp(text) # apply spacy pipeline to POS tag its tokens
    new_text      = ""
    for sentence in text_obj.sents: # eliminate sentences with no verbs
        has_verb      = False
        for token in sentence:
            if ((token.pos_ == 'VERB') or (token.pos_ == 'AUX')):
                has_verb = True
                break
        if (has_verb):
            for token in sentence: 
                if token.text in [',', '.', ':', ';', '?', '!']:
                    new_text = new_text + token.text
                else:
                    new_text = new_text + " " + token.text
            
    return(new_text)                  
       
# Some of these steps might not be suitable for the application    
def preprocess(text):
    text        = strip_html_tags(text)
    text        = remove_accented_chars(text)
    text        = remove_repeated_characters(text)
    text        = expand_contractions(text)
    text        = my_remove_stopwords(text)
    text        = remove_sentences_without_verbs(text)
    text        = remove_special_characters(text, False)
    text        = lemmatize(text)
    return(text)

text = preprocess(text)
print(text)

    try it free             Yahoo Finance Sign        Mail Sign view mail         Finance       Finance        Watchlists       watchlist        Portfolio       Portfolio       Screeners       Screeners        save ScrenersSaved Screeners Equity ScrenerEquity Screener Mutual Fund ScrenerMutual Fund Screener etf ScrenerETF screener future ScrenerFuture Screener Index ScrenerIndex Screener         Yahoo Finance plus       Yahoo Finance plus        DashboardDashboard Research ReportsResearch Reports Investment IdeasInvestment Ideas Community InsightsComunity Insights BlogBlog         Markets       Markets        CryptocurenciesCryptocurencies CalendarsCalendars trend TickersTrending ticker stock ActivesStocks actives stock GainersStocks gainer stock LosersStocks Losers Top etfstop etfs FuturesFutures World IndicesWorld Indices CurenciesCurencies Top Mutual FundsTop Mutual Funds Options Highest Open InterestOptions Highest Open Interest Options Highest Implied VolatilityOptions Highest Imp

In [35]:
# from https://huggingface.co/tasks/text-classification
# from https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device='cuda:0')


In [36]:
def select_positive_sentences_about_organizations(text):
    sentences = text_to_sentences(text)

    new_sentences = []
    for sentence in sentences:
        if (len(sentence) > 0):
            sentence = ' '.join(sentence) # convert list of sentence tokens to text
            # result is a list with a single dictionary as element
            if (len(sentence)>512): # maximum sentence size for this model
                result = classifier(sentence[:512])
            else:
                result = classifier(sentence) 
                    
            sent_obj = nlp(sentence)
            for ent in sent_obj.ents: # for each token in the sentence 
                if (ent.label_ == 'ORG'):
                    if (result[0]['label'] == 'POSITIVE'): 
                        new_sentences.append((ent, sentence, result[0]['score']))  
    
    return(new_sentences)


sentences = select_positive_sentences_about_organizations(text)
for sentence in sentences:
    print(sentence[0],": \n",sentence[1],"\n\n score: ",sentence[2])
    print()

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


GroupAir Canada : 
 DroughtHSBC9h agocnw GroupAir Canada begin work Enhance Cold Chain Handling Capabilities Toronto Pearson Cargo FacilityAir Canada today announce start 16 million project expand enhance Air Canada Cargo cold chain handling capability shipment pharmaceutical fresh food perishable Toronto Pearson International Airport cargo facility project be part Air Canada strategy develop cargo division also include acquisition freighter aircraft launch dedicated freighter route expansion e comerce1h agoTipRanksAlibaba outlook be weak Stock be undervalue says JP MorganOne recurrent headline appear financial news board fairly regularly past month go pretty much like be share insert chinese tech education internet company fall answer time be due crackdown course amongst nursing wound tighten chinese regulatory environment be 

 score:  0.8077647686004639

Enhance Cold Chain Handling Capabilities Toronto : 
 DroughtHSBC9h agocnw GroupAir Canada begin work Enhance Cold Chain Handling C

### The pipelines in Hugging Face convert text into numbers, creating embeddings that are suitable for machine learning operations, such as clustering and classification. We will now learn how this can be done.  