# Procesado de texto

## Leyendo el texto

### Texto plano

In [15]:
import os 

with open(os.path.join("data", "hieroglyph.txt"), "r") as file:
    text = file.read()
    print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



### Tabular data

In [16]:
!pip install pandas

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


In [17]:
import pandas as pd

df = pd.read_csv(os.path.join("data", "news.csv"))
df.head()[['publisher', 'title']]

Unnamed: 0,publisher,title
0,Livemint,Fed's Charles Plosser sees high bar for change...
1,IFA Magazine,US open: Stocks fall after Fed official hints ...
2,IFA Magazine,"Fed risks falling 'behind the curve', Charles ..."
3,Moneynews,Fed's Plosser: Nasty Weather Has Curbed Job Gr...
4,NASDAQ,Plosser: Fed May Have to Accelerate Tapering Pace


In [18]:
# Convert text column to lowercase
df['title'] = df['title'].str.lower()
df.head()[['publisher', 'title']]

Unnamed: 0,publisher,title
0,Livemint,fed's charles plosser sees high bar for change...
1,IFA Magazine,us open: stocks fall after fed official hints ...
2,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
3,Moneynews,fed's plosser: nasty weather has curbed job gr...
4,NASDAQ,plosser: fed may have to accelerate tapering pace


### Recursos online

In [19]:
import requests
import json

# Fetch data from a REST API
r = requests.get("https://quotes.rest/qod.json")
res = r.json()
print(json.dumps(res, indent=4))

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "Successful people appear to be traveling along one continual, successful road. What is not apparent is the perseverance it takes following each defeat to keep you on that road. No one I know of has ever experienced one success after another without defeats, failures, disappointments, and frustrations galore along the way. Learning to overcome those times of agony is what separates the winners from the losers.",
                "length": "412",
                "author": "G. Kingsley Ward",
                "tags": {
                    "0": "failure",
                    "1": "inspire",
                    "2": "perseverance",
                    "3": "success",
                    "5": "winning"
                },
                "category": "inspire",
                "language": "en",
                "date": "2021-08-30",
                "permalink": "https://theys

In [20]:
# Extract relevant object and field
q = res["contents"]["quotes"][0]
print(q["quote"], "\n--", q["author"])

Successful people appear to be traveling along one continual, successful road. What is not apparent is the perseverance it takes following each defeat to keep you on that road. No one I know of has ever experienced one success after another without defeats, failures, disappointments, and frustrations galore along the way. Learning to overcome those times of agony is what separates the winners from the losers. 
-- G. Kingsley Ward


## Limpieza

### Usando RegEx

In [21]:
import requests

# Fetch a web page
r = requests.get("https://news.ycombinator.com")
print(r.text[:2000])

<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?e4TTZo68HcOy5uQfOsmc">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="front">past</a> | <a href=

In [22]:
import re

# Remove HTML tags using RegEx
pattern = re.compile(r'<.*?>')  # tags look like <...>
print(pattern.sub('', r.text[:2000]))  # replace them with blank


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      I nearly lost the Lightroom catalog with of all my photos<span class="sitebit comh


### Usando BeautifulSoup

In [23]:
!pip install bs4 html5lib

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


In [24]:
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup library
soup = BeautifulSoup(r.text[:2000], "html5lib")
print(soup.get_text())


        
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      I nearly lost the Lightroom catalog with of all my photos


#### Obtener los artículos de 'soup'

In [25]:
summaries = soup.find_all("tr", class_="athing")
summaries[0]

<tr class="athing" id="28345422">
      <td align="right" class="title" valign="top"><span class="rank">1.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=28345422&amp;how=up&amp;goto=news" id="up_28345422"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="storylink" href="https://ptrchm.com/blog/how-i-nearly-lost-all-my-photos/">I nearly lost the Lightroom catalog with of all my photos</a></td></tr>

In [29]:
# Extract title
summaries[0].find("a", class_="storylink").get_text()

'I nearly lost the Lightroom catalog with of all my photos'

## Normalización

In [37]:
import os

In [38]:
with open(os.path.join("data", "hieroglyph.txt"), "r") as file:
    text = file.read()

### Mayúsculas y minúsculas

In [39]:
text = text.lower()
print(text)

hieroglyphic writing dates from c. 3000 bc, and is composed of hundreds of symbols. a hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



### Borrar puntuación

In [40]:
import re

In [41]:
text = re.sub(r"[^a-zA-Z0-9]", " ", text)
print(text)

hieroglyphic writing dates from c  3000 bc  and is composed of hundreds of symbols  a hieroglyph can represent a word  a sound  or a silent determinative  and the same symbol can serve different purposes in different contexts  hieroglyphs were a formal script  used on stone monuments and in tombs  that could be as detailed as individual works of art  


### Tokenización

In [42]:
words = text.split()
print(words)

['hieroglyphic', 'writing', 'dates', 'from', 'c', '3000', 'bc', 'and', 'is', 'composed', 'of', 'hundreds', 'of', 'symbols', 'a', 'hieroglyph', 'can', 'represent', 'a', 'word', 'a', 'sound', 'or', 'a', 'silent', 'determinative', 'and', 'the', 'same', 'symbol', 'can', 'serve', 'different', 'purposes', 'in', 'different', 'contexts', 'hieroglyphs', 'were', 'a', 'formal', 'script', 'used', 'on', 'stone', 'monuments', 'and', 'in', 'tombs', 'that', 'could', 'be', 'as', 'detailed', 'as', 'individual', 'works', 'of', 'art']


## Continuación

La continuación de este notebook se encuentra en el notebook `02.Nltk.ipynb`.