# Python Text Analysis: Part 1 Solutions

In [None]:
import pandas as pd
import os
import re
import nltk
import spacy

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation

In [None]:
# Import pandas
import pandas as pd

# Use pandas to import Tweets
csv_path = '../data/airline_tweets.csv'
tweets = pd.read_csv(csv_path, sep=',')

## 🥊 Challenge 1: Preprocessing with Multiple Steps

So far we've learned a few preprocessing operations, let's put them together in a function! This function would be a handy one to refer to if you happen to work with some messy English text data, and you want to preprocess it with a single function. 

The example text data for challenge 1 has been read in. Write a function to:
- Lowercase the text
- Remove punctuation marks
- Remove extra whitespace characters

Feel free to recycle the codes we've used above!

**Explicación del Código**
* Este código abre un archivo de texto, lee su contenido completo y lo imprime en pantalla.

In [20]:
import re
challenge1_path = '../data/example1.txt'

with open(challenge1_path, 'r') as file:
    challenge1 = file.read()
    
print(challenge1)



This is a text file that has some extra blankspace at the start and end. Blankspace is a catch-all term for spaces, tabs, newlines, and a bunch of other things that computers distinguish but to us all look like spaces, tabs and newlines.


The Python method called "strip" only catches blankspace at the start and end of a string. But it won't catch it in       the middle,		for example,

in this sentence.		Once again, regular expressions will

help		us    with this.





**Explicación del Código**
- Esta función borra todos los signos de puntuación del texto, dejando solo letras, números y espacios.

- **Explicación paso a paso**
    * from string import punctuation
        - Importa una constante llamada punctuation, que contiene todos los caracteres de puntuación comunes:
        - !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    * Definición de la función remove_punct(text)
        - Entrada: una cadena de texto.
        - Proceso:
            - Recorre cada carácter del texto (for char in text).
            - Si ese carácter no está en la lista de punctuation, lo guarda en la lista no_punct.
        - Salida: devuelve un nuevo texto (text_no_punct) formado al unir (''.join) todos los caracteres sin puntuación.
    * Resultado: obtienes el texto original pero sin ningún signo de puntuación.

In [21]:
from string import punctuation

def remove_punct(text):
    '''Remove punctuation marks in input text'''
    
    # Select characters not in puncutaion
    no_punct = []
    for char in text:
        if char not in punctuation:
            no_punct.append(char)

    # Join the characters into a string
    text_no_punct = ''.join(no_punct)   
    
    return text_no_punct

**Explicación del Código**

Ese código implementa una función de limpieza de texto básica en NLP 🧹. Convierte el texto a minúsculas, elimina la puntuación y normaliza los espacios en blanco.
- Explicación:
    
    1 Expresión regular para espacios: blankspace_pattern = r'\s+'
    - \s = cualquier espacio en blanco (espacios, tabs, saltos de línea).
    - (+) = uno o más.
    + 👉 Este patrón detecta cualquier secuencia de espacios en blanco repetidos.
            
    2 Texto de reemplazo:
    - blankspace_repl = ' '
    - Significa que cada secuencia de espacios será reemplazada por un solo espacio.

    3 Función clean_text:
    - Paso 1: text.lower() - Convierte todo a minúsculas → uniformidad.
    - Paso 2: remove_punct(text) - Elimina signos de puntuación (usa la función que definiste antes).
    - Paso 3: 
        - re.sub(...): reemplaza secuencias de espacios (múltiples) por uno solo.
        - .strip(): quita espacios al inicio y al final.
    
    4 Devuelve el texto limpio ✅
    5 Ejecuta la limpieza sobre challenge1, que es el texto que leíste desde tu archivo example1.txt.

🧪 Ejemplo práctico
- Si challenge1 = "Hola!!! Mundo, Esto es una PRUEBA..."
    - print(clean_text(challenge1))
- Salida:
    - hola mundo esto es una prueba

In [23]:
# Write a pattern in regex
blankspace_pattern = r'\s+'

# Write a replacement for the pattern identfied
blankspace_repl = ' '

def clean_text(text):

    # Step 1: Lowercase the input text
    text = text.lower()

    # Step 2: Use remove_punct to remove puncutuation marks
    text = remove_punct(text)

    # Step 3: Remove extra whitespace characters
    text = re.sub(blankspace_pattern, blankspace_repl, text)
    text = text.strip()
    
    return text
    
clean_text(challenge1)

'this is a text file that has some extra blankspace at the start and end blankspace is a catchall term for spaces tabs newlines and a bunch of other things that computers distinguish but to us all look like spaces tabs and newlines the python method called strip only catches blankspace at the start and end of a string but it wont catch it in the middle for example in this sentence once again regular expressions will help us with this'

📌 # **Función clean_text**

* Función para:
    * Convertir el texto a minúsculas
    * Eliminar los signos de puntuación
    * Eliminar los espacios en blanco adicionales

In [19]:
import re

# patrones para limpiar
blankspace_pattern = r'\s+'
blankspace_repl = " "

def remove_punct(text):
    return re.sub(r'[^\w\s]', '', text)

def clean_text(text):
    # Step 1: lowercase
    text = text.lower()
    
    # Step 2: remove punctuation
    text = remove_punct(text)
    
    # Step 3: remove extra whitespace
    text = re.sub(blankspace_pattern, blankspace_repl, text)
    text = text.strip()
    
    return text

# Ejemplo de uso
challenge1 = "Hola!!!   Mundo,   esto es   una   PRUEBA..."
print(clean_text(challenge1))


hola mundo esto es una prueba


## 🥊 Challenge 2: Remove Stop Words

We have known how `nltk` and `spaCy` work as NLP packages. We've also demostrated how to identify stop words with each package. 

Let's write **two** functions to remove stop words from our text data. 

- Complete the function for stop words removal using `nltk`
    - The starter code requires two arguments: the raw text input and a list of predefined stop words
- Complete the function for stop words removal using `spaCy`
    - The starter code requires one argument: the raw text input
 
A friendly reminder before we dive in: both functions take raw text as input—that's a signal to perform tokenization on the raw text first!

In [31]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop = stopwords.words('english')

def remove_stopword_nltk(raw_text, stopword):
    
    # Step 1: Tokenization with nltk
    tokens = word_tokenize(raw_text)
    
    # Step 2: Filter out tokens in the stop word list
    text = [token for token in tokens if token not in stopword]
    
    return text

ModuleNotFoundError: No module named 'nltk'

In [None]:
nlp = spacy.load('en_core_web_sm')

def remove_stopword_spacy(raw_text):

    # Step 1: Apply the nlp pipeline
    doc = nlp(raw_text)
    
    # Step 2: Filter out tokens in the stop word list
    text = [token.text for token in doc if token.is_stop is False]

    return text

In [None]:
text = tweets['text'][7]

In [None]:
remove_stopword_nltk(text, stop)

In [None]:
remove_stopword_spacy(text)

## 🥊 Challenge 3: Find the Word Boundary

Now that we know BERT tokenization would often return subwords. Let's try a few more examples! 

Does the result make sense to you? What do you think is the correct word boundary to split the following words into subwords? 

Also feel free to read more about limitations of the WordPiece algorithm. For instance, [this blog post](https://medium.com/@rickbattle/weaknesses-of-wordpiece-tokenization-eb20e37fec99) dives into why would it fail, and [this one](https://tinkerd.net/blog/machine-learning/bert-tokenization/#demo-bert-tokenizer) introduces the mechanism underlying the algoritm. 

In [None]:
# Load BERT tokenizer in
from transformers import BertTokenizer

# Initialize the tokenizer 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
def get_tokens(string):
    '''Tokenzie the input string with BERT'''
    tokens = tokenizer.tokenize(string)
    return print(tokens)

In [None]:
# Abbreviations
get_tokens('dlab')

# OOV
get_tokens('covid')

# Prefix
get_tokens('huggable')

# Digits
get_tokens('378')

# YOUR EXAMPLE