### Referencias

- https://oa.upm.es/70960/1/TFG_MARIA_BELEN_ALARCON_RAMOS.pdf
- https://www.datacamp.com/es/tutorial/text-analytics-beginners-nltk
- https://anie.me/On-Torchtext/

### Importaciones y configuraciones

In [1]:
import pandas as pd
import re
import emoji
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD

# Descargar recursos necesarios de NLTK
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Limpieza de Datos

En este proyecto, se está realizando la limpieza de un dataset que contiene tweets sobre el clima, con el objetivo de preparar los datos para un análisis de sentimientos. La limpieza del texto implica múltiples transformaciones, categorizaciones y sustituciones que se detallan a continuación.

#### Pasos de la Limpieza de Datos

1. **Procesar Hashtags**: En lugar de eliminar los hashtags, se optó por separar las palabras contenidas en los hashtags, ya que pueden contener información valiosa para el análisis de sentimientos. Por ejemplo, el hashtag `#ClimateStrike` se transforma en `climate strike`, permitiendo que los términos sean procesados individualmente.

2. **Categorizar Años**: Los años encontrados en el texto se agrupan en categorías relevantes para el análisis. Esto es útil para detectar períodos de tiempo importantes en el contexto del cambio climático. Por ejemplo:
   - Años anteriores a 1800 se categorizan como `anio_pre_industrial`.
   - Entre 1800 y 1950, los años son etiquetados como `anio_industrializacion`.
   - Otros períodos clave incluyen `anio_post_guerra`, `anio_pre_acuerdo_paris`, y `anio_decada_actual`.

3. **Categorizar Temperaturas**: Las menciones de temperatura en grados Celsius se clasifican en diferentes rangos, reflejando los umbrales críticos del cambio climático. Ejemplos de categorías incluyen:
   - `temp_bajo_cero` para temperaturas negativas.
   - `temp_objetivo_paris` para temperaturas entre 0 y 1.5°C, reflejando los objetivos del Acuerdo de París.
   - `temp_calentamiento_extremo` para temperaturas superiores a 4°C.

4. **Categorizar Concentraciones de CO2**: Las concentraciones de CO2 (en ppm) se clasifican según su nivel de impacto:
   - `co2_pre_industrial` para valores por debajo de 350 ppm.
   - `co2_critico` para concentraciones superiores a 450 ppm.

5. **Categorizar Porcentajes**: Los porcentajes en el texto se categorizan según su magnitud, lo que facilita el análisis de declaraciones sobre porcentajes de incremento, reducción, o compromisos:
   - `porcentaje_minimo` para valores menores al 1%.
   - `porcentaje_alto` para valores entre 50% y 90%.
   - `porcentaje_muy_alto` para valores superiores al 90%.

6. **Sustitución de URLs y Nombres de Usuario**: Las URLs se sustituyen por el marcador `link_url` y las menciones de usuarios en Twitter se reemplazan con `cuenta_usuario`, ya que estos elementos no son relevantes para el análisis de sentimientos.

7. **Sustitución de Números**: Los números generales que no fueron categorizados se reemplazan con el marcador `numero`.

8. **Normalización de Texto**: El texto se convierte a minúsculas, se eliminan los emojis y los caracteres repetidos de forma consecutiva en una palabra se normalizan (por ejemplo, `sooo happy` se convierte en `so happy`).

9. **Eliminación de Signos de Puntuación**: Se eliminan los signos de puntuación, ya que estos no aportan información útil para el análisis de sentimientos.

#### Aplicación del Proceso de Limpieza

- Se aplica el proceso de limpieza a una columna de texto en un dataset que contiene tweets sobre el clima.
- Se generan tanto el texto original como el texto limpio para su posterior análisis.
- Finalmente, el dataset procesado se guarda en un archivo CSV para futuros análisis.

In [None]:
def process_hashtags(text):
    # Función para separar palabras en un hashtag
    def split_hashtag(tag):
        return ' '.join(re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|\d|\W|$)|\d+', tag))

    # Encuentra todos los hashtags
    hashtags = re.findall(r'#(\w+)', text)
    
    # Procesa cada hashtag
    for tag in hashtags:
        text = text.replace(f'#{tag}', split_hashtag(tag))
    
    return text

def categorize_year(match):
    year = int(match.group())
    if year < 1800:
        return 'anio_pre_industrial'
    elif 1800 <= year < 1950:
        return 'anio_industrializacion'
    elif 1950 <= year < 1990:
        return 'anio_post_guerra'
    elif 1990 <= year < 2000:
        return 'anio_90s'
    elif 2000 <= year < 2015:
        return 'anio_pre_acuerdo_paris'
    elif 2015 <= year < 2020:
        return 'anio_post_acuerdo_paris'
    elif 2020 <= year <= 2030:
        return 'anio_decada_actual'
    elif 2030 < year <= 2050:
        return 'anio_objetivo_mediano_plazo'
    elif year > 2050:
        return 'anio_futuro_lejano'
    else:
        return 'otro_anio'
    
def categorize_temperature(match):
    temp = float(match.group(1))
    if temp < 0:
        return 'temp_bajo_cero'
    elif 0 <= temp < 1.5:
        return 'temp_objetivo_paris'
    elif 1.5 <= temp < 2:
        return 'temp_limite_paris'
    elif 2 <= temp < 4:
        return 'temp_calentamiento_alto'
    else:
        return 'temp_calentamiento_extremo'

def categorize_co2(match):
    co2 = int(match.group(1))
    if co2 < 350:
        return 'co2_pre_industrial'
    elif 350 <= co2 < 400:
        return 'co2_elevado'
    elif 400 <= co2 < 450:
        return 'co2_muy_elevado'
    else:
        return 'co2_critico'

def categorize_percentage(match):
    percent = float(match.group(1))
    if percent < 1:
        return 'porcentaje_minimo'
    elif 1 <= percent < 10:
        return 'porcentaje_bajo'
    elif 10 <= percent < 50:
        return 'porcentaje_medio'
    elif 50 <= percent < 90:
        return 'porcentaje_alto'
    else:
        return 'porcentaje_muy_alto'

In [20]:
# Función para limpiar el texto
def clean_text(text):
    # Convertir a minúsculas
    text = text.lower()
    
    # Quitar emojis
    text = emoji.demojize(text)
    text = re.sub(r':[a-zA-Z_]+:', '', text)
    
    # Quitar hashtags
    text = process_hashtags(text)
    
    # Normalizar palabras con letras consecutivas repetidas
    text = re.sub(r'(.)\1+', r'\1\1', text)
    
    # Sustituir URLs
    text = re.sub(r'http\S+|www.\S+', 'link_url', text)
    
    # Sustituir nombres de usuario
    text = re.sub(r'@\w+', 'cuenta_usuario', text)
    
    # Eliminar etiquetas HTML
    text = re.sub(r'&lt;.*?&gt', '', text)
    
    # Categorizar numeros
    text = re.sub(r'\b\d{4}\b', categorize_year, text)
    text = re.sub(r'(\d+(?:\.\d+)?)\s*(?:°C|degrees?(?:\s+Celsius)?|C\b)', categorize_temperature, text)
    text = re.sub(r'(\d+)\s*(?:ppm|parts?\s+per\s+million)', categorize_co2, text)
    text = re.sub(r'(\d+(?:\.\d+)?)\s*%', categorize_percentage, text)
    text = re.sub(r'\b\d+(?:\.\d+)?\b', 'numero', text)
    
    # Eliminar signos de puntuación
    text = re.sub(r'[^\w\s]', '', text)
    
    return text.strip()

# Cargar el dataset
df = pd.read_csv('./data/raw/climateTwitterData.csv')

# Aplicar la limpieza a la columna 'text'
df['cleaned_text'] = df['text'].apply(clean_text)

# Mostrar algunos ejemplos de antes y después
for i in range(5):
    print(f"Original: {df['text'].iloc[i]}")
    print(f"Limpio: {df['cleaned_text'].iloc[i]}")
    print()

df[['text', 'cleaned_text']].to_csv('./data/processed/cleanedClimateTwitterData.csv', index=False)

  df = pd.read_csv('./data/raw/climateTwitterData.csv')


Original: 2020 is the year we #votethemout, the year we #climatestrike our hearts out, the year we #rebelforlife because without a liveable future nothing else matters. 2020 is the year we get shit done. (3/3)
Limpio: anio_decada_actual is the year we votethemout the year we climatestrike our hearts out the year we rebelforlife because without a liveable future nothing else matters anio_decada_actual is the year we get shit done numeronumero

Original: Winter has not stopped this group of dedicated climate activists. They are an example to follow. #climatefriday #climatestrike #ClimateAction
Limpio: winter has not stopped this group of dedicated climate activists they are an example to follow climatefriday climatestrike climateaction

Original: WEEK 55 of #ClimateStrike at the @UN. Next week, @Fridays4future heads into its 3rd year of striking. As our time on the streets gets longer, we need you to act and do something for the climate. In 2020, people must stop looking away and stop pr

In [21]:
df = pd.read_csv('./data/processed/cleanedClimateTwitterData.csv')
text_data = df['cleaned_text']

# Tokenization
tokenized_texts = [word_tokenize(text) for text in text_data]

# Remove stopwords
stop_words = set(stopwords.words('english'))
cleaned_texts = [[word for word in tokens if word not in stop_words] for tokens in tokenized_texts]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_texts = [[lemmatizer.lemmatize(word) for word in tokens] for tokens in cleaned_texts]
processed_texts = [' '.join(tokens) for tokens in lemmatized_texts]

# Bag of Words
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(processed_texts)
svd = TruncatedSVD(n_components=100)
reduced_bow = svd.fit_transform(bow_matrix)
reduced_bow_df = pd.DataFrame(reduced_bow, columns=[f'bow_feature_{i}' for i in range(100)])

# Add processed text and reduced BoW features to original DataFrame
df['processed_text'] = processed_texts
df = pd.concat([df, reduced_bow_df], axis=1)

df.to_csv('./data/processed/processedClimateTwitterData.csv', index=False)

In [25]:
display(df.head())

Unnamed: 0,text,cleaned_text,processed_text,bow_feature_0,bow_feature_1,bow_feature_2,bow_feature_3,bow_feature_4,bow_feature_5,bow_feature_6,...,bow_feature_90,bow_feature_91,bow_feature_92,bow_feature_93,bow_feature_94,bow_feature_95,bow_feature_96,bow_feature_97,bow_feature_98,bow_feature_99
0,"2020 is the year we #votethemout, the year we ...",anio_decada_actual is the year we votethemout ...,anio_decada_actual year votethemout year clima...,0.182222,0.770263,-0.169053,0.567862,0.127018,-0.306568,0.374607,...,0.195437,-0.017584,-0.094284,0.072052,-0.087219,0.007657,-0.059401,-0.142866,0.009915,-0.05851
1,Winter has not stopped this group of dedicated...,winter has not stopped this group of dedicated...,winter stopped group dedicated climate activis...,0.213223,0.93851,-0.524714,0.839188,-0.295755,0.211378,-0.050596,...,-0.02775,0.011625,0.003746,0.035626,0.005179,-0.050737,-0.004331,0.006624,0.004949,0.012931
2,WEEK 55 of #ClimateStrike at the @UN. Next wee...,week numero of climatestrike at the cuenta_usu...,week numero climatestrike cuenta_usuario next ...,2.335869,1.305344,0.306468,0.748459,-0.506978,0.313775,0.206079,...,0.054295,0.106906,0.430716,0.229943,0.069674,-0.622509,0.885369,0.423262,0.094535,0.476259
3,"A year of resistance, as youth protests shape...",a year of resistance as youth protests shaped ...,year resistance youth protest shaped climate c...,0.503495,2.148792,-1.327881,0.339124,-0.300876,1.059556,-0.781069,...,-0.25651,-0.137293,0.114234,-0.020294,-0.12266,-0.298775,-0.218554,0.026688,0.001849,0.339471
4,HAPPY HOLIDAYS #greta #gretathunberg #climate...,happy holidays greta gretathunberg climatechan...,happy holiday greta gretathunberg climatechang...,0.459755,1.793725,-1.240209,-0.024631,1.12666,0.538389,-0.534122,...,-0.035914,-0.245255,-0.055241,0.039951,-0.168081,-0.142689,-0.262644,0.06413,0.056814,0.545895


In [24]:
display(df[['text', 'cleaned_text', 'processed_text']].head(20))


Unnamed: 0,text,cleaned_text,processed_text
0,"2020 is the year we #votethemout, the year we ...",anio_decada_actual is the year we votethemout ...,anio_decada_actual year votethemout year clima...
1,Winter has not stopped this group of dedicated...,winter has not stopped this group of dedicated...,winter stopped group dedicated climate activis...
2,WEEK 55 of #ClimateStrike at the @UN. Next wee...,week numero of climatestrike at the cuenta_usu...,week numero climatestrike cuenta_usuario next ...
3,"A year of resistance, as youth protests shape...",a year of resistance as youth protests shaped ...,year resistance youth protest shaped climate c...
4,HAPPY HOLIDAYS #greta #gretathunberg #climate...,happy holidays greta gretathunberg climatechan...,happy holiday greta gretathunberg climatechang...
5,10 Questions to Ask Politicians About Climate...,numero questions to ask politicians about clim...,numero question ask politician climate change ...
6,#climatestrike #FridaysForFuture #portraits #u...,climatestrike fridaysforfuture portraits uniqu...,climatestrike fridaysforfuture portrait unique...
7,#ClimateChangeIsReal #ClimateStrike #ClimateAc...,climatechangeisreal climatestrike climateactio...,climatechangeisreal climatestrike climateactio...
8,My oldest daughter finding inspiration and enc...,my oldest daughter finding inspiration and enc...,oldest daughter finding inspiration encouragem...
9,Our toddler #POTUS whined this week about #Tim...,our toddler potus whined this week about time ...,toddler potus whined week time magazine pickin...
