<a href="https://colab.research.google.com/github/dieko95/blackouts-C4V/blob/master/twitter_pretagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training Dataset Creation - Tagging 

This notebook aims to create the dataset for Code for Venezuela's Blackout Project. 

This dataset is going to be consumed by an ML model that will aim to predict: 

- If a tweet is from Venezuela
- If so from which state(s) 
- About what public service the user is reporting (sinluz)



## Libraries



In [None]:
import pandas as pd
import re

# For better visualization of text in Pandas DF
pd.set_option('display.max_colwidth', None)

## Accessing Data

The untagged dataset originates from scraped tweets by Code For Venezuela's Angostura ETL. A subset of tweets (11,000) was queried from the etl in order for them to be tagged. The first 4,000 tweets have already been tagged. 

In [None]:
# Read CSV 
tags_df = pd.read_csv('tagging-set-original_for_jupyter_tagging.csv')



# Fill Nas in label_country to 0
tags_df.label_country.fillna('0', 
                             inplace = True)


# Tagged Tweets
pre_tag_df = tags_df[(tags_df.label_country != '0')].copy()


# Tweets to tag
to_tag_df = tags_df[~tags_df.index.isin(tags_df[(tags_df.label_country != '0')].index)].copy()

## Cleaning Text

This is a helper function to quickly clean text.

- Converts all text to low caps. 
- Strips all spanish accents

Pending:

- Strip dots and links (@ and # must remain) 

In [None]:
def cleaner(df,text_col):
  # to lower

  df[text_col] = df[text_col].str.lower()

  # Convert common spanish accents

  df[text_col] = df[text_col].str.replace("ú", "u")
  df[text_col] = df[text_col].str.replace("ù", "u")
  df[text_col] = df[text_col].str.replace("ü", "u")
  df[text_col] = df[text_col].str.replace("ó", "o")
  df[text_col] = df[text_col].str.replace("ò", "o")
  df[text_col] = df[text_col].str.replace("í", "i")
  df[text_col] = df[text_col].str.replace("ì", "i")
  df[text_col] = df[text_col].str.replace("é", "e")
  df[text_col] = df[text_col].str.replace("è", "e")
  df[text_col] = df[text_col].str.replace("á", "a")
  df[text_col] = df[text_col].str.replace("à", "a")
  df[text_col] = df[text_col].str.replace("ñ", "gn")

  return df


to_tag_df = cleaner(to_tag_df, 'concat_text_user_description')
to_tag_df = cleaner(to_tag_df, 'full_text')

## Sections to Tag 

- Tag label_type (service reported)
  - Extracting pound signs (\#)

- Tag Country
  - Matches any state? 
  - has keyword 'edo' or 'estado' in it?
  - Follows any of the common accounts?
- Tag State
  - Match with list of venezuela states
  - We can use a list of venezuelan cities as well 

### Classifying Label Type

#### Hashtags

* \#SinLuz






### Tagging Country

  - Matches any state? 
  - has keyword 'edo' or 'estado' in it?
  - Follows any of the common accounts?

*Notes*
  - For this section I will use the tweet's original text. If I include the user description it can add noise because a user can be reporting about a power outage of another state (e.g., I'm from caracas and reporting a power outage in Zulia)
  

## Tagging State


Incluimos cuentas que reportan a nivel nacional? Es ruido porque lo que hacen es repetir lo que otros usuarios dicen? O captura señal porque son reportes de fallas de luz?

~~~
print(tags_df.loc[8122,'full_text'])

#Ahora Reportan más zonas #SinLuz: 

Catia, Distrito Capital ❌💡
Guatire y Guarenas, Edo. Miranda ❌💡
Estado Mérida ❌💡
Estado Aragua ❌💡

Comenta si hay fallas en tu zona #2Oct

~~~