Este notebook tendra features asociados a ciertas palabras especificas

In [1]:
import pandas as pd
pd.set_option('max_colwidth', -1)
import string
import nltk
nltk.download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download('vader_lexicon')
from textblob import TextBlob
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\brian\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\brian\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Primero, carga del csv

In [2]:
df = pd.read_csv('../dataset/train.csv')
df_test = pd.read_csv('../dataset/test.csv')
df_test.drop(columns=['location', 'keyword'], inplace=True)
df = df.merge(df_test, how='outer')

In [3]:
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all,1.0
1,4,,,Forest fire near La Ronge Sask. Canada,1.0
2,5,,,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected,1.0
3,6,,,"13,000 people receive #wildfires evacuation orders in California",1.0
4,7,,,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school,1.0


Por convencion con el resto del grupo, utilizamos id como indice. Como solo voy a trabajar con la columna text, elimino las otras dos

In [4]:
df.set_index('id',inplace = True)
df.drop(columns=['location', 'keyword', 'target'], inplace=True)

La primera idea que vamos a explorar, es evaluar si un tweet tiene mencion a una entidad divina o no. Para eso, vamos a preprocesar el campo text eliminando #, url, etc.

In [5]:
# Quitamos las urls
df['text'] = df['text'].str.replace(r'http:\/\/.*', '', regex=True).replace(r'https:\/\/.*', '', regex=True)

# Quitamos user mentions, signos de puntuación, hashtags y stopwords.
def clean_text(text):
    words = text.lower().split(' ')
    text = ' '.join([word for word in words if not word.startswith('@') and word not in stopwords.words('english')])
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

df['text'] = df['text'].apply(clean_text)

df['word_count'] = df['text'].apply(lambda x: len(x.split(' ')))

In [6]:
df.head()

Unnamed: 0_level_0,text,word_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,deeds reason earthquake may allah forgive us,7
4,forest fire near la ronge sask canada,7
5,residents asked shelter place notified officers evacuation shelter place orders expected,11
6,13000 people receive wildfires evacuation orders california,8
7,got sent photo ruby alaska smoke wildfires pours school,10


Utilizaremos las principales religiones para esto: Cristianismo, Islam, Hinduismo, Budismo,Judaísmo.

In [7]:
dioses =[' god ',' allah ',' yahve ',' rog ',' shiva ',' visnu ',' ala ',' jesus ',' christ ',' jehova ',' zeus ',' poseidon ']

def chequear_mencion_divina(string):
    for dios in dioses:
        if dios in string:
            return True
    return False

df['mencion_divina'] = df['text'].apply(chequear_mencion_divina)

De manera similar, separamos en aquellos tweets que incluyan una risa.

In [8]:

def chequear_risa(string):
    if 'hah' in string:
        return True
    return False
df['contiene_risa'] = df['text'].apply(chequear_risa)

El uso de una hora especifica en un tweet, puede llegar a ser relevante. Se puede establecer, trabajando de forma similar,
si existe una especificacion asi o no.

In [9]:


def chequear_hora(string):
    if ' pm ' in string or ' am ' in string: #Notar que necesito que existan los espacios a ambos lados
        return True
    return False
df['contiene_hora_especifica'] = df['text'].apply(chequear_hora)

Los Tweets que hacen mencion a la persona que los escribe, tienen tendencia a ser tweets no relacionados con algun desastre

In [18]:
pronombres =['i','m','me','us','im','am']
def chequear_pronombre(string):
    split = string.split(' ')
    for pronombre in pronombres:
        if pronombre in split:
            return True
    return False
df['mencion_personal'] = df['text'].apply(chequear_pronombre)

Utilizacion de infinitivos

In [22]:
def chequear_ing (string):
    if 'ing' in string:
        return True
    return False
df['Uso_infinitivos'] = df['text'].apply(chequear_ing)

Utilizacion de otros pronombres

In [25]:
pronombres = ['he','she','it','you']
df['mencion_tercero'] = df['text'].apply(chequear_pronombre)

In [26]:
df

Unnamed: 0_level_0,text,word_count,mencion_divina,contiene_risa,contiene_hora_especifica,mencion_personal,Uso_infinitivos,mencion_tercero
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,deeds reason earthquake may allah forgive us,7,True,False,False,True,False,False
4,forest fire near la ronge sask canada,7,False,False,False,False,False,False
5,residents asked shelter place notified officers evacuation shelter place orders expected,11,False,False,False,False,False,False
6,13000 people receive wildfires evacuation orders california,8,False,False,False,False,False,False
7,got sent photo ruby alaska smoke wildfires pours school,10,False,False,False,False,False,False
8,rockyfire update california hwy 20 closed directions due lake county fire cafire wildfires,15,False,False,False,False,False,False
10,flood disaster heavy rain causes flash flooding streets manitou colorado springs areas,12,False,False,False,False,True,False
13,im top hill see fire woods,6,False,False,False,True,False,False
14,theres emergency evacuation happening building across street,7,False,False,False,False,True,False
15,im afraid tornado coming area,5,False,False,False,True,True,False


In [24]:
df.drop(columns=['text','word_count']).to_csv('../features/palabras_clave.csv', index=False)