<a href="https://colab.research.google.com/github/alepenaa94/TP1_Real_or_Not/blob/master/TP1_Real_or_Not.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Features

En este notebook la idea es realizar todo el procesamiento del set de datos, limpieza en los campos de ser necesario y sección para la generacion de los features.

## Ideas de cosas a realizar

--> Generar un N-grama <br>
--> mejorar los stopwords <br>
--> poder darle mas caracter a un tweet segun su texto <br>
--> TF-IDF <br>
--> Embeddings <br>

In [1]:
import pandas as pd
import numpy as np

import warnings 
warnings.filterwarnings('ignore')

In [2]:
train_df = pd.read_csv('../Data/train.csv', encoding='latin-1',dtype={'id': np.uint16,'target': np.bool})
test_df = pd.read_csv('../Data/test.csv', encoding='latin-1',dtype={'id': np.uint16})

In [3]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,True
1,4,,,Forest fire near La Ronge Sask. Canada,True
2,5,,,All residents asked to 'shelter in place' are ...,True
3,6,,,"13,000 people receive #wildfires evacuation or...",True
4,7,,,Just got sent this photo from Ruby #Alaska as ...,True


In [4]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


## Limpieza de la información

En base a lo aprendido en el TP1, limpiamos la información de algunos campos del dataframe.

In [5]:
train_df['keyword'] = train_df['keyword'].str.replace('%20',' ')
test_df['keyword'] = test_df['keyword'].str.replace('%20',' ')

In [6]:
train_df['keyword'].fillna('None',inplace=True)
test_df['keyword'].fillna('None',inplace=True)

In [7]:
train_df['location'].fillna('Unknown',inplace=True)
test_df['location'].fillna('Unknown',inplace=True)

## Creación de features

Volvemos a generar las columnas utilizadas para el TP1 ya que podrían resultar features útiles para el modelo de predicción.

###### Cantidad de palabras en el tweet

In [8]:
train_df['cantidad_de_palabras'] = train_df['text'].str.count(' ') + 1
test_df['cantidad_de_palabras'] = test_df['text'].str.count(' ') + 1

###### Longitud del tweet

In [9]:
train_df['longitud_del_tweet'] = train_df['text'].str.len()
test_df['longitud_del_tweet'] = test_df['text'].str.len()

###### Cuartiles de longitud

In [10]:
train_df['longitud_del_tweet'].describe().to_frame().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
longitud_del_tweet,7613.0,101.336136,33.991338,7.0,78.0,107.0,134.0,163.0


In [11]:
test_df['longitud_del_tweet'].describe().to_frame().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
longitud_del_tweet,3263.0,102.429053,34.234151,5.0,79.0,109.0,134.0,169.0


In [12]:
train_df.loc[train_df['longitud_del_tweet'] < 78.0,'longitud_categ'] = "0 a 25"
train_df.loc[train_df['longitud_del_tweet'] >= 78.0,'longitud_categ'] = "25 a 50"
train_df.loc[train_df['longitud_del_tweet'] >= 107.0,'longitud_categ'] = "50 a 75"
train_df.loc[train_df['longitud_del_tweet'] >= 134.0,'longitud_categ'] = "75 a 100"

test_df.loc[test_df['longitud_del_tweet'] < 79.0,'longitud_categ'] = "0 a 25"
test_df.loc[test_df['longitud_del_tweet'] >= 79.0,'longitud_categ'] = "25 a 50"
test_df.loc[test_df['longitud_del_tweet'] >= 110.0,'longitud_categ'] = "50 a 75"
test_df.loc[test_df['longitud_del_tweet'] >= 134.0,'longitud_categ'] = "75 a 100"

Usamos el método de Smothing. (El código está basado en el obtenido de la siguiente fuente: https://gist.github.com/marnixkoops/e68815d30474786e2b293682ed7cdb01)

In [13]:
def smoothing(df, column, target, weight=100):
    mean = df[target].mean()
    agg = df.groupby(column)[target].agg(['count', 'mean'])

    dic = {}
    
    for i in df[column].unique():
        dic[i] = (agg.loc[i]['count'] * agg.loc[i]['mean'] + weight * mean) / (agg.loc[i]['count'] + weight)
        
    return dic

In [14]:
long_categ_encoding_dic = smoothing(train_df,'longitud_categ','target')

In [15]:
train_df['longitud_categ'] = train_df['longitud_categ'].map(long_categ_encoding_dic)
test_df['longitud_categ'] = test_df['longitud_categ'].map(long_categ_encoding_dic)

###### Realiza menciones

In [16]:
train_df['tiene_menciones'] = train_df['text'].str.contains('@')
test_df['tiene_menciones'] = test_df['text'].str.contains('@')

###### Tweet expresivo

In [17]:
train_df['es_expresivo'] = (train_df['text'].str.contains('\!\!') | train_df['text'].str.contains('\?\?'))
test_df['es_expresivo'] = (test_df['text'].str.contains('\!\!') | test_df['text'].str.contains('\?\?'))

###### Cantidad de hashtags

In [18]:
train_df['cantidad_de_hashtags'] = train_df['text'].str.count('#')
test_df['cantidad_de_hashtags'] = test_df['text'].str.count('#')

###### Tiene links

In [19]:
train_df['tiene_links'] = train_df['text'].str.contains('http')
test_df['tiene_links'] = test_df['text'].str.contains('http')

###### Ubicaciones con tweet único

In [20]:
def train_location_unico(location):
    ubicaciones_unicas = train_df['location'].value_counts()[train_df['location'].value_counts() == 1].index
    return (location in ubicaciones_unicas)

def test_location_unico(location):
    ubicaciones_unicas = test_df['location'].value_counts()[test_df['location'].value_counts() == 1].index
    return (location in ubicaciones_unicas)

In [21]:
train_df['location_unico'] = train_df['location'].map(train_location_unico)
test_df['location_unico'] = test_df['location'].map(test_location_unico)

###### Encoding del campo keyword

Como la columna keyword es de tipo categórico, vamos a buscar una forma de codificarla.

In [22]:
keywords_encoding_dic = smoothing(train_df,'keyword','target')

In [23]:
train_df['keyword_encoded'] = train_df['keyword'].map(keywords_encoding_dic)
test_df['keyword_encoded'] = test_df['keyword'].map(keywords_encoding_dic)

###### TF-IDF

In [24]:
import re
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from textblob import TextBlob

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

df=train_df.copy()

#vamos a limpiar un poco el tweet
pattern_exclude = '(one|dont|cant|would|im|people|go|make|time|love|amp|get|house|update|talk'+\
                  '|want|today|know|say|us|day|crush|see|back|think|look|rigth|remember|car|shes'+\
                  '|thing|let|still|lol|much|thank|take|way|youre|road|another|really|save|hows'+\
                  '|play|even|theres|everyone|feel|year|work|check|two|great|ing|like|sink|href|hr|hs'+\
                  '|every|build|youtuve|video|n|home|body|bag|photo|stay|game|start|gt|fuck|help'+\
                  '|best|well|california|end|live|e|rt|wreck|plan|full|may|ies|u|could|many|last'+\
                  '|find|service|leave|collapse|world|war|destroy|wound|break|right|hear|school)+'

def filter_words(tweet):
    tweet = re.sub(r'(\b[\w]+:\/\/[\w -\?&;#~=\.\/@]+[\w\/])', ' ', tweet)
    tweet = re.sub(r'\'', '', tweet)
    return re.sub(r'[www.]*[A-z]+.(com|gov|edu|net|mil|org|io|int)+', ' ', tweet)

def text_to_blob(tweet):
    tweet_blob = TextBlob(str(tweet))
    return ' '.join(tweet_blob.words)


def normalization(tweet_list):
        lem = WordNetLemmatizer()
        normalized_tweet = []
        for word in tweet_list:
            word_aux = word.lower().strip()
            if re.match(pattern_exclude,
                        word_aux):
                continue
            normalized_text = lem.lemmatize(word_aux,'v')
            normalized_tweet.append(normalized_text)
        return normalized_tweet

def clean(tweet):
    tweet_list = [word for word in (text_to_blob(filter_words(tweet))).split()]
    clean_tokens = [tkn for tkn in tweet_list if re.match(r'[A-z]+', tkn)]
    clean_s = ' '.join(clean_tokens)
    l_aux = normalization(clean_s.split())
    return ' '.join([word for word in l_aux if word not in stopwords.words('english')])

train_df['clean_text'] = train_df['text'].apply(clean)
test_df['clean_text'] = test_df['text'].apply(clean)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Alejandro\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Alejandro\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Alejandro\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import linear_kernel

docs=train_df['clean_text'].to_list()

tfidf_vectorizer=TfidfVectorizer(analyzer='word', ngram_range=(1,3),sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', stop_words='english')

tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(docs)

In [26]:
def similarity_feature(doc):
    #calculo la similitud entre el doc y lo calculado para todo el set train.
    query=TfidfVectorizer(vocabulary=tfidf_vectorizer.vocabulary_)
    query=query.fit_transform([doc])
    s=pd.Series(linear_kernel(query,tfidf_vectorizer_vectors)[0])
    return s.sum()
    
train_df['tfidf_score']= train_df['clean_text'].apply(similarity_feature)
test_df['tfidf_score']= test_df['clean_text'].apply(similarity_feature)

In [27]:
train_df['tfidf_score'].value_counts()

0.000000     94
98.399077    34
35.546405    24
37.914268    20
30.622043    19
             ..
44.211627     1
39.141489     1
50.746029     1
34.644072     1
23.557168     1
Name: tfidf_score, Length: 6361, dtype: int64

In [28]:
test_df['tfidf_score'].value_counts()

0.000000     44
98.399077    16
42.654963    11
35.546405    10
46.971357    10
             ..
62.919636     1
38.936827     1
33.165787     1
30.737806     1
49.048676     1
Name: tfidf_score, Length: 2891, dtype: int64

In [29]:
del train_df['clean_text']
del test_df['clean_text']

In [30]:
train_df = train_df.set_index('id').iloc[:,3:]
test_df = test_df.set_index('id').iloc[:,3:]

Guardamos los dataframes de train y test en archivos .csv para usarlos en el Notebook de Algoritmos.

In [31]:
train_df.to_csv('../Data/train_features.csv')
test_df.to_csv('../Data/test_features.csv')