<a href="https://colab.research.google.com/github/danielapavas/Google-QUEST-Q-A-Labeling/blob/main/02%20-%20preprocesado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Carga de dataset desde Kaggle

**Challenge Google QUEST Q&A Labeling**




El dataset es tomado de la competencia de Kaggle: https://www.kaggle.com/competitions/google-quest-challenge/data

In [1]:
!pip install kaggle



In [2]:
#carga del token de kaggle

from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"danielapavas","key":"68c976b13c9b0f4fda22960f271dabd5"}'}

In [3]:
! mkdir ~/.kaggle

In [5]:
! cp kaggle.json ~/.kaggle/

In [6]:
! chmod 600 ~/.kaggle/kaggle.json

In [7]:
!kaggle competitions download -c google-quest-challenge

Downloading google-quest-challenge.zip to /content
  0% 0.00/4.85M [00:00<?, ?B/s]100% 4.85M/4.85M [00:00<00:00, 50.8MB/s]
100% 4.85M/4.85M [00:00<00:00, 50.6MB/s]


In [8]:
!unzip  google-quest-challenge.zip

Archive:  google-quest-challenge.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


# Preprocesado de datos

In [9]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords

In [10]:
# Crear dataset

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
sample_submission_dataset = pd.read_csv('sample_submission.csv')

train.shape, test.shape, sample_submission_dataset.shape

((6079, 41), (476, 11), (476, 31))

In [11]:
# Sample data_point,
dp = 1000
train['question_title'][dp], train['question_body'][dp], train['answer'][dp]

('Using a Wiener Filter to Estimate a Transfer Function',
 'As a follow-on to this question about estimating a transfer function of an unknown system using a Wiener filter, \n\n\nHow would you put a minimum MSE criteria on how well the estimated filter weights matched the actual transfer function of the system? [Suppose you needed the MSE to be no more than -50dB]?\nHow would you change his formulation if you wanted poles as well as zeroes (an IIR rather than an FIR filter)?\n\n',
 "\nThe desired MSE is application dependent, so there can be no general\nrule. If the approximation doesn't satisfy your needs you can\nincrease the filter length to obtain a better match.\nThere is no straightforward way to change the FIR Wiener filter solution to an IIR solution because the IIR formulation results in a set of nonlinear equations which have no closed-form solution. The IIR solution might also be unstable, so FIR filters are a much more practical choice when computing a Wiener filter.\n\n")

Cuando trabajamos con texto, generalmente realizamos una limpieza básica, como poner en minúsculas todas las palabras, eliminar tokens especiales (como '%', '$', '#', etc.), eliminar etiquetas HTML, etiquetas \r, \n (enter) con espacio y eliminar todos los caracteres especiales.

In [12]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
stop_words.remove('no'); stop_words.remove('not'); stop_words.remove('nor')

def stopwrd_removal(sent):
  lst = []
  for wrd in sent.split():
    if wrd not in stop_words:
      lst.append(wrd)
  return " ".join(lst)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [13]:
def text_preprocessor(column, remove_stopwords = False, remove_specialchar = False):
  """pass any column with Text in it from df_train | Note: returns nothing makes inplace changes in df_train"""

  # 1. remove html tags, html urls, replace html comparison operators
  train[column] = [re.sub('<.*?>', ' ', i) for i in train[column].values]
  train[column] = train[column].str.replace('&lt;', '<')\
                                          .str.replace('&gt;', '>')\
                                          .str.replace('&le;', '<=' )\
                                          .str.replace('&ge;', '>=')

  # 2. remove latex i,e., if there is any formulas or latex we have to remove it
  train[column] = [re.sub('\$.*?\$', ' ', i) for i in train[column].values]

  # 3. all lowercase
  train[column] = train[column].str.lower()

  # 4. decontractions
  train[column] = train[column].str.replace("won't", "will not").str.replace("can\'t", "can not").str.replace("n\'t", " not").str.replace("\'re", " are").str.\
                                                replace("\'s", " is").str.replace("\'d", " would").str.replace("\'ll", " will").str.\
                                                replace("\'t", " not").str.replace("\'ve", " have").str.replace("\'m", " am")

  # 5. removing non-english or hebrew characters
  train[column] = [i.encode("ascii", "ignore").decode() for i in train[column].values]

  # 6. remove all special-characters other than alpha-numericals
  if remove_specialchar == True:
    train[column] = [re.sub('[^A-Za-z0-9]+', ' ', i) for i in train[column].values]

  # 7. separating special chars from alphanumerics
  all_sc = [re.findall('[^ A-Za-z0-9]', i) for i in train[column].values]
  special_char = np.unique([j for i in all_sc for j in i])
  replace_char = [' '+i+' ' for i in special_char]
  for i,j in zip(special_char, replace_char):
    train[column] = train[column].str.replace(i, j)

  # 8. Stop_word removal
  if remove_stopwords == True:
    train[column] = [stopwrd_removal(i) for i in train[column].values]

  # 9. remove all white-space i.e., \n, \t, and extra_spaces
  train[column] = train[column].str.replace("\n", " ").str.replace("\t", " ").str.rstrip()
  train[column] = [re.sub('  +', ' ', i) for i in train[column].values]


In [14]:
train['clean_title'] = train['question_title']
train['clean_body'] = train['question_body']
train['clean_answer'] = train['answer']
text_preprocessor('clean_title',  remove_stopwords = False, remove_specialchar = False)
text_preprocessor('clean_body',  remove_stopwords = False, remove_specialchar = False)
text_preprocessor('clean_answer',  remove_stopwords = False, remove_specialchar = False)

In [15]:
text = train['question_title'].values
print('TITLE : ', text[0:5], '\n')

text = train['clean_title'].values
print('CLEAN_TITLE : ',text[0:5], '\n')

text = train['question_body'].values
print('BODY : ', text[0:5], '\n')

text = train['clean_body'].values
print('CLEAN_BODY : ',text[0:5], '\n')

text = train['answer'].values
print('ANSWER : ', text[0:5], '\n')

text = train['clean_answer'].values
print('CLEAN_ANSWER : ', text[0:5])

TITLE :  ['What am I losing when using extension tubes instead of a macro lens?'
 'What is the distinction between a city and a sprawl/metroplex... between downtown and a commercial district?'
 'Maximum protusion length for through-hole component pins'
 'Can an affidavit be used in Beit Din?'
 'How do you make a binary image in Photoshop?'] 

CLEAN_TITLE :  ['what am i losing when using extension tubes instead of a macro lens ?'
 'what is the distinction between a city and a sprawl / metroplex . . . between downtown and a commercial district ?'
 'maximum protusion length for through - hole component pins'
 'can an affidavit be used in beit din ?'
 'how do you make a binary image in photoshop ?'] 

BODY :  ['After playing around with macro photography on-the-cheap (read: reversed lens, rev. lens mounted on a straight lens, passive extension tubes), I would like to get further with this. The problems with the techniques I used is that focus is manual and aperture control is problematic a