# <center>Feature Engineering (v2)</center>

<br>
<br>
<p>Before we get started we need to run the following two code blocks containing the previous work done with the data.</p>
<br>
<br>

In [2]:
!wget -O trainingandtestdata.zip http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
print('unziping ...')
!unzip -o -j trainingandtestdata.zip

--2019-05-08 00:10:55--  http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip [following]
--2019-05-08 00:10:55--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘trainingandtestdata.zip’


2019-05-08 00:11:02 (11.7 MB/s) - ‘trainingandtestdata.zip’ saved [81363704/81363704]

unziping ...
Archive:  trainingandtestdata.zip
  inflating: testdata.manual.2009.06.14.csv  
  inflating: training.1600000.processed.noemoticon.csv  


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


data = pd.read_csv("training.1600000.processed.noemoticon.csv", header=None, encoding='ISO-8859-1')
test = pd.read_csv("testdata.manual.2009.06.14.csv", header=None, encoding='ISO-8859-1')


data.columns = ["target", "ids", "date", "flag", "user", "text"]
test.columns = ["target", "ids", "date", "flag", "user", "text"]


data["target"] = data["target"].replace(4, 1)
test["target"] = test["target"].replace(4, 1)


df = data[["target", "text"]]
ts = test[["target", "text"]]


ts_bin = ts[ts["target"]!=2]
ts_neut = ts[ts["target"]==2]




df.to_csv('training_data.csv')
ts_bin.to_csv('test_data.csv')
ts_neut.to_csv('neutral_data.csv')



<br>
<br>
<p>As the performance of the model implemented before was not convincing, we will try another approach. We will use only the training dataset here too for an effective comparison with version 1. For the new feature engineering we need to segment the tweet texts into words and convert that words into number features. We will use the Keras Preprocessing module now.</p>
<p>Let's import the libraries we will use.</p>
<br>
<br>

In [4]:

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split




Using TensorFlow backend.


<br>
<br>
<p>Let's create the Pandas Dataframe and separate the tweets and the labels in two variables.</p>
<br>
<br>

In [5]:
df_m = pd.read_csv("training_data.csv")

In [6]:
labels = df_m["target"]
tweets = df_m["text"]

labels.count()

1600000

<br>
<br>
<p>First, we will use the <b>Tokenizer API</b> from <i>Keras</i>. This tool makes the tokenization, converting words into an integers index.</p>
<br>
<br>

In [7]:
tok = Tokenizer(num_words=10000)
tok.fit_on_texts(tweets)


<br>
<br>
<p>We'll create and pad the tokens sequences.</p>
<br>
<br>

In [8]:
tweets_seq = tok.texts_to_sequences(tweets)

In [9]:
max_length = 30

padded_tweets = pad_sequences(tweets_seq, maxlen=max_length, padding='post')

print(padded_tweets[:5])

[[  39  147   56  473  144    4 1221    7 3659   48  828   12 1955   30
     2   41    9  385    0    0    0    0    0    0    0    0    0    0
     0    0]
 [   8  818   17  111   69  565  193  536  126 2097    9    6  299  551
    85    4 2399  149   40  273 1170    0    0    0    0    0    0    0
     0    0]
 [   1  321  363   11    3 1298 1751    2  935 1164    3  493   37   31
    12    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]
 [   5  450  851  504 3036    6   34   71   13 1169    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]
 [  36   42   24   23   32   19  617  113   62    1   91  217    1   69
    68    7   32  135   86    0    0    0    0    0    0    0    0    0
     0    0]]


<br>
<br>
<p>Now we will split the dataset, 80% for train and 20% for test.</p>
<br>
<br>

In [6]:
X_train, X_test, y_train, y_test = train_test_split(padded_tweets, labels, test_size=0.2, random_state=2)

<br>
<br>
<p>All the required work for the new version is done.</p>
<br>
<br>