# <center>Feature Engineering</center>

<br>
<br>
<p>Before we get started we need to run the following two code blocks containing the previous work done with the data.</p>
<br>
<br>

In [1]:
!wget -O trainingandtestdata.zip http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
print('unziping ...')
!unzip -o -j trainingandtestdata.zip

--2019-05-07 02:29:25--  http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip [following]
--2019-05-07 02:29:26--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘trainingandtestdata.zip’


2019-05-07 02:29:43 (4.45 MB/s) - ‘trainingandtestdata.zip’ saved [81363704/81363704]

unziping ...
Archive:  trainingandtestdata.zip
  inflating: testdata.manual.2009.06.14.csv  
  inflating: training.1600000.processed.noemoticon.csv  


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


data = pd.read_csv("training.1600000.processed.noemoticon.csv", header=None, encoding='ISO-8859-1')
test = pd.read_csv("testdata.manual.2009.06.14.csv", header=None, encoding='ISO-8859-1')


data.columns = ["target", "ids", "date", "flag", "user", "text"]
test.columns = ["target", "ids", "date", "flag", "user", "text"]


data["target"] = data["target"].replace(4, 1)
test["target"] = test["target"].replace(4, 1)


df = data[["target", "text"]]
ts = test[["target", "text"]]


ts_bin = ts[ts["target"]!=2]
ts_neut = ts[ts["target"]==2]




df.to_csv('training_data.csv')
ts_bin.to_csv('test_data.csv')
ts_neut.to_csv('neutral_data.csv')



<br>
<br>
<p>We will use only the training dataset in this first aproach. For the feature engineering we need to segment the tweet texts into words. Then, we need to convert that words into number features in order to apply it the machine learning process selected. Also, we will split the dataset into train and test data.</p>
<p>Let's import the libraries we will use.</p>
<br>
<br>

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD


<br>
<br>
<p>Let's create the Pandas Dataframe and separate the tweets and the labels in two variables.</p>
<br>
<br>

In [3]:
df_m = pd.read_csv("training_data.csv")

In [4]:
labels = df_m["target"]
tweets = df_m["text"]

labels.count()

1600000

<br>
<br>
<p>First, we will use <b>TfidfVectorizer</b> from <i>Scikit-learn</i>. This tool makes the tokenization, the vectorization and the TF-IDF statistical estimation to the raw text data.</p>
<br>
<br>

In [5]:

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(tweets)
print(X.shape)


(1600000, 684047)


<br>
<br>
<p>Now we will split the dataset, 80% for train and 20% for test.</p>
<br>
<br>

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=2)

<br>
<br>
<p>At this point, we will make a standadization of the data. First we fit <b>StandardScaler</b> with X_train data and then we apply the transformation to X_train and X_test data. We have to set the parameter <i>with_mean=False</i> because we are working with sparse matrices.</p>
<br>
<br>

In [7]:

scaler = StandardScaler(with_mean=False)

scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

<br>
<br>
<p>Now it's time to make dimensionality reduction. We can't use PCA for this dataset because it don't work with sparse data. The mandatory method for the type of data we are dealing with here is Latent Semantic Analysis (LSA) and we will apply it through <b>TruncatedSVD</b>. We set the parameter <i>n_components=100</i> as recommended in the Scikit-learn documentation. As we did with StandardScaler, we fit with X_train data and then we apply the transformation to X_train and X_test data.</p>
<br>
<br>

In [8]:

svd = TruncatedSVD(n_components=100)
svd.fit(X_train)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=5,
       random_state=None, tol=0.0)

In [9]:
X_train_svd = svd.transform(X_train)
X_test_svd = svd.transform(X_test)

<br>
<br>
<p>Now the datasets are ready for the model implementation stage.</p>
<br>
<br>