# <center>ETL Stage</center>

In [2]:
!wget -O trainingandtestdata.zip http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
print('unziping ...')
!unzip -o -j trainingandtestdata.zip

--2019-05-05 21:47:21--  http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip [following]
--2019-05-05 21:47:21--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘trainingandtestdata.zip’


2019-05-05 21:47:23 (62.7 MB/s) - ‘trainingandtestdata.zip’ saved [81363704/81363704]

unziping ...
Archive:  trainingandtestdata.zip
  inflating: testdata.manual.2009.06.14.csv  
  inflating: training.1600000.processed.noemoticon.csv  


In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


data = pd.read_csv("training.1600000.processed.noemoticon.csv", header=None, encoding='ISO-8859-1')
test = pd.read_csv("testdata.manual.2009.06.14.csv", header=None, encoding='ISO-8859-1')

<br>
<br>
<p>As we had seen in the Exploratory Data Analysis, the data provided is healthy and the amount of values per category is well-balanced. So major data transformations will not be necessary. Let's do some minor modifications.</p>
<p>First let's add the proper name to each column.</p>
<br>
<br>

In [4]:
data.columns = ["target", "ids", "date", "flag", "user", "text"]
data.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [5]:
test.columns = ["target", "ids", "date", "flag", "user", "text"]
test.head()

Unnamed: 0,target,ids,date,flag,user,text
0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...


<br>
<br>
<p>OK, done. Now we'll normalize the target value, so we'll get 0 for negative sentiment and 1 for positive sentiment. We will not modify the neutral value because we will not use it as a classification label.</p>
<br>
<br>

In [6]:
data["target"] = data["target"].replace(4, 1)
data.groupby("target").count()

Unnamed: 0_level_0,ids,date,flag,user,text
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,800000,800000,800000,800000,800000
1,800000,800000,800000,800000,800000


In [7]:
test["target"] = test["target"].replace(4, 1)
test.groupby("target").count()

Unnamed: 0_level_0,ids,date,flag,user,text
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,177,177,177,177,177
1,182,182,182,182,182
2,139,139,139,139,139


<br>
<br>
<p>It's time to select the "target" and the "text" columns from each dataset to obtain the usable dataframes.</p>
<br>
<br>

In [8]:
df = data[["target", "text"]]
df.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [9]:
ts = test[["target", "text"]]
ts.head()

Unnamed: 0,target,text
0,1,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,1,Reading my kindle2... Love it... Lee childs i...
2,1,"Ok, first assesment of the #kindle2 ...it fuck..."
3,1,@kenburbary You'll love your Kindle2. I've had...
4,1,@mikefish Fair enough. But i have the Kindle2...


<br>
<br>
<p>The training data is ready. For the test data, we need to set apart the neutral values. We will use it as a complementary test "a posteriori" of the algorithm development.</p>
<br>
<br>

In [10]:
ts_bin = ts[ts["target"]!=2]
print(ts_bin.shape)
ts_bin["target"].unique()

(359, 2)


array([1, 0])

In [11]:
ts_neut = ts[ts["target"]==2]
print(ts_neut.shape)
ts_neut["target"].unique()

(139, 2)


array([2])

<br>
<br>
<p>OK, the work is done. We get a dataframe for training, a dataset for evaluation and an adicional dataset for a complementary test. Now we'll save the transformed data to new datasets files for further working.</p>
<br>
<br>

In [12]:
df.to_csv('training_data.csv')
ts_bin.to_csv('test_data.csv')
ts_bin.to_csv('neutral_data.csv')