# Prepare Twitter Data with Row ID Prefixes

This notebook downloads Sentiment140 twitter data and appends a prefix to each row ID.The prefix is a zero-padded integer ranging from 0 to 511. Later we will create duplicated twitter data based on the data with row ID prefixes for our benchmark tests. With the prefixes, we can easily split Accumulo tables used for holding replicated twitter data and speed up the process of writing data into these tables.

Note that you need to run this notebook in Python 3. 

In [1]:
import os
import re
from zipfile import ZipFile
import urllib.request

import math
import numpy as np
import pandas as pd

## Download Twitter Data

In [2]:
# URL to download the sentiment140 dataset and data file name
data_url = 'http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip'
data_filename = 'training.1600000.processed.noemoticon.csv'

# Column names of the data
cols = ['sentiment', 'id', 'date', 'query_string', 'user', 'text']

# Data directory
data_dir = os.path.join('.', 'data')

In [3]:
def download_data(download_url, filedir='.', filename='downloaded_data.zip'):
    """Download and extract data"""
    if not os.path.isdir(filedir):
        os.mkdir(filedir)
    downloaded_filename = os.path.join(filedir, filename)
    print ('Step 1: Downloading data')
    urllib.request.urlretrieve(download_url, downloaded_filename)
    print ('Step 2: Extracting data')
    zipfile = ZipFile(downloaded_filename)
    zipfile.extractall(filedir)
    zipfile.close()

In [4]:
download_data(data_url, filedir=data_dir)

Step 1: Downloading data
Step 2: Extracting data


In [5]:
data_filepath = os.path.join(data_dir, data_filename)
df = pd.read_csv(data_filepath, header=None, names=cols, encoding='iso-8859-1')
#df = pd.read_csv(data_filepath, header=None, names=cols)

In [6]:
df.head()

Unnamed: 0,sentiment,id,date,query_string,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


## Add Prefix for Table Splitting

In [7]:
# Number of splits
n_splits = 512
n_digits = len(str(n_splits)) + 1

bucket_size = df.shape[0] // n_splits
for i in range(n_splits):
    print('processing split {}'.format(i))
    start = i*bucket_size
    if i < n_splits-1: 
        end = (i+1)*bucket_size
    else:
        end = df.shape[0]
    idx_range = range(start, end)
    prefix = str(str(i).zfill(n_digits))
    df.loc[idx_range, 'id'] = prefix + '_' + df.loc[idx_range, 'id'].astype(str)

processing split 0
processing split 1
processing split 2
processing split 3
processing split 4
processing split 5
processing split 6
processing split 7
processing split 8
processing split 9
processing split 10
processing split 11
processing split 12
processing split 13
processing split 14
processing split 15
processing split 16
processing split 17
processing split 18
processing split 19
processing split 20
processing split 21
processing split 22
processing split 23
processing split 24
processing split 25
processing split 26
processing split 27
processing split 28
processing split 29
processing split 30
processing split 31
processing split 32
processing split 33
processing split 34
processing split 35
processing split 36
processing split 37
processing split 38
processing split 39
processing split 40
processing split 41
processing split 42
processing split 43
processing split 44
processing split 45
processing split 46
processing split 47
processing split 48
processing split 49
processing

processing split 396
processing split 397
processing split 398
processing split 399
processing split 400
processing split 401
processing split 402
processing split 403
processing split 404
processing split 405
processing split 406
processing split 407
processing split 408
processing split 409
processing split 410
processing split 411
processing split 412
processing split 413
processing split 414
processing split 415
processing split 416
processing split 417
processing split 418
processing split 419
processing split 420
processing split 421
processing split 422
processing split 423
processing split 424
processing split 425
processing split 426
processing split 427
processing split 428
processing split 429
processing split 430
processing split 431
processing split 432
processing split 433
processing split 434
processing split 435
processing split 436
processing split 437
processing split 438
processing split 439
processing split 440
processing split 441
processing split 442
processing sp

In [8]:
# Save data to file
df.to_csv(os.path.join(data_dir, 'sentiment140_prefix.csv'), header=None, index=False)

In [9]:
df_prefix = pd.read_csv(os.path.join(data_dir, 'sentiment140_prefix.csv'), header=None)

In [10]:
df_prefix

Unnamed: 0,0,1,2,3,4,5
0,0,0000_1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,0000_1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,0000_1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,0000_1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,0000_1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,0511_2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,0511_2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,0511_2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,0511_2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [11]:
df_prefix.shape

(1600000, 6)