# NLP Example, Twitter Tweets <img style="float: right; width: 310px;" src="./Data/Twitter_Logo.jpg"/>  
  
---  

### By: Heather M. Steich, M.S.
### Date: October 29$^{th}$, 2017
### Written in: Python 3.4.5

In [1]:
import sys
print(sys.version)

3.4.5 |Anaconda custom (64-bit)| (default, Jul  5 2016, 14:53:07) [MSC v.1600 64 bit (AMD64)]


---  
  
## Dataset Credit  
  
  
The data used for this project is used with permission (if cited) from the following source:  

    Z. Cheng, J. Caverlee, and K. Lee. You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users. 
    In Proceeding of the 19th ACM Conference on Information and Knowledge Management (CIKM), Toronto, Oct 2010. (Bibtex)

<https://archive.org/details/twitter_cikm_2010><img style="float: center;" src="./Data/paper_logo.gif">

---  

## Overview

The goal of the exercise is to extract information about concert appearances of musicians, performers or bands.  For each such tweet, we are looking to extract:  

 - Who was the performer  
 - When was the show  
 - Where was the show  
 - The Tweeter user who attended it  
 - The sentiment of the tweet  
   
Not all of these fields are available in all tweets, and that’s ok.  

Each row in the dataset includes the user id who sent the tweet and the timestamp for the tweet. For the ‘when’ field, we are interested in the date of the show (not just the tweet). We are not interested in any other tweets, including tweets about performers which don’t mention concerts.

---  
  
# Part 1: Prepare the Data  
  
### - Step 1: Load the required libraries

In [1]:
## LOAD LIBRARIES

# Data wrangling & processing: 
import numpy as np
import pandas as pd

# Remove warning messages:
#import warnings
#warnings.filterwarnings('ignore')

---  
  
### - Step 2: Load & view the data

In [83]:
## LOAD DATA:

# Read in the files:
train = pd.read_csv("./Data/training_set_tweets.txt", delimiter="\n", names='A') 
test = pd.read_csv("./Data/test_set_tweets.txt", delimiter="\n", names='A')

# Print shapes:
print('Train Shape:', train.shape)
print('Train Column Names:', train.columns)
print('\nTest Shape:', test.shape)
print('Test Column Names:', test.columns)

Train Shape: (3822473, 1)
Train Column Names: Index(['A'], dtype='object')

Test Shape: (5150011, 1)
Test Column Names: Index(['A'], dtype='object')


In [84]:
## PRINT A PREVIEW OF THE DATAFRAMES:

train.head()

Unnamed: 0,A
0,60730027\t6320951896\t@thediscovietnam coo. t...
1,60730027\t6320673258\t@thediscovietnam shit it...
2,60730027\t6319871652\t@thediscovietnam hey cod...
3,60730027\t6318151501\t@smokinvinyl dang. you ...
4,60730027\t6317932721\tmaybe i'm late in the ga...


In [85]:
test.head()

Unnamed: 0,A
0,22077441\t10538487904\tOk today I have to find...
1,22077441\t10536835844\tI am glad I'm having th...
2,22077441\t10536809086\tHonestly I don't even k...
3,22077441\t10534149786\t@LovelyJ_Janelle hey so...
4,22077441\t10530203659\tSitting infront of this...


---  
  
### - Step 3: Prepare the data

In [112]:
## DEFINE A FUNCTION TO PROPERLY STRUCTURE THE DATA:

def restructure_data(df):
    print('Starting phase 1!')
    
    global split_df
    split_df = df['A'].str.split('\t', 3, expand=True)
    split_df.columns = ['UserID', 'tTweetID', 'tTweet', 'tCreatedAt']
            
    bad_rows = split_df[~split_df.UserID.str.isnumeric()].index
            
    while True:
        if bad_rows.shape[0] > 0:
            #print(' ', bad_rows.shape[0])
                    
            bad_rows = df.iloc[bad_rows, :]
            bad_rows.index = bad_rows.index - 1
                    
            good_rows = split_df[split_df.UserID.str.isnumeric()].index
            good_rows = df.iloc[good_rows, :]
                    
            combined_df = pd.concat([good_rows, bad_rows], axis=1)
            combined_df.columns = ['A', 'B']
                    
            df = pd.DataFrame((combined_df['A'].fillna('') + 
                               combined_df['B'].fillna('')), 
                               columns=['A']).reset_index(drop=True)
                    
            split_df = df['A'].str.split('\t', 3, expand=True)
            split_df.columns = ['UserID', 'tTweetID', 'tTweet', 'tCreatedAt']
            
            bad_rows = split_df[~split_df.UserID.str.isnumeric()].index
                
        else:
            print('Done with phase 1!')
                    
            bad_rows = split_df[split_df.tTweetID.isnull()].index

            while True:
                if bad_rows.shape[0] > 0:
                    #print('  ', bad_rows.shape[0])
                            
                    bad_rows = df.iloc[bad_rows, :]
                    bad_rows.index = bad_rows.index - 1

                    good_rows = split_df[~split_df.tTweetID.isnull()].index
                    good_rows = df.iloc[good_rows, :]

                    combined_df = pd.concat([good_rows, bad_rows], axis=1)
                    combined_df.columns = ['A', 'B']

                    df = pd.DataFrame((combined_df['A'].fillna('') + 
                                       combined_df['B'].fillna('')), 
                                       columns=['A']).reset_index(drop=True)

                    split_df = df['A'].str.split('\t', 3, expand=True)
                    split_df.columns = ['UserID', 'tTweetID', 'tTweet', 'tCreatedAt']

                    bad_rows = split_df[split_df.tTweetID.isnull()].index

                else:
                    print('Done with phase 2!')
                    
                    bad_rows = split_df[~split_df.tTweetID.str.isnumeric()].index

                    while True:
                        if bad_rows.shape[0] > 0:
                            #print('   ', bad_rows.shape[0])

                            bad_rows = df.iloc[bad_rows, :]
                            bad_rows.index = bad_rows.index - 1

                            good_rows = split_df[split_df.tTweetID.str.isnumeric()].index
                            good_rows = df.iloc[good_rows, :]

                            combined_df = pd.concat([good_rows, bad_rows], axis=1)
                            combined_df.columns = ['A', 'B']

                            df = pd.DataFrame((combined_df['A'].fillna('') + 
                                               combined_df['B'].fillna('')), 
                                               columns=['A']).reset_index(drop=True)

                            split_df = df['A'].str.split('\t', 3, expand=True)
                            split_df.columns = ['UserID', 'tTweetID', 'tTweet', 'tCreatedAt']

                            bad_rows = split_df[~split_df.tTweetID.str.isnumeric()].index

                        else:
                            print('Done with phase 3!')
                                    
                            strip_timestamps = df['A'].str.rsplit('\t', 1, expand=True)
                            other_columns = strip_timestamps[0].str.split('\t', 2, expand=True)
                            split_df = pd.concat([other_columns, strip_timestamps[1]], axis=1)
                            split_df.columns = ['UserID', 'tTweetID', 'tTweet', 'tCreatedAt']
                                    
                            split_df.UserID = split_df.UserID.astype(np.int64)
                            split_df.tTweetID = split_df.tTweetID.astype(np.int64)
                            split_df.tCreatedAt = pd.to_datetime(split_df.tCreatedAt) #split_df.tCreatedAt.astype(np.datetime64)
                            
                            print('Dataset completed!\n')
                            return split_df

In [113]:
## APPLY THE RESTRUCTURING FUNCTION TO THE TRAINING & TEST SETS:
## (This will take several minutes to run.)

train_df = restructure_data(train)
test_df = restructure_data(test)

# Print corrected shapes:
print('Corrected Train Shape:', train_df.shape)
print('\nCorrected Test Shape:', test_df.shape)

  80516
  12604
  3386
  1302
  607
  319
  168
  91
  49
  19
  8
  6
  5
  3
  1
Done with phase 1!
Done with phase 2!
    22
    7
    5
    3
    2
    1
Done with phase 3!


---  
  
### - Step 4: Validate that the restructured data is correct

In [138]:
## CHECK DATA TYPES:

print('Training:\n', train_df.dtypes)
print('\nTesting:\n', test_df.dtypes)

Training:
 UserID                 int64
tTweetID               int64
tTweet                object
tCreatedAt    datetime64[ns]
dtype: object

Testing:
 UserID                 int64
tTweetID               int64
tTweet                object
tCreatedAt    datetime64[ns]
dtype: object


In [133]:
## CHECK FOR MISSING VALUES:

print('Missing Data in Training Set:\n', train_df.isnull().sum().values.sum())
print('Missing Data in Testing Set:\n', test_df.isnull().sum().values.sum())

Missing Data in Training Set:
 0
Missing Data in Testing Set:
 0


In [162]:
## CHECK HEADS & TAILS OF EACH SORTED COLUMN:

train_df.sort_values(by='UserID').head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
259,14,1004190858,Giving 60 dictionaries to 60 highly energetic ...,2008-11-13 11:55:58
283,14,4119972522,ound like French summer.,2009-09-20 02:21:31
284,14,3881145051,fireworks blast over Santa Monica scattering s...,2009-09-09 23:44:42
285,14,3599227170,today my prototype danced for me. perfect piro...,2009-08-28 03:52:22
286,14,3324787395,If Kafka gave birth atop the trade center towe...,2009-08-15 02:09:23


In [163]:
train_df.sort_values(by='UserID').tail()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
2189710,95260481,6470419090,get paid with paypal everyweek no hard work ne...,2009-12-08 12:10:01
2189709,95260481,6470548793,I get paid Online making about 12k a Month wor...,2009-12-08 12:14:34
2189708,95260481,6470585312,How is the new Jay-z album should i buy it or ...,2009-12-08 12:15:50
2189706,95260481,6470751598,What's this all about all I can say is you hav...,2009-12-08 12:22:08
2189734,95260481,6468176439,"nearly my son's birthday he loves disney cars,...",2009-12-08 10:45:57


In [164]:
train_df.sort_values(by='tTweetID').head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
119368,5622,28191,trying to figure out what this thing is.,2006-09-10 11:07:07
825210,38323,581203,Getting ready to go to the Improv Olympic to s...,2006-12-02 21:34:25
1663297,690103,3873613,Customer support is no fun at all.,2007-01-23 18:07:29
1663296,690103,3880013,I am still doing customer support,2007-01-23 19:01:33
1663295,690103,3880323,Bryan is a nice guy cause he thinks I shouldn'...,2007-01-23 19:04:58


In [165]:
train_df.sort_values(by='tTweetID').tail()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
552720,22307857,6485532854,"NYC at holiday time, can't go wrong. Ate with ...",2009-12-08 21:27:36
1738094,15676578,6485597908,Spoiler: Find out who won Biggest Loser! http:...,2009-12-08 21:29:53
1672933,22097952,6485873515,"@katelynroseee hahaha. I mean, New York Style ...",2009-12-08 21:39:27
108263,34739186,6486280798,WOW! Cardio Barre was AMAZING-I had no idea u ...,2009-12-08 21:54:03
108262,34739186,6486831431,Stop tweeting while we're at dinner!!! ;) RT @...,2009-12-08 22:14:24


In [166]:
train_df.sort_values(by='tTweet').head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
166674,20192982,1251289107,!,2009-02-25 16:57:39
3353374,65642903,4123445005,!,2009-09-20 08:09:00
76889,15863302,5658023171,!,2009-11-12 14:00:34
2161550,17926620,6287028586,!,2009-12-02 18:37:42
218118,30681120,5744360824,!,2009-11-15 14:11:25


In [167]:
train_df.sort_values(by='tTweet').tail()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
2822022,29392455,4164281228,１日ほんの数ページずつですが、坂の上の雲を再読中、以前は完全に龍馬がゆく派だったが、変わったかも。,2009-09-21 22:09:00
3564787,22874169,4333563740,３年前酔っぱらってリバースしたコロナビール飲んでます。酒もってこーい！！,2009-09-23 22:35:03
3036928,52706200,4118931198,４月にサンディエゴで始まる予定のアニメのイベント。その名も「アニメ漢字」ーＡＮＩＭＥ　ＣＯＮ...,2009-09-20 00:09:00
3415132,14330200,3991134830,９月１５日、火曜日。曇り。,2009-09-14 18:09:00
3532366,34374919,4119614535,�Red' Sims will always love �his' boys: During...,2009-09-20 01:09:00


In [168]:
train_df.sort_values(by='tCreatedAt').head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
119368,5622,28191,trying to figure out what this thing is.,2006-09-10 11:07:07
825210,38323,581203,Getting ready to go to the Improv Olympic to s...,2006-12-02 21:34:25
1663297,690103,3873613,Customer support is no fun at all.,2007-01-23 18:07:29
1663296,690103,3880013,I am still doing customer support,2007-01-23 19:01:33
1663295,690103,3880323,Bryan is a nice guy cause he thinks I shouldn'...,2007-01-23 19:04:58


In [169]:
train_df.sort_values(by='tCreatedAt').tail()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
552720,22307857,6485532854,"NYC at holiday time, can't go wrong. Ate with ...",2009-12-08 21:27:36
1738094,15676578,6485597908,Spoiler: Find out who won Biggest Loser! http:...,2009-12-08 21:29:53
1672933,22097952,6485873515,"@katelynroseee hahaha. I mean, New York Style ...",2009-12-08 21:39:27
108263,34739186,6486280798,WOW! Cardio Barre was AMAZING-I had no idea u ...,2009-12-08 21:54:03
108262,34739186,6486831431,Stop tweeting while we're at dinner!!! ;) RT @...,2009-12-08 22:14:24


In [162]:
test_df.sort_values(by='UserID').head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
259,14,1004190858,Giving 60 dictionaries to 60 highly energetic ...,2008-11-13 11:55:58
283,14,4119972522,ound like French summer.,2009-09-20 02:21:31
284,14,3881145051,fireworks blast over Santa Monica scattering s...,2009-09-09 23:44:42
285,14,3599227170,today my prototype danced for me. perfect piro...,2009-08-28 03:52:22
286,14,3324787395,If Kafka gave birth atop the trade center towe...,2009-08-15 02:09:23


In [162]:
test_df.sort_values(by='UserID').tail()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
259,14,1004190858,Giving 60 dictionaries to 60 highly energetic ...,2008-11-13 11:55:58
283,14,4119972522,ound like French summer.,2009-09-20 02:21:31
284,14,3881145051,fireworks blast over Santa Monica scattering s...,2009-09-09 23:44:42
285,14,3599227170,today my prototype danced for me. perfect piro...,2009-08-28 03:52:22
286,14,3324787395,If Kafka gave birth atop the trade center towe...,2009-08-15 02:09:23


In [162]:
test_df.sort_values(by='tTweetID').head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
259,14,1004190858,Giving 60 dictionaries to 60 highly energetic ...,2008-11-13 11:55:58
283,14,4119972522,ound like French summer.,2009-09-20 02:21:31
284,14,3881145051,fireworks blast over Santa Monica scattering s...,2009-09-09 23:44:42
285,14,3599227170,today my prototype danced for me. perfect piro...,2009-08-28 03:52:22
286,14,3324787395,If Kafka gave birth atop the trade center towe...,2009-08-15 02:09:23


In [162]:
test_df.sort_values(by='tTweetID').tail()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
259,14,1004190858,Giving 60 dictionaries to 60 highly energetic ...,2008-11-13 11:55:58
283,14,4119972522,ound like French summer.,2009-09-20 02:21:31
284,14,3881145051,fireworks blast over Santa Monica scattering s...,2009-09-09 23:44:42
285,14,3599227170,today my prototype danced for me. perfect piro...,2009-08-28 03:52:22
286,14,3324787395,If Kafka gave birth atop the trade center towe...,2009-08-15 02:09:23


In [162]:
test_df.sort_values(by='tTweet').head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
259,14,1004190858,Giving 60 dictionaries to 60 highly energetic ...,2008-11-13 11:55:58
283,14,4119972522,ound like French summer.,2009-09-20 02:21:31
284,14,3881145051,fireworks blast over Santa Monica scattering s...,2009-09-09 23:44:42
285,14,3599227170,today my prototype danced for me. perfect piro...,2009-08-28 03:52:22
286,14,3324787395,If Kafka gave birth atop the trade center towe...,2009-08-15 02:09:23


In [162]:
test_df.sort_values(by='tTweet').tail()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
259,14,1004190858,Giving 60 dictionaries to 60 highly energetic ...,2008-11-13 11:55:58
283,14,4119972522,ound like French summer.,2009-09-20 02:21:31
284,14,3881145051,fireworks blast over Santa Monica scattering s...,2009-09-09 23:44:42
285,14,3599227170,today my prototype danced for me. perfect piro...,2009-08-28 03:52:22
286,14,3324787395,If Kafka gave birth atop the trade center towe...,2009-08-15 02:09:23


In [162]:
test_df.sort_values(by='tCreatedAt').head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
259,14,1004190858,Giving 60 dictionaries to 60 highly energetic ...,2008-11-13 11:55:58
283,14,4119972522,ound like French summer.,2009-09-20 02:21:31
284,14,3881145051,fireworks blast over Santa Monica scattering s...,2009-09-09 23:44:42
285,14,3599227170,today my prototype danced for me. perfect piro...,2009-08-28 03:52:22
286,14,3324787395,If Kafka gave birth atop the trade center towe...,2009-08-15 02:09:23


In [170]:
test_df.sort_values(by='tCreatedAt').tail()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
4425852,15342502,10617707268,"Bout to nap it up, please DND.",2010-03-17 06:49:15
2427177,50765154,10617911844,Made it to work ontime! Now I need spmeone to ...,2010-03-17 06:55:50
1384903,18782786,10618404437,"@wizwow, thanks for yet another insightful pos...",2010-03-17 07:10:55
3119490,26962064,10618517948,"@MiszTdott when I pointless text you, you poin...",2010-03-17 07:14:22
235010,15458701,10619091335,Preach Jesus Christ first and foremost as a ma...,2010-03-17 07:31:11


---  
  
### - Step 5: Write out the restructured datasets for future ingestion

In [171]:
## WRITE OUT RESTRUCTURED DATA TO CSV:

train_df.to_csv('./Data/corrected_training_set_tweets.csv')
test_df.to_csv('./Data/corrected_test_set_tweets.csv')