# NLP Example, Twitter Tweets <img style="float: right; width: 310px;" src="./Data/Twitter_Logo.jpg"/>  
  
---  

### By: Heather M. Steich, M.S.
### Date: October 29$^{th}$, 2017
### Written in: Python 3.4.5

In [1]:
import sys
print(sys.version)

3.4.5 |Anaconda custom (64-bit)| (default, Jul  5 2016, 14:53:07) [MSC v.1600 64 bit (AMD64)]


---  
  
## Dataset Credit  
  
  
The data used for this project is used with permission (if cited) from the following source:  

    Z. Cheng, J. Caverlee, and K. Lee. You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users. 
    In Proceeding of the 19th ACM Conference on Information and Knowledge Management (CIKM), Toronto, Oct 2010. (Bibtex)

<https://archive.org/details/twitter_cikm_2010><img style="float: center;" src="./Data/paper_logo.gif">

---  

## Overview

The goal of the exercise is to extract information about concert appearances of musicians, performers or bands.  For each such tweet, we are looking to extract:  

 - Who was the performer  
 - When was the show  
 - Where was the show  
 - The Tweeter user who attended it  
 - The sentiment of the tweet  
   
Not all of these fields are available in all tweets, and that’s ok.  

Each row in the dataset includes the user id who sent the tweet and the timestamp for the tweet. For the ‘when’ field, we are interested in the date of the show (not just the tweet). We are not interested in any other tweets, including tweets about performers which don’t mention concerts.

---  
  
### Part 1: Classify if the tweets are relevant

In [2]:
## LOAD LIBRARIES

# Data wrangling & processing: 
import numpy as np
import pandas as pd

# Machine learning:
#from sklearn.preprocessing import StandardScaler
#from sklearn.model_selection import train_test_split
#from sklearn.ensemble import RandomForestClassifier as RF
#from sklearn.metrics import confusion_matrix
#from pandas_ml import ConfusionMatrix
#from sklearn.metrics import roc_curve
#from sklearn.ensemble import RandomForestRegressor as RFR

# Plotting:
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#from IPython.display import display, HTML

# Remove warning messages:
#import warnings
#warnings.filterwarnings('ignore')

In [3]:
## ESTABLISH PLOT FORMATTING

#mpl.rcdefaults()  # Resets plot defaults

def plt_format():
    %matplotlib inline
    plt.rcParams['figure.figsize'] = (16, 10)
    plt.rcParams['font.size'] = 16
    plt.rcParams['font.family'] = 'Times New Roman'
    plt.rcParams['axes.labelcolor'] = 'black'
    plt.rcParams['axes.labelsize'] = 20
    plt.rcParams['axes.labelweight'] = 'bold'
    plt.rcParams['axes.titlesize'] = 32
    plt.rcParams['axes.titleweight'] = 'bold'
    plt.rcParams['legend.fontsize'] = 16
    plt.rcParams['legend.markerscale'] = 4
    plt.rcParams['text.color'] = 'black'
    plt.rcParams['xtick.labelsize'] = 20
    plt.rcParams['ytick.labelsize'] = 20
    plt.rcParams['legend.fontsize'] = 16
    plt.rcParams['legend.frameon'] = False
    plt.rcParams['axes.linewidth'] = 1

#plt.rcParams.keys()  # Available rcParams
plt_format()

 - Step 2: Load, view & prepare the provided data

In [18]:
## LOAD DATA:

# Read in the files:
train = pd.read_csv("./Data/corrected_training_set_tweets.csv")
test = pd.read_csv("./Data/corrected_test_set_tweets.csv")

# Translate the timestamps to DateTime objects:
train.tCreatedAt = pd.to_datetime(train.tCreatedAt)
test.tCreatedAt = pd.to_datetime(test.tCreatedAt)

# Print shapes:
print('Train Shape:', train.shape)
print('Train Column Names:', train.columns)
print('\nTest Shape:', test.shape)
print('Test Column Names:', test.columns)

Train Shape: (3741881, 4)
Train Column Names: Index(['UserID', 'tTweetID', 'tTweet', 'tCreatedAt'], dtype='object')

Test Shape: (5125748, 4)
Test Column Names: Index(['UserID', 'tTweetID', 'tTweet', 'tCreatedAt'], dtype='object')


In [15]:
## PRINT A PREVIEW OF THE DATAFRAMES:

train.head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
0,60730027,6320951896,@thediscovietnam coo. thanks. just dropped yo...,2009-12-03 18:41:07
1,60730027,6320673258,@thediscovietnam shit it ain't lettin me DM yo...,2009-12-03 18:31:01
2,60730027,6319871652,"@thediscovietnam hey cody, quick question...ca...",2009-12-03 18:01:51
3,60730027,6318151501,@smokinvinyl dang. you need anything? I got ...,2009-12-03 17:00:16
4,60730027,6317932721,"maybe i'm late in the game on this one, but th...",2009-12-03 16:52:36


In [16]:
test.head()

Unnamed: 0,UserID,tTweetID,tTweet,tCreatedAt
0,22077441,10538487904,Ok today I have to find something to wear for ...,2010-03-15 17:35:58
1,22077441,10536835844,I am glad I'm having this show but I can't wai...,2010-03-15 16:53:44
2,22077441,10536809086,Honestly I don't even know what's going on any...,2010-03-15 16:52:59
3,22077441,10534149786,@LovelyJ_Janelle hey sorry I'm sitting infront...,2010-03-15 15:42:07
4,22077441,10530203659,Sitting infront of this sewing machine ... I d...,2010-03-15 13:55:22


In [19]:
## CHECK DATA TYPES:

print('Training:\n', train.dtypes)
print('\nTesting:\n', test.dtypes)

Training:
 UserID                 int64
tTweetID               int64
tTweet                object
tCreatedAt    datetime64[ns]
dtype: object

Testing:
 UserID                 int64
tTweetID               int64
tTweet                object
tCreatedAt    datetime64[ns]
dtype: object
