## Dataset Preparation For Sentiment Analysis

Here, we will work on building the training and testing data for a sentiment classifier, based on Twitter data. This classifier will then be applied to Brexit related tweets between June 01, 2016 and July 15, 2016 to get a sense of general sentiment against this issue.

The training and testing data is taken from Kaggle, composed of 1.6M tweets with labeled sentiments. It can be found [here](https://www.kaggle.com/kazanova/sentiment140/data). 

In [1]:
import twitter
import os
import pandas as pd
import time
from pprint import pprint
from itertools import chain
from processor.sentiment_data_processor import SentimentDataProcessor
pjoin = os.path.join

In [2]:
def build_training_set(corpus_file, cleaned_file):
    '''
    Clean the original CSV file (corpus_file), save results to new CSV file (cleaned_file)
    '''
    # Get contents from original CSV file first
    colnames = ['Sentiment', 'Tweet ID', 'DateTime', 'Query', 'Username', 'Text']
    df = pd.read_csv(corpus_file, encoding='latin-1', names=colnames)
    
    # Keep sentiment and text columns only
    clean_df = df[['Sentiment', 'Text']]

    # Save the cleaned df to new file
    clean_df.to_csv(cleaned_file, index=False)
    
    print(f'File saved: {cleaned_file}')
    
corpus_file = './data/tweet/sentiment_analysis/training_set.csv'
tweet_data_file = './data/tweet/sentiment_analysis/training_set_clean.csv'

build_training_set(corpus_file, tweet_data_file)


File saved: ./data/tweet/sentiment_analysis/training_set_clean.csv


In [2]:
# Preprocess tweets in the cleaned CSV file
tweet_data_file = './data/tweet/sentiment_analysis/training_set_clean.csv'

p = SentimentDataProcessor(tweet_data_file, num_tweets='all')

df = p.get_processed_df()
df.head()

3834it [00:00, 38337.29it/s]

Read all tweets


1600000it [00:38, 41729.65it/s]


Unnamed: 0,Sentiment,Tweet
0,0,"[awww, that, bummer, you, shoulda, got, david,..."
1,0,"[upset, that, can, update, his, facebook, text..."
2,0,"[dived, many, times, for, the, ball, managed, ..."
3,0,"[whole, body, feels, itchy, and, like, its, fire]"
4,0,"[not, behaving, all, mad, why, here, because, ..."


In [3]:
# Store processed dataframe in pickle file
pkl_filedir = './data/tweet/sentiment_analysis/pkl'
if not os.path.exists(pkl_filedir):
    os.makedirs(pkl_filedir)

pkl_file = pjoin(pkl_filedir, 'processed_tweets_sentiments.pkl')

df.to_pickle(pkl_file)
