# Machine Learning Course Project
## Identifying User Stance On Social Media via Semi-Supervised Learning

### Overview

#### Midsem Pipeline - 

 - **Read Data**: Read text files to load all the words. 
 - **Clean Data**: Remove stop-words, everything lowercase, dehashify hashtags.
 - **Format Data**: Create data in a format required by each baseline method. 
 - **Baseline Approaches**: LSA, pLSA, Para2Vec, LDA Topic Modelling. The goal of these approaches is to create a "FIXED SIZE" and "HIGH LEVEL" feature representation for variable length tweets. These representations leverage our unlabelled data. 
 - **Training**: Some Supervised Learning on the learned representation using the given labels. 
 - **Evaluation**: Compare the different methods mentioned above on different datasets. 

#### Endsem Approaches - 
 - LDA2Vec - https://www.datacamp.com/community/tutorials/lda2vec-topic-model
 - Gaussian LDA - https://rajarshd.github.io/papers/acl2015.pdf
 - Word Embeddings Informed Topic Models - http://proceedings.mlr.press/v77/zhao17a/zhao17a.pdf
 
#### Reference
 - https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

## Constants

In [1]:
PATH_LABELLED_DATA_TRUMP = "../semeval2016-task6-domaincorpus/data-all-annotations/testdata-taskB-all-annotations.txt"
PATH_UNLABELLED_DATA_TRUMP = "./../semeval2016-task6-domaincorpus/downloaded_Donald_Trump.txt"

## Setup

In [2]:
# SETUP
# # Run in python console
# import nltk; nltk.download('stopwords')

# # Run in terminal or command prompt
# !python -m spacy download en

In [3]:
# !pip install gensim
# !pip install pyLDAvis

In [4]:
import pandas as pd
import numpy as np
import re
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim.models import LsiModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [5]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['via'])

### Read Sem-Eval Task A Data (Labelled)

Interactive Visualization - http://www.saifmohammad.com/WebPages/StanceDataset.htm

Targets - 
 - Hilary Clinton
 - Atheism
 - Climate Change
 - Donald Trump
 - Feminism
 - Abortion

In [6]:
data_labelled = pd.read_csv(PATH_LABELLED_DATA_TRUMP, sep='\t', lineterminator='\r', encoding ='latin1')
data_labelled = data_labelled.where(data_labelled.Tweet != 'Not Available')
data_labelled.dropna(how='any', inplace=True)

### Remove Symbols

In [7]:
def clean_tweets(sent):
    sent = str(sent)
    
    # Remove new line characters
    sent = re.sub('\s+', ' ', sent)

    # Remove distracting single quotes
    sent = re.sub("\'", "", sent)

    # Remove distracting single quotes
    sent = re.sub("\"", "", sent)

    # Remove hashtags
    sent = re.sub("\#", "", sent)

    # Remove http:// links
    sent = re.sub('http:\/\/.*','', sent)

    # Remove https:// links
    sent = re.sub('https:\/\/.*','', sent)
        
    return sent
    

In [8]:
data_labelled['Tweet'] = data_labelled['Tweet'].apply(clean_tweets)

#### Data Stats

In [9]:
data_labelled.count()

ID                 707
Target             707
Tweet              707
Stance             707
Opinion towards    707
Sentiment          707
dtype: int64

In [10]:
data_labelled.head()

Unnamed: 0,ID,Target,Tweet,Stance,Opinion towards,Sentiment
0,\n20001,Donald Trump,@2014voteblue @ChrisJZullo blindly supporting ...,NONE,OTHER,NEGATIVE
1,\n20002,Donald Trump,@ThePimpernelX @Cameron_Gray @CalebHowe Totall...,NONE,OTHER,POSITIVE
2,\n20003,Donald Trump,@JeffYoung @ThePatriot143 I fully support full...,NONE,OTHER,POSITIVE
3,\n20004,Donald Trump,@ABC Stupid is as stupid does! Showedhis true ...,AGAINST,TARGET,NEGATIVE
4,\n20005,Donald Trump,@HouseGOP we now have one political party. The...,NONE,OTHER,NEGATIVE


In [11]:
data_labelled.where(data_labelled.Stance == 'AGAINST').count()

ID                 299
Target             299
Tweet              299
Stance             299
Opinion towards    299
Sentiment          299
dtype: int64

In [12]:
data_labelled.where(data_labelled.Stance == 'NONE').count()

ID                 260
Target             260
Tweet              260
Stance             260
Opinion towards    260
Sentiment          260
dtype: int64

In [13]:
data_labelled.where(data_labelled.Stance == 'FAVOR').count()

ID                 148
Target             148
Tweet              148
Stance             148
Opinion towards    148
Sentiment          148
dtype: int64

## Balance Dataset

In [14]:
data_labelled = data_labelled.groupby('Stance')
data_labelled = pd.DataFrame(data_labelled.apply(lambda x: x.sample(data_labelled.size().max(), replace=True).reset_index(drop=True)))

In [15]:
data_labelled.where(data_labelled.Stance == 'FAVOR').count()

ID                 299
Target             299
Tweet              299
Stance             299
Opinion towards    299
Sentiment          299
dtype: int64

In [16]:
data_labelled.where(data_labelled.Stance == 'AGAINST').count()

ID                 299
Target             299
Tweet              299
Stance             299
Opinion towards    299
Sentiment          299
dtype: int64

In [17]:
data_labelled.where(data_labelled.Stance == 'NONE').count()

ID                 299
Target             299
Tweet              299
Stance             299
Opinion towards    299
Sentiment          299
dtype: int64

## Create Train-Test Split

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_labelled.Tweet, data_labelled.Stance, test_size=0.3, random_state=0, shuffle = True)

In [19]:
y_train.where(y_train == 'AGAINST').count()

211

In [20]:
y_train.where(y_train == 'FAVOR').count()

207

In [21]:
y_train.where(y_train == 'NONE').count()

209

In [22]:
y_test.where(y_test == 'AGAINST').count()

88

In [23]:
y_test.where(y_test == 'FAVOR').count()

92

In [24]:
y_test.where(y_test == 'NONE').count()

90

In [25]:
X_train.to_pickle("./X_train.pkl")
X_test.to_pickle("./X_test.pkl")
y_train.to_pickle("./y_train.pkl")
y_test.to_pickle("./y_test.pkl")