## DTM: Twitter Corpus Reader, Preprocessor, & Sample Analysis

### By. Alexander Bogdanowicz

#### This notebook runs through newly implemented modules and their interfaces, these include:
    1. NewTwitterReader
    2. NewTwitterPreprocessor
    3. NewTwitterCorpusLoader
    4. NewTwitterCorpusView
    5. NewTwitterCorpusTranformer
    
#### These Modules can be found here: https://github.com/akbog/urban-data
#### This notebook is also designed to run through the following steps:
    1. Establishing a local version of the github repository "urban-data" (Part 0)
    2. Structuring the file-system in a baleen structure for easy readibility by the CorpusReaders
    3. Running a sample LDA Model
    
#### Please see Part 0. at the bottom of this notebook for directions on how to initialize a local github repository

-------------------------------------------------------------------------------------------------------------------

## Part 0. Before you get started: Initializing Your Local Github Repository

##### Before attempting this, please make sure you have a valid Github account and have git installed on your machine. 

##### If not, see (Installing Git on Linux, Mac, Windows) https://gist.github.com/derhuerst/1b15ff4652a867391f03

##### To clone the github repository, navigate to the folder you would like to clone the repository into and type the following after the ">":

    > git clone https://github.com/akbog/urban-data.git

##### You should find that the files from the github repository are now in your local file system.

In [1]:
###Importing Neccesary Modules
from Modules.NewTwitterReader import GzipStreamBackedCorpusView, NewTwitterCorpusReader, NewTwitterPickledCorpusReader
from Modules.NewTwitterPreprocessor import Preprocessor
from Modules.NewPickleCorpusView import PickleCorpusView
from Modules.NewTwitterCorpusLoader import NewTwitterCorpusLoader
from Modules.NewTwitterTransformer import TextNormalizer, GensimTfidfVectorizer, GensimTopicModels
from sklearn.pipeline import Pipeline
from gensim.sklearn_api import lsimodel, ldamodel
import re

-------------------------------------------------------------------------------------------------------------------

## Part 1. Preprocessing Stage

##### The NewTwitterCorpusReader assumes that collections of tweets are stored in .gz files (to conserve disk space) (this is often the format that api's may export tweets).

##### We store our data *outside* the urban-data local github repository, as the data tends to be large and is not suited for github's code repository. In this example, our data will be stored in the following hierarchical structure:

&nbsp;&nbsp;&nbsp;&nbsp; -> urban-data </br>
&nbsp;&nbsp;&nbsp;&nbsp; -> Twitter-Data </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> Category1 (2019-01-07) </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> Raw Gzipped JSON File (Manhattan-2019-11-07-000.json.gz) </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> Category2 </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> Category3 </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> Category4 </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> Raw Gzipped JSON File </br>
&nbsp;&nbsp;&nbsp;&nbsp; -> Twitter-Data-Pkl </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> Category1 (2019-01-07) </br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> Raw Gzipped JSON File (Manhattan-2019-11-07-000.pickle) </br>


##### For the purposes of this notebook, our categories are divided into separate dates, each date containing one JSON Gzipped File.

##### We also see a Twitter-Data-Pkl Folder, which will contain the Preprocessed (Tokenized) Tweets, pickled in their Python Readable Formats (List of Dictionaries of Tweets)
    

In [None]:
#The following are Regex Strings designed to match general forms of our files
DOC_PATTERN = r'[0-2][0-9][0-9][0-9]-[0-3][0-9]-[0-1][0-9]/Manhattan.*\.json\.gz$' #Document Regex
root = r"../Twitter-Data" #Root Data Directory

In [None]:
#Instantiating our Corpus Reader Object with a root directory
pre_corpus = NewTwitterCorpusReader(root = root)
print("Sample Tweet Structure")
print(pre_corpus.docs()[0])

#### The below stage may take some time as the Preprocessor tokenizes each tweet and writes them to the target directory in pickle format

##### Note. The file-size will naturally increase as pickling is a less efficient compressions (thus more easily readable) than Gzipped files

In [None]:
#Initializing the Preprocessor
target = r"../Twitter-Data-Pkl" #Specifying target folder for preprocessed corpus
preprocess = Preprocessor(corpus = pre_corpus, target = target) #Initializing 
CAT_PATTERN = r'[0-2][0-9][0-9][0-9]-[0-3][0-9]-[0-1][0-9]' #Category Regex

#Calling Transformation which Tokenizes the Dataset and Pickles the result to the target directory
docs = preprocess.transform(categories = CAT_PATTERN) 
print("Done: ", len(list(docs))) #Must call as docs is simply a generator

-------------------------------------------------------------------------------------------------------------------

## Part 2. Reading Pickled Corpus & Initial Test Analysis

##### In this section, we will run a simple analysis that demonstrates how well we have been able to abstract each step of the modeling process. We will run a simple Gensim Based Latent Dirichlet Allocation Model on our tweets to try and extract key models.

##### As this can take some time, we will only perform the analysis for the day of November 7th, 2019

In [None]:
#Specifying root directory of Pickled Tweets
pkl_root = r"../Twitter-Data-Pkl"
corpus = NewTwitterPickledCorpusReader(pkl_root) #Initializing Pickled Corpus Reader

##### The following steps occur when instantiating and fitting a GensimTopicModels Object
    1. Text Normalization (consists of stopword removal and lemmatization)
    2. Text Vectorization (In this sample, TF-IDF Matrix measureing term frequency/tweet and corpus)
    3. Text Model (Gensim LDA Model generating 10 Topics)

In [None]:
CAT_PATTERN = r'[0-2][0-9][0-9][0-9]-[0-3][0-9]-[0-1]7' #Category Regex
#### docs object for use in fitting the model (consists of our tweets in the specified category)
docs = [
    tweet for tweet in corpus.tweets(categories = CAT_PATTERN)
]
gensim_lda = GensimTopicModels(n_topics = 10)

In [None]:
#Fitting the model (This may take some time)
gensim_lda.fit(docs)

-------------------------------------------------------------------------------------------------------------------

## Part 3. Sample Visualization with pyLDAvis

#### pyLDAvis is a highly interactive visualization of the output of an LDA Topic Model

##### Note. This code is directly sampled from Applied Text Analytics with Python By. Bengfort, Bilbro & Ojeda

In [None]:
import pyLDAvis
import pyLDAvis.gensim
import numpy as np

In [None]:
#Extracting the LDA Model
lda = gensim_lda.model.named_steps['model'].gensim_model
#extracting the corpus vectors
corpus = [
    gensim_lda.model.named_steps['vect'].lexicon.doc2bow(doc)
    for doc in gensim_lda.model.named_steps['norm'].transform(docs)
]
#extracing the corresponding lexicon
lexicon = gensim_lda.model.named_steps['vect'].lexicon

#creating formatted data for pyLDAvis (and correcting)
data = pyLDAvis.gensim.prepare(lda, corpus, lexicon)
data[0]["x"] = np.real(data[0]["x"])
data[0]["y"] = np.real(data[0]["y"])

In [None]:
pyLDAvis.display(data)