# Chunking Gab

This notebook is intended only to demonstrate that it is possible to chain a series of actions together to explore various dimensions of a data set and/or to achieve various analyses. None of the code here is terribly complicated or advanced. 

In [1]:
# All the imports upfront
import re
import nltk
import pandas as pd

# # And then the MPL import and settings
# import matplotlib.pyplot as plt
# # Set plt parameters
# plt.rcParams['figure.dpi'] = 300
# plt.rcParams["figure.figsize"] = (10,5)

Since I never saved my cleaned up texts from last time, I will need to re-create my steps here. Before I am done, however, I will make sure to save things to a file!

In [2]:
# Load the data from the file
# But not if the line starts with three dashes
with open("../queue/gab-chatlogs.txt") as f:
    chatlog = [n for n in f.readlines() if not n.startswith('---')]

# Delete first four lines
# (Discovered last time)
del chatlog[0:4]

# Join our list of strings into one big string again
chats = " ".join(chatlog)

# And then split the joined string at date-time
re_datetime = re.compile("\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}")
splits = re.split(re_datetime, chats)

# This is the regex pattern we are going to use
# to identify the user. It reads as follows:
# ^     - start of line
#       - space (it's there)
# (.*?) - capture group
# :     - colon
re_user = re.compile("^ (.*?):")

# For loop to remove the user from the start of the line:
texts = []
for i in splits:
    text = re.sub(re_user, "", i, count=1)
    texts.append(text)

# Check our results
print(f"You've got {len(texts)} gab chats!")
print(f"Here's a sample:\n")
for i in texts[100:102]:
    print(i)

You've got 70596 gab chats!
Here's a sample:

 Thank you. I’m just a very lonely person. I had someone, kind of, or at least I thought I did (maybe I never had her; I don’t know), but she doesn’t want me anymore. I’ve been trying to find someone for a long time now. I’m the real deal. I like sex, but I find non-procreative sex pointless. My theory is that men were given the urge so that they’d procreate. So I understand we have that urge and need to satiate it, lest we go mad, but I find it totally pointless if it’s not for making babies. I’m not big on banging anything that moves. I used to go to pubs and clubs and did some of that, but never as much as I could’ve if I’d really wanted to. Those encounters are short-lived and leave you feeling very alone and used the next day when she’s left in a cab. I’m not about that and never was. I did it, but only because I couldn’t find a woman to marry me and give me kids. Now, every girl I meet who says she wants that is more interested in her

In [3]:
# I'd like to save these gab chats to a text file using writelines,
# so I want to remove all newline markers within the texts
gabs = [re.sub("\n", " ", text) for text in texts]

# Annnnd then we re-introduce newline markers 
# to separate the gab chats
# with open("../queue/gabs.txt", "w") as f:
#     f.writelines([f"{gab}\n" for gab in gabs])

<div class="alert alert-block alert-info">
<b>Tip:</b> Usually once I've written a file in a notebook, I will comment it out so I don't overwrite that file by accident. If the notebook is one I am going to come back to later, I will go ahead and write the code to load that file back into memory just below.
</div>

In [4]:
with open("../queue/gabs.txt", "r") as f:
    gabs = f.readlines()

In [5]:
# Now let's tokenize our gabs
tokenized = []
for i in gabs:
    tokens = nltk.tokenize.word_tokenize(i)
    tokenized.append(tokens)

# List comprehensions are for loops
# Written somewhat differently
tagged = [nltk.pos_tag(i) for i in tokenized]

In [6]:
tagged[14:18]

[[('I', 'PRP'),
  ('got', 'VBD'),
  ('DM', 'NNP'),
  ('’', 'NNP'),
  ('s', 'VBD'),
  ('back', 'RP'),
  ('😈😂', 'NN')],
 [('its', 'PRP$'), ('happening', 'VBG')],
 [('Squeeeeeeeeeeeee', 'NNP'), ('😆', 'NN')],
 [('Hello', 'NNP'),
  (',', ','),
  ('Beautiful', 'NNP'),
  ('Lady', 'NNP'),
  ('.', '.'),
  ('Have', 'VBP'),
  ('a', 'DT'),
  ('fantastic', 'JJ'),
  ('day', 'NN'),
  ('☺️', 'VB')]]

In [7]:
# An or should do the work of these multiple if conditions
# but it was giving me unexpected results

verbs =[]
for item in tagged:
    for word, tag in item:
        if tag == "VB":
            verbs.append(word)
        if tag == "VBD":
            verbs.append(word)
        if tag == "VBP":
            verbs.append(word)
        if tag == "VBZ":
            verbs.append(word)
print(f"Extracted {len(verbs)} verbs!")
print(f"Here are 20 of them: {verbs[0:20]}")

Extracted 199161 verbs!
Here are 20 of them: ['are', 'is', 'am', 'want', 'ask', 'please', 'dont', 'mind', 'worked', 'see', "'s", 'think', 'refreshed', 'wrote', 'see', 'opened', 'saw', 'expire', 'said', 'expired']


In [8]:
from collections import Counter
tallies = Counter(verbs)

In [9]:
tallies.most_common(20)

[('is', 12502),
 ('have', 8212),
 ('are', 7822),
 ('span', 7456),
 ('<', 6304),
 ('be', 6222),
 ('do', 5968),
 ('was', 5757),
 ("'s", 4975),
 ('’', 4343),
 ('know', 4031),
 ("'m", 3827),
 ('get', 3387),
 ('am', 3112),
 ('see', 2785),
 ('think', 2205),
 ('did', 2056),
 ('has', 2009),
 ('had', 1798),
 ('want', 1797)]

Counters are useful but limited. We pretty much know what the most common verbs are going to be at this point. I.e., the usual suspects. We want to explore the "goldilocks zone."

In many cases, I find it useful to save things to a CSV which I can then quickly scroll around in. 

In [10]:
df = pd.DataFrame.from_records(list(dict(tallies).items()), 
                               columns=['verb','count'])
df.head()

Unnamed: 0,verb,count
0,are,7822
1,is,12502
2,am,3112
3,want,1797
4,ask,370


In [11]:
# First we sort the dataframe by count
df.sort_values(by=['count'], ascending=False, inplace=True)

In [12]:
# Then we look a little further down the table
df[60:80]

Unnamed: 0,verb,count
5,please,452
53,put,439
102,thought,432
311,read,432
21,Have,419
99,check,414
219,seems,398
131,sent,392
112,remember,391
305,change,387


In [13]:
# Code to find nouns associated with the verb "remember"