# Fess Upped

An analysis of our confessions using Python.

First create a virtual environment and install requirements as per the Readme file:

```bash
python3 -m venv venv

source venv/bin/activate

pip install --upgrade pip

pip install -r requirements.txt
```

## Setup Access to Twitter

If you don't have a Twitter account create one.

Register for developer access as per here - https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/twitter-data-in-python/

Store the API Key, the API Key Secret and the Bearer token on three consecutive lines of a ```secrets.txt``` file, e.g.:
```
[Paste API KEY]
[Paste API Key Secret]
[Paste Bearer Token]
```

Then as per the guide, go into you Twitter "app" setup and generate an Access Key and Secret. Add these as lines 4 and 5 of the secrets.txt file:
```
[Paste API KEY]
[Paste API Key Secret]
[Paste Bearer Token]
[Paste Access Key]
[Paste Access Token]
```

In [8]:
with open("secrets.txt") as file:
    lines = file.readlines()
    lines = [line.rstrip() for line in lines]

consumer_key = lines[0]
consumer_secret = lines[1]
bearer_token = lines[2]
access_token = lines[3]
access_token_secret = lines[4]

## Libraries and Code

We'll use the tweepy library to access Twitter and pandas to save the data.

In [9]:
# Import Libraries
import os
import tweepy as tw
import pandas as pd

In [10]:
# Setup programmatic API access
client = tw.Client(bearer_token)
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

In [15]:
# Lookup the Twitter ID for fesshole
client.get_users(usernames=["fesshole"])

Response(data=[<User id=1007749631818821638 name=Fesshole 🧻 username=fesshole>], includes={}, errors=[], meta={})

In [17]:
user_id = 1007749631818821638

In [18]:
tweets = client.get_users_tweets(id=user_id, tweet_fields=['context_annotations','created_at','geo'])

In [19]:
for tweet in tweets:
    print(tweet)

[<Tweet id=1548599650646728705 text='I saw half of an old episode of Thomas the Tank Engine recently, and someone snogged one of the engines. The idea that Percy the Green Engine has had more action than me over the past 12 months has probably sent me into a depressive spiral.'>, <Tweet id=1548584547834527744 text='I will instigate sex in the morning with my wife, knowing that she will rebuff my advances and instantly get up and make me coffee in bed'>, <Tweet id=1548569448721879040 text="I'm a 38 year old man and when I arrange the soft toys on my daughters bed while making it I'll position them in a way that they would be able to interact in case there is any consciousness in there.">, <Tweet id=1548433556682133504 text='I fell in love with a woman created by the AI account "This Person Does Not Exist" and it kills me that we can never be together.'>, <Tweet id=1548418456030105602 text="Refunding kindle books can cost authors so I never do it but I've been buying &amp; returning all 

In [20]:
tweets

Response(data=[<Tweet id=1548599650646728705 text='I saw half of an old episode of Thomas the Tank Engine recently, and someone snogged one of the engines. The idea that Percy the Green Engine has had more action than me over the past 12 months has probably sent me into a depressive spiral.'>, <Tweet id=1548584547834527744 text='I will instigate sex in the morning with my wife, knowing that she will rebuff my advances and instantly get up and make me coffee in bed'>, <Tweet id=1548569448721879040 text="I'm a 38 year old man and when I arrange the soft toys on my daughters bed while making it I'll position them in a way that they would be able to interact in case there is any consciousness in there.">, <Tweet id=1548433556682133504 text='I fell in love with a woman created by the AI account "This Person Does Not Exist" and it kills me that we can never be together.'>, <Tweet id=1548418456030105602 text="Refunding kindle books can cost authors so I never do it but I've been buying &amp; 

In [27]:
tweets.meta['next_token']

'7140dibdnow9c7btw4228myixwc23vdbddh42h184wo51'

### API Method

We need to use this method - https://docs.tweepy.org/en/stable/client.html#tweepy.Client.get_users_tweets

### Twitter Limitations on API

The "Reverse chronological home timeline" has limitations of "a per-user rate limit of 180 requests per 15-minute window and returns 800 of the most recent Tweets".

The endpoint can return the 3,200 most recent Tweets.

The user Tweet timeline endpoint is designed to support two common usage patterns: 

"Get a user’s historical Tweets": Requests made to user Tweet timeline in order to receive Tweets authored by the user of interest in chronological order over a specific recent timeframe. The timeframe can be set using the start_time and end_time and paginating through the full results.  In some cases, a user’s entire history of Tweets can be retrieved if the user has only authored up to 3,200 Tweets in their account. Tweets included will depend on the public availability and the authentication that is used for the requests.

"Polling for new Tweets": Requests made to user Tweet timeline on a continual basis, to retrieve new Tweets authored by a specific user. The last Tweet ID received can be set as a parameter for any new requests since the last Tweet.

We need to use pagenation to cycle through 32 requests to get the 3200 recent tweets.

We pass a ```pagination_token``` which is returned from the previous request.

In [28]:
tweets2 = client.get_users_tweets(id=user_id, max_results=10, pagination_token=tweets.meta['next_token'])

In [29]:
tweets2

Response(data=[<Tweet id=1548342958591619074 text='I babysat for an awful family with two kids. The kids had a pet rabbit they abused horribly. One day I volunteered to take him to the vet, told the vet to write on his card that he was euthanized and dead.. I took the bunny home. He turns 8 this summer.'>, <Tweet id=1548327859571175424 text="My wife loves Bees. We found a dying one and gave it sugary water. Woke up the next day and it was gone. She was so happy. I don't have the heart to tell her it drowned so I chucked it next door.">, <Tweet id=1548312759518892036 text="I suffered a severe double Inguinal Hernia 20 years ago lifting a heavy bag of compost. My employer kindly paid for the very painful operation as an emergency following doctor's advice. I lied. My wife &amp; I had spent an entire Easter weekend shagging, off our faces on ecstasy.">, <Tweet id=1548297658632007685 text="When I was a kid, my parents were trying to think of a creative punishment and asked if taking the ba

In [31]:
import time

responses = list()
pagination_token = None
# Iterate over 32 pages of 100 tweets
for i in range(0, 32):
    if not pagination_token:
        response = client.get_users_tweets(id=user_id, max_results=100)
    else:
        response = client.get_users_tweets(id=user_id, max_results=100, pagination_token=pagination_token)
    pagination_token = response.meta['next_token']
    responses.append(response) 
    # Throttle a little to be kind
    time.sleep(0.5)

In [32]:
len(responses)

32

In [34]:
responses[0].data[0], responses[1].data[0]

(<Tweet id=1548614747280576513 text='I forgot to scan a bottle of wine at Tesco Express and realised when I got home. Walked back and told them so I could pay for it. I am an utter loser.'>,
 <Tweet id=1546440421299163138 text='My southern husband is very sniffy about "northern" water and insists on buying posh bottled water from Waitrose. Little does he know this is what gets poured into the dogs water bowl and his bottle gets filled up with dirty northern muck.'>)

There seem to be about 16 tweets per day. So 3200 is about 200 days worth. So you would fetch twice a year to update.

In [36]:
import pickle

def SaveLists(data, filename):
    open_file = open(filename, "wb")
    pickle.dump(data, open_file)
    open_file.close()

def LoadLists(file):
    open_file = open(file, "rb")
    loaded_list = pickle.load(open_file)
    open_file.close()
    return loaded_list

SaveLists(responses, "2022-07-17 - Saved Data")

In [38]:
# This tweet has recap photo data
tweet_id = 1543983465368018947

In [40]:
r = client.get_tweet(id=tweet_id); r 

Response(data=<Tweet id=1543983465368018947 text='Best public confessions for April 2022, what a great month that was. Do keep sending stories in and keeping us laughing. Thank you everyone, and remember to encourage your friends to FOLLOW @FESSHOLE NOW, so we can keep it all going with your most extraordinary confessions. https://t.co/iEQaPLFz5N'>, includes={}, errors=[], meta={})

In [44]:
"https://" in r.data.text, "@" in r.data.text

(True, True)

### Post Processing

We need to exclude any tweets that have a link, twitter handle or attached media.
* Link we can just look for "http://" in tweet. This also filters out "recap" posts with media as these have an https link as above.
* Look for @ in tweet.

In [46]:
# Process Tweets
cleaned_data = list()
for response in responses:
    for tweet in response.data:
        if not "https://" in tweet.text and not "@" in tweet.text:
            cleaned_data.append(tweet.text)

In [48]:
cleaned_data[0:10]

['I forgot to scan a bottle of wine at Tesco Express and realised when I got home. Walked back and told them so I could pay for it. I am an utter loser.',
 'I saw half of an old episode of Thomas the Tank Engine recently, and someone snogged one of the engines. The idea that Percy the Green Engine has had more action than me over the past 12 months has probably sent me into a depressive spiral.',
 'I will instigate sex in the morning with my wife, knowing that she will rebuff my advances and instantly get up and make me coffee in bed',
 "I'm a 38 year old man and when I arrange the soft toys on my daughters bed while making it I'll position them in a way that they would be able to interact in case there is any consciousness in there.",
 'I fell in love with a woman created by the AI account "This Person Does Not Exist" and it kills me that we can never be together.',
 "Refunding kindle books can cost authors so I never do it but I've been buying &amp; returning all Tory MPs books I can

In [49]:
# Setup spacy by downloading the small web model, which should be good for tweets
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.0/en_core_web_sm-3.4.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [50]:
# Process Text with Spacy
import spacy

nlp = spacy.load("en_core_web_sm")
docs = list(nlp.pipe(cleaned_data))

### LDA

Gensim provides library functions to perform LDA. 

We can replace some of the functions used in the tutorial with outputs from spacy. For example, we can use the lemma_ attribute of tokens and then take the lower of this.

In [51]:
# Lemmatise the docs and take the lowercase representation
lower_lemmas = [[t.lemma_.lower() for t in doc] for doc in docs]

In [52]:
lower_lemmas[10]

['i',
 'babysat',
 'for',
 'an',
 'awful',
 'family',
 'with',
 'two',
 'kid',
 '.',
 'the',
 'kid',
 'have',
 'a',
 'pet',
 'rabbit',
 'they',
 'abuse',
 'horribly',
 '.',
 'one',
 'day',
 'i',
 'volunteer',
 'to',
 'take',
 'he',
 'to',
 'the',
 'vet',
 ',',
 'tell',
 'the',
 'vet',
 'to',
 'write',
 'on',
 'his',
 'card',
 'that',
 'he',
 'be',
 'euthanize',
 'and',
 'dead',
 '..',
 'i',
 'take',
 'the',
 'bunny',
 'home',
 '.',
 'he',
 'turn',
 '8',
 'this',
 'summer',
 '.']

In [55]:
# Spacy has an inbuilt stoplist that is available as a token attribute
for t in docs[10][0:20]:
    print(t, t.is_stop)

I True
babysat False
for True
an True
awful False
family False
with True
two True
kids False
. False
The True
kids False
had True
a True
pet False
rabbit False
they True
abused False
horribly False
. False


In [56]:
# Let's also (as per old NLP) remove numbers, punctuation and single letter words.
processed_text = [[t.lemma_.lower() for t in doc if not (t.is_stop or t.is_punct or t.is_digit)] for doc in docs]

In [57]:
processed_text[10]

['babysat',
 'awful',
 'family',
 'kid',
 'kid',
 'pet',
 'rabbit',
 'abuse',
 'horribly',
 'day',
 'volunteer',
 'vet',
 'tell',
 'vet',
 'write',
 'card',
 'euthanize',
 'dead',
 'take',
 'bunny',
 'home',
 'turn',
 'summer']

In [67]:
# Configure Logging for Gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Add n_grams to the doc tokens
# Compute bigrams.
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(processed_text, min_count=20)
for idx in range(len(processed_text)):
    for token in bigram[processed_text[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            processed_text[idx].append(token)

2022-07-17 12:22:29,223 : INFO : collecting all words and their counts
2022-07-17 12:22:29,224 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2022-07-17 12:22:29,268 : INFO : collected 50947 token types (unigram + bigrams) from a corpus of 51200 words and 2935 sentences
2022-07-17 12:22:29,269 : INFO : merged Phrases<50947 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000>
2022-07-17 12:22:29,269 : INFO : Phrases lifecycle event {'msg': 'built Phrases<50947 vocab, min_count=20, threshold=10.0, max_vocab_size=40000000> in 0.05s', 'datetime': '2022-07-17T12:22:29.269866', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.15.0-41-generic-x86_64-with-glibc2.10', 'event': 'created'}


In [69]:
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(processed_text)

2022-07-17 12:23:21,838 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2022-07-17 12:23:21,895 : INFO : built Dictionary<7880 unique tokens: ['bottle', 'express', 'forgot', 'get', 'home']...> from 2935 documents (total 51698 corpus positions)
2022-07-17 12:23:21,895 : INFO : Dictionary lifecycle event {'msg': "built Dictionary<7880 unique tokens: ['bottle', 'express', 'forgot', 'get', 'home']...> from 2935 documents (total 51698 corpus positions)", 'datetime': '2022-07-17T12:23:21.895933', 'gensim': '4.2.0', 'python': '3.8.5 (default, Sep  4 2020, 07:30:14) \n[GCC 7.3.0]', 'platform': 'Linux-5.15.0-41-generic-x86_64-with-glibc2.10', 'event': 'created'}


I omitted the token filtering as this reduced the dictionary to only 470 tokens. We can maybe try this as a variation.
```
# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)
print('Number of unique tokens: %d' % len(dictionary))
```

In [70]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in processed_text]

In [71]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10
# Changed this to 4000 to cover all the docs as they are short and fit in memory
chunksize = 4000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make an index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

2022-07-17 12:24:36,631 : INFO : using autotuned alpha, starting with [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]
2022-07-17 12:24:36,633 : INFO : using serial LDA version on this node
2022-07-17 12:24:36,638 : INFO : running online (multi-pass) LDA training, 10 topics, 20 passes over the supplied corpus of 2935 documents, updating model once every 2935 documents, evaluating perplexity every 0 documents, iterating 400x with a convergence threshold of 0.001000
2022-07-17 12:24:36,639 : INFO : PROGRESS: pass 0, at document #2935/2935
2022-07-17 12:24:37,934 : INFO : optimized alpha [0.07004665, 0.07151671, 0.07835673, 0.07398079, 0.080541484, 0.07157614, 0.07575834, 0.07633582, 0.083537795, 0.07283665]
2022-07-17 12:24:37,938 : INFO : topic #0 (0.070): 0.009*"go" + 0.009*"like" + 0.007*"time" + 0.006*"try" + 0.006*"year" + 0.006*"year_old" + 0.006*"day" + 0.005*"wife" + 0.005*"buy" + 0.005*"tell"
2022-07-17 12:24:37,939 : INFO : topic #1 (0.072): 0.009*"year" + 0.008*"day" + 0.007

In [72]:
top_topics = model.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

2022-07-17 12:25:26,453 : INFO : CorpusAccumulator accumulated stats from 1000 documents
2022-07-17 12:25:26,457 : INFO : CorpusAccumulator accumulated stats from 2000 documents


Average topic coherence: -2.9604.
[([(0.027623765, 'year_old'),
   (0.013424676, 'year'),
   (0.012604725, 'old'),
   (0.012436859, 'think'),
   (0.009284224, 'wife'),
   (0.008263561, 'work'),
   (0.0068875323, 'get'),
   (0.006652288, 'take'),
   (0.006451795, 'time'),
   (0.0061963326, 'look'),
   (0.00617845, 'tell'),
   (0.006124915, 'kid'),
   (0.005946673, 'want'),
   (0.0057988726, 'day'),
   (0.0056434027, 'night'),
   (0.005336212, 'go'),
   (0.005288253, 'like'),
   (0.005078944, 'know'),
   (0.004822383, 'home'),
   (0.004517141, 'watch')],
  -2.508778851428638),
 ([(0.009646516, 'work'),
   (0.007426429, 'year'),
   (0.0073261233, 'get'),
   (0.0063504386, 'day'),
   (0.0063478895, 'ask'),
   (0.005703999, 'go'),
   (0.0053586126, 'think'),
   (0.005344327, 'wife'),
   (0.005307524, 'leave'),
   (0.005194282, 'phone'),
   (0.005062379, 'see'),
   (0.004631838, 'house'),
   (0.0041664885, 'month'),
   (0.0039775185, 'watch'),
   (0.0038153583, 'tell'),
   (0.00360465, 'turn

These don't seem that informative

Resources:
* https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9
* Getting an twitter ID from the username - https://commentpicker.com/twitter-id.php
* Gensim LDA Tutorial - https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html
* Multicore LDA - https://radimrehurek.com/gensim/models/ldamulticore.html#module-gensim.models.ldamulticore
* Clustering using Spacy and k-means - https://towardsdatascience.com/lovecraft-with-natural-language-processing-part-3-tf-idf-vectors-8c2d4df98621