This notebook contains some code snippets I used in writing the following blog post:

Some time ago I read a tweet of Kenneth Reitz, a very well known Python developer I follow on Twitter, asking:
{% twitter 952553176925958145 %}
Starting from this, I decided to analyze some tweets from pretty popular Python devs in order to understand a priori how they use Twitter, what they tweet about and what I can gather using data from Twitter APIs only.
Obviously you can apply the same analysis on a different list of Twitter accounts.


# Setting up the environment

For my analysis I set up a Python 3.6 virtual environment with the following main libraries:
- [Tweepy](https://github.com/tweepy/tweepy) for interaction with Twitter APIs
- [Pandas](https://pandas.pydata.org/) for data analysis and visualization
- [Beautiful Soup 4](https://pypi.org/project/beautifulsoup4/), [NLTK](https://www.nltk.org/) and [Gensim](https://radimrehurek.com/gensim/) for processing text data

Some extra libraries will be introduced later on, along with the explanation of the steps I did.

In order to access the Twitter APIs I [registered my app](https://apps.twitter.com/) and then I provided the tweepy library with the consumer_key, access_token, access_token_secret and consumer_secret.

We're now ready to get some real data from Twitter!

In [None]:
import itertools
import logging
import os
import re
import ssl
import timeit
from collections import Counter
from contextlib import contextmanager
from datetime import datetime
from typing import List, Tuple

import folium
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import pyLDAvis
import pyLDAvis.gensim as gensimvis
import tweepy
from IPython.display import Markdown, display
from bs4 import BeautifulSoup
from gensim import corpora
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel, LdaModel, AuthorTopicModel, Phrases
from gensim.models.phrases import Phraser
from gensim.parsing.preprocessing import STOPWORDS
from geopy import Nominatim
from matplotlib import style, ticker, rcParams
from textblob import TextBlob
from wordcloud import WordCloud

from configs import twitter_consumer_key, twitter_access_token, twitter_access_token_secret, twitter_consumer_secret

pd.set_option("display.width", 1000)
pd.set_option("display.max_columns", 100)

style.use("ggplot")

%matplotlib inline
rcParams.update({'font.size': 13})
rcParams['figure.figsize'] = [20, 10]

logging.basicConfig(format="%(levelname)s : %(message)s", level=logging.WARN)
logging.root.level = logging.WARN

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context


# Choosing a list of Twitter accounts

First of all, I chose a list of 8 popular Python devs, starting from [the top 360 most-downloaded packages on PyPI](https://pythonwheels.com/) and selecting some libraries I know or use daily.

Here's my final list, including links to the Twitter account and the libraries (from the above mentioned list) for which those guys are known for:
- [@kennethreitz](https://twitter.com/kennethreitz): requests
- [@mitsuhiko](https://twitter.com/mitsuhiko): jinja2/flask
- [@zzzeek](https://twitter.com/zzzeek): sqlalchemy
- [@teoliphant](https://twitter.com/teoliphant): numpy/scipy/numba
- [@benoitc](https://twitter.com/benoitc): gunicorn
- [@asksol](https://twitter.com/asksol): celery
- [@wesmckinn](https://twitter.com/wesmckinn): pandas
- [@cournape](https://twitter.com/cournape): scikit-learn


# Getting data from Twitter

I got all the data with two endpoints only:
- with a call to [lookup users](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-users-lookup) I could get all the information about the accounts (creation date, description, counts, location, etc.)
- with a call to [user timeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html) I could get the tweets about a single user and all the information related to every single tweet. I configured the call to get also retweets and replies.

I saved the results from the two calls in two Pandas dataframes in order to ease the data processing and then into CSV files to be used as starting point for the next steps without re-calling each time the Twitter API.


In [None]:
def _get_twitter_api() -> tweepy.api:
    auth = tweepy.OAuthHandler(twitter_consumer_key, twitter_consumer_secret)
    auth.set_access_token(twitter_access_token, twitter_access_token_secret)

    return tweepy.API(auth)


def _print_api_limits(api: tweepy.api, api_endpoint: str) -> int:

    limits = api.rate_limit_status()

    endpoint_limits = limits["resources"]["statuses"][f"/statuses/{api_endpoint}"]
    usertimeline_remain = endpoint_limits["remaining"]
    usertimeline_limit = endpoint_limits["limit"]
    usertimeline_time_local = datetime.fromtimestamp(endpoint_limits["reset"])
    print(f"Twitter API for {api_endpoint}: remain: {usertimeline_remain} / limit: {usertimeline_limit} - resetting at: {usertimeline_time_local}")
    return usertimeline_remain


def _get_df_users(api: tweepy.api, accounts: List[str]) -> pd.DataFrame:

    user_res = api.lookup_users(screen_names=accounts)
    df_users = pd.DataFrame([res._json for res in user_res])[["created_at", "description", "favourites_count",
                                                              "followers_count", "friends_count", "id_str",
                                                              "listed_count", "location", "name", "screen_name",
                                                              "statuses_count", "time_zone", "verified"]]
    df_users.rename(columns={"friends_count": "following_count", "statuses_count": "total_tweets"}, inplace=True)

    return df_users


def _get_df_tweets(api: tweepy.api, account: str) -> pd.DataFrame:

    count = 200
    include_rts = True
    exclude_replies = False
    tweet_mode = "extended"

    print(f"Getting tweets of [{account}]")
    fields = ("created_at", "entities", "favorite_count", "id_str", "lang", "possibly_sensitive", "retweet_count",
              "full_text", "truncated", "source", "geo", "coordinates", "place",
              "in_reply_to_status_id", "retweeted_status", )
    tweets = []
    
    new_tweets = api.user_timeline(screen_name=account, count=count, include_rts=include_rts,
                                   exclude_replies=exclude_replies, tweet_mode=tweet_mode)

    while len(new_tweets) > 0:
        for tweet_res in new_tweets:
            items = {k: v for k, v in tweet_res._json.items() if k in fields}
            tweets.append(items)

        print(f"We've got {len(tweets)} tweets so far for user @{account}")
        max_id = new_tweets[-1].id - 1
        new_tweets = api.user_timeline(screen_name=account, count=count, max_id=max_id, include_rts=include_rts,
                                       exclude_replies=exclude_replies, tweet_mode=tweet_mode)

    df = pd.DataFrame(tweets)
    df["username"] = account

    return df

In [None]:
accounts = [
        "kennethreitz",  # requests
        "mitsuhiko",  # jinja2/flask
        "zzzeek",  # sqlalchemy
        "teoliphant",  # numpy/scipy/numba
        "benoitc",  # gunicorn
        "asksol",  # celery
        "wesmckinn",  # pandas
        "cournape",  # scikit-learn
    ]

api = _get_twitter_api()

_print_api_limits(api, "user_timeline")

df_tweets = pd.DataFrame()

df_users = _get_df_users(api=api, accounts=accounts)

display(df_users)

for account in accounts:

    _print_api_limits(api, "user_timeline")

    df = _get_df_tweets(api=api, account=account)

    df_tweets = df_tweets.append(df)

display(df_tweets.head())


# Preprocessing tweets

The users dataframe contained all the information I needed; I just created three more columns:
- a followers/following ratio, a sort of "popularity" indicator
- a tweets per day ratio, dividing the total number of tweets by the number of days since the creation of the account
- the coordinates starting from the location, if available, using [Geopy](https://pypi.org/project/geopy/). @benoitc doesn't have a location, while @zzzeek has a generic "northeast", geolocated in Nebraska :-)


In [None]:
df_users["created_at"] = pd.to_datetime(df_users["created_at"])
df_users["followers/following"] = df_users["followers_count"] / df_users["following_count"]
df_users["tweets_per_day"] = df_users["total_tweets"] / (datetime.now().date() - df_users["created_at"]).dt.days
geolocator = Nominatim()
df_users["location_coo"] = df_users["location"].apply(lambda x: [geolocator.geocode(x).latitude,
                                                                 geolocator.geocode(x).longitude] if pd.notnull(x) and x not in (None, '') else x)
display(df_users[["screen_name", "name", "verified", "description", "created_at", "location", "location_coo", 
                  "time_zone", "total_tweets", "favourites_count", "followers_count", "following_count", 
                  "listed_count", "followers/following", "tweets_per_day"]])

The tweets dataframe on the contrary needed some extra preprocessing.

First of all, I discovered an annoying limitation about the Twitter user timeline API: there's a maximum number of tweets that can be returned (more or less 3200 including retweets and replies). Therefore I decided to group the tweets by username and to get the oldest tweet date for each user.

Then I filtered out all the tweets before the maximum value between the first dates (2017-08-26 20:48:35).
Starting from these data, @kennethreitz is influencing the cut date because he's tweeting a lot more than some other users; but in this way we can at least get the same timeframe for all the users and compare tweets from the same period.

In [None]:
def _check_condition(df: pd.DataFrame, condition: pd.core.series.Series, label: str) -> pd.DataFrame:
    before = len(df)
    df = df[condition]
    print(f"{label}: {before}-{before-len(df)}={len(df)}")
    return df

df_tweets["created_at"] = pd.to_datetime(df_tweets["created_at"])

display(df_tweets.groupby("username").agg({"created_at": ["count", "min", "max"]})["created_at"].reset_index())
max_min_created = max(df_tweets.groupby("username").agg({"created_at": ["min"]})["created_at"]["min"])
df_tweets = _check_condition(df_tweets,
                             condition=df_tweets["created_at"] >= max_min_created,
                             label="max min_created")
display(df_tweets.groupby("username").agg({"created_at": ["count", "min", "max"]})["created_at"].reset_index())

Other preprocessing steps:

- I parsed the source information using Beautiful Soup because it contained HTML entities
- I removed smart quotes from text
- I converted the text to lower case
- I removed Twitter entities, URLs and numbers
- I filtered out all the tweets with empty text after these steps (i.e. containing only mentions, etc) and I got 10948-217=10731 tweets.

I finally created a new column containing the "tweet type" (standard, reply or retweet) and another column with the tweet length.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Format source and replace smart quotes
df_tweets["source"] = df_tweets["source"].map(lambda x: BeautifulSoup(x, "lxml").a.string if pd.notnull(x) else x)
df_tweets["full_text"] = df_tweets["full_text"].str.replace("’", "'")
df_tweets["full_text"] = df_tweets["full_text"].str.replace("”", '"')
df_tweets["full_text"] = df_tweets["full_text"].str.replace("“", '"')
df_tweets["full_text"] = df_tweets["full_text"].map(lambda x: BeautifulSoup(x, "lxml").string if pd.notnull(x) else x)
display(df_tweets[['username', 'created_at', 'full_text', 'lang', 'favorite_count', 
                   'retweet_count', 'source']].head(5))

In [None]:
def _clean_tweet(tweet: pd.core.series.Series, remove_entities: bool = False) -> str:

    # lowercase
    tweet_text = tweet["full_text"].lower()

    # remove indices
    if remove_entities:
        indexes = []
        for entity, values_list in eval(tweet.entities).items():
            if values_list:
                for val in values_list:
                    indexes.append(val.get("indices"))

        for idx in sorted(indexes, reverse=True):
            tweet_text = tweet_text[:idx[0]] + tweet_text[idx[1]:]

    # remove remaining urls
    tweet_text = re.sub("http\\S+", "", tweet_text)

    # only letters
    tweet_text = re.sub("[^@#a-zA-Z]", " ", tweet_text)

    tweet_text = " ".join(tweet_text.split())

    return tweet_text

# Clean text
df_tweets["text_clean"] = df_tweets.apply(lambda x: _clean_tweet(x), axis=1)
df_tweets = _check_condition(df_tweets,
                             condition=df_tweets["text_clean"] != "",
                             label="empty text clean")

display(df_tweets[['username', 'created_at', 'full_text', 'lang', 'favorite_count', 
                   'retweet_count', 'source', 'text_clean']].head(5))

In [None]:
 # tweet type
df_tweets['tweet_type'] = "standard"
df_tweets['tweet_type'] = np.where(pd.notnull(df_tweets['in_reply_to_status_id']), "reply", df_tweets['tweet_type'])
df_tweets['tweet_type'] = np.where(pd.notnull(df_tweets['retweeted_status']), "retweet", df_tweets['tweet_type'])

display(df_tweets[['username', 'created_at', 'full_text', 'lang', 'favorite_count', 
                   'retweet_count', 'source', 'text_clean', 'tweet_type']].head(5))

In [None]:
df_tweets['tweet_len'] = df_tweets['full_text'].str.len()

display(df_tweets[['username', 'created_at', 'full_text', 'lang', 'favorite_count', 
                   'retweet_count', 'source', 'text_clean', 'tweet_type', 'tweet_len']].head(5))

# Explorative Data Analysis

The users dataframe itself already shows some insights:

- There are only two accounts with the verified flag: @mitsuhiko and @wesmckinn
- @wesmckinn, @kennethreitz, @teoliphant and @mitsuhiko are the most popular accounts in the list (according to my "popularity" indicator):

In [None]:
def _bar_plot(df: pd.DataFrame, show_graphs: bool, **kwargs):
    print(f"bar - {kwargs.get('title', '')}")
    if show_graphs:
        df.plot.bar(**kwargs)
        plt.show()
    else:
        plt.clf()

show_graphs = True
        
kwargs = dict(legend=False, rot=45, title='Popularity indicator', x='screen_name', y='followers/following')
_bar_plot(df=df_users, show_graphs=show_graphs, **kwargs)

- @kennethreitz wrote since the creation of his account at least twice the number of tweets per day compared to the other devs in the list:

In [None]:
kwargs = dict(legend=False, rot=45, title='Tweets per day', x='screen_name', y='tweets_per_day')
_bar_plot(df=df_users, show_graphs=show_graphs, **kwargs)

- Most of the accounts in the list live in the US; I used Folium to create a map showing the locations:

In [None]:
m = folium.Map(zoom_start=5)
for _, row in df_users.iterrows():
    if row["location_coo"]:
        folium.Marker(row["location_coo"], popup=row["screen_name"]).add_to(m)

m

The tweets dataframe needs instead some manipulation before we can gather some good insights.

First of all let's check the tweet "style" of each account. From the following chart we can see for example that @cournape is retweeting a lot, while @mitsuhiko is replying a lot:

In [None]:
df_tweets.groupby(['username', 'tweet_type']).size().unstack().plot(kind='bar', rot=45, title='Tweet types')

We can also group by username and tweet type, and show a chart with the mean tweet length. @kennethreitz for example writes replies shorter than standard tweets, while @teoliphant writes tweets longer than the other guys (exceeding the 140 chars limit):

In [None]:
ax = df_tweets.groupby(['username', 'tweet_type']).agg({'tweet_len': ["mean"]}).unstack().plot(kind='bar', rot=45, title='Tweet length')
ax.legend(["reply", "retweet", "standard"])

Ok, now let's filter out the retweets and let's focus on the machine-detected language used in standard tweets and replies. The five most common languages are: English, German, French, undefined and a rather weird "tagalog" (ISO 639-1 code "tl", maybe an error in auto-detection?). Most of the tweets are in English; @mitsuhiko tweets a lot in German, while @benoitrc in French:

In [None]:
df_tweets = _check_condition(df_tweets,
                                 condition=~df_tweets["tweet_type"].isin(["retweet"]),
                                 label="not RTs")

# Check languages and Keep only english tweets
languages = df_tweets["lang"].values.tolist()
mc_languages = [l[0] for l in Counter(languages).most_common(5)]
print(mc_languages)
df_tweets["lang"] = df_tweets["lang"].where(df_tweets["lang"].isin(mc_languages), 'other')
df_tweets.groupby(['username', 'lang']).size().unstack().plot(kind='bar', rot=45, title='Languages')

So, let's just select tweets in English or undefined: all the next charts are just considering tweets and replies in English (but obviously you can tune differently your analysis).
Let's group by username and get statistics about the number of favorites/retweets per user:

In [None]:
df_tweets = _check_condition(df_tweets,
                                 condition=df_tweets["lang"].isin(["en", "und"]),
                                 label="only english/undefined")

# General stats
tmp_df = df_tweets.groupby("username").agg({"favorite_count": ["count", "max", "mean", "std"],
                                       "retweet_count": ["max", "mean", "std"]}).reset_index()
tmp_df.columns = [' '.join(col).strip() for col in tmp_df.columns.values]

tmp_df

From this table we can see that:

- @kennethreitz has the most retweeted and favorited tweet in the dataframe. Here's the [tweet](https://twitter.com/kennethreitz/status/981547972239417345)
- @wesmckinn has the second most retweeted and favorited tweet in the dataframe. Here's the [tweet](https://twitter.com/wesmckinn/status/986998077767716865)
- @wesmckinn has highest mean value for retweet count and favorite count

Since @wesmckinn has also the highest followers count, how these stats change if we normalize the dataframe using the followers count?
Obviously one tweet can get favorited/retweeted even from non-followers, but this normalization will probably produce more fair results because the higher the followers count, the most the tweet will probably be viewed.

After the normalization we can see that @cournape and @teoliphant are getting higher mean values, in terms of retweets and favorites.

In [None]:
# General stats normalized by followers count
tmp_df = df_tweets.merge(df_users[["screen_name", "followers_count"]], left_on="username", right_on="screen_name")
tmp_df["favorite_count perc"] = tmp_df["favorite_count"] / tmp_df["followers_count"] * 100
tmp_df["retweet_count perc"] = tmp_df["retweet_count"] / tmp_df["followers_count"] * 100
tmp_df = tmp_df.groupby("username").agg({"favorite_count perc": ["count", "max", "mean", "std"],
                                         "retweet_count perc": ["max", "mean", "std"]}).reset_index()
tmp_df.columns = [' '.join(col).strip() for col in tmp_df.columns.values]

tmp_df

We can also see how the monthly number of tweets changes over time, per user. From the following chart we can see for example that @kennethreitz tweeted a lot in september 2017 (more than 800 tweets):

In [None]:
pv = df_tweets.set_index("created_at").groupby(["username", pd.TimeGrouper("M")]).size().reset_index()
pv['created_at'] = pv['created_at'].apply(lambda x: x.strftime("%Y-%m"))
pv.rename(columns={"id_str": "count"}, inplace=True)
pv = pd.pivot_table(pv, index="created_at", columns="username", aggfunc="sum")
pv.columns = pv.columns.get_level_values(1)
print(pv)
kwargs = dict(rot=45, title="Monthly tweets distribution over time")
_bar_plot(df=pv, show_graphs=show_graphs, **kwargs)

Or we can even see which tools are used the most to tweet, per user; I grouped a lot of less used tools under "Other" (Tweetbot for iΟS, Twitter for iPad, OS X, Instagram, Foursquare, Facebook, LinkedIn, Squarespace, Medium, Buffer).

In [None]:
pv = pd.pivot_table(df_tweets, values="id_str", index=["username"], columns=["source"], aggfunc="count")
pv = pv[pv.sum().sort_values(ascending=False).index]
pv = pv.div(pv.sum(1) / 100, 0)
other_col = pv.columns[pv.sum() < 20]
print(f"Grouping columns under 'Other' label: {other_col}")
pv["Other"] = pv[other_col].sum(1)
pv.drop(other_col, axis=1, inplace=True)
display(pv)
kwargs = dict(stacked=True, edgecolor="black", linewidth=0.5, colormap="tab20", legend="reverse", rot=45, title="Sources")
_bar_plot(df=pv, show_graphs=show_graphs, **kwargs)

Finally, we can build a kind of punchcard chart for each user, showing an aggregation of tweets dates by day of the week and hours of the day:

In [None]:
# Timezone could be read from the df_users, but I'm lazy...
timezones = {'kennethreitz': 'US/Eastern', 'wesmckinn': 'US/Eastern', 'mitsuhiko': 'Europe/Vienna',
             'asksol': 'US/Pacific', 'benoitc': 'Europe/Paris', 'cournape': 'Europe/Amsterdam',
             'zzzek': 'America/Guayaquil', 'teoliphant': 'US/Central'}

fig = plt.figure()
i = 0
for username, grouped_df in df_tweets.groupby("username"):
    print(f"Preparing punch_card for {username}")

    grouped_df['created_at'] = grouped_df['created_at'].dt.tz_localize('UTC').dt.tz_convert(timezones.get(username))
    grouped_df['weekday'] = grouped_df['created_at'].apply(lambda x: int(x.strftime("%w")))
    grouped_df['hour'] = grouped_df['created_at'].apply(lambda x: int(x.strftime("%H")))
    grouped_df = grouped_df.groupby(["hour", "weekday"]).size().reset_index()

    ax = fig.add_subplot(2, 4, i + 1)
    grouped_df.plot(kind='scatter', x='hour', y='weekday', s=grouped_df[0].values * 2.5, ax=ax)
    # ax.xaxis.set_major_formatter(ticker.FixedFormatter(['', '00:00', '04:00', '08:00', '12:00', '16:00', '20:00']))
    ax.yaxis.set_major_formatter(ticker.FixedFormatter(['', 'Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat']))
    ax.set_title(f"@{username}")
    i += 1
plt.show()

# Topics

But what are the devs in the list talking about?

Let's start with a simple visualization, a word cloud.
After some basic preprocessing of the text from standard tweets only (tokenization, pos tagging, stopwords removal, bigrams, etc), I grouped the tweets by username and got the most common words for each one:

In [None]:
def _create_dictionary_corpus(df_tweets: pd.DataFrame, min_count: int, no_above: float) -> Tuple[Dictionary, List, List]:

    # lem = nltk.WordNetLemmatizer()
    tokens_list = [x.split() for x in df_tweets["text_clean"].tolist()]
    tags_list = [nltk.pos_tag(token_list) for token_list in tokens_list]
    docs = []
    for tag_list in tags_list:
        sentence = []
        for word, tag in tag_list:
            # word = lem.lemmatize(word)
            if not word.isnumeric() and len(word) > 1 and word not in STOPWORDS:
                sentence.append(word)
        docs.append(sentence)

    # p_stemmer = PorterStemmer()
    # docs = [[p_stemmer.stem(token) for token in doc] for doc in docs]

    ph = Phrases(docs, min_count=min_count)
    bigram = Phraser(ph)
    for doc in docs:
        for token in bigram[doc]:
            if "_" in token:
                doc.append(token)

    dictionary = Dictionary(docs)
    dictionary.filter_extremes(no_below=min_count, no_above=no_above)
    if dictionary:
        __ = dictionary[0]  # This is only to "load" the dictionary
    corpus = [dictionary.doc2bow(doc) for doc in docs]

    return dictionary, corpus, docs


df_wc = _check_condition(df_tweets, 
                         condition=~df_tweets["tweet_type"].isin(["retweet", "reply"]), 
                         label="not RTs")
df_wc = _check_condition(df_wc, 
                         condition=df_wc["lang"].isin(["en", "und"]), 
                         label="only english/undefined")

tweet_wc = WordCloud(background_color="white", max_words=75, width=600, height=500,
                     normalize_plurals=False, regexp="(#\\w+|@\\w+|\\w+)", relative_scaling=0.6)

fig = plt.figure()
i = 0
for username, grouped_df in df_wc.groupby("username"):

    dictionary, corpus, docs = _create_dictionary_corpus(df_tweets=grouped_df, min_count=20, no_above=0.5)
    w_list = list(itertools.chain(*docs))
    words = " ".join(w_list)

    print(f"@{username}|{len(grouped_df)}|{Counter(w_list).most_common(6)}")
    ax = fig.add_subplot(2, 4, i + 1)
    tweet_wc.generate(words)
    ax.imshow(tweet_wc, interpolation="bilinear")
    ax.set_title(f"@{username} ({len(grouped_df)} tweets)")
    ax.axis("off")
    i += 1
plt.show()

The next step is to identify real topics, using an LDAmodel from Gensim. I still used standard tweets from two accounts only (@kennethreitz and @mitsuhiko) and I performed the same preprocessing used for wordclouds generation.
I run the model using two dynamic values: the number of topics (ranging between 2 and 14) and the alpha value (with possible values 0.2, 0.3, 0.4).
Then I chose the best model using the Gensim built-in Coherence Model, using c_v as a metric: the best model is the one with 9 topics and alpha=0.2

In [None]:
def evaluate_topics(min_topics: int, max_topics: int, dictionary, corpus, texts, author2doc=None, coherence="c_v"):

    iterations = 100
    passes = 20
    chunksize = 10000
    eval_every = 10
    minimum_probability = 0.1
    random_state = 1

    alphas = (0.2, 0.3, 0.4)

    c_v_map = {}

    for num_topics in range(min_topics, max_topics):
        for alpha in alphas:
            lm = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary,
                          alpha=alpha, iterations=iterations, passes=passes, chunksize=chunksize,
                          eval_every=eval_every, minimum_probability=minimum_probability, random_state=random_state)

            kwargs = dict(model=lm, corpus=corpus, dictionary=dictionary, coherence=coherence)
            if coherence in ("c_v", "c_uci", "c_npmi"):
                kwargs.update(dict(texts=texts))
            cm = CoherenceModel(**kwargs)
            tc_cv = cm.get_coherence()

            c_v_map[f"{alpha}_{num_topics}"] = [lm, tc_cv]
            print([num_topics, alpha, tc_cv])

    # Show graph
    for alpha in alphas:
        plt.plot(range(min_topics, max_topics),
                 [v[1] for k, v in c_v_map.items() if k.startswith(str(alpha))],
                 label=alpha)
    plt.xlabel("num_topics")
    plt.ylabel(f"Coherence score ({coherence})")
    plt.legend(loc="best")
    plt.show()

    return c_v_map


df_topics = _check_condition(df_tweets,
                             condition=~df_tweets["tweet_type"].isin(["retweet", "reply"]),
                             label="not RTs")
df_topics = _check_condition(df_topics,
                             condition=df_topics["lang"].isin(["en", "und"]),
                             label="only english/undefined")

df_topics = _check_condition(df_topics,
                             condition=df_topics["username"].isin(["kennethreitz", "mitsuhiko"]),
                             label="only kennethreitz and mitsuhiko")

usernames = set(df_topics["username"].tolist())

min_topics = 2
max_topics = 15

min_count = 10
no_above = 0.5

dictionary, corpus, docs = _create_dictionary_corpus(df_tweets=df_topics, min_count=min_count, no_above=no_above)

print(f"Number of unique authors: {len(usernames)}")
print(f"Number of unique tokens: {len(dictionary)}")
print(f"Number of documents: {len(corpus)}")

c_v_map = evaluate_topics(dictionary=dictionary, corpus=corpus, texts=docs, 
                          min_topics=min_topics, max_topics=max_topics, coherence="c_v")

selected = max(c_v_map.keys(), key=(lambda k: c_v_map[k][1]))
selected_alpha, selected_num_topics = selected.split("_")
print(f"Selected {selected_num_topics} num_topics in range ({min_topics}-{max_topics - 1}, with alpha={selected_alpha})")
model = c_v_map[f"{selected_alpha}_{selected_num_topics}"][0]

In [None]:
for i, topic in enumerate(model.show_topics(num_topics=-1, num_words=5)):
    print(f"{i}|{topic}")

In [None]:
vis_data = gensimvis.prepare(model, corpus, dictionary)
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

# Conclusions and future steps

In this post I showed how to get data from Twitter APIs and how to perform some simple analysis in order to know in advance some features about an account (e.g. tweet style, statistics about tweets, topics).
Your mileage may vary depending on the initial account list and the configuration of the algorithms (especially in topics detection).

Next steps:
- Improve preprocessing using lemmatization and stemming
- Try different algorithms for topics detection using Gensim (e.g. AuthorTopicModel or LDAMallet) or scikit-learn
- Add sentiment analysis