I worked on analyzing political affiliations based on plain text, which is pretty similar to judging whether given article is part of fake news or not. The maion objective behind this exercise was to generate some sort of labelled data to train any ML model with.
Fake News detection is a hard problem and it becomes harder if you don't have a region specific fakenews dataset. In this post we try to findout the groups or communities using their twitter characteristics like retweets and mentions and constructing a network out of it. I analyze the results I get and will try to makesome sense of the data we have
# Gathering Data
The main task in our case is gathering data, I do so by writing a MultiThreaded Tweet crawler which helped me crawl roughly 50 million tweets in roughly 5 days. You can check out the repository yourself and do some crawlingn for fun. The data was stored in a postgresql database which was also primary dump for all related data I got.
## Software Used
I used my [HomeLab](https://bagdeabhishek.github.io/homelab/) to crawl all this data.
1. The first thing you need to do is setup PostGreSQL on your compunter which you'll use for keeping crawled data. There are two settings you should be careful about while configuring the service. First, move the data directory to the disk you'll use for storage. Postgres by default stores the data in /var directory which can be a problem if you have your root partition on a small capacity SSD. The second thing is, make sure you are able to access over network. It becomes way easier to work with the data on yourt local system using an IDE like DataGrip
2. The second thing is writing a Twitter Crawler, I use the official API along with Tweepy library to crawl data. I've made the code generic enough so taht anyone can download it and run using a simple configuration file. You can clone [this](https://github.com/bagdeabhishek/TweetCrawlMultiThreaded) repository and edit the configuration file. You can add the handles you want to crawl in the handles.txt file. You can update the handles needed to crawl in the next layer using the -r option.
3. It would making working with data infinitely easier if you process data on the server itself. To do this there is no better tool than Jupyter Notebooks and Jupyter Lab offers same features with many additonal features. You can install it easily using pip. Once you install it, change some parameters in the confifuration file and you are ready to access over the internet.
4. For inserting the data in potgres database you'll need a python library. Psycopg2 is the most supported library out there and it's very easy to use

Once everything is set up correctly, you can launch the TweetCrawler.py file and it will happy chug along crawling twitter handles and storing it in database. One 'bug' is that some of the columns will have curl;y braces in it. This is mostly due to Psycopg2 library inserts data in database. You can clean the data using the following SQL command
```sql
UPDATE tweets_cleaned as t set retweeted_status_url =  TRANSLATE (t.retweeted_status_url,'{}','' );
```
Once done create indexes on the 'tweet_from' columns to make retrieving and working on the crawled data a bit easier. You can run the following SQL commmand to do that
```sql
CREATE INDEX on tweet_cleaned(id);
```
After these steps you should have data to work with, we will now work on the data we have to 
# Operating on data
The data I've crawled is huge and there is no way to work on it in memory (even after upgrading my main memory to 32gigs). So pandas is out of question. Pandas loads all the data in memory and then performs operations on it, our data size requires us to use more sophisticated solutions which generally are used in case of big data. 

## Apache spark
Apache spark has a very good dataframe API which can acutally speed up data access and processing. Though it is generally discouraged to use a single node apache spark cluster, I found the advantage of operating directly on data using DF API is significantly greater than loading data in memory and then working on it(which is anyways impossible due to amount of data we have). 

Setting up apache spark is fairly straightforward. Complete the setup add the postgres jdbc jar file in the appropriate location and you are good to go. Installing the pyspark package using will make things easier to work with if you prefer python instead of scala like me. The main issue I encountered is that, since the table I'm operating on is a single table with 50 millon rows. There is no parallelism present for the spark to take advantage of which caused ridiculously wait times. Initially these wait times caused a lot of timeouts and changing the settings fixed that. Still the time taken is ridiculous. 

To fix this problem I searched for ways to partition the table. Postgres allows you to partition the table using inheritance but in our case the table already exists and there is no easy way to partition table once created. The other option is using [pg\_partman](https://github.com/pgpartman/pg_partman). I'll configure the existing tables based on created\_at column so that the database is partitioned based on time intervals. This would ideally allow parallel access to the records and should speed up spark access.

# Plotting the network
Processing the crawled data in a user interactable form becomes challenging with the scale of data I had. The idea I had was to identify clusters in the twitter network and then do further processing on this information.

## Twitter network graph
Constructing the twitter network was done with the NetworkX graphing library in python. I converted the crawled data into list of tuples with one source and one destination node. I kept only the tweets which were retweets for this analysis. In general retweets are a stronger measure of endorsement compared to mentions. I've also observed that the more popular personalities' mentions carry more importance, these personalities rarely retweet. I might scale the edge weights in future keeping this creiteria in mind and do additional processing to get better clutering using gephi's clustering algorithm.

The function to costruct the graph is pretty simple. The graph we construct is a Directed graph where edge weight is the number of times a user A retweets user B. The python function below does the same simple operation and gives a Directed Graph G.
```python
def create_graph(ls_tup):
    G = nx.DiGraph()
    for dc in ls_tup:
        tfrom=dc['tweet_from']
        rt = dc['retweeted_status_user_handle']
        if G.has_edge(tfrom,rt):
            print(tfrom,rt,G[tfrom][rt]['weight'])
            G[tfrom][rt]['weight'] += 1
        else:
            G.add_edge(tfrom,rt,weight=1)
    return(G)
```
We can do a lot of preprocessing in python itself but i've found [Gephi](https://gephi.org/) to be a much better tool which is easier to use and operate on such large amounts of data. The visualizations are also way more engaging and it comes with various plugins to export the network in various formats. 
You can export this Graph easily using GEXF format which can be done using NetworkX library's _write\_gexf()_ function


## Gephi processing
Once you get the GEXF file you can run Gephi and import the data into the tool. Before you do that though, make sure you have installed Oracle's Java version. The difference between Oracle Java and the OpenJDK version is day and night, Gephi basically becomes unusable if the network is as large as I had. Once installed disable the anti-aliasing in settings to make the renders quicker as well.
### Pre-processing
Once you get the data into Gephi you can do some preprocessing, I did the following thigs to weed out un-important nodes

1. Calculate the weighted degrees of all the nodes. Trim the nodes with weighted degrees less than 2 (you can adjust this threhold to much higher)
2. Run the modularity algorithm tweaking the parameters to get less number of communities. You can now see major clusters that aris out of data. 
3. Trim out the clusters with less than say 500 nodes in them. This can be done using the partition count filter in Gephi.
4. Run the clustering algorithm once again on this reduced graph. Now you shoulds see the clusters clear enough. In my case all handles from a particular political party were in a cluster.

### Visualization
Once you have the data the next step is visualizing the data. If you don't want to visualize, go to data tab and select columns of interest in my case that would be the modularity class and export it as CSV or some other format. Visualization in Gephi is pretty cnofusing if you are new to it. For most of the purposes aimply running the Openord clustering algorithm will suffice. That's what I did, I tweaked the phases of Openord algorithm to give more time to expansion phase. Also increasing iterations will help in case of large networks, though it might take more time depending on the hardware you have. 

Once you are happy with the visualization render it and export it in any format you want. Hot-tip you can use the plugin (will add later) which wil generate the HTML code which renders using Javascript for browser friendly interactive rendering.

# Using the data to extract relevant information
Once you have the clustering information you need to filter out tweets according to the clusters you have identified, the first step is now to get this clustering data out of gephi. Getting this data out becomes easier if you export the data directly using CSV file. The following steps are what I followed 
1. After runnning the clustering algorihtm i.e getting the modularity classes. Apply the partition count filter and filter by modularity class. Keep only the classes with more than some threshold amount of nodes in it. This will reduce the number of irrelevant communities.
2. Go to the data tab and select the export as spreadsheet option. In the dialogue box only select the id and modularity class columns and save the file.
3. Create a separate table which will store this mapping from twitter handle to modularity class mapping. You can use create a simple table in postgres which will have two columns one for twitter_handles and the other will store the modularity class. Copying the CSV data into the table can be done simply by using the following command. Keep in mind that you'll have to manually remove the heading column if your modularity_class column is of a different datatype like a BIGINT.
```SQL 
COPY cluster_mapping FROM '/path/to/csv/clusters.csv' WITH (FORMAT csv);
```
4. To actually retrieve valuable information like (text,cluster_id) tuples, we'll need to do retrieve data using SQL JOIN statements. If you have lots of data this JOIN operation becomes basically impossible if you dont have indexes. So we create index on the twitter_handles column of the new table, we already had created index on the table that has all the tweets. 

5. Getting this data into Python is tricky, since my dataset of tweets was very large I decided to upgrade my System Memory to 32 GB. The other more saner way is to use Apache Spark as I've noted above. You can still use pandas in python to load using chunking. I read this data into a jupyter noteboook using psycopg2, this was mainly possible because of RAM upgrade. Running the following SQL statement will fetch the relevant data into a python list. I use INNER join becuase I don't want any data which doesn't lie in the selected Modularity classes.
```SQL
SELECT t.tweet_from,t.user_mentions_name,t.retweeted_status_user_handle,c.cluster,c.weighted_degree FROM tweets_cleaned AS t INNER JOIN cluster_mapping AS c ON t.tweet_from = c.id ;
```
6. Load this data into a pandas dataframe using a simple command 
```Python
df = pd.DataFrame(ls ,columns = ["handle","mentions","retweets","cluster","importance"])
```
7. One trick I use is store the relevant dataframe using pickle, this reduces the time to get data from database.



In [2]:
from collections import Counter
import string
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [3]:
df  =  pd.read_pickle("mention_retweet_pandas.pkl")

In [6]:
def __custom_words_accumulator(series):
    c = Counter()
    for sentence in series:
        if sentence:
            sent_list = sentence.split(",")
            c.update(sent_list)
    return c.most_common(50)
wf = df.groupby("cluster")["retweets"].apply(__custom_words_accumulator).reset_index()

In [9]:
def split_list(series,handleBool=True):
    handles = []
    listNoOfX = []
    for groupList in series:
        for handle,x in groupList:
            handles.append(handle)
            listNoOfX.append(x)
    if handleBool :
        return(handles)
    else:
        return(listNoOfX)
        
wf2 = pd.DataFrame({
    'cluster_id' : np.repeat(wf['cluster'],50),
    'handle': split_list(wf['retweets'],True),
    'noOfX': split_list(wf['retweets'],handleBool=False)
})
clusters = wf2.cluster_id.unique()

In [None]:
import seaborn as sns
sns.set(rc={'figure.figsize': (40,10)})
i = 0
f, ax = plt.subplots(len(clusters), 1, figsize=(40, 100))
f.tight_layout(pad=6.0)
for cid in clusters:
    g = sns.barplot(x="handle", y="noOfX", hue="cluster_id", data=wf2[wf2.cluster_id==cid],ax=ax[i])
    g.set_xticklabels(g.get_xticklabels(), rotation=50, horizontalalignment='right')
    i+=1