# RedditHoles - A study in  internet Rabbit Holes

## An introduction to Reddit

Reddit is a social media platform organised in communities, rather than individual connections. This makes the experience on the platform quite different from other Social Networks and closer, in a way, to the one of forums.  

### u/User
A user (also called redditor, usually referred as “u/” followed by the username) can make a post (also called a submission); posts (also called submissions) can be links, videos, pictures, polls or text. The OP (Original Poster) as well as other users can comment and vote (either upvote or downvote) the post if they find it interesting (so it’s not exactly like like/dislike features of other Social Networks). Finally, users can give posts awards by paying for reddit coins; these recognise the contributions of a post or comment. There are hundreds of these, from a generic gold or silver award to some following internet lingo, such as the “F” award (used ironically to ‘pay respect’). 

Users can mark their posts as spoilers and multiple types of flairs and markings such as OC (Original Content), Spoilers, +18 and so on. Some subreddits might have rules for posting. 

Users can be humans or bots. Bots have multiple uses, from auto moderation, to answering with quotes from movies and books, waving flags to other utilities and fun uses. 
#### The average redditor

Reddit users, according to [data from the site itself](https://www.redditinc.com/advertising/audience), are more than 50 million with more than 100 thousand communities. They are mostly male (56%) and have between 18 and 34 year old (58%). 

### r/Subreddits

Reddit is structured in subreddits, communities that group in various ways the interested users. Subreddits can vary from cute pets, to political parties, to recipe advice. Subreddits are usually referred to as “r/” followed by the name of the subreddits (in our case, we will study “r/conspiracy” and related subreddits). 

Subreddits can vary significantly. They can have rules for posting, different levels of moderation and bot acceptance, all depending from the nature of the subreddit. Rules of subreddits can have different nature; some subreddits may require the content of a post to be marked, some require sources to be linked, some can have no rules at all. This ambiguity makes Reddit a much more decentralised and open Social Network in which some of the cut down on fake news and trolls which happened on sites such as Facebook or Twitter has not yet happened at the same level. 



## Accessing Reddit Data
In order to access data from the Reddit API, we used the [`praw` python library](https://github.com/praw-dev/praw). 
In order to use that, we need:
- a Reddit account.
- a Reddit app ([here](https://www.reddit.com/prefs/apps) you can create one). Here you'll find the client id under the app name once you create it, as well as the client secret and the user agent, which is your app name.

Once we have those, we can access reddit from the python terminal by creating a reddit instance in praw (it is also possible to access as a user by adding username and account if you want to use it in write mode).

In [1]:
#!pip install praw
#!pip install pandas
import praw
reddit = praw.Reddit(client_id = "CMO4Nf8Dpd3YE_lftmqnHg", client_secret= "79YxaeUzQYqtVpvev6SbWbCMvF70-g", user_agent= "little_digging")

Now we can access information about subreddits, posts and users. As an example, we can access the top 5 posts of all time on Reddit, with the information about the author and the upvote ratio and datat:

In [None]:
for i in reddit.subreddit('all').top(limit=5): #this will print the top posts all time
    print(i.title + ' Author: ' + i.author.name +  ' Link: https://www.reddit.com' + i.permalink + ' Subreddit: r/' + i.subreddit.display_name + ' Upvote Ratio: ' + str(i.upvote_ratio) + ' Date (UTC Format): '+str(i.created_utc))

## Scraping Reddit, starting from r/conspiracy

Our objective is to watch how a piece of news or a post is shared between different subreddits. While most social network would measure shares of a post, Reddit is built in a way that if a link is shared in the platform, it is possible to retrieve how much the original link is shared through subreddits. As an example, if an image from imgur is shared, we can use its url to search for the same object on Reddit.
This happens because, for the most part, Reddit is used to comment news and multimedia in communities, which often present a common worldview (e.g. subbreddits made by people with same political views).
We started with the [r/conspiracy](https://www.reddit.com/r/conspiracy/)'s posts, in particular the top 5000 posts all time. We opted for the top posts all time because the other types of ranking are time-bound and we wanted to watch the overall transmission of posts.

In [None]:
import pandas as pd
post_list=list()
subreddit_list = list()
conspiracy_dict=dict()

for i in reddit.subreddit("conspiracy").top(limit=5000): 
    post_list.append(i.url) # This may seem counterintuitive, but in praw's terms this is the original link of the resource inside the post.


for post in post_list:
    for repost in reddit.subreddit('all').search('url:'+post): # This function searches for the original post's element
        subreddit_url = str(repost.subreddit)
        subreddit_url = "https://www.reddit.com/r/" + subreddit_url
        if subreddit_url in conspiracy_dict.keys():
            conspiracy_dict[subreddit_url][0].append("https://www.reddit.com"+repost.permalink)
            conspiracy_dict[subreddit_url][1][0] +=1
        else:
            conspiracy_dict[subreddit_url]=[[],[1]]
            conspiracy_dict[subreddit_url][0].append("https://www.reddit.com"+repost.permalink)
            
                


df = pd.DataFrame(conspiracy_dict)
df.to_csv(r'results/conspiracy_data/conspiracy.csv',index=False)




We realised after the fact that this method also got the reposts inside the same subreddit. After manually cleaning the csv in this instance, we proceeded to remove this problem in the following steps.

### Creating a network of subreddits

After scraping r/conspiracy, we moved to the neighbouring subreddits. What this will do is creating a network of shared posts between subreddits.

First we need to look at all the files in the directory.

In [2]:
def get_all_in_dir(dir):
    for filename in os.listdir(dir):
        f = os.path.join(dir, filename)
        if os.path.isfile(f) and f[-4:] == ".csv":
            yield f

Now we can scrape the other subreddits that had more than 5 posts in common with r/conspiracy.

In [None]:
df1 = pd.read_csv('results/conspiracy_data/conspiracy.csv')
id_to_analyse = []
for column in df1.columns:
    value = df1[column][1][1:-1]
    if int(value) >= 5:
        ind = column.index('/r/')
        id = column[ind+3:]
        id_to_analyse.append(id)
        

for subreddit in id_to_analyse:
    post_list=list()
    subreddit_list = list()
    conspiracy_dict=dict() 
    for i in reddit.subreddit(subreddit).top(limit=5000):
        post_list.append((i.title, i.score, i.url))



    for post in post_list:
        for repost in reddit.subreddit('all').search('url:'+post[2]):
            if repost.subreddit_id != "t5_"+reddit.subreddit(subreddit).id: #cosa facciamo?
                subreddit_url = str(repost.subreddit)
                subreddit_url = "https://www.reddit.com/r/" + subreddit_url
                if subreddit_url in conspiracy_dict.keys():
                    conspiracy_dict[subreddit_url][0].append("https://www.reddit.com"+repost.permalink)
                    conspiracy_dict[subreddit_url][1][0] +=1
                else:
                    conspiracy_dict[subreddit_url]=[[],[1]]
                    conspiracy_dict[subreddit_url][0].append("https://www.reddit.com"+repost.permalink)
    df = pd.DataFrame(conspiracy_dict)
    df.to_csv(r'results/1st_level/'+subreddit+'.csv')

We repeat the cycle one more time, but this time with just the top 500 posts, in order to speed up the process.

In [None]:
import pandas as pd
top_comments_list = dict()
to_scan = [file for file in get_all_in_dir("results/1st_level_2")]
to_avoid= [file.split('\\')[1] for file in get_all_in_dir("results/1st_level")]
to_avoid.extend(file.split('\\')[1] for file in get_all_in_dir("results/1st_level_2"))
to_avoid.extend(file.split('\\')[1] for file in get_all_in_dir("results/2nd_level"))

for subr in to_scan:
        id_to_analyse = []
        print(f'now opening {subr}')
        try:
                df1 = pd.read_csv(subr)
        except:
                pass
        try:
                df1 = pd.read_csv(subr, encoding='utf8')
        except:
                print(f'unable to open {subr}')
                continue
        for column in df1.columns:
                try: 
                        value = df1[column][1][1:-1]
                except:
                        continue
                if int(value) >= 5:
                        ind = column.index('/r/')
                        id = column[ind+3:]
                        if not id + '.csv' in to_avoid:
                                try:
                                        post_list=list()
                                        subreddit_list = list()
                                        conspiracy_dict=dict() 
                                        for i in reddit.subreddit(id).top(limit=500):
                                                post_list.append((i.title, i.score, i.url))

                                        for post in post_list:
                                                for repost in reddit.subreddit('all').search('url:'+post[2]):
                                                        if repost.subreddit_id != "t5_"+reddit.subreddit(id).id: #cosa facciamo?
                                                                subreddit_url = str(repost.subreddit)
                                                                subreddit_url = "https://www.reddit.com/r/" + subreddit_url
                                                        else:
                                                                continue
                                                        if subreddit_url in conspiracy_dict.keys():
                                                                conspiracy_dict[subreddit_url][0].append("https://www.reddit.com"+repost.permalink)
                                                                conspiracy_dict[subreddit_url][1][0] +=1
                                                        else:
                                                                conspiracy_dict[subreddit_url]=[[],[1]]
                                                                conspiracy_dict[subreddit_url][0].append("https://www.reddit.com"+repost.permalink)
                                        
                                        df = pd.DataFrame(conspiracy_dict)
                                        df.to_csv(r'results/2nd_level/'+id+'.csv')
                                        with open("results/2nd_level/done_2.txt",'a', encoding = "utf-8") as text_note:
                                                text_note.write(id + "\n")
                                                text_note.close()
                                        
                                except Exception as E:
                                        print(E)
                                        with open("results/2nd_level/error_2.txt",'a', encoding = "utf-8") as text_note:
                                                text_note.write(id + "\n")
                                                text_note.close()
                                

## Cleaning the data
In order to perform more efficiently the operations of data representation we decided two perform two operations:
1. We turned the number of crossposts into integers.
2. We removed subreddits with less than 5 posts in common.
These was in order to have cleaner and more usable data.

In [None]:
#removing parentheses and connections with less than 5 reposts

datasets= [file for file in get_all_in_dir("results/1st_level")]
datasets.extend(file for file in get_all_in_dir("results/1st_level_2"))
datasets.extend(file for file in get_all_in_dir("results/2nd_level"))
datasets.extend("results/conspiracy_data/conspiracy_top_url.csv")

for file in datasets:
    try:
        df = pd.read_csv(file, on_bad_lines='skip', encoding='utf8')
    except:
        pass
    try:
        df = pd.read_csv(file, on_bad_lines='skip', encoding='latin')
    except:
        print(file, 'has problems in formatting')
        continue
    for col in df.columns:
        if "u/" in col or 'u_' in col: #Sometimes users end up in the columns. Remove them. 
            df.drop(col, inplace=True, axis=1)
        else:
            try:
                
               if isinstance(df[col][1], str): 
                    try:
                        df[col][1]=int(df[col][1][1:-1])
                    except:
                        continue
                    if df[col][1]<5:
                        df.drop(col, inplace=True, axis=1)
            except:
                df.drop(col, inplace=True, axis=1)
    df.to_csv(file, encoding="utf8")
        

Given the data from Reddit, we can build a network of shared posts.

In [None]:
#Build the network

# QUI DOBBIAMO RIVEDERE L'ORGANIZZAZIONE; deve esserci un csv che contiene i post in comune tra csv1 e csv2, dicendo da dove arrivano. Questo ci permetterà di confrontare i risultanti.
## Analysing the comments
A part of our inquiry involves the kind of language that redditors use on the website in response to posts. In order to this, we employed two techniques that allow us to find out the nature of a text: Topic Modelling and Sentiment Analysis

### Topic modelling
Topic modelling is a machine learning technique that tries to predict the distribution of abstract topics in a text, and thus reveal the hidden semantic structures within it. In particular, we used Latent Dirichlet Allocation (LDA) method. We adapted the method used in [here]().
The libraries used are <code>[nltk](https://www.nltk.org/), [gensim](https://github.com/RaRe-Technologies/gensim/), [spacy](https://spacy.io/), [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/readme.html)</code>.

In [3]:
#!pip install nltk
#!pip install gensim
#!pip install spacy
#!pip install pyLDAvis
import nltk
nltk.download('stopwords')
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['re', 'edit']) 

import re
import os
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy
# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models as genmodels  # don't skip this
import matplotlib.pyplot as plt

import warnings
#python -m spacy download en_core_web_sm
warnings.filterwarnings("ignore",category=DeprecationWarning)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\david\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Since the number of comments can be particularly big, we used this code snippet to get a random sample from bigger comment sections.

In [None]:
import random
def iterSample(iterable, samplesize):
    results = []
    for i, v in enumerate(iterable):
        r = random.randint(0, i)
        if r < samplesize:
            if i < samplesize:
                results.insert(r, v) # add first samplesize items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < samplesize:
        raise ValueError("Sample larger than population.")

    return results

Here we import the dataset and the comments from the posts in common between subreddits.

In [None]:
# Import Dataset
import pandas as pd
import re
to_scan= [file for file in get_all_in_dir("results/1st_level")]
to_scan.extend(file for file in get_all_in_dir("results/1st_level_2"))
to_scan.extend(file for file in get_all_in_dir("results/2nd_level"))
top_comments_list = dict()
for subreddit in to_scan:
        try: 
                df = pd.read_csv(subreddit, sep=',', encoding='utf8', on_bad_lines='skip')
        except:
                print(subreddit + ' is a bad boy')
        done = pd.read_csv('done.csv', keep_default_na=False, na_values=[""])
                
        done_col = set(done.columns)
        cols = set(df.columns[1:])
        cols = cols.difference(done_col)
        for column in cols: # qui inserire un loop sui permalink
                if column not in done_col and 'Unnamed' not in column:
                        
                        list_post=[]
                        list_post= df[column][0].split(',')
                        name= subreddit + " @ " + column
                        for post_url in list_post:
                                if post_url == 0:
                                        continue
                        
                                #tmp =  re.sub("\[\]\'", "", post_url) NON FUNZIONA E NON CAPISCO PERCHé
                                if '[' in post_url:
                                        post_url = post_url.replace('[','')
                                if ']' in post_url:
                                        post_url = post_url.replace(']','')
                                try:
                                        post = reddit.submission(url=post_url[1:-1])
                                except Exception as e: # some posts may be removed
                                        continue
                                try:
                                        #This allows up to 1 reply to each post.
                                        post.comments.replace_more(limit=1)
                                except Exception as e:
                                        print(e)
                                        print(f'{post} was not accessible')
                                        continue
                                result = []
                                if len(post.comments)> 500:
                                        for comment in iterSample(post.comments, 500):
                                                #result.append(comment)
                                                result.extend(comment.replies.list())
                                else:
                                        for comment in post.comments:
                                                #result.append(comment)
                                                result.extend(comment.replies.list())


                                if name in top_comments_list.keys():
                                        top_comments_list[name].extend(result)
                                else:
                                        top_comments_list[name]=result
                        try:
                                val = top_comments_list[name]
                        except KeyError:
                                continue
                        res=pd.DataFrame({name:val})
                        res.to_csv('comments.csv',mode='a')
                        
                        done.insert(0,column,'NaN')
                        done.to_csv('done.csv' , mode='w')


We can now perform the topic modeling: first we need to define the functions.

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out


We can now print a csv for each crossposting between subreddits with the topic modeling of the common posts.

In [4]:
comments = pd.read_csv('comments.csv', encoding='utf8', on_bad_lines='skip')

In [5]:
comments = pd.read_csv('comments.csv', encoding='utf8', on_bad_lines='skip')
top_comments = dict()
latest_passage = 'results/1st_level\\1984isreality.csv @ https://www.reddit.com/r/ukpolitics'
for row in comments.iterrows():
    if 'reddit' in row[1][1]:
        latest_passage = row[1][1]
    elif latest_passage not in top_comments.keys():
        top_comments[latest_passage] = [row[1][1]]
    elif latest_passage in top_comments.keys():
        top_comments[latest_passage].append(row[1][1])

In [6]:
done = pd.read_csv('done.csv', encoding='utf8', on_bad_lines='skip')
done_set=set(done.columns)
check= set(top_comments.keys())

diff = set(check).difference(done_set)


In [None]:
print(diff)

In [7]:
# Refactor for
for key in diff:
    origin, id = key.split('@ ')
    if '1st_level_2' in origin:
        origin = origin.split('1st_level_2')[1]
        origin = origin.replace('\\','')
        origin = origin.replace('.csv','')
    if '1st_level' in origin:
        origin = origin.split('1st_level')[1]
        origin = origin.replace('\\','')
        origin = origin.replace('.csv','')
    if '2nd_level' in origin:
        origin = origin.split('2nd_level')[1]
        origin = origin.replace('\\','')
        origin = origin.replace('.csv','')

    id=id[id.index('/r/')+3:]
    if 'x0' in key:
        key = key.replace('x0','')

    # Convert to list
    data = []
    mt = top_comments[key]
    if len(mt) > 4:
        for i in mt:
            data.append(reddit.comment(i))
        # Remove tags from comments
        try:
            data = [re.sub('\S*@\S*\s?', '', sent.body) for sent in data]
        except Exception as e:
            print(e)
            continue

        # Remove new line characters
        data = [re.sub('\s+', ' ', sent) for sent in data]

        # Remove distracting single quotes
        data = [re.sub("\'", "", sent) for sent in data]

        # Tokenizing the text
        data_words = list(sent_to_words(data))

        # Build the bigram and trigram models
        bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
        trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

        # Faster way to get a sentence clubbed as a trigram/bigram
        bigram_mod = gensim.models.phrases.Phraser(bigram)
        trigram_mod = gensim.models.phrases.Phraser(trigram)

        # Remove Stop Words
        data_words_nostops = remove_stopwords(data_words)

        # Form Bigrams
        data_words_bigrams = make_bigrams(data_words_nostops)

        # Initialize spacy 'en' model, keeping only tagger component (for efficiency)
        # python3 -m spacy download en
        nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

        # Do lemmatization keeping only noun, adj, vb, adv
        data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

        # Create Dictionary
        id2word = corpora.Dictionary(data_lemmatized)

        # Create Corpus
        texts = data_lemmatized

        # Term Document Frequency
        corpus = [id2word.doc2bow(text) for text in texts]

        # Human readable format of corpus (term-frequency)
        [[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

        if corpus != [] and corpus != [[]]:
            # Build LDA model
            lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                                    id2word=id2word,
                                                    num_topics=10, 
                                                    random_state=100,
                                                    update_every=1,
                                                    chunksize=100,
                                                    passes=10,
                                                    alpha='auto',
                                                    per_word_topics=True)

            # Print the Keyword in the 10 topics
            
            index = []
            topics = {}
            list_topics=lda_model.print_topics()
            for i in list_topics:
                for value in range(len(list_topics)):
                    to_add = []
                    for el in list_topics[value][1].split('+'):
                        start = el.index('"')
                        word = el[start+1:-2]             
                        to_add.append(word)
                    topics[value] = to_add

            
            df_topic = pd.DataFrame(topics)
            done = pd.read_csv('done.csv', encoding='utf8')
            done.insert(0,key,'NaN')
            done.to_csv('done.csv' ,encoding='utf8', mode='w')

            for row in df_topic.index:
                index.append(origin)
            df_topic['origin'] = index
            if os.path.isfile(r"results/topic_models/"+ id + '.csv'):
                df_topic.set_index('origin', inplace=True)
                df_topic.to_csv(r"results/topic_models/"+ id + '.csv', mode='a', encoding = 'utf8', header=False)
            else:
                df_topic.set_index('origin', inplace=True)
                df_topic.to_csv(r"results/topic_models/"+ id + '.csv', encoding = 'utf8')
            #pprint(lda_model.print_topics())
        else: 
            print (key + " is done.")


NameError: name 'sent_to_words' is not defined

The resulting csv file will contain the topics of crossposts.
# RIFAI CONTANDO DA DOVE PARTE
### Sentiment Analysis
Sentiment analysis is the computational study of people's emotions expressed in text. In our case we used the popular VADER sentiment analyser, an analyser especially created for social networks (in particular, it was based off Twitter). The result of each comment's analysis will be a number between -1 and 1, depending on whether the comment is perceived as negative or positive.

In [None]:
#nltk.download(['stopwords', "vader_lexicon"]) # Do this the first time you run this script.

In [19]:
comments = pd.read_csv('comments.csv', encoding='utf8', on_bad_lines='skip')
top_comments = dict()
latest_passage = 'results/1st_level\\1984isreality.csv @ https://www.reddit.com/r/ukpolitics'
for row in comments.iterrows():
    if 'reddit' in row[1][1]:
        latest_passage = row[1][1]
    elif latest_passage not in top_comments.keys():
        top_comments[latest_passage] = [row[1][1]]
    elif latest_passage in top_comments.keys():
        top_comments[latest_passage].append(row[1][1])

In [51]:
done = pd.read_csv('done.csv', encoding='utf8', on_bad_lines='skip')
done_set=set(done.columns)
check= set(top_comments.keys())

diff = set(check).difference(done_set)


In [None]:
print(diff)

False

In [8]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from statistics import mean
sent = SentimentIntensityAnalyzer()
for post in diff:
    post_sentiment = [] # median sentiment
    data = []
    origin, id = post.split('@ ')
    if '1st_level_2' in origin:
        origin = origin.split('1st_level_2')[1]
        origin = origin.replace('\\','')
        origin = origin.replace('.csv','')
    if '1st_level' in origin:
        origin = origin.split('1st_level')[1]
        origin = origin.replace('\\','')
        origin = origin.replace('.csv','')
    if '2nd_level' in origin:
        origin = origin.split('2nd_level')[1]
        origin = origin.replace('\\','')
        origin = origin.replace('.csv','')
    id=id[id.index('/r/')+3:]
    if 'x0' in post:
        post = post.replace('x0','')
    mt = top_comments[post]
    if len(mt) > 4:
        for i in mt:
            data.append(reddit.comment(i))
        for comment in data:
            try:
                body = comment.body
                if not "I'm a bot" in body and not 'I am a bot' in body:
                    val = sent.polarity_scores(body)['compound']
                    post_sentiment.append(val)
            except:
                continue
        try:
            post_sentiment = mean(post_sentiment)
        except:
            post_sentiment = 0
        index = []

        sentiment_df = pd.DataFrame({'sentiment': post_sentiment}, index = [origin])
        if os.path.isfile(r"results/sentiment/"+ id + '.csv'):
            sentiment_df.to_csv(r"results/sentiment/"+ id + '.csv', mode='a', encoding = 'utf8', header=False)
        else:
            sentiment_df.to_csv(r"results/sentiment/"+ id + '.csv', encoding = 'utf8')
        done = pd.read_csv('done.csv', encoding='utf8')
        done.insert(0,post,'NaN', allow_duplicates=True)
        done.to_csv('done.csv' ,encoding='utf8', mode='w')

# Per plottare con pandas etc: inverti colonna, rendi int, droppa colonne inutili, and go.