# Getting your data

To run any script and to query the API in general, you will need a token. A code is generated every time you install the facebook.tracking.exposed

You can use the test one or enter you own. Read this if you don't know how to get your token: link.

#### Use the script download_facebook.py in src/ to download the data, then locate the csv file.

In [None]:
summary_path = 'sample_data/facebook_summary.csv'

Import the necessary libraries. In this example we commented out the hierarchical configuration used to call scripts from the command line.

In [None]:
import pandas as pd
from src.lib import API, tools
# from src.lib.config import config
import datetime

print('Done!')

Now, we read the csv downloaded with facebook_download.py, remember that you can choose the amount of entries to retrieve by using the parameters --amount and --skip.

In [None]:
df = pd.read_csv(summary_path)
print('Done!')

This is how the data looks like:

In [None]:
from IPython.display import display

display(df.head())

## Manipulating dates

Now you can check the timeframe of the data you pulled.

In [None]:

df = tools.setDatetimeIndex(df)
maxDate = df.index.max()
minDate = df.index.min()
print('Information for timeframe: '+str(minDate)[:-6]+' to '+str(maxDate)[:-6])

OPTIONALLY, you can also cut it to get, in this example, the last 24 hours only.

In [None]:
start = maxDate-datetime.timedelta(days=1)
end = maxDate
df = tools.setTimeframe(df, str(start), str(end))
print('From '+str(start)+' to '+str(end)+'\n')

## Your stats

You can get useful insights for yourself, for example you can estimate the you time spent of facebook during that timeframe.

In [None]:
timelines = df.timeline.unique()
total = pd.to_timedelta(0)

for t in timelines:
    ndf = tools.filter(t, df=df, what='timeline', kind='or')
    timespent = ndf.index.max() - ndf.index.min()
    total += timespent
    
print('Time spent on Facebook in this timeframe: '+str(total))

Or the time spent watching ads.

In [None]:
nature = df.nature.value_counts()

try:
    percentage = str((nature.sponsored/nature.organic)*100)[:-12]
except:
    nature['sponsored'] = 0
    percentage = str((nature.sponsored/nature.organic)*100)
    
print(percentage+'% of the posts are sponsored posts.')

timeads = (total.seconds)*(nature.sponsored/nature.organic)
print('You spent an estimate of '+str(datetime.timedelta(seconds=(timeads)))[:-7]+' watching ads on Facebook.')

You can also check which are the top news that are informing you.

In [None]:
n = 5
top = df.source.value_counts().nlargest(n)
print('Top '+str(n)+' sources of information are: \n'+top.to_string())

Of course, you can display this data graphically. (Run field twice if it doesn't work).

In [None]:
top.plot.pie(autopct='%.2f', fontsize=13, figsize=(6, 6))

## Experimenting with altair viz tools

Exploring seen posts

In [None]:
import altair as alt

# for the notebook only (not for JupyterLab) run this command once per session
alt.renderers.enable('notebook')

alt.Chart(df).mark_circle().encode(
    x='impressionTime:T',
    y='LIKE:Q',
    color='source:N'
).interactive()

In [None]:
df['count'] = df.groupby('postId')['postId'].transform('count')

alt.Chart(df).transform_calculate(
    url='https://www.facebook.com' + alt.datum.permaLink
).mark_circle().encode(
    y='count:Q',
    x='average(impressionOrder):Q',
    color='source:N',
#     size='LIKE:Q',
    href='url:N',
    tooltip=['source:N', 'url:N']
).interactive()

## Creating wordclouds with the text of the posts

In [None]:
#import necessary modules
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from stop_words import get_stop_words

# get the stopwords for your language
stop_words = get_stop_words('es')

# define the function 
def generate_wordcloud(text):
    wordcloud = WordCloud(font_path='src/fonts/DejaVuSans.ttf',
                          relative_scaling = 1.0,
                          stopwords = stop_words # set or space-separated string
                          ).generate(text)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
    
text = df.texts.str.join(sep='').reset_index()
# .str().join(sep=',').reset_index()
print(text)
text.columns = ['date', 'words']
text = text.words.str.cat(sep=' ')

generate_wordcloud(text)