# Facebbok Data Miner

This notebook is about showcasing the features of this python application. 
It has a secondary mission, that is by exploring possibilities in this notebook I get insights for designing the interface of the CLI of this app, since it is not totally clear yet, how data relates, and exactly which information can be accessed where.

## Introduction

One day I downloaded my facebook data, and it turned out it's 8.2GBs. That's a lot. Like how is that even possible. I started looking into directories, and found same interesting media files, chat messages and posts. I wanted to dive deeper, but I knew I can't scale it after some point.

Then I googled `facebook data analyzer github`, so that I can just download a package and do the analysis. As you can see in this [list](https://github.com/topics/facebook-data-analyzer) there is not a lot. 

There is a nice library written in Ruby with more than 100 commits and 505 stars, but it's more than 2 years old now. The other thing is that I was looking for a solution with which I can mess around (e.g. maybe use the data getters to do some deep learning on my message history) and I didn't really want to get into Ruby (sorry Ruby users).

There are also some python packages, but they offer little-to-no data interfaces. Most of them focus on visualizing the results with plots, and the data getters were desgined around the plotter functions. Although you may want to check out [fviz](https://github.com/itzmeanjan/fviz) which offers really nice plots. Probably nicer then this library would ever have. 

## Purpose

Then I decided that I would write my own library for this, amongst others for two main reasons:

1. I want the insights from my facebook data, and I want to have really specific insights.
2. I want to improve my Python and get expertise in Python's data science libraries. Gaining some more git and GitHub, software design and project planning skills would also not hurt.


## Analyzing your Facebook data
For getting your Facebook data refer to the [README.md](README.md) file. 

Let's specify the path of the zip file.

In [None]:
import os
DATA_PATH = f"{os.getcwd()}/data"
DATA_PATH

### Unzip and describe the data

In [None]:
# TODO unzip function
# TODO test if it is jsons and english

Let's see what's in the box.

In [None]:
!ls data

These are the categories in which Facebook ordered the data they have on you. 

In this notebook we will focus on `friends` and `messages`.

First we import all the dependencies we will need for running the notebook.

In [None]:
from datetime import datetime

from IPython.display import Image


from miner.app import App
from miner import utils

## Application
The main entrypoint of the application is a class named `App` (not so creatively). It is a CLI, which you can use from the command line, but you can also use it as a module.

This notebook will give you a good grasp on how this application works, and how can you use it.

Let's dive right into it. The execution of this cell might take a while depending on how much data you have and how fast is your CPU. Right now the constructor of this class reads in the json files we are analyzing, and creates some basic data structures from them.

In [None]:
app = App(DATA_PATH) 

### Friends
The first feature we have is a straightforward one. We get our friends from Facebook, and the date the connection was made.

For this Facebook provides us with the `friends` directory/category. Here is what is inside.

In [None]:
!ls data/friends

In [None]:
friends = app.friends
friends

We can have a look what kind oh methods and properties does this object expose. We use a wrapper around the `dir` built-in, thus we only see methods and properties which does not start with an underscore.

In [None]:
print(utils.dir_stripped(friends))

The `data` property of the `friends` object is a pandas DataFrame that contains all of your Facebook friends. Timestamps refer to the datetime of making the Facebook freinds. 

Here are 5 samples from the friends DataFrame.

In [None]:
friends.data.sample(5)

Let's see what pandas can tell us about the DataFrame.

In [None]:
friends.data.describe()

In [None]:
# TODO rate of making friends

### Conversations

`Conversations` is a class that reads and stores conversation data, that is both `private` and `group` messages.

Both `private` and `group` properties of the Conversations class has the type of `Dict[str, Conversation]`, where:
* a `key` is either private conversation partners or group conversation names (basically the name of the channel),
* a `value` is a type of `Conversation`, which is a container class that contains a channel's data and metadata.

The difference between `private` and `group` is not at all that big. The reason behind the design decision to deal with them separately comes down to the fact that the relation between participants and channels is different (in private one participant can relate to only one channel, while in group messages one participant can realte to a number of channels). Although symmetry is still kept by using the same classes for both of them (no subclasses).

In [None]:
conversations = app.conversations
conversations

In [None]:
print(utils.dir_stripped(conversations))

We have these two properties to use. We can check how many private and how many group conversations do we have.

In [None]:
len(conversations.private), len(conversations.group)

### Messaging Analyzer
Now we got to a point where we can actually *analyze* our data. 

`app.analyzer` returns an instance of MessagingAnalyzerManager which manages the analysis of both `private` and `group` messages, and also the interaction between the two e.g. get all messages for one person (both private and group channels).

**IMPORTANT NOTE**: For discover the features of this class and its composite classes, we want to be able to refer to at least one private conversation partner and at least one group conversation. 
In the next cell you should set this variables to a private conversation partner's name and the name of a group conversation of yours.

In [None]:
PARTNER_NAME = "Dániel Nagy"# TODO test if there is a name like this? maybe add suggestions: `no name like this, but amybe this?! like git commands`

GROUP_NAME = '420' 

Now we can go ahead and start our discoveries. 

In [None]:
analyzer = app.analyzer

If we run the next cell, we will see that this object has a lot of methods and properties.

In [None]:
print(utils.dir_stripped(analyzer))

Let's now take the above group and see amongst its participants who we have private conversation with.

In [None]:
analyzer.get_who_i_have_private_convo_with_from_a_group(GROUP_NAME)

Now we can also check how much I speak with these participants.

In [None]:
analyzer.how_much_i_speak_in_private_with_group_members(GROUP_NAME)

Let's see if I had first private or group messages first with the above partner.

In [None]:
analyzer.is_priv_msg_first_then_group(PARTNER_NAME)

Now let's get all the stats for one person. For this method there are two possible versions, we either look for all the messages that was sent in the private channel and also all the groups the queried partner is in, and all other groups' participants' messages, OR only the messages sent by the partner in either of the channels. 

`ConversationStats` is a versatile class, we will cover it soon.

In [None]:
stats = analyzer.get_stats_together(PARTNER_NAME, subject='all') # TODO add support for name as a subject!!!
stats

Now we can print some statistics from this object, just so we see how it is working. But let's not spoil too much from this object.

In [None]:
stats.number_of_channels

In [None]:
count = 5 if stats.number_of_channels > 5 else stats.number_of_channels
stats.channels[:count]

In [None]:
stats.df.sample(5)

### Private and Group Messaging Analyzer


So let's finally see the similarities and differences between `private` and `group` analyzers by looking into these objects. 

It might be helpful for us to know, that both of these objects are created by passing their constructor the `Conversations`' private and group properties. That means: `analyzer.private.data = conversations.private` and `analyzer.group.data = conversations.group`.

In [None]:
private = analyzer.private
group = analyzer.group

In [None]:
assert type(private) == type(group)
type(group)

As you can see both objects are instances of `MesagingAnalyzer` class. Every property and method is available for both types of conversations, although some properties makes more sense for one or the other. 

First we get the number of channels.

In [None]:
len(private), len(group)

Then let's list all the properties and methods we can use with these objects.

In [None]:
print(utils.dir_stripped(private))

We have this `is_group` property, which should be pretty intuitive.

In [None]:
private.is_group, group.is_group

Let's see the number of all the messages ever sent from any party in private and in group messages. We do this by inspecting the shape of the DataFrame bound to these objects. 

Here we should mention, that `private.df` or `group.df` is created by stacking together all the `Conversation` DataFrames these objects contain. 

In [None]:
private.df.shape[0], group.df.shape[0]

Let's see the number of participants. Note, that e.g. for me this is not the same as `len(private), len(group)` but it is only because some weird unicode (or not) characters.

In [None]:
len(private.participants), len(group.participants)

Let's see the ranking of my friends by messages sent. It could be very different for private and group messages.

In [None]:
by_count, by_percent = private.get_ranking_of_senders_by_convo_stats(statistic='mc', top=10)
by_count

In [None]:
by_count, by_percent = group.get_ranking_of_senders_by_convo_stats(statistic='mc', top=10)
by_percent

Chances are your name will appear in the group's ranking and it's because of an important difference in how these values get computed. The `private` object calculates the values by a channel, while `group` object calculates it by senders. It can be a bit confusing, but this seemes the most intuitive solution.

We can also get the portion of our conversation partners' contribution in percent. Let's look at that.

In [None]:
private.get_portion_of_contribution(top=10)

Note that you can also change the metric by which we calculate the portion of contribution. Let's do a calculation based on `character count`.

In [None]:
group.get_portion_of_contribution(statistic='cc', top=10)

We have these handy properties as well: `most_contributed`, `least_contributed`. They measure contribution by `message count`.

In [None]:
private.most_contributed, group.most_contributed

In [None]:
private.least_contributed, group.least_contributed

`get_stat_count` is a function with which you can get specific countable statistics for all of your private or group conversations. This function uses the `ConversationStats` class to get statitics from. The nice thing about this function is that you can filter the undelying `ConversationStats` by channel, sender, date range, type of message. 

Let's now get word count aliased by `wc` of all the messages someone else sent me aliased by `partner`.

In [None]:
private.get_stat_count(attribute='wc', subject='partner')

In [None]:
group.get_stat_count(attribute='wc', subject='partner')
# partner here means: not me

We can also count how many conversations was started by me.

In [None]:
private.number_of_convos_created_by_me

In [None]:
group.number_of_convos_created_by_me

We also have a `{min|mean|max}_channels_size` property, which measures the corresponding statistics for the channels registered in the current analyzer. For private messages it makes less sense, then in group, but for the sake of symmetry, we have these properties for both.

In [None]:
private.min_channel_size, private.max_channel_size


In [None]:
group.min_channel_size, group.mean_channel_size, group.max_channel_size

#### Filtering
You can filter the analyzers. Both `private` and `group` conversation analyzers can be filtered along two dimensions. First is `channels`, which is quite straightforward. The other is `senders`. 

In case of `private` messages, `channels` and `senders` are exactly the same, altough calculated a bit differently (I suggest using `channels`, which is faster). 

In case of `group`, `channels` filters by group conversation names, `senders` filters by people who are part of a group message.

Both parameters can be passed as string or list of strings.

The `filter` method will create a new `Analyzer` instance.

Let's first filter private messages.

In [None]:
private_filtered_by_channels = private.filter(channels=PARTNER_NAME)
private_filtered_by_channels

In [None]:
private_filtered_by_senders = private.filter(senders=PARTNER_NAME)
private_filtered_by_senders

Let's see the number of meesages in both.

In [None]:
private_filtered_by_channels.df.shape, private_filtered_by_senders.df.shape

To make sure they yield the same result, we will assert some more properies of the two objects.

In [None]:
private_filtered_by_channels.participants, private_filtered_by_senders.participants

In [None]:
private_filtered_by_channels.number_of_convos_created_by_me, private_filtered_by_senders.number_of_convos_created_by_me

Here the `get_all_groups_for_one_person` method makes little sense. It is the same as the `channels` property of the `stats` object.

In [None]:
private_filtered_by_channels.get_all_channels_for_one_person(PARTNER_NAME) == private_filtered_by_senders.stats.channels

Now we can filter the group object as well. Here the two possible filtering parameters actually make sense. 

First we filter by `channels`.

In [None]:
group_filtered_by_channels = group.filter(channels=GROUP_NAME)
group_filtered_by_channels

The data property should only contian one entry now.

In [None]:
len(group_filtered_by_channels.data.items())

In [None]:
group_filtered_by_channels.data.items()

On the other hand if we filter by our chosen partner (defined in `PARTNER_NAME` variable) we may get several groups. In fact all the groups in which our partner has *particiapted* (not neccesserarily contributed).

In [None]:
group_filtered_by_senders = group.filter(senders=PARTNER_NAME)
group_filtered_by_senders

In [None]:
len(group_filtered_by_senders.data.items())

This should be the same number as the return value of the following property of the Analyzer class.

In [None]:
len(group.get_all_channels_for_one_person(PARTNER_NAME))

### ConversationStats
As the name suggests this class is a container for holding statsictical data/information about converations. The basic concept is that it does not know general conversation metadata, since it is only constructed by the messages and the metadata of unique messages (who sent it, what kind of messages is it, when was it sent). This object is created by `MessagingAnalyzer` class by passing in the DataFrame as input. The DataFrame is created from all the conversations that the analyzer holds (remember you can filter them, down to a single conversation). 

So to sum it up, `MessagingAnalyzer` knows about the channels and all the metadata of the conversations, while `ConversationStats` only knows about the messages themselves.

`ConversationStats` has a lot of interesting properties and methods, and also a pretty versatile filtering function, so let's discover them.

#### Private and Group Conversation Stats
Note: we follow the strategy of showing private and group conversation stats side-by-side as we did it with the analyzer.

In [None]:
private_stats = private.stats
group_stats = group.stats

Do both of them have the same properties?

In [None]:
assert dir(private_stats) == dir(group_stats)

Seems so... Let's print out the useful ones.

In [None]:
print(utils.dir_stripped(private_stats))

Lets print out one channel from both.

In [None]:
private_stats.channels[0], group_stats.channels[0]

Let's see the number of channels for both of them. Note that the numbers can be quite familiar. It's because they should be the same as what we saw in case of the Analyzer class.

In [None]:
private_stats.number_of_channels, group_stats.number_of_channels

Contributors are all the people who ever sent a message. Apparently (at least for me) some of them never did. This is because nowadays, when you make a new friend on Facebook you get connected on Messenger automatically, regardless if you have messages or not.

In [None]:
private_stats.number_of_contributors, group_stats.number_of_contributors

Your first messages ever (either sent or got).

In [None]:
private_stats.start, group_stats.start

Last messages before download.

In [None]:
private_stats.end,group_stats.end

Let's show some samples of our data.

In [None]:
len(private_stats.messages), len(private_stats.df)

In [None]:
private_stats.messages.sample(5) # TODO this does not have columns

In [None]:
group_stats.messages.sample(5) # TODO 

Samples of text messages.

In [None]:
private_stats.text.sample(5)

In [None]:
group_stats.text.sample(5)

A sample of media messages.

In [None]:
private_stats.media.sample(5)

In [None]:
group_stats.media.sample(5)

For me this was really interesting to look back on. Remember this is only a demo, and you can feel free to check out bigger sample sizes, or if you got the skills, maybe play around with these dataframes. They contain more than 90% of all the data we analyze here. 

We can also get all the words that have ever been sent by anyone to any channel. You could build a language model from this. :)

In [None]:
private_stats.words

In [None]:
group_stats.words

We can also calulate average word length.

In [None]:
private_stats.average_word_length

In [None]:
group_stats.average_word_length

Now for the numbers. Let's get the message, word and character counts.

In [None]:
private_stats.mc, private_stats.wc, private_stats.cc

In [None]:
group_stats.mc, group_stats.wc, group_stats.cc

Text messages count and media message count.

In [None]:
private_stats.text_mc, private_stats.media_mc

In [None]:
group_stats.text_mc, group_stats.media_mc

Unique message count and unique word count.

In [None]:
private_stats.unique_mc, private_stats.unique_wc

In [None]:
group_stats.unique_mc, group_stats.unique_wc

We can also calculate the percentage of media messages.

In [None]:
private_stats.percentage_of_text_messages, private_stats.percentage_of_media_messages

In [None]:
group_stats.percentage_of_text_messages, group_stats.percentage_of_media_messages

Wow! For me in groups it is a lot more likely to get a media message. What about you?

We can also get the most used messages and words in our conversations.

In [None]:
private_stats.most_used_msgs[:10] # TODO remove emoticons?!

In [None]:
group_stats.most_used_msgs[:10] # TODO remove emoticons?!

In [None]:
private_stats.most_used_words[:10]

In [None]:
group_stats.most_used_words[:10]

You can also query `files`, `photos`, `videos`, `audios`, `gifs` even.

In [None]:
private_stats.photos

And of course you have downloaded those files as well with your facebook data. Let's see one of the photos.

In [None]:
Image(f"data/{private_stats.photos.iloc[2][0].get('uri')}")

And here are the other media types. Fell free to mess around with them.

In [None]:
len(private_stats.photos), len(private_stats.videos), len(private_stats.audios), len(private_stats.gifs), len(private_stats.files)

In [None]:
len(group_stats.photos),len(group_stats.videos), len(group_stats.audios), len(group_stats.gifs), len(group_stats.files)

##### Date and time related stats
The `get_grouped_time_series_data` function let's you group all stats into years, months, days and hours. Change the period parameter to one of these keywords' starting letter (e.g. day='d').

In [None]:
private_stats.get_grouped_time_series_data(period='y')

In [None]:
group_stats.get_grouped_time_series_data(period='m')

The `stat_per_period` function is another interesting method. You can pass the same period values like in the `get_grouped_time_series_data` method, but here it will group by relative time range. For example group all the messages into 12 months, regardless which year they happened in.

You can have a pretty good insight on which parts of the year, week, or day are you active on Facebook Messenger.

In [None]:
private_stats.stat_per_period(period='m')

In [None]:
private_stats.stat_per_period(period='d')

In [None]:
group_stats.stat_per_period(period='h')

#### Filtering
The filtering for `ConversationStats` works the same way as it works for `MessagingAnalyzer`. You can call it like this `stats.filter(**kwargs)` and it will return a narrower instance of itself. 

With the `.filter()` method we are filtering an underlying DataFrame behind the scenes. Here you can filter along the `channels` and `senders` dimensions, but this gets accompanied by `subject` and `start`, `end` & `period`. All of these are now filtered on the DatFrame itself.

Subject is 

In [None]:
# does not make sense for mutiple channels
private_stats.creator, group_stats.creator