In [2]:
import pandas as pd

# Problem Description

Write a query to calculate the distribution of comments by the count of users that joined Meta/Facebook between 2018 and 2020, for the month of January 2020. 

The output should contain a count of comments and the corresponding number of users that made that number of comments in Jan-2020. For example, you'll be counting how many users made 1 comment, 2 comments, 3 comments, 4 comments, etc in Jan-2020. Your left column in the output will be the number of comments while your right column in the output will be the number of users. Sort the output from the least number of comments to highest.

To add some complexity, there might be a bug where an user post is dated before the user join date. You'll want to remove these posts from the result.

## First look at Data

In [4]:
fb_comments = pd.read_csv('fb_comments.csv')
fb_comments.head(3)

Unnamed: 0,user_id,body,created_at
0,89,Wrong set challenge guess college as position.,2020-01-16 00:00:00
1,33,Interest always door health military bag. Stor...,2019-12-31 00:00:00
2,34,Physical along born key leader various. Forwar...,2020-01-08 00:00:00


In [5]:
fb_users = pd.read_csv('fb_users.csv')
fb_users.head(3)

Unnamed: 0,id,name,joined_at,city_id,device
0,4,Ashley Sparks,2020-06-30 00:00:00,63,2185
1,8,Zachary Tucker,2018-02-18 00:00:00,78,3900
2,9,Caitlin Carpenter,2020-07-23 00:00:00,60,8592


## Firsts Tougths
* filter for comments made at January 2020

* delete comments that occur before join date

* groupby user_id and count comments

* groupby count_comments and count users

## Data Analysis

In [8]:
#checking for missing values and format of columns
print(fb_comments.info())
print(fb_users.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     100 non-null    int64 
 1   body        100 non-null    object
 2   created_at  100 non-null    object
dtypes: int64(1), object(2)
memory usage: 2.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         24 non-null     int64 
 1   name       24 non-null     object
 2   joined_at  24 non-null     object
 3   city_id    24 non-null     int64 
 4   device     24 non-null     int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ KB
None


There is no missing values, but date coumns are not in optimal format

date -> to_datetime

In [12]:
fb_comments.created_at = pd.to_datetime(fb_comments.created_at, format='%Y-%m-%d')
fb_users.joined_at = pd.to_datetime(fb_users.joined_at, format='%Y-%m-%d')
fb_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   id         24 non-null     int64         
 1   name       24 non-null     object        
 2   joined_at  24 non-null     datetime64[ns]
 3   city_id    24 non-null     int64         
 4   device     24 non-null     int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 1.1+ KB


Now that data is fixed lets start...
## Solution

In [14]:
#Filtering data for january 2020
YY_MM = '2020-01'
df = fb_comments[fb_comments.created_at.dt.strftime('%Y-%m')==YY_MM ]

#Joinning both datasets and delete comments that occur before join date
df = pd.merge(df,fb_users[['id','joined_at']], left_on='user_id', right_on='id', how='left').drop('id', axis=1)
df = df[~(df.created_at < df.joined_at)]

# Groupby user
gby1= df.groupby('user_id').created_at.count().rename('n_comments').reset_index()

#Groupby comments
output = gby1.groupby('n_comments').user_id.count().rename('n_users').reset_index().sort_values('n_comments')

## Final Output

In [15]:
output

Unnamed: 0,n_comments,n_users
0,1,5
1,2,7
2,3,2
3,4,1
4,6,1
