Preliminary code to extract the check-in messages in the Slack channel "think-biver-sunday-checkins".

Many of the steps/functions used in "AG_slack-export-data-compilation.ipynb" could be use to clean further this data (PENDING).

The main objective so far was to separate the text into the expected categories: 'project_name', 'working_on', 'progress_and_roadblocks', 'plans_for_following_week', 'meetings'. Here are some comments/considerations:

    1) When parsing the text, it is assumed that each category starts with the category_name followed by a semicolon. It works in most of the cases but there are exceptions where another symbol, or no symbol at all, is used. These messages cannot be confidently parsed.

    2) There are entries that do not correspond to a real check-in, most of these entries were dropped. They can be SlackBot messages, or messages sent multiple times as a reminder of the expected format for the check-ins.

    3) Some check-in messages contain more than one project. For these cases, each project is assigned to a different row in the final dataframe (preserving all relevant info as user, msg_id, ...). 

    4) The code was generalized to not rely on having the line "Weekly report" or "Weekly update" (since most of the messages do not have them). It makes things easier if it does have it, but it won't break if a message does not have it.

    5) Different words/phrases that could refer (without ambiguitity) to a category are defined in keywords_dictionary. When parsing the text, all these possible ways of writting a category_name are considered.

    6) The code was generalized to parse messages with semicolons in the bulk of the text other than the semicolons following the category name.

PENDING:
    
    1) Convert users and channel ids to their display names when used in the messages.
    2) Modify if necessary the format of the dataframe.

In [None]:
import pandas as pd 
import numpy as np

from os import listdir
from os.path import getmtime, exists, isdir, isfile
from pathlib import Path
import re
import sys

In [None]:
##-- Global variables:
missing_value = 'n/d'

source_path = "/home/agds/Documents/RebeccaEverleneTrust/RebeccaEverlene_Slack_export/think-biver-sunday-checkins"

In [None]:
##-- Introduce expected/possible keywords per report's category:
keywords_dictionary = {
    'header' : ['weekly report', 'report', "week's report"],
    'project_name': ['project name'],
    'working_on' : ['working on', 'working', 'what you are working on', 'worked on'],
    'progress_and_roadblocks' : ['progress and roadblocks', 'progress and roadblock', 'progress &amp; roadblocks', 'Progress/Roadblocks'],#, 'progress', 'roadblocks'],
    'progress' : ['progress'],
    'roadblocks' : ['roadblocks', 'roadblock'],
    'plans_for_following_week' : ['plans for the following week', 'plans for next week', 'following week', 'next week', 'plans for the upcoming week'],
    'meetings' : ['meetings', 'meet', 'met', "meetings you've attended", 'upcoming meetings', 'meetings', 'Meeting attended', 'Meetings attended']
}
all_keywords = ['project_name', 'working_on', 'progress_and_roadblocks', 'progress', 'roadblocks','plans_for_following_week', 'meetings']

In [None]:
##-- Extract messages from the Slack channel "think-biver-sunday-checkins":

def check_format_of_json_names(list_names):
    """ Iterates over all the json files in a channel's directory, and returns a list with the names of the json files 
    that have the correct format 'yyyy-mm-dd.json' """
    list_names_dates = []
    for i in range(len(list_names)):
        match = re.match(r'(\d{4})(-)(\d{2})(-)(\d{2})(.)(json)',list_names[i])
        if match!=None:
            list_names_dates.append(list_names[i])
    return list_names_dates

##-- Initialize dataframe with first json file:
json_names = check_format_of_json_names(listdir(source_path))
checkins_df = pd.read_json(source_path+'/'+json_names[0])
checkins_df['json_name'] = json_names[0]

##-- Iterate over the remaining json files and concat info to checkins_df:
for file in json_names[1:]:
    file_df = pd.read_json(source_path+'/'+file)
    file_df['json_name'] = file
    checkins_df = pd.concat([checkins_df,file_df], axis=0, ignore_index=True)

##-- Keep relevant columns:
checkins_df = checkins_df[['user', 'client_msg_id', 'ts', 'json_name', 'text']]

##-- Set dtypes:
checkins_column_names = list(checkins_df.columns)
checkins_column_dtypes = ['string','string','float64','string','string']
for i in range(len(checkins_column_names)):
    checkins_df[checkins_column_names[i]] = checkins_df[checkins_column_names[i]].astype(checkins_column_dtypes[i])

##-- Fix the dtype of each column:
checkins_column_types = [checkins_df[feature].dtypes for feature in list(checkins_df.columns)]

checkins_df.info()

In [None]:
def handle_missing_values(df, missing_value):
    df = df.replace(pd.NaT, missing_value)
    df = df.replace(np.nan, missing_value) 
    df = df.fillna(missing_value)
    return df
    
def get_indices_with_repeated_text(df):
    """ Function to get the dataframe's indices of the rows that have exactly the same text """
    indices_before_drop = list(df.index)
    indices_after_drop = list(df[['text']].drop_duplicates(subset=['text'], keep='last').index )
    indices_same_text = []
    for i in indices_before_drop:
        flag = False
        for j in indices_after_drop:
            if i == j and flag == False:
                flag = True
        if flag == False:
            indices_same_text.append(i)
    return np.array(indices_same_text)

##-- Check for messages that have repeated text:
indices_same_text = get_indices_with_repeated_text(checkins_df)
print('indices_same_text: ', np.array(indices_same_text), '\n')

##-- Messages explaining how the format of the checkins should be:
sample_format_msg_text = checkins_df.at[10,'text']
sample_format_msg_indices = checkins_df[checkins_df['text']==sample_format_msg_text].index
print('sample_format_msg_indices: ', np.array(sample_format_msg_indices), '\n')

sample_text_indices = []
sample_text_1 = '\n\n*Project Name* :Scapegoated \n1. *What you are working on:* Currently focusing on designs and other suggestions as per the discussions with other team members as well as improving the design of the website\n2. *Progress and Roadblocks:* Regularly asking for suggestions and in contact with other team members,no roadblocks so far.\n3. *Plans for the following week:* To continue to work with the frontend part\n4. *Meetings:* No meetings conducted'
sample_text_2 = 'Hey *<!here>*, I’m Deeptha from the AWS team'
sample_text_3 = "<!channel> reposting <@U07FCQXU7Q9>'s message. Please adhere to it. THANK YOU."
sample_text_4 = 'please follow this structure when posting updates'
for i in range(len(checkins_df)):
    text_i = checkins_df.at[i,'text']
    if sample_text_1 in text_i or sample_text_2 in text_i or sample_text_3 in text_i or sample_text_4 in text_i:
        sample_text_indices.append(i)
print('sample_text_indices: ', np.array(sample_text_indices), '\n')

##-- Messages from USLACKBOT:
bot_indices = checkins_df[checkins_df['user']=='USLACKBOT'].index
print('bot_indices: ', np.array(bot_indices), '\n')

##-- Joined-the-channel messages:
joined_channel_indices = []
for i in list(checkins_df.index):
    if 'has joined the channel' in checkins_df.at[i,'text']:
      joined_channel_indices.append(i)
print('joined_channel_indices: ', joined_channel_indices, '\n')

##-- Drop from dataframe:
for msg_type_indices in [sample_format_msg_indices, sample_text_indices, bot_indices, joined_channel_indices]:
    try:
        checkins_df = checkins_df.drop(msg_type_indices,axis=0)
    except:
        continue

##-- Remaining messages:
#indices_same_text_remaining = get_indices_with_repeated_text(checkins_df)
#print('indices_same_text_remaining: ', np.array(indices_same_text_remaining), '\n')
#checkins_df.loc[indices_same_text_remaining]

##-- Handle missing values:
checkins_df = handle_missing_values(checkins_df, missing_value)

##-- Reset indices:
checkins_df.index = np.arange(0,len(checkins_df),1)
checkins_df.info()

In [None]:

def check_message_format(text):
    is_format_correct = [0]*len(all_keywords)
    indices = []
    
    if text!='':
        for line in text.splitlines():
            line = line.lower().lstrip('*-•. ').rstrip('*-•. ').replace('*', '').replace(' ','')
            for i in range(len(all_keywords)):
                feature = all_keywords[i]
                for keyword in keywords_dictionary[feature]:
                    if keyword.lower().replace(' ','')+':' in line:
                        is_format_correct[i] = 1
                        indices.append(i)
                        break
                ##-- Double check 'roadblocks:' vs. 'progress and roadblocks:'
                if feature == 'roadblocks':
                    for keyword in keywords_dictionary['progress_and_roadblocks']:
                        if keyword.lower().replace(' ','')+':' in line:
                            is_format_correct[i] = 0
                            indices = indices[:-1]
    
    check_format_dict = {}
    for i in range(len(all_keywords)):
        check_format_dict[all_keywords[i]] = is_format_correct[i]

    return check_format_dict

def check_messages_format(df):
    indices_all_missing = []
    indices_review = []
    indices_all_pandr = []
    indices_all_p_r = []
    
    for i in range(len(df)):
        text = df.at[i,'text']
        
        check_format_dict = check_message_format(text)
        check_format_list = []
        for j in range(len(check_format_dict)):
            df.at[i, f"format_{all_keywords[j]}"] = check_format_dict[all_keywords[j]]
            check_format_list.append( check_format_dict[all_keywords[j]] )
        [pn, wo, pandr, p, r, nw, m] = check_format_list
        
        if 1 not in check_format_list:
            indices_all_missing.append(i)
            df.at[i,'status'] = 'Not parsed'
        
        elif pn==1 and wo==1 and pandr==1 and p==0 and r==0 and nw==1 and m==1:
            indices_all_pandr.append(i)
            df.at[i,'status'] = 'Fully parsed (p&r)'
    
        elif pn==1 and wo==1 and pandr==0 and p==1 and r==1 and nw==1 and m==1:
            indices_all_p_r.append(i)
            df.at[i,'status'] = 'Fully parsed (p,r)'
    
        else:
            indices_review.append(i)
            df.at[i,'status'] = 'review'


    print(f"All_missing: (({len(indices_all_missing)})) {indices_all_missing}", '\n')    
    print(f"Partially missing: (({len(indices_review)})) {indices_review}", '\n')
    print(f"All p&r: (({len(indices_all_pandr)})) {indices_all_pandr}", '\n')
    print(f"All p, r: (({len(indices_all_p_r)})) {indices_all_p_r}", '\n')

check_messages_format(checkins_df)
checkins_df[:10]

In [None]:
def match_to_category(line, category_name):
    """
    Returns True if the category_name, followed by a semicolon, was found in a line of the message.
    """
    line = line.lower().lstrip('*-•. ').rstrip('*-•. ').replace('*', '').replace(' ','')
    out = False
    for keyword in keywords_dictionary[category_name]:
        if keyword.lower().replace(' ','')+':' in line:
            out = True
    return out
            

def review_format(text):
    """
    Returns a dictionary with keys:
        project_name, working_on, progress_and_roadblocks, progress, roadblocks, plans_for_following_week and meetings.
    And values={0,1,missing_value} depending if the above keywords were found in the text as a whole.
    """
    is_format_correct_list = [0]*len(all_keywords)
    is_format_correct_dict = {}
    
    if text!='':
        text_to_lines = text.splitlines()
        for i_line in range(len(text_to_lines)):
            line = text_to_lines[i_line]
            for i in range(len(all_keywords)):
                category_name = all_keywords[i]
                if match_to_category(line, category_name) == True:
                    is_format_correct_list[i] = 1
                    break
                ##-- Double check 'roadblocks:' vs. 'progress and roadblocks:'
                if category_name == 'roadblocks' and match_to_category(line, 'roadblocks') == True:
                    is_format_correct_list[i] = 0
    
    for i in range(len(all_keywords)):
        is_format_correct_dict[all_keywords[i]] = is_format_correct_list[i]

    return is_format_correct_dict


def get_indices_of_lines_with_category_name(text):
    """
    Returns two lists. One with the number of the line in the text where a keyword was identified and the other list with the corresponding 
    category_names
    """
    indices_start_of_category = []  
    category_names = []
    if text!='':
        text_to_lines = text.splitlines()
        for i_line in range(len(text_to_lines)):
            line = text_to_lines[i_line]
            for i in range(len(all_keywords)):
                category_name = all_keywords[i]
                if match_to_category(line, category_name) == True:
                    indices_start_of_category.append(i_line)
                    category_names.append(category_name)
                    break
                ##-- Double check 'roadblocks:' vs. 'progress and roadblocks:'
                if category_name == 'roadblocks' and match_to_category(line, 'roadblocks') == True:
                    indices_start_of_category = indices_start_of_category[:-1]
                    category_names = category_names[:-1]
                    
    return indices_start_of_category, category_names



def group_lines(text, indices_start_of_category):
    """
    Returns a list of lists, where each elements collects the content (1 or more lines) for each category in the text.
    """
    blocks = []
    begin = indices_start_of_category[0]
    end = indices_start_of_category[-1]
    text_to_lines = text.splitlines()
    for i in range(len(indices_start_of_category)-1):
        begin = indices_start_of_category[i]
        end = indices_start_of_category[i+1]
        blocks.append(text_to_lines[begin:end]) 
    blocks.append(text_to_lines[end:]) 
    return blocks

def count_projects(category_names):
    """
    Returns an integer with the number of identified projects in the text. A project is identified in the category label is:
        "Project name:"
    independently of lowercase or uppercase letters.
    """
    counter = 0
    for name in category_names:
        if name == 'project_name':
            counter += 1
    return counter

def count_weekly_report_label(df):
    """
    Retunrs a list of dataframe indices such that the label "Weekly report:" was found in the corresponding text.
    """
    indices = []
    for i in range(len(df)):
        text = df.at[i,'text']
        for line in text.splitlines():
            line = line.lower().lstrip('*-•. ').rstrip('*-•. ').replace('*', '')
            if 'weekly report' in line or 'weekly update' in line:
                indices.append(i)
    return indices
    

def extract_answers(blocks_list):
    """
    Returns a list of strings, where each string corresponds to the "answer" of a given category.
    It removes the category_name label, and combined multiple lines if necessary.
    """
    answers = []
    for block in blocks_list:
        answer_text = ''
        for line in block:
            line_matches = False
            for category in all_keywords:
                if match_to_category(line, category)==True:
                    answer_text += line.split(":")[1].lstrip('*-•. ').rstrip('*-•. ').replace('*', '')
                    line_matches = True
                    break
            if line_matches==False:
                answer_text += line
        answers.append(answer_text)
    return answers


def create_empty_df_with_categories(n_rows):
    """
    Returns an empty dataframe with n_rows number of rows and columns:
        project_name, working_on, progress_and_roadblocks, progress, roadblocks, plans_for_following_week, meetings, n_projects, index_
    The column index_ is for internal development of the code. It can be remove at the end.
    """
    columns=list(all_keywords)+['n_projects','index_']
    df = pd.DataFrame([[np.nan]*n_rows]*len(columns)).T
    df.columns = columns  
    df = df.astype('object')
    return df


def parse_msg_to_df(df, text, indices_start_of_category, category_names, answers):
    """
    Takes the empty dataframe created with the function "create_empty_df_with_categories(n_rows)" and fills the cells 
    with the "answers" to the categories that were correctly identified in the text.
    """    
    n_project = count_projects(category_names)

    project_counter = -1
    for i in range(len(category_names)):
        if category_names[i] == 'project_name':
            project_counter += 1
            #df.at[project_counter, 'text'] = text
        df.at[project_counter, category_names[i]] = answers[i]

    return df


def get_indices_progress_roadblocks(df, missing_value):
    """
    Function to collect the dataframe's indices that contain:
        progress_roadblocks = entries that have both "Progress" and "Roadblocks"
        progress = entries that have only "Progress"
        roadblocks = entries that have only "Roadblocks"
        progress_and_roadblocks_true = entries that have the desire label "progress_and_roadblocks"
        progress_and_roadblocks_other = []    
    """
    progress = []
    roadblocks = []
    progress_roadblocks = []
    progress_and_roadblocks_true = []
    progress_and_roadblocks_other = []
    for i in range(len(df)):
        try:
            if df.at[i, 'progress'] != missing_value and df.at[i, 'roadblocks'] != missing_value:
                progress_roadblocks.append(i)
            else:
                if df.at[i, 'progress'] != missing_value :
                    progress.append(i)
                if df.at[i, 'roadblocks'] != missing_value :
                    roadblocks.append(i)
            if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'progress'] != missing_value:
                progress_and_roadblocks_true.append(i)
            if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'roadblocks'] != missing_value:
                progress_and_roadblocks_true.append(i)
            if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'progress'] != missing_value and df.at[i, 'roadblocks'] != missing_value:
                progress_and_roadblocks_other.append(i)
        except:
            continue
    return [progress, roadblocks, progress_roadblocks, progress_and_roadblocks_true, progress_and_roadblocks_other]


def combine_progress_and_roadblocks(df, missing_value):
    """ Combines the information in 'progress' and 'roadblocks' into 'progress_and_roadblocks', such that
    the text in progress_and_roadblocks becomes:
        "Progress: progress_text
         new_line
         Roadblocks: roadblocks_text"
    An alternative is to split 'progress_and_roadblocks' although it is much more complicated.
    """
    for i in range(len(df)):
        pr_text = ''
        if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'progress'] != missing_value:
            pr_text += 'Progress: ' + df.at[i, 'progress'] + '\n'
        if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'roadblocks'] != missing_value:
            pr_text += 'Roadblocks: ' + df.at[i, 'roadblocks']
        
        if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'roadblocks'] == missing_value:
            pr_text = missing_value

        if df.at[i, 'progress_and_roadblocks'] != missing_value and df.at[i, 'progress'] == missing_value and df.at[i, 'roadblocks'] == missing_value:
            pr_text = df.at[i, 'progress_and_roadblocks']
        
        df.at[i, 'progress_and_roadblocks_combined'] = pr_text
        
    return df


In [None]:
###-- MAIN ANALYSIS OF SECOND IMPLEMENTATION:

def parse_dataframe(df):
    """
    Returns a dataframe with the parsed text.
    Messages with weekly reports of more than one project are splitted in as many rows as projects in the report. 
    The columns are:
       'user', 'client_msg_id', 'ts', 'json_name', 'text',
       'format_project_name', 'format_working_on', 'format_progress_and_roadblocks', 'format_progress', 'format_roadblocks', 'format_plans_for_following_week','format_meetings', 
       'status', 
       'project_name', 'working_on', 'progress_and_roadblocks', 'progress', 'roadblocks', 'plans_for_following_week', 'meetings', 
       'n_projects', 'index_', 'index','progress_and_roadblocks_combined'
    """
    ##-- Initialize a dataframe to collect the original and parsed information:
    checkins_parsed_df = pd.DataFrame(columns=list(df)+list(all_keywords)+['n_projects','index_'])
    
    for i in range(len(df)):
        text = df.at[i,'text']   
        
        indices_start_of_category, category_names = get_indices_of_lines_with_category_name(text)
        #print(indices_start_of_category)
        #print(category_names, '\n')
        
        if len(indices_start_of_category) == 0:   
            ##-- If no keywords were identified in the non-empty text:
            df_i_blocks = create_empty_df_with_categories(1)
            n_projects = missing_value
        elif len(indices_start_of_category) > 0:
            blocks_list = group_lines(text, indices_start_of_category)
            #print(blocks_list, '\n')
        
            answers = extract_answers(blocks_list)
            #print('answers =', answers, '\n')
        
            n_projects = count_projects(category_names)
            #print(f"n_projects = {n_projects}" , '\n')
            
            if n_projects == 0:
                ##-- If project_name was not identified:
                df_i_blocks = create_empty_df_with_categories(1)
            else:
                df_i_blocks = create_empty_df_with_categories(n_projects)
                df_i_blocks = parse_msg_to_df(df_i_blocks, text, indices_start_of_category, category_names, answers)      
            
        df_i_blocks['n_projects'] = n_projects
        df_i_blocks['index_'] = i
        df_i_blocks = handle_missing_values(df_i_blocks, missing_value)  
        #display(df_i_blocks)
            
        ##-- Dataframe with the original text. Rows are dublicated as many times as projects in the checkin:
        df_i_text = pd.DataFrame([list(df.loc[i].values)]*len(df_i_blocks))
        df_i_text.columns = df.columns
        df_i_text['index'] = i
        df_i_text = handle_missing_values(df_i_text, missing_value)
        #display(df_i_text)
        
        ##-- Concatenate df_i_text and df_i_blocks for i-th message:
        df_i_all = pd.concat([df_i_text, df_i_blocks], axis=1, ignore_index=True)
        df_i_all.columns = list(df_i_text.columns) + list(df_i_blocks.columns)
        df_i_all = handle_missing_values(df_i_all, missing_value)  
        #display(df_i_all)
        
        ##-- Concatenate to checkins_parsed_df:
        checkins_parsed_df = pd.concat([checkins_parsed_df, df_i_all], axis=0, ignore_index=True)

    ##-- Combine "Progress" and "Roadblocks":
    #checkins_parsed_df = combine_progress_and_roadblocks(checkins_parsed_df, missing_value)
    #checkins_parsed_df = handle_missing_values(checkins_parsed_df, missing_value)

    return checkins_parsed_df

In [None]:
checkins_parsed_df = parse_dataframe(checkins_df)
checkins_parsed_df.head()

In [None]:
columns_to_keep = ['user', 'client_msg_id', 'ts', 'json_name', 'text','project_name', 'working_on','progress_and_roadblocks', 'progress', 'roadblocks', 'plans_for_following_week', 'meetings', 'n_projects']
checkins_parsed_df = checkins_parsed_df[columns_to_keep]
checkins_parsed_df.head()