Preliminary code to extract the check-in messages in the Slack channel "think-biver-sunday-checkins".

Many of the steps/functions used in "AG_slack-export-data-compilation.ipynb" could be use to clean further this data (PENDING).

The main objective so far was to separate the text into the expected categories: 'project_name', 'working_on', 'progress_and_roadblocks', 'plans_for_following_week', 'meetings'.

    1) When parsing the text, it is assumed that each category starts with the category_name followed by a semicolon. It works in most of the cases but there are exceptions where another symbol, or no symbol at all, is used. PENDING to generalize the first separation of the categories.

    2) There are entries that do not correspond to a real check-in, most of these entries were dropped. They can be SlackBot messages, or messages sent multiple times as a reminder of the expected format for the check-ins.

    3) Some check-in messages contain more than one project. For these cases, each project is assign to a different row in the final dataframe (preserving all relevant info as user, msg_id, ...). PENDING to include some edge cases.

    4) Some messages split "progress" from "roadblocks". These cases where combined keeping the format:
        
        Progress: aaaaaaaaa.
        
        (new_line)
        
        Roadblocks: bbbbbbb.
    
    5) Preliminary stage. Some rows have been added to the dataframe for developing/debugging purposes.

In [1]:
import pandas as pd 
import numpy as np

from os import listdir
from os.path import getmtime, exists, isdir, isfile
from pathlib import Path
import re
import sys

In [3]:
##-- Global variables:
missing_value = 'n/d'

source_path = "/home/agds/Documents/RebeccaEverleneTrust/RebeccaEverlene_Slack_export/think-biver-sunday-checkins"

In [5]:
##-- Introduce expected/possible keywords per report's category:
keywords_dictionary = {
    'header' : ['weekly report', 'report', "week's report"],
    'project_name': ['project name', 'project'],
    'working_on' : ['working on', 'working', 'what you are working on'],#, 'this week'],
    'progress_and_roadblocks' : ['progress and roadblocks', 'progress and roadblock', 'progress &amp; roadblocks'],#, 'progress', 'roadblocks'],
    'progress' : ['progress'],
    'roadblocks' : ['roadblocks', 'roadblock'],
    'plans_for_following_week' : ['plans for the following week', 'plans for next week', 'following week', 'next week', 'plans for the upcoming week'],
    'meetings' : ['meetings', 'meet', 'met', "meetings you've attended", 'upcoming meetings'],
    'this_week' : ['this week']
}
all_keywords = ['project_name', 'working_on', 'progress_and_roadblocks', 'progress', 'roadblocks','plans_for_following_week', 'meetings', 'this_week']

In [122]:
##-- Extract messages from the Slack channel "think-biver-sunday-checkins":

def check_format_of_json_names(list_names):
    """ Iterates over all the json files in a channel's directory, and returns a list with the names of the json files 
    that have the correct format 'yyyy-mm-dd.json' """
    list_names_dates = []
    for i in range(len(list_names)):
        match = re.match(r'(\d{4})(-)(\d{2})(-)(\d{2})(.)(json)',list_names[i])
        if match!=None:
            list_names_dates.append(list_names[i])
    return list_names_dates

##-- Initialize dataframe with first json file:
json_names = check_format_of_json_names(listdir(source_path))
checkins_df = pd.read_json(source_path+'/'+json_names[0])
checkins_df['json_name'] = json_names[0]

##-- Iterate over the remaining json files and concat info to checkins_df:
for file in json_names[1:]:
    file_df = pd.read_json(source_path+'/'+file)
    file_df['json_name'] = file
    checkins_df = pd.concat([checkins_df,file_df], axis=0, ignore_index=True)

##-- Keep relevant columns:
checkins_df = checkins_df[['user', 'client_msg_id', 'ts', 'json_name', 'text']]

##-- Set dtypes:
checkins_column_names = list(checkins_df.columns)
checkins_column_dtypes = ['string','string','float64','string','string']
for i in range(len(checkins_column_names)):
    checkins_df[checkins_column_names[i]] = checkins_df[checkins_column_names[i]].astype(checkins_column_dtypes[i])

##-- Fix the dtype of each column:
checkins_column_types = [checkins_df[feature].dtypes for feature in list(checkins_df.columns)]

checkins_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941 entries, 0 to 940
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   user           941 non-null    string 
 1   client_msg_id  913 non-null    string 
 2   ts             941 non-null    float64
 3   json_name      941 non-null    string 
 4   text           941 non-null    string 
dtypes: float64(1), string(4)
memory usage: 36.9 KB


In [124]:
def handle_missing_values(df, missing_value):
    df = df.replace(pd.NaT, missing_value)
    df = df.replace(np.nan, missing_value) 
    df = df.fillna(missing_value)
    return df
    
def get_indices_with_repeated_text(df):
    """ Function to get the dataframe's indices of the rows that have exactly the same text """
    indices_before_drop = list(df.index)
    indices_after_drop = list(df[['text']].drop_duplicates(subset=['text'], keep='last').index )
    indices_same_text = []
    for i in indices_before_drop:
        flag = False
        for j in indices_after_drop:
            if i == j and flag == False:
                flag = True
        if flag == False:
            indices_same_text.append(i)
    return np.array(indices_same_text)

##-- Check for messages that have repeated text:
indices_same_text = get_indices_with_repeated_text(checkins_df)
print('indices_same_text: ', np.array(indices_same_text), '\n')

##-- Messages explaining how the format of the checkins should be:
sample_format_msg_text = checkins_df.at[10,'text']
sample_format_msg_indices = checkins_df[checkins_df['text']==sample_format_msg_text].index
print('sample_format_msg_indices: ', np.array(sample_format_msg_indices), '\n')

sample_text_indices = []
sample_text_1 = '\n\n*Project Name* :Scapegoated \n1. *What you are working on:* Currently focusing on designs and other suggestions as per the discussions with other team members as well as improving the design of the website\n2. *Progress and Roadblocks:* Regularly asking for suggestions and in contact with other team members,no roadblocks so far.\n3. *Plans for the following week:* To continue to work with the frontend part\n4. *Meetings:* No meetings conducted'
sample_text_2 = 'Hey *<!here>*, I’m Deeptha from the AWS team'
sample_text_3 = "<!channel> reposting <@U07FCQXU7Q9>'s message. Please adhere to it. THANK YOU."
for i in range(len(checkins_df)):
    if sample_text_1 in checkins_df.at[i,'text'] or sample_text_2 in checkins_df.at[i,'text'] :
        sample_text_indices.append(i)
print('sample_text_indices: ', np.array(sample_text_indices), '\n')

##-- Messages from user explaining how the format of the checkins should be:
#sample_format_user_text = checkins_df.at[10,'user']
#sample_format_user_indices = checkins_df[checkins_df['user']==sample_format_user_text].index
#print('sample_format_user_indices: ', np.array(sample_format_user_indices), '\n')

##-- Messages from USLACKBOT:
bot_indices = checkins_df[checkins_df['user']=='USLACKBOT'].index
print('bot_indices: ', np.array(bot_indices), '\n')

##-- Joined-the-channel messages:
joined_channel_indices = []
for i in list(checkins_df.index):
    if 'has joined the channel' in checkins_df.at[i,'text']:
      joined_channel_indices.append(i)
print('joined_channel_indices: ', joined_channel_indices, '\n')

##-- Drop from dataframe:
for msg_type_indices in [sample_format_msg_indices, sample_text_indices, bot_indices, joined_channel_indices]:
    try:
        checkins_df = checkins_df.drop(msg_type_indices,axis=0)
    except:
        continue

##-- Remaining messages:
#indices_same_text_remaining = get_indices_with_repeated_text(checkins_df)
#print('indices_same_text_remaining: ', np.array(indices_same_text_remaining), '\n')
#checkins_df.loc[indices_same_text_remaining]

##-- Handle missing values:
checkins_df = handle_missing_values(checkins_df, missing_value)

##-- Reset indices:
checkins_df.index = np.arange(0,len(checkins_df),1)
checkins_df.info()

indices_same_text:  [ 10  11  13  14  18  49  50  51  52  53  56  95 112 119 122 194 212 220
 226 330 336 338 339 340 477 564 565 566 567 568 569 570 571 572 573 574
 575 576 577 578 860 862 863 864 865 866 867 868 869 870 871 872 873 874
 876 877 878 880 881 882 883 885 887] 

sample_format_msg_indices:  [ 10  11  13  14  49  50  51  52  53  56 122 860 862 863 864 865 866 867
 868 869 870 871 872 873 874 876 877 878 880 881 882 883 887 894] 

sample_text_indices:  [ 41 110 125] 

bot_indices:  [ 95 226 330 340 702] 

joined_channel_indices:  [84, 91, 115, 116, 129, 137, 138, 139, 141, 142, 144, 145, 146, 215, 222, 230, 233, 646, 780, 819, 936, 937] 

<class 'pandas.core.frame.DataFrame'>
Index: 877 entries, 0 to 876
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   user           877 non-null    string 
 1   client_msg_id  877 non-null    string 
 2   ts             877 non-null    float64
 3   json_name      8

In [166]:
def split_report_into_blocks(report_text):
    """ Separates the original text into blocks (before identifying their categories)
    Assumes that each keyword is followed by ":".
    Lines containing ":" signal the beginning of a block. The text that follows the semicolon can have multiple lines.
    """
    if report_text!='':
        lines_list = report_text.splitlines()
        report_by_blocks = [lines_list[0]]
        for line in lines_list[1:]:
            if ":" not in line:
                report_by_blocks[-1] = report_by_blocks[-1]+'\n'+line
            else:
                report_by_blocks.append(line)
        return report_by_blocks
    else:
        return missing_value
        

def identify_categories(report_by_blocks):
    """ Reads the list generated by split_report_into_blocks and matches each block to a report's category.
    Stores matching in python dictionary 
    (Needs lots of improvement)
    """
    df = pd.DataFrame(columns=all_keywords)
    project = -1
    for item in report_by_blocks:
        item_key = re.sub(r'[0-9.*-/]', '', item.partition(":")[0]).lstrip()
        item_text = item.partition(":")[2].replace('*', '').lstrip()
        ##-- Autocorrect key if necessary: (PENDING)
        ##-- Compares the block's keyword to expected keywords:
        for type in all_keywords:
            keys = [i.lower().replace(' ','') for i in keywords_dictionary[type]]
            #for key in keys:
            #    if key in item_key.lower().replace(" ", ""):
            #        if type == 'project_name':
            #            project += 1
            #        df.at[project,type] = item_text.rstrip()
            #        break 
    
    
    
            if item_key.lower().replace(" ", "") in keys:
                if type == 'project_name':
                    project += 1
                df.at[project,type] = item_text.rstrip()
                break
    return df


def print_test(df):
    """ Function use for debugging to make sure that the text was correctly parsed and 
    that entries with multiple projects were correctly separated into individual rows"""
    missmatch = []
    for i in range(len(df)):
        text = df.at[i,'text']
        project = df.at[i,'project_name']
        working = df.at[i,'working_on']
        pr = df.at[i,'progress_and_roadblocks_combined']
        plans = df.at[i,'plans_for_following_week']
        meeting = df.at[i,'meetings']
        index = df.at[i,'index']
        index_ = df.at[i,'index_']
        print(i, int(index), index_,'\n -------------------------')
        print(text,'\n -------------------------')
        print(project,'\n -------------------------')
        print(working,'\n -------------------------')
        print(pr,'\n -------------------------')
        print(plans,'\n -------------------------')
        print(meeting,'\n ======================================')
        if index != index_:
            missmatch.append(i)
    return missmatch


def get_indices_progress_roadblocks(df, missing_value):
    """
    Function to collect the dataframe's indices that contain:
        progress_roadblocks = entries that have both "Progress" and "Roadblocks"
        progress = entries that have only "Progress"
        roadblocks = entries that have only "Roadblocks"
        progress_and_roadblocks_true = entries that have the desire label "progress_and_roadblocks"
        progress_and_roadblocks_other = []    
    """
    progress = []
    roadblocks = []
    progress_roadblocks = []
    progress_and_roadblocks_true = []
    progress_and_roadblocks_other = []
    for i in range(len(df)):
        try:
            if df.at[i, 'progress'] != missing_value and df.at[i, 'roadblocks'] != missing_value:
                progress_roadblocks.append(i)
            else:
                if df.at[i, 'progress'] != missing_value :
                    progress.append(i)
                if df.at[i, 'roadblocks'] != missing_value :
                    roadblocks.append(i)
            if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'progress'] != missing_value:
                progress_and_roadblocks_true.append(i)
            if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'roadblocks'] != missing_value:
                progress_and_roadblocks_true.append(i)
            if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'progress'] != missing_value and df.at[i, 'roadblocks'] != missing_value:
                progress_and_roadblocks_other.append(i)
        except:
            continue
    return [progress, roadblocks, progress_roadblocks, progress_and_roadblocks_true, progress_and_roadblocks_other]


def combine_progress_and_roadblocks(df, missing_value):
    """ Combines the information in 'progress' and 'roadblocks' into 'progress_and_roadblocks', such that
    the text in progress_and_roadblocks becomes:
        "Progress: progress_text
         new_line
         Roadblocks: roadblocks_text"
    An alternative is to split 'progress_and_roadblocks' although it is much more complicated.
    """
    for i in range(len(df)):
        pr_text = ''
        if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'progress'] != missing_value:
            pr_text += 'Progress: ' + df.at[i, 'progress'] + '\n'
        if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'roadblocks'] != missing_value:
            pr_text += 'Roadblocks: ' + df.at[i, 'roadblocks']
        
        if df.at[i, 'progress_and_roadblocks'] == missing_value and df.at[i, 'roadblocks'] == missing_value:
            pr_text = missing_value

        if df.at[i, 'progress_and_roadblocks'] != missing_value and df.at[i, 'progress'] == missing_value and df.at[i, 'roadblocks'] == missing_value:
            pr_text = df.at[i, 'progress_and_roadblocks']
        
        df.at[i, 'progress_and_roadblocks_combined'] = pr_text
        
    return df


In [None]:
##-- MAIN ANALYSIS:

##-- Initialize a dataframe to collect the original and parsed information:
checkins_parsed_df = pd.DataFrame(columns=list(checkins_df)+list(all_keywords)+['n_projects','index_'])

for i in range(len(checkins_df)):
    ##-- Dataframe with the parsed checkin message, with as many rows as projects in the message: 
    text = checkins_df.at[i,'text']
    report_by_blocks = split_report_into_blocks(text)
    df_i_blocks = identify_categories(report_by_blocks)
    df_i_blocks['n_projects'] = len(df_i_blocks)
    df_i_blocks['index_'] = i
    if len(df_i_blocks) == 0:
        df_i_blocks.loc[0] = [missing_value]*len(df_i_blocks.columns)
        df_i_blocks['index_'] = i
    df_i_blocks = handle_missing_values(df_i_blocks, missing_value)
    
    ##-- Dataframe with the original text. Rows are dublicated as many times as projects in the checkin:
    df_i_text = pd.DataFrame([list(checkins_df.loc[i].values)]*len(df_i_blocks))
    df_i_text.columns = checkins_df.columns
    df_i_text['index'] = i
    df_i_text = handle_missing_values(df_i_text, missing_value)

    ##-- Concatenate df_i_text and df_i_blocks for i-th message:
    df_i_all = pd.concat([df_i_text, df_i_blocks], axis=1, ignore_index=True)
    df_i_all.columns = list(df_i_text.columns) + list(df_i_blocks.columns)
    df_i_all = handle_missing_values(df_i_all, missing_value)

    ##-- Concatenate to checkins_parsed_df:
    checkins_parsed_df = pd.concat([checkins_parsed_df, df_i_all], axis=0, ignore_index=True)

##-- Combine "Progress" and "Roadblocks":
checkins_parsed_df = combine_progress_and_roadblocks(checkins_parsed_df, missing_value)
checkins_parsed_df = handle_missing_values(checkins_parsed_df, missing_value)

checkins_parsed_df

In [None]:
checkins_parsed_df[['user', 'client_msg_id', 'ts', 'json_name', 'text', 'project_name', 'working_on', 'progress_and_roadblocks_combined', 'plans_for_following_week', 'meetings']].to_csv('/home/agds/Desktop/Parsed_checkins.csv')

In [130]:
def check_format(text):
    is_format_correct = [0]*len(all_keywords)
    indices = []
    
    if text!='':
        for line in text.splitlines():
            line = line.lower().lstrip('*-•. ').rstrip('*-•. ').replace('*', '').replace(' ','')
            for i in range(len(all_keywords)):
                feature = all_keywords[i]
                for keyword in keywords_dictionary[feature]:
                    if keyword.lower().replace(' ','')+':' in line:
                        is_format_correct[i] = 1
                        indices.append(i)
                        break
                ##-- Double check 'roadblocks:' vs. 'progress and roadblocks:'
                if feature == 'roadblocks':
                    for keyword in keywords_dictionary['progress_and_roadblocks']:
                        if keyword.lower().replace(' ','')+':' in line:
                            is_format_correct[i] = 0
                            indices = indices[:-1]
    
    check_format_dict = {}
    for i in range(len(all_keywords)):
        check_format_dict[all_keywords[i]] = is_format_correct[i]

    return check_format_dict
    
indices_all_missing = []
indices_review = []
indices_all_pandr = []
indices_all_p_r = []

for i in range(len(checkins_df)):
    text = checkins_df.at[i,'text']
    
    check_format_dict = check_format(text)
    check_format_list = []
    for j in range(len(check_format_dict)):
        checkins_df.at[i, f"format_{all_keywords[j]}"] = check_format_dict[all_keywords[j]]
        check_format_list.append( check_format_dict[all_keywords[j]] )
    [pn, wo, pandr, p, r, nw, m, tw] = check_format_list
    
    if 1 not in check_format_list:
        indices_all_missing.append(i)
        checkins_df.at[i,'status'] = 'Not parsed'
    
    elif pn==1 and wo==1 and pandr==1 and p==0 and r==0 and nw==1 and m==1:
        indices_all_pandr.append(i)
        checkins_df.at[i,'status'] = 'Fully parsed (p&r)'

    elif pn==1 and wo==1 and pandr==0 and p==1 and r==1 and nw==1 and m==1:
        indices_all_p_r.append(i)
        checkins_df.at[i,'status'] = 'Fully parsed (p,r)'

    else:
        indices_review.append(i)
        checkins_df.at[i,'status'] = 'review'



print(f"All_missing: (({len(indices_all_missing)})) {indices_all_missing}", '\n')

print(f"Parsially missing: (({len(indices_review)})) {indices_review}", '\n')

print(f"All p&r: (({len(indices_all_pandr)})) {indices_all_pandr}", '\n')

print(f"All p, r: (({len(indices_all_p_r)})) {indices_all_p_r}", '\n')


checkins_df.head()

All_missing: ((380)) [2, 10, 14, 19, 23, 29, 39, 45, 51, 61, 64, 76, 84, 85, 89, 90, 91, 92, 95, 96, 97, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 115, 116, 118, 134, 141, 143, 188, 192, 193, 195, 196, 198, 199, 200, 220, 261, 291, 312, 313, 316, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 333, 335, 336, 337, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 353, 354, 355, 356, 358, 359, 361, 362, 363, 365, 366, 367, 368, 370, 371, 372, 375, 377, 380, 381, 382, 384, 385, 387, 388, 390, 391, 393, 394, 396, 398, 399, 400, 402, 403, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 418, 419, 420, 421, 422, 423, 424, 426, 429, 430, 435, 436, 437, 438, 439, 440, 442, 443, 444, 445, 446, 448, 449, 451, 452, 453, 454, 455, 458, 459, 460, 461, 473, 482, 483, 517, 529, 530, 531, 532, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 550, 551, 555, 556, 557, 558, 559, 560, 562, 564, 565, 566, 570, 571, 572, 

Unnamed: 0,user,client_msg_id,ts,json_name,text,format_project_name,format_working_on,format_progress_and_roadblocks,format_progress,format_roadblocks,format_plans_for_following_week,format_meetings,format_this_week,status
0,U07JPU9HQ76,1dc116e4-7c6e-4153-a389-067407a8cb44,1731831000.0,2024-11-17.json,*Project Name:* College Aspect *Working on:* S...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,Fully parsed (p&r)
1,U07G6Q1GSHM,5FA29A73-4A2A-4678-AC97-7E274917CF9F,1731835000.0,2024-11-17.json,*Project Name:* MedKids *Working on:* Strong ...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,Fully parsed (p&r)
2,U07MJ05MXH8,3172a161-83c7-4965-aa39-2608603cebdd,1731835000.0,2024-11-17.json,"Hello, team. I have completed the following in...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Not parsed
3,U07HGR6G2N7,5e8489e0-d2f6-4c64-90dd-1d5584969fb5,1731837000.0,2024-11-17.json,*Project Name:* MedKids *Working on:* Fitness ...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,Fully parsed (p&r)
4,U07K16GMK6K,2E9891BF-E329-4762-924F-8F91FF77BFB5,1731856000.0,2024-11-17.json,Weekly report: Project name: Digital Runner Wo...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,Fully parsed (p&r)


In [162]:
def parse_dataframe(df):
    ##-- Initialize a dataframe to collect the original and parsed information:
    checkins_parsed_df = pd.DataFrame(columns=list(df)+list(all_keywords)+['n_projects','index_'])
    
    for i in range(len(df)):
        ##-- Dataframe with the parsed checkin message, with as many rows as projects in the message: 
        text = df.at[i,'text']
    
        #report_by_blocks = split_report_into_blocks(text)
        #df_i_blocks = identify_categories(report_by_blocks)
        # Use this instead to include cases where ":" are use withing the text:
        indices_start_of_category, category_names = get_indices_of_lines_with_category_name(text)
        blocks_list = group_lines(text, indices_start_of_category)
        n_projects = count_projects(category_names)
        answers = extract_answers(blocks_list)
        df_i_blocks = parse_msg_to_df(text, indices_start_of_category, category_names, answers)
        
        df_i_blocks['n_projects'] = n_projects
        df_i_blocks['index_'] = i
        if len(df_i_blocks) == 0:
            df_i_blocks.loc[0] = [missing_value]*len(df_i_blocks.columns)
            df_i_blocks['index_'] = i
        df_i_blocks = handle_missing_values(df_i_blocks, missing_value)
        
        ##-- Dataframe with the original text. Rows are dublicated as many times as projects in the checkin:
        df_i_text = pd.DataFrame([list(df.loc[i].values)]*len(df_i_blocks))
        df_i_text.columns = df.columns
        df_i_text['index'] = i
        df_i_text = handle_missing_values(df_i_text, missing_value)
    
        ##-- Concatenate df_i_text and df_i_blocks for i-th message:
        df_i_all = pd.concat([df_i_text, df_i_blocks], axis=1, ignore_index=True)
        df_i_all.columns = list(df_i_text.columns) + list(df_i_blocks.columns)
        df_i_all = handle_missing_values(df_i_all, missing_value)
    
        ##-- Concatenate to checkins_parsed_df:
        checkins_parsed_df = pd.concat([checkins_parsed_df, df_i_all], axis=0, ignore_index=True)
    
    ##-- Combine "Progress" and "Roadblocks":
    checkins_parsed_df = combine_progress_and_roadblocks(checkins_parsed_df, missing_value)
    checkins_parsed_df = handle_missing_values(checkins_parsed_df, missing_value)

    return checkins_parsed_df

In [168]:
checkins_df_a = checkins_df[checkins_df['status']=='Fully parsed (p&r)'].copy()
checkins_df_a.index = np.arange(0,len(checkins_df_a),1)
checkins_df_a.head()

checkins_parsed_df = parse_dataframe(checkins_df_a)
df_out = checkins_parsed_df[['client_msg_id', 'text', 'project_name', 'working_on', 'progress_and_roadblocks_combined', 'plans_for_following_week', 'meetings', 'status', 'n_projects']]
df_out.to_csv('/home/agds/Desktop/TEST1.csv')
df_out[:15]

  checkins_parsed_df = pd.concat([checkins_parsed_df, df_i_all], axis=0, ignore_index=True)


Unnamed: 0,client_msg_id,text,project_name,working_on,progress_and_roadblocks_combined,plans_for_following_week,meetings,status,n_projects
0,1dc116e4-7c6e-4153-a389-067407a8cb44,*Project Name:* College Aspect\n*Working on:* ...,College Aspect,Sentiment Analysis and Submitting document,Finished the Jupyter notebook and will submit ...,Starting with NPS project,,Fully parsed (p&r),1
1,5FA29A73-4A2A-4678-AC97-7E274917CF9F,*Project Name:* MedKids \n*Working on:* Strong...,MedKids,Strong Like Role Model design,finished iterating on previous work and added ...,Finishing up the hi-fidelity design of the web...,,Fully parsed (p&r),1
2,5e8489e0-d2f6-4c64-90dd-1d5584969fb5,*Project Name:* MedKids\n*Working on:* Fitness...,MedKids,Fitness,Exported some images for github,Start working on wordpress,Wednesday(11/13/2024) - sync with medkids team,Fully parsed (p&r),2
3,5e8489e0-d2f6-4c64-90dd-1d5584969fb5,*Project Name:* MedKids\n*Working on:* Fitness...,ASPECTS,Labor Unions in Oregon,"Currently at row 30, unable to find the email ...",Complete at least to row 60 by the end of next...,,Fully parsed (p&r),2
4,2E9891BF-E329-4762-924F-8F91FF77BFB5,Weekly report:\nProject name: Digital Runner\n...,Digital Runner,Security Policy Development,Finished with the information gathering stage ...,Will start drafting the primary document for t...,Had one weekly meeting with Cyber Discussions ...,Fully parsed (p&r),1
5,70998e86-8e0d-4b31-be2e-86ad00a10274,*Sumiti: Weekly Report*\n*Project Name:* Jogap...,Jogapps and Dreampad,• Developing the PRD for Dreampad and compilin...,• Initiated work on the Jogapps PRD to align w...,• Finalize the updated PRD for Jogapps.• Conti...,• *Jogapps*: Meeting held on Friday at 2:00 PM...,Fully parsed (p&r),1
6,6A5241D5-199C-4C7E-8C92-53408C27E3A4,*Weekly Update*\n\n*Project Name*: Grants\n\n*...,Grants,Updating and correcting the data in the combin...,Working on the assigned rows. Delayed due to b...,Complete the updation of all the assigned rows,No meetings attended this weekThanks.,Fully parsed (p&r),1
7,86C92585-30AA-49AA-8A22-D05E66672D05,*Weekly Update*\n\n*Project Name:* AWS automat...,AWS automation,Tracking progress of team and finalizing AWS s...,No blockers,explore apex code for salesforce integration,11/15,Fully parsed (p&r),1
8,488b426d-e2ce-420f-84f3-8067e99a5cd1,*Project Name: Strategic Planning - Brochures*...,Strategic Planning - Brochures,Drafted a brochure for MEDKids after going thr...,No significant roadblocks apart from missing i...,Collect feedback and refine already drafted br...,No meetings this week,Fully parsed (p&r),1
9,E506D93A-713E-4492-ABC0-53A8186FA738,Project Name: Landmarks\n1. Working on: Coordi...,Landmarks,"Coordinating discussions on age group, questio...",Documented key points; need clarity on structu...,Create “Playtesting” folder on Mon<http,"Attended Meetings with Trey, Nikhil and Abhina...",Fully parsed (p&r),1


In [176]:
review_indices = []
for i in range(len(df_out)):
    review_i = False
    for category_name in ['project_name', 'working_on', 'progress_and_roadblocks_combined', 'plans_for_following_week', 'meetings']:
        if df_out.at[i,category_name]==missing_value:
            review_i = True
    if review_i == True:
        review_indices.append(i)

print(np.array(review_indices),'\n')

ii = 246
for i in range(len(df_out.at[ii, 'text'].splitlines())):
    print(df_out.at[ii, 'text'].splitlines()[i])

pd.DataFrame(df_out.iloc[ii]).T


#32: uses '-', multiple_projects
#111 >> mutiple projects (incomplete)


[ 32 111 113 114 230 231 232 246 247 252 253 288 289 306 307 308] 

Hi team, here is my update for this week:
*Project Name: Curriculum Planning*
*Working on*: created 5 lesson plans on various chemistry concepts
*Roadblocks*: No roadblocks encountered
*Plans for the next week*: continue making more lesson plans
*Meetings*: No meetings for this week

*Project Name: Spaulding Daniels White Papers*
*Working on:* Sustainable Innovations in Carbon Capture and Utilization (CCU) Technologies
*Progress and Roadblocks:* currently drafting and researching the topic. No roadblocks encountered
*Plans for the following week:* continue writing the paper
*Meetings:* No meetings this week

*Project Name: Landmark*
*Working on:* Middle School data
*Progress and Roadblocks:* finished data entries from 3340-3500, some data was left empty because could not find it.
*Plans for the following week:* continue filling data for the spreadsheet
*Meetings:* No meetings.


Unnamed: 0,client_msg_id,text,project_name,working_on,progress_and_roadblocks_combined,plans_for_following_week,meetings,status,n_projects
246,7ab12462-9961-48ed-9a03-83866e263d58,"Hi team, here is my update for this week:\n*Pr...",n/d,n/d,n/d,n/d,n/d,Fully parsed (p&r),n/d


In [128]:
def match_to_category(line, category_name):
    line = line.lower().lstrip('*-•. ').rstrip('*-•. ').replace('*', '').replace(' ','')
    out = False
    for keyword in keywords_dictionary[category_name]:
        if keyword.lower().replace(' ','')+':' in line:
            out = True
    return out
            

def review_format(text):
    is_format_correct_list = [0]*len(all_keywords)
    is_format_correct_dict = {}
    
    if text!='':
        text_to_lines = text.splitlines()
        for i_line in range(len(text_to_lines)):
            line = text_to_lines[i_line]
            for i in range(len(all_keywords)):
                category_name = all_keywords[i]
                if match_to_category(line, category_name) == True:
                    is_format_correct_list[i] = 1
                    break
                ##-- Double check 'roadblocks:' vs. 'progress and roadblocks:'
                if category_name == 'roadblocks' and match_to_category(line, 'roadblocks') == True:
                    is_format_correct_list[i] = 0
    
    for i in range(len(all_keywords)):
        is_format_correct_dict[all_keywords[i]] = is_format_correct_list[i]

    return is_format_correct_dict


def get_indices_of_lines_with_category_name(text):
    indices_start_of_category = []  
    category_names = []
    if text!='':
        text_to_lines = text.splitlines()
        for i_line in range(len(text_to_lines)):
            line = text_to_lines[i_line]
            for i in range(len(all_keywords)):
                category_name = all_keywords[i]
                if match_to_category(line, category_name) == True:
                    indices_start_of_category.append(i_line)
                    category_names.append(category_name)
                    break
                ##-- Double check 'roadblocks:' vs. 'progress and roadblocks:'
                if category_name == 'roadblocks' and match_to_category(line, 'roadblocks') == True:
                    indices_start_of_category = indices_start_of_category[:-1]
                    category_names = category_names[:-1]
                    
    return indices_start_of_category, category_names



def group_lines(text, indices_start_of_category):
    blocks = []
    text_to_lines = text.splitlines()
    for i in range(len(indices_start_of_category)-1):
        begin = indices_start_of_category[i]
        end = indices_start_of_category[i+1]
        blocks.append(text_to_lines[begin:end]) 
    blocks.append(text_to_lines[end:]) 
    return blocks

def count_projects(category_names):
    counter = 0
    for name in category_names:
        if name == 'project_name':
            counter += 1
    return counter


def extract_answers(blocks_list):
    answers = []
    for block in blocks_list:
        answer_text = ''
        for line in block:
            line_matches = False
            for category in all_keywords:
                if match_to_category(line, category)==True:
                    answer_text += line.split(":")[1].lstrip('*-•. ').rstrip('*-•. ').replace('*', '')
                    line_matches = True
                    break
            if line_matches==False:
                answer_text += line
        answers.append(answer_text)
    return answers


def parse_msg_to_df(text, indices_start_of_category, category_names, answers):
    n_project = count_projects(category_names)
    columns=list(all_keywords)+['n_projects','index_']
    df = pd.DataFrame([[np.nan]*n_project]*len(columns)).T
    df.columns = columns  
    df = df.astype('object')

    project_counter = -1
    for i in range(len(category_names)):
        if category_names[i] == 'project_name':
            project_counter += 1
            #df.at[project_counter, 'text'] = text
        df.at[project_counter, category_names[i]] = answers[i]
    return df

In [65]:
text = df_out.at[4, 'text']

for i in range(len(text.splitlines())):
    print(i, text.splitlines()[i])
print()

indices_start_of_category, category_names = get_indices_of_lines_with_category_name(text)
print(indices_start_of_category, '\n')
print(category_names, '\n')

blocks_list = group_lines(text, indices_start_of_category)
print(blocks_list, '\n')

n_projects = count_projects(category_names)
print(n_projects, '\n')

answers = extract_answers(blocks_list)
print(answers)

parse_msg_to_df(text, indices_start_of_category, category_names, answers)

0 Weekly report:
1 Project name: Digital Runner
2 Working on: Security Policy Development 
3 Progress and roadblocks: Finished with the information gathering stage and will start drafting from next week.
4 Plans for next week: Will start drafting the primary document for the policy.
5 Meetings: Had one weekly meeting with Cyber Discussions team on Tuesday, 12th November, 12 PM EST.

[1, 2, 3, 4, 5] 

['project_name', 'working_on', 'progress_and_roadblocks', 'plans_for_following_week', 'meetings'] 

[['Project name: Digital Runner'], ['Working on: Security Policy Development '], ['Progress and roadblocks: Finished with the information gathering stage and will start drafting from next week.'], ['Plans for next week: Will start drafting the primary document for the policy.'], ['Meetings: Had one weekly meeting with Cyber Discussions team on Tuesday, 12th November, 12 PM EST.']] 

1 

['Digital Runner', 'Security Policy Development', 'Finished with the information gathering stage and will 

Unnamed: 0,project_name,working_on,progress_and_roadblocks,progress,roadblocks,plans_for_following_week,meetings,this_week,n_projects,index_
0,Digital Runner,Security Policy Development,Finished with the information gathering stage ...,,,Will start drafting the primary document for t...,Had one weekly meeting with Cyber Discussions ...,,,


In [126]:
sample_text = "<!channel> reposting <@U07FCQXU7Q9>'s message. Please adhere to it. THANK YOU."

for i in range(len(checkins_df)):
    if sample_text in checkins_df.at[i,'text']:
        print(i)
        display(checkins_df.loc[i])