## Preprocessing (1.1): cleaning
Goal: clean data frame in order to tokenize it. Split the column's content and exclude redundant information.

In [5]:
import pandas as pd
import re

In [6]:
# load the dataset from HF
df_full = pd.read_parquet("hf://datasets/ruggsea/stanford-encyclopedia-of-philosophy_chat_multi_turn/data/train-00000-of-00001.parquet")

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
# I recommend to work on mock dataset (50 first rows) for writing the code
df = df_full.head(5)
df.head()

Unnamed: 0,conversation,prompt
0,"[{'content': 'Professor, I was thinking about ...",You are an expert and well-read Philosophy pro...
1,"[{'content': 'Professor Phil, do we always cho...",You are an expert and well-read Philosophy pro...
2,"[{'content': 'Professor Phil, I was reading th...",You are an expert and well-read Philosophy pro...
3,"[{'content': 'Professor Phil, I've been wonder...",You are an expert and well-read Philosophy pro...
4,"[{'content': 'Professor, I've been thinking ab...",You are an expert and well-read Philosophy pro...


In [8]:
def extract_content(conversation: dict):
    """
    Unwraps original dictionary with student's and professor utterances.
    :param conversation: dict with utterances
    :return: list of professor utterances and student utterances
    """
    professor = []
    user = []
    for turn in conversation:
        # print(turn)
        # print('-'*20)
        if turn['role'] == 'user':
            user.append(turn['content'])
        else:
            professor.append(turn['content'])
    return professor, user

def prepare_prompt(prompt: str):
    """
    Cleans the redundant content from each prompt. Designed to use with lambda.
    :param prompt: prompt to clean
    :return: cleaned prompt
    """
    return re.findall(r'\"([\s\S]+?)\"', prompt)


In [9]:
# Split the conversation into various rows in df
# select utterances of professor
df['professor'] = df['conversation'].apply(lambda x: extract_content(x)[0])

# select utterances of student
df['student'] = df['conversation'].apply(lambda x: extract_content(x)[1])

# clean redundant content from prompt
df['prompt'] = df['prompt'].apply(lambda x: prepare_prompt(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['professor'] = df['conversation'].apply(lambda x: extract_content(x)[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['student'] = df['conversation'].apply(lambda x: extract_content(x)[1])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['prompt'] = df['prompt'].apply(lambda x: prepare_pro

In [10]:
df = df.drop('conversation', axis=1)

In [11]:
df = df.explode(["professor", "student"]) # powerful line! creates a pair of student-professor utterance for each row
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21 entries, 0 to 4
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   prompt     21 non-null     object
 1   professor  21 non-null     object
 2   student    21 non-null     object
dtypes: object(3)
memory usage: 672.0+ bytes


In [12]:
# Now data is clean
df.to_csv('dfs/preprocessed-df.csv', sep=';', index=False) # save it for future work