<div class="alert alert-block alert-danger">

# Data Processing

Environment: Python 3.9

<div class="alert alert-block alert-info">
    
# Table of Contents

</div>

[2. Data Processing](#1) <br>
$\;\;\;\;$[2.1. Import Libraries](#11) <br>
$\;\;\;\;$[2.2. Data Wrangling](#12) <br>

<div class="alert alert-block alert-success">
    
## Import Libraries <a class="anchor" name="11"></a>

Import libraries needed

In [16]:
import pandas as pd
import numpy as np

In [33]:
# Load the CSV file
df = pd.read_csv('data/stanfordMOOCForumPostsSet/posts.csv')

<div class="alert alert-block alert-success">
    
## Data Wrangling <a class="anchor" name="12"></a>

<div class="alert alert-block alert-success">
    
### Drop NA value <a class="anchor" name="121"></a>

In [36]:
# Drop rows where 'forum_uid' is null
df = df.dropna(subset=['forum_uid'])

<div class="alert alert-block alert-success">
    
### Fill Missing Value <a class="anchor" name="122"></a>

In [37]:
# Fill missing CourseType
# Function to extract the value before the first '/' from 'course_display_name'
def extract_course_type(course_display_name):
    if pd.notnull(course_display_name):  # Check if course_display_name is not null
        return course_display_name.split('/')[0]  # Split and take the first part
    return None

# Apply the function to fill null values in 'CourseType' with the extracted value from 'course_display_name'
df['CourseType'] = df.apply(lambda row: extract_course_type(row['course_display_name']) if pd.isnull(row['CourseType']) else row['CourseType'], axis=1)

In [40]:
# Fill 'comment_thread_id' with 'forum_post_id' where 'comment_thread_id' is null and 'post_type' is 'CommentThread'
df.loc[(df['comment_thread_id'].isnull()) & (df['post_type'] == 'CommentThread'), 'comment_thread_id'] = df['forum_post_id']

In [46]:
# Optionally save the cleaned DataFrame to a new file
df.to_csv('data/stanfordMOOCForumPostsSet/cleaned_posts.csv', index=False)

<div class="alert alert-block alert-success">
    
## Data Exploration <a class="anchor" name="13"></a>

In [43]:
df.dtypes

Text                         object
Opinion(1/0)                  int64
Question(1/0)                 int64
Answer(1/0)                   int64
Sentiment(1-7)              float64
Confusion(1-7)              float64
Urgency(1-7)                float64
CourseType                   object
forum_post_id                object
course_display_name          object
forum_uid                    object
created_at                   object
post_type                    object
anonymous                   float64
anonymous_to_peers          float64
up_count                    float64
comment_thread_id            object
reads                       float64
politeness_score            float64
unique_id                     int64
sum_svm_politeness_score      int64
Please                        int64
Please_start                  int64
HASHEDGE                      int64
Indirect_(btw)                int64
Hedges                        int64
Factuality                    int64
Deference                   

In [44]:
# Iterate over each column to show specific information based on data type
for column in df.columns:
    print(f"\nColumn: {column}")
    if pd.api.types.is_numeric_dtype(df[column]):
        # If the column is numeric, show the range (min, max)
        print(f"Numeric Column. Range: {df[column].min()} to {df[column].max()}")
    elif pd.api.types.is_categorical_dtype(df[column]) or pd.api.types.is_object_dtype(df[column]):
        # If the column is categorical or object (string), show unique values
        print(f"Categorical Column. Unique values: {df[column].unique()}")
    else:
        print("Other data type.")


Column: Text
Categorical Column. Unique values: ['Interesting! How often we say those things to others without really understanding what we are saying. That must have been a powerful experience! Excellent!'
 'What is \\Algebra as a Math Game\\" or are you just saying you create games that incorporate algebra."'
 'I like the idea of my kids principal who says \\Smart doesn\'t mean easy, smart means working hard\\" and incorporating the idea of making mistakes into working hard."'
 ...
 '> Hello Josh,_x0007__x0007_Is this hypothesis formulation correct?_x0007__x0007_HO: Breast feeding is related to obesity_x0007_H1: Breast feeding is not related to obesity_x0007__x0007_I think I am almost there!_x0007__x0007_Best Wishes,_x0007__x0007_<nameRedac_<anon_screen_name_redacted>>'
 "Hi Josh,_x0007__x0007__x0007__x0007_Looking at the table for this question, I see that the first number in the second column is 130. how can this be if we're talking percentages? I know it doesn't matter to answer 

<div class="alert alert-block alert-success">
    
## Descriptive Statistics <a class="anchor" name="14"></a>

In [6]:
# Download NLTK data for sentence tokenization
# nltk.download('punkt')

# Load the CSV file
df = pd.read_csv('data/stanfordMOOCForumPostsSet/stanfordMOOCForumPostsSet.csv')

In [7]:
# Create a new column called 'unique_id' and assign a unique number to each row
df['unique_id'] = range(1, len(df) + 1)

df

Unnamed: 0,Text,Opinion(1/0),Question(1/0),Answer(1/0),Sentiment(1-7),Confusion(1-7),Urgency(1-7),CourseType,forum_post_id,course_display_name,forum_uid,created_at,post_type,anonymous,anonymous_to_peers,up_count,comment_thread_id,reads,politeness_score,unique_id
0,Interesting! How often we say those things to ...,1,0,0,6.5,2.0,1.5,Education,5225177f2c501f0a00000015,Education/EDUC115N/How_to_Learn_Math,30CADB93E6DE4711193D7BD05F2AE95C,2013-09-02 22:55:59,Comment,0.0,0.0,0.0,5221a8262cfae31200000001,41.0,0.784717,1
1,"What is \Algebra as a Math Game\"" or are you j...",0,1,0,4.0,5.0,3.5,Education,5207d0e9935dfc0e0000005e,Education/EDUC115N/How_to_Learn_Math,37D8FAEE7D0B94B6CFC57D98FD3D0BA5,2013-08-11 17:59:05,Comment,0.0,0.0,0.0,520663839df35b0a00000043,55.0,0.550738,2
2,I like the idea of my kids principal who says ...,1,0,0,5.5,3.0,2.5,Education,52052c82d01fec0a00000071,Education/EDUC115N/How_to_Learn_Math,CC11480215042B3EB6E5905EAB13B733,2013-08-09 17:53:06,Comment,0.0,0.0,0.0,51e59415e339d716000001a6,25.0,0.569943,3
3,"From their responses, it seems the students re...",1,0,0,6.0,3.0,2.5,Education,5240a45e067ebf1200000008,Education/EDUC115N/How_to_Learn_Math,C717F838D10E8256D7C88B33C43623F1,2013-09-23 20:28:14,CommentThread,0.0,0.0,0.0,,0.0,0.866788,4
4,"The boys loved math, because \there is freedom...",1,0,0,7.0,2.0,3.0,Education,5212c5e2dd10251500000062,Education/EDUC115N/How_to_Learn_Math,F83887D68EA48964687C6441782CDD0E,2013-08-20 01:26:58,CommentThread,0.0,0.0,0.0,,3.0,0.980352,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29599,The p value tells us the probability of observ...,0,0,1,4.0,3.5,2.0,Medicine,53e44042bf0e2c074e000039,Medicine/MedStats/Summer2014,83C5EC9DD9319435989AB52FA7E580BC,2014-08-08 03:13:06,Comment,0.0,0.0,0.0,53e4193fbce97d56a9000026,144.0,0.963438,29600
29600,given the null hypothesis is considered true,0,0,1,4.0,3.5,1.0,Medicine,53e442dfbf0e2c8d66000034,Medicine/MedStats/Summer2014,83C5EC9DD9319435989AB52FA7E580BC,2014-08-08 03:24:15,Comment,0.0,0.0,0.0,53e4193fbce97d56a9000026,144.0,0.134566,29601
29601,"> Hello Josh,_x0007__x0007_Is this hypothesis ...",0,1,0,4.0,5.0,5.5,Medicine,53e447cbbce97d56a9000032,Medicine/MedStats/Summer2014,83C5EC9DD9319435989AB52FA7E580BC,2014-08-08 03:45:15,Comment,0.0,0.0,0.0,53e4193fbce97d56a9000026,144.0,0.850349,29602
29602,"Hi Josh,_x0007__x0007__x0007__x0007_Looking at...",0,1,0,3.5,5.0,5.5,Medicine,53e46e1cbce97d5d4300003c,Medicine/MedStats/Summer2014,673E487F9CE5343B8F32E7C7D49B6098,2014-08-08 06:28:44,Comment,0.0,0.0,0.0,53dfe280a8638d3f7a00002f,203.0,0.972872,29603


In [8]:
# Create a ConvoKit Corpus for each sentence
# Each row may have multiple sentences. Split the text and treat each sentence as a separate utterance.
utterances = []
for i, row in df.iterrows():
    if isinstance(row['Text'], str):  # Check if the text is a valid string
        sentences = nltk.sent_tokenize(row['Text'])  # Split the text into sentences
        for j, sentence in enumerate(sentences):
            utterances.append(Utterance(id=f"{i}_{j}", speaker=Speaker(id=f"Speaker_{i}"), text=sentence))

corpus = Corpus(utterances=utterances)

In [9]:
# Apply dependency parses and politeness strategies
parser = TextParser(verbosity=5000)
corpus = parser.transform(corpus)

ps = PolitenessStrategies()
corpus = ps.transform(corpus, markers=True)

5000/97566 utterances processed
10000/97566 utterances processed
15000/97566 utterances processed
20000/97566 utterances processed
25000/97566 utterances processed
30000/97566 utterances processed
35000/97566 utterances processed
40000/97566 utterances processed
45000/97566 utterances processed
50000/97566 utterances processed
55000/97566 utterances processed
60000/97566 utterances processed
65000/97566 utterances processed
70000/97566 utterances processed
75000/97566 utterances processed
80000/97566 utterances processed
85000/97566 utterances processed
90000/97566 utterances processed
95000/97566 utterances processed
97566/97566 utterances processed


In [17]:
# Use the trained classifier to predict politeness for each sentence
predictions = classifier.transform(corpus)

In [18]:
# Compute the sum of politeness scores (1 for polite, -1 for direct) and aggregate the strategies for each row
results = []
for i, row in df.iterrows():
    if isinstance(row['Text'], str):  # Skip non-string entries in the text column
        sentences = nltk.sent_tokenize(row['Text'])
        sentence_scores = []
        sentence_strategies = defaultdict(list)

        # For each sentence, retrieve its politeness score and strategies
        for j, sentence in enumerate(sentences):
            utt = predictions.get_utterance(f"{i}_{j}")
            politeness_prediction = utt.meta['prediction']

            # Change prediction to -1 for impolite and 1 for polite
            if politeness_prediction == 0:
                politeness_prediction = -1  # Impolite
            elif politeness_prediction == 1:
                politeness_prediction = 1  # Polite

            politeness_strategies = utt.meta['politeness_strategies']

            # Store the prediction score (polite = 1, impolite = -1)
            sentence_scores.append(politeness_prediction)

            # Aggregate strategies across all sentences
            for strategy, count in politeness_strategies.items():
                sentence_strategies[strategy].append(count)

        # Calculate the sum politeness score for the entire text
        if sentence_scores:  # Ensure there are sentences to calculate the sum
            sum_politeness_score = sum(sentence_scores)
        else:
            sum_politeness_score = 0  # If no valid sentences, assign 0

        # Add the aggregated results to the list
        results.append({
            "Text": row['Text'],
            "unique_id": row.get('unique_id'),
            "sum_svm_politeness_score": sum_politeness_score,
            "svm_politeness_strategies": dict(sentence_strategies)
        })

# Convert to DataFrame
results_df = pd.DataFrame(results)

In [19]:
# Display the updated svm DataFrame
results_df

Unnamed: 0,Text,unique_id,sum_svm_politeness_score,svm_politeness_strategies
0,Interesting! How often we say those things to ...,1,0,"{'feature_politeness_==Please==': [0, 0, 0, 0]..."
1,"What is \Algebra as a Math Game\"" or are you j...",2,-1,"{'feature_politeness_==Please==': [0], 'featur..."
2,I like the idea of my kids principal who says ...,3,-1,"{'feature_politeness_==Please==': [0], 'featur..."
3,"From their responses, it seems the students re...",4,-2,"{'feature_politeness_==Please==': [0, 0], 'fea..."
4,"The boys loved math, because \there is freedom...",5,-6,"{'feature_politeness_==Please==': [0, 0, 0, 0,..."
...,...,...,...,...
29598,The p value tells us the probability of observ...,29600,-1,"{'feature_politeness_==Please==': [0], 'featur..."
29599,given the null hypothesis is considered true,29601,-1,"{'feature_politeness_==Please==': [0], 'featur..."
29600,"> Hello Josh,_x0007__x0007_Is this hypothesis ...",29602,-1,"{'feature_politeness_==Please==': [0], 'featur..."
29601,"Hi Josh,_x0007__x0007__x0007__x0007_Looking at...",29603,-2,"{'feature_politeness_==Please==': [0, 0], 'fea..."


In [20]:
# Sum the strategy lists in 'svm_politeness_strategies' and expand them into individual columns
def expand_politeness_strategies(df, strategy_column):
    # First, extract the dictionary from the 'svm_politeness_strategies' column (which should already be dictionaries)
    strategy_dicts = df[strategy_column]

    # Create a DataFrame where each strategy has its own column
    strategy_df = pd.json_normalize(strategy_dicts)

    # Sum the lists in each strategy column
    strategy_df = strategy_df.applymap(lambda x: sum(x) if isinstance(x, list) else x)

    # Rename the columns to remove 'feature_politeness_==' and '=='
    strategy_df.columns = [col.replace('feature_politeness_==', '').replace('==', '') for col in strategy_df.columns]

    # Merge the new strategy columns back with the original DataFrame
    df = df.drop(strategy_column, axis=1).join(strategy_df)
    
    return df

# Apply the function to your DataFrame
results_df = expand_politeness_strategies(results_df, 'svm_politeness_strategies')

In [21]:
results_df

Unnamed: 0,Text,unique_id,sum_svm_politeness_score,Please,Please_start,HASHEDGE,Indirect_(btw),Hedges,Factuality,Deference,...,1st_person_start,2nd_person,2nd_person_start,Indirect_(greeting),Direct_question,Direct_start,HASPOSITIVE,HASNEGATIVE,SUBJUNCTIVE,INDICATIVE
0,Interesting! How often we say those things to ...,1,0,0,0,1,0,0,1,2,...,0,0,0,0,1,0,3,0,0,0
1,"What is \Algebra as a Math Game\"" or are you j...",2,-1,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,I like the idea of my kids principal who says ...,3,-1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,0,0
3,"From their responses, it seems the students re...",4,-2,0,0,1,0,1,1,0,...,0,0,0,0,0,0,2,0,0,0
4,"The boys loved math, because \there is freedom...",5,-6,0,0,2,0,2,0,0,...,1,2,1,0,0,0,2,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29598,The p value tells us the probability of observ...,29600,-1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29599,given the null hypothesis is considered true,29601,-1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29600,"> Hello Josh,_x0007__x0007_Is this hypothesis ...",29602,-1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29601,"Hi Josh,_x0007__x0007__x0007__x0007_Looking at...",29603,-2,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,1,0,0


In [22]:
# Drop the 'Text' column from results_df
results_df = results_df.drop('Text', axis=1)

# Display the DataFrame to confirm the column is dropped
results_df

Unnamed: 0,unique_id,sum_svm_politeness_score,Please,Please_start,HASHEDGE,Indirect_(btw),Hedges,Factuality,Deference,Gratitude,...,1st_person_start,2nd_person,2nd_person_start,Indirect_(greeting),Direct_question,Direct_start,HASPOSITIVE,HASNEGATIVE,SUBJUNCTIVE,INDICATIVE
0,1,0,0,0,1,0,0,1,2,0,...,0,0,0,0,1,0,3,0,0,0
1,2,-1,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
2,3,-1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,0,0
3,4,-2,0,0,1,0,1,1,0,0,...,0,0,0,0,0,0,2,0,0,0
4,5,-6,0,0,2,0,2,0,0,0,...,1,2,1,0,0,0,2,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29598,29600,-1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29599,29601,-1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29600,29602,-1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
29601,29603,-2,0,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,1,0,0


In [23]:
# Perform a left join between the original df and results_df based on the 'Text' column
df = df.merge(results_df, on='unique_id', how='left')

In [25]:
df

Unnamed: 0,Text,Opinion(1/0),Question(1/0),Answer(1/0),Sentiment(1-7),Confusion(1-7),Urgency(1-7),CourseType,forum_post_id,course_display_name,...,1st_person_start,2nd_person,2nd_person_start,Indirect_(greeting),Direct_question,Direct_start,HASPOSITIVE,HASNEGATIVE,SUBJUNCTIVE,INDICATIVE
0,Interesting! How often we say those things to ...,1,0,0,6.5,2.0,1.5,Education,5225177f2c501f0a00000015,Education/EDUC115N/How_to_Learn_Math,...,0.0,0.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0
1,"What is \Algebra as a Math Game\"" or are you j...",0,1,0,4.0,5.0,3.5,Education,5207d0e9935dfc0e0000005e,Education/EDUC115N/How_to_Learn_Math,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,I like the idea of my kids principal who says ...,1,0,0,5.5,3.0,2.5,Education,52052c82d01fec0a00000071,Education/EDUC115N/How_to_Learn_Math,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
3,"From their responses, it seems the students re...",1,0,0,6.0,3.0,2.5,Education,5240a45e067ebf1200000008,Education/EDUC115N/How_to_Learn_Math,...,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
4,"The boys loved math, because \there is freedom...",1,0,0,7.0,2.0,3.0,Education,5212c5e2dd10251500000062,Education/EDUC115N/How_to_Learn_Math,...,1.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29599,The p value tells us the probability of observ...,0,0,1,4.0,3.5,2.0,Medicine,53e44042bf0e2c074e000039,Medicine/MedStats/Summer2014,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29600,given the null hypothesis is considered true,0,0,1,4.0,3.5,1.0,Medicine,53e442dfbf0e2c8d66000034,Medicine/MedStats/Summer2014,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29601,"> Hello Josh,_x0007__x0007_Is this hypothesis ...",0,1,0,4.0,5.0,5.5,Medicine,53e447cbbce97d56a9000032,Medicine/MedStats/Summer2014,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
29602,"Hi Josh,_x0007__x0007__x0007__x0007_Looking at...",0,1,0,3.5,5.0,5.5,Medicine,53e46e1cbce97d5d4300003c,Medicine/MedStats/Summer2014,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0


In [26]:
df.to_csv('data/stanfordMOOCForumPostsSet/posts.csv', index=False)