Manually classify the working language and embedded language of the first 100 posts

In [1]:
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
import pandas as pd

In [2]:
def get_conversation_length(conversation_id):
    """
    Get the total number of posts given a conversation_id
    """
    # Filter the DataFrame to include only rows with the given conversation_id
    filtered_df = df[df['conversation_id'] == conversation_id]
    
    # Get the total number of posts for the conversation_id
    total_posts = len(filtered_df)
    
    return total_posts

In [3]:
def truncate_post(post, n):
    """
    Truncate a string to a maximum number of tokens (n)
    """
    # Tokenize the input string into words
    words = word_tokenize(post)
    
    # Select the first n tokens
    truncated_tokens = words[:n]
    
    # Join the selected tokens back into a string
    truncated_post = ' '.join(truncated_tokens)
    
    return truncated_post

In [4]:
df= pd.read_csv("conv_sample_posts.csv")

In [5]:
# iterate through a range of conversation data points
for i in range(10, 100):
    
    conversation_id= df.at[i, 'conversation_id']
    print(f"• Examining data point {i}, with conversation_id {conversation_id}")
    
    if pd.isna(df['post_working_language_actual'].iloc[i]):

        # see how long the conversation is
        conversation_length= get_conversation_length(conversation_id)
        print(f"----Data point {i} is part of a conversation of length {conversation_length}")

        # initialize a set to store the languages associated with the post
        languages= set()

        # get the post content
        post= df.at[i, 'post']
        # truncate the post
        truncated_post= truncate_post(post, 25)

        # print out the truncated post
        print(f"--------Post content: {truncated_post}")

        # get the language classification for the text
        print(f"--------Getting the language classification...")
        classification= input("Classify this post as: > ") # <- HERE IS WHERE THE API QUERY TAKES PLACE
        # classification= "English"
        print(f"--------Classification is: {classification}")

        # split the languages into a list, since 'classification' may be a string with languages separated by commas
        classification= classification.split(",")
        # add each language to the set of languages associated with the conversation
        for language in classification:
            languages.add(language)

        # print out the set of languages associated with the conversation data point
        print(f"--------The language(s) used in data point {i} are: {languages}")

        # associate the languages with the conversation data point in a new column
        df.loc[i, "post_working_language_actual"]= ", ".join(map(str, languages))

        print("\n")
    
    else:
        print(f"----Data point {i} already has a working language classification!")
        
# Specify the path where you want to save the CSV file
file_path = 'conv_sample_posts.csv'

# Save the DataFrame to a CSV file
df.to_csv(file_path, index=False)  # Set index=False to exclude the index column

print(f'DataFrame has been saved to {file_path}')

• Examining data point 10, with conversation_id 3
----Data point 10 is part of a conversation of length 20
--------Post content: `` Ca n't Rename , no element found '' on anything I try to rename
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 10 are: {'English'}


• Examining data point 11, with conversation_id 3
----Data point 11 is part of a conversation of length 20
--------Post content: Has clojure-lsp started correctly ?
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 11 are: {'English'}


• Examining data point 12, with conversation_id 3
----Data point 12 is part of a conversation of length 20
--------Post content: > Has clojure-lsp started correctly ? Yes
--------Getting the language classification...
Classify this post as: > English
--------Classification is: E

Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 30 are: {'English'}


• Examining data point 31, with conversation_id 5
----Data point 31 is part of a conversation of length 1
--------Post content: Bump engine.io and browser-sync . Bumps [ engine.io ] ( https : //github.com/socketio/engine.io ) to 6.2.1 and updates ancestor dependency [ browser-sync ] ( https
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 31 are: {'English'}


• Examining data point 32, with conversation_id 6
----Data point 32 is part of a conversation of length 1
--------Post content: Bump pillow from 6.0.0 to 9.3.0 . Bumps [ pillow ] ( https : //github.com/python-pillow/Pillow ) from 6.0.0 to 9.3.0 . < details > <
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The languag

Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 48 are: {'English'}


• Examining data point 49, with conversation_id 8
----Data point 49 is part of a conversation of length 39
--------Post content: : hourglass : Trying commit 7738cc734546bc91ede24376970cc4dfd01580da with merge 4997a56cf2edc4b2be488f2f2b586bf5e2bb3b37 ... < ! -- homu : { `` type '' : '' TryBuildStarted '' , ''
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 49 are: {'English'}


• Examining data point 50, with conversation_id 8
----Data point 50 is part of a conversation of length 39
--------Post content: : sunny : Try build successful - [ checks-actions ] ( https : //github.com/rust-lang-ci/rust/actions/runs/3387025807/jobs/5629370226 ) Build commit : 4997a56cf2edc4b2be488f2f2b586bf5e2bb3b37 ( ` 4997a56cf2edc4b2be488f2f2b586bf5e2bb3b37 ` ) <
------

Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 66 are: {'English'}


• Examining data point 67, with conversation_id 8
----Data point 67 is part of a conversation of length 39
--------Post content: : umbrella : The latest upstream changes ( presumably # 105183 ) made this pull request unmergeable . Please [ resolve the merge conflicts ]
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 67 are: {'English'}


• Examining data point 68, with conversation_id 8
----Data point 68 is part of a conversation of length 39
--------Post content: : umbrella : The latest upstream changes ( presumably # 105644 ) made this pull request unmergeable . Please [ resolve the merge conflicts ]
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The language(s) used in 

Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 83 are: {'English'}


• Examining data point 84, with conversation_id 15
----Data point 84 is part of a conversation of length 1
--------Post content: MQ Smoke test PR . autogenerated pr
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 84 are: {'English'}


• Examining data point 85, with conversation_id 16
----Data point 85 is part of a conversation of length 1
--------Post content: [ Snyk ] Upgrade snyk from 1.996.0 to 1.1044.0 . < h3 > Snyk has created this PR to upgrade snyk from 1.996.0 to 1.1044.0.
--------Getting the language classification...
Classify this post as: > English
--------Classification is: English
--------The language(s) used in data point 85 are: {'English'}


• Examining data point 86, with conversation_id 17
----Data point 86 is part of a conversa