## Parsing the CSV File into Different Labels

The original data from the CSV file contained hundreds of email samples rated by native and non-native speakers, rated from negative to positive 12, per the abstract. The recorded means contain the average scores as multiple ratings were given for each sample. The normalized values go from around negative to positive 3.

### Labeling Schemes (Based on normalized scores)

Binary (0 or 1): less than zero is "impolite", "polite" otherwise

Strong Neutral (-1, 0, 1): adding a "neutral" tag for scores with an absolute value at most 0.25

Weak Neutral (-1, 0, 1): extending "neutral" tagging to scores with an absolute value at most 0.75

Scale of 5 (-2 to 2): set of five tags ["very impolite": score <= -1.5, "impolite": -1.5 < score < -0.25, "neutral": -0.25 <= score <= 0.25, "polite": 0.25 < score < 1.5, "very polite": score >= 1.5]

In [13]:
# Helper function for labeling, specs defined by labeling schemes
def getLabel(index, value):
    if index == 0:
        # Binary labeling
        return 0 if value < 0 else 1
    elif index == 1:
        # Strong Neutral
        return 0 if abs(value) <= 0.25 else (-1 if value < 0 else 1)
    elif index == 2:
        # Weak Neutral
        return 0 if abs(value) <= 0.75 else (-1 if value < 0 else 1)
    
    # Labeling with Intermediates
    if value <= -1.5:
        return -2
    elif value >= 1.5:
        return 2
    return 0 if abs(value) <= 0.25 else (-1 if value < 0 else 1)    

In [14]:
import csv

csv_file = open("RatingData - Sheet1.csv", "r")
csv_reader = csv.reader(csv_file)

labels = ['ID', 'Message', 'NS', 'NNS']
filenames = ["BinaryLabeling.csv", "StrongNeutralLabeling.csv",
             "WeakNeutralLabeling.csv", "IntermediateLabeling.csv"]
fileobjs = [open("LabeledData/" + i, "w", newline='') for i in filenames]
writers = [csv.writer(i) for i in fileobjs]
for i in writers:
    i.writerow(labels)

bad_rows = 0
next(csv_reader, None)
for row in csv_reader:
    # Check for errors in comma division in csv
    if len(row) != 10:
        bad_rows += 1
    else:
        # Grabbing normalized scores from csv
        NS_score = float(row[4])
        NNS_score = float(row[8])
        
        # Performing labeling
        for i in range(4):
            writers[i].writerow([row[0], row[1], getLabel(i, NS_score), getLabel(i, NNS_score)])
for i in fileobjs:
    i.close()
print(bad_rows)

0
