# Processing qualitative survey data

<br>

**Language: Python**

This notebook shows the process used for cleaning the qualitative data from the Qualtrics survey (rationales for ratings) for future analysis. Prior to being analyzed in this notebook, the comments were manually annotated with tags (see dissertation section 6.2.4 for this details of this scheme).

**Notebook contents:**
- [Initial setup](#Initial-setup)
- [Data processing](#Data-processing)
- [Data analysis](#Data-analysis)

## Initial setup

In [1]:
# Import necessary modules

import pandas as pd
import pprint
from IPython.core.interactiveshell import InteractiveShell
import joblib
import csv
import collections

In [2]:
# Set preferred notebook format

InteractiveShell.ast_node_interactivity = "all" # Show all output, not just last item
pd.set_option('display.max_columns', 999) # Allow viewing of all columns

In [3]:
# Read in survey with annotations

comments = pd.read_csv('../docs/survey_annotated_comments.csv')
comments.head()

Unnamed: 0,text_id,rater_id,CEFR,soph,binary_soph,soph_type,accuracy,tags
0,1,R1,B1,low,low,mix,low,accuracy; error_severity; sophistication; dive...
1,1,R11,B1,low,low,mix,low,preposition; spelling; accuracy; sophisticatio...
2,1,R19,B1,low,low,mix,low,
3,1,R40,B1,low,low,mix,low,word_formation; diversity
4,2,R6,B1,low,low,mix,high,collocation_accuracy; spelling; sophistication...


## Data processing

In [4]:
# Remove NaN rows and reset index

len(comments)
comments = comments.loc[comments.tags.isna() == False]
comments = comments.reset_index(drop=True)
len(comments)

121

110

In [5]:
# Split tags

comments.tags = comments.tags.apply(lambda x: x.split(";"))
comments.head()

Unnamed: 0,text_id,rater_id,CEFR,soph,binary_soph,soph_type,accuracy,tags
0,1,R1,B1,low,low,mix,low,"[accuracy, error_severity, sophistication, ..."
1,1,R11,B1,low,low,mix,low,"[preposition, spelling, accuracy, sophistic..."
2,1,R40,B1,low,low,mix,low,"[word_formation, diversity]"
3,2,R6,B1,low,low,mix,high,"[collocation_accuracy, spelling, sophisticat..."
4,3,R2,B1,mid,high,col,low,"[FS, FS_accuracy, sophistication, error_sev..."


In [6]:
# Check and standardize tags

sorted(list(set([x for y in comments.tags for x in y])))

[' FS',
 ' FS_accuracy',
 ' accuracy',
 ' accuracy(collocation)',
 ' accuracy(word_choice)',
 ' appropriacy',
 ' coherence',
 ' collocation',
 ' collocation_accuracy',
 ' control',
 ' diversity',
 ' error_severity',
 ' fluency',
 ' linking_words',
 ' message',
 ' preposition',
 ' register',
 ' sophistication',
 ' spelling',
 ' style',
 ' task_adequacy',
 ' tone',
 ' word_choice',
 ' word_formation',
 'FS',
 'accuracy',
 'accuracy(collocation)',
 'appropriacy',
 'collocation',
 'collocation_accuracy',
 'diversity',
 'error_gravity',
 'noun_phrase',
 'preposition',
 'register',
 'sophistication',
 'spelling',
 'style',
 'task_adequacy',
 'word_choice',
 'word_formation']

In [7]:
# Create dict

tags_dict = {
    ' FS':'FS',
    ' FS_accuracy':'FS',
    ' accuracy':'accuracy',
    ' accuracy(collocation)':'accuracy(collocation)',
    ' accuracy(word_choice)':'accuracy',
    ' appropriacy':'appropriacy',
    ' coherence':'coherence',
    ' collocation':'collocation',
    ' collocation_accuracy':'accuracy(collocation)',
    ' control':'appropriacy',
    ' diversity':'range',
    ' error_severity':'error_gravity',
    ' fluency':'fluency',
    ' linking_words':'linking_words',
    ' message':'appropriacy',
    ' preposition':'preposition',
    ' register':'style',
    ' sophistication':'sophistication',
    ' spelling':'spelling',
    ' style':'style',
    ' task_adequacy':'appropriacy',
    ' tone':'style',
    ' word_choice':'accuracy(word_choice)',
    ' word_formation':'word_formation',
    'collocation_accuracy':'accuracy(collocation)',
    'task_adequacy':'appropriacy',
    'diversity':'range',
    'register':'style',
}

# Apply dict

comments.tags = comments.tags.apply(lambda row: [tags_dict[x] if x in tags_dict else x for x in row])
sorted(list(set([x for y in comments.tags for x in y])))

['FS',
 'accuracy',
 'accuracy(collocation)',
 'accuracy(word_choice)',
 'appropriacy',
 'coherence',
 'collocation',
 'error_gravity',
 'fluency',
 'linking_words',
 'noun_phrase',
 'preposition',
 'range',
 'sophistication',
 'spelling',
 'style',
 'word_choice',
 'word_formation']

## Data analysis

In [8]:
# Counts of each tag

tags = [x for y in comments.tags for x in y]
tag_counts = collections.Counter(tags)
tag_counts

Counter({'accuracy': 45,
         'error_gravity': 16,
         'sophistication': 60,
         'range': 44,
         'preposition': 4,
         'spelling': 12,
         'accuracy(word_choice)': 2,
         'word_formation': 8,
         'accuracy(collocation)': 7,
         'appropriacy': 38,
         'FS': 24,
         'coherence': 2,
         'style': 10,
         'collocation': 38,
         'word_choice': 1,
         'fluency': 2,
         'linking_words': 1,
         'noun_phrase': 1})

In [9]:
tag_counts.items()

dict_items([('accuracy', 45), ('error_gravity', 16), ('sophistication', 60), ('range', 44), ('preposition', 4), ('spelling', 12), ('accuracy(word_choice)', 2), ('word_formation', 8), ('accuracy(collocation)', 7), ('appropriacy', 38), ('FS', 24), ('coherence', 2), ('style', 10), ('collocation', 38), ('word_choice', 1), ('fluency', 2), ('linking_words', 1), ('noun_phrase', 1)])

These counts will be used for the visualization created in R in notebook 15.

[Back to top](#Processing-qualitative-survey-data)