<img src="..\data\images\teaching_physics.png">

# Analyzing transcriptions of physics classroom sessions

## 1. Introduction

Welcome to <a href=Introduction.ipynb>this</a> notebook!

The aim of this document is to summarize the different approaches used to analyze the transcriptions of physics clasroom sessions. We recorded and transcribed classroom sessions using the <a href=https://smartspeech.azurewebsites.net>SmartSpeech app</a>. We used the transcriptions of clasroom sessions to:
- understand the relation among sessions (similarity by topic or by grade)
- analyze the use of specific words, combination of words and concept relations
- analyze the temporal dimension of the appearence of words during the clasroom sessions

## 2. Dataset

The dataset consist in 56 physic clasroom sessions by Heber, brother of Obed. Each session is saved in a .txt file (you can find them in the <a href=http://localhost:8888/tree/Documents/clusters%20ciae/data>data</a> folder).

Considering all the sessions, there are:

- Total number of words: 277,789

- Size of set of words: 14,208

Here are the number of session by grade and topic

In [21]:
from packages.text_clustering import text_preprocessing as tp
import os
root_path = 'C:\Users\CATALINA ESPINOZA\Documents\clusters ciae'
data_path = os.path.join(root_path,'data')
output_path = os.path.join(root_path,'output')
by_grade_and_content = os.path.join(data_path,'textos_ulloa_by_grade_content')



In [53]:
reload(tp)
tp.list_files(by_grade_and_content)

cuarto/
    aplicaciones campo eléctrico(1)
    campo magnético y transformadores(1)
    condensadores y magnetismo(1)
    electivo psu (luz)(1)
    electrodinamica(2)
    flujo magnético(1)
    fuerza electrostática(1)
    fuerza y campo eléctrico(1)
    psu física (luz)(1)
octavo/
    centrales eléctricas(1)
    circuitos(1)
    energía eléctrica(2)
    ley de ohm(1)
    potencia y energía eléctrica(2)
    voltaje(1)
primero/
    fenómenos de la luz(3)
    fenómenos del sonido(3)
    fenómenos ondulatorios(1)
    interferencia de la luz(1)
    luz(1)
    sonido(1)
    óptica de la luz(2)
segundo/
    aplicaciones del mrua y mrur(2)
    dinámica(1)
    fuerza de roce(1)
    leyes de newton(2)
    movimiento rectilineo uniforme(1)
    mrua(1)
    mrua y mrur(3)
    torque(1)
    torque y palancas(2)
septimo/
    el clima(1)
    elementos del clima(1)
    factores del clima(2)
    factores geográficos(2)
    presión(2)
tercero/
    fluidos(2)
    principio de pascal(1)
    torque(1)


Here is a scatter plot of the length of the sessions (minutes) by the number of words of the sessions

<img src="..\data\images\descarga.png" align="left">

Here is a wordcloud with the most frequent words used by Heber during the lessons

In [None]:
# sacar imagen ahora (1 pom)

## 3. Clasroom session representations

In this section we describe the different approaches used to represent the clasroom sessions transcriptions. As we are working with transcriptions, we refer to the clasroom sessions as documents. To exemplify each approach, we are going to use the following auxiliar document:

In [115]:
auxiliar_document = ['This document is just an example','It is not a transcription of a session','Remember it is just an example']

<i>Warning: the document above is just an example, it is not a transcription of a session</i>

<i>Real warning: each element of the list is representing a line of the document. A line correspond to 5 seconds of the lesson in the real documents.</i>

####  General preprocessing

All the approaches presented in the following subsections work with preprocessed documents. The preprocessing tasks are described in detail in the corresponding notebooks. Here are some typical preprocessing tasks, that prepare the auxiliar document that later will help us to exemplify the different approaches.

In [131]:
# work with words in lower case
auxiliar_document = map(lambda x: x.lower(),auxiliar_document)

In [132]:
# work with separated words (not lines)
split_words = [j for i in auxiliar_document for j in i.split()]
print split_words

['this', 'document', 'is', 'just', 'an', 'example', 'it', 'is', 'not', 'a', 'transcription', 'of', 'a', 'session', 'remember', 'it', 'is', 'just', 'an', 'example']


In [133]:
# work with the set of words (no repetition)
set_of_words = sorted(set(split_words))
print set_of_words

['a', 'an', 'document', 'example', 'is', 'it', 'just', 'not', 'of', 'remember', 'session', 'this', 'transcription']


<a href="Preprocessing">Here</a> are the notebooks with some general preprocessing.

### 3.1 Words

Each document is transformed into a vector where each component represents the importance in the document of a specific word. The importance of a word can be measured, for example, by the frequency of the word.

In [119]:
words_frequency = []
for i in set_of_words:
    words_frequency.append(split_words.count(i))

Here is the vector with the frequency of each word

In [120]:
# it looks like a list, but it is a vector
print words_frequency

[2, 2, 1, 2, 3, 2, 2, 1, 1, 1, 1, 1, 1]


Here is each word with its corresponding frequency

In [121]:
for i in zip(set_of_words,words_frequency):
    print i

('a', 2)
('an', 2)
('document', 1)
('example', 2)
('is', 3)
('it', 2)
('just', 2)
('not', 1)
('of', 1)
('remember', 1)
('session', 1)
('this', 1)
('transcription', 1)


<a href="Words">Here</a> are the notebooks with the analysis using word vectors.

### 3.2 Pair of words

Each document is transformed into a matrix that holds the frequency in which a pair of words is enunciated together.

First, we select an interesting group of words. For example, the complete set of words: 

In [122]:
set_of_words

['a',
 'an',
 'document',
 'example',
 'is',
 'it',
 'just',
 'not',
 'of',
 'remember',
 'session',
 'this',
 'transcription']

Then we build a matrix $connective\_matrix$, size $n \times n$, where $n$ is the length of the set of words.

In [134]:
import numpy as np
connective_matrix = np.zeros((len(set_of_words),len(set_of_words)))

Each cell of $connective\_matrix$ holds the frequency in which a word $w_i$ is enunciate before a word $w_j$ (in the same line), denoting a temporal relation between $w_i\rightarrow w_j$ 

In [135]:
for w in set_of_words:
    matrix_i = set_of_words.index(w)
    for line in auxiliar_document:
        line_words = line.split()
        for w_i in range(len(line_words)):
            if line_words[w_i] == w:
                for w_j in range(len(line_words)):
                    if w_i < w_j:
                        matrix_j =set_of_words.index(line_words[w_j])
                        if matrix_i != matrix_j:
                            connective_matrix[matrix_i,matrix_j] += 1
        connective_matrix[matrix_i,matrix_i] += line_words.count(w)

So, this is how it looks a pair of words representation of a document

In [136]:
import pandas as pd 
df = pd.DataFrame(connective_matrix)
df.columns = set_of_words
df.index = set_of_words
df

Unnamed: 0,a,an,document,example,is,it,just,not,of,remember,session,this,transcription
a,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0
an,0.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
document,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
example,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
is,2.0,2.0,0.0,2.0,3.0,0.0,2.0,1.0,1.0,0.0,1.0,0.0,1.0
it,2.0,1.0,0.0,1.0,2.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
just,0.0,2.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
not,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0
of,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
remember,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


In [164]:
# Here is the document to check the relation between topics
auxiliar_document

['this document is just an example',
 'it is not a transcription of a session',
 'remember it is just an example']

In [137]:
df.as_matrix().diagonal() == np.array(words_frequency)

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True])

<a href="Pair_of_words">Here</a> are the notebooks with the analysis using pair of words.

### 3.3 Pair of topics

Similar to the representation by pair of words (section 3.2), pair of topics representation uses matrices that holds the connection between topics in the documents. A topic is a group of words that are related with each other. A word can belong to different topics. Each word in a topic has a score of its belonging.

For example, here are two topics for the auxiliar document:

In [153]:
# topic_1 contains words regarding the session
topic_1 = ['document','session','transcription']
# topic_2 contains words used to teach
topic_2 = ['example','remember']

In [154]:
auxiliar_document

['this document is just an example',
 'it is not a transcription of a session',
 'remember it is just an example']

<i>Note: for simplicity we assume every word has a score equal to 1, and each word belongs to different classes</i>

Now, we build the pair of topics matrix counting the times that each topic is related to the other across the document. This time, the $connective\_matrix$ size is $n \times n$, where $n$ is the number of topics.

In [159]:
topics = [topic_1,topic_2]
connective_matrix = np.zeros((len(topics),len(topics)))

For each word in each topic, it is calculated the relation of that word with the words of other topics. Besides, the diagonal of the $connective\_matrix$ represent the sum of the frequencies of the topic's words

In [160]:
for topic_index in range(len(topics)):
    for w in topics[topic_index]:
        for line in auxiliar_document:
            line_words = line.split()
            if w in line_words:
                w_i = line_words.index(w)
                for w_j in range(len(line_words)):
                    if w_i < w_j and line_words[w_j] != w:
                        for other_topic_index in range(len(topics)):
                            if other_topic_index != topic_index:
                                if line_words[w_j] in topics[other_topic_index]:
                                    connective_matrix[topic_index,other_topic_index] += 1
                connective_matrix[topic_index,topic_index] += line_words.count(w)

In [161]:
df = pd.DataFrame(connective_matrix)
df.columns = ['session','teaching']
df.index = ['session','teaching']
df

Unnamed: 0,session,teaching
session,3.0,1.0
teaching,0.0,3.0


In [163]:
# Here is the document to check the relation between topics
auxiliar_document

['this document is just an example',
 'it is not a transcription of a session',
 'remember it is just an example']

<a href="Pair_of_topics">Here</a> are the notebooks with the analysis using pair of topics.

## 4. Other analysis 

### 4.1 Epistemic networks