# Text analysis: '***2012-13 School Data with Affect***'


## Goal
  
Analyze the dataset to obtain a clean dataset to be used as input for BERT and other NLP models.

    1) Remove what is not part of the questions: html tag, url etc.  
    2) Drop questions without enough text information

NB IRT estimate is calculated using all problems.  
    
## Table of contents
1. Load    
2. Transform   
3. Remove
4. Result


In [None]:
#import libraries,functions
import pandas as pd

from utils.text_utils import (clean_html, remove_hash, remove_newline,
                                     remove_page, remove_question, remove_url,
                                     sep_exp)

pd.set_option('display.max_colwidth', -1)

# <font color='blue'>1 Load </font> 


In [None]:
# load dataset
df = pd.read_csv(r'data/ASSISTments2012DataSet-ProblemBodies.csv')

In [None]:
inter = pd.read_csv(r'data/interactions.csv')

### Dimensionality of the raw dataset

In [None]:
row, col = df.shape

print("#Rows: ", row)
print("#Columns: ", col)

In [None]:
print("#unique problem: ", df.problem_id.nunique())
print("#unique assistement: ", df.assistment_id.nunique())
print("#unique text: ", df.body.nunique())

In [None]:
df.sample(n=5)

In [None]:
# convert body to string
df['body'] = df['body'].astype(str)

In [None]:
# count how many text cointains an image

len(df[df['body'].str.contains('<img')])

# <font color='blue'>2 Transform</font> 


In this section we do not remove problems (ie rows). We only remove part of the text.   
    - remove html tags
    - remove url
    - remove new line   
    - split numerical expression   
    - remove page reference
    - remove hash reference
    - remove questions reference

In [None]:
clean = df.copy()
#drop column 'assistment_id'
clean.drop('assistment_id', axis=1, inplace=True)

In [None]:
# remove html tags
clean = df.copy()
clean['body'] = clean['body'].apply(lambda x: clean_html(x))

In [None]:
# example of questions with url
clean[clean['body'].str.contains('www.')].sample(n=2)

In [None]:
# remove url from text
clean['body'] = clean['body'].apply(lambda x: remove_url(x))

In [None]:
# remove \r\n (End Of Line) and \n  (Line Feed)
clean['body'] = clean['body'].apply(lambda x: remove_newline(x))

In [None]:
# split numerical expression '3*4+1' to '3 * 4 + 1'
clean['body'] = clean['body'].apply(lambda x: sep_exp(x))

In [None]:
# example of questions starting with Page
clean[clean['body'].str.startswith('Page')].sample(n=2)

In [None]:
# remove initial page ref. "Page 82 #2 do the properties hold" to "do the properties hold"
clean['body'] = clean['body'].apply(lambda x: remove_page(x))

In [None]:
# example of questions starting with Question
clean[clean['body'].str.startswith('Question')].sample(n=2)

In [None]:
# remove Question + Number: "Question #6 Determine if is a function" to "Determine if is a function"
clean['body'] = clean['body'].apply(lambda x: remove_question(x))

In [None]:
# example of questions starting with hash+number
clean[clean['body'].str.startswith('#')].sample(n=2)

In [None]:
# remove first word hash+number  "#40 find the max" to "find the max"
clean['body'] = clean['body'].apply(lambda x: remove_hash(x))

### Text ambiguity
As we can see some problems are identical considering the row text.   
Example: text identical, but different image   
Example: text identical, but they refers to something that is not even in the row text.   


In [None]:
print("#problem_id: ", df.problem_id.nunique())
print("#unique text with html: ", df.body.nunique())
print("#unique text after cleaning: ", clean.body.nunique())

In [None]:
#example of duplicated text
clean[clean.duplicated(['body'], keep=False)]

# <font color='blue'>3 Drop</font> 


### Integration

From now consider text of problem with **at least 50 interactions**

In [None]:
# read dataset with interactions (min int x problem = 50)
df_int = pd.read_csv(r'data/interactions.csv')

# consider text of problem with at least 50 intearctions
clean = clean.loc[clean['problem_id'].isin(df_int.problem_id)]

### Text lenght


In [None]:
words = clean['body'].apply(lambda text: len(text.split()))
#np.count_nonzero(words.values > 5)
words.describe()

In [None]:
x = words.plot.hist(bins=50, range=(0, 200))

x.set_title("Length")
x.set_xlabel("#words");

### Bad problems

There are texts that do not represent questions and we should remove them.

    - remove question with 0 words
    - remove question with 1 word and one character is a digit (Ex: '1D' is removed, 'Simplify' is not)   
    - remove question that contains the patterns: 'Sorry, that is incorrect', 'If your answer is positive','Submit your answer from the textbook', 'QUESTION'

Fortunately lot of "bad" problems have few interactions and thus are automatically removed by considering problem with at least N interactions.


In [None]:
# problems with 0 words

print("#problems with 0 words: ", len(clean[clean['body'].map(lambda x: len(x.split()) == 0)]))

In [None]:
# remove questions with 0 words.
clean = clean[clean['body'].map(lambda x: len(x.split()) > 0)]

In [None]:
# remove questions with 1 word that cointas a digit.
# EXAMPLE: "1D" is deleted, "Simplify" is not deleted

p = clean[clean['body'].map(lambda x: len(x.split()) == 1)]
p = p[p['body'].map(lambda x: any(map(str.isdigit, x)))]

clean = clean.loc[~clean['problem_id'].isin(p.problem_id)]

print("random samples of removed problems:")
p.sample(n=10)

In [None]:
# 1 Useless Pattern "Sorry, that is incorrect"
pattern = 'Sorry, that is incorrect'
p = clean[clean['body'].str.contains(pattern)]
print("#Problems removed: ", p.shape[0])

# remove these problems
clean = clean.loc[~clean['problem_id'].isin(p.problem_id)]

print("random samples of removed problems:")
p.sample(n=10)

In [None]:
# 2 Useless Pattern "If your answer is positive:"

pattern = 'If your answer is positive'
p = clean[clean['body'].str.contains(pattern)]
print("#Problems removed: ", p.shape[0])

# remove these problems
clean = clean.loc[~clean['problem_id'].isin(p.problem_id)]

print("random samples of removed problems:")
p.sample(n=10)

In [None]:
# 3 Useless Pattern "Submit your answer from the textbook"

pattern = 'Submit your answer from the textbook'
p = clean[clean['body'].str.contains(pattern)]
print("#Problems removed: ", p.shape[0])

# remove these problems
clean = clean.loc[~clean['problem_id'].isin(p.problem_id)]

print("random samples:")
p.sample(n=10)

In [None]:
# 4 Useless Pattern "QUESTION" && len <10

pattern = 'QUESTION'
p = clean[clean['body'].str.contains(pattern)
          & clean['body'].map(lambda x: len(x.split()) <10)]
print("#Problems removed: ", p.shape[0])

# remove these problems
clean = clean.loc[~clean['problem_id'].isin(p.problem_id)]

print("random samples:")
p.sample(n=10)

In [None]:
# 5 Useless Pattern "LIFE SCIENCE QUESTION"

pattern = 'your answer from your worksheet'
p = clean[clean['body'].str.contains(pattern)]
print("#Problems removed: ", p.shape[0])

# remove these problems
clean = clean.loc[~clean['problem_id'].isin(p.problem_id)]

print("random samples:")
p.sample(n=10)

# <font color='blue'>4 Result</font> 


In [None]:
print("#problems: ",clean.problem_id.nunique())
print("#problems with different text", clean.body.nunique())

#unique template_id = 12'310

In [None]:
# final length distribution

words = clean['body'].apply(lambda text: len(text.split()))
x = words.plot.hist(bins=50, range=(0, 100))

x.set_title("Length")
x.set_xlabel("#words");

In [None]:
# 20 random samples
clean.sample(n=20)