# Basic and Brief code suited for Text processing

In this very short notebook, the aim is to experiment and determine whether text processing such as tokenisation , stop word processing etc. can be applied at a larger scale

We begin by importing some of the needed packages:

In [2]:
#importing wanted libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Here we read in the 'Comments' Dataset and work with only subset of the data


In [20]:
comments = pd.read_excel("/content/drive/My Drive/Colab Notebooks/ASSIGNMENT 2/data/C4Comments.xlsx")
comments.head(n=10)

Unnamed: 0,Comment_ID,course_id,Timestamp,user,Question_ID,Comment
0,7197674,21320,2020-04-30 19:57:28,635978,4470201,"This is a good question, i do think it's a lit..."
1,7197679,21320,2020-04-30 20:00:46,635978,4468999,"This is a really good question, i like the str..."
2,7199285,21320,2020-05-01 15:43:35,636026,4471345,It is a good question. Slightly tricky but not...
3,7202419,21320,2020-05-03 13:28:57,635990,4472540,I think this is a really good question! I thin...
4,7202425,21320,2020-05-03 13:34:57,635990,4472481,This question requires the students to recall ...
5,7202447,21320,2020-05-03 13:52:47,635990,4472418,This question requires quite specific and deta...
6,7202548,21320,2020-05-03 16:38:30,636007,4468332,I think this question was quite good. The ques...
7,7204388,21320,2020-05-04 16:43:37,636020,4468400,I like the style of this question in having to...
8,7204405,21320,2020-05-04 16:54:47,636020,4469950,I think this question would benefit from rewor...
9,7205230,21320,2020-05-05 00:09:07,635993,4475427,"This question is very easy, while it is very r..."


### Key observations:

From this brief snapshot we can observe a couple of things:
- the respective `Comment_ID`
- the `course_id` and `Timestamp` variables (not pertinent to this current analysis)
- The corresponding `Question_Id` 
- The most pertinent column: `Comment`


- In the comments column we can straight away observe that there are a few "stop words" we can immediately clean up such as:
  - I, is, a, the, to
  - 'Question', think, having

- The words of interest could possibly be:
  - easy, hard, good ,bad, tricky
  - recall
  - **Actionable words**:
    - benefit, rewording, misleading, confusing, difficult
    - suggest


- We will also need to find the root words/ words that have similiar meaning to eachother

In [21]:
comments['Comment'].head(n=30)

0     This is a good question, i do think it's a lit...
1     This is a really good question, i like the str...
2     It is a good question. Slightly tricky but not...
3     I think this is a really good question! I thin...
4     This question requires the students to recall ...
5     This question requires quite specific and deta...
6     I think this question was quite good. The ques...
7     I like the style of this question in having to...
8     I think this question would benefit from rewor...
9     This question is very easy, while it is very r...
10    I liked this question, however i was unsure of...
11    This question is good but only tests basic rec...
12    The question was straight-forward and requires...
13    The question has plausible options that are wo...
14    The question was a bit difficult, which is goo...
15    It would be really helpful if you further expl...
16    I think option C is quite misleading; I chose ...
17    I think this question is worded a bit conf

## Preliminary text transformation

Here I will cast each row in the `comments['Comment]` column to a string as a pre-step to tokenisation.

In [25]:
# Convert to list
comments_ls = comments['Comment'].tolist()

# viewing an example:
comments_ls[100]



'When reading this question I got too distracted by reading the table options an trying to figure out which columns are correct and incorrect that I forgot to read the actual question again. Whoops!! Now that I read the question again the correct answer makes sense and is quite easy to identify through process of elimination of the other answers. This question requires students to apply their knowledge in a situation where the sigma subunit is impaired. The fact that you said "impaired" rather than not present at all makes the question more challenging. The transcription occurs (Yes/No) column could be a bit confusing but also allows the elimination of A and D as students that are confident with the syllabu content will know that the sigma subunit is involved in the initiation of prokaryotic transcription. This question could be changed into a "Which row of the table is True/False?" question or something along those lines which is similar to the style of exam questions.'

### Key Observations:
From this example we can observe the following things, the author of this particular comment has implemented:
- brackets/ parentheses ()
- Quotation marks " " , this also differs comment to comment- some are enclosed by ' ' and others " "
- Punctuation marks like commas, fullstops, exclamation points, question marks
- There are a few spelling errors
- These comments are quite long and dense 
- Important words to consider are:
  "distracted"
  "confusing"
- Distinction needs to be made between answer options (A, B, C, D) and stop words/ words at the beginning of the sentence like 'a' and 'A' if capitalised

# Next Steps

The next steps would be to build or find functions where we tokenise the most valuable words - that relate to the aims of the core.

As well as remove any noise from the text- punctuation, latex or HTML formatting.

Then assign some statistical weighting to these words.

