# Final Project Notebook

DS 5001 Exploratory Text Analytics | Spring 2024

# Metadata

- Full Name: Andrew Avitabile
- Userid: yaj3ma@virginia.edu
- GitHub Repo URL: https://github.com/avitabilea/DS5001-Final-Project
- UVA Box URL: https://virginia.app.box.com/folder/260794950453

# Raw Data

## Source Description (1)

The data used in this project come from a large teacher training program (TTP) in Texas between 2012 and 2019. As part of this training, preservice teachers---PSTs or individuals in purusing a teaching certificate---take part in clinical teaching experiences. In these experiences, PSTs engage in professional activities under the mentorship of cooperating (or mentor) teachers. TPPs oversee these field experiences through field supervisors who manage relationships between PSTs and cooperating teachers, observe classroom activities, and provide written feedback to PSTs. Although field supervisors play a key role in the clinical training of PSTs, the influence of their feedback on PST growth remains unclear. In this context, field supervisors observe PSTs (often multiple times) and provide unstructured, written feedback on classroom performance. 

## Source Features (1)

- UVA Box URL: https://virginia.box.com/s/p25x3as50pbgbyzm6fharyi7tbljin5c
- Number of raw documents: 18,991
- Total size of raw documents: 4,626 KB
- File format: .xlsx

## Source Document Structure (1)

The documents in this corpus are feedback provided by superviors to PSTs after observations during their clinical teaching experiences. This feedback is provided in the file [eval_text.xlsx](https://virginia.box.com/s/p25x3as50pbgbyzm6fharyi7tbljin5c) in the column ``overallcomments``. This workbook also includes deidentified information on the supervisor (``supervisor_id``) and PST (``uin``), the date of the observation (``observationdate``), a discrete, numeric score given by supervsiors to PSTs rating their between 1 and 4 (``overallrating_num``), the order of the observation within a supervisor and PST (``observation_order``), the program type that a PST is participating in (`program_cl`, e.g., KINE = Kinesiology), and the grade level where the clinical observation took place (`gradelevel_cl`, e.g., Elementary = an elementary school). The values of the variable ``overallrating_num`` are 1: Needs Significant Improvement, 2: Growth in Progress, 3: Proficent, and 4: Exceeds Expectations.

In a sense, these documents are letters, addressed from supervisors to PSTs. Each PST should be observed at least three times before completing their clinical teaching experience, but some are observed more than that and some drop out before completion. Supervisors may observe many PSTs. In addition to written feedback,  (see the variable ``avgscore``).

The amount and organization of the text in each document varies greatly. For example, the longest supervisor feedback contains over 300 sentences, whereas the contains 11 characters and reads: "Good lesson." The histogram below shows the distribution of the number of characters in each document. Some documents are written as if they are paragraphs and others are bulletted lists. Some documents have little to no formal structure.

![image.png](attachment:6df87b26-468d-4639-ab3d-bc350dc94c0d.png)

# Parsed and Annotated Data

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.box.com/s/ngyvwlkuesmh1uvwq4xfb3drvzc03kz4
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/01_data_prep.ipynb
- Delimitter: Pipe
- Number of observations: 18,991
- List of features: `supervisor_id` `observationdate` `overallcomments` `overallrating_num` `observation_order` `n_psts` `n_documents` `sentence_count` `token_count` `char_count` `program_cl` `gradelevel_cl`
- Average length of each document in characters: 651.7

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.box.com/s/nomwbb2t6nak1wrsqee90uupn3iosewf
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/01_data_prep.ipynb
- Delimitter: Pipe
- Number of observations: 2,386,406
- OHCO Structure: `document_id`, `sentence_num`, `token_num`
- Columns: `token_str`, `term_str`, `pos`, `pos_group`

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.box.com/s/nomwbb2t6nak1wrsqee90uupn3iosewf
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/01_data_prep.ipynb and https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Number of observations: 23,101 rows
- Columns: `n` `porter_stem` `stop` `max_pos` `max_pos_group` `ngram_length` `df` `idf` `dfidf` `mean_tfidf`

Top 20 Most Significant Words

1. were
2. as
3. you
4. -
5. student
6. are
7. that
8. well
9. was
10. good
11. her
12. classroom
13. job
14. this
15. teacher
16. very
17. your
18. an
19. !
20. have

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.box.com/s/f491d7f7dtmjb3d5425o79p4lscwfzeh
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Bag: `document_id`
- Number of observations: 1,415,944
- Columns: `n`

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://virginia.box.com/s/nomwbb2t6nak1wrsqee90uupn3iosewf
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Bag: `docutment_id`

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://virginia.box.com/s/zgl6vxwxgtyywy9jxam1stmteiqblkmk
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Description of TFIDIF formula:

I use the following formulas to calculate the TF, IDF, and TFIDF:

$\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Maximum number of terms in document } d}$

$\text{IDF}(t) = \log \left( \frac{N Documents}{\text{df}(t)} \right)$

$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$


## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://virginia.box.com/s/zgl6vxwxgtyywy9jxam1stmteiqblkmk
- UVA Box URL of source TFIDF table: https://virginia.box.com/s/zgl6vxwxgtyywy9jxam1stmteiqblkmk
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Number of features (i.e. significant words): 1,000
- Principle of significant word selection: 

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.box.com/s/vb61w6zso7yus66xtdwsd1e0ta4pv2fv
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/zgl6vxwxgtyywy9jxam1stmteiqblkmk
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe
- Number of components: 5
- Library used to generate: sklearn.decomposition
- Top 5 positive terms for first component: objective, learning, practice, activity, praise
- Top 5 negative terms for second component: instruction, positive, student, specific, lesson

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.box.com/s/ngyvwlkuesmh1uvwq4xfb3drvzc03kz4
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://virginia.box.com/s/vb61w6zso7yus66xtdwsd1e0ta4pv2fv
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components. Color the points based on a metadata feature associated with the documents.

![image.png](attachment:462d50c7-d38a-42df-9ff5-2d8178b16c8a.png)

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![image.png](attachment:ccd50637-074e-45b7-9981-7a966de892ab.png)

Briefly describe the nature of the polarity you see in the first component:

Documents with low overallscores tend to have lower values of component 0. Overall score seems to have a less clearly defined relationship with component 1, suggesteing that the difference between the documents across this dimension do not translate to changes in the supervisors overall rating of a PST's performance.

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components. Color the points based on a metadata feature associated with the documents.

![image.png](attachment:83f00b28-fb59-49cf-9ee1-7ff1d274fb75.png)

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![image.png](attachment:89c1d225-fcc5-45b5-8f04-19423b067d0b.png)

Briefly describe the nature of the polarity you see in the second component:

Similar to component 1, the relationship between overallscore and component 2 is not well-defined. This again suggests that the difference between the documents across this dimension do not translate to changes in the supervisors overall rating of a PST's performance. This may be in-line with prior research that shows that supervisor feedback is more a function of their rating standards than differences in PST performance.

## LDA TOPIC (4)

- UVA Box URL: 
- UVA Box URL of count matrix used to create: 
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe
- Libary used to compute: sklearn
- A description of any filtering, e.g. POS (Nouns and Verbs only): Nouns and Verbs only
- Number of components: 10
- Any other parameters used: NA
- Top 5 words and best-guess labels for topic five topics by mean document weight:
    - T04
        - Weight: 3,373
        - Top five words: classroom teaching management classroom management
        - Label: Classroom management
    - T03: 
        - Weight: 2,979
        - Top five words: class time work job kids classroom 
        - Label: Class time
    - T06: 
        - Weight: 2,291
        - Top five words: behavior continue instruction today voice
        - Label: Behavior and instruction
    - T01: 
        - Weight: 2,077
        - Top five words: job did work learning objective 
        - Label: Objective
    - T09: 
        - Weight: 1,938
        - Top five words: class st read asked activity 
        - Label: Activities

## LDA THETA (4)

- UVA Box URL: https://virginia.box.com/s/igkdlycjds75lnpul7nihc2yceo3r8zb
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## LDA PHI (4)

- UVA Box URL: https://virginia.box.com/s/lewcf96wk3ows1l6rzsnf77ydi2zc88u
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## LDA + PCA Visualization (4)

![image.png](attachment:3c8dac4c-7d53-43ec-8cf5-35bbbe8be85e.png)

Topic 8 is positively correlated with PC 2 and topic 4 is positively correlated with PC 1. These topics are distinct both from each other and the other topics accross these dimensions. Topic 4 seems to be this classroom management skill, which is generally something that is lacking for new teachers [(Bartanen et al., 2023)](https://edworkingpapers.com/ai23-845). It is less clear what is captured by topic T08. The top five words of this topic are: learning work behavior objective classroom. It also has the lowest weight of all the topics.

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.box.com/s/5u9pul5gfq0ylz2h7c2rni9lrnvzcjkk
- UVA Box URL for source lexicon: [AFINN Sentiment Lexicon](http://corpustext.com/reference/sentiment_afinn.html)
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe
- Document bag expressed in terms of OHCO levels: ``document_id``

## Sentiment Plot (4)

The plot below shows average sentiment score of documents plotted annually. The sentiment of supervisor feedback to PSTs has remained fairly stable over time.

![image.png](attachment:df384da7-5cf0-4557-8996-b79e49881f53.png)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.box.com/s/lbhbzjglleme4s40oslk17e4t8965eol
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe
- Document bag expressed in terms of OHCO levels: document_id
- Number of features generated: 246
- The library used to generate the embeddings: gensim's corpora Dictionary

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

![image.png](attachment:8835e862-2551-4b92-987b-917fe99aaee0.png)

Describe a cluster in the plot that captures your attention.

The cluster in the image above shows verbs that reflect supervisors asking PSTs to reengage or follow-up about a concept with students. For example, the terms "reinforce", "repeat", and "remind" all express the need to say something again. These terms are also related to nearby verbs like "encourage", "clarify", "reccomend", and "define", which are likely suggestions from supervisors on how PSTs should follow up with students about a concept. Generally, I think the algorithm did a good job classifying verbs.

# Riffs

## Riff 1 (5)

### Sentiment varies greatly by supervisor

The bar chart below shows that feedback sentiment varies greatly between supervisors. The x-axis shows the mean document-level sentiment score for a supervisor and the y-axis shows the supervisor IDs. For example, supervisor 68 gives feedback that is approximately 30 times more positive than supervisor 168. Below the bar chart, I provide examples of feedback from each of these supervisors. It is clear from this that there are large differences in the types of feedback given by supervisors. This is in-line with prior research which shows that the ratings supervisors give during clinical teaching observations largely reflect supervisor ratings standards, rather than differences in PST performance [(Bartanen & Kowk, 2022)](https://journals.sagepub.com/doi/abs/10.3102/0002831221990359).

![image.png](attachment:121bf422-7a74-4d68-94a9-ceae126591c2.png)

Example feedback from Supervisor ID 68:

"Today Ms. Freels completed her third formal observation. It was also her first formal observation since beginning her second student teaching placement in seventh grade math recently. Today's lesson required students to solve volume problems involving prisms and pyramid figures. This lesson was the conclusion to earlier lessons on volume solutions using formulas. Formulas from previous lessons were brought together and attributes of each were reviewed and questioned before moving into stations each containing a volume problem to be solved for different prisms and pyramids. Students were not assigned partners to work with in the stations.  The requirement was to complete the problems within the class period and check their solutions with Ms. Freels. As station work began, students sorted themselves into groups or individual groups as they chose.  As the period moved along, students began working together in more cooperative groups and did so with exchanges of content vocabulary and exchanges of ideas as to understanding the problem at hand.  This gave Ms. Freels ample time to observe and check in with students needing additional explanation or instruction to move forward. It also allowed her time to assess individual misunderstandings or gaps in understandings to overcome.  Students were actively engaged through the learning period.  At the post-observation conference, Ms. Freels shared her many thought processes in designing this structure for room arrangement and instructional materials that supported a smooth flow of task completion for these 3D volume formulas. All the elements of the learning cycle were completed in presenting this lesson plan. All students could participate in the learning steps/tasks to the maximum of their ability.  It was recommended that she continue to structure engaging and productive lessons such as this whenever appropriate to the TEK to be studied.  In the post-observation conference there was little to recommend for reinforcing but to continue the progress being made with her newest students in building good working relationships with students, and to continue building and expecting compliance with the expectations for listening and responding that she has begun. One small recommendation for lessons such as this involved keeping early finishers productively engaged. She and I discussed ideas that do not require advance planning and preparation, but did fit with the lesson objective of the day.
Ms. Freels spent the first half of the student teaching semester with an elementary grade level.  She has been with students as middle school level for a short time but is making adjustments to teaching an older set of students very well."

Example feedback from Supervisor ID 196:

"Student performed as expected."

## Riff 2 (5)

## Sentiment also varies by grade level, but less so than supervisor

The bar chart below shows that feedback sentiment also varies by grade level. On the x-axis, is the average document sentiment score and the y-axis is the grade level taught by a PST during the clinical teaching observation. For example, PSTs teaching middle school classes tend to recieve more positive feedback than those teaching any other level.
![image.png](attachment:0ebe9457-5bac-407b-9882-abd05a543155.png)

Although these differences are non-trivial, they reflect difference in supervisors across these grade levels. To account for this, I estimate the following regression via OLS:

$Sentiemnt_{iso} = \beta_0 + \beta \times \mathbf{X}_i + \lambda_s + \varepsilon_{iso}$

Where the sentiment score recieved by PST $i$, paired with supervisor $s$, during obseration order $o$ is a function of binary variables for for grade level taught (contained in the vector $\mathbf{X}$) and supervisor fixed effects ($\lambda_s$). I also cluster standard errors at the supervisor level. When I do this, many of the differences between elementary school teachers (the omitted group) and other grade levels disappear. This again suggests that the sentiment of feedback given to PSTs is a function of the supervisor providing the feedback, as opposed to differences across grades.

![image.png](attachment:fc9b9754-6e44-4b4f-851b-d5f962f2160b.png)

## Riff 3 (5)

### Relationship between sentiment and topics

In the scatterplots below, I explore the relationship between a documents sentiment score and the topics from the topic modeling algorithm. In each plot, the y-axis is a document's sentiment score and the x-axis is the score for one of the 10 topics (e.g., T00). I also include a line estimating the linear relationship between these two variables, estimated via OLS, and include the coefficent of the slope of this variable in the title of each plot. Most topics show a modest relationship between sentiment and the presence of a topic. For exampel, topic T04 (labeled above as "Classroom Management") has a slightly negative correlation with sentiment. As mentioned previously, research shows that classroom managment is generally a key skill that is lacking for new teachers [(Bartanen et al., 2023)](https://edworkingpapers.com/ai23-845).

![image.png](attachment:31f5c489-45d4-4626-8594-f1454acbe01d.png)

Interestingly, the presence of topic T04 appears to be positively related with supervsior ratings of PSTs. The bar graphs below shows the average values of topics broken out by supervisors' overall ratings of a PSTs clinical evalaution. PSTs who recieve feedback that has higher values of the topic T04 are much more likely to get a high rating than those with low values of this topic. This might suggest that these PSTs are good at managing a classroom.

![image.png](attachment:0fdbbaae-157e-48ca-8e28-ac01f7f06147.png)

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

TEST