# Final Project Notebook

DS 5001 Exploratory Text Analytics | Spring 2024

# Metadata

- Full Name: Andrew Avitabile
- Userid: yaj3ma@virginia.edu
- GitHub Repo URL: https://github.com/avitabilea/DS5001-Final-Project
- UVA Box URL: https://virginia.app.box.com/folder/260794950453

# Raw Data

## Source Description (1)

The data used in this project come from a large teacher training program (TTP) in Texas between 2012 and 2019. As part of this training, preservice teachers---PSTs or individuals in purusing a teaching certificate---take part in clinical teaching experiences. In these experiences, PSTs engage in professional activities under the mentorship of cooperating (or mentor) teachers. TPPs oversee these field experiences through field supervisors who manage relationships between PSTs and cooperating teachers, observe classroom activities, and provide written feedback to PSTs. Although field supervisors play a key role in the clinical training of PSTs, the influence of their feedback on PST growth remains unclear. In this context, field supervisors observe PSTs (often multiple times) and provide unstructured, written feedback on classroom performance. 

## Source Features (1)

- UVA Box URL: https://virginia.box.com/s/p25x3as50pbgbyzm6fharyi7tbljin5c
- Number of raw documents: 19,008
- Total size of raw documents: 4,410 KB
- File format: .xlsx

## Source Document Structure (1)

The documents in this corpus are feedback provided by superviors to PSTs after observations during their clinical teaching experiences. This feedback is provided in the file [eval_text.xlsx](https://virginia.box.com/s/p25x3as50pbgbyzm6fharyi7tbljin5c) in the column ``overallcomments``. This workbook also includes deidentified information on the supervisor (``supervisor_id``) and PST (``uin``), the date of the observation (``observationdate``), a discrete, numeric score given by supervsiors to PSTs rating their between 1 and 4 (``overallrating_num``), and the order of the observation within a supervisor and PST (``observation_order``). The values of the variable ``overallrating_num`` are 1: Needs Significant Improvement, 2: Growth in Progress, 3: Proficent, and 4: Exceeds Expectations. Prior research shows that these ratings largely reflect differences in supervisor rating standards [(Bartanen & Kowk, 2022)](https://journals.sagepub.com/doi/abs/10.3102/0002831221990359).

In a sense, these documents are letters, addressed from supervisors to PSTs. Each PST should be observed at least three times before completing their clinical teaching experience, but some are observed more than that and some drop out before completion. Supervisors may observe many PSTs. In addition to written feedback,  (see the variable ``avgscore``).

The amount and organization of the text in each document varies greatly. For example, the longest supervisor feedback contains over 300 sentences, whereas the contains 11 characters and reads: "Good lesson." The histogram below shows the distribution of the number of characters in each document. Some documents are written as if they are paragraphs and others are bulletted lists. Some documents have little to no formal structure.

![image.png](attachment:6df87b26-468d-4639-ab3d-bc350dc94c0d.png)

# Parsed and Annotated Data

## LIB (2)

The source documents the corpus comprises. These may be books, plays, newspaper articles, abstracts, blog posts, etc. 

Note that these are *not* documents in the sense used to describe a bag-of-words representation of a text, e.g. chapter.

- UVA Box URL: https://virginia.box.com/s/ngyvwlkuesmh1uvwq4xfb3drvzc03kz4
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/01_data_prep.ipynb
- Delimitter: Pipe
- Number of observations: 19,008
- List of features: `supervisor_id` `observationdate` `overallcomments` `overallrating_num` `observation_order` `n_psts` `n_documents` `sentence_count` `token_count` `char_count`
- Average length of each document in characters: 651

## CORPUS (2)

The sequence of word tokens in the corpus, indexed by their location in the corpus and document structures.

- UVA Box URL: https://virginia.box.com/s/nomwbb2t6nak1wrsqee90uupn3iosewf
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/01_data_prep.ipynb
- Delimitter: Pipe
- Number of observations: 2,263,958
- OHCO Structure: `document_id`, `sentence_num`, `token_num`
- Columns: `token_str`, `term_str`, `pos`, `pos_group`

## VOCAB (2)

The unique word types (terms) in the corpus.

- UVA Box URL: https://virginia.box.com/s/nomwbb2t6nak1wrsqee90uupn3iosewf
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/01_data_prep.ipynb and https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Number of observations: 26,608 rows
- Columns: `n` `porter_stem` `stop` `max_pos` `max_pos_group` `ngram_length` `df` `idf` `dfidf` `mean_tfidf`

Top 20 Most Significant Words

1. were
2. as
3. you
4. student
5. are
6. that
7. was
8. good
9. classroom
10. her
11. well
12. job
13. this
14. very
15. teacher
16. an
17. your
18. !
19. have
20. be

# Derived Tables

## BOW (3)

A bag-of-words representation of the CORPUS.

- UVA Box URL: https://virginia.box.com/s/f491d7f7dtmjb3d5425o79p4lscwfzeh
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Bag: `document_id` `sentence_num`
- Number of observations: 1,380,782 
- Columns: `n`

## DTM (3)

A represenation of the BOW as a sparse count matrix.

- UVA Box URL: https://virginia.box.com/s/nomwbb2t6nak1wrsqee90uupn3iosewf
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Bag: `docutment_id`

## TFIDF (3)

A Document-Term matrix with TFIDF values.

- UVA Box URL: https://virginia.box.com/s/zgl6vxwxgtyywy9jxam1stmteiqblkmk
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Description of TFIDIF formula:

I use the following formulas to calculate the TF, IDF, and TFIDF:

$\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Maximum number of terms in document } d}$

$\text{IDF}(t) = \log \left( \frac{N Documents}{\text{df}(t)} \right)$

$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$


## Reduced and Normalized TFIDF_L2 (3)

A Document-Term matrix with L2 normalized TFIDF values.

- UVA Box URL: https://virginia.box.com/s/zgl6vxwxgtyywy9jxam1stmteiqblkmk
- UVA Box URL of source TFIDF table: https://virginia.box.com/s/zgl6vxwxgtyywy9jxam1stmteiqblkmk
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/02_derived_tables.ipynb
- Delimitter: Pipe
- Number of features (i.e. significant words): 1,000
- Principle of significant word selection: 

# Models

## PCA Components (4)

- UVA Box URL: https://virginia.box.com/s/vb61w6zso7yus66xtdwsd1e0ta4pv2fv
- UVA Box URL of the source TFIDF_L2 table: https://virginia.box.com/s/zgl6vxwxgtyywy9jxam1stmteiqblkmk
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe
- Number of components: 5
- Library used to generate: sklearn.decomposition
- Top 5 positive terms for first component: confident, rating, skill, watching, clear
- Top 5 negative terms for second component: history, power, kids, point, provide, world 

## PCA DCM (4)

The document-component matrix generated.

- UVA Box URL: https://virginia.box.com/s/ngyvwlkuesmh1uvwq4xfb3drvzc03kz4
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## PCA Loadings (4)

The component-term matrix generated.

- UVA Box URL: https://virginia.box.com/s/vb61w6zso7yus66xtdwsd1e0ta4pv2fv
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## PCA Visualization 1 (4)

Include a scatterplot of documents in the space created by the first two components. Color the points based on a metadata feature associated with the documents.

![image.png](attachment:5003d4b5-a602-4917-b382-5eed1f74ce1b.png)

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![image.png](attachment:c63e836d-b5e5-434f-8b0f-7fd2687f01ff.png)

Briefly describe the nature of the polarity you see in the first component:

Documents with low overallscores seem to be very far away from those that have higher overall scores. This suggests that PSTs who had poor evaluations receiev comments from their supervisors that are much different than those who get better evaluations.

## PCA Visualization 2 (4)

Include a scatterplot of documents in the space created by the second two components. Color the points based on a metadata feature associated with the documents.

![image.png](attachment:baf4ef98-2e66-405f-94b8-416012078c0d.png)

Also include a scatterplot of the loadings for the same two components. (This does not need a feature mapped onto color.)

![image.png](attachment:90e009eb-03b6-4708-8470-433cf0cb570a.png)

Briefly describe the nature of the polarity you see in the second component:

Similarly to the first two componenets, documents with low overallscores seem to be very far away from those that have higher overall scores. This suggests that PSTs who had poor evaluations receiev comments from their supervisors that are much different than those who get better evaluations.

## LDA TOPIC (4)

- UVA Box URL: 
- UVA Box URL of count matrix used to create: 
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe
- Libary used to compute: sklearn
- A description of any filtering, e.g. POS (Nouns and Verbs only): Nouns and Verbs only
- Number of components: 10
- Any other parameters used: NA
- Top 5 words and best-guess labels for topic five topics by mean document weight:
  - T00: lesson classroom ms learning management teaching
  - T01: time questions lesson class know
  - T02: lesson today rating continue today
  - T03: teaching classroom mrs lessons semester 
  - T04: lesson job did did job activity engaged

## LDA THETA (4)

- UVA Box URL: https://virginia.box.com/s/igkdlycjds75lnpul7nihc2yceo3r8zb
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## LDA PHI (4)

- UVA Box URL: https://virginia.box.com/s/lewcf96wk3ows1l6rzsnf77ydi2zc88u
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## LDA + PCA Visualization (4)

Apply PCA to the PHI table and plot the topics in the space opened by the first two components.

Size the points based on the mean document weight of each topic (using the THETA table).

Color the points basd on a metadata feature from the LIB table.

Provide a brief interpretation of what you see.

(INSERT IMAGE HERE)

(INSERT INTERPRETATION HERE)

## Sentiment VOCAB_SENT (4)

Sentiment values associated with a subset of the VOCAB from a curated sentiment lexicon.

- UVA Box URL: https://virginia.box.com/s/5u9pul5gfq0ylz2h7c2rni9lrnvzcjkk
- UVA Box URL for source lexicon: [AFINN Sentiment Lexicon](http://corpustext.com/reference/sentiment_afinn.html)
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## Sentiment BOW_SENT (4)

Sentiment values from VOCAB_SENT mapped onto BOW.

- UVA Box URL: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe

## Sentiment DOC_SENT (4)

Computed sentiment per bag computed from BOW_SENT.

- UVA Box URL: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe
- Document bag expressed in terms of OHCO levels: ``document_id``

## Sentiment Plot (4)

Plot sentiment over some metric space, such as time.

If you don't have a metric metadata features, plot sentiment over a feature of your choice.

You may use a bar chart or a line graph.

(INSERT IMAGE HERE)

## VOCAB_W2V (4)

A table of word2vec features associated with terms in the VOCAB table.

- UVA Box URL: https://virginia.box.com/s/lbhbzjglleme4s40oslk17e4t8965eol
- GitHub URL for notebook used to create: https://github.com/avitabilea/DS5001-Final-Project/blob/master/03_models.ipynb
- Delimitter: Pipe
- Document bag expressed in terms of OHCO levels: document_id
- Number of features generated: 246
- The library used to generate the embeddings: gensim's corpora Dictionary

## Word2vec tSNE Plot (4)

Plot word embedding featues in two-dimensions using t-SNE.

![image.png](attachment:8835e862-2551-4b92-987b-917fe99aaee0.png)

Describe a cluster in the plot that captures your attention.

The cluster in the image above shows verbs that reflect supervisors asking PSTs to reengage or follow-up about a concept with students. For example, the terms "reinforce", "repeat", and "remind" all express the need to say something again. These terms are also related to nearby verbs like "encourage", "clarify", "reccomend", and "define", which are likely suggestions from supervisors on how PSTs should follow up with students about a concept.

# Riffs

Provde at least three visualizations that combine the preceding model data in interesting ways.

These should provide insight into how features in the LIB table are related. 

The nature of this relationship is left open to you -- it may be correlation, or mutual information, or something less well defined. 

In doing so, consider the following visualization types:

- Hierarchical cluster diagrams
- Heatmaps
- Scatter plots
- KDE plots
- Dispersion plots
- t-SNE plots
- etc.

## Riff 1 (5)

(INSERT IMAGE HERE)

(INSERT INTERPRETATION HERE)

## Riff 2 (5)

(INSERT IMAGE HERE)

(INSERT INTERPRETATION HERE)

## Riff 3 (5)

(INSERT IMAGE HERE)

(INSERT INTERPRETATION HERE)

# Interpretation (4)

Describe something interesting about your corpus that you discovered during the process of completing this assignment.

At a minumum, use 250 words, but you may use more. You may also add images if you'd like.

(INSERT INTERPRETATION HERE)