# Lab Session 10: Introduction to Natural Language Processing 

> Friday 11-14-2025, 9AM-11AM & 1PM-3PM & 3PM-5PM
>
> Instructors: Instructors: [Jimmy Butler](https://statistics.berkeley.edu/people/james-butler) & [Sequoia Andrade](https://statistics.berkeley.edu/people/sequoia-rose-andrade)

What's planned for today:
1. **Text Data Loading**: We will review text data loading using a dataset of fake and real news. We will perform exploratory data analysis on it.
2. **Text Data Processing**: We will practice text processing using the SpaCy package, including lemmatization and word frequency counts.
3. **Topic Modeling**: We will start discussing topic modeling and demonstrate how to build topic models.

# Text Data Loading

1. First load the "Fake" and "True" data into dataframes from csvs
2. Make two plots:
   - One that is the distribution of subjects over the Fake dataset
   - One that is the distribution of subjects over the True dataset
3. Make one plot with two lines that have the count of reports (y-axis) per month per year (x-axis) with one line for True and one line for Fake
4. Make two plots:
   - One that is the distribution of Text length over the Fake dataset (histogram)
   - One that is the distribution of Text length over the True dataset (histogram)

In [2]:
import pandas as pd
import spacy
from collections import Counter

In [10]:
fake = pd.read_csv("fake-and-real-news-dataset/Fake.csv")
real = pd.read_csv("fake-and-real-news-dataset/True.csv")

# Text Data Processing with SpaCy - Tokenization, Lemmatization, Word Frequency

In this section we will follow steps similar to the project. Now we will start working on simply text processing using the SpaCy package and the same dataset as part 1. To install SpaCy run in your environment:

```
conda install -c conda-forge spacy 
python -m spacy download en_core_web_sm
```

Some important definitions:

- *Token*: a single word or piece of a word
- *Lemma*: the core component of a word, e.g., "complete" is the lemma for "completed" and "completely"
- *Stop Word*: a common word that does not add semantic value, such as "a", "and", "the", etc.
- *Vectorization*: representing a document as a vector where each index in the vector corresponds to a token or word and each entry is the count.

In this section, we will explore the most common tokens and lemmas throughout different slices of the data. We will also develop vectorization representations of the speeches. 

 The core steps are:

1. Process the Fake and True data separately using the SpaCy nlp module (first subset the speeches!)
2. Analyze Lemmas:
- Create a list of all lemmas that are not stop words, punctuation, or spaces.
- Display the top 25 lemmas for each of the True and Fake datsets. Is there any difference

**Resources:**
- https://realpython.com/natural-language-processing-spacy-python/
- https://www.statology.org/text-preprocessing-feature-engineering-spacy/ 
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html# 
- https://www.geeksforgeeks.org/nlp/how-to-store-a-tfidfvectorizer-for-future-use-in-scikit-learn/ 



# Topic Modeling (Optional) - LDA topic Modeling with Gensim
- Train an LDA model with 10 topics
- Output the top 10 words for each topic. 
- Output the topic distribution for the first speech
- Make a visualization