# 📝 6.2 Text Analysis for Qualitative Research

We’ll **prepare** text for human-led coding (cleaning, tokenisation, light structure) and add **small helper summaries** (frequencies, n-grams) that support—not replace—interpretation.

This notebook keeps the qualitative lens front-and-centre while giving you just enough NLP to work efficiently.

## 🎯 Objectives
- Clean and tokenise open-ended responses with NLTK.
- Lemmatise, remove stopwords/punctuation, handle case.
- Build **n-grams** (bigrams, trigrams) to surface phrases.
- Optional: POS tags and (careful) sentiment as exploratory aides.
- Export a tidy table ready for **manual coding** or 6.3.

In [None]:
%pip install -q pandas nltk matplotlib seaborn scikit-learn wordcloudimport pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as snsimport nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom nltk.stem import WordNetLemmatizerfrom collections import Counterfrom wordcloud import WordCloudsns.set_theme()# NLTK data (robust to version differences)try:
    nltk.download('punkt', quiet=True)
except: pass
try:
    nltk.download('punkt_tab', quiet=True)
except: pass
nltk.download('stopwords', quiet=True)nltk.download('wordnet', quiet=True)try:
    nltk.download('averaged_perceptron_tagger_eng', quiet=True)
except:
    nltk.download('averaged_perceptron_tagger', quiet=True)try:
    nltk.download('vader_lexicon', quiet=True)
except: passprint('Text/NLP environment ready.')

In [None]:
from pathlib import Pathtxt = Path('data')/'food_preferences.txt'responses = [r.strip() for r in txt.read_text(encoding='utf-8').splitlines() if r.strip()]df = pd.DataFrame({'response_id': range(1, len(responses)+1), 'text': responses})print('N responses:', len(df))df.head(5)

## 🧼 Preprocessing pipeline
We’ll lowercase, tokenise, remove stopwords/punctuation, and **lemmatise** (carrots→carrot). This **supports** coding by removing noise.

In [None]:
stop = set(stopwords.words('english')).union({'hippo', 'h1','h2','h3'})lem = WordNetLemmatizer()def clean_tokens(text: str):    words = word_tokenize(text.lower())    words = [w for w in words if w.isalpha()]  # drop punctuation/numbers    words = [w for w in words if w not in stop]    words = [lem.lemmatize(w) for w in words]    return wordsdf['tokens'] = df['text'].apply(clean_tokens)df[['response_id','text','tokens']].head(6)

## 📊 Frequencies & word cloud (orientation only)

In [None]:
all_tokens = [t for row in df['tokens'] for t in row]freq = Counter(all_tokens).most_common(15)pd.DataFrame(freq, columns=['word','count'])

In [None]:
wc = WordCloud(width=800, height=400, background_color='white').generate(' '.join(all_tokens))plt.figure(figsize=(10,4)); plt.imshow(wc); plt.axis('off'); plt.title('Word Cloud'); plt.show()

## 🔗 N-grams (bigrams & trigrams)
Short phrases can reveal food pairings (e.g., *fresh fruit*, *crunchy carrot*).

In [None]:
from nltk.util import ngramsdef ngram_counts(tokens_list, n=2, top=15):    ng = Counter()    for toks in tokens_list:        ng.update(ngrams(toks, n))    return pd.DataFrame(ng.most_common(top), columns=[f'{n}-gram','count'])bigrams = ngram_counts(df['tokens'], n=2, top=15)trigrams = ngram_counts(df['tokens'], n=3, top=10)display(bigrams); display(trigrams)

## 🧪 Optional aides: POS tags & VADER sentiment
Use sparingly; these are **exploratory hints**, not findings. Sentiment can be noisy in domain language.

In [None]:
try:    from nltk import pos_tag    df['pos'] = df['tokens'].apply(pos_tag)    df[['response_id','pos']].head(4)except Exception as e:    print('POS tagging unavailable:', e)

In [None]:
try:    from nltk.sentiment import SentimentIntensityAnalyzer    sia = SentimentIntensityAnalyzer()    df['sentiment'] = df['text'].apply(lambda x: sia.polarity_scores(x)['compound'])    sns.histplot(df['sentiment']); plt.title('Sentiment (VADER) — exploratory only'); plt.show()except Exception as e:    print('VADER unavailable:', e)

## 📤 Export a coding-ready table
We create a simple structure that supports **manual coding** (e.g., in Excel/Sheets or in 6.3).

In [None]:
out = df[['response_id','text','tokens']].copy()out['initial_code'] = ''  # analyst will fill codesout['notes'] = ''         # memo/commentsout_path = 'qual_coding_sheet.csv'out.to_csv(out_path, index=False)print('Wrote:', out_path)

## 🧩 Exercises
1) **Stoplist tuning**: add domain-specific stopwords (e.g., *like, really, very*)—how do top words change?
2) **Phrase mining**: examine bigrams containing *fruit* or *carrot*; collect example quotes.
3) **Coding sheet**: add 2–4 provisional **initial codes** per 10 responses (keep them short & action-oriented).

## ✅ Conclusion
You prepared text for analysis and produced a coding-ready table. Next: formal **coding & thematic analysis** with reliability checks (6.3).

<details><summary>More</summary>
- NLTK docs (tokenisation, POS, stopwords)
- Practical theming workflows
</details>