<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Data Analysis in Python

This notebook provides an introduction to basic qualitative data analysis in Python, building on earlier notebooks in syntax and data handling. It demonstrates how to process and analyze text data using Python, including searching, filtering, and counting occurrences within text. Parts of this notebook were inspired by or adapted from the Python Humanities textbook, available at [python-textbook.pythonhumanities.com](python-textbook.pythonhumanities.com).

## 1. Plot Data with `pandas`

`pandas` makes it easy to create quick and simple visualizations directly from a DataFrame using the `.plot()` method. This is especially useful for exploring your data before doing more complex analyses or creating polished plots with other libraries such as `matplotlib`. 

Here are three example plots created using `pandas`:

In [None]:
!pip install pandas
import pandas as pd

In [None]:
# The original Titanic dataset includes only 'male' and 'female' as gender categories.
# For demonstration purposes, we'll change the first 3 entries to 'diverse'. This also shows how indexing and assignment work in pandas.
Titanic = pd.read_csv("../Data/Titanic-Dataset.csv")
Titanic.loc[0:2, 'Sex'] = 'diverse'  # Change rows with index 0, 1, and 2

1. A bar chart showing the number of male and female passengers:

In [None]:
Titanic['Sex'].value_counts().plot.bar(title="Title") # Bar plot

2. A pie chart showing the proportion of passengers in each travel class:

In [None]:
Titanic['Pclass'].value_counts().plot.pie() # Pie plot

3. A scatter plot to explore the relationship between the age and fare of passengers, colored by travel class.

In [None]:
Titanic.plot.scatter(x="Age", y="Fare", c="Pclass", cmap="viridis") # Scatter plot

Now, let’s try this again by first converting Pclass into a `pandas` categorical type:

In [None]:
Titanic['Pclass'] = Titanic.Pclass.astype('category')
Titanic.plot.scatter(x="Age", y="Fare", c="Pclass", cmap="viridis")

## 2. Manual Text Analysis and Pattern Matching with `re` (Regex)

Before using complex tools, we often start with manual coding - spotting phrases, concepts, or features manually or with simple string rules. This can be automated using Python’s built-in regular expressions module, `re`.

For example, **to extract dates:**

In [None]:
import re
text = "The meeting happened on 12 March, 5 April, and 23 December."
matches = re.findall(r"\b\d{1,2} [A-Z][a-z]+", text)
print(matches)

- `\b` word boundary, ensures we start at the beginning of a word
- `\d{1,2}` matches 1 or 2 digit day numbers
- `[A-Z][a-z]+` matches any capitalized month name (or word), flexible for months

Limitation: It is not realistically possible to match all date formats with a single regular expression because dates can appear in a huge variety of formats.

### **Exercise 1:** 

What happens if you enter a typo to one of the month names?

**To extract emails:**

In [None]:
text = "Please contact us at info@example.com or support@service.org for assistance."
emails = re.findall(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", text)
print(emails)

- `[...]` defines a character class
- `a-zA-Z` allows uppercase and lowercase letters
- `0-9` allows numbers
- `._%+-` allows typical symbols used in emails
- `+` means one or more of the preceding characters
- `@` a literal match of the @ character
- `\.` dot before the top level domain (TLD)
- `[a-zA-Z]{2,}` TLD allows only letters and `{2,}` means "at least 2 characters" (e.g., com, org, de, info, etc.)

## 3. Very Brief Introduction to Natural Language Processing (NLP)

Natural Language Processing (NLP) helps us extract structure and meaning from unstructured text, like interviews, news articles, or historical records. It powers things like keyword tagging, topic detection, and identifying people or places.

## 3.1 Tokenization and Word Frequency

This is how we can **tokenize text** with basic Python:

In [None]:
from collections import Counter
text = "I love learning Python. Python is great for data analysis and Python is also fun!"
tokens = re.findall(r"\b\w+\b", text.lower())
print(tokens)

- \b\w+\b: match each word
- text.lower() normalizes casing

We can also **count top words**:

In [None]:
word_freq = Counter(tokens)
print(word_freq.most_common(3))

To **exclude stopwords** (such as "the", "of", "and"):

In [None]:
!pip install wordcloud
from wordcloud import STOPWORDS

In [None]:
filtered_tokens = [word for word in tokens if word not in STOPWORDS]
word_freq = Counter(filtered_tokens)
print(word_freq.most_common(3))

### **Exercise 2:** 

Count the most frequent words in a Wikipedia article of your choice, excluding stopwords. See *Section 2.3 in Notebook 02* for how to extract and clean text from a Wikipedia article using `BeautifulSoup` and `requests`. After extracting the text, tokenize it into lowercase words, remove stopwords, and display the top 10 most frequent words.

For more advanced text processing, we can use the `spaCy` library. It handles punctuation, stopwords, and even finds the **base form** of each word (called lemma). Here, we use the headlines data again to explore how `spaCy` processes real-world text.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm # a small English language model

In [None]:
import spacy
from collections import Counter

# Load spaCy's English language model
nlp = spacy.load("en_core_web_sm")

# Read the text file with the headlines
file_path = "../Data/all_headlines.txt"

with open(file_path, "r", encoding="utf-8") as f:
    lines = [line.strip() for line in f if line.strip()]  # Clean & skip empty lines

# Select 1000 random lines (headlines)
import random
random_lines = random.sample(lines, min(1000, len(lines)))
print(random_lines[:10])

# Join lines into a single string
text = " ".join(random_lines)

# Process the text
doc = nlp(text.lower())

# Extract lemmas, removing stopwords and punctuation
lemmas = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

# Count the most frequent lemmas
word_freq = Counter(lemmas)
print(word_freq.most_common(10))

## 3.2 Word Cloud Visualization

A word cloud is a fun and simple way to visualize word frequency: The more often a word appears, the bigger it shows up in the cloud. Let's built a word cloud from the headlines:

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Generate word cloud
wordcloud = WordCloud(width=800, height=400).generate(text)

# Display it
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## 3.3 Sentiment Analysis with `TextBlob`

`TextBlob` gives a quick and simple way to check the emotional tone of a sentence - whether it's positive, negative, or neutral. `TextBlob` analyzes the text and returns:

- Polarity: -1 (negative) → +1 (positive)
- Subjectivity: 0 (objective) → 1 (subjective)

This is based on a built-in dictionary of words with assigned sentiment scores.

In [None]:
!pip install textblob
from textblob import TextBlob

In [None]:
text = "Python is amazing. But sometimes debugging makes me sad."
blob = TextBlob(text)

print(blob.sentiment)

### **Exercise 3:** 

Try sentences or paragraphs with or without clear emotional tone - negative or positive - and check the output.

## 4. Outlook: What's Next After the Basics?

Once you’ve mastered data handling, and simple analysis with tools like `re` and `spaCy`, you’re well on your way to **more advanced text methods**. In fact, with qualitative data in Python, you very quickly enter the territory of **machine learning (ML)**.

Why? Because qualitative data - everything that’s not numbers - must be turned into a structure that code can process. Programming is built around structured data and numeric operations, so we often need to **transform language into numbers**. 

**General-purpose libraries** like `scikit-learn` and `transformers` provide broad frameworks for many types of machine learning tasks, including but not limited to NLP. In addition, various **specialized libraries within NLP and qualitative research** exists, for example `whisper` for audio-to-text transcription or `BookNLP` for structural/narrative analysis of long texts (e.g., novels).

This opens the door to **various powerful techniques and use-cases:**

- *Named Entity Recognition (NER) & Entity-Level Structuring:* Detect and categorize names of people, places, organizations, etc.
- *Topic Discovery & Interpretation:* Identify recurring themes across documents using unsupervised methods such as Latent Dirichlet Allocation (LDA) or Non-negative Matrix Factorization (NMF).
- *Audio-to-Text Transcription:* Convert spoken interviews or discussions into text.
- *Text Classification:* Assign predefined labels (e.g., topic, urgency) using rule-based, classical ML (logistic regression, SVM), or neural approaches.
- *Semantic Text Comparison & Embeddings:* Compare meaning across texts using vector-based representations.
- *Keyword Extraction & Summarization:* Extract key phrases or generate concise summaries.
- *Dimensionality Reduction & Text Clustering:* Explore or visualize hidden patterns in text embeddings.
- *Stylometry & Authorship Analysis:* Analyze writing style to attribute or profile authorship.
- *Narrative & Discourse Structure Analysis:* Analyze how information is organized.
- *Network Analysis from Text:* Visualize relationships between entities or concepts, e.g. social/co-occurrence networks (e.g., speaker-topic links).
- *Multilingual & Cross-Lingual NLP:* Analyze or compare texts in multiple languages.
- *Long-Form Narrative Analysis:* Specialized tools for processing novels or long stories.
- *Structured & Metadata-Aware Text Processing:* Parse and analyze structured documents like XML or JSON.
- *Time-Based or Diachronic Text Analysis:* Track language or theme changes over time.
- ...

Note: When using large datasets together with advanced models (like transformer models), you'll often need more **computational resources**, such as a GPU or a computing cluster. For smaller datasets and simpler models (e.g., with scikit-learn), standard hardware usually is enough.

**For more**, check out the [Python Humanities Textbook](python-textbook.pythonhumanities.com), which covers some of the mentioned techniques and use-cases from a humanities and qualitative research perspective. And check out Notebook 00 for more support. 