# Session 9: Exploring Unstructured Data (Text)

**Unit 1: Introduction to Data Science**
**Hour: 9**
**Mode: Practical Lab**

---

### 1. Objective

This lab introduces the basics of handling unstructured data, focusing on plain text. We will perform the most fundamental task in text analysis: reading text and counting word frequencies.

**What is Unstructured Data?** Data that has no inherent model or organization, like the text in a book or an email.

### 2. Setup

We will use a special tool from Python's `collections` library called `Counter`, which is perfect for counting items in a list. We will also use the `re` library for regular expressions to help us clean the text.

In [None]:
import re
from collections import Counter

### 3. The Text Data

Let's create a simple text file to work with.

In [None]:
text_data = """Data science is a field of study. 
This field combines domain expertise, programming skills, and knowledge of mathematics and statistics. 
Data scientists use this field to extract knowledge and insights from data."""

with open('sample.txt', 'w') as f:
    f.write(text_data)

### 4. Basic Text Processing Workflow

We will follow a simple multi-step process.

#### 4.1. Step 1: Read the File

We use Python's built-in file handling to open the file and read its contents into a single string variable.

In [None]:
with open('sample.txt', 'r') as f:
    text = f.read()

print("Original Text:")
print(text)

#### 4.2. Step 2: Normalize and Clean

To ensure that "Data" and "data" are treated as the same word, we convert the entire string to lowercase. We will also remove punctuation using a regular expression. `re.sub(r'[^\w\s]', '', text)` finds anything that is NOT a word character (`\w`) or whitespace (`\s`) and replaces it with nothing.

In [None]:
text_lower = text.lower()
text_clean = re.sub(r'[^\w\s]', '', text_lower)

print("Cleaned Text:")
print(text_clean)

#### 4.3. Step 3: Tokenization (Splitting into Words)

**Tokenization** is breaking text into smaller pieces, or "tokens". We'll split the cleaned string by spaces to get a list of words.

In [None]:
words = text_clean.split()
print(words)

#### 4.4. Step 4: Count Word Frequencies

Now we use the `Counter` object to count the occurrences of each unique word in our list.

In [None]:
word_counts = Counter(words)
print(word_counts)

The `Counter` object has a helpful method called `.most_common()` to see the top N words.

In [None]:
# Get the 5 most common words
print(word_counts.most_common(5))

### 5. Conclusion

While this was a simple example, you have just performed the foundational steps of almost any text analysis project:
1.  Read raw text data.
2.  Normalize and clean the text (lowercase, remove punctuation).
3.  Tokenize the text into words.
4.  Calculate frequencies to find the most important terms.

This process is the starting point for more advanced topics like sentiment analysis, topic modeling, and more.