# Getting Started with Bible Text Analysis

## Welcome! 👋

This notebook will teach you the basics of data analysis using the Bible as our text dataset. Don't worry if you're new to data science - we'll go step by step!

### What You'll Learn:
1. How to load and read text data
2. Basic text analysis (word counts, frequencies)
3. Data visualization (charts and graphs)
4. Simple statistical analysis

### Let's Get Started! 🚀

## Step 1: Import Libraries

Libraries are pre-written code that help us do complex tasks easily. Think of them as tools in a toolbox.

In [None]:
# pandas: for working with data tables
import pandas as pd

# matplotlib & seaborn: for creating charts and graphs
import matplotlib.pyplot as plt
import seaborn as sns

# For text analysis
from collections import Counter
import re

# Make graphs look nice
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("✅ Libraries loaded successfully!")
print("Ready to analyze the Bible text! 📖")

## Step 2: Load the Bible Text

Let's read the Bible text file that's already in your data folder.

In [None]:
# Read the Bible text file
with open('../data/raw/kjv_bible.txt', 'r', encoding='utf-8') as file:
    bible_text = file.read()

# Let's see how much text we have
print(f"Total characters: {len(bible_text):,}")
print(f"\nFirst 500 characters:")
print(bible_text[:500])

## Step 3: Basic Text Analysis

### Let's count words!

We'll split the text into individual words and count them.

In [None]:
# Convert to lowercase and split into words
words = bible_text.lower().split()

# Remove punctuation from words
words_clean = [re.sub(r'[^a-z]', '', word) for word in words if re.sub(r'[^a-z]', '', word)]

print(f"Total words in the Bible: {len(words_clean):,}")
print(f"\nFirst 20 words: {words_clean[:20]}")

## Step 4: Find the Most Common Words

Let's discover which words appear most frequently in the Bible!

In [None]:
# Count word frequencies
word_counts = Counter(words_clean)

# Get the 20 most common words
most_common = word_counts.most_common(20)

print("Top 20 Most Frequent Words in the Bible:\n")
for rank, (word, count) in enumerate(most_common, 1):
    print(f"{rank:2d}. '{word}' appears {count:,} times")

## Step 5: Create Your First Visualization! 📊

Let's make a bar chart showing the most common words.

In [None]:
# Get top 15 words (excluding very common words like 'the', 'and', etc.)
# These are called "stop words"
stop_words = {'the', 'and', 'of', 'to', 'a', 'in', 'that', 'he', 'is', 'for', 'it', 'with', 'as', 'his', 'i', 'was', 'be', 'they', 'him'}
meaningful_words = [(word, count) for word, count in word_counts.most_common(50) if word not in stop_words][:15]

# Prepare data for plotting
words_list = [word for word, count in meaningful_words]
counts_list = [count for word, count in meaningful_words]

# Create the bar chart
plt.figure(figsize=(12, 6))
plt.bar(words_list, counts_list, color='steelblue')
plt.xlabel('Words', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Top 15 Most Frequent Meaningful Words in the Bible', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

print("📊 Your first data visualization is complete!")

## Step 6: Analyze Word Lengths

How long are the words in the Bible on average?

In [None]:
# Calculate word lengths
word_lengths = [len(word) for word in words_clean]

# Calculate statistics
average_length = sum(word_lengths) / len(word_lengths)
shortest_word = min(word_lengths)
longest_word = max(word_lengths)

print(f"Word Length Statistics:")
print(f"  Average word length: {average_length:.2f} letters")
print(f"  Shortest word: {shortest_word} letter(s)")
print(f"  Longest word: {longest_word} letters")

# Find the actual longest words
longest_words = [word for word in words_clean if len(word) == longest_word]
print(f"\nExamples of longest words ({longest_word} letters): {list(set(longest_words))[:5]}")

## Step 7: Word Length Distribution

Let's visualize how word lengths are distributed.

In [None]:
# Create histogram
plt.figure(figsize=(12, 6))
plt.hist(word_lengths, bins=range(1, max(word_lengths)+1), color='coral', edgecolor='black', alpha=0.7)
plt.xlabel('Word Length (letters)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Word Lengths in the Bible', fontsize=14, fontweight='bold')
plt.axvline(average_length, color='red', linestyle='--', linewidth=2, label=f'Average: {average_length:.1f}')
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## Step 8: Unique Words (Vocabulary)

How many different words are used in the Bible?

In [None]:
# Count unique words
unique_words = set(words_clean)
total_words = len(words_clean)
vocabulary_size = len(unique_words)

print(f"Vocabulary Analysis:")
print(f"  Total words: {total_words:,}")
print(f"  Unique words (vocabulary): {vocabulary_size:,}")
print(f"  Vocabulary diversity: {(vocabulary_size/total_words)*100:.2f}%")
print(f"\nThis means {vocabulary_size:,} different words are used to write {total_words:,} total words!")

## Step 9: Search for Specific Words

Let's search for how many times specific words appear.

In [None]:
# Search for specific words
search_words = ['love', 'faith', 'hope', 'peace', 'joy', 'grace', 'mercy', 'god', 'lord', 'jesus']

print("Word Frequency Analysis:\n")
word_freq_data = []
for word in search_words:
    count = word_counts.get(word, 0)
    word_freq_data.append({'Word': word.capitalize(), 'Frequency': count})
    print(f"  '{word.capitalize()}' appears {count:,} times")

# Create a dataframe (data table)
df = pd.DataFrame(word_freq_data)
print("\n📊 Data Table:")
print(df)

## Step 10: Visualize Your Search Results

Let's create a nice chart for these important words!

In [None]:
# Create horizontal bar chart
plt.figure(figsize=(10, 6))
df_sorted = df.sort_values('Frequency', ascending=True)
plt.barh(df_sorted['Word'], df_sorted['Frequency'], color='mediumseagreen')
plt.xlabel('Frequency', fontsize=12)
plt.ylabel('Word', fontsize=12)
plt.title('Frequency of Important Words in the Bible', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 🎉 Congratulations!

You've completed your first data analysis! Here's what you learned:

✅ **Loading data** from files  
✅ **Basic text processing** (splitting, cleaning)  
✅ **Counting and statistics** (word frequencies, averages)  
✅ **Data visualization** (bar charts, histograms)  
✅ **Using pandas** to organize data  

### Next Steps:
1. Try modifying the code above
2. Search for different words
3. Create your own visualizations
4. Move on to `01_data_acquisition.ipynb` for more advanced analysis

### Practice Ideas:
- Search for your favorite words
- Change the colors of the charts
- Try analyzing different sections of the Bible
- Count how many times specific names appear

**Remember**: Data analysis is all about exploring and asking questions! 🔍