<p align="center">
    <img src="JHU.png" width="200" alt="Johns Hopkins University logo">
</p>

## Mining Patterns in Alice in Wonderland & Building a Neural Network on MNIST Dataset

### Overview:

Estimated time needed: **60** minutes

This hands-on lab covers two topics:
1. **A priori analysis** using the novel *Alice in Wonderland* by Lewis Carroll. You will mine frequent patterns and phrases in the novel, such as "Mock Turtle" or "White Rabbit," using the Apriori algorithm and apply it using the **nltk** library. You will preprocess the text data and export it into a format readable by machine learning frameworks like Weka or use Python-based libraries like **mlxtend**.
2. **Neural Network Implementation** using the MNIST dataset to recognize handwritten digits. You will implement and evaluate a neural network using the class **NeuralNetMLP** with two hidden layers and assess its performance.

## Part 1: A Priori Analysis of Alice in Wonderland

### Dataset:
The dataset for this task is the *Alice in Wonderland* novel available in the NLTK Gutenberg corpus. You will treat each sentence as a transaction and the words within it as items.

### Step 1: Loading and Preprocessing the Novel Text

In this step, we load the raw text of *Alice in Wonderland* using the NLTK library's Gutenberg corpus. The text is then manually split into individual sentences using regular expressions. Each sentence is subsequently tokenized into words, converting them to lowercase and removing punctuation. This preprocessing allows us to represent the text as transactions, where each transaction corresponds to a sentence represented by its constituent words. The first few transactions are printed to verify the results.

In [None]:
# Install/Importing necessary libraries.
import re
from nltk.corpus import gutenberg
import pandas as pd

# Download Gutenberg corpus
import nltk
nltk.download('gutenberg')

In [None]:
# Load the raw text of the novel
alice_text = gutenberg.raw('carroll-alice.txt')

# Manually split the text into sentences using regular expressions
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', alice_text)

# Tokenize each sentence into words
# Write your code here!

# Print the first few transactions
# Write your code here!


<details><summary>Click here for the solution</summary>
 
```python
# Tokenize each sentence into words
transaction_sentences = [re.findall(r'\b\w+\b', sentence.lower()) for sentence in sentences]

# Print the first few transactions
print(transaction_sentences[:5])

```
 
</details>

**Explanation**:
- **Regular Expressions**: The regex pattern used for splitting sentences accounts for various punctuation and sentence-ending criteria, ensuring accurate sentence segmentation.
- **Tokenization**: The `re.findall` function extracts word tokens from each sentence while converting them to lowercase, preparing the data for further analysis.
- **Transaction Representation**: The resulting list, `transaction_sentences`, serves as a foundation for further steps in the analysis, such as finding frequent itemsets or generating association rules.

### Step 2: Prepare Data for the Apriori Algorithm

To apply the **Apriori algorithm**, we need to convert the tokenized sentences into a format that can be used for association rule mining. We'll use the **TransactionEncoder** from **mlxtend** to transform the tokenized sentences into a format suitable for the algorithm.

In [None]:
# Install/Importing necessary libraries.
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Use TransactionEncoder to prepare data for Apriori
# Write your code here!
te = TransactionEncoder()


# Convert the encoded data into a DataFrame
# Write your code here!

# Show a sample of the dataframe
# Write your code here!

<details><summary>Click here for the solution</summary>
 
```python
# Use TransactionEncoder to prepare data for Apriori
te = TransactionEncoder()
te_ary = te.fit(transaction_sentences).transform(transaction_sentences)

# Convert the encoded data into a DataFrame
df_apriori = pd.DataFrame(te_ary, columns=te.columns_)

# Show a sample of the dataframe
print(df_apriori.head())
```
 
</details>

**Explanation:**
- **TransactionEncoder**: This encodes the tokenized sentences into a binary matrix where each row represents a sentence (transaction), and each column represents a unique word (item).
- **df_apriori**: The resulting DataFrame has sentences as rows and unique words as columns. The value in each cell is either `True` (if the word is present in the sentence) or `False` (if it is not).

### Step 3: Apply the Apriori Algorithm

Next, we will apply the **Apriori algorithm** using the **mlxtend** library to discover frequent word patterns.

In [None]:
# Install/Importing necessary libraries.
from mlxtend.frequent_patterns import apriori, association_rules

# Apply the Apriori algorithm with a minimum support threshold
# Write your code here!

# Display the frequent itemsets
# Write your code here!

<details><summary>Click here for the solution</summary>
 
```python
# Apply the Apriori algorithm with a minimum support threshold
frequent_itemsets = apriori(df_apriori, min_support=0.1, use_colnames=True)

# Display the frequent itemsets
print(frequent_itemsets.head())
```
 
</details>

**Explanation:**
- **min_support=0.1**: This sets the minimum support to 10% of the transactions (sentences). You can adjust this value based on how frequent you want the patterns to be.
- **frequent_itemsets**: This DataFrame contains the frequent word patterns and their support values.

### Step 4: Generate Association Rules

After finding the frequent itemsets, we generate association rules to identify interesting word combinations or patterns.

In [None]:
# Generate association rules from the frequent itemsets
# Write your code here!

# Display the first few rules
# Write your code here!


<details><summary>Click here for the solution</summary>
 
```python
# Generate association rules from the frequent itemsets
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

# Display the first few rules
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head())
```
 
</details>

**Explanation:**
- **association_rules()**: This function generates rules from the frequent itemsets. It uses the **lift** metric to identify interesting associations.
- **antecedents** and **consequents**: These represent the "if-then" rules in the data. For example, if word A is present (antecedent), then word B (consequent) is likely to occur as well.
- **support**, **confidence**, and **lift**: These are key metrics used to evaluate the quality of the rules. Higher values of confidence and lift indicate stronger relationships between words.


### Step 5: Report Interesting Patterns

Once you have the rules generated by the Apriori algorithm, you can report interesting word patterns. Based on *Alice in Wonderland*, some common patterns might include famous character names or phrases.

In [None]:
# Display interesting word patterns (association rules with high confidence and lift)
# Write your code here!


<details><summary>Click here for the solution</summary>
 
```python
interesting_rules = rules[(rules['confidence'] > 0.6) & (rules['lift'] > 1.2)]
print(interesting_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
```
 
</details>

The results you shared show **association rules** that were generated using the **Apriori algorithm** with high confidence and lift values. Let’s break down what each column in the result means and how to interpret it.

### Breakdown of Columns:

1. **Antecedents**:
   - These are the items (words or word combinations) that appear in the left-hand side (LHS) of the association rule.
   - For example, in the first row, **(a)** is the antecedent. This means the rule is looking at sentences that contain the word **"a"**.

2. **Consequents**:
   - These are the items (words or word combinations) that appear in the right-hand side (RHS) of the association rule.
   - In the first row, the consequent is **(and)**, meaning that if the word **"a"** appears in a sentence, then the word **"and"** is likely to appear as well.

3. **Support**:
   - This indicates the proportion of sentences in the dataset that contain both the antecedent and the consequent.
   - For example, in the first row, the support is **0.307214**, meaning that around **30.72%** of the sentences contain both the word **"a"** and the word **"and"**.

4. **Confidence**:
   - This represents how often the consequent appears in sentences that contain the antecedent. In other words, it’s the likelihood that if the antecedent is present, the consequent will also be present.
   - In the first row, the confidence is **0.688022**, meaning that when the word **"a"** appears, there's a **68.80%** chance that the word **"and"** will also appear in the same sentence.

5. **Lift**:
   - Lift measures how much more likely the consequent is to appear when the antecedent is present, compared to when the antecedent is not present. It tells you the strength of the association.
   - A **lift** value greater than **1** indicates a strong positive association between the antecedent and consequent.
   - In the first row, the lift is **1.298521**, meaning that the word **"and"** is about **1.30 times** more likely to appear in a sentence when the word **"a"** is present compared to its general occurrence in the text.

### Interpreting the Results:

- **Row 1**: The association rule says that if the word **"a"** appears in a sentence, there’s a **68.80%** chance that the word **"and"** will also appear in that sentence. This rule has a lift of **1.30**, meaning the presence of **"a"** increases the likelihood of seeing **"and"** by **30%** compared to random chance.
  
- **Row 2**: If **"as"** is present in a sentence, there's a **61.14%** chance that **"a"** will also be present. This rule has a lift of **1.37**, meaning the word **"a"** is **37%** more likely to appear in sentences containing **"as"** than randomly.

- **Row 6936**: Here, a multi-word pattern is involved. If the words **"the," "a," "to," "she"** are present in a sentence, then there’s a **69.83%** chance that the words **"and, it"** will also be present. The lift of **2.49** means this combination of words is almost **2.5 times** more likely to occur together than by chance.

- **Row 6946**: When the words **"she, a, it"** appear in a sentence, there’s a **73.64%** chance that the words **"the, and, to"** will also appear. The lift of **2.15** indicates a strong association between these two sets of words.

### Key Insights:
- The rules in the results with high **confidence** and **lift** show strong associations between words or word combinations.
- Words like **"a"**, **"the"**, **"and"**, **"she"**, and **"it"** are commonly associated with each other, forming strong patterns.
- **Multi-word patterns** (antecedents and consequents with multiple words) also provide insights into frequent phrases or structures in the text. For example, combinations like **"the, a, to, she"** and **"and, it"** show frequent co-occurrence of these common words in sentences.

### How to Use This Information:
- These rules help identify frequently occurring word pairs or patterns in *Alice in Wonderland*. For instance, discovering that **"a"** and **"and"** frequently occur together could provide insight into sentence structure or commonly used phrases in the novel.
- Rules with a high lift value highlight particularly strong associations. For example, the rule **"she, a, it" → "the, and, to"** is notable for having a strong co-occurrence in sentences.
- This kind of analysis can be used to detect stylistic patterns or recurring themes in a literary text.

### Step 6 : Exporting and Loading Association Rules

In this step, we first export the DataFrame containing the interesting association rules to a CSV file for easy access and future use. The rules are saved in a file named **`alice_association_rules.csv`**. After exporting, we read the CSV file back into a new DataFrame to confirm that the data has been saved correctly and can be accessed as needed. Finally, we print the loaded DataFrame to verify the contents and ensure that the data integrity is maintained during the export and import processes.

In [None]:
# Read the CSV file back into a DataFrame
# Install/Importing necessary libraries.
import pandas as pd

# Export the rules DataFrame to a CSV file
# Write your code here!

# Load the CSV file
# Write your code here!

# Print the loaded rules to confirm
# Write your code here!

<details><summary>Click here for the solution</summary>
 
```python
# Export the rules DataFrame to a CSV file
interesting_rules.to_csv('alice_association_rules.csv', index=False)

# Load the CSV file
loaded_rules = pd.read_csv('alice_association_rules.csv')

# Print the loaded rules to confirm
print(loaded_rules)
```
 
</details>

## Part 2: Neural Network on MNIST Dataset

### Dataset:
The **MNIST dataset** is a widely-used dataset for digit classification tasks. It contains 60,000 training images and 10,000 test images of handwritten digits (0–9). Each image is 28x28 pixels in grayscale.

### Step 1: Importing necessary libraries

In [None]:
# Install/Importing necessary libraries.
!pip install optree
!pip install tensorflow
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

### Step 2: Load the MNIST Dataset
Use the **tensorflow** or **keras** library to load the MNIST dataset.

In [None]:
# Load MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Preprocess data: Normalize the images and one-hot encode the labels
X_train = X_train.reshape(X_train.shape[0], 28*28).astype('float32') / 255
X_test = X_test.reshape(X_test.shape[0], 28*28).astype('float32') / 255
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

<details><summary>Click here for the solution</summary>
 
```python
# Preprocess data: Normalize the images and one-hot encode the labels
X_train = X_train.reshape(X_train.shape[0], 28*28).astype('float32') / 255
X_test = X_test.reshape(X_test.shape[0], 28*28).astype('float32') / 255
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
```
 
</details>

### Step 3: Define Neural Network (NeuralNetMLP)
Here we create a simple neural network model with two hidden layers.

In [None]:
# Install/Importing necessary libraries.
!pip install optree
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Define the NeuralNetMLP model
# Write your code here!
model = Sequential()


# Compile the model
# Write your code here!


<details><summary>Click here for the solution</summary>
 
```python
# Define the NeuralNetMLP model
model = Sequential()
model.add(Dense(128, input_shape=(28*28,), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
```
 
</details>

### Step 3: Train and Evaluate the Model
Train the model on the MNIST dataset and report its accuracy.

In [None]:
# Train the model
# Write your code here!

# Evaluate the model
# Write your code here!

<details><summary>Click here for the solution</summary>
 
```python
# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
score = model.evaluate(X_test, y_test, verbose=0)
print(f"Test accuracy: {score[1]*100:.2f}%")
```
 
</details>

### Summary:
- In Part 1, you explored the Apriori algorithm to find interesting word patterns in *Alice in Wonderland*. 
- In Part 2, you implemented a two-layer neural network for digit classification using the MNIST dataset and achieved a high accuracy score.