# Using Jupyter Notebooks
:label:`sec_jupyter`


This section describes how to edit and run the code
in each section of this book
using the Jupyter Notebook. Make sure you have
installed Jupyter and downloaded the
code as described in
:ref:`chap_installation`.
If you want to know more about Jupyter see the excellent tutorial in
their [documentation](https://jupyter.readthedocs.io/en/latest/).


## Editing and Running the Code Locally

Suppose that the local path of the book's code is `xx/yy/d2l-en/`. Use the shell to change the directory to this path (`cd xx/yy/d2l-en`) and run the command `jupyter notebook`. If your browser does not do this automatically, open http://localhost:8888 and you will see the interface of Jupyter and all the folders containing the code of the book, as shown in :numref:`fig_jupyter00`.

![The folders containing the code of this book.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter00.png?raw=1)
:width:`600px`
:label:`fig_jupyter00`


You can access the notebook files by clicking on the folder displayed on the webpage.
They usually have the suffix ".ipynb".
For the sake of brevity, we create a temporary "test.ipynb" file.
The content displayed after you click it is
shown in :numref:`fig_jupyter01`.
This notebook includes a markdown cell and a code cell. The content in the markdown cell includes "This Is a Title" and "This is text.".
The code cell contains two lines of Python code.

![Markdown and code cells in the "text.ipynb" file.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter01.png?raw=1)
:width:`600px`
:label:`fig_jupyter01`


Double click on the markdown cell to enter edit mode.
Add a new text string "Hello world." at the end of the cell, as shown in :numref:`fig_jupyter02`.

![Edit the markdown cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter02.png?raw=1)
:width:`600px`
:label:`fig_jupyter02`


As demonstrated in :numref:`fig_jupyter03`,
click "Cell" $\rightarrow$ "Run Cells" in the menu bar to run the edited cell.

![Run the cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter03.png?raw=1)
:width:`600px`
:label:`fig_jupyter03`

After running, the markdown cell is shown in :numref:`fig_jupyter04`.

![The markdown cell after running.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter04.png?raw=1)
:width:`600px`
:label:`fig_jupyter04`


Next, click on the code cell. Multiply the elements by 2 after the last line of code, as shown in :numref:`fig_jupyter05`.

![Edit the code cell.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter05.png?raw=1)
:width:`600px`
:label:`fig_jupyter05`


You can also run the cell with a shortcut ("Ctrl + Enter" by default) and obtain the output result from :numref:`fig_jupyter06`.

![Run the code cell to obtain the output.](https://github.com/d2l-ai/d2l-en-colab/blob/master/img/jupyter06.png?raw=1)
:width:`600px`
:label:`fig_jupyter06`


When a notebook contains more cells, we can click "Kernel" $\rightarrow$ "Restart & Run All" in the menu bar to run all the cells in the entire notebook. By clicking "Help" $\rightarrow$ "Edit Keyboard Shortcuts" in the menu bar, you can edit the shortcuts according to your preferences.

## Advanced Options

Beyond local editing two things are quite important: editing the notebooks in the markdown format and running Jupyter remotely.
The latter matters when we want to run the code on a faster server.
The former matters since Jupyter's native ipynb format stores a lot of auxiliary data that is
irrelevant to the content,
mostly related to how and where the code is run.
This is confusing for Git, making
reviewing contributions very difficult.
Fortunately there is an alternative---native editing in the markdown format.

### Markdown Files in Jupyter

If you wish to contribute to the content of this book, you need to modify the
source file (md file, not ipynb file) on GitHub.
Using the notedown plugin we
can modify notebooks in the md format directly in Jupyter.


First, install the notedown plugin, run the Jupyter Notebook, and load the plugin:

```
pip install d2l-notedown  # You may need to uninstall the original notedown.
jupyter notebook --NotebookApp.contents_manager_class='notedown.NotedownContentsManager'
```

You may also turn on the notedown plugin by default whenever you run the Jupyter Notebook.
First, generate a Jupyter Notebook configuration file (if it has already been generated, you can skip this step).

```
jupyter notebook --generate-config
```

Then, add the following line to the end of the Jupyter Notebook configuration file (for Linux or macOS, usually in the path `~/.jupyter/jupyter_notebook_config.py`):

```
c.NotebookApp.contents_manager_class = 'notedown.NotedownContentsManager'
```

After that, you only need to run the `jupyter notebook` command to turn on the notedown plugin by default.

### Running Jupyter Notebooks on a Remote Server

Sometimes, you may want to run Jupyter notebooks on a remote server and access it through a browser on your local computer. If Linux or macOS is installed on your local machine (Windows can also support this function through third-party software such as PuTTY), you can use port forwarding:

```
ssh myserver -L 8888:localhost:8888
```

The above string `myserver` is the address of the remote server.
Then we can use http://localhost:8888 to access the remote server `myserver` that runs Jupyter notebooks. We will detail on how to run Jupyter notebooks on AWS instances
later in this appendix.

### Timing

We can use the `ExecuteTime` plugin to time the execution of each code cell in Jupyter notebooks.
Use the following commands to install the plugin:

```
pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --user
jupyter nbextension enable execute_time/ExecuteTime
```

## Summary

* Using the Jupyter Notebook tool, we can edit, run, and contribute to each section of the book.
* We can run Jupyter notebooks on remote servers using port forwarding.


## Exercises

1. Edit and run the code in this book with the Jupyter Notebook on your local machine.
1. Edit and run the code in this book with the Jupyter Notebook *remotely* via port forwarding.
1. Compare the running time of the operations $\mathbf{A}^\top \mathbf{B}$ and $\mathbf{A} \mathbf{B}$ for two square matrices in $\mathbb{R}^{1024 \times 1024}$. Which one is faster?


[Discussions](https://discuss.d2l.ai/t/421)


In [24]:
# Sentiment_RNN_Project.py
# Full pipeline: Data loading -> cleaning -> EDA -> features -> LSTM -> evaluation
# Run as a notebook (split cells) or script.

import os
import re
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_recall_fscore_support
from sklearn.preprocessing import LabelEncoder

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, GRU, Dense, Dropout, BatchNormalization, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from wordcloud import WordCloud

# ========== Configuration ==========
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

file_path = '/content/Mini_Project/Week 19 - Graded Mini Project - Dataset - Twitter-training.csv'
# If file has weird encoding, try: pd.read_csv(file_path, encoding='latin1', header=None)

# ========== 1. Load Dataset ==========
# Try to infer header; if not present, use header=None and then set names.
try:
    # Try reading with header first
    df = pd.read_csv(file_path)
    # If no error but columns are still numerical, it's likely no header was present
    if df.columns.tolist()[:4] == [0,1,2,3]:
         df = pd.read_csv(file_path, encoding='latin1', header=None) # Reload with no header
except Exception:
    # If initial read failed, try with encoding and no header
    df = pd.read_csv(file_path, encoding='latin1', header=None)

print("Raw shape:", df.shape)
display(df.head())

# If the CSV has no proper column names, set them (example common format: id, topic, sentiment, text)
# Based on the head output, columns seem to be: id, topic, sentiment, text
# Check if columns are numerical (indicating no header was detected)
if all(isinstance(col, int) for col in df.columns):
    # Assume the order is id, topic, sentiment, text based on the original attempt and common format
    new_column_names = ['id', 'topic', 'sentiment', 'text']
    # Only assign if we have at least these many columns
    if df.shape[1] >= len(new_column_names):
        df.columns = new_column_names + df.columns[len(new_column_names):].tolist()

# Now, select the relevant columns ('text' and 'sentiment')
# Ensure the columns exist after potential renaming and filter out numeric sentiment values
if 'text' in df.columns and 'sentiment' in df.columns:
    # Filter out rows where 'sentiment' is not one of the expected string labels
    expected_sentiments = ['Positive', 'Negative', 'Neutral', 'Irrelevant']
    df = df[df['sentiment'].isin(expected_sentiments)].copy()
    df = df[['text','sentiment']].dropna(subset=['text']).reset_index(drop=True)


    # If text is split in multiple columns because of commas after renaming, merge remaining cols
    # This part might be tricky if commas were within the original text, but let's try to handle it
    # If the text column now contains lists or tuples due to the earlier merge attempt, flatten it
    if df['text'].apply(type).nunique() > 1 or (df['text'].apply(type).iloc[0] != str and df['text'].apply(type).iloc[0] != np.object_):
         # Assuming original issue was text split across numerical columns 3 onwards
         if all(isinstance(col, int) for col in df.columns[:4]): # Check if initial load was header=None
             # Reload specifically handling this case by joining columns from index 3 onwards
             df = pd.read_csv(file_path, encoding='latin1', header=None)
             if df.shape[1] >= 4:
                  df.columns = ['id', 'topic', 'sentiment', 'text'] + [f'extra_{i}' for i in range(4, df.shape[1])]
                  # Filter again after potential reload - IMPORTANT: Apply filtering of numeric sentiment here too
                  expected_sentiments = ['Positive', 'Negative', 'Neutral', 'Irrelevant']
                  df = df[df['sentiment'].isin(expected_sentiments)].copy()

                  text_cols_to_merge = [col for col in df.columns if col.startswith('text') or col.startswith('extra_')]
                  # Ensure text_cols_to_merge are in df.columns before selection
                  text_cols_to_merge = [col for col in text_cols_to_merge if col in df.columns]
                  if text_cols_to_merge: # Only merge if there are columns to merge
                    df['text'] = df[text_cols_to_merge].astype(str).agg(' '.join, axis=1)
                  else: # If no extra text columns, just use the designated 'text' column
                    df['text'] = df['text'].astype(str)

                  df = df[['sentiment', 'text']].dropna(subset=['text']).reset_index(drop=True)
         else:
             # Fallback if merging text failed and columns weren't purely numerical initially - IMPORTANT: Apply filtering of numeric sentiment here too
              expected_sentiments = ['Positive', 'Negative', 'Neutral', 'Irrelevant']
              df = df[df['sentiment'].isin(expected_sentiments)].copy()
              df = df[['text','sentiment']].dropna(subset=['text']).reset_index(drop=True) # Revert to simple selection


    df['text'] = df['text'].astype(str).str.strip()
    print("After basic column normalization and sentiment filtering:", df.shape)
    display(df.head())
else:
    print("Error: 'text' or 'sentiment' columns not found after loading and renaming.")
    # Create an empty DataFrame or exit if essential columns are missing
    df = pd.DataFrame(columns=['text', 'sentiment'])
    print("Created empty DataFrame due to missing essential columns.")


# Remove sentiment classes with less than 2 samples for stratification
if 'sentiment' in df.columns:
    initial_shape = df.shape
    sentiment_counts = df['sentiment'].value_counts()
    classes_to_remove = sentiment_counts[sentiment_counts < 2].index.tolist()
    if classes_to_remove:
        print(f"Removing sentiment classes with less than 2 samples: {classes_to_remove}")
        df = df[~df['sentiment'].isin(classes_to_remove)].copy()
        print(f"Shape after removing small classes: {df.shape}")
    else:
        print("No sentiment classes with less than 2 samples found.")
else:
    print("Skipping removal of small sentiment classes as 'sentiment' column is missing.")


# ========== 2. Data Cleaning & Preprocessing ==========
# Download NLTK resources
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('punkt_tab', quiet=True) # Add this line to download 'punkt_tab'

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_tweet(text):
    """
    - Remove URLs, mentions, hashtags (or keep the word), emojis/special chars.
    - Normalize: lowercase, remove extra whitespace.
    - Tokenize & remove stopwords.
    - Lemmatize.
    """
    # Ensure text is a string
    if not isinstance(text, str):
        return "" # Return empty string for non-string inputs

    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)
    # Remove mentions (@user)
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags symbol only (keep the tag word)
    text = re.sub(r'#', '', text)
    # Remove HTML entities
    text = re.sub(r'&\w+;', '', text)
    # Keep letters & common punctuation, remove others (including many emojis). If you want emojis preserved, adapt.
    text = re.sub(r'[^A-Za-z0-9\s\.,!?\'`]', ' ', text)
    # Lowercase
    text = text.lower()

    # Check if text is empty after cleaning before tokenization
    if not text.strip():
        return ""

    # Tokenize simple
    tokens = nltk.word_tokenize(text)

    # Add a check for empty tokens after tokenization and before processing
    tokens = [t for t in tokens if t and not t.isspace()]

    # Remove stopwords and short tokens
    tokens = [t for t in tokens if t not in stop_words and len(t) > 1]
    # Lemmatize
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(tokens)

# Drop duplicates and missing
if 'text' in df.columns and 'sentiment' in df.columns:
    df = df.drop_duplicates().reset_index(drop=True)
    print("After dropping duplicates:", df.shape)
    # Apply cleaning (this may take a while on large datasets)
    df['clean_text'] = df['text'].apply(clean_tweet)
    display(df[['text','clean_text']].head())
else:
    print("Skipping data cleaning and preprocessing due to missing columns.")
    # Create empty clean_text column if 'text' was not found
    df['clean_text'] = "" # Ensure clean_text column exists


# Optional: If you prefer stemming instead of lemmatization:
# from nltk.stem.porter import PorterStemmer
# stemmer = PorterStemmer()
# tokens = [stemmer.stem(t) for t in tokens]

# ========== 3. Feature Engineering ==========
if 'clean_text' in df.columns and 'sentiment' in df.columns and not df.empty:
    # A) TF-IDF features (for traditional ML or baseline)
    tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1,2))
    X_tfidf = tfidf.fit_transform(df['clean_text'])

    # B) Tokenizer + padded sequences for RNN
    MAX_VOCAB = 15000
    MAX_LEN = 80  # tune this

    tokenizer = Tokenizer(num_words=MAX_VOCAB, oov_token='<OOV>')
    tokenizer.fit_on_texts(df['clean_text'])
    sequences = tokenizer.texts_to_sequences(df['clean_text'])
    X_seq = pad_sequences(sequences, maxlen=MAX_LEN, padding='post', truncating='post')

    # Label encoding
    le = LabelEncoder()
    # Inspect unique values before encoding
    print("Unique values in 'sentiment' column:", df['sentiment'].unique())
    # Ensure sentiment column is string type before fitting LabelEncoder
    y = le.fit_transform(df['sentiment'].astype(str))
    print("Classes:", le.classes_)

    # ========== 5. Model Building (LSTM) ==========
    if X_seq is not None and y is not None and le is not None and X_seq.shape[0] > 0:
        # Split
        # Ensure y has enough unique values for stratification if num_classes > 1
        if len(np.unique(y)) > 1:
            # Check unique values and counts in y right before splitting
            unique_y, counts_y = np.unique(y, return_counts=True)
            print("Unique values and counts in y before splitting:", list(zip(unique_y, counts_y)))

            X_train, X_test, y_train, y_test, seq_train, seq_test = train_test_split(
                X_tfidf, y, X_seq, test_size=0.2, random_state=RANDOM_SEED, stratify=y) # Added stratify=y back as the issue might be resolved by filtering

        else:
             print("Only one class found. Skipping stratification.")
             X_train, X_test, y_train, y_test, seq_train, seq_test = train_test_split(
                X_tfidf, y, X_seq, test_size=0.2, random_state=RANDOM_SEED)


        # For RNN we use seq_train/seq_test
        # Define a simple Embedding + BiLSTM model
        vocab_size = min(MAX_VOCAB, len(tokenizer.word_index) + 1)
        embedding_dim = 100

        def build_lstm_model(vocab_size=vocab_size, embedding_dim=embedding_dim, input_length=MAX_LEN, lstm_units=128, dropout_rate=0.5, num_classes=None):
            if num_classes is None:
                num_classes = len(np.unique(y))
            model = Sequential()
            model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=input_length))
            model.add(Bidirectional(LSTM(lstm_units, return_sequences=False)))
            model.add(BatchNormalization())
            model.add(Dropout(dropout_rate))
            if num_classes == 2:
                model.add(Dense(1, activation='sigmoid'))
                loss = 'binary_crossentropy'
            else:
                model.add(Dense(num_classes, activation='softmax'))
                loss = 'sparse_categorical_crossentropy'
            model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])
            return model

        num_classes = len(np.unique(y))
        model = build_lstm_model(num_classes=num_classes)
        model.summary()

        # Callbacks
        es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
        # Optionally save best model:
        checkpoint_path = 'best_lstm.h5'
        mc = ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True)

        # Train
        BATCH_SIZE = 64
        EPOCHS = 10

        history = model.fit(
            seq_train, y_train,
            validation_split=0.1,
            epochs=EPOCHS,
            batch_size=BATCH_SIZE,
            callbacks=[es, mc],
            verbose=1
        )
    else:
        print("Skipping model building due to missing data, labels, or insufficient data for splitting.")
        model = None
        history = None

else:
    print("Skipping feature engineering and model building due to missing columns or empty DataFrame.")
    # Initialize empty variables to avoid errors later
    X_tfidf = None
    X_seq = None
    y = None
    le = None
    model = None
    history = None


# ========== 4. EDA ==========
if df is not None and 'sentiment' in df.columns and 'clean_text' in df.columns and not df.empty:
    # Basic stats
    print("Total tweets:", df.shape[0])
    print(df['sentiment'].value_counts())

    # Sentiment distribution plots
    plt.figure(figsize=(6,4))
    sns.countplot(x='sentiment', data=df, order=df['sentiment'].value_counts().index)
    plt.title('Sentiment Distribution')
    plt.xlabel('Sentiment')
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

    # Pie chart
    plt.figure(figsize=(6,6))
    df['sentiment'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90, ylabel='')
    plt.title('Sentiment Proportions')
    plt.show()

    # Word clouds for positive/negative (if those classes exist)
    for sentiment in df['sentiment'].unique():
        text_blob = ' '.join(df.loc[df['sentiment']==sentiment, 'clean_text'].values)
        if len(text_blob.strip()) == 0:
            print(f"No clean text found for sentiment: {sentiment}")
            continue
        wc = WordCloud(width=800, height=400, background_color='white').generate(text_blob)
        plt.figure(figsize=(10,4))
        plt.imshow(wc, interpolation='bilinear')
        plt.axis('off')
        plt.title(f'WordCloud for {sentiment}')
        plt.show()

    # Top keywords per sentiment using TF-IDF or simple frequency
    from collections import Counter
    def top_n_words(texts, n=20):
        cnt = Counter()
        for t in texts:
            cnt.update(t.split())
        return cnt.most_common(n)

    for s in df['sentiment'].unique():
        print("\nTop words for", s)
        texts_for_sentiment = df.loc[df['sentiment']==s, 'clean_text']
        if not texts_for_sentiment.empty:
             print(top_n_words(texts_for_sentiment, 15))
        else:
            print(f"No clean text found for sentiment: {s}")


    # Relationship between tweet length and sentiment
    df['clean_len'] = df['clean_text'].apply(lambda x: len(x.split()))
    plt.figure(figsize=(8,5))
    sns.boxplot(x='sentiment', y='clean_len', data=df)
    plt.title('Tweet Length (tokens) by Sentiment')
    plt.show()
else:
    print("Skipping EDA due to missing columns, empty DataFrame, or data.")


# ========== 6. Evaluation ==========
# Check if model, test data, and labels are available and valid before evaluating
if model is not None and X_test is not None and y_test is not None and le is not None and seq_test is not None and seq_test.shape[0] > 0:
    # Predict
    y_pred_prob = model.predict(seq_test)
    if num_classes == 2:
        y_pred = (y_pred_prob.flatten() > 0.5).astype(int)
    else:
        y_pred = np.argmax(y_pred_prob, axis=1)

    print("Accuracy:", accuracy_score(y_test, y_pred))
    # Ensure target_names match the classes present in y_test if subset was used
    try:
        print("Classification Report:")
        print(classification_report(y_test, y_pred, target_names=le.classes_))
    except ValueError as e:
        print(f"Could not generate classification report with original classes. Error: {e}")
        # Fallback report without specific target names if issue with label mapping
        print(classification_report(y_test, y_pred))


    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6,5))
    # Handle potential mismatch between unique values in y_test and le.classes_ for heatmap labels
    unique_y_test = np.unique(y_test)
    cm_xticklabels = [le.classes_[i] for i in unique_y_test] if len(unique_y_test) <= len(le.classes_) else unique_y_test
    cm_yticklabels = [le.classes_[i] for i in unique_y_test] if len(unique_y_test) <= len(le.classes_) else unique_y_test

    sns.heatmap(cm, annot=True, fmt='d', xticklabels=cm_xticklabels, yticklabels=cm_yticklabels, cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.title('Confusion Matrix')
    plt.show()

    # Plot learning curves
    if history is not None and history.history:
        plt.figure(figsize=(12,4))
        plt.subplot(1,2,1)
        plt.plot(history.history.get('loss'), label='train_loss')
        plt.plot(history.history.get('val_loss'), label='val_loss')
        plt.legend(); plt.title('Loss')
        plt.subplot(1,2,2)
        plt.plot(history.history.get('accuracy'), label='train_acc')
        plt.plot(history.history.get('val_accuracy'), label='val_acc')
        plt.legend(); plt.title('Accuracy')
        plt.show()
    elif history is None:
        print("Skipping plotting learning curves as history is not available.")
    else:
        print("Skipping plotting learning curves as history object is empty.")

else:
    print("Skipping evaluation due to missing model, test data, labels, or insufficient test data points.")


# ========== 7. Model Improvement & Hyperparameter Tuning (outline) ==========
# - Use GridSearch on number of LSTM units, dropout, embedding dim (but heavy to run).
# - Use pretrained embeddings (GloVe): create embedding_matrix and use Embedding(..., weights=[embedding_matrix], trainable=False/True)
# - Try Transformers (e.g., HuggingFace BERT) for transfer learning (much better results, requires additional libraries).

# Example: quick grid-search skeleton (won't run here as-is due to Keras model compatibility with sklearn)
# You can implement KerasClassifier wrapper or manual loops over params:
# for units in [64,128]:
#     for drop in [0.2,0.5]:
#         model = build_lstm_model(lstm_units=units, dropout_rate=drop)
#         history = model.fit(...)

# ========== 8. Save artifacts ==========
if tokenizer is not None and tfidf is not None and model is not None and hasattr(tokenizer, 'word_index'):
    import pickle
    try:
        with open('tokenizer.pkl','wb') as f:
            pickle.dump(tokenizer, f)
        print("Tokenizer saved.")
    except Exception as e:
        print(f"Could not save tokenizer: {e}")

    try:
        with open('tfidf.pkl','wb') as f:
            pickle.dump(tfidf, f)
        print("TF-IDF vectorizer saved.")
    except Exception as e:
        print(f"Could not save TF-IDF vectorizer: {e}")

    try:
        model.save('final_lstm_model.h5')
        print("Keras model saved.")
    except Exception as e:
        print(f"Could not save Keras model: {e}")
else:
    print("Skipping saving artifacts due to missing components or invalid tokenizer.")


# ========== 9. Quick Inference function for live demo ==========
if tokenizer is not None and model is not None and le is not None and hasattr(tokenizer, 'word_index'):
    def predict_sentiment(text):
        # Ensure num_classes is correctly inferred or passed
        num_classes_in_model = model.output_shape[-1] if model.output_shape is not None else (len(le.classes_) if le is not None else 2)

        clean = clean_tweet(text)
        seq = tokenizer.texts_to_sequences([clean])
        pad = pad_sequences(seq, maxlen=MAX_LEN, padding='post')
        prob = model.predict(pad)
        if num_classes_in_model == 1 or (le is not None and len(le.classes_) == 2): # Handle binary explicitly
            # Assuming sigmoid output for binary
            label_idx = int(prob.flatten()[0] > 0.5)
            score = float(prob.flatten()[0])
            label = le.inverse_transform([label_idx])[0] if le is not None else ("Positive" if label_idx == 1 else "Negative")
        elif num_classes_in_model > 1:
            # Assuming softmax output for multi-class
            idx = np.argmax(prob, axis=1)[0]
            label = le.inverse_transform([idx])[0] if le is not None else f"Class {idx}"
            score = float(prob[0, idx])
        else:
             return "Could not predict", 0.0 # Fallback

        return label, score

    # Demo
    samples = [
        "I love this game, it's so fun!",
        "This update ruined everything. Worst patch ever.",
        "Meh, it's okay I guess."
    ]
    print("\n--- Inference Demo ---")
    for s in samples:
        # Check if all components for prediction are available
        if tokenizer and model and le and hasattr(tokenizer, 'word_index'):
            print(s, "->", predict_sentiment(s))
        else:
             print(f"Skipping demo for '{s}' due to missing inference components.")
else:
    print("Skipping inference demo due to missing components or invalid tokenizer.")


# ========== 10. Report & Slides ==========
# For the report: export figures generated above and create a short slide deck (e.g., using python-pptx) summarizing:
# - Dataset overview
# - EDA highlights (charts + wordclouds)
# - Model architecture screenshot/summary
# - Evaluation metrics
# - Demo predictions
# (Implementation of ppt generation is optional; can be done with python-pptx.)

Raw shape: (74681, 4)


Unnamed: 0,2401,Borderlands,Positive,"im getting on borderlands and i will murder you all ,"
0,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
1,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
2,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
3,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
4,2401,Borderlands,Positive,im getting into borderlands and i can murder y...


Error: 'text' or 'sentiment' columns not found after loading and renaming.
Created empty DataFrame due to missing essential columns.
No sentiment classes with less than 2 samples found.
After dropping duplicates: (0, 2)


Unnamed: 0,text,clean_text


Skipping feature engineering due to missing columns or empty DataFrame.
Skipping EDA due to missing columns, empty DataFrame, or data.
Skipping model building due to missing data, labels, or insufficient data for splitting.
Skipping evaluation due to missing model, test data, labels, or insufficient test data points.


NameError: name 'tokenizer' is not defined