# **CRISP-DM Framework for Twitter Sentiment Analysis**

# Business Understanding

**Goal:** Define the problem, stakeholder, and value proposition in simple terms.

*   **Stakeholder:** **Product Manager at Apple** (You can choose Google, but pick one for focus).
*   **Real-World Problem:** The Product Manager needs a fast, scalable way to monitor public sentiment about their latest product (e.g., a new iPhone or iOS update) on Twitter. Manual reading of thousands of tweets is impossible.
*   **Project Value:** Build a proof-of-concept model that automatically classifies tweets as **Positive, Negative, or Neutral**. This allows the stakeholder to:
    *   Quickly identify and address negative feedback (customer service issues, bugs).
    *   Gauge positive reception for marketing campaigns.
    *   Track sentiment trends over time.

**Deliverable:** This will form the **Introduction** of your notebook and presentation.

# Data Understanding

**Goal:** Load, explore, and describe the data to show its suitability.

*   **Load Data:** Load the `twitter.csv` (or similar) from data.world.
*   **Initial Exploration:**
    *   Check shape (rows, columns).
    *   Check for missing values (especially in the `text` and `sentiment` columns).
    *   Identify the features (`text`) and the target (`sentiment`).
*   **Descriptive Statistics & EDA:**
    *   **Class Distribution:** Plot a bar chart of the sentiment labels. **Crucially, note the imbalance** (e.g., more neutral tweets). This is a key data limitation.
    *   **Text Statistics:** Calculate and discuss average tweet length, unique words, etc.
    *   **Sample Inspection:** Manually read a sample of tweets for each sentiment to build intuition.

**Deliverable:** A "Data Understanding" section in your notebook with charts and commentary.

# Data Preparation

**Goal:** Clean the text data and prepare it for modeling in a reproducible, justifiable way.

## Text Preprocessing (Create a function)

 **Justification:** We clean the text to reduce noise and help the model focus on meaningful words.
   *  **Steps:**
        * **Lowercase:** `"Apple"` and `"apple"` should be the same.
        * **Remove URLs, User Mentions, and Hashtags:** These are often unique and don't carry general sentiment meaning. (Alternatively, you could replace them with placeholders like `[URL]`).
        * **Remove Punctuation and Numbers:** Simplifies the text.
        * **Tokenization:** Split text into individual words.
        * **Remove Stopwords** (using `nltk` or `spacy`): Remove common words like "the", "and" that add little semantic value.
        * **Lemmatization** (preferred over stemming): Reduce words to their base form (e.g., "running" -> "run") using `nltk` or `spacy`. **This is your "other Python package."**

## Target Variable Preparation:
-   **Proof-of-Concept Path (Recommended):** Map the sentiment to a **binary** problem first (e.g., drop "neutral" tweets). This simplifies the initial model and makes it easier to achieve good performance.
-   **Advanced Path:** Keep all three classes (Positive, Negative, Neutral) for a multiclass challenge.

## Train-Test-Validation Split:
- Split data into **Train** (70%), **Validation** (15%), and **Test** (15%) sets. Use `stratify` to preserve the class distribution. **This is your validation strategy.**
-  **Justification:** The test set is the final, untouched benchmark. The validation set is used for model selection and hyperparameter tuning during the iterative process.

**Deliverable:** A well-documented "Data Preparation" section with a preprocessing function. This meets the "Advanced Data Preparation" bar.

# Modeling & Evaluation (Iterative)

**Goal:** Build multiple models, compare them, and select a final champion.

We are going to start the modeling phase. We'll proceed step by step, building one model at a time iteratively.

Steps for Modeling Phase:

1. **Feature Extraction:** Convert the cleaned text into numerical features (e.g., TF-IDF, CountVectorizer)
2. **Model Building:** Start with a baseline model and then try more advanced models.
3. **Model Evaluation:** Use the validation set to tune hyperparameters and the test set for final evaluation.

We are going to use the following models in order:
1. **Naive Bayes** (as a baseline)
2. **Logistic Regression**
3. **Random Forest**

- We'll handle class imbalance during model training by using class weights.

Let's start with the first model: Naive Bayes.

- But first, we need to convert our text data into numerical features. We'll use TF-IDF.

- **Note:** We are going to use the Apple-focused dataset for modeling because our stakeholder is interested in Apple products. However, note that the Apple dataset is smaller. We might also consider using the full dataset and then filtering for Apple in production, but for now, let's use the Apple dataset.

Steps for Feature Extraction:

- We'll use TF-IDF on the 'cleaned_text' column.

- We'll fit the TF-IDF vectorizer on the training set and then transform the train, validation, and test sets.



## Feature Engineering:

**Vectorization:** Convert cleaned text into numbers.
- **Model 1:** `CountVectorizer` (Bag-of-Words).
- **Model 2:** `TfidfVectorizer` (Term Frequency-Inverse Document Frequency). This often performs better.

## Baseline Model:

- **What:** A `DummyClassifier` that predicts the most frequent class.
- **Why:** It gives you a performance floor. Any real model must beat this.

## Model Iteration 1: Simple & Fast
- **Model:** `MultinomialNB` (Naive Bayes) with `TfidfVectorizer`.
- **Evaluation:** Check accuracy, precision, recall, F1-score on the **validation set**. Create a confusion matrix.

## Model Iteration 2: Powerful & Robust
- **Model:** `LogisticRegression` with `TfidfVectorizer`.
- **Action:** Perform **Hyperparameter Tuning** (e.g., `C`, `max_features` in the vectorizer) using `GridSearchCV` or `RandomizedSearchCV` on the training/validation sets.

## Model Interpretation (Key for "Exceeds"):
- **Built-in:** For Logistic Regression, display the most important features (words) for Positive and Negative classes. This is highly explainable.
- **Advanced (LIME):** Use the `lime` package to explain *why* a single specific tweet was classified as positive or negative. This is very impressive.

**Deliverable:** A "Modeling" section that clearly shows your iterative process: Baseline -> Model 1 -> Tuned Model 2. This meets and can exceed the "Advanced ML Modeling" bar.

# Final Evaluation & Conclusion

**Goal:** Justify your final model choice and explain its business implications.

*   **Final Model Selection:** Choose the best-performing model on the validation set (likely the tuned Logistic Regression).
*   **Unbiased Test:** Evaluate **only the final model** on the held-out **test set**. Report the final metrics.
*   **Business Interpretation:**
    *   "Our model achieves 85% accuracy on unseen tweets. This means the Product Manager can trust the sentiment labels 85% of the time."
    *   "The biggest limitation is class imbalance; the model is better at identifying positive sentiment than negative. In a real-world scenario, we would collect more negative examples."
    *   **Recommendation:** "Deploy this model to automatically scan tweets daily and deliver a sentiment dashboard to the product team. This will save dozens of hours of manual work."

**Deliverable:** The "Evaluation" and "Conclusion" sections of your notebook and the core of your non-technical presentation.