# Project 6

You are a data scientist working for a Consulting Firm. You are given a dataset containing in sentiment140.csv 

Download sentiment140.csv. The data set has six columns without header:

- the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- the id of the tweet (2087)
- the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- the query (lyx). If there is no query, then this value is NO_QUERY.
- the user that tweeted (robotickilldozr)
- the text of the tweet ("Lyx is cool")

Data source: Sentiment140Links to an external site. (Go, A. et. al., Stanford University)

 
Steps and Questions: 
Our goal is to forecast the polarity of the tweet using the text of the tweet.

1. Load the dataset of sentiment140.csv into memory.
2. Clean and preprocess the texts.
3. Build the first model based on pipeline using the support vector machines.
4. Check the first model. Is it a good model based on the selected evaluation metrics? Please justify your answer.
5. Create the second model using pipeline, grid search CV for the hyperparameters for the estimators. (Please see all the potential parameters at Scikit Learn TfidfVectorizerLinks to an external site. and Scikit Learn SVCLinks to an external site..)
6. Tune the second model using the support vector machines and perform model diagnostics. Is it a good model? Please justify your answer.
7. Build the third model using pipeline, grid search CV, hyperparameter for the following classifiers:
    - Logistic regression
    - Random Forest
    - Support Vector Machine

8. Tune the third model and perform model diagnostics. Is it a good model? Please justify your answer.

## Load Necessary Libraries

In [1]:
import pandas as pd
import re
import string
import nltk
import spacy
from nltk.tokenize import word_tokenize
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer, TfidfTransformer
nltk.download('stopwords')
nltk.download('punkt')
stop_words = set(stopwords.words('english'))

  hasattr(torch, "has_mps")
  and torch.has_mps  # type: ignore[attr-defined]
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/matthewmoore/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/matthewmoore/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Load the dataset of sentiment140.csv into memory

In [2]:
# Load the Dataset
df = pd.read_csv('/Users/matthewmoore/Downloads/sentiment140.csv', encoding = 'ISO-8859-1', header = None)

df

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


In [3]:
# Change Column Names
df.columns = ['polarity', 'id', 'date', 'query', 'user', 'text']

df.head()

Unnamed: 0,polarity,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   polarity  1600000 non-null  int64 
 1   id        1600000 non-null  int64 
 2   date      1600000 non-null  object
 3   query     1600000 non-null  object
 4   user      1600000 non-null  object
 5   text      1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [5]:
percent_missing = df.isnull().sum() * 100 / len(df)
percent_missing

polarity    0.0
id          0.0
date        0.0
query       0.0
user        0.0
text        0.0
dtype: float64

## 2. Clean and Preprocess the Data

In [6]:
from bs4 import BeautifulSoup

def preprocess_text(text):
    
    # Handle missing values
    if pd.isnull(text):
        return ""
    
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Tokenization
    words = word_tokenize(text)
    
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    
    # Join words back into a cleaned string
    return " ".join(words)

# Apply the cleaning function to the dataset
df['clean_text'] = df['text'].apply(preprocess_text)

In [7]:
df['clean_text']


0          switchfoot httptwitpiccomyzl awww thats bummer...
1          upset cant update facebook texting might cry r...
2          kenichan dived many times ball managed save re...
3                           whole body feels itchy like fire
4                   nationwideclass behaving im mad cant see
                                 ...                        
1599995                        woke school best feeling ever
1599996    thewdbcom cool hear old walt interviews httpbl...
1599997                      ready mojo makeover ask details
1599998    happy th birthday boo alll time tupac amaru sh...
1599999    happy charitytuesday thenspcc sparkscharity sp...
Name: clean_text, Length: 1600000, dtype: object

In [8]:
df['polarity'].value_counts(normalize= True)

0    0.5
4    0.5
Name: polarity, dtype: float64

## 3. Build the first model based on pipeline using the support vector machines.

In [9]:
nlp = spacy.load("en_core_web_sm")

stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Creating our tokenzer function from a given sentence
def spacy_tokenizer(sentence):
   
    # Split the sentence into tokens/words
    mytokens = nlp(sentence)
    # Removing stop words and obtain the lemma
    mytokens = [ word.lemma_ for word in mytokens if word not in stop_words]
    return mytokens   

In [10]:
X = df['clean_text']
y = df['polarity']

# Split the data into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
X_train.shape, X_test.shape

((1280000,), (320000,))

In [12]:
y_train.shape, y_test.shape

((1280000,), (320000,))

In [13]:
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))

# SVM classifier
svm_classifier = SVC()

In [None]:
# Create a Pipeline for SVM Classifier
pipeline_svc = Pipeline([
    ('vectorizer', tfidf_vector),
    ('classifier', svm_classifier)
])

# Train the model
pipeline_svc.fit(X_train, y_train)

# Predict
y_pred = pipeline_svc.predict(X_test)

In [None]:
# Model Performance Metrics

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

## 4. Model Evaluation for SVM Model

[Fill in info]

## 5. Create the second model using pipeline, grid search CV for the hyperparameters for the estimators.

In [None]:
# Define TfidfVectorizer and SVC Parameters
param_grid = {
    'tfidf__max_df': [0.80, 1.0],
    'tfidf__min_df': [1, 5],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'svc__C': [0.1, 1, 10],
    'svc__kernel': ['linear', 'rbf'],
    'svc__gamma': ['scale', 'auto']
}

In [None]:
# Create a Pipeline 
pipeline_svc = Pipeline([
    ('vectorizer', tfidf_vector),
    ('classifier', svm_classifier)
])


In [None]:
# GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

## 6. Tune the second model using the support vector machines and perform model diagnostics. Is it a good model? Please justify your answer.

In [None]:
# Parameter and Model Evaluation
print("Best Parameters:\n", grid_search.best_params_)

y_pred_grid = grid_search.predict(X_test)
print("\nAccuracy (Tuned Model):", accuracy_score(y_test, y_pred_grid))
print("\nClassification Report (Tuned Model):\n", classification_report(y_test, y_pred_grid))

[Fill in info]

## 7. Build the third model using pipeline, grid search CV, hyperparameter for the following classifiers:
- Logistic regression
- Random Forest
- Support Vector Machine

In [None]:
# Create a Pipeline
pipeline_multi = Pipeline([
    ('tfidf', tfidf_vector),
    ('clf', svm_classifier) 
])

In [None]:
# Define the Parameters for the 3 Classifiers
param_grid_multi = [
    {
        'clf': [LogisticRegression(max_iter=1000)],
        'clf__C': [0.1, 1, 10],
        'tfidf__ngram_range': [(1,1), (1,2)]
    },
    {
        'clf': [RandomForestClassifier()],
        'clf__n_estimators': [100, 200],
        'clf__max_depth': [None, 10, 20],
        'tfidf__max_df': [0.75, 1.0]
    },
    {
        'clf': [SVC()],
        'clf__C': [0.1, 1, 10],
        'clf__kernel': ['linear', 'rbf'],
        'tfidf__min_df': [1, 5]
    }
]


In [None]:
# GridSearchCV with Pre-defined Multi-parameters
grid_search_multi = GridSearchCV(pipeline_multi, param_grid_multi, cv=5, n_jobs=-1, verbose=1)
grid_search_multi.fit(X_train, y_train)

## 8. Tune the third model and perform model diagnostics. Is it a good model? Please justify your answer.

In [None]:
# Parameter and Model Evaluation
print("Best Parameters for Multi-Classifiers:\n", grid_search_multi.best_params_)

y_pred_multi = grid_search_multi.predict(X_test)
print("\nAccuracy (Multi-Classifier Model):", accuracy_score(y_test, y_pred_multi))
print("\nClassification Report (Multi-Classifier Model):\n", classification_report(y_test, y_pred_multi))

[Fill in info] 