# FA 3
**Predictive Analytics**

Justine Aizel Samson

### Importing Dataset



In [22]:
!pip uninstall -y tensorflow numpy
!pip install numpy==1.23.5 tensorflow==2.12.0


Found existing installation: tensorflow 2.18.0
Uninstalling tensorflow-2.18.0:
  Successfully uninstalled tensorflow-2.18.0
Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
  Successfully uninstalled numpy-2.0.2
Collecting numpy==1.23.5
  Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting tensorflow==2.12.0
  Downloading tensorflow-2.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow==2.12.0)
  Downloading gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting keras<2.13,>=2.12.0 (from tensorflow==2.12.0)
  Downloading keras-2.12.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow==2.12.0)
  Downloading protobuf-4.25.7-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Collecting tensorboard<2.13,>=2.12 (from tensorflow==2.12

### Importing Dataset



In [2]:
import numpy
import tensorflow as tf
from tensorflow.keras.models import Sequential
print(numpy.__version__)
print(tf.__version__)


1.23.5
2.12.0


In [3]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    cleaned = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return cleaned


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd

# If you uploaded to "My Drive"
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Tweets.csv')

# Preview
print(df.head())


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
             tweet_id airline_sentiment  airline_sentiment_confidence  \
0  570306133677760513           neutral                        1.0000   
1  570301130888122368          positive                        0.3486   
2  570301083672813571           neutral                        0.6837   
3  570301031407624196          negative                        1.0000   
4  570300817074462722          negative                        1.0000   

  negativereason  negativereason_confidence         airline  \
0            NaN                        NaN  Virgin America   
1            NaN                     0.0000  Virgin America   
2            NaN                        NaN  Virgin America   
3     Bad Flight                     0.7033  Virgin America   
4     Can't Tell                     1.0000  Virgin America   

  airline_sentiment_gold        name negativereason_g

#### Cleaning Dataset

In [5]:
# 1. Drop duplicate rows
df = df.drop_duplicates()

# 2. Drop rows with missing 'text'
df = df.dropna(subset=['text'])

# 3. Drop unnecessary columns (if present)
columns_to_drop = [
    'tweet_id', 'airline_sentiment_gold', 'negativereason_gold',
    'name', 'tweet_coord', 'tweet_location', 'user_timezone'
]
df = df.drop(columns=[col for col in columns_to_drop if col in df.columns])

# 4. Fix encoding in text (e.g., '&amp;' → '&')
df['text'] = df['text'].str.replace('&amp;', '&')

# 5. Reset index
df = df.reset_index(drop=True)

# 6. Preview cleaned data
print("\nAfter cleaning:")
print(df.info())
print(df.head())

df['tweet_created'] = pd.to_datetime(df['tweet_created'])
df['airline_sentiment'].value_counts()



After cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14604 entries, 0 to 14603
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   airline_sentiment             14604 non-null  object 
 1   airline_sentiment_confidence  14604 non-null  float64
 2   negativereason                9159 non-null   object 
 3   negativereason_confidence     10503 non-null  float64
 4   airline                       14604 non-null  object 
 5   retweet_count                 14604 non-null  int64  
 6   text                          14604 non-null  object 
 7   tweet_created                 14604 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 912.9+ KB
None
  airline_sentiment  airline_sentiment_confidence negativereason  \
0           neutral                        1.0000            NaN   
1          positive                        0.3486            NaN   
2           

Unnamed: 0_level_0,count
airline_sentiment,Unnamed: 1_level_1
negative,9159
neutral,3091
positive,2354


In [6]:
df['tweet_created'] = pd.to_datetime(df['tweet_created'], utc=True)


Data overview after cleaning:

* Total rows: 14,601
* Columns remain the same with slight reduction in rows (from 14,604 to 14,601)

* negativereason and negativereason_confidence have missing values as expected



### Text Preprocessing

In [7]:
!pip install nltk




In [8]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download required nltk data files (run once)
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove Twitter handles (@user)
    text = re.sub(r'@\w+', '', text)
    # Remove punctuation and special characters (keep spaces and letters)
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove digits
    text = re.sub(r'\d+', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # Join tokens back to string
    return ' '.join(tokens)


In [10]:
from nltk.tokenize import word_tokenize

def preprocess_text(text):
    # ...
    tokens = word_tokenize(text)  # no extra language argument!
    # ...


In [11]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
import nltk.tokenize
print(nltk.tokenize.__file__)

from nltk.tokenize import word_tokenize
print(word_tokenize)


/usr/local/lib/python3.11/dist-packages/nltk/tokenize/__init__.py
<function word_tokenize at 0x7c581b73ef20>


In [13]:
from nltk.tokenize.treebank import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

text = "This is a sample sentence."
tokens = tokenizer.tokenize(text)
print(tokens)


['This', 'is', 'a', 'sample', 'sentence', '.']


In [14]:
import nltk
print(nltk.data.find('tokenizers/punkt'))


/root/nltk_data/tokenizers/punkt


In [15]:
from nltk.tokenize.treebank import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\d+', '', text)

    tokens = tokenizer.tokenize(text)   # Use explicit tokenizer here
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply preprocessing
df['clean_text'] = df['text'].apply(preprocess_text)

# Check results
print(df[['text', 'clean_text']].head())


                                                text  \
0                @VirginAmerica What @dhepburn said.   
1  @VirginAmerica plus you've added commercials t...   
2  @VirginAmerica I didn't today... Must mean I n...   
3  @VirginAmerica it's really aggressive to blast...   
4  @VirginAmerica and it's a really big bad thing...   

                                          clean_text  
0                                               said  
1       plus youve added commercial experience tacky  
2       didnt today must mean need take another trip  
3  really aggressive blast obnoxious entertainmen...  
4                               really big bad thing  


 #### Compute TF-IDF scores and display top 10 weighted words for each class.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd

# Create TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=5000)  # limit features for speed

# Fit and transform the cleaned tweets
X_tfidf = tfidf.fit_transform(df['clean_text'])

# Get feature (word) names
feature_names = np.array(tfidf.get_feature_names_out())

# Add sentiment labels
df['airline_sentiment'] = df['airline_sentiment'].astype(str)  # ensure string type

# Function to get top n words per class based on average TF-IDF scores
def top_tfidf_words_per_class(tfidf_matrix, labels, class_name, n=10):
    # Select rows with this class
    class_indices = np.where(labels == class_name)[0]

    # Average TF-IDF vector for this class
    class_tfidf = tfidf_matrix[class_indices].mean(axis=0)

    # Convert to array
    class_tfidf_array = np.asarray(class_tfidf).flatten()

    # Get indices of top n words
    top_n_ids = class_tfidf_array.argsort()[::-1][:n]

    return feature_names[top_n_ids], class_tfidf_array[top_n_ids]

# Prepare labels array
labels = df['airline_sentiment'].values

# For each sentiment class, print top 10 weighted words
for sentiment in df['airline_sentiment'].unique():
    top_words, scores = top_tfidf_words_per_class(X_tfidf, labels, sentiment, n=10)
    print(f"\nTop 10 TF-IDF words for sentiment '{sentiment}':")
    for word, score in zip(top_words, scores):
        print(f"{word}: {score:.4f}")



Top 10 TF-IDF words for sentiment 'neutral':
flight: 0.0398
fleek: 0.0191
dm: 0.0186
fleet: 0.0183
please: 0.0175
get: 0.0168
thanks: 0.0155
need: 0.0144
help: 0.0133
tomorrow: 0.0110

Top 10 TF-IDF words for sentiment 'positive':
thanks: 0.0880
thank: 0.0815
great: 0.0328
flight: 0.0254
love: 0.0190
much: 0.0179
awesome: 0.0171
best: 0.0165
guy: 0.0161
good: 0.0151

Top 10 TF-IDF words for sentiment 'negative':
flight: 0.0477
hour: 0.0255
get: 0.0212
cancelled: 0.0206
customer: 0.0181
service: 0.0176
hold: 0.0171
time: 0.0165
bag: 0.0154
help: 0.0149


### Embedding with Word2Vec or GloVe

In [17]:
!pip install --upgrade numpy==1.24.3 scipy==1.10.1 gensim==4.3.1


Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [18]:
import gensim.downloader as api
import numpy as np

# Download and load Word2Vec Google News (300-dimensional)
word2vec = api.load("word2vec-google-news-300")

embedding_dim = 300

def document_vector(doc):
    """Create document vector by averaging word vectors for words in doc"""
    words = doc.split()
    valid_words = [word for word in words if word in word2vec.key_to_index]
    if len(valid_words) == 0:
        # If no words found in embeddings, return zero vector
        return np.zeros(embedding_dim)
    else:
        return np.mean(word2vec[valid_words], axis=0)

# Apply to your cleaned text
df['doc_vector'] = df['clean_text'].apply(document_vector)

print(df[['clean_text', 'doc_vector']].head())


                                          clean_text  \
0                                               said   
1       plus youve added commercial experience tacky   
2       didnt today must mean need take another trip   
3  really aggressive blast obnoxious entertainmen...   
4                               really big bad thing   

                                          doc_vector  
0  [-0.009094238, -0.044189453, 0.099609375, -0.0...  
1  [0.0009358724, -0.05480957, -0.04031372, 0.078...  
2  [-0.0025896344, 0.04867118, 0.0355399, 0.03494...  
3  [0.0029686822, 0.097235784, -0.018581815, 0.05...  
4  [0.11010742, 0.06271362, 0.0031738281, 0.13183...  


In [19]:
import nltk
import os

# Create a local nltk_data folder in your working directory
nltk_data_dir = './nltk_data'
os.makedirs(nltk_data_dir, exist_ok=True)

# Download punkt and stopwords to local directory
nltk.download('punkt', download_dir=nltk_data_dir)
nltk.download('stopwords', download_dir=nltk_data_dir)
nltk.download('wordnet', download_dir=nltk_data_dir)
nltk.download('omw-1.4', download_dir=nltk_data_dir)

# Tell nltk to look here first
nltk.data.path.insert(0, nltk_data_dir)

print("NLTK data paths:", nltk.data.path)


NLTK data paths: ['./nltk_data', '/root/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


[nltk_data] Downloading package punkt to ./nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to ./nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to ./nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to ./nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [20]:
import nltk
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download required resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Define tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Preprocessing function
def preprocess_text(text, use_lemma=True):
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+|@\w+|#\w+", "", text)
    text = re.sub(r"[{}]".format(string.punctuation), "", text)
    text = re.sub(r"\d+", "", text)

    try:
        tokens = word_tokenize(text)
    except:
        tokens = text.split()  # Fallback in case of tokenization error

    tokens = [w for w in tokens if w not in stop_words and w.strip() != ""]

    if use_lemma:
        tokens = [lemmatizer.lemmatize(w) for w in tokens]
    else:
        tokens = [stemmer.stem(w) for w in tokens]

    return tokens

# Apply both versions
df['tokens_stem'] = df['text'].apply(lambda x: preprocess_text(x, use_lemma=False))
df['tokens_lemma'] = df['text'].apply(lambda x: preprocess_text(x, use_lemma=True))

# Preview
df[['text', 'tokens_stem', 'tokens_lemma']].head()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,tokens_stem,tokens_lemma
0,@VirginAmerica What @dhepburn said.,[said],[said]
1,@VirginAmerica plus you've added commercials t...,"[plu, youv, ad, commerci, experi, tacki]","[plus, youve, added, commercial, experience, t..."
2,@VirginAmerica I didn't today... Must mean I n...,"[didnt, today, must, mean, need, take, anoth, ...","[didnt, today, must, mean, need, take, another..."
3,@VirginAmerica it's really aggressive to blast...,"[realli, aggress, blast, obnoxi, entertain, gu...","[really, aggressive, blast, obnoxious, enterta..."
4,@VirginAmerica and it's a really big bad thing...,"[realli, big, bad, thing]","[really, big, bad, thing]"


In this formative asssessment, I will build a model that can tell if a movie review is positive or negative. I will start by cleaning the text, removing extra words, and turning the reviews into numbers. Then, I will test different methods to represent the words, like using TF-IDF and word embeddings. Finally, I will train simple models to see how well they can predict the review's sentiment and compare the results.


### Model Building

In [21]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_w2v, y, test_size=0.2, random_state=42)

# Build simple NN
model = Sequential()
model.add(Dense(64, input_dim=X_w2v.shape[1], activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))  # Binary output

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1, validation_data=(X_test, y_test))

# Predict
y_pred_nn = model.predict(X_test).flatten()
y_pred_nn_labels = (y_pred_nn >= 0.5).astype(int)

# Evaluation
print("Neural Network Classification Report:")
print(classification_report(y_test, y_pred_nn_labels))

cm_nn = confusion_matrix(y_test, y_pred_nn_labels)
sns.heatmap(cm_nn, annot=True, fmt='d', cmap='Oranges')
plt.title("Neural Network Confusion Matrix (Word2Vec)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


NameError: name 'X_w2v' is not defined