# Project 5

You are a data scientist working for a Political Consulting Firm. You are given a dataset containing in Twitter_Data.csv. This dataset has the following two columns:

- clean_text: Tweets made by the people extracted from Twitter Mainly Focused on tweets Made by People on Modi(2019 Indian Prime Minister candidate) and Other Prime Ministerial Candidates.

- category: It describes the actual sentiment of the respective tweet with three values of -1, 0, and 1.

## Load Necessary Libraries 

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import spacy

  hasattr(torch, "has_mps")
  and torch.has_mps  # type: ignore[attr-defined]


## 1. Load the Twitter Data

In [2]:
# Load Twitter Data
df = pd.read_csv("/Users/matthewmoore/Downloads/Twitter_Data (1).csv", encoding = "utf-8")

# Display Dataframe
df

Unnamed: 0,clean_text,category
0,when modi promised “minimum government maximum...,-1.0
1,talk all the nonsense and continue all the dra...,0.0
2,what did just say vote for modi welcome bjp t...,1.0
3,asking his supporters prefix chowkidar their n...,1.0
4,answer who among these the most powerful world...,1.0
...,...,...
162975,why these 456 crores paid neerav modi not reco...,-1.0
162976,dear rss terrorist payal gawar what about modi...,-1.0
162977,did you cover her interaction forum where she ...,0.0
162978,there big project came into india modi dream p...,0.0


## 2. Find the cosine similarity in clean_text for two tweets of the 100th and 10,000th tweets using dot and norm function.

In [3]:
# Clean NaN Values
df['clean_text'] = df['clean_text'].fillna('')

In [4]:
# Define Function for Cosine Similarity Dot & Norm Function
def cosine_similarity_dot_norm(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    return dot_product / norm_product if norm_product else 0

# TF-IDF Vectorizer
vectorizer = TfidfVectorizer(use_idf=True,smooth_idf=True, sublinear_tf=False)
X_tfidf = vectorizer.fit_transform(df['clean_text'])
vec1, vec2 = X_tfidf[99].toarray(), X_tfidf[9999].toarray()

# Cosine Similarity w/ Dot and Norm Function
similarity_dot_norm = cosine_similarity_dot_norm(vec1.flatten(), vec2.flatten())
print(f"Cosine Similarity using dot and norm: {similarity_dot_norm}")

Cosine Similarity using dot and norm: 0.06209565096586555


## 3. Find the cosine similarity in clean_text for two tweets of  the 100th and 10,000th tweets using the cosine function.

In [5]:
# Cosine Similarity w/ Manual Cosine Function
def cosine_function(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarity_cosine_function = cosine_function(vec1.flatten(), vec2.flatten())
print(f"Cosine Similarity using manual cosine function: {similarity_cosine_function}")

Cosine Similarity using manual cosine function: 0.06209565096586555


## 4. Find the cosine similarity in clean_text for two tweets of  the 100th and 10,000th tweets using cosine_similarity function.

In [6]:
# Cosine Similarity w/ Cosine_Similarity Function from sklearn
similarity_sklearn = cosine_similarity(X_tfidf[99], X_tfidf[9999])[0][0]
print(f"Cosine Similarity using sklearn's cosine_similarity: {similarity_sklearn}")

Cosine Similarity using sklearn's cosine_similarity: 0.06209565096586556


## 5. Find the cosine similarity in clean_text for two tweets of  the 100th and 10,000th tweets using the Spacy function.

In [7]:
#!python -m spacy download en_core_web_lg


In [8]:
#!python -m spacy download en_core_web_md

In [9]:
nlp = spacy.load("en_core_web_md")

# Create Spacy Vectors
doc1 = nlp(df['clean_text'][99])
doc2 = nlp(df['clean_text'][9999])

# Cosine Similarity w/ Spacy Function
similarity_spacy = doc1.similarity(doc2)
print(f"Cosine Similarity using Spacy: {similarity_spacy}")

Cosine Similarity using Spacy: 0.8365634393115299


## 6. Find the tweets with the cosine similarity > 65% with the 100th tweets using Spacy in this dataset.

In [10]:
# Convert tweets into document vectors (SpaCy)
spacy_vectors = np.array([nlp(tweet).vector for tweet in df['clean_text']])

# Determine similarity between the 100th tweet and all other tweets
doc1_vector = spacy_vectors[99].reshape(1, -1)
similarities = cosine_similarity(doc1_vector, spacy_vectors)[0]

# Tweets with Greater Than 65% Cosine Similarity
similar_tweet_indices = np.where(similarities > 0.65)[0]

# Results
print("Tweets with similarity > 65%:")
for idx in similar_tweet_indices:
    print(f"Index: {idx}, Similarity: {similarities[idx]:.2f}")


Tweets with similarity > 65%:
Index: 0, Similarity: 0.80
Index: 1, Similarity: 0.78
Index: 2, Similarity: 0.78
Index: 3, Similarity: 0.86
Index: 4, Similarity: 0.76
Index: 6, Similarity: 0.71
Index: 7, Similarity: 0.74
Index: 8, Similarity: 0.75
Index: 10, Similarity: 0.84
Index: 11, Similarity: 0.80
Index: 13, Similarity: 0.72
Index: 14, Similarity: 0.76
Index: 15, Similarity: 0.73
Index: 17, Similarity: 0.73
Index: 18, Similarity: 0.85
Index: 19, Similarity: 0.70
Index: 20, Similarity: 0.81
Index: 21, Similarity: 0.66
Index: 22, Similarity: 0.72
Index: 23, Similarity: 0.75
Index: 24, Similarity: 0.80
Index: 25, Similarity: 0.75
Index: 26, Similarity: 0.72
Index: 27, Similarity: 0.75
Index: 28, Similarity: 0.68
Index: 29, Similarity: 0.70
Index: 30, Similarity: 0.73
Index: 31, Similarity: 0.69
Index: 32, Similarity: 0.82
Index: 33, Similarity: 0.78
Index: 34, Similarity: 0.82
Index: 37, Similarity: 0.72
Index: 38, Similarity: 0.88
Index: 39, Similarity: 0.82
Index: 40, Similarity: 0.7

## 7. Compute the corpus vector that is equal to the average of all the document vectors, where each document corresponds to a tweet or a row in this dataset.

In [None]:
# Compute corpus vector as the mean of all document vectors
corpus_vector = np.mean(X_tfidf.toarray(), axis=0)
print("Computed corpus vector (first 10 elements):")
print("-----------------------------------------------------------")

# Limited to prevent kernel dying
corpus_vector.head(10)