# Final Project - Text Mining

## Name: Cailean
## *IS 5150*

The purpose of the final project is to combine skills you have learned in each of the previous units into one finisihed project.

**Unit 1**
* Basic Text Statistics
* **NLP Pipeline (Preprocessing \& Normalization)**

**Unit 2**
* **Feature Engineering**
* Text Classification
* **Topic Modeling**

**Unit 3**
* Document Summarization
* Text Similarity
* **Document Clustering**

**Unit 4**
* Semantic Analysis
* **Sentiment Analysis**



In [None]:
# load dependencies

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import nltk, re, pprint

from urllib import request
from bs4 import BeautifulSoup #needed for parsing HTML

# !pip install contractions
# import contractions #contractions dictionary
from string import punctuation

import spacy #used for lemmatization/stemming
#python -m spacy download en_core_web_sm
#or in Jupyter download in terminal using spacy download en_core_web_sm

from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer() #stopword removal
from nltk import word_tokenize
import os

from sklearn.decomposition import LatentDirichletAllocation             # import LatentDirichlectAllocation from sklearn.decomposition

from scipy.cluster.hierarchy import fcluster, dendrogram, linkage       # import fcluster, dendrogram, and linkage from scipcy.cluster.hierarchy

pd.options.display.max_colwidth = 200



In [1]:
# # load text data
# This code is commented out, since it is no longer useful, but provides insight on how I managed the data

# csv_collection = [] # creating empty list
# for filename in os.listdir('C:\\Users\\lando\\Downloads\\data'): # for loop going through csvs saved in downloads

#         fullpath = 'C:\\Users\\lando\\Downloads\\data\\' + filename
#         csv_collection.append(fullpath) #appending the csvs to the list

# import csv
# import gc
# from pathlib import Path

# columns = ["text", "location", "tweetcreatedts", "language", "is_quote_status"] # columns I am taking from the csvs


# csv_collection.sort()
# dataframe_collection = []
# for i, csvfile in enumerate(csv_collection): # enumerating over the csvs and reading them into python
#     df = pd.read_csv(csvfile, engine='python', compression='gzip',encoding='utf-8', quoting=csv.QUOTE_ALL, usecols=lambda x: x in columns)
#     df.query('language == "en"', inplace=True) # filtering the english observatons
#     df.to_csv(f'clean\\df{i}.csv.gzip', compression='gzip', index=False)
#     dataframe_collection.append(df) #appending them all lto dataframe
#     print(csvfile)

# df = None
# del df
# gc.collect()

In [None]:
df_combined = pd.concat(dataframe_collection, axis=0) # concatonating to one dataframe

In [None]:
df_combined['language'].unique()

array(['en'], dtype=object)

In [None]:
dataframe_collection = None # this is done to free-up space.
del dataframe_collection
gc.collect()

0

In [None]:

df_combined.drop('language', axis=1, inplace = True) #dropping langauge from concatonated data.

df = df_combined


In [None]:
df['date'] = pd.to_datetime(df["tweetcreatedts"]) # converting to datetime

In [None]:
df.drop('tweetcreatedts', axis=1, inplace = True) # now dropping time stamp

In [None]:
df.head()

Unnamed: 0,location,text,is_quote_status,date
0,Hawaii,⚡The Ukrainian Air Force would like to address...,,2022-04-01
1,,Chernihiv oblast. Ukrainians welcome their lib...,,2022-04-01
2,,America 🇺🇸 is preparing for something worse th...,,2022-04-01
3,International Web Zone,JUST IN: #Anonymous has hacked &amp; released ...,,2022-04-01
4,Hunter Account,***PUBLIC MINT NOW LIVE***\n\nFor \n@billionai...,,2022-04-01


In [None]:
df.isna().sum()

location            8236404
text                      0
is_quote_status    16158908
date                      0
dtype: int64

In [None]:
df.shape
#dropping nas

df.drop('is_quote_status', axis = 1, inplace = True) #dropping is quote status because there is a lot of missing data, and probably not helpful for analysis at this point.


In [None]:
# writing to csv for conveince in future
df.to_parquet("final_project_data.parquet") #writing to parquet file, so I don't have to go through the process of data colleciton again.




I got the The Ukraine Conflict data set from [Kaggle](https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows/discussion). The data set had roughly 20 columns from account description, followers, userid, tweetid, following, retweet, text, location, language, timestamp, etc. The dataset was intended for text analysis, more specifically sentiment analysis, per the creator of the dataset. Tweets are scraped every day and stored in CSV files. This has been going on since the start of the conflict. The data set is quite large; when I first concatonated my dataset with the months February, March, April, October, and November, I had over 19 million observations. If I tried to work with the data from February to present, I am sure this number would go up exponentially.

The objective I had in mind when picking this project was to look at if people's tweets on the Russian Ukraine conflict could offer any insight into public opinion. To do this, I plan on looking at topic modeling and sentiment analysis. I think in the future I would be interesting to see how the location of tweets were clustered. What topics were retweeted, or if there was a similarity amongst the topics that were retweeted.

### Unit 1: At least NLP Pipeline, optional to produce basic text statistics

In [None]:
import polars as pl
df = pl.read_parquet('final_project_data.parquet')


In [None]:
from datetime import datetime

march = datetime(2022, 3, 31, 23, 59, 59)
november = datetime(2022, 11, 1)

In [None]:
df = df.filter(
    (pl.col('date') <= pl.lit(march)) | \
    (pl.col('date') >= pl.lit(november))
)

In [None]:
df = df.lazy().sort('date').drop('__index_level_0__').select([
    pl.all().take_every(25) #Taking every 25th tweet since the data set is so huge
]).collect()

In [None]:
df.write_parquet('final_project_data.parquet')

In [None]:
# Unit 1: NLP pipeline steps, clean and normalize text data
# put into required structure e.g., vocab list, dataframe, etc.
df = df.to_pandas()
df_text = df["text"]

In [None]:
def text_cleaner(text):
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', text) # removes extra indentation
    stripped_text = re.sub(r'(?<!\S)#(\S+)', '', stripped_text)
    stripped_text = re.sub(r'http(\S+)', '', stripped_text) #Making sure I take out URLs
    stripped_text = re.sub(r'’', "'", stripped_text)
    stripped_text = re.sub(r"[^'\w\s\.]+", '', stripped_text) # remove non-period punctuation
    stripped_text = re.sub(r"(\s*'\s*s)", 's', stripped_text) # possesive s
    stripped_text = re.sub(r'\d+\.|\d+', '', stripped_text) # remove digits with or without a following period
    stripped_text = re.sub(r'[A-Z]\.', '', stripped_text) # remove uppercase letters with following period
    stripped_text = re.sub(r'\s+', ' ', stripped_text) #removes extra whitespaces

    return stripped_text

corpus_cleaner = np.vectorize(text_cleaner)
df['clean_text'] = corpus_cleaner(df_text)

In [None]:
print(df)

                                                     text                date  \
0       Footage of the airport bombing in Ivano-Franki... 2022-02-24 06:48:02   
1       The simpsons predicted the crisis of #Ukraine ... 2022-02-24 06:48:05   
2       I strongly condemn #Russia’s reckless attack o... 2022-02-24 06:48:08   
3       RT, SPREAD AND SHARE, YOU CAN HELP UKRAINE #Uk... 2022-02-24 06:48:10   
4       Footage of the airport bombing in Ivano-Franki... 2022-02-24 06:48:11   
...                                                   ...                 ...   
456207  Why Did Germany Side Cover Their Mouths During... 2022-12-02 23:53:59   
456208  The defeat of the 155 enemy marine brigade / P... 2022-12-02 23:55:44   
456209  The #HighLevelBridge in #Edmonton #Alberta #Ca... 2022-12-02 23:56:53   
456210  #Ukraine: 3. UA General Staff report the repel... 2022-12-02 23:58:17   
456211  #Edward #Snowden gets #Russian #passport after... 2022-12-02 23:58:49   

                           

In [None]:
df.drop('location', axis = 1, inplace = True) #dropping location since the data is still so dense

In [None]:
def lowercase(text): #function to convert all data to lowercases.
    sents_lower = text.lower()
    return sents_lower

lowercases = np.vectorize(lowercase) # using np.vectorize so it can be applied to list of tweets.
lower_text = lowercases(df.clean_text) #applying lowercase function to clean text
lower_text

array(['footage of the airport bombing in ivanofrankivsk. ',
       'the simpsons predicted the crisis of and ',
       'i strongly condemn reckless attack on which puts at risk countless civilian lives. this is a grave breach of international law amp a serious threat to euroatlantic security. allies will meet to address russias renewed aggression. ',
       ...,
       'the in will be lit in purple for womens brain health day. womensbrains ',
       ' ua general staff report the repelling of russian advances nr settlements in regions in last hours ua forces downing another orlan drone amp hitting russian control points personnel and weapons. ',
       ' gets after of guardian'], dtype='<U824')

In [None]:
contractions_dict = {"I'm": 'I am',
 "I'm'a": 'I am about to',
 "I'm'o": 'I am going to',
 "I've": 'I have',
 "I'll": 'I will',
 "I'll've": 'I will have',
 "I'd": 'I would',
 "I'd've": 'I would have',
 'Whatcha': 'What are you',
 "amn't": 'am not',
 "ain't": 'are not',
 "aren't": 'are not',
 "'cause": 'because',
 "can't": 'cannot',
 "can't've": 'cannot have',
 "could've": 'could have',
 "couldn't": 'could not',
 "couldn't've": 'could not have',
 "daren't": 'dare not',
 "daresn't": 'dare not',
 "dasn't": 'dare not',
 "didn't": 'did not',
 'didn’t': 'did not',
 "don't": 'do not',
 'don’t': 'do not',
 "doesn't": 'does not',
 "e'er": 'ever',
 "everyone's": 'everyone is',
 'finna': 'fixing to',
 'gimme': 'give me',
 "gon't": 'go not',
 'gonna': 'going to',
 'gotta': 'got to',
 "hadn't": 'had not',
 "hadn't've": 'had not have',
 "hasn't": 'has not',
 "haven't": 'have not',
 "he've": 'he have',
 "he's": 'he is',
 "he'll": 'he will',
 "he'll've": 'he will have',
 "he'd": 'he would',
 "he'd've": 'he would have',
 "here's": 'here is',
 "how're": 'how are',
 "how'd": 'how did',
 "how'd'y": 'how do you',
 "how's": 'how is',
 "how'll": 'how will',
 "isn't": 'is not',
 "it's": 'it is',
 "'tis": 'it is',
 "'twas": 'it was',
 "it'll": 'it will',
 "it'll've": 'it will have',
 "it'd": 'it would',
 "it'd've": 'it would have',
 'kinda': 'kind of',
 "let's": 'let us',
 'luv': 'love',
 "ma'am": 'madam',
 "may've": 'may have',
 "mayn't": 'may not',
 "might've": 'might have',
 "mightn't": 'might not',
 "mightn't've": 'might not have',
 "must've": 'must have',
 "mustn't": 'must not',
 "mustn't've": 'must not have',
 "needn't": 'need not',
 "needn't've": 'need not have',
 "ne'er": 'never',
 "o'": 'of',
 "o'clock": 'of the clock',
 "ol'": 'old',
 "oughtn't": 'ought not',
 "oughtn't've": 'ought not have',
 "o'er": 'over',
 "shan't": 'shall not',
 "sha'n't": 'shall not',
 "shalln't": 'shall not',
 "shan't've": 'shall not have',
 "she's": 'she is',
 "she'll": 'she will',
 "she'd": 'she would',
 "she'd've": 'she would have',
 "should've": 'should have',
 "shouldn't": 'should not',
 "shouldn't've": 'should not have',
 "so've": 'so have',
 "so's": 'so is',
 "somebody's": 'somebody is',
 "someone's": 'someone is',
 "something's": 'something is',
 'sux': 'sucks',
 "that're": 'that are',
 "that's": 'that is',
 "that'll": 'that will',
 "that'd": 'that would',
 "that'd've": 'that would have',
 'em': 'them',
 "there're": 'there are',
 "there's": 'there is',
 "there'll": 'there will',
 "there'd": 'there would',
 "there'd've": 'there would have',
 "these're": 'these are',
 "they're": 'they are',
 "they've": 'they have',
 "they'll": 'they will',
 "they'll've": 'they will have',
 "they'd": 'they would',
 "they'd've": 'they would have',
 "this's": 'this is',
 "this'll": 'this will',
 "this'd": 'this would',
 "those're": 'those are',
 "to've": 'to have',
 'wanna': 'want to',
 "wasn't": 'was not',
 "we're": 'we are',
 "we've": 'we have',
 "we'll": 'we will',
 "we'll've": 'we will have',
 "we'd": 'we would',
 "we'd've": 'we would have',
 "weren't": 'were not',
 "what're": 'what are',
 "what'd": 'what did',
 "what've": 'what have',
 "what's": 'what is',
 "what'll": 'what will',
 "what'll've": 'what will have',
 "when've": 'when have',
 "when's": 'when is',
 "where're": 'where are',
 "where'd": 'where did',
 "where've": 'where have',
 "where's": 'where is',
 "which's": 'which is',
 "who're": 'who are',
 "who've": 'who have',
 "who's": 'who is',
 "who'll": 'who will',
 "who'll've": 'who will have',
 "who'd": 'who would',
 "who'd've": 'who would have',
 "why're": 'why are',
 "why'd": 'why did',
 "why've": 'why have',
 "why's": 'why is',
 "will've": 'will have',
 "won't": 'will not',
 "won't've": 'will not have',
 "would've": 'would have',
 "wouldn't": 'would not',
 "wouldn't've": 'would not have',
 "y'all": 'you all',
 "y'all're": 'you all are',
 "y'all've": 'you all have',
 "y'all'd": 'you all would',
 "y'all'd've": 'you all would have',
 "you're": 'you are',
 "you've": 'you have',
 "you'll've": 'you shall have',
 "you'll": 'you will',
 "you'd": 'you would',
 "you'd've": 'you would have',
 'to cause': 'to cause',
 'will cause': 'will cause',
 'should cause': 'should cause',
 'would cause': 'would cause',
 'can cause': 'can cause',
 'could cause': 'could cause',
 'must cause': 'must cause',
 'might cause': 'might cause',
 'shall cause': 'shall cause',
 'may cause': 'may cause',
 'jan.': 'january',
 'feb.': 'february',
 'mar.': 'march',
 'apr.': 'april',
 'jun.': 'june',
 'jul.': 'july',
 'aug.': 'august',
 'sep.': 'september',
 'oct.': 'october',
 'nov.': 'november',
 'dec.': 'december',
 'I’m': 'I am',
 'I’m’a': 'I am about to',
 'I’m’o': 'I am going to',
 'I’ve': 'I have',
 'I’ll': 'I will',
 'I’ll’ve': 'I will have',
 'I’d': 'I would',
 'I’d’ve': 'I would have',
 'amn’t': 'am not',
 'ain’t': 'are not',
 'aren’t': 'are not',
 '’cause': 'because',
 'can’t': 'cannot',
 'can’t’ve': 'cannot have',
 'could’ve': 'could have',
 'couldn’t': 'could not',
 'couldn’t’ve': 'could not have',
 'daren’t': 'dare not',
 'daresn’t': 'dare not',
 'dasn’t': 'dare not',
 'doesn’t': 'does not',
 'e’er': 'ever',
 'everyone’s': 'everyone is',
 'gon’t': 'go not',
 'hadn’t': 'had not',
 'hadn’t’ve': 'had not have',
 'hasn’t': 'has not',
 'haven’t': 'have not',
 'he’ve': 'he have',
 'he’s': 'he is',
 'he’ll': 'he will',
 'he’ll’ve': 'he will have',
 'he’d': 'he would',
 'he’d’ve': 'he would have',
 'here’s': 'here is',
 'how’re': 'how are',
 'how’d': 'how did',
 'how’d’y': 'how do you',
 'how’s': 'how is',
 'how’ll': 'how will',
 'isn’t': 'is not',
 'it’s': 'it is',
 '’tis': 'it is',
 '’twas': 'it was',
 'it’ll': 'it will',
 'it’ll’ve': 'it will have',
 'it’d': 'it would',
 'it’d’ve': 'it would have',
 'let’s': 'let us',
 'ma’am': 'madam',
 'may’ve': 'may have',
 'mayn’t': 'may not',
 'might’ve': 'might have',
 'mightn’t': 'might not',
 'mightn’t’ve': 'might not have',
 'must’ve': 'must have',
 'mustn’t': 'must not',
 'mustn’t’ve': 'must not have',
 'needn’t': 'need not',
 'needn’t’ve': 'need not have',
 'ne’er': 'never',
 'o’': 'of',
 'o’clock': 'of the clock',
 'ol’': 'old',
 'oughtn’t': 'ought not',
 'oughtn’t’ve': 'ought not have',
 'o’er': 'over',
 'shan’t': 'shall not',
 'sha’n’t': 'shall not',
 'shalln’t': 'shall not',
 'shan’t’ve': 'shall not have',
 'she’s': 'she is',
 'she’ll': 'she will',
 'she’d': 'she would',
 'she’d’ve': 'she would have',
 'should’ve': 'should have',
 'shouldn’t': 'should not',
 'shouldn’t’ve': 'should not have',
 'so’ve': 'so have',
 'so’s': 'so is',
 'somebody’s': 'somebody is',
 'someone’s': 'someone is',
 'something’s': 'something is',
 'that’re': 'that are',
 'that’s': 'that is',
 'that’ll': 'that will',
 'that’d': 'that would',
 'that’d’ve': 'that would have',
 'there’re': 'there are',
 'there’s': 'there is',
 'there’ll': 'there will',
 'there’d': 'there would',
 'there’d’ve': 'there would have',
 'these’re': 'these are',
 'they’re': 'they are',
 'they’ve': 'they have',
 'they’ll': 'they will',
 'they’ll’ve': 'they will have',
 'they’d': 'they would',
 'they’d’ve': 'they would have',
 'this’s': 'this is',
 'this’ll': 'this will',
 'this’d': 'this would',
 'those’re': 'those are',
 'to’ve': 'to have',
 'wasn’t': 'was not',
 'we’re': 'we are',
 'we’ve': 'we have',
 'we’ll': 'we will',
 'we’ll’ve': 'we will have',
 'we’d': 'we would',
 'we’d’ve': 'we would have',
 'weren’t': 'were not',
 'what’re': 'what are',
 'what’d': 'what did',
 'what’ve': 'what have',
 'what’s': 'what is',
 'what’ll': 'what will',
 'what’ll’ve': 'what will have',
 'when’ve': 'when have',
 'when’s': 'when is',
 'where’re': 'where are',
 'where’d': 'where did',
 'where’ve': 'where have',
 'where’s': 'where is',
 'which’s': 'which is',
 'who’re': 'who are',
 'who’ve': 'who have',
 'who’s': 'who is',
 'who’ll': 'who will',
 'who’ll’ve': 'who will have',
 'who’d': 'who would',
 'who’d’ve': 'who would have',
 'why’re': 'why are',
 'why’d': 'why did',
 'why’ve': 'why have',
 'why’s': 'why is',
 'will’ve': 'will have',
 'won’t': 'will not',
 'won’t’ve': 'will not have',
 'would’ve': 'would have',
 'wouldn’t': 'would not',
 'wouldn’t’ve': 'would not have',
 'y’all': 'you all',
 'y’all’re': 'you all are',
 'y’all’ve': 'you all have',
 'y’all’d': 'you all would',
 'y’all’d’ve': 'you all would have',
 'you’re': 'you are',
 'you’ve': 'you have',
 'you’ll’ve': 'you shall have',
 'you’ll': 'you will',
 'you’d': 'you would',
 'you’d’ve': 'you would have'}

In [None]:
def expand_contractions(text):
    expanded_words = [] #empty list of for expanded words
    for word in text.split():
        expanded_words.append(contractions_dict.get(word, word)) #read in contraction dictonary in cell above, using to expand contractions
    expanded_text = ' '.join(expanded_words) #joining the expanded words back together.
    return expanded_text

expand_corpus = np.vectorize(expand_contractions)
expanded_text = expand_corpus(lower_text)

expanded_text

array(['footage of the airport bombing in ivanofrankivsk.',
       'the simpsons predicted the crisis of and',
       'i strongly condemn reckless attack on which puts at risk countless civilian lives. this is a grave breach of international law amp a serious threat to euroatlantic security. allies will meet to address russias renewed aggression.',
       ...,
       'the in will be lit in purple for womens brain health day. womensbrains',
       'ua general staff report the repelling of russian advances nr settlements in regions in last hours ua forces downing another orlan drone amp hitting russian control points personnel and weapons.',
       'gets after of guardian'], dtype='<U823')

In [None]:
nltk.download('stopwords')
stopword_list = nltk.corpus.stopwords.words('english') #After looking through tweets, I didn't feel it was important to remove any addtional stopwords.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lando\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [None]:
def remove_stopwords(text, is_lower_case=False): # Taking out words that show up often.
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [None]:
stop_corpus =np.vectorize(remove_stopwords)
stopped_text = stop_corpus(expanded_text, is_lower_case = True)
stopped_text


array(['footage airport bombing ivanofrankivsk .',
       'simpsons predicted crisis',
       'strongly condemn reckless attack puts risk countless civilian lives. grave breach international law amp serious threat euroatlantic security. allies meet address russias renewed aggression .',
       ..., 'lit purple womens brain health day. womensbrains',
       'ua general staff report repelling russian advances nr settlements regions last hours ua forces downing another orlan drone amp hitting russian control points personnel weapons .',
       'gets guardian'], dtype='<U756')

In [None]:
df = stopped_text

In [None]:
df[:5] # Checking cleaned tweets

array(['footage airport bombing ivanofrankivsk .',
       'simpsons predicted crisis',
       'strongly condemn reckless attack puts risk countless civilian lives. grave breach international law amp serious threat euroatlantic security. allies meet address russias renewed aggression .',
       'rt spread share help ukraine',
       'footage airport bombing ivanofrankivsk .'], dtype='<U756')

In [None]:
df = pd.read_parquet('final_project_data.parquet') #reading in data

In [None]:
df.drop('location', axis='columns', inplace=True)
df['cleaned_text'] = stopped_text
df.to_parquet('final_project_data.parquet', index=False)

In [None]:
df.head()

Unnamed: 0,text,date,cleaned_text
0,Footage of the airport bombing in Ivano-Franki...,2022-02-24 06:48:02,footage airport bombing ivanofrankivsk .
1,The simpsons predicted the crisis of #Ukraine ...,2022-02-24 06:48:05,simpsons predicted crisis
2,I strongly condemn #Russia’s reckless attack o...,2022-02-24 06:48:08,strongly condemn reckless attack puts risk cou...
3,"RT, SPREAD AND SHARE, YOU CAN HELP UKRAINE #Uk...",2022-02-24 06:48:10,rt spread share help ukraine
4,Footage of the airport bombing in Ivano-Franki...,2022-02-24 06:48:11,footage airport bombing ivanofrankivsk .


### Preprocessing steps


I made sure to remove contractions, stopwords, special characters, links, hashtags, etc. Beyond some additional regex to make sure links and URLs were removed, I did not have to take do any abnormal cleaning steps. I was fairly through in the steps above, so additional cleaning should not be necessary for my analysis. The cleaned data is saved.

### Unit 2: At least feature engineering, optional to run topic modeling or text classification

In [None]:
df = pd.read_parquet('final_project_data.parquet')

In [None]:
df.head()

In [None]:
# feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer
# tf = TfidfVectorizer() # todo parameters
# tf.fit_transform(df['cleaned_text']).toarray()


tv = TfidfVectorizer(min_df=0., max_df=1., norm = 'l2',
                    use_idf = True, smooth_idf = True)                                          # set parameters of tf-idf vectorizer
tv_matrix = tv.fit_transform(df.cleaned_text)                                                       # apply tf-idf to norm_corpus
vocab = tv.get_feature_names_out()                                                              # apply vocab labels




In [None]:
from datetime import datetime

In [None]:
df.set_index('date', inplace = True)

In [None]:
start = datetime.fromisoformat('2022-02-24') #Limiting the data to march instead of april
end = datetime.fromisoformat('2022-03-30')
df_1 = df.loc[start:end] #breaking up dataset

In [None]:
start = datetime.fromisoformat('2022-11-01') # limiting data to just november
end = datetime.fromisoformat('2022-12-04')
df_2 = df.loc[start:end] #df_2 is the data set for tweets from the present.

In [None]:
df_1.shape, df_2.shape

((400797, 2), (36439, 2))

In [None]:
# feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer
tv_1 = TfidfVectorizer(min_df=0.005, max_df=1., norm = 'l2',
                    use_idf = True, smooth_idf = True,
                    ngram_range=(1, 2))  #used bigrams                                         # set parameters of tf-idf vectorizer
tv_matrix_1 = tv_1.fit_transform(df_1.cleaned_text)                                                       # apply tf-idf to norm_corpus
vocab = tv_1.get_feature_names_out()                                                              # apply vocab labels

tv_matrix_1.shape


(400797, 506)

In [None]:
vocab

In [None]:
# feature engineering
# from sklearn.feature_extraction.text import TfidfVectorizer
tv_2 = TfidfVectorizer(min_df=0.005, max_df=1., norm = 'l2',
                    use_idf = True, smooth_idf = True,
                    ngram_range = (1, 2))                                          # set parameters of tf-idf vectorizer
tv_matrix_2 = tv_2.fit_transform(df_2.cleaned_text)                                                       # apply tf-idf to norm_corpus
vocab = tv_2.get_feature_names_out()                                                              # apply vocab labels

tv_matrix_2.shape


(36439, 350)

In [None]:
vocab

### Which type of feature engineering did I do?

I decided to go with TF-IDF for a couple of reasons, the most important being that, TF-IDF is a way to judge the topic of a piece of text. This is done by the kind of words it contains. Words are given a weight so it measures relevance, not frequency. Whereas, Word2vec produces one vector per word, and tf-idf produces a score. Since each tweet could contain a max of 140 characters discerning the weights of eatch word was important to me in order to perform topic modeling and sentiment analysis.

Overall, Tfidf satisfied my needs of analysis. I also decided to look at bigrams since each tweet is relatively small.

### Analysis One: Topic Modeling

## Topic Modeling

In [None]:
lda = LatentDirichletAllocation(n_components=5, max_iter=50, random_state=42)                     # set parameters of LDA
dt_matrix = lda.fit_transform(tv_matrix_1)                                                              # fit LDA to count vectorizer matrix
features = pd.DataFrame(dt_matrix, columns = ['topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5'])                       # show document-topic matrix as DF
features


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5
0,0.082844,0.665201,0.083599,0.082844,0.085513
1,0.100000,0.100000,0.101217,0.101859,0.596924
2,0.229913,0.045524,0.046339,0.630101,0.048122
3,0.724357,0.068166,0.069338,0.069072,0.069068
4,0.082844,0.665201,0.083599,0.082844,0.085513
...,...,...,...,...,...
400792,0.055265,0.604104,0.228524,0.056045,0.056062
400793,0.051304,0.051635,0.451492,0.050703,0.394866
400794,0.058051,0.058199,0.058490,0.057939,0.767321
400795,0.060642,0.061002,0.061595,0.580542,0.236218


In [None]:
lda2 = LatentDirichletAllocation(n_components =5, max_iter=50, random_state=42)                     # set parameters of LDA
dt_matrix = lda2.fit_transform(tv_matrix_2)                                                              # fit LDA to count vectorizer matrix
features_2 = pd.DataFrame(dt_matrix, columns = ['topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5'])                       # show document-topic matrix as DF
features_2


In [None]:
tt_matrix = lda2.components_

vocab2 = np.array(tv_2.get_feature_names_out())
for i, topic_weights in enumerate(tt_matrix): #enumerating over topics to get the 15 top words in each topic.
    print("Topic #" + str(i + 1))
    biggest_weight_inds = list(reversed(np.argsort(topic_weights)[-15:]))
    print(vocab2[biggest_weight_inds])
    print()

Topic #1
['russia' 'putin' 'russian' 'like' 'ukraine' 'get' 'help' 'us' 'need'
 'ukrainian' 'attack' 'country' 'people' 'money' 'way']

Topic #2
['ukraine' 'time' 'war' 'video' 'one' 'today' 'best' 'russian' 'love'
 'archive' 'archived' 'propaganda' 'minister' 'trump' 'archive video']

Topic #3
['live' 'world' 'link' 'germany' 'vs' 'cup' 'world cup' 'soldiers'
 'ukrainian' 'people' 'live link' 'russian' 'states' 'united' 'read']

Topic #4
['amp' 'ukraine' 'region' 'support' 'war' 'stop' 'must' 'november' 'peace'
 'new' 'end' 'russia' 'well' 'nato' 'iran']

Topic #5
['russian' 'via' 'ukraine' 'day' 'power' 'good' 'war' 'forces' 'near'
 'kherson' 'video' 'new' 'check' 'morning' 'amp']



In [None]:
labels = ['russia', 'video', 'world cup', 'peace', 'power'] # applying labels to topics

corpus = np.array(df_2['text'])
labels = [labels[topic] for topic in features_2.values.argmax(axis = 1)]
corpus_df = pd.DataFrame({'Document': corpus, 'Category':labels}) #adding category of topics to corpus df

In [None]:
corpus_df

Unnamed: 0,Document,Category
0,#Ukrainian Soldiers Tell the REALITY of What Made the Kharkiv Offensive Work! https://t.co/JqyBCiCa5z \n\n#UkraineRussianWar #UkraineWillWin,power
1,"@rwalsh777 That’s exactly the whole issue, public safety vs private rights\nIts a tougher balance than most would like to admit\nEveryone has rights &amp; all have freedoms, but sometimes they cla...",russia
2,Thought provoking read. As in how can we resolve the pain for the survivors and not only how to prevent it now. Psychological trauma will be an immense issue.\n#RussiaIsATerroristState\nhttps://t....,world cup
3,Narcissists and #StandWithUkraine https://t.co/Sq5B8nr4kY via @YouTube,power
4,@Sleuthteller @lakecitygirl @liberal_party Please share far and wide ✔️\nWakey Wakey 👀 \nResearch isn't hard Enjoy mine 💯✅️\n#Ukraine thread #Canada #USA \nMake your own conclusions \n#JustinTrude...,russia
...,...,...
36434,Why Did Germany Side Cover Their Mouths During Team Photo?\n#Austria #Germany #Ukraine #Russia #البحرين https://t.co/HQKvYJtzUD,world cup
36435,The defeat of the 155 enemy marine brigade / Part 1\n#Canada #Germany #America #Ukraine\nhttps://t.co/J8hNy5I8e0,video
36436,The #HighLevelBridge in #Edmonton #Alberta #Canada will be lit in purple for Women's Brain Health Day. \n💻https://t.co/Y24e3Dhby4 #BrainHealth @womensbrains #Yeg #LightTheBridge💟 https://t.co/t8Du...,power
36437,"#Ukraine: 3. UA General Staff report the repelling of Russian advances nr 14 settlements in #Donetsk/#Luhansk regions in last 24 hours, UA forces downing another Orlan-10 drone &amp; hitting 4 Rus...",peace


In [None]:
#same as above just running for datafarme 1
tt_matrix = lda.components_

vocab = np.array(tv_1.get_feature_names_out())
for i, topic_weights in enumerate(tt_matrix):
    print("Topic #" + str(i + 1))
    biggest_weight_inds = list(reversed(np.argsort(topic_weights)[-15:]))
    print(vocab[biggest_weight_inds])
    print()

Topic #1
['needs' 'potus' 'please' 'via' 'thank' 'defend' 'stop' 'innocent'
 'weapons' 'humanitarian' 'provide' 'follow' 'rt' 'assistance' 'civilian']

Topic #2
['russian' 'ukrainian' 'forces' 'army' 'video' 'destroyed' 'soldiers'
 'one' 'killed' 'kyiv' 'city' 'near' 'captured' 'military' 'region']

Topic #3
['war' 'putin' 'amp' 'people' 'world' 'russia' 'us' 'go' 'like' 'get'
 'would' 'ukraine' 'know' 'help' 'good']

Topic #4
['ukraine' 'president' 'invasion' 'russia' 'war' 'people' 'support'
 'russian' 'today' 'amp' 'new' 'march' 'international' 'foreign' 'russias']

Topic #5
['amp' 'ukraine' 'nato' 'russia' 'un' 'russian' 'people' 'border' 'aid'
 'security' 'close' 'troops' 'refugees' 'million' 'media']



In [None]:
labels = ['peace', 'kyiv', 'putin', 'invasion', 'nato']

corpus = np.array(df_1['text'])
labels = [labels[topic] for topic in features.values.argmax(axis = 1)]
corpus_df = pd.DataFrame({'Document': corpus, 'Category':labels})

In [None]:
corpus_df

Unnamed: 0,Document,Category
0,Footage of the airport bombing in Ivano-Frankivsk. #Ukraine #Russia https://t.co/MLVuNyPItI,kyiv
1,The simpsons predicted the crisis of #Ukraine and #Russia\n\n#Kiev #RussiaUkraineConflict https://t.co/QwM9SHuJDK,nato
2,"I strongly condemn #Russia’s reckless attack on #Ukraine, which puts at risk countless civilian lives. This is a grave breach of international law &amp; a serious threat to Euro-Atlantic security....",invasion
3,"RT, SPREAD AND SHARE, YOU CAN HELP UKRAINE #Ukraine #Russia https://t.co/4Lq1Jjs6wc",peace
4,Footage of the airport bombing in Ivano-Frankivsk. #Ukraine #Russia #TerceraGuerraMundial #rusia #ucrania #UkraineRussie https://t.co/D7fPnnlUwk,kyiv
...,...,...
400792,"This woman drove six generators from Krakow to the outskirts of #Kyiv, and just urgently unloaded one for our soldiers in Irpin. “Gotta go now, it’s very loud here,” she said. My mother is unbelie...",kyiv
400793,"Today is a good time for a reminder that #Russia will do everything to appear as if they are the good guy, humanitarian etc. They killed thousands for a needless war. They cannot gain a single ...",putin
400794,"Signing an agreement on security guarantees for #Ukraine is possible only after the return of #Russian troops to their positions as of February 23, 2022, Ukrainian permanent representative to the ...",nato
400795,"If @SenateForeign &amp; @HouseForeign stand for peace &amp; democracy, they should apply in #Tigray at least a fraction of their resolve in #Ukraine. Time to pass #S3199 &amp; #HR6600\n\n@ChrisVan...",invasion


**Report any interesting findings below, describe a practical application(s) of this method in the real-world.**

In the first dataframe df_1, 'Kyiv' is popular in topic 2 which makes sense because Russia was on the offensive in the area until it pulled its troops out in April. In df_1, topics tended to call more for assistance, intervention, and noticeable emotional reactions.
In df_2 some interesting topics come up that initially don't make sense, we see the words 'archive', 'archived', 'video', and 'archive video' in topic 2. After some additional reading, I found that the is an archive compilation of videos, and pictures of the Ukraine war which might explain the uptick of this topic on social media. In topic 3, the words 'Germany', 'world cup', and 'cup' come up often. This is interesting because apparently at the World Cup the German national team put their hands over their mouths during the national anthem to protest human rights in Qatar. My guess is people used this protest as a way to bring attention to the human rights violations going on in Ukraine.
In the 4th topic, we see mention of Iran; my guess is topics that are trending tend to overlap with other seemingly unrelated trending topics. Finally, in topic 5, we see the word 'Kherson' show up. Recently, Ukraine took back the Russian occupied city which could very easily be why we are seeing it as a trending tweet topic.

I think this can be very useful for anyone who wants to analyze social media, I am glad I chose topic modeling with this particular dataset. I was surprised how dense the data was, even when I tried to filter it down. Being able to see trending topics offers insight into what is mattering to users on social media. As people change the way they recieve their news or interact politically, I think this kind of analysis will become more and more useful.  

### Unit 4: Sentiment Analysis

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
def analyze_sentiment_vader_lexicon(text,
                                    threshold=0.1,
                                    verbose=False):

    # analyze the sentiment for tweet
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(text)

    # get aggregate scores and final sentiment
    agg_score = scores['compound']
    final_sentiment = 'positive' if agg_score >= threshold\
                                   else 'negative'
    if verbose:
        # display detailed sentiment statistics
        positive = str(round(scores['pos'], 2)*100)+'%'
        final = round(agg_score, 2)
        negative = str(round(scores['neg'], 2)*100)+'%'
        neutral = str(round(scores['neu'], 2)*100)+'%'
        sentiment_frame = pd.DataFrame([[final_sentiment, final, positive,
                                        negative, neutral]],
                                        columns=pd.MultiIndex(levels=[['SENTIMENT STATS:'],
                                                                      ['Predicted Sentiment', 'Polarity Score',
                                                                       'Positive', 'Negative', 'Neutral']],
                                                              codes=[[0,0,0,0,0],[0,1,2,3,4]]))
        print(sentiment_frame)

    return final_sentiment


In [None]:
for i, tweet in enumerate(df_1['text']):
    print('Tweet:', tweet)
    pred = analyze_sentiment_vader_lexicon(tweet, threshold=0.4, verbose=True)
    print('-'*60)
    if i == 50:
      break


Tweet: Footage of the airport bombing in Ivano-Frankivsk. #Ukraine #Russia https://t.co/MLVuNyPItI
     SENTIMENT STATS:                                         
  Predicted Sentiment Polarity Score Positive Negative Neutral
0            negative            0.0     0.0%     0.0%  100.0%
------------------------------------------------------------
Tweet: The simpsons predicted the crisis of #Ukraine and #Russia

#Kiev #RussiaUkraineConflict https://t.co/QwM9SHuJDK
     SENTIMENT STATS:                                         
  Predicted Sentiment Polarity Score Positive Negative Neutral
0            negative          -0.62     0.0%    27.0%   73.0%
------------------------------------------------------------
Tweet: I strongly condemn #Russia’s reckless attack on #Ukraine, which puts at risk countless civilian lives. This is a grave breach of international law &amp; a serious threat to Euro-Atlantic security. #NATO Allies will meet to address Russia’s renewed aggression. https://t.co/FP

In [None]:
for i, tweet in enumerate(df_2['text']):
    print('Tweet:', tweet)
    pred = analyze_sentiment_vader_lexicon(tweet, threshold=0.4, verbose=True)
    print('-'*60)
    if i == 50:
      break

Tweet: #Ukrainian Soldiers Tell the REALITY of What Made the Kharkiv Offensive Work!  https://t.co/JqyBCiCa5z 

#UkraineRussianWar #UkraineWillWin
     SENTIMENT STATS:                                         
  Predicted Sentiment Polarity Score Positive Negative Neutral
0            negative          -0.51     0.0%    19.0%   81.0%
------------------------------------------------------------
Tweet: @rwalsh777 That’s exactly the whole issue, public safety vs private rights
Its a tougher balance than most would like to admit
Everyone has rights &amp; all have freedoms, but sometimes they clash
In a functional society agreement on limits is inherent but not easy to find
#perspective #Canada
     SENTIMENT STATS:                                         
  Predicted Sentiment Polarity Score Positive Negative Neutral
0            positive           0.75    23.0%     6.0%   71.0%
------------------------------------------------------------
Tweet: Thought provoking read. As in how can we res

In [None]:
df_2['sentiment'] = df_2['text'].apply(analyze_sentiment_vader_lexicon)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_2['sentiment'] = df_2['text'].apply(analyze_sentiment_vader_lexicon)


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
positive_tweets = [tweet for tweet, sentiment in zip(df_2['cleaned_text'], df_2['sentiment']) if sentiment == 'positive']
ptvf = TfidfVectorizer(min_df=0.005, max_df=1., norm = 'l2',
                    use_idf = True, smooth_idf = True,
                    ngram_range = (1, 2))
ptvf_features = ptvf.fit_transform(positive_tweets)


In [None]:
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tf.fit(ptvf_features)

LatentDirichletAllocation(random_state=0)

In [None]:
!pip install pyLDAvis # I am only applying the topic modeling on df_2 because in the first analysis it yielded the most interesting results, and because df_1 is 11 times larger than df_2.
import pyLDAvis
import pyLDAvis.sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda_tf, ptvf_features, ptvf)

  default_term_info = default_term_info.sort_values(


In [None]:
negative_tweets = [tweet for tweet, sentiment in zip(df_2['cleaned_text'], df_2['sentiment']) if sentiment == 'negative'] #taking the cleaned text and the negative tweets to look at dominate topics.
ntvf = TfidfVectorizer(min_df=0.005, max_df=1., norm = 'l2',
                    use_idf = True, smooth_idf = True,
                    ngram_range = (1, 2))
ntvf_features = ntvf.fit_transform(negative_tweets)

In [None]:
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tf.fit(ntvf_features)


LatentDirichletAllocation(random_state=0)

In [None]:
pyLDAvis.sklearn.prepare(lda_tf, ntvf_features, ntvf)


  default_term_info = default_term_info.sort_values(


In [None]:
df_2['sentiment'].value_counts()


negative    24538
positive    11901
Name: sentiment, dtype: int64

### Discussion

The sentiment analysis was interesting. Though, mostly a tool to apply topic modeling to both positive and negative tweets. Breaking down the topic modeling based on positve and negative sentiments yields interesting results, and provides more insight into the general sentiment of the public. I performed the topic modeling on positive and negative tweets using both the CountVectorizer and tfidf. Ultimately I decided to keep the tfidf in my project because it provided more association between words/more impactful words in each topic. It seemed to have almost a smoothing affect.

As you look through the different topics in the positive category, you will find the most salient words are similar (as are their meanings). Words  that come up often are peace, good, best, support, help, aid, US, Russia, Ukraine etc. Most of the words don't stand out as meaningful.

There were more illuminating insights in the negative tweets. The more interesting topics seemed to come out of the negative category. Salient words like: 'live', 'link', 'germany', 'cup', 'poland', 'missile', 'progaganda', 'archive', 'video', 'kherson', and 'nato' dominated most topics. There are more than double the amount of tweets categorized as negative as there are positive, and this could play a role in which words are categorized as salient in each topic.

### Conclusion

Overall, I found this project very interesting. I know the analysis I performed barely scratched the surface of what could be done with this dataset. In the future, I would probably try to add back in more columns, expand my analysis to include seeing what topics were retweeted the most, clustering tweets based on location, and comparing tweets in English to tweets in other languages, since this is an international issue.