In [2]:
import pandas as pd
import numpy as np

import preprocessor as p

from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.utils as utils

**Load in dataframes**

There are 4 different data frames each containing tweets that were scraped based on a different key word

Later, all the tweets go into 1 large document-term matrix and by the use of randomized Singular Value Decomposition, I try to identify different topics in the document-term matrix.

In [3]:
paths = ['data/alexey.csv', 'data/cheese.csv', 'data/gamestop.csv', 'data/ML.csv']

alexey = pd.read_csv(paths[0])
cheese = pd.read_csv(paths[1])
gamestop = pd.read_csv(paths[2])
ML = pd.read_csv(paths[3])

In [4]:
cheese.head()

Unnamed: 0,username,description,location,following,followers,totaltweets,retweetcount,text,hashtags
0,26lizardking,EVERYWHERE N NOWHERE BABY,S8.....EUROPE,3029,2403,33240,0,All this on foot action has given me hard skin...,[]
1,VelvetBarstool,"Interested in health, sports, fashion, busines...","Georgia, USA",246,278,16121,0,Get ready for the elitist virtual networking e...,[]
2,betting_cheese,Cheese that is here to make you 💵 Bet mainly o...,"Cashville, SC",200,101,82,0,PoD 50-38 (56.8%) \nWe are ready for another w...,[]
3,CowboyColleen,family. dogs. cabin. country. guns. conservative.,Eastern WA/North Idaho,560,566,58749,5946,The Republican Party is no longer the “wine an...,[]
4,DaVondraRamsey,To GOD Be The Glory 🙌🏾 • Isaiah 58:11 • PharmD...,In Class 🥴,740,795,11949,0,How to spot a graduate student\n\nNobody: ...\...,[]


In [5]:
alexey.shape, cheese.shape, gamestop.shape, ML.shape

((200, 9), (200, 9), (200, 9), (200, 9))

**Drop unnecessary columns**

Keep the text column (this contains the tweet) as well as the hashtags column (this can be used to roughly identify the topic of the tweet)

In [6]:
alexey = alexey.loc[:, 'text': 'hashtags']
cheese = cheese.loc[:, 'text': 'hashtags']
gamestop = gamestop.loc[:, 'text': 'hashtags']
ML = ML.loc[:, 'text': 'hashtags']

In [7]:
alexey.shape, cheese.shape, gamestop.shape, ML.shape

((200, 2), (200, 2), (200, 2), (200, 2))

**Join dataframes (stack them along the indecies**

In [8]:
final_df = pd.concat([alexey, cheese, gamestop, ML]).reset_index(drop=True)
final_df.shape

(800, 2)

In [9]:
final_df.head()

Unnamed: 0,text,hashtags
0,"Alexey Navalny is a survivor and, as of Tuesda...",[]
1,"Lawyer @SobolLubov, a close associate of Russi...",[]
2,I have a hunch that Yulia's moment is coming. ...,[]
3,Vladimir Putin's approval rating has fallen fr...,[]
4,"Alexey Navalny is a survivor and, as of Tuesda...",[]


Cast text column to string

In [10]:
final_df['text'] = final_df.text.astype('string')
final_df.dtypes

text        string
hashtags    object
dtype: object

**Clean tweets**

In [11]:
p.set_options(p.OPT.URL, p.OPT.EMOJI, p.OPT.SMILEY, p.OPT.NUMBER)

final_df.text = final_df.text.apply(lambda x: p.clean(x))

In [12]:
final_df.head()

Unnamed: 0,text,hashtags
0,"Alexey Navalny is a survivor and, as of Tuesda...",[]
1,"Lawyer @SobolLubov, a close associate of Russi...",[]
2,I have a hunch that Yulia's moment is coming.,[]
3,Vladimir Putin's approval rating has fallen fr...,[]
4,"Alexey Navalny is a survivor and, as of Tuesda...",[]


Look into some example tweets from the different themes

Alexey

In [15]:
final_df.text[10]

'Who is Alexey Navalny - the man behind the big protests against Vladimir Putin and Russias elite? #AJStartHere with @SandraGathmann explain'

In [20]:
final_df.text[110]

"The Kremlin's crackdown on Alexey Navalny risks turning him into a martyr"

In [21]:
final_df.text[168]

'Vladimir Putin\'s approval rating has fallen from % to % in months, but interestingly Alexey Navalny\'s "trust" rating has reached a record high of %. He\'s now ahead of Communist leader Zyuganov (4%), but still behind Nationalist Zhirinovsky (10%).'

Cheese

In [16]:
final_df.text[210]

'Think about how whatever side effects you get from the covid vaccine you cannot sue them for medical help. Yet, they pushing it like government cheese which makes you sick later on. $573 million and no money for the families that suffered.'

In [17]:
final_df.text[220]

'cheese burger'

In [18]:
final_df.text[300]

'@Burrite @joshtpm FREEDOM FRIES cuz refusing to pretend Iraq had anything to do with /11 made them "cheese eating surrender monkeys." The first BIG LIE of the GOP was "Have You Forgotten?" The second was WMD. And American troops died for these lies.'

Game Stop

In [19]:
final_df.text[410]

'@Malouka23B @MyUsernamesThis Well fuck I like mcyt and roblox guess that means I have to fucking quit my entire YouTube career and stop enjoying the game'

In [22]:
final_df.text[438]

'@0noriss @LiyahVII game stop really'

In [23]:
final_df.text[500]

'@ClashRoyale Ive started a new account because, I changed phones and I didnt save my progress on the old one. But now I wont stop playing against bots, and its really frustrating, I really enjoy the game but this is killing my interest in this game...'

Machine Learning (ML)

In [24]:
final_df.text[602]

'Errors of machine learning #100daysofcode #AI #ArtificialIntelligence #Analytics #MachineLearning #DataScience #devcommunity #datadriven'

In [25]:
final_df.text[679]

'Professor Brian Cox: Machines of the Future! @ProfBrianCox introduces a new school resource, produced by @BritSciAssoc on behalf of the @RoyalSociety - a @CRESTAwards Discovery Day! Students can discover the importance of machine learning: @AllAboutSTEM'

In [26]:
final_df.text[789]

'Scientific computing tools keep getting easier to use and more visual. Math Inspector is an open-source, coding environment (Python based ) for visualizing and animating math operations. Machine learning students might find this useful.'

*The themes have quite mixed tweets, especially the "cheese" topic. I did not anticipate the use of slang, when I scraped tweets*

**Vectorizer instance**

In [27]:
vectorizer = TfidfVectorizer(stop_words= 'english')

Create document-term matrix

In [28]:
dtm = vectorizer.fit_transform(final_df.text).toarray()
vocab = np.array(vectorizer.get_feature_names())

In [29]:
print(f'The shape of the document term matrix is : {dtm.shape} \
      and the number of tokens in the vocabulary is : {len(vocab)}.')

The shape of the document term matrix is : (800, 3443)       and the number of tokens in the vocabulary is : 3443.


Look into the vocabulary

In [30]:
vocab[400:410]

array(['brave', 'bravery', 'bravest', 'brazilian', 'bread', 'breads',
       'break', 'breakfast', 'breaking', 'breast'], dtype='<U30')

Helper function to extract "top words"

In [31]:
def show_topics(V):
    top_words = lambda x: [vocab[i] for i in np.argsort(x)[:-num_top_words-1:-1]]
    topic_words = ([top_words(x) for x in V])
    return [' '.join(x) for x in topic_words]

In [32]:
d = 8 # number of topics
num_top_words = 10 # number of top words

In [33]:
U, s, V = utils.extmath.randomized_svd(dtm, d, random_state= 42)

In [34]:
show_topics(V)

['rating putin vladimir navalny alexey zyuganov record communist reached nationalist',
 'regret biographyone chapter writing prisoner marking survivor tuesday new vladimir',
 'pseudo sinking farce absurd titanic style worthy changed poisoning case',
 'learning machine machinelearning ai datascience 100daysofcode analytics python artificialintelligence devcommunity',
 'aly rubinadilaik doing game stop gony strategy jasminbasin clever contestant',
 'crackdown risks turning kremlin martyr navalny alexey amartyr rubinadilaik aly',
 'cheese party burger wine blue beer jeans longer republican eat',
 'cheese burger ajstarthere sandragathmann protests explain russias man elite big']