# MIE 1624 - Group 16 

# Part 1: Extract data about tweets from Twitter API by using Tweepy

## 1. Setting up your Twitter Developers Account

1. In order to use the Twitter API, you first have to register as a Twitter developer on the developers’ website. https://developer.twitter.com/en
2. Once registered, you need to create a Twitter application that’ll set up a bunch of credentials. These credentials will be later used by the Tweepy library in order to authenticate you.
* 1) Go to the developer’s dashboard.
* 2) Hit Overview from the left sidebar and click on the Create App button.
* 3) Give your app a name.
* 4) This will generate the following credentials. They’re personal: don’t share them with anybody.

 * #### API_KEY
 * #### API_KEY_SECRET
 * #### ACCESS_TOKEN
 * #### ACCESS_TOKEN_SECRET
 * #### BEARER_TOKEN

## 2. Install Tweepy and call it

In [1]:
# Tweepy is a free Python wrapper that makes it easier to authenticate and interact with the Twitter API.
# install tweepy 
#! pip install tweepy

In [2]:
# import Python modules to work with Twitter data
import tweepy
import pandas as pd

## 3. Authentication

#### To access Twitter data, you will need to authenticate your account using your API keys and tokens. 
#### We added our credentials to a txt file in advance. We will read the keys from the txt file.

In [3]:
# read twitter authorication keys from txt file
keys = []
with open('Twitter_Keys.txt') as f:
    for line in f:
        keys.append(line.strip())

In [4]:
consumer_key=keys[0] # consumer_key = 'YourConsumerKey'
consumer_secret=keys[1] # consumer_secret = 'YourConsumerSecret'
access_token=keys[2] # access_token = 'YourAccessToken'
access_token_secret=keys[3] # access_token_secret = 'YourAccessTokenSecret'
bearer_token = keys[4] # bearer_token = 'YourBearerToken'

In [5]:
# Twitter API v2 Client
# put your credentials in tweepy.Client to authenticate your account
client = tweepy.Client(consumer_key=consumer_key,
                       consumer_secret=consumer_secret,
                       access_token=access_token,
                       access_token_secret = access_token_secret,
                       bearer_token=bearer_token,
                       wait_on_rate_limit=True)

## 4. Search tweets by hashtags

* Here we are going to searching tweets by `#uoft`. 
* We can do this by building a search query and using the function search_recent_tweets(). 
* But it only returns the tweet id and text.

In [6]:
uoft_search = client.search_recent_tweets(query="#uoft", max_results=10)
print(uoft_search)
type(uoft_search)

Response(data=[<Tweet id=1592389643613638656 text='#UofT Completed: The scheduled maintenance has been completed. https://t.co/99BqV2Rd6Q'>, <Tweet id=1592381946256330752 text='#UofT In Progress: Scheduled maintenance is currently in progress. We will provide updates as necessary. https://t.co/8pkaK6M18Z'>, <Tweet id=1592375948196073472 text="RT @UofTfamilycare: It's #transawarenessweek. Check out the FCO's blog for events across #uoft as well as resources.\nhttps://t.co/BnjKiVYpB…">, <Tweet id=1592375575896764417 text="RT @UofTFNH: It's #RockYourMocs week, November 13-19th 2022!\nCome visit the Resource Centre at First Nations House Indigenous Student Servi…">, <Tweet id=1592375539683115009 text='RT @cupe3261: Since 2020, #UofT contracted out cleaning in an additional 27 buildings at the St. George campus, cutting good jobs and creat…'>, <Tweet id=1592374830375976960 text="RT @uoftbrn: BRN Speaker Series: Join @UTSC's Caroline Hossein and Ebun Joseph (@EbunJoseph1), a lecturer @ucddub

tweepy.client.Response

## 5. Converting Information to DataFrame and Exporting as CSV

If we want to see more information about tweets, we can use `Expansions` to expand the information included in the metadata beyond the default. For this example, I want to also retrieve the author of the tweet (author_id).

By default, the Tweet object only returns the id and the text fields. If you need the Tweet’s created date or public metrics, you will need to use the `tweet.fields` parameters to request them. `public_metrics` includes retweets, replies, likes information.


* Introduce `Expansions`: https://developer.twitter.com/en/docs/twitter-api/expansions
* Introduce `Fields` : https://developer.twitter.com/en/docs/twitter-api/fields
* More functions to get metadata: https://docs.tweepy.org/en/stable/expansions_and_fields.html#tweet-fields-parameter

In [7]:
# Expansions enable you to request additional data objects 
# that relate to the originally returned List, Space, Tweets, or users.
uoft_search = client.search_recent_tweets(
    query="#uoft -is:retweet lang:en", # Extract non-retweeted English tweets
    max_results=100, 
    expansions=["author_id"],
    tweet_fields= ["created_at,public_metrics"])

In [8]:
# create our data set
data = []

# set the columns
columns = ['ID', 'Tweet', "Date Posted",'Author ID', 'Liked', 'Reply', 'Retweet']

# create a dictionary that will use the author_id field to look up more information 
# about the users
uoft_users = {user['id']: 
    user for user in uoft_search.includes['users']}

# add the data from our retieval to the data set
for tweet in uoft_search.data:
    if uoft_users[tweet.author_id]:
        user = uoft_users[tweet.author_id]
        data.append([tweet.id, 
                     tweet.text, 
                     tweet.created_at, 
                     user.username,  
                     tweet.public_metrics['like_count'], 
                     tweet.public_metrics['reply_count'],
                     tweet.public_metrics['retweet_count']])
    
# create the dataframe
uoft_df = pd.DataFrame(data, columns=columns )

In [9]:
# export the data as csv
uoft_df.to_csv("uoft_tweets_current.csv")

In [10]:
# read we pre saved uoft_tweets_Nov13.csv to run the following steps
uoft_df = pd.read_csv('uoft_tweets_Nov13.csv')

In [11]:
uoft_df

Unnamed: 0.1,Unnamed: 0,ID,Tweet,Date Posted,Author ID,Liked,Reply,Retweet
0,0,1591977857664024576,Dont forget to donate to CIUT FM !!! The only ...,2022-11-14 02:14:33+00:00,smhimh,0,0,0
1,1,1591938939669352448,Happy first day of snow to everyone in Toronto...,2022-11-13 23:39:54+00:00,uoftmha,1,0,0
2,2,1591937427085598721,How are we feeling on the last day of reading ...,2022-11-13 23:33:54+00:00,uoftmha,0,0,0
3,3,1591936122674085890,#SelfCareSunday Since eating is an important p...,2022-11-13 23:28:43+00:00,uoftmha,0,0,0
4,4,1591925903936061440,Faculty of Fall\n\n15 sec mp4 720 × 1280 11.8 ...,2022-11-13 22:48:06+00:00,michaelalstad,5,0,0
...,...,...,...,...,...,...,...,...
95,95,1590883872027598850,Recently published: A mathematical framework t...,2022-11-11 01:47:26+00:00,sourojeet,0,0,0
96,96,1590883568099762176,Recently published: A mathematical framework t...,2022-11-11 01:46:14+00:00,sourojeet,0,0,0
97,97,1590875152988114945,Life can be challenging but there are resource...,2022-11-11 01:12:48+00:00,UTSC,2,0,1
98,98,1590862298620436480,Watch LIVE on YouTube: Contemporary Indigenous...,2022-11-11 00:21:43+00:00,UofTDaniels,1,0,0


# Part 2: Basic Feature Extraction & Basic Text Preprocessing 

## 1. Basic feature extraction

#### Number of words

In [12]:
uoft_df["Number_of_words"] = uoft_df["Tweet"].apply(lambda x: len(x.split()))

#### Number of characters

In [13]:
uoft_df["Number_of_characters"] = uoft_df["Tweet"].apply(lambda x: len(x))

#### Average word length for each tweet

In [14]:
average_word_length = []
for i in uoft_df['Tweet']:
    words = i.split()
    average = sum(len(word) for word in words) / len(words)
    average_word_length.append(average)

In [15]:
uoft_df["Average_word_length"] = average_word_length

#### Number of stopwords

In [16]:
# import the nltk package to count the stopwords
import nltk
from nltk.corpus import stopwords  
# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [17]:
uoft_df['Number_of_stopwords'] = uoft_df['Tweet'].apply(lambda x: len([w for w in x.split() if w in stop_words]))

### Now we could know the basic information of each tweet.

In [18]:
uoft_df

Unnamed: 0.1,Unnamed: 0,ID,Tweet,Date Posted,Author ID,Liked,Reply,Retweet,Number_of_words,Number_of_characters,Average_word_length,Number_of_stopwords
0,0,1591977857664024576,Dont forget to donate to CIUT FM !!! The only ...,2022-11-14 02:14:33+00:00,smhimh,0,0,0,36,257,6.166667,10
1,1,1591938939669352448,Happy first day of snow to everyone in Toronto...,2022-11-13 23:39:54+00:00,uoftmha,1,0,0,19,180,8.368421,3
2,2,1591937427085598721,How are we feeling on the last day of reading ...,2022-11-13 23:33:54+00:00,uoftmha,0,0,0,41,277,5.756098,11
3,3,1591936122674085890,#SelfCareSunday Since eating is an important p...,2022-11-13 23:28:43+00:00,uoftmha,0,0,0,34,255,6.470588,7
4,4,1591925903936061440,Faculty of Fall\n\n15 sec mp4 720 × 1280 11.8 ...,2022-11-13 22:48:06+00:00,michaelalstad,5,0,0,26,202,6.769231,1
...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,1590883872027598850,Recently published: A mathematical framework t...,2022-11-11 01:47:26+00:00,sourojeet,0,0,0,24,237,8.833333,4
96,96,1590883568099762176,Recently published: A mathematical framework t...,2022-11-11 01:46:14+00:00,sourojeet,0,0,0,24,241,9.000000,4
97,97,1590875152988114945,Life can be challenging but there are resource...,2022-11-11 01:12:48+00:00,UTSC,2,0,1,33,264,6.909091,8
98,98,1590862298620436480,Watch LIVE on YouTube: Contemporary Indigenous...,2022-11-11 00:21:43+00:00,UofTDaniels,1,0,0,24,212,7.791667,4


### But by observing the below text data, there is too much ‘noise’. 
### Therefore, we need to clean the data. 

In [19]:
uoft_df['Tweet'][0]

'Dont forget to donate to CIUT FM !!! The only radio station i know that doesn’t have commercials and constantly has spellbinding shows and programs throughout the week #collegeradio #toronto #canada #UofT #modern #listenersupported #spokenword #NewMusic2022'

In [20]:
uoft_df['Tweet'][1]

'Happy first day of snow to everyone in Toronto!!❄️☃️🤍\n\n#TorontoWeather #snowfall #Weather #WINTER #uoft #relax #SnowFlake #MentalHealthAwareness #positive \n\nhttps://t.co/onJLVz44oK'

In [21]:
uoft_df['Tweet'][2]

'How are we feeling on the last day of reading week😭 Time flew by and I was extremely unproductive study-wise, but hey atleast I got a relaxing week to myself! ❤️ \n#uoft #toronto #readingweek #student #studentlife #mentalhealth #studyspo #relax #positive https://t.co/UVJ6nuOeEC'

## 2. Basic text preprocessing

#### Lower casing

In [22]:
uoft_df["Tweet"] = uoft_df["Tweet"].apply(lambda x: x.lower())

In [23]:
uoft_df['Tweet'][2]

'how are we feeling on the last day of reading week😭 time flew by and i was extremely unproductive study-wise, but hey atleast i got a relaxing week to myself! ❤️ \n#uoft #toronto #readingweek #student #studentlife #mentalhealth #studyspo #relax #positive https://t.co/uvj6nuoeec'

#### Remove all emojis

In [24]:
def remove_emoji(text):
    text = text.encode('ascii', 'ignore').decode('ascii') #it encodes a unicode string to ascii and ignores errors
    return text

In [25]:
uoft_df["Tweet"] = uoft_df["Tweet"].apply(remove_emoji)

In [26]:
uoft_df["Tweet"][2]

'how are we feeling on the last day of reading week time flew by and i was extremely unproductive study-wise, but hey atleast i got a relaxing week to myself!  \n#uoft #toronto #readingweek #student #studentlife #mentalhealth #studyspo #relax #positive https://t.co/uvj6nuoeec'

#### Remove  all URLs

* #### import `re` to replace a pattern in string by the a certain replacement. Details refer to https://docs.python.org/3/library/re.html

In [27]:
import re

In [28]:
# using the 're.sub' to remove urls.
def remove_urls(text):
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"www.(\w+)", "", text)
    return text

In [29]:
# Apply function to remove all urls
uoft_df["Tweet"] = uoft_df["Tweet"].apply(remove_urls)

In [30]:
uoft_df['Tweet'][2]

'how are we feeling on the last day of reading week time flew by and i was extremely unproductive study-wise, but hey atleast i got a relaxing week to myself!  \n#uoft #toronto #readingweek #student #studentlife #mentalhealth #studyspo #relax #positive '

#### Remove all punctuations

In [31]:
import re

In [32]:
# using the 're.sub' to remove punctuation.
def remove_punctuation(text):
    text = re.sub(r"[^\w\s]", "", text)
    return text

In [33]:
# Apply function to remove all punctuation
uoft_df["Tweet"] = uoft_df["Tweet"].apply(remove_punctuation)

In [34]:
uoft_df['Tweet'][2]

'how are we feeling on the last day of reading week time flew by and i was extremely unproductive studywise but hey atleast i got a relaxing week to myself  \nuoft toronto readingweek student studentlife mentalhealth studyspo relax positive '

#### Remove stopwords
You can see that stopwords are these meaningless pronouns or prepositions. Removing these meaningless words will help us focus on keywords in the further analysis.

In [35]:
# import nltk package to find stopwords
import nltk
from nltk.corpus import stopwords  
# nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [36]:
uoft_df['Tweet'] = uoft_df['Tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))


In [37]:
uoft_df['Tweet'][2]

'feeling last day reading week time flew extremely unproductive studywise hey atleast got relaxing week uoft toronto readingweek student studentlife mentalhealth studyspo relax positive'

In [38]:
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

#### Spelling correction

In [39]:
# using Textbolb libiary to do the spelling correction
import sys
!{sys.executable} -m pip install -U textblob
from textblob import TextBlob



In [40]:
uoft_df['Tweet'] = uoft_df.Tweet.apply(lambda txt: ''.join(TextBlob(txt).correct()))

In [41]:
uoft_df['Tweet'][2]

'feeling last day reading week time flew extremely productive studywise hey least got relaxing week soft toronto readingweek student studentlife mentalhealth studyspo relax positive'

#### Tokenization
Tokenizers can divide strings into lists of substrings. Then we can try to understand the meaning of the text by analyzing the smaller units


In [42]:
import nltk
# nltk.download('punkt')

In [43]:
uoft_df['tokenized'] = uoft_df.Tweet.apply(lambda txt:nltk.word_tokenize(txt))

In [44]:
uoft_df['tokenized'][2]

['feeling',
 'last',
 'day',
 'reading',
 'week',
 'time',
 'flew',
 'extremely',
 'productive',
 'studywise',
 'hey',
 'least',
 'got',
 'relaxing',
 'week',
 'soft',
 'toronto',
 'readingweek',
 'student',
 'studentlife',
 'mentalhealth',
 'studyspo',
 'relax',
 'positive']

#### Stemming
Stemming is a process that stems or removes the last few characters from a word, sometimes leading to incorrect meanings and spelling.
* For instance, stemming the word 'Caring' would return 'Car'.

In [45]:
# import PorterStemmer to do stemming
from nltk.stem import PorterStemmer

In [46]:
# stem each words from the tokenized tweet and then join the stem words together
porter = PorterStemmer()
final = []
for i in uoft_df['tokenized']:
        stem_sentence=[]
        for word in i:
            stem_sentence.append(porter.stem(word))
            stem_sentence.append(" ")
        final.append("".join(stem_sentence))

In [47]:
uoft_df['stem'] = final

In [48]:
uoft_df['stem'][2]

'feel last day read week time flew extrem product studywis hey least got relax week soft toronto readingweek student studentlif mentalhealth studyspo relax posit '

#### Lemmatization
Lemmatization considers the context and converts the word to its meaningful base form.
* For example, lemmatizing the word ‘Caring‘ would return ‘Care‘.

In [49]:
from nltk.stem import WordNetLemmatizer

In [50]:
wordnet_lemmatizer = WordNetLemmatizer()

In [51]:
# lemmatize each words from the tokenized tweets and then join the lemmatized words together
# nltk.download('wordnet')
# nltk.download('omw-1.4')
final_lem = []
for i in uoft_df['tokenized']:
        lem_sentence=[]
        for word in i:
            lem_sentence.append(wordnet_lemmatizer.lemmatize(word))
            lem_sentence.append(" ")
        final_lem.append("".join(lem_sentence))

In [52]:
uoft_df['lemmatization'] = final_lem

In [53]:
uoft_df['lemmatization'][2]

'feeling last day reading week time flew extremely productive studywise hey least got relaxing week soft toronto readingweek student studentlife mentalhealth studyspo relax positive '

In [54]:
uoft_df

Unnamed: 0.1,Unnamed: 0,ID,Tweet,Date Posted,Author ID,Liked,Reply,Retweet,Number_of_words,Number_of_characters,Average_word_length,Number_of_stopwords,tokenized,stem,lemmatization
0,0,1591977857664024576,dont forget donate cut am radio station know d...,2022-11-14 02:14:33+00:00,smhimh,0,0,0,36,257,6.166667,10,"[dont, forget, donate, cut, am, radio, station...",dont forget donat cut am radio station know do...,dont forget donate cut am radio station know d...
1,1,1591938939669352448,happy first day snow everyone toronto torontow...,2022-11-13 23:39:54+00:00,uoftmha,1,0,0,19,180,8.368421,3,"[happy, first, day, snow, everyone, toronto, t...",happi first day snow everyon toronto torontowe...,happy first day snow everyone toronto torontow...
2,2,1591937427085598721,feeling last day reading week time flew extrem...,2022-11-13 23:33:54+00:00,uoftmha,0,0,0,41,277,5.756098,11,"[feeling, last, day, reading, week, time, flew...",feel last day read week time flew extrem produ...,feeling last day reading week time flew extrem...
3,3,1591936122674085890,selfcaresunday since eating important part sel...,2022-11-13 23:28:43+00:00,uoftmha,0,0,0,34,255,6.470588,7,"[selfcaresunday, since, eating, important, par...",selfcaresunday sinc eat import part self care ...,selfcaresunday since eating important part sel...
4,4,1591925903936061440,faculty fall 15 see may 7201280 118 mb 99100 t...,2022-11-13 22:48:06+00:00,michaelalstad,5,0,0,26,202,6.769231,1,"[faculty, fall, 15, see, may, 7201280, 118, mb...",faculti fall 15 see may 7201280 118 mb 99100 t...,faculty fall 15 see may 7201280 118 mb 99100 t...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,1590883872027598850,recently published mathematical framework unde...,2022-11-11 01:47:26+00:00,sourojeet,0,0,0,24,237,8.833333,4,"[recently, published, mathematical, framework,...",recent publish mathemat framework understand p...,recently published mathematical framework unde...
96,96,1590883568099762176,recently published mathematical framework unde...,2022-11-11 01:46:14+00:00,sourojeet,0,0,0,24,241,9.000000,4,"[recently, published, mathematical, framework,...",recent publish mathemat framework understand p...,recently published mathematical framework unde...
97,97,1590875152988114945,life challenging resources available help see ...,2022-11-11 01:12:48+00:00,UTSC,2,0,1,33,264,6.909091,8,"[life, challenging, resources, available, help...",life challeng resourc avail help see hard time...,life challenging resource available help see h...
98,98,1590862298620436480,watch live couture contemporary indigenous per...,2022-11-11 00:21:43+00:00,UofTDaniels,1,0,0,24,212,7.791667,4,"[watch, live, couture, contemporary, indigenou...",watch live coutur contemporari indigen perform...,watch live couture contemporary indigenous per...


## 3. Advanced text processing

#### Term Frequency (TF)

In [55]:
# import function to do term frequency 
from sklearn.feature_extraction.text import CountVectorizer

In [56]:
# using countvectorizer from sklearn to achieve term frequency count
vectorizer = CountVectorizer()

In [57]:
# fit_transform data
X = vectorizer.fit_transform(uoft_df['lemmatization'])
# extract all the words present in lemmatization column
vectorizer.get_feature_names_out()

array(['0105', '100', '1020', '102176720495471221938', '105', '11', '118',
       '12', '1200', '1200pm', '12130pm', '13', '14', '1418', '15',
       '1517', '16', '17', '18', '18669255454', '18yearold', '2020',
       '2022', '2304pm', '2330pm', '26', '27', '30', '305', '3711312',
       '3800', '400pm', '4164952891', '4306pm', '50', '500pm', '7201280',
       '99100', 'ac223', 'academic', 'academicpromotion',
       'academictwitter', 'accepting', 'access', 'accommodation',
       'account', 'across', 'activity', 'actually', 'adam', 'additional',
       'address', 'adult', 'advice', 'advising', 'affect', 'agog', 'ai',
       'alms', 'alongside', 'also', 'alum', 'alumni_utm', 'always', 'am',
       'america', 'amp', 'anatomy', 'anchordown', 'ancient', 'and',
       'anna', 'annette', 'annettemkennedy', 'announcement', 'answer',
       'antitoxin', 'apologize', 'application', 'apply', 'applying',
       'appointed', 'appreciate', 'approval', 'arc', 'architect',
       'architecture', '

In [58]:
# a list of words frequency for each tweets
X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [59]:
# create the dataframe
feature_extraction1 = pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names_out())

In [60]:
uoft_df=uoft_df.join(feature_extraction1)

In [61]:
uoft_df

Unnamed: 0.1,Unnamed: 0,ID,Tweet,Date Posted,Author ID,Liked,Reply,Retweet,Number_of_words,Number_of_characters,...,working,workshop,world,worry,would,wreath,writerkait,www,year,young
0,0,1591977857664024576,dont forget donate cut am radio station know d...,2022-11-14 02:14:33+00:00,smhimh,0,0,0,36,257,...,0,0,0,0,0,0,0,0,0,0
1,1,1591938939669352448,happy first day snow everyone toronto torontow...,2022-11-13 23:39:54+00:00,uoftmha,1,0,0,19,180,...,0,0,0,0,0,0,0,0,0,0
2,2,1591937427085598721,feeling last day reading week time flew extrem...,2022-11-13 23:33:54+00:00,uoftmha,0,0,0,41,277,...,0,0,0,0,0,0,0,0,0,0
3,3,1591936122674085890,selfcaresunday since eating important part sel...,2022-11-13 23:28:43+00:00,uoftmha,0,0,0,34,255,...,0,0,0,0,0,0,0,0,0,0
4,4,1591925903936061440,faculty fall 15 see may 7201280 118 mb 99100 t...,2022-11-13 22:48:06+00:00,michaelalstad,5,0,0,26,202,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,95,1590883872027598850,recently published mathematical framework unde...,2022-11-11 01:47:26+00:00,sourojeet,0,0,0,24,237,...,0,0,0,0,0,0,0,0,0,0
96,96,1590883568099762176,recently published mathematical framework unde...,2022-11-11 01:46:14+00:00,sourojeet,0,0,0,24,241,...,0,0,0,0,0,0,0,0,0,0
97,97,1590875152988114945,life challenging resources available help see ...,2022-11-11 01:12:48+00:00,UTSC,2,0,1,33,264,...,0,0,0,0,0,0,0,0,0,0
98,98,1590862298620436480,watch live couture contemporary indigenous per...,2022-11-11 00:21:43+00:00,UofTDaniels,1,0,0,24,212,...,0,0,0,0,0,0,0,0,0,0


# Part 3: Alternative way to get Twitter data in Python

This method is much easier than extracting data from Twitter API. It doesn't need to create an account and generate the keys and tokens. Simply use the codes below, you can quickly scrape a bunch of information from Twitter.

In [62]:
# Install alternative method to get Twitter data in Python  
# pip install git+https://github.com/JustAnotherArchivist/snscrape.git

In [63]:
import snscrape.modules.twitter as sntwitter
import pandas as pd

#https://stackoverflow.com/questions/73485659/scrape-tweets-from-a-list-of-hashtags-using-snscrape
def tweet_scraper(query, n_tweet):
    attributes_container = []
    max_tweet = n_tweet
    for i,tweet in enumerate(sntwitter.TwitterSearchScraper(query).get_items()):
        if i>max_tweet:
            break
        attributes_container.append([tweet.user.username,
                                 tweet.user.created,
                                 tweet.user.followersCount,
                                 tweet.user.friendsCount,
                                 tweet.retweetCount,
                                 tweet.lang,
                                 tweet.date,
                                 tweet.likeCount,
                                 tweet.sourceLabel,
                                 tweet.id,
                                 tweet.content,
                                 tweet.hashtags,
                                 tweet.conversationId,
                                 tweet.inReplyToUser,
                                 tweet.coordinates,
                                 tweet.place])
    
    return pd.DataFrame(attributes_container, columns=["User",
                                                   "Date_Created",
                                                   "Follows_Count",
                                                   "Friends_Count",
                                                   "Retweet_Count",
                                                   "Language",
                                                   "Date_Tweet",
                                                   "Number_of_Likes",
                                                   "Source_of_Tweet",
                                                   "Tweet_Id",
                                                   "Tweet",
                                                   "Hashtags",
                                                   "Conversation_Id",
                                                   "In_reply_To",
                                                   "Coordinates",
                                                   "Place"])

In [64]:
uoft_df2 = tweet_scraper('#uoft', 100)
uoft_df2

  tweet.content,


Unnamed: 0,User,Date_Created,Follows_Count,Friends_Count,Retweet_Count,Language,Date_Tweet,Number_of_Likes,Source_of_Tweet,Tweet_Id,Tweet,Hashtags,Conversation_Id,In_reply_To,Coordinates,Place
0,CSUSUofT,2016-10-05 20:54:39+00:00,313,394,0,en,2022-11-15 04:17:42+00:00,0,Twitter Web App,1592371236038119424,"Hey #UofT community, there is still time to re...",[UofT],1592371236038119424,,,
1,EquityPubPolicy,2014-07-16 15:29:10+00:00,767,309,0,en,2022-11-15 00:15:27+00:00,0,Twitter Web App,1592310271426826240,We are so excited to commence this year's sess...,"[EDPPtalks, UofT, topoli, onpoli, cdnpoli, mun...",1592310271426826240,,,
2,uoftalumni,2009-03-10 15:27:10+00:00,7214,1310,0,en,2022-11-14 22:25:10+00:00,1,Hootsuite Inc.,1592282519021420552,Community members gathered on #UofT’s three ca...,"[UofT, UofTalumni]",1592282519021420552,,,
3,uoftlibraries,2009-10-03 18:10:17+00:00,11003,758,0,en,2022-11-14 21:43:52+00:00,2,Twitter Web App,1592272128136871936,Our Robarts Library book display for both Nove...,[UofT],1592272128136871936,,,
4,UTMBiology,2012-12-05 18:41:31+00:00,2202,2260,0,en,2022-11-14 21:03:25+00:00,1,Zoho Social,1592261946170306564,"You can't miss this Friday, Nov 18, #UTMBiolog...","[UTMBiology, UofT]",1592261946170306564,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,UTLaw,2009-01-12 14:56:57+00:00,15338,1878,0,en,2022-11-11 18:01:25+00:00,1,Hootsuite Inc.,1591128981843083266,This Monday!\n\nJoin Prof. @BedardRubin @NDesR...,[UofT],1591128981843083266,,,
97,UofTStudentLife,2010-04-07 16:44:11+00:00,20246,633,0,en,2022-11-11 18:00:14+00:00,1,Sprout Social,1591128683862949915,Explore how spiritual well-being and mental he...,[uoft],1591128683862949915,,,
98,uoftlibraries,2009-10-03 18:10:17+00:00,11003,758,5,en,2022-11-11 17:53:39+00:00,11,Twitter for iPhone,1591127025972285440,#UofT libraries was honoured to place a wreath...,"[UofT, LestWeForget, RemembranceDay]",1591127025972285440,,"Coordinates(longitude=-79.639319, latitude=43....","Place(fullName='Toronto, Ontario', name='Toron..."
99,CentreToronto,2018-12-04 15:16:48+00:00,31,174,0,zh,2022-11-11 17:31:40+00:00,0,Twitter Web App,1591121495128170496,学习策略和技能来改善您的 #MentalHealth 并更好地应对学生生活的挑战！ 我们为学...,"[MentalHealth, CBT, Markham, Toronto, UofT, Yo...",1591121495128170496,,,
