# Obtaining Data

To follow along and practice with the course resources, I will scrape Twitter data and try to apply the different tools to my own dataset.

1. I will first scrap the tweets of 4 Twitter ML accounts I like (with different audiences and styles). I will extract their last 52 tweets until Thursday 19th, January 2023, 00:00 CT. I extracted the following information:
    - Date
    - Tweet ID
    - Tweet Content
    - Tweet Sentiment (obtained with [TextBlob](https://textblob.readthedocs.io/en/dev/))
    - If the Tweet contains media
    - Number of Views
    - Number of Retweets
    - Number of Replies
    - User
    - Number of Followers of the User
    - Number of Likes

2. I will then obtain the sentiment of the tweets and add it to our features.

3. Lastly, I will create a CSV file that will be stored on a [Datasets Notion page](https://florentine-rayon-d99.notion.site/Datasets-88840ad9026047d09c0359327f39efd0) I prepared to store my practice Datasets. The data I'll be creating in this tutorial: [Twitter Data](https://s3.us-west-2.amazonaws.com/secure.notion-static.com/56592100-0105-4d81-869e-29ec562a1f2d/ML-AZ-tweets.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20230119%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20230119T054104Z&X-Amz-Expires=86400&X-Amz-Signature=66a6100d43d7003c46a926f27092e95d5c709b13c7e505ea10b0ea619f132c6c&X-Amz-SignedHeaders=host&response-content-disposition=filename%3D%22ML-AZ-tweets.csv%22&x-id=GetObject).

> Useful additional link for Sentiment Analysis: https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/

## Twitter Scrapping 

I'll use the [snscrape](https://github.com/JustAnotherArchivist/snscrape) Python Library to scrap the Tweets. The Library is not well documented, but here are two great resources to help us use the tool: 
1. [Scrape Twitter with 5 Lines of Code](https://www.youtube.com/watch?v=PUMMCLrVn8A), Youtube Video by Rob Mulla
1. [Scrape Twitter data without Twitter API using SNScrape for timeseries analysis](https://datasciencedojo.com/blog/scrape-twitter-data-using-sncrape/), article found in Data Science Dojo.

In [9]:
import re
from tqdm import tqdm # Progress bar: conda install -c conda-forge tqdm
import snscrape.modules.twitter as sntwitter # pip install snscrape 
import pandas as pd
from textblob import TextBlob

def scrape_twitter(username, max_tweets = 50):

    def clean_tweet(tweet):
        # Utility function to clean tweet text by removing links and special 
        # characters using simple regex statements.
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

    # Variables:
    scraper = sntwitter.TwitterSearchScraper(f"from:{username} exclude:retweets exclude:replies")
    tweets = []

    for i, tweet in tqdm(enumerate(scraper.get_items()), total = max_tweets):

        tweet_sentiment = TextBlob(clean_tweet(tweet.rawContent)).sentiment.polarity

        tweet_data = [
        tweet.date, 
        tweet.id, 
        tweet.rawContent,
        tweet_sentiment, 
        tweet.media != None,
        tweet.viewCount, 
        tweet.retweetCount, 
        tweet.replyCount, 
        tweet.user.username, 
        tweet.user.followersCount,
        tweet.likeCount,
        ]
        tweets.append(tweet_data)
        if i > max_tweets:
            break
    
    twitter_data = pd.DataFrame(
    tweets, 
    columns = [
        'date', 
        'id', 
        'content',
        'sentiment',
        'has_media',
        'views', 
        'retweets', 
        'replies',
        'user',
        'followers',
        'likes',
    ]
    )

    return twitter_data

In [10]:
# Getting tweets from users:
tweets_tunguz = scrape_twitter('tunguz')
tweets_pau = scrape_twitter('paulabartabajo_')
tweets_qb = scrape_twitter('quantumblack')
tweets_santiago = scrape_twitter('svpino')

tweets = pd.concat([tweets_tunguz, tweets_pau, tweets_qb, tweets_santiago])

51it [00:01, 27.22it/s]                        
51it [00:01, 39.12it/s]                        
51it [00:01, 30.67it/s]                        
51it [00:01, 29.03it/s]                        


In [11]:
tweets

Unnamed: 0,date,id,content,sentiment,has_media,views,retweets,replies,user,followers,likes
0,2023-01-19 02:55:22+00:00,1615905729654702083,Accurate.,0.400000,False,3828.0,1,1,tunguz,90940,15
1,2023-01-19 02:21:00+00:00,1615897079741571072,We are so early. https://t.co/jKdSMlpPMS,0.100000,False,3879.0,1,4,tunguz,90940,35
2,2023-01-19 00:01:10+00:00,1615861890739220481,Why cool and waste it when you can boil and ta...,0.075000,True,5703.0,6,9,tunguz,90940,56
3,2023-01-18 23:46:44+00:00,1615858256471273472,I barely know what a binary tree is. Is that l...,0.050000,False,8774.0,0,11,tunguz,90940,46
4,2023-01-18 23:43:05+00:00,1615857338254241792,"Yes, gaslighting is the right term here.",0.285714,False,4754.0,2,2,tunguz,90940,12
...,...,...,...,...,...,...,...,...,...,...,...
47,2022-12-28 17:43:39+00:00,1608156739270057985,https://t.co/ni3SEHdGBN,0.000000,False,27219.0,9,16,svpino,228565,45
48,2022-12-28 13:00:08+00:00,1608085388350062594,7 YouTube videos for anyone starting with mach...,0.000000,False,286276.0,521,48,svpino,228565,2025
49,2022-12-27 16:58:45+00:00,1607783052297572352,Who currently has more advanced Computer Visio...,0.300000,False,24248.0,3,18,svpino,228565,18
50,2022-12-27 13:00:17+00:00,1607723041966161922,Free AI Conference in early January!\n\nGuess ...,0.333333,True,40529.0,26,4,svpino,228565,158


In [12]:
tweets.to_csv("ML-AZ-tweets.csv", index = False)