# Taking the first steps in snscrape land 

**snscrape** is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the discovered items, e.g. the relevant posts.

Github repo: https://github.com/JustAnotherArchivist/snscrape

## Scraping tweets from the terminal

In [2]:
!pip install snscrape

Collecting snscrape
  Downloading snscrape-0.4.3.20220106-py3-none-any.whl (59 kB)
Installing collected packages: snscrape
Successfully installed snscrape-0.4.3.20220106


In [3]:
!snscrape --help


usage: snscrape [-h] [--version] [--citation] [-v] [--dump-locals] [--retry N]
                [-n N] [-f FORMAT | --jsonl] [--with-entity]
                [--since DATETIME] [--progress]
                SCRAPER ...

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --citation            Display recommended citation information and exit
                        (default: None)
  -v, --verbose, --verbosity
                        Increase output verbosity (default: 0)
                        or higher) (default: False)
  --retry N, --retries N
                        When the connection fails or the server returns an
                        unexpected response, retry up to N times with an
                        exponential backoff (default: 3)
  -n N, --max-results N
                        Only return the first N results (default: None)
  -f FORMAT, --format FORMAT
                        Output 

There are multiple scrapers for twitter, based on use case. 
For scraping tweets of a particular user, the appropriate one to use is twitter-user. 

In [4]:
# Return top 3 (most recent) @macmiller tweets 

!snscrape --max-results 3 twitter-user macmiller


https://twitter.com/MacMiller/status/1037883095196680192
https://twitter.com/MacMiller/status/1037879990090489856
https://twitter.com/MacMiller/status/1037879124868517889


## Using the official python wrapper to scrape tweets

The CLI usage of snscrape has been documented on the repo.

However, the python wrapper has no official documentation. The information is passed on through blogs, and word of mouth probably, from what I can figure out. 

In [1]:
# importing snscrape as a package in python

import snscrape.modules.twitter as sntwitter

In [2]:
# importing other libraries

import pandas as pd
import numpy as np
import itertools


To filter the data by location, snscrape accepts a geocode argument which requires three - latitude, longitude, and radius - values as inputs. 

In [3]:
# assigning the latitude and longitude of London to a variable, along with a radius value of 25km
loc = '51.50722, -0.1275, 25km'

# scraping 100 recent tweets from London that mention "mental health" and creating a pandas DataFrame 
df_coord = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
    'mental health geocode:"{}"'.format(loc)).get_items(), 100))

df_coord.head()


Unnamed: 0,url,date,content,renderedContent,id,user,replyCount,retweetCount,likeCount,quoteCount,...,media,retweetedTweet,quotedTweet,inReplyToTweetId,inReplyToUser,mentionedUsers,coordinates,place,hashtags,cashtags
0,https://twitter.com/dmcfender/status/158208235...,2022-10-17 18:53:22+00:00,Truss is beginning to show all the signs of so...,Truss is beginning to show all the signs of so...,1582082359435808770,"{'username': 'dmcfender', 'id': 3315235678, 'd...",0,0,0,0,...,,,,,,,,,,
1,https://twitter.com/SandLTH/status/15820810156...,2022-10-17 18:48:02+00:00,We like to encourage people to get outside int...,We like to encourage people to get outside int...,1582081015698591744,"{'username': 'SandLTH', 'id': 3003012933, 'dis...",0,0,0,0,...,[{'previewUrl': 'https://pbs.twimg.com/media/F...,,,,,,,,[getoutdoors],
2,https://twitter.com/felthambboy/status/1582080...,2022-10-17 18:47:09+00:00,Kanye west is a dangerous man with dangerous v...,Kanye west is a dangerous man with dangerous v...,1582080792507060224,"{'username': 'felthambboy', 'id': 989625924, '...",0,0,0,0,...,,,,,,,,,,
3,https://twitter.com/GaReth_Rutter/status/15820...,2022-10-17 18:43:32+00:00,@adamskip77 So. Sorry mate. So much I didn’t d...,@adamskip77 So. Sorry mate. So much I didn’t d...,1582079882766733312,"{'username': 'GaReth_Rutter', 'id': 397265577,...",0,0,0,0,...,,,,1.582026e+18,"{'username': 'adamskip77', 'id': 192690577, 'd...","[{'username': 'adamskip77', 'id': 192690577, '...",,,,
4,https://twitter.com/eamonnpmcmahon/status/1582...,2022-10-17 18:43:19+00:00,There are so many anonymous twitter accounts w...,There are so many anonymous twitter accounts w...,1582079827917799424,"{'username': 'eamonnpmcmahon', 'id': 106534795...",0,0,0,0,...,,,,,,,,,,


In [4]:
df_coord.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 27 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   url               100 non-null    object             
 1   date              100 non-null    datetime64[ns, UTC]
 2   content           100 non-null    object             
 3   renderedContent   100 non-null    object             
 4   id                100 non-null    int64              
 5   user              100 non-null    object             
 6   replyCount        100 non-null    int64              
 7   retweetCount      100 non-null    int64              
 8   likeCount         100 non-null    int64              
 9   quoteCount        100 non-null    int64              
 10  conversationId    100 non-null    int64              
 11  lang              100 non-null    object             
 12  source            100 non-null    object             
 13  source

We might want to filter the data further, based on language ("lang"), or date, or popularity/traction the tweet has gained ("likeCount", "retweetCount", etc). 


The "user" column may need to be normalised using json_normalize from pandas.io.json before it can be utilised further.

In [5]:
# importing json_normalize

from pandas.io.json import json_normalize

import warnings
warnings.filterwarnings("ignore")


In [6]:
users = json_normalize(df_coord["user"])

users.head()

Unnamed: 0,username,id,displayname,description,rawDescription,descriptionUrls,verified,created,followersCount,friendsCount,...,favouritesCount,listedCount,mediaCount,location,protected,linkUrl,linkTcourl,profileImageUrl,profileBannerUrl,label
0,dmcfender,3315235678,Dermz/#torysewagepartyout !,"Angry Pro European ..Uber driver, more people ...","Angry Pro European ..Uber driver, more people ...",,False,2015-06-09 15:40:41+00:00,825,1792,...,24278,2,24,london,False,,,https://pbs.twimg.com/profile_images/134433624...,https://pbs.twimg.com/profile_banners/33152356...,
1,SandLTH,3003012933,Strength and Learning Through Horses,We're a unique London charity helping young pe...,We're a unique London charity helping young pe...,,False,2015-01-28 11:56:11+00:00,629,518,...,3450,15,2767,"Edgware, London",False,http://www.strengthandlearningthroughhorses.org/,https://t.co/DDorXyWTnI,https://pbs.twimg.com/profile_images/130626432...,https://pbs.twimg.com/profile_banners/30030129...,
2,felthambboy,989625924,Tom,Brentford Football Club are premier League. a...,Brentford Football Club are premier League. a...,,False,2012-12-04 21:37:46+00:00,640,3042,...,3956,11,4001,South west London,False,,,https://pbs.twimg.com/profile_images/148220909...,https://pbs.twimg.com/profile_banners/98962592...,
3,GaReth_Rutter,397265577,Gareth Rutter,Founder and Creative Director @BellowStudio - ...,Founder and Creative Director @BellowStudio - ...,"[{'text': 'bellow.studio', 'url': 'http://bell...",False,2011-10-24 13:18:14+00:00,3125,4830,...,15781,58,6440,London,False,,,https://pbs.twimg.com/profile_images/155668183...,https://pbs.twimg.com/profile_banners/39726557...,
4,eamonnpmcmahon,1065347954297442304,Eamonn McMahon,"fintech / ex investment banking / Gin, Jazz an...","fintech / ex investment banking / Gin, Jazz an...",,False,2018-11-21 20:55:20+00:00,172,158,...,7168,0,170,"London, England",False,http://www.equipal.co,https://t.co/YOPQbLN96l,https://pbs.twimg.com/profile_images/157561270...,https://pbs.twimg.com/profile_banners/10653479...,


Interestingly, the "location" values for the users in the first five rows seem to be London. 
That is good to note.



To be continued...


References:

- https://www.kaggle.com/code/prathamsharma123/clean-raw-json-tweets-data/notebook
- https://betterprogramming.pub/how-to-scrape-tweets-with-snscrape-90124ed006af
- https://medium.com/swlh/how-to-scrape-tweets-by-location-in-python-using-snscrape-8c870fa6ec25
- https://github.com/JustAnotherArchivist/snscrape

