# Getting tweets with snscrape

## Libraries

Run the following command to use Elasticsearch via Docker:
`docker run --rm -p 9200:9200 -p 9300:9300 -e "xpack.security.enabled=false" -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.3.3 `


**Prerequisites : Docker**

In [1]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

In [2]:
es.info()

{'name': '13e00a4c77cd',
 'cluster_name': 'docker-cluster',
 'cluster_uuid': 'V41WvBpyROKDWh1tFjiMrw',
 'version': {'number': '8.3.3',
  'build_flavor': 'default',
  'build_type': 'docker',
  'build_hash': '801fed82df74dbe537f89b71b098ccaff88d2c56',
  'build_date': '2022-07-23T19:30:09.227964828Z',
  'build_snapshot': False,
  'lucene_version': '9.2.0',
  'minimum_wire_compatibility_version': '7.17.0',
  'minimum_index_compatibility_version': '7.0.0'},
 'tagline': 'You Know, for Search'}

In [3]:
import snscrape.modules.twitter as sntwitter
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

## Snscrape

**Set variables to be used below.**

In [4]:
maxTweets = 1000

In [5]:
scraper = sntwitter.TwitterSearchScraper("Balenciaga lang:en until:2022-12-02 since:2022-12-01")

**Create list to append data to.**


In [6]:
tweets_list = []

**Use TwitterSearchScraper to scrape data and append tweets to the list.**

In [7]:
for i, tweet in tqdm(enumerate(scraper.get_items()), total = maxTweets):
    data = [tweet.id, tweet.date, tweet.rawContent, tweet.likeCount, tweet.retweetCount, tweet.user.location]
    tweets_list.append(data)
    
    if i > maxTweets:
        break

  0%|          | 0/1000 [00:00<?, ?it/s]

**Create a dataframe from the tweets list.**

In [8]:
tweets_df = pd.DataFrame(tweets_list, columns=["Tweet_Id", "Datetime", "Text", "Likes", "Retweets", "Location"])

In [9]:
tweets_df.head()

Unnamed: 0,Tweet_Id,Datetime,Text,Likes,Retweets,Location
0,1598466952309211136,2022-12-01 23:59:54+00:00,"If y’all are throwing out any Balenciaga, espe...",2,0,"Los Angeles, CA"
1,1598466908008980480,2022-12-01 23:59:43+00:00,#75 live at 8pm EST on this page. So many thin...,0,0,Montreal
2,1598466875947880448,2022-12-01 23:59:35+00:00,@BaileyUnspoken @RealTalkPerson @KimKardashian...,0,0,Riverside Estates
3,1598466874303463426,2022-12-01 23:59:35+00:00,@kanyewest DEMNA! Lead Designer of Balenciaga!...,0,0,"Houston, TX"
4,1598466848319750144,2022-12-01 23:59:29+00:00,Kanye out here burying Balenciaga news tired 😴,0,0,"Manchester, England"


## Elasticsearch

### Storage

In [10]:
#tweets_list

In [11]:
#es.indices.delete(index="tweets")

In [12]:
#mappings = {
#        "properties": {
#            "Tweet_Id": {"type": "long"},
#            "Datetime": {"type": "date", "format": "yyyy-MM-dd HH:mm:ss"},
#            "Text": {"type": "text", "analyzer": "english"},
#            "Username": {"type": "keyword"}
#    }
#}
#
#es.indices.create(index="tweets", mappings=mappings)

In [13]:
def safe_value(field_val):
    return field_val if not pd.isna(field_val) else "Other"

tweets_df["Location"] = tweets_df["Location"].apply(safe_value)

In [14]:
from elasticsearch.helpers import bulk

In [15]:
bulk_data = []
for i,row in tweets_df.iterrows():
    bulk_data.append(
        {
            "_index": "tweets",
            "_id": i,
            "_source": {        
                "Tweet_Id": row["Tweet_Id"],
                "Datetime": row["Datetime"],
                "Text": row["Text"],
                "Likes": row["Likes"],
                "Retweets": row["Retweets"],
                "Location": row["Location"]
            }
        }
    )
bulk(es, bulk_data)

(1002, [])

In [16]:
es.indices.refresh(index="tweets")
es.cat.count(index="tweets", format="json")

[{'epoch': '1675114844', 'timestamp': '21:40:44', 'count': '1002'}]

### Querying

In [17]:
from elasticsearch_dsl import Search
from elasticsearch_dsl import connections

connections.create_connection(hosts=["localhost"])

s = Search(using = es, index = "tweets")
response = s.scan()

count = 0
records = []
for hit in response:
    records.append(hit.to_dict())
    #print(hit.to_dict())  # be careful, it will printout every hit in your index
    count += 1

print(count)

1002


In [18]:
records

[{'Tweet_Id': 1598466952309211136,
  'Datetime': '2022-12-01T23:59:54+00:00',
  'Text': 'If y’all are throwing out any Balenciaga, especially size L shirts or a size 12 for shoes, please bring it to me so I can dispose of it',
  'Likes': 2,
  'Retweets': 0,
  'Location': 'Los Angeles, CA'},
 {'Tweet_Id': 1598466908008980480,
  'Datetime': '2022-12-01T23:59:43+00:00',
  'Text': '#75 live at 8pm EST on this page. So many things to talk about including Balenciaga, climate change activism etc. https://t.co/tJkfjNvyKg',
  'Likes': 0,
  'Retweets': 0,
  'Location': 'Montreal'},
 {'Tweet_Id': 1598466875947880448,
  'Datetime': '2022-12-01T23:59:35+00:00',
  'Text': '@BaileyUnspoken @RealTalkPerson @KimKardashian @RealTristan13 But she literally ONLY wears balenciaga. If Balenciaga had done something less serious I could agree with you, but this is literal CHILD PORNOGRAPHY. The campaign would have had to be approved by so many people at Balenciaga. This involves the whole company.',
  'Likes'

## Modify quered json and save it to DataFrame and json locally 

In [19]:
records_df = pd.DataFrame.from_dict(records)

### Datetime

In [20]:
records_df["Datetime"] = pd.to_datetime(records_df["Datetime"], format = "%Y-%m-%dT%H:%M:%S%z")

In [21]:
records_df.head(10)

Unnamed: 0,Tweet_Id,Datetime,Text,Likes,Retweets,Location
0,1598466952309211136,2022-12-01 23:59:54+00:00,"If y’all are throwing out any Balenciaga, espe...",2,0,"Los Angeles, CA"
1,1598466908008980480,2022-12-01 23:59:43+00:00,#75 live at 8pm EST on this page. So many thin...,0,0,Montreal
2,1598466875947880448,2022-12-01 23:59:35+00:00,@BaileyUnspoken @RealTalkPerson @KimKardashian...,0,0,Riverside Estates
3,1598466874303463426,2022-12-01 23:59:35+00:00,@kanyewest DEMNA! Lead Designer of Balenciaga!...,0,0,"Houston, TX"
4,1598466848319750144,2022-12-01 23:59:29+00:00,Kanye out here burying Balenciaga news tired 😴,0,0,"Manchester, England"
5,1598466846080262144,2022-12-01 23:59:28+00:00,Thank God I didn’t end up buying them balencia...,0,0,Sydney / Nairobi / Harare
6,1598466840992579584,2022-12-01 23:59:27+00:00,@mmpadellan With the drama at Balenciaga and D...,3,0,
7,1598466839293599745,2022-12-01 23:59:27+00:00,@kanyewest We love Demna. He's not Balenciaga ...,0,0,
8,1598466829797687298,2022-12-01 23:59:24+00:00,@vikare06 Looks like a balenciaga photo shoot,2,0,Hyde Park
9,1598466786244317184,2022-12-01 23:59:14+00:00,@DestinyVaughn @M4D3R0 @ksenijapavlovic @Mikha...,0,0,"New York, NY / Manahatta"


### Replace blank cells weith NaNs

In [22]:
#records_df["Location"]

In [23]:
def safe_value(field_val):
    return field_val if not pd.isna(field_val) else "Other"

records_df["Location"] = records_df["Location"].apply(safe_value)

In [24]:
records_df.head(10)

Unnamed: 0,Tweet_Id,Datetime,Text,Likes,Retweets,Location
0,1598466952309211136,2022-12-01 23:59:54+00:00,"If y’all are throwing out any Balenciaga, espe...",2,0,"Los Angeles, CA"
1,1598466908008980480,2022-12-01 23:59:43+00:00,#75 live at 8pm EST on this page. So many thin...,0,0,Montreal
2,1598466875947880448,2022-12-01 23:59:35+00:00,@BaileyUnspoken @RealTalkPerson @KimKardashian...,0,0,Riverside Estates
3,1598466874303463426,2022-12-01 23:59:35+00:00,@kanyewest DEMNA! Lead Designer of Balenciaga!...,0,0,"Houston, TX"
4,1598466848319750144,2022-12-01 23:59:29+00:00,Kanye out here burying Balenciaga news tired 😴,0,0,"Manchester, England"
5,1598466846080262144,2022-12-01 23:59:28+00:00,Thank God I didn’t end up buying them balencia...,0,0,Sydney / Nairobi / Harare
6,1598466840992579584,2022-12-01 23:59:27+00:00,@mmpadellan With the drama at Balenciaga and D...,3,0,
7,1598466839293599745,2022-12-01 23:59:27+00:00,@kanyewest We love Demna. He's not Balenciaga ...,0,0,
8,1598466829797687298,2022-12-01 23:59:24+00:00,@vikare06 Looks like a balenciaga photo shoot,2,0,Hyde Park
9,1598466786244317184,2022-12-01 23:59:14+00:00,@DestinyVaughn @M4D3R0 @ksenijapavlovic @Mikha...,0,0,"New York, NY / Manahatta"


### Export dataframe to JSON and CSV

In [25]:
records_json = records_df.to_json(path_or_buf = "./records", orient = "records")

In [26]:
records_df.to_csv("tweets.csv", sep=",", index = False) 