# Data Collection

In this notebook, we describe the process of data collection for our project. Due to limited access to the Twitter API, we utilized an alternative method using [ScraperAPI](https://www.scraperapi.com/twitter-scraper/), a popular web scraping tool founded in 2018. ScraperAPI claims that it is legal to scrape Twitter without its API as most Twitter data is publicly accessible, which makes it legal to extract. 

In [1]:
import requests
import json
import pandas as pd
import os

# load environment variables
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

In [2]:
api_key = os.environ['API_KEY']

### Scrape Data

In [3]:
# to store data in json from scraping
folder_path = "raw_data"

The `scrape_data` function below is designed to retrieve data using an API and store the results in a JSON file. Due to limitations on the number of API calls, the function performs multiple scrapes for a company. The function makes 15 consecutive requests, each time retrieving a batch of approximately 19 tweets. The responses are then exported to a json.

In [4]:
def scrape_data(file_name, next_cursor, user_id):
    payload = {'api_key': api_key, 'user_id': user_id, 'next_cursor': next_cursor}

    responses = []
    # 15 consecutive requests
    for _ in range(15):
        r = requests.get('https://api.scraperapi.com/structured/twitter/v2/tweets', params=payload)
        data = r.json()
        responses.append(data)

        # update the new next_cursor value
        payload['next_cursor'] = data['next_cursor']

    # a dictionary to store the name and responses
    export_data = {
        'name': file_name,
        'responses': responses
    }
    
    file_path = os.path.join(folder_path, file_name)

    # export the data to JSON file
    with open(file_path, 'w') as file:
        json.dump(export_data, file)

    print(f"Scraping completed and responses exported to '{file_name}' file.")

In [24]:
# momentive
# scrape_data("momentive-5.json", "HBaQwLDh99yXyCcAAA==", "1384685957216092161")

In [25]:
# Wazoku
# scrape_data("wazoku-3.json","HBaEgL3NzbiwgScAAA==","304002619")

In [26]:
# INPART
# scrape_data("inpart-4.json","HBb+v7Llv4G80CYAAA==","819883602")

In [27]:
# ninesigma
# scrape_data("ninesigma-3.json","HBaCwLXl9pjenB0AAA==","19405388")

In [28]:
# yet2com
# scrape_data("yet2com-4.json","HBaKwL35gaSg1B8AAA==","63966920")

In [29]:
# innoget
# scrape_data("innoget-1.json","NEXT_CURSOR_VALUE","72579390")
# scrape_data("innoget-2.json","HBaAgLaNj/jg0hMAAA==","72579390")
# scrape_data("innoget-3.json","HBaAwLe5y+eHrRAAAA==","72579390")

### Combine to single DataFrames

After getting all tweets from competitors, we extract the desired fields, and combine them into one dataframe.

In [5]:
def load_data(filename):
    print(filename)
    file_path = os.path.join(folder_path, filename)

    with open(file_path, 'r') as file:
        response_json = json.load(file)
    
    # extract the desired fields from the JSON data
    tweet_data = []
    skipped_count = 0  # Counter for skipped rows
    for item in response_json['responses']:
        if 'tweets' in item:
            tweets = item['tweets']
            for tweet in tweets:
                try:
                    user_id = tweet['user_id']
                    user_name = tweet['user_name']
                    date = tweet['date']
                    tweet_id = tweet['tweet_id']
                    text = tweet['text']
                    is_reply = tweet['is_reply']
                    replies = tweet['replies']
                    retweets = tweet['retweets']
                    quotes = tweet['quotes']
                
                    # append the extracted data to the list
                    tweet_data.append({
                        'user_id': user_id,
                        'user_name': user_name,
                        'date': date,
                        'text': text,
                        'tweet_id': tweet_id,
                        'is_reply': is_reply,
                        'replies': replies,
                        'retweets': retweets,
                        'quotes': quotes
                    })
                except KeyError:
                    # Increment the skipped_count if user_id or user_name is not found
                    skipped_count += 1
                    continue
    
    df = pd.DataFrame(tweet_data)
    print(f"Skipped rows: {skipped_count}")
    return df

In [6]:
# fucntion to check the duplicated data
def check_dup(df):
    duplicated_tweet_ids = df[df.duplicated('tweet_id', keep=False)]

    if not duplicated_tweet_ids.empty:
        print("Duplicated tweet_id values found:")
        print(duplicated_tweet_ids)
    else:
        print("No duplicated tweet_id values found.")

In [7]:
# load all json scrapping data 
raw_data = [f for f in os.listdir("raw_data/") if f.endswith(".json")]

In [8]:
raw_data

['innoget-2.json',
 'momentive-5.json',
 'inpart-1.json',
 'ninesigma-2.json',
 'ninesigma-3.json',
 'momentive-4.json',
 'innoget-3.json',
 'wazoku-1.json',
 'momentive-3.json',
 'momentive-2.json',
 'yet2com-1.json',
 'momentive-1.json',
 'yet2com-2.json',
 'yet2com-3.json',
 'inpart-4.json',
 'wazoku-2.json',
 'inpart-3.json',
 'yet2com-4.json',
 'inpart-2.json',
 'ninesigma-1.json',
 'wazoku-3.json',
 'innoget-1.json']

In [9]:
# create a df to store combined data
df = pd.DataFrame()

In [10]:
for file in raw_data:
    temp_df = load_data(file)
    df = pd.concat([df,temp_df], ignore_index=True, sort=False)
    check_dup(df)

innoget-2.json
Skipped rows: 0
No duplicated tweet_id values found.
momentive-5.json
Skipped rows: 0
No duplicated tweet_id values found.
inpart-1.json
Skipped rows: 0
No duplicated tweet_id values found.
ninesigma-2.json
Skipped rows: 0
No duplicated tweet_id values found.
ninesigma-3.json
Skipped rows: 0
No duplicated tweet_id values found.
momentive-4.json
Skipped rows: 0
No duplicated tweet_id values found.
innoget-3.json
Skipped rows: 0
No duplicated tweet_id values found.
wazoku-1.json
Skipped rows: 0
No duplicated tweet_id values found.
momentive-3.json
Skipped rows: 0
No duplicated tweet_id values found.
momentive-2.json
Skipped rows: 0
No duplicated tweet_id values found.
yet2com-1.json
Skipped rows: 0
No duplicated tweet_id values found.
momentive-1.json
Skipped rows: 0
No duplicated tweet_id values found.
yet2com-2.json
Skipped rows: 1
No duplicated tweet_id values found.
yet2com-3.json
Skipped rows: 0
No duplicated tweet_id values found.
inpart-4.json
Skipped rows: 0
No dup

In [11]:
df.head()

Unnamed: 0,user_id,user_name,date,text,tweet_id,is_reply,replies,retweets,quotes
0,72579390,innoget,Mon Feb 29 11:23:20 +0000 2016,"RT @PRUAB: 3,2M€ #GrantCall per a projectes #c...",704265693152337921,False,0,1,0
1,72579390,innoget,Fri Feb 26 11:35:21 +0000 2016,#Global #Biotech Reagents Market 2016 Industry...,703181551727570944,False,0,0,0
2,72579390,innoget,Thu Feb 25 10:15:16 +0000 2016,Tech Transfer Office in #Ohio #University help...,702799011187658752,False,0,0,0
3,72579390,innoget,Wed Feb 24 12:10:05 +0000 2016,What’s your point regarding IP protection? Doe...,702465520109559808,False,0,0,0
4,72579390,innoget,Tue Feb 23 15:15:17 +0000 2016,New article about #Samsung and its investment ...,702149739492597761,False,0,0,0


In [12]:
df.value_counts('user_name')

user_name
yet2com        849
MomentiveAI    846
IN_PART        840
NineSigma      832
WazokuHq       819
innoget        798
Name: count, dtype: int64

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4984 entries, 0 to 4983
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   user_id    4984 non-null   object
 1   user_name  4984 non-null   object
 2   date       4984 non-null   object
 3   text       4984 non-null   object
 4   tweet_id   4984 non-null   object
 5   is_reply   4984 non-null   bool  
 6   replies    4984 non-null   int64 
 7   retweets   4984 non-null   int64 
 8   quotes     4984 non-null   int64 
dtypes: bool(1), int64(3), object(5)
memory usage: 316.5+ KB


In [16]:
# export to pickle
df.to_pickle("pickle_files/all_tweets.pkl")
print("DataFrames exported successfully.")

DataFrames exported successfully.


In [17]:
# export to CSV
df.to_csv("csv_files/all_tweets.csv", index=False)
print("DataFrames exported successfully.")

DataFrames exported successfully.
