# Data Collection - Ukrainian inflow
**Author**: Andrea Cass

## 1. About this notebook

The purpose of this Jupyter notebook is to collect immigration-related tweets in English and German published publicly in Germany during the Ukrainian inflow between June 24, 2021 and October 24, 2022

Tweets are collected via the Twitter API using my Academic research developer account and Twarc2. To run the code successfuly, you must also have an Academic research developer account with Twitter and provide your details under the "Authentification" section. You can apply for an account here: https://developer.twitter.com/en/products/twitter-api/academic-research

English and German tweets will be collected separately and saved to separate csv files, titled:
> *01a_Data-Collection_limited_Ukrainian_eng.csv*

> *01b_Data-Collection_limited_Ukrainian_de.csv*

The two files will be merged in the subsequent Jupyter Notebook, titled, "02_Pre-processing"

**NOTE**: Do NOT merely run all cells without reading instructions from each section. At times, it may be required of you to alter the code. If that is the case, you will be instructed on how to do so.

## 2. Imports

In [1]:
from datetime import date, datetime, timezone
import asyncio
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from twarc import Twarc2, expansions
import tweepy
import csv
import os
from pathlib import Path

## 3. Working directory & file paths

In the Notebook for the data collection of tweets during the Syrian inflow, the working directory was established, and a new folded, "CASS_thesis" was created within the working directory. The code below is used to, again, set up your working directory in addition to creating two objects called:
* **cwd**: the current working directory
* **CASS_thesis**: the folder where all data from my Notebooks will be saved

### 3.1. Current working directory
Use the code below to find out what your current working directory is set to.

In [2]:
# find current working directory

os.getcwd()

'/Users/andycass/Jupyterlab_main-folder/THESIS'

If your current working directory is not your desired directory, follow the subsequent steps to change the working diectory by:
1. deciding where you would like your working directory to be (e.g., your Desktop)
2. entering the file path of your desired working directory into the code below

**NOTE**: If you are satisfied with your working directory and do NOT wish to change it, skip the block of code underneath **3.1.1. Changing current working directory** and, instead, proceed from the block of code underneath **3.1.2. Naming current working directory**.

#### 3.1.1. Changing current working directory
**NOTE**: The code below contains the path to **my** desired working directory to serve as an example. You must alter it to the path of **your** desired working directory. Keep in mind that my example is formatted according to Macbook standards, and Windows formatting differs.

In [3]:
# changing current working directory

os.chdir('/Users/andycass/Desktop/Thesis_data-and-code')

#### 3.1.2. Naming current working directory
Now that your current working directory is established, use the code below to name it "cwd":

In [4]:
# naming the current working directory

cwd = Path.cwd()

In [5]:
# double-checking the current working directory location

cwd

PosixPath('/Users/andycass/Desktop/Thesis_data-and-code')

### 3.2. CASS_thesis
Now that the current directory has been named, the code below will allow you to name "CASS_thesis" (which was already created in the Notebook titled "01_Data-Collection_Syrian).

In [6]:
# naming the CASS_thesis folder

CASS_thesis = cwd / 'CASS_thesis'

In [7]:
# double-checking the CASS_thesis location

CASS_thesis

PosixPath('/Users/andycass/Desktop/Thesis_data-and-code/CASS_thesis')

## 4. Authentification
**NOTE**: You must provide your unique bearer token from your Academic research account in the quotation marks of the following code. To proceed, insert your bearer token. The code will NOT run otherwise.

In [8]:
bearer_token = "" # your bearer token
twarc_client = Twarc2(bearer_token=bearer_token)

## 5. Search endpoint
### 5.1. English tweets

In [13]:
# empty list to store the search results

tweets = [] 

#### 5.1.1. Defining the query
The query will contain:
* keywords
* country
* geo-location
* language

The keywords include:
* refugee
* refugees
* asylum seeker
* ukrainian
* ukrainians

**NOTE**: There is no case sensitivity.

In [9]:
# defining the query

query = '("asylum seeker" OR "refugee" OR "refugees" OR "ukrainian" OR "ukrainians") place_country:DE has:geo lang:en'


In [10]:
# start and end time of the query

#start: June 24, 2021
# end: October 24, 2022

start = datetime(2021, 6, 24, 0, 0, 0, 0, tzinfo=timezone.utc)
end = datetime(2022, 10, 24, 0, 0, 0, 0, tzinfo=timezone.utc)

#### 5.1.2. File name
All tweets collected will be saved on a file called:
> *01a_Data-Collection_limited_Ukrainian_eng.csv*

In [11]:
filename = CASS_thesis / "01a_Data-Collection_limited_Ukrainian-eng.csv"

#### 5.1.3. Collection & saving
The following code will collect all tweets that match the given query and start and end times, save them as a dataframe, and save the dataframe as a csv file with the file name previously provided.

In [15]:
for tweet in twarc_client.search_all(
    query,
    start_time=start,
    end_time=end):
    tweets.append(tweet)
    
df = pd.json_normalize(tweets, record_path=['data']) 

df.to_csv(filename, index=False)

### 5.2. German tweets

In [16]:
# empty list to store the search results

tweets = [] 

#### 5.2.1. Defining the query
The query will contain:
* keywords
* country
* geo-location
* language

The keywords include:
* flüchtling
* fluchtling
* fluechtling
* flüchtlinge
* fluchtlinge
* fluechtlinge
* asylbewerber
* asylbewerberin
* asylbewerberinnen
* asylsuchende
* asylsuchenden
* asylant
* asylantin
* asylanten
* asylantinnen
* syrer
* syrerin
* syrier
* ukrainer
* ukrainerin
* ukrainerinnen

**NOTE**: The list of German keywords is longer than the list of English keywords due to the gendered nature of the language and to take into account potential missing--or alternative spelling of--umlauts (e.g., "u" instead of "ü" or "ue" instead of "ü")

**NOTE**: There is no case sensitivity.

In [22]:
query = '("flüchtling" OR "fluchtling" OR "fluechtling" OR "flüchtlinge" OR "fluchtlinge" OR "fluechtlinge" OR "asylbewerber" OR "asylbewerberin" OR "asylbewerberinnen" OR "asylsuchende" OR "asylsuchenden" OR "asylant" OR "asylantin" OR "asylanten" OR "asylantinnen" OR "syrer" OR "syrerin" OR "syrier" OR "ukrainer" OR "ukrainerin" OR "ukrainerinnen") place_country:DE has:geo lang:de'

In [18]:
# start and end time of the query

#start: June 24, 2021
# end: October 24, 2022

start = datetime(2021, 6, 24, 0, 0, 0, 0, tzinfo=timezone.utc)
end = datetime(2022, 10, 24, 0, 0, 0, 0, tzinfo=timezone.utc)

#### 5.2.2. File name
All tweets collected will be saved on a file called:
> *01b_Data-Collection_limited_Ukrainian_de.csv*

In [19]:
filename = CASS_thesis / "01b_Data-Collection_limited_Ukrainian-de.csv"

#### 5.2.3. Collection & saving
The following code will collect all tweets that match the given query and start and end times, save them as a dataframe, and save the dataframe as a csv file with the file name previously provided.

In [23]:
for tweet in twarc_client.search_all(
    query,
    start_time=start,
    end_time=end):
    tweets.append(tweet)
    
df = pd.json_normalize(tweets, record_path=['data']) 

df.to_csv(filename, index=False)