# Pre-processing - Syrian and Ukrainian inflows
**Author**: Andrea Cass

## 1. About this notebook

The purpose of this Jupyter notebook is to pre-process the data collected in the 01_Data-collection Notebooks:
> *01a_Data-Collection_limited_Syrian_eng.csv*

> *01b_Data-Collection_limited_Syrian_de.csv*

> *01a_Data-Collection_limited_Ukrainian_eng.csv*

> *01b_Data-Collection_limited_Ukrainian_de.csv*

If you did not collect this data yourself but instead received the files from me, please ensure these files are moved to a folder called "CASS_thesis" after the folder is created in section **3.2. CASS_thesis** of this Notebook. To reiterate, after you finish section **3.2. CASS_thesis**, please move the files I sent you into the folder, "CASS_thesis".

***

Goals:
* Format dates
* Create new 'language' column
* Merge English- and German-language datasets into one
* Create new 'inflow' column
* Merge Syrian inflow and Ukrainian inflow datasets into one
* Dropping unnecessary columns

The output will be a single dataset saved as a csv filed titled,
> *02_Pre-processed_limited_merged.csv*

## 2. Imports

In [1]:
import pandas as pd
from textblob import TextBlob
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
import re
from matplotlib import pyplot as plt
from matplotlib import dates as mpl_dates
from datetime import datetime, timedelta
import nltk
from nltk.corpus import stopwords
from textblob import Word
from datetime import datetime as dt
import os
from pathlib import Path

## 3. Working directory & file paths

Before beginning data pre-processing, the working directory needs to be set up. Additionally, if you did not already use the Notebook titled "01_Data-Collection_Syrian" to create a folder called "CASS_thesis", code is provided here to do so. 

Two objects will be named:
* **cwd**: the current working directory (e.g., your Desktop)
* **CASS_thesis**: the folder where all data from my Notebooks will be saved

### 3.1. Current working directory
Use the code below to find out what your current working directory is set to.

In [2]:
# find current working directory

os.getcwd()

'/Users/andycass/Jupyterlab_main-folder/THESIS'

If your current working directory is not your desired directory, follow the subsequent steps to change the working diectory by:
1. deciding where you would like your working directory to be (e.g., your Desktop)
2. entering the file path of your desired working directory into the code below

**NOTE**: If you are satisfied with your working directory and do NOT wish to change it, skip the block of code underneath **3.1.1. Changing current working directory** and, instead, proceed from the block of code underneath **3.1.2. Naming current working directory**.

#### 3.1.1. Changing current working directory
**NOTE**: The code below contains the path to **my** desired working directory to serve as an example. You must alter it to the path of **your** desired working directory. Keep in mind that my example is formatted according to Macbook standards, and Windows formatting differs.

In [3]:
# changing current working directory

os.chdir('/Users/andycass/Desktop/Thesis_data-and-code')

#### 3.1.2. Naming current working directory
Now that your current working directory is established, use the code below to name it "cwd":

In [4]:
# naming the current working directory

cwd = Path.cwd()

In [5]:
# double-checking the current working directory location

cwd

PosixPath('/Users/andycass/Desktop/Thesis_data-and-code')

### 3.2 CASS_thesis
You may or may not have already created a folder named "CASS_thesis" depending on whether you ran the code from the first Data Collection Notebook. Nevertheless, the code below will work in either case. 

In [6]:
# naming the CASS_thesis folder

CASS_thesis = cwd / 'CASS_thesis'

In [None]:
# creating the CASS_thesis folder

CASS_thesis.mkdir(exist_ok=True)

In [7]:
# double-checking the CASS_thesis location

CASS_thesis

PosixPath('/Users/andycass/Desktop/Thesis_data-and-code/CASS_thesis')

## 4. Syrian inflow datasets
**NOTE**: Before proceeding, ensure that the folder called "CASS_thesis" contains the following files:
> *01a_Data-Collection_limited_Syrian_eng.csv*

> *01b_Data-Collection_limited_Syrian_de.csv*

> *01a_Data-Collection_limited_Ukrainian_eng.csv*

> *01b_Data-Collection_limited_Ukrainian_de.csv*

### 4.1. English-language dataset
#### 4.1.1 Loading the data

In [8]:
df_eng = pd.read_csv(CASS_thesis / "01a_Data-Collection_limited_Syrian-eng.csv")

In [9]:
df_eng

Unnamed: 0,possibly_sensitive,edit_history_tweet_ids,lang,text,reply_settings,created_at,author_id,id,conversation_id,edit_controls.edits_remaining,...,attachments.media_keys,geo.coordinates.type,geo.coordinates.coordinates,context_annotations,referenced_tweets,in_reply_to_user_id,attachments.poll_ids,withheld.copyright,withheld.country_codes,withheld.scope
0,False,['722876040876580864'],en,UNHCR - Survivors report massive loss of life ...,everyone,2016-04-20T19:54:13.000Z,339833759,722876040876580864,722876040876580864,5,...,,,,,,,,,,
1,False,['722802163114733572'],en,Syrian artists are painting bright murals in t...,everyone,2016-04-20T15:00:39.000Z,412624794,722802163114733572,722802163114733572,5,...,,,,,,,,,,
2,False,['722712144354144256'],en,Paul Guest is an excellent moderator at the co...,everyone,2016-04-20T09:02:57.000Z,2863752725,722712144354144256,722712144354144256,5,...,['3_722697412549193728'],,,,,,,,,
3,False,['722592172663394304'],en,DEN : According to a Jerusalem Post review of ...,everyone,2016-04-20T01:06:13.000Z,186899860,722592172663394304,722592172663394304,5,...,,Point,"[13.46757813, 52.5913198]",,,,,,,
4,False,['722520855218008065'],en,Why #europe are sending #syrian #refugees to #...,everyone,2016-04-19T20:22:50.000Z,69418092,722520855218008065,722520855218008065,5,...,,Point,"[13.0667, 52.4]",,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5165,False,['547694737776320513'],en,"DE-News : Syrian President Bashar Assad, shown...",everyone,2014-12-24T10:06:15.000Z,186899860,547694737776320513,547694737776320513,5,...,,Point,"[13.46757813, 52.5913198]",,,,,,,
5166,False,['547609017623654400'],en,#CelioGermanyDesk German President Gauck calls...,everyone,2014-12-24T04:25:38.000Z,271776179,547609017623654400,547609017623654400,5,...,,Point,"[6.75600452, 51.22732903]",,,,,,,
5167,False,['547102939484286976'],en,#munich is colourful! welcome #refugees! no #p...,everyone,2014-12-22T18:54:40.000Z,15105772,547102939484286976,547102939484286976,5,...,['3_547102922203353088'],Point,"[11.54845144, 48.12402972]",,,,,,,
5168,False,['546699388316184577'],en,DE-News : There is little to break the monoton...,everyone,2014-12-21T16:11:06.000Z,186899860,546699388316184577,546699388316184577,5,...,,Point,"[13.46757813, 52.5913198]",,,,,,,


#### 4.1.3 Formatting dates 
The created_at column, containing information about when the tweet was posted, will be converted to datetime format and normalized so that new columns (e.g., 'year-month') can be derived from it.

In [10]:
# converting created_at to datetime format

df_eng["created_at"] = pd.to_datetime(df_eng["created_at"])

In [11]:
# converting it to date and creating a new column called "date"

df_eng['date'] = df_eng['created_at'].dt.normalize()

In [12]:
# creating week, month, year, year-week, and year-month columns

df_eng['week'] = df_eng['created_at'].dt.week
df_eng['month'] = df_eng['created_at'].dt.month
df_eng['year'] = df_eng['created_at'].dt.year
df_eng['year-week'] = df_eng['created_at'].dt.strftime('%Y-%U')
df_eng['year-month'] = df_eng['created_at'].dt.strftime('%Y-%m')

  df_eng['week'] = df_eng['created_at'].dt.week


In [13]:
df_eng

Unnamed: 0,possibly_sensitive,edit_history_tweet_ids,lang,text,reply_settings,created_at,author_id,id,conversation_id,edit_controls.edits_remaining,...,attachments.poll_ids,withheld.copyright,withheld.country_codes,withheld.scope,date,week,month,year,year-week,year-month
0,False,['722876040876580864'],en,UNHCR - Survivors report massive loss of life ...,everyone,2016-04-20 19:54:13+00:00,339833759,722876040876580864,722876040876580864,5,...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
1,False,['722802163114733572'],en,Syrian artists are painting bright murals in t...,everyone,2016-04-20 15:00:39+00:00,412624794,722802163114733572,722802163114733572,5,...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
2,False,['722712144354144256'],en,Paul Guest is an excellent moderator at the co...,everyone,2016-04-20 09:02:57+00:00,2863752725,722712144354144256,722712144354144256,5,...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
3,False,['722592172663394304'],en,DEN : According to a Jerusalem Post review of ...,everyone,2016-04-20 01:06:13+00:00,186899860,722592172663394304,722592172663394304,5,...,,,,,2016-04-20 00:00:00+00:00,16,4,2016,2016-16,2016-04
4,False,['722520855218008065'],en,Why #europe are sending #syrian #refugees to #...,everyone,2016-04-19 20:22:50+00:00,69418092,722520855218008065,722520855218008065,5,...,,,,,2016-04-19 00:00:00+00:00,16,4,2016,2016-16,2016-04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5165,False,['547694737776320513'],en,"DE-News : Syrian President Bashar Assad, shown...",everyone,2014-12-24 10:06:15+00:00,186899860,547694737776320513,547694737776320513,5,...,,,,,2014-12-24 00:00:00+00:00,52,12,2014,2014-51,2014-12
5166,False,['547609017623654400'],en,#CelioGermanyDesk German President Gauck calls...,everyone,2014-12-24 04:25:38+00:00,271776179,547609017623654400,547609017623654400,5,...,,,,,2014-12-24 00:00:00+00:00,52,12,2014,2014-51,2014-12
5167,False,['547102939484286976'],en,#munich is colourful! welcome #refugees! no #p...,everyone,2014-12-22 18:54:40+00:00,15105772,547102939484286976,547102939484286976,5,...,,,,,2014-12-22 00:00:00+00:00,52,12,2014,2014-51,2014-12
5168,False,['546699388316184577'],en,DE-News : There is little to break the monoton...,everyone,2014-12-21 16:11:06+00:00,186899860,546699388316184577,546699388316184577,5,...,,,,,2014-12-21 00:00:00+00:00,51,12,2014,2014-51,2014-12


### 4.2. German-language dataset
#### 4.2.1 Loading the data

In [14]:
df_de = pd.read_csv(CASS_thesis / "01b_Data-Collection_limited_Syrian-de.csv")

#### 4.2.2 Viewing the dataframe

In [15]:
df_de

Unnamed: 0,reply_settings,possibly_sensitive,id,text,conversation_id,author_id,edit_history_tweet_ids,created_at,lang,entities.urls,...,geo.coordinates.type,geo.coordinates.coordinates,in_reply_to_user_id,referenced_tweets,entities.hashtags,attachments.media_keys,entities.mentions,context_annotations,attachments.poll_ids,entities.cashtags
0,everyone,False,7.229216e+17,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",722921572810366977,4122038069,['722921572810366977'],2016-04-20T22:55:08.000Z,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,,,,,,,,,,
1,everyone,False,7.228995e+17,"Habe schon lang nicht gehört, daß Flüchtling G...",722899547039473665,1179543852,['722899547039473665'],2016-04-20T21:27:37.000Z,de,,...,Point,"[7.1468836, 50.7306348]",,,,,,,,
2,everyone,False,7.228974e+17,"""Es kommen kaum noch Flüchtlinge nach Griechen...",722897370313195521,224607633,['722897370313195521'],2016-04-20T21:18:58.000Z,de,,...,,,,,,,,,,
3,everyone,False,7.228536e+17,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,722847691101880320,2480764313,['722853635751809025'],2016-04-20T18:25:11.000Z,de,"[{'start': 83, 'end': 106, 'url': 'https://t.c...",...,,,2480764313,"[{'type': 'replied_to', 'id': '722847691101880...","[{'start': 23, 'end': 35, 'tag': 'Flüchtlinge'}]",['7_722852972233912321'],,,,
4,everyone,False,7.228240e+17,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,722824011063799809,606265303,['722824011063799809'],2016-04-20T16:27:28.000Z,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,,,,,"[{'start': 9, 'end': 21, 'tag': 'Flüchtlinge'}...",['3_722824009939886080'],,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19458,everyone,False,5.467140e+17,Stehe am Bahnhof und eine Frau erklärt ihren(?...,546714009261858816,19926396,['546714009261858816'],2014-12-21T17:09:11.000Z,de,,...,,,,,"[{'start': 107, 'end': 122, 'tag': 'ichkönntko...",,,,,
19459,everyone,False,5.467096e+17,"Fast 500 Flüchtlinge, Obdachlose und Heimkinde...",546709582018781184,12358112,['546709582018781184'],2014-12-21T16:51:36.000Z,de,"[{'start': 91, 'end': 114, 'url': 'https://t.c...",...,,,,,"[{'start': 83, 'end': 89, 'tag': 'Hansa'}]",['3_546709581229879297'],,,,
19460,everyone,False,5.466898e+17,@OomenBerlin @Rex_Cramer Ostsachsen noPerspekt...,546399944803090432,224651746,['546689796337573888'],2014-12-21T15:32:59.000Z,de,,...,,,75067369,"[{'type': 'replied_to', 'id': '546399944803090...","[{'start': 129, 'end': 136, 'tag': 'Pegida'}]",,"[{'start': 0, 'end': 12, 'username': 'OomenBer...",,,
19461,everyone,False,5.466522e+17,"'Nichts gegen Flüchtlinge, aber ein Gefängnis ...",546652160600317952,40870544,['546652160600317952'],2014-12-21T13:03:26.000Z,de,"[{'start': 75, 'end': 97, 'url': 'http://t.co/...",...,Point,"[13.4140765, 52.4883914]",,,,,,,,


#### 4.2.3 Formatting dates 

**NOTE**: When trying to convert created at to datetime format, it was discovered that one or more observations had a value of "5bcd72da50f0ee77" for created_at, likely due to an error during collection. The following code locates, views, and drops the observation(s).

In [17]:
# locating the error

df_de.loc[df_de['created_at'] == '5bcd72da50f0ee77', 'created_at']

8643     5bcd72da50f0ee77
13733    5bcd72da50f0ee77
Name: created_at, dtype: object

The output from the code above indicates that the error is located at two indeces:
* 8643
* 13733

In [18]:
# viewing the first index

df_de.iloc[8643]

reply_settings                     Das tut mir sehr leid. Mein Beileid an die Ang...
possibly_sensitive                                                662278575119273984
id                                                                      2982735723.0
text                                                          ['662281747464327169']
conversation_id                                             2015-11-05T14:54:07.000Z
author_id                                                                         de
edit_history_tweet_ids                                                           NaN
created_at                                                          5bcd72da50f0ee77
lang                                                                               5
entities.urls                                                                   True
geo.place_id                                                2015-11-05T15:24:07.000Z
edit_controls.edits_remaining                                    

Upon viewing 8643, it is apparant that it indeed needs to be dropped.

In [19]:
df_de = df_de.drop(df_de.index[8643])

In [20]:
# viewing th second index

df_de.iloc[13732]

reply_settings                     Am ehemaligen Zaun vom Freihafen so gesehen. h...
possibly_sensitive                                                641895980024102912
id                                                                       317067353.0
text                                                          ['641895980024102912']
conversation_id                                             2015-09-10T08:48:22.000Z
author_id                                                                         de
edit_history_tweet_ids             [{'start': 105, 'end': 127, 'url': 'http://t.c...
created_at                                                          5bcd72da50f0ee77
lang                                                                               5
entities.urls                                                                   True
geo.place_id                                                2015-09-10T09:18:22.000Z
edit_controls.edits_remaining                                    

Upon viewing 13733 (written as 13732 in the code due to the change in position after dropping 8643), it is apparant that it indeed needs to be dropped.

In [21]:
df_de = df_de.drop(df_de.index[13732])

In [None]:
# double-checking that the error is gone

df_de.loc[df_de['created_at'] == '5bcd72da50f0ee77', 'created_at']

Now that the errors have been removed, dates can be formatted as usual.

In [23]:
# converting created_at to datetime format

df_de["created_at"] = pd.to_datetime(df_de["created_at"])

In [30]:
# converting it to date and creating a new column called "date"

df_de['date'] = df_de['created_at'].dt.normalize()

In [25]:
# creating week, month, and year columns

df_de['week'] = df_de['created_at'].dt.week
df_de['month'] = df_de['created_at'].dt.month
df_de['year'] = df_de['created_at'].dt.year
df_de['year-week'] = df_de['created_at'].dt.strftime('%Y-%U')
df_de['year-month'] = df_de['created_at'].dt.strftime('%Y-%m')

  df_de['week'] = df_de['created_at'].dt.week


#### 4.2.4 Viewing the dataframe

In [26]:
df_de

Unnamed: 0,reply_settings,possibly_sensitive,id,text,conversation_id,author_id,edit_history_tweet_ids,created_at,lang,entities.urls,...,attachments.media_keys,entities.mentions,context_annotations,attachments.poll_ids,entities.cashtags,week,month,year,year-week,year-month
0,everyone,False,7.229216e+17,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",722921572810366977,4122038069,['722921572810366977'],2016-04-20 22:55:08+00:00,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,,,,,,16.0,4.0,2016.0,2016-16,2016-04
1,everyone,False,7.228995e+17,"Habe schon lang nicht gehört, daß Flüchtling G...",722899547039473665,1179543852,['722899547039473665'],2016-04-20 21:27:37+00:00,de,,...,,,,,,16.0,4.0,2016.0,2016-16,2016-04
2,everyone,False,7.228974e+17,"""Es kommen kaum noch Flüchtlinge nach Griechen...",722897370313195521,224607633,['722897370313195521'],2016-04-20 21:18:58+00:00,de,,...,,,,,,16.0,4.0,2016.0,2016-16,2016-04
3,everyone,False,7.228536e+17,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,722847691101880320,2480764313,['722853635751809025'],2016-04-20 18:25:11+00:00,de,"[{'start': 83, 'end': 106, 'url': 'https://t.c...",...,['7_722852972233912321'],,,,,16.0,4.0,2016.0,2016-16,2016-04
4,everyone,False,7.228240e+17,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,722824011063799809,606265303,['722824011063799809'],2016-04-20 16:27:28+00:00,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,['3_722824009939886080'],,,,,16.0,4.0,2016.0,2016-16,2016-04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19458,everyone,False,5.467140e+17,Stehe am Bahnhof und eine Frau erklärt ihren(?...,546714009261858816,19926396,['546714009261858816'],2014-12-21 17:09:11+00:00,de,,...,,,,,,51.0,12.0,2014.0,2014-51,2014-12
19459,everyone,False,5.467096e+17,"Fast 500 Flüchtlinge, Obdachlose und Heimkinde...",546709582018781184,12358112,['546709582018781184'],2014-12-21 16:51:36+00:00,de,"[{'start': 91, 'end': 114, 'url': 'https://t.c...",...,['3_546709581229879297'],,,,,51.0,12.0,2014.0,2014-51,2014-12
19460,everyone,False,5.466898e+17,@OomenBerlin @Rex_Cramer Ostsachsen noPerspekt...,546399944803090432,224651746,['546689796337573888'],2014-12-21 15:32:59+00:00,de,,...,,"[{'start': 0, 'end': 12, 'username': 'OomenBer...",,,,51.0,12.0,2014.0,2014-51,2014-12
19461,everyone,False,5.466522e+17,"'Nichts gegen Flüchtlinge, aber ein Gefängnis ...",546652160600317952,40870544,['546652160600317952'],2014-12-21 13:03:26+00:00,de,"[{'start': 75, 'end': 97, 'url': 'http://t.co/...",...,,,,,,51.0,12.0,2014.0,2014-51,2014-12


### 4.3 Merging the English-language and German-language dataframe

#### 4.3.1 Creating language column

In [27]:
df_de['Language'] = 'German'
df_eng['Language'] = 'English'

In [31]:
# Creating a joint data frame

df_syr = pd.concat([df_de, df_eng], ignore_index = True)

#### 4.3.3 Viewing the dataframe

In [32]:
df_syr

Unnamed: 0,reply_settings,possibly_sensitive,id,text,conversation_id,author_id,edit_history_tweet_ids,created_at,lang,entities.urls,...,month,year,year-week,year-month,Language,date,entities.annotations,withheld.copyright,withheld.country_codes,withheld.scope
0,everyone,False,7.229216e+17,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",722921572810366977,4122038069,['722921572810366977'],2016-04-20 22:55:08+00:00,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,
1,everyone,False,7.228995e+17,"Habe schon lang nicht gehört, daß Flüchtling G...",722899547039473665,1179543852,['722899547039473665'],2016-04-20 21:27:37+00:00,de,,...,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,
2,everyone,False,7.228974e+17,"""Es kommen kaum noch Flüchtlinge nach Griechen...",722897370313195521,224607633,['722897370313195521'],2016-04-20 21:18:58+00:00,de,,...,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,
3,everyone,False,7.228536e+17,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,722847691101880320,2480764313,['722853635751809025'],2016-04-20 18:25:11+00:00,de,"[{'start': 83, 'end': 106, 'url': 'https://t.c...",...,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,
4,everyone,False,7.228240e+17,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,722824011063799809,606265303,['722824011063799809'],2016-04-20 16:27:28+00:00,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24626,everyone,False,5.476947e+17,"DE-News : Syrian President Bashar Assad, shown...",547694737776320513,186899860,['547694737776320513'],2014-12-24 10:06:15+00:00,en,"[{'start': 117, 'end': 139, 'url': 'http://t.c...",...,12.0,2014.0,2014-51,2014-12,English,2014-12-24 00:00:00+00:00,"[{'start': 27, 'end': 38, 'probability': 0.949...",,,
24627,everyone,False,5.476090e+17,#CelioGermanyDesk German President Gauck calls...,547609017623654400,271776179,['547609017623654400'],2014-12-24 04:25:38+00:00,en,"[{'start': 96, 'end': 118, 'url': 'http://t.co...",...,12.0,2014.0,2014-51,2014-12,English,2014-12-24 00:00:00+00:00,"[{'start': 1, 'end': 16, 'probability': 0.2996...",,,
24628,everyone,False,5.471029e+17,#munich is colourful! welcome #refugees! no #p...,547102939484286976,15105772,['547102939484286976'],2014-12-22 18:54:40+00:00,en,"[{'start': 65, 'end': 87, 'url': 'http://t.co/...",...,12.0,2014.0,2014-51,2014-12,English,2014-12-22 00:00:00+00:00,"[{'start': 1, 'end': 6, 'probability': 0.3992,...",,,
24629,everyone,False,5.466994e+17,DE-News : There is little to break the monoton...,546699388316184577,186899860,['546699388316184577'],2014-12-21 16:11:06+00:00,en,"[{'start': 101, 'end': 123, 'url': 'http://t.c...",...,12.0,2014.0,2014-51,2014-12,English,2014-12-21 00:00:00+00:00,"[{'start': 3, 'end': 6, 'probability': 0.4604,...",,,


## 5. Ukrainian inflow datasets
### 5.1. English-language dataset
#### 5.1.1 Loading the data

In [33]:
df_eng = pd.read_csv(CASS_thesis / "01a_Data-Collection_limited_Ukrainian-eng.csv")

#### 5.1.2 Viewing the dataframe

In [34]:
df_eng

Unnamed: 0,reply_settings,created_at,text,edit_history_tweet_ids,id,context_annotations,lang,author_id,possibly_sensitive,in_reply_to_user_id,...,entities.hashtags,entities.urls,attachments.media_keys,referenced_tweets,entities.cashtags,geo.coordinates.type,geo.coordinates.coordinates,withheld.copyright,withheld.country_codes,attachments.poll_ids
0,everyone,2022-10-23T21:26:35.000Z,@elonmusk Elon Musk: it’s. 🆎out time you...,['1584295241993310209'],1584295241993310209,"[{'domain': {'id': '46', 'name': 'Business Tax...",en,229784080,False,4.419640e+07,...,,,,,,,,,,
1,everyone,2022-10-23T16:48:33.000Z,That was a nice visit @DOK_Leipzig the last co...,['1584225275914915842'],1584225275914915842,"[{'domain': {'id': '46', 'name': 'Business Tax...",en,930852774,False,,...,"[{'start': 230, 'end': 245, 'tag': 'festivalse...","[{'start': 288, 'end': 311, 'url': 'https://t....","['3_1584225201956454401', '3_15842252332167086...",,,,,,,
2,everyone,2022-10-23T16:36:15.000Z,The Evil of the Ukrainian Forces has No bounds...,['1584222176730750976'],1584222176730750976,"[{'domain': {'id': '123', 'name': 'Ongoing New...",en,1454177533931540482,False,,...,,"[{'start': 77, 'end': 100, 'url': 'https://t.c...",,,,,,,,
3,everyone,2022-10-23T16:26:46.000Z,@thesiriusreport The Ukrainians now boast of b...,['1584219793670176769'],1584219793670176769,,en,1272454735,False,7.017749e+17,...,,,,"[{'type': 'replied_to', 'id': '158414894324339...",,,,,,
4,everyone,2022-10-23T14:36:11.000Z,@tom_username_ DPR/LNR militia did a huge part...,['1584191961053159425'],1584191961053159425,,en,1121807798826930177,False,8.725516e+17,...,,,,"[{'type': 'replied_to', 'id': '158419130384313...",,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8006,everyone,2021-06-28T14:43:31.000Z,"For day 1 of week 2, @AnnaMariaKonsta discusse...",['1409522855814041606'],1409522855814041606,,en,110402493,False,1.104025e+08,...,"[{'start': 95, 'end': 108, 'tag': 'SocialRight...",,,"[{'type': 'replied_to', 'id': '140843450365639...",,,,,,
8007,everyone,2021-06-27T13:03:29.000Z,"@ariadneconill Europe is racist, but in a diff...",['1409135295895855104'],1409135295895855104,,en,2521808908,False,1.586954e+07,...,,,,"[{'type': 'replied_to', 'id': '140913435503797...",,,,,,
8008,everyone,2021-06-27T08:37:21.000Z,"A labour of love, inspired by Middle-earth.\n\...",['1409068320947593220'],1409068320947593220,"[{'domain': {'id': '130', 'name': 'Multimedia ...",en,563381751,False,5.633818e+08,...,,"[{'start': 272, 'end': 295, 'url': 'https://t....","['3_1409068306196205568', '3_14090683170636595...","[{'type': 'replied_to', 'id': '140906829783282...",,,,,,
8009,everyone,2021-06-26T08:03:22.000Z,@simongerman600 I must have missed the great f...,['1408697379063271425'],1408697379063271425,,en,2591892350,False,3.591885e+08,...,,,,"[{'type': 'replied_to', 'id': '140845747170700...",,,,,,


#### 5.1.3 Formatting dates 

In [35]:
# converting created_at to datetime format

df_eng["created_at"] = pd.to_datetime(df_eng["created_at"])

In [36]:
# converting it to date and creating a new column called "date"

df_eng['date'] = df_eng['created_at'].dt.normalize()

In [37]:
# creating week, month, year, year-week, and year-month columns

df_eng['week'] = df_eng['created_at'].dt.week
df_eng['month'] = df_eng['created_at'].dt.month
df_eng['year'] = df_eng['created_at'].dt.year
df_eng['year-week'] = df_eng['created_at'].dt.strftime('%Y-%U')
df_eng['year-month'] = df_eng['created_at'].dt.strftime('%Y-%m')

  df_eng['week'] = df_eng['created_at'].dt.week


#### 5.1.4 Viewing the dataframe

In [38]:
df_eng

Unnamed: 0,reply_settings,created_at,text,edit_history_tweet_ids,id,context_annotations,lang,author_id,possibly_sensitive,in_reply_to_user_id,...,geo.coordinates.coordinates,withheld.copyright,withheld.country_codes,attachments.poll_ids,date,week,month,year,year-week,year-month
0,everyone,2022-10-23 21:26:35+00:00,@elonmusk Elon Musk: it’s. 🆎out time you...,['1584295241993310209'],1584295241993310209,"[{'domain': {'id': '46', 'name': 'Business Tax...",en,229784080,False,4.419640e+07,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
1,everyone,2022-10-23 16:48:33+00:00,That was a nice visit @DOK_Leipzig the last co...,['1584225275914915842'],1584225275914915842,"[{'domain': {'id': '46', 'name': 'Business Tax...",en,930852774,False,,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
2,everyone,2022-10-23 16:36:15+00:00,The Evil of the Ukrainian Forces has No bounds...,['1584222176730750976'],1584222176730750976,"[{'domain': {'id': '123', 'name': 'Ongoing New...",en,1454177533931540482,False,,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
3,everyone,2022-10-23 16:26:46+00:00,@thesiriusreport The Ukrainians now boast of b...,['1584219793670176769'],1584219793670176769,,en,1272454735,False,7.017749e+17,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
4,everyone,2022-10-23 14:36:11+00:00,@tom_username_ DPR/LNR militia did a huge part...,['1584191961053159425'],1584191961053159425,,en,1121807798826930177,False,8.725516e+17,...,,,,,2022-10-23 00:00:00+00:00,42,10,2022,2022-43,2022-10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8006,everyone,2021-06-28 14:43:31+00:00,"For day 1 of week 2, @AnnaMariaKonsta discusse...",['1409522855814041606'],1409522855814041606,,en,110402493,False,1.104025e+08,...,,,,,2021-06-28 00:00:00+00:00,26,6,2021,2021-26,2021-06
8007,everyone,2021-06-27 13:03:29+00:00,"@ariadneconill Europe is racist, but in a diff...",['1409135295895855104'],1409135295895855104,,en,2521808908,False,1.586954e+07,...,,,,,2021-06-27 00:00:00+00:00,25,6,2021,2021-26,2021-06
8008,everyone,2021-06-27 08:37:21+00:00,"A labour of love, inspired by Middle-earth.\n\...",['1409068320947593220'],1409068320947593220,"[{'domain': {'id': '130', 'name': 'Multimedia ...",en,563381751,False,5.633818e+08,...,,,,,2021-06-27 00:00:00+00:00,25,6,2021,2021-26,2021-06
8009,everyone,2021-06-26 08:03:22+00:00,@simongerman600 I must have missed the great f...,['1408697379063271425'],1408697379063271425,,en,2591892350,False,3.591885e+08,...,,,,,2021-06-26 00:00:00+00:00,25,6,2021,2021-25,2021-06


### 5.2. German-language dataset
#### 5.2.1 Loading the data

In [39]:
df_de = pd.read_csv(CASS_thesis / "01b_Data-Collection_limited_Ukrainian-de.csv")

  df_de = pd.read_csv(CASS_thesis / "01b_Data-Collection_limited_Ukrainian-de.csv")


#### 5.2.2 Viewing the dataframe

In [40]:
df_de

Unnamed: 0,context_annotations,lang,author_id,text,reply_settings,possibly_sensitive,edit_history_tweet_ids,conversation_id,created_at,id,...,entities.mentions,entities.urls,entities.hashtags,attachments.media_keys,geo.coordinates.type,geo.coordinates.coordinates,attachments.poll_ids,withheld.copyright,withheld.country_codes,withheld.scope
0,"[{'domain': {'id': '46', 'name': 'Business Tax...",de,1508097355458961410,Die Ukraine plant eine „False Flag“ Operation ...,everyone,False,['1584313222764511232'],1.584313e+18,2022-10-23T22:38:02.000Z,1584313222764511232.0,...,,,,,,,,,,
1,,de,1498603032640167936,"@MalteKaufmann Oh wow..Ja, schlimme Zustände w...",everyone,False,['1584313179298885633'],1.584091e+18,2022-10-23T22:37:51.000Z,1584313179298885632.0,...,"[{'start': 0, 'end': 14, 'username': 'MalteKau...",,,,,,,,,
2,,de,1310099812474331147,"@HasnainKazim Ach, es ist ja so einfach mit de...",everyone,False,['1584258196608192512'],1.583888e+18,2022-10-23T18:59:22.000Z,1584258196608192512.0,...,"[{'start': 0, 'end': 13, 'username': 'HasnainK...",,,,,,,,,
3,,de,2218012226,"Hätte er auch gewonnen, wenn er kein Ukrainer ...",everyone,False,['1584253429714944000'],1.584253e+18,2022-10-23T18:40:26.000Z,1584253429714944000.0,...,,,,,,,,,,
4,"[{'domain': {'id': '10', 'name': 'Person', 'de...",de,1272454735,Unsere Politiker sind allmählich komplett verr...,everyone,False,['1584212627927941120'],1.584208e+18,2022-10-23T15:58:18.000Z,1584212627927941120.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33619,,de,1081214456099684352,"@polenz_r Sie ist nicht nur wunderbar, sondern...",everyone,False,['1408456160915689479'],1.337706e+18,2021-06-25T16:04:51.000Z,1408456160915689479,...,"[{'start': 0, 'end': 9, 'username': 'polenz_r'...",,,,,,,,,
33620,,de,1322552611241971714,Und morgen sind die Verantwortlichen als Flüch...,everyone,False,['1408386043800399873'],1.408386e+18,2021-06-25T11:26:14.000Z,1408386043800399873,...,,,,,,,,,,
33621,,de,3021093443,"4.700.000.000,- Euro\nfür Syrer. \nEU unterstü...",everyone,False,['1408289579115978758'],1.408290e+18,2021-06-25T05:02:55.000Z,1408289579115978758,...,,,,,,,,,,
33622,,de,10456882,Menschen hetzen gegen #LGBTQI und Flüchtlinge ...,everyone,False,['1408140630149173255'],1.408141e+18,2021-06-24T19:11:02.000Z,1408140630149173255,...,,,"[{'start': 22, 'end': 29, 'tag': 'LGBTQI'}]",,,,,,,


#### 5.2.3 Formatting dates 
**NOTE**: When trying to convert created_at to datetime format, it was discovered that one or more observations had a value of "True" for created_at, likely due to an error during collection. The following code locates, views, and drops the observation(s).

In [42]:
# locating the error

df_de.loc[df_de['created_at'] == 'True', 'created_at']

32768    True
Name: created_at, dtype: object

The error is located at index 32768

In [43]:
# viewing the error

df_de.iloc[32768]

context_annotations                                          https://t.co/mY57RUt49f
lang                                                                        everyone
author_id                                                                      False
text                                                         ['1465568558868480001']
reply_settings                                                   1465568558868480001
possibly_sensitive                                          2021-11-30T06:29:08.000Z
edit_history_tweet_ids                                           1465568558868480001
conversation_id                                                                  5.0
created_at                                                                      True
id                                                          2021-11-30T06:59:08.000Z
edit_controls.edits_remaining                                       6e100b0c8dc4fa7e
edit_controls.is_edit_eligible                                   

Upon viewing index 32768, it is apparant that it indeed needs to be dropped.

In [44]:
df_de = df_de.drop(df_de.index[32768])

In [45]:
# double-checking that the error is gone

df_de.loc[df_de['created_at'] == 'True', 'created_at']

Series([], Name: created_at, dtype: object)

Now that problematic entry has been removed, formatting dates can continue.

In [46]:
# converting created_at to datetime format

df_de["created_at"] = pd.to_datetime(df_de["created_at"])

In [47]:
# converting it to date and creating a new column called "date"

df_de['date'] = df_de['created_at'].dt.normalize()

In [48]:
# creating week, month, year, year-week, and year-month columns

df_de['week'] = df_de['created_at'].dt.week
df_de['month'] = df_de['created_at'].dt.month
df_de['year'] = df_de['created_at'].dt.year
df_de['year-week'] = df_de['created_at'].dt.strftime('%Y-%U')
df_de['year-month'] = df_de['created_at'].dt.strftime('%Y-%m')

  df_de['week'] = df_de['created_at'].dt.week


#### 5.2.4 Viewing the dataframe

In [49]:
df_de

Unnamed: 0,context_annotations,lang,author_id,text,reply_settings,possibly_sensitive,edit_history_tweet_ids,conversation_id,created_at,id,...,attachments.poll_ids,withheld.copyright,withheld.country_codes,withheld.scope,date,week,month,year,year-week,year-month
0,"[{'domain': {'id': '46', 'name': 'Business Tax...",de,1508097355458961410,Die Ukraine plant eine „False Flag“ Operation ...,everyone,False,['1584313222764511232'],1.584313e+18,2022-10-23 22:38:02+00:00,1584313222764511232.0,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
1,,de,1498603032640167936,"@MalteKaufmann Oh wow..Ja, schlimme Zustände w...",everyone,False,['1584313179298885633'],1.584091e+18,2022-10-23 22:37:51+00:00,1584313179298885632.0,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
2,,de,1310099812474331147,"@HasnainKazim Ach, es ist ja so einfach mit de...",everyone,False,['1584258196608192512'],1.583888e+18,2022-10-23 18:59:22+00:00,1584258196608192512.0,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
3,,de,2218012226,"Hätte er auch gewonnen, wenn er kein Ukrainer ...",everyone,False,['1584253429714944000'],1.584253e+18,2022-10-23 18:40:26+00:00,1584253429714944000.0,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
4,"[{'domain': {'id': '10', 'name': 'Person', 'de...",de,1272454735,Unsere Politiker sind allmählich komplett verr...,everyone,False,['1584212627927941120'],1.584208e+18,2022-10-23 15:58:18+00:00,1584212627927941120.0,...,,,,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33619,,de,1081214456099684352,"@polenz_r Sie ist nicht nur wunderbar, sondern...",everyone,False,['1408456160915689479'],1.337706e+18,2021-06-25 16:04:51+00:00,1408456160915689479,...,,,,,2021-06-25 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06
33620,,de,1322552611241971714,Und morgen sind die Verantwortlichen als Flüch...,everyone,False,['1408386043800399873'],1.408386e+18,2021-06-25 11:26:14+00:00,1408386043800399873,...,,,,,2021-06-25 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06
33621,,de,3021093443,"4.700.000.000,- Euro\nfür Syrer. \nEU unterstü...",everyone,False,['1408289579115978758'],1.408290e+18,2021-06-25 05:02:55+00:00,1408289579115978758,...,,,,,2021-06-25 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06
33622,,de,10456882,Menschen hetzen gegen #LGBTQI und Flüchtlinge ...,everyone,False,['1408140630149173255'],1.408141e+18,2021-06-24 19:11:02+00:00,1408140630149173255,...,,,,,2021-06-24 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06


### 5.3 Merging the English-language and German-language dataframe

#### 5.3.1 Creating language column

In [52]:
df_de['Language'] = 'German'
df_eng['Language'] = 'English'

#### 5.3.2 Merging

In [53]:
# Creating a joint data frame

df_uk = pd.concat([df_de, df_eng], ignore_index = True)

#### 5.3.3 Viewing the dataframe

In [54]:
df_uk

Unnamed: 0,context_annotations,lang,author_id,text,reply_settings,possibly_sensitive,edit_history_tweet_ids,conversation_id,created_at,id,...,withheld.scope,date,week,month,year,year-week,year-month,Language,entities.annotations,entities.cashtags
0,"[{'domain': {'id': '46', 'name': 'Business Tax...",de,1508097355458961410,Die Ukraine plant eine „False Flag“ Operation ...,everyone,False,['1584313222764511232'],1.584313e+18,2022-10-23 22:38:02+00:00,1584313222764511232.0,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
1,,de,1498603032640167936,"@MalteKaufmann Oh wow..Ja, schlimme Zustände w...",everyone,False,['1584313179298885633'],1.584091e+18,2022-10-23 22:37:51+00:00,1584313179298885632.0,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
2,,de,1310099812474331147,"@HasnainKazim Ach, es ist ja so einfach mit de...",everyone,False,['1584258196608192512'],1.583888e+18,2022-10-23 18:59:22+00:00,1584258196608192512.0,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
3,,de,2218012226,"Hätte er auch gewonnen, wenn er kein Ukrainer ...",everyone,False,['1584253429714944000'],1.584253e+18,2022-10-23 18:40:26+00:00,1584253429714944000.0,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
4,"[{'domain': {'id': '10', 'name': 'Person', 'de...",de,1272454735,Unsere Politiker sind allmählich komplett verr...,everyone,False,['1584212627927941120'],1.584208e+18,2022-10-23 15:58:18+00:00,1584212627927941120.0,...,,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41629,,en,110402493,"For day 1 of week 2, @AnnaMariaKonsta discusse...",everyone,False,['1409522855814041606'],1.407592e+18,2021-06-28 14:43:31+00:00,1409522855814041606,...,,2021-06-28 00:00:00+00:00,26.0,6.0,2021.0,2021-26,2021-06,English,,
41630,,en,2521808908,"@ariadneconill Europe is racist, but in a diff...",everyone,False,['1409135295895855104'],1.409134e+18,2021-06-27 13:03:29+00:00,1409135295895855104,...,,2021-06-27 00:00:00+00:00,25.0,6.0,2021.0,2021-26,2021-06,English,"[{'start': 15, 'end': 20, 'probability': 0.963...",
41631,"[{'domain': {'id': '130', 'name': 'Multimedia ...",en,563381751,"A labour of love, inspired by Middle-earth.\n\...",everyone,False,['1409068320947593220'],1.409068e+18,2021-06-27 08:37:21+00:00,1409068320947593220,...,,2021-06-27 00:00:00+00:00,25.0,6.0,2021.0,2021-26,2021-06,English,"[{'start': 59, 'end': 63, 'probability': 0.416...",
41632,,en,2591892350,@simongerman600 I must have missed the great f...,everyone,False,['1408697379063271425'],1.408457e+18,2021-06-26 08:03:22+00:00,1408697379063271425,...,,2021-06-26 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,"[{'start': 59, 'end': 68, 'probability': 0.670...",


## 6. Combining Syrian inflow and Ukrainian inflow  dataframes

### 6.1. Creating inflow column

In [55]:
# Adding a new column, inflow, indicating Syrians or Ukrainians

df_syr['inflow'] = 'Syrians'
df_uk['inflow'] = 'Ukrainians'

### 6.2. Viewing the dataframes

In [56]:
df_syr

Unnamed: 0,reply_settings,possibly_sensitive,id,text,conversation_id,author_id,edit_history_tweet_ids,created_at,lang,entities.urls,...,year,year-week,year-month,Language,date,entities.annotations,withheld.copyright,withheld.country_codes,withheld.scope,inflow
0,everyone,False,7.229216e+17,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",722921572810366977,4122038069,['722921572810366977'],2016-04-20 22:55:08+00:00,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
1,everyone,False,7.228995e+17,"Habe schon lang nicht gehört, daß Flüchtling G...",722899547039473665,1179543852,['722899547039473665'],2016-04-20 21:27:37+00:00,de,,...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
2,everyone,False,7.228974e+17,"""Es kommen kaum noch Flüchtlinge nach Griechen...",722897370313195521,224607633,['722897370313195521'],2016-04-20 21:18:58+00:00,de,,...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
3,everyone,False,7.228536e+17,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,722847691101880320,2480764313,['722853635751809025'],2016-04-20 18:25:11+00:00,de,"[{'start': 83, 'end': 106, 'url': 'https://t.c...",...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
4,everyone,False,7.228240e+17,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,722824011063799809,606265303,['722824011063799809'],2016-04-20 16:27:28+00:00,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24626,everyone,False,5.476947e+17,"DE-News : Syrian President Bashar Assad, shown...",547694737776320513,186899860,['547694737776320513'],2014-12-24 10:06:15+00:00,en,"[{'start': 117, 'end': 139, 'url': 'http://t.c...",...,2014.0,2014-51,2014-12,English,2014-12-24 00:00:00+00:00,"[{'start': 27, 'end': 38, 'probability': 0.949...",,,,Syrians
24627,everyone,False,5.476090e+17,#CelioGermanyDesk German President Gauck calls...,547609017623654400,271776179,['547609017623654400'],2014-12-24 04:25:38+00:00,en,"[{'start': 96, 'end': 118, 'url': 'http://t.co...",...,2014.0,2014-51,2014-12,English,2014-12-24 00:00:00+00:00,"[{'start': 1, 'end': 16, 'probability': 0.2996...",,,,Syrians
24628,everyone,False,5.471029e+17,#munich is colourful! welcome #refugees! no #p...,547102939484286976,15105772,['547102939484286976'],2014-12-22 18:54:40+00:00,en,"[{'start': 65, 'end': 87, 'url': 'http://t.co/...",...,2014.0,2014-51,2014-12,English,2014-12-22 00:00:00+00:00,"[{'start': 1, 'end': 6, 'probability': 0.3992,...",,,,Syrians
24629,everyone,False,5.466994e+17,DE-News : There is little to break the monoton...,546699388316184577,186899860,['546699388316184577'],2014-12-21 16:11:06+00:00,en,"[{'start': 101, 'end': 123, 'url': 'http://t.c...",...,2014.0,2014-51,2014-12,English,2014-12-21 00:00:00+00:00,"[{'start': 3, 'end': 6, 'probability': 0.4604,...",,,,Syrians


In [57]:
df_uk

Unnamed: 0,context_annotations,lang,author_id,text,reply_settings,possibly_sensitive,edit_history_tweet_ids,conversation_id,created_at,id,...,date,week,month,year,year-week,year-month,Language,entities.annotations,entities.cashtags,inflow
0,"[{'domain': {'id': '46', 'name': 'Business Tax...",de,1508097355458961410,Die Ukraine plant eine „False Flag“ Operation ...,everyone,False,['1584313222764511232'],1.584313e+18,2022-10-23 22:38:02+00:00,1584313222764511232.0,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
1,,de,1498603032640167936,"@MalteKaufmann Oh wow..Ja, schlimme Zustände w...",everyone,False,['1584313179298885633'],1.584091e+18,2022-10-23 22:37:51+00:00,1584313179298885632.0,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
2,,de,1310099812474331147,"@HasnainKazim Ach, es ist ja so einfach mit de...",everyone,False,['1584258196608192512'],1.583888e+18,2022-10-23 18:59:22+00:00,1584258196608192512.0,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
3,,de,2218012226,"Hätte er auch gewonnen, wenn er kein Ukrainer ...",everyone,False,['1584253429714944000'],1.584253e+18,2022-10-23 18:40:26+00:00,1584253429714944000.0,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
4,"[{'domain': {'id': '10', 'name': 'Person', 'de...",de,1272454735,Unsere Politiker sind allmählich komplett verr...,everyone,False,['1584212627927941120'],1.584208e+18,2022-10-23 15:58:18+00:00,1584212627927941120.0,...,2022-10-23 00:00:00+00:00,42.0,10.0,2022.0,2022-43,2022-10,German,,,Ukrainians
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41629,,en,110402493,"For day 1 of week 2, @AnnaMariaKonsta discusse...",everyone,False,['1409522855814041606'],1.407592e+18,2021-06-28 14:43:31+00:00,1409522855814041606,...,2021-06-28 00:00:00+00:00,26.0,6.0,2021.0,2021-26,2021-06,English,,,Ukrainians
41630,,en,2521808908,"@ariadneconill Europe is racist, but in a diff...",everyone,False,['1409135295895855104'],1.409134e+18,2021-06-27 13:03:29+00:00,1409135295895855104,...,2021-06-27 00:00:00+00:00,25.0,6.0,2021.0,2021-26,2021-06,English,"[{'start': 15, 'end': 20, 'probability': 0.963...",,Ukrainians
41631,"[{'domain': {'id': '130', 'name': 'Multimedia ...",en,563381751,"A labour of love, inspired by Middle-earth.\n\...",everyone,False,['1409068320947593220'],1.409068e+18,2021-06-27 08:37:21+00:00,1409068320947593220,...,2021-06-27 00:00:00+00:00,25.0,6.0,2021.0,2021-26,2021-06,English,"[{'start': 59, 'end': 63, 'probability': 0.416...",,Ukrainians
41632,,en,2591892350,@simongerman600 I must have missed the great f...,everyone,False,['1408697379063271425'],1.408457e+18,2021-06-26 08:03:22+00:00,1408697379063271425,...,2021-06-26 00:00:00+00:00,25.0,6.0,2021.0,2021-25,2021-06,English,"[{'start': 59, 'end': 68, 'probability': 0.670...",,Ukrainians


### 6.3. Merging 

In [58]:
# Creating a joint data frame

df = pd.concat([df_syr, df_uk], ignore_index = True)

### 6.4 Viewing the dataframe

In [59]:
df

Unnamed: 0,reply_settings,possibly_sensitive,id,text,conversation_id,author_id,edit_history_tweet_ids,created_at,lang,entities.urls,...,year,year-week,year-month,Language,date,entities.annotations,withheld.copyright,withheld.country_codes,withheld.scope,inflow
0,everyone,False,722921572810366976.0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",722921572810366977,4122038069,['722921572810366977'],2016-04-20 22:55:08+00:00,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
1,everyone,False,722899547039473664.0,"Habe schon lang nicht gehört, daß Flüchtling G...",722899547039473665,1179543852,['722899547039473665'],2016-04-20 21:27:37+00:00,de,,...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
2,everyone,False,722897370313195520.0,"""Es kommen kaum noch Flüchtlinge nach Griechen...",722897370313195521,224607633,['722897370313195521'],2016-04-20 21:18:58+00:00,de,,...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
3,everyone,False,722853635751809024.0,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,722847691101880320,2480764313,['722853635751809025'],2016-04-20 18:25:11+00:00,de,"[{'start': 83, 'end': 106, 'url': 'https://t.c...",...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
4,everyone,False,722824011063799808.0,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,722824011063799809,606265303,['722824011063799809'],2016-04-20 16:27:28+00:00,de,"[{'start': 92, 'end': 115, 'url': 'https://t.c...",...,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,,,,,Syrians
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66260,everyone,False,1409522855814041606,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1407591830422687744.0,110402493,['1409522855814041606'],2021-06-28 14:43:31+00:00,en,,...,2021.0,2021-26,2021-06,English,2021-06-28 00:00:00+00:00,,,,,Ukrainians
66261,everyone,False,1409135295895855104,"@ariadneconill Europe is racist, but in a diff...",1409134355037970432.0,2521808908,['1409135295895855104'],2021-06-27 13:03:29+00:00,en,,...,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,"[{'start': 15, 'end': 20, 'probability': 0.963...",,,,Ukrainians
66262,everyone,False,1409068320947593220,"A labour of love, inspired by Middle-earth.\n\...",1409068228270231552.0,563381751,['1409068320947593220'],2021-06-27 08:37:21+00:00,en,"[{'start': 272, 'end': 295, 'url': 'https://t....",...,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,"[{'start': 59, 'end': 63, 'probability': 0.416...",,,,Ukrainians
66263,everyone,False,1408697379063271425,@simongerman600 I must have missed the great f...,1408457471707000832.0,2591892350,['1408697379063271425'],2021-06-26 08:03:22+00:00,en,,...,2021.0,2021-25,2021-06,English,2021-06-26 00:00:00+00:00,"[{'start': 59, 'end': 68, 'probability': 0.670...",,,,Ukrainians


## 7. Finding and dropping unnecessary columns
Several columns are unnecessary for the analysis and will be dropped.
Steps:
* View all column names
* Inspect some of the columns to see if they are needed or not
* Drop the columns identified as unnecessary

### 7.1. Viewing all column names

In [60]:
# viewing all column names

df.columns

Index(['reply_settings', 'possibly_sensitive', 'id', 'text', 'conversation_id',
       'author_id', 'edit_history_tweet_ids', 'created_at', 'lang',
       'entities.urls', 'geo.place_id', 'edit_controls.edits_remaining',
       'edit_controls.is_edit_eligible', 'edit_controls.editable_until',
       'public_metrics.retweet_count', 'public_metrics.reply_count',
       'public_metrics.like_count', 'public_metrics.quote_count',
       'public_metrics.impression_count', 'geo.coordinates.type',
       'geo.coordinates.coordinates', 'in_reply_to_user_id',
       'referenced_tweets', 'entities.hashtags', 'attachments.media_keys',
       'entities.mentions', 'context_annotations', 'attachments.poll_ids',
       'entities.cashtags', 'week', 'month', 'year', 'year-week', 'year-month',
       'Language', 'date', 'entities.annotations', 'withheld.copyright',
       'withheld.country_codes', 'withheld.scope', 'inflow'],
      dtype='object')

### 7.2. Inspecting column contents

In [61]:
# inspecting column 'reply_settings' (only looking at first 5 entries)

print(df['reply_settings'][0:5])

0    everyone
1    everyone
2    everyone
3    everyone
4    everyone
Name: reply_settings, dtype: object


In [62]:
# inspecting column 'edit_controls.edits_remaining' (only looking at first 5 entries)

print(df['edit_controls.edits_remaining'][0:5])

0    5.0
1    5.0
2    5.0
3    5.0
4    5.0
Name: edit_controls.edits_remaining, dtype: object


In [64]:
# inspecting column 'entities.hashtags' (only looking at first 20 entries)

print(df['entities.hashtags'][0:20])

0                                                   NaN
1                                                   NaN
2                                                   NaN
3      [{'start': 23, 'end': 35, 'tag': 'Flüchtlinge'}]
4     [{'start': 9, 'end': 21, 'tag': 'Flüchtlinge'}...
5                                                   NaN
6     [{'start': 0, 'end': 7, 'tag': 'Boehmi'}, {'st...
7     [{'start': 0, 'end': 8, 'tag': 'Bamberg'}, {'s...
8     [{'start': 64, 'end': 75, 'tag': 'Mittelmeer'}...
9                                                   NaN
10                                                  NaN
11                                                  NaN
12                                                  NaN
13                                                  NaN
14    [{'start': 0, 'end': 10, 'tag': 'Davutoğlu'}, ...
15    [{'start': 66, 'end': 73, 'tag': 'Bochum'}, {'...
16    [{'start': 16, 'end': 27, 'tag': 'Länderzeit'}...
17    [{'start': 0, 'end': 11, 'tag': 'EHFreibur

In [65]:
# inspecting column 'entities.mentions' (only looking at first 20 entries)

print(df['entities.mentions'][0:20])

0                                                   NaN
1                                                   NaN
2                                                   NaN
3                                                   NaN
4                                                   NaN
5     [{'start': 0, 'end': 8, 'username': 'hataibu',...
6                                                   NaN
7                                                   NaN
8                                                   NaN
9                                                   NaN
10                                                  NaN
11    [{'start': 96, 'end': 101, 'username': 'welt',...
12    [{'start': 96, 'end': 101, 'username': 'welt',...
13                                                  NaN
14                                                  NaN
15    [{'start': 0, 'end': 13, 'username': 'AndreasL...
16    [{'start': 10, 'end': 14, 'username': 'DLF', '...
17                                              

In [66]:
# inspecting column 'entities.urls' (only looking at first 20 entries)

print(df['entities.urls'][0:20])

0     [{'start': 92, 'end': 115, 'url': 'https://t.c...
1                                                   NaN
2                                                   NaN
3     [{'start': 83, 'end': 106, 'url': 'https://t.c...
4     [{'start': 92, 'end': 115, 'url': 'https://t.c...
5                                                   NaN
6                                                   NaN
7     [{'start': 70, 'end': 93, 'url': 'https://t.co...
8     [{'start': 113, 'end': 136, 'url': 'https://t....
9                                                   NaN
10    [{'start': 76, 'end': 99, 'url': 'https://t.co...
11    [{'start': 68, 'end': 91, 'url': 'https://t.co...
12    [{'start': 68, 'end': 91, 'url': 'https://t.co...
13    [{'start': 110, 'end': 133, 'url': 'https://t....
14                                                  NaN
15    [{'start': 110, 'end': 133, 'url': 'https://t....
16                                                  NaN
17    [{'start': 82, 'end': 105, 'url': 'https:/

In [67]:
# inspecting column 'geo.coordinates.type' (only looking at first 5 entries)

print(df['geo.coordinates.type'][0:5])

0      NaN
1    Point
2      NaN
3      NaN
4      NaN
Name: geo.coordinates.type, dtype: object


In [68]:
# inspecting column 'attachments.media_keys' (only looking at first 5 entries)

print(df['attachments.media_keys'][0:5])

0                         NaN
1                         NaN
2                         NaN
3    ['7_722852972233912321']
4    ['3_722824009939886080']
Name: attachments.media_keys, dtype: object


In [69]:
# inspecting column 'context_annotations' (only looking at first 5 entries)

print(df['context_annotations'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: context_annotations, dtype: object


In [70]:
# inspecting column 'attachments.poll_ids' (only looking at first 5 entries)

print(df['attachments.poll_ids'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: attachments.poll_ids, dtype: object


In [71]:
# inspecting column 'entities.cashtags' (only looking at first 5 entries)

print(df['entities.cashtags'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: entities.cashtags, dtype: object


In [72]:
# inspecting column 'entities.annotations' (only looking at first 5 entries)

print(df['entities.annotations'][0:5])

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
Name: entities.annotations, dtype: object


### 7.2. Dropping columns

Unnecessary columns include:
* referenced_tweets
* id
* conversation_id
* edit_history_tweet_ids
* possibly_sensitive
* reply_settings
* edit_controls.edits_remaining
* edit_controls.is_edit_eligible
* edit_controls.editable_until
* entities.mentions
* entities.urls
* geo.coordinates.type
* attachments.media_keys
* context_annotations
* attachments.poll_ids
* entities.cashtags
* entities.annotations
* withheld.copyright
* withheld.country_codes
* withheld.scope

**NOTE**: Some remaining columns may not end up being necessary but will be left in the dataframe just in case

In [73]:
# dropping the columns

df.drop(['referenced_tweets', 'id', 'conversation_id', 'edit_history_tweet_ids', 'possibly_sensitive', 'reply_settings', 'edit_controls.edits_remaining', 'edit_controls.is_edit_eligible', 'edit_controls.editable_until', 'entities.mentions', 'entities.urls', 'geo.coordinates.type', 'attachments.media_keys', 'context_annotations', 'attachments.poll_ids', 'entities.cashtags', 'entities.annotations', 'withheld.copyright', 'withheld.country_codes', 'withheld.scope'], axis=1, inplace=True)


In [74]:
# viewing all column names

df.columns

Index(['text', 'author_id', 'created_at', 'lang', 'geo.place_id',
       'public_metrics.retweet_count', 'public_metrics.reply_count',
       'public_metrics.like_count', 'public_metrics.quote_count',
       'public_metrics.impression_count', 'geo.coordinates.coordinates',
       'in_reply_to_user_id', 'entities.hashtags', 'week', 'month', 'year',
       'year-week', 'year-month', 'Language', 'date', 'inflow'],
      dtype='object')

In [76]:
# viewing the dataframe

df

Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,public_metrics.impression_count,...,in_reply_to_user_id,entities.hashtags,week,month,year,year-week,year-month,Language,date,inflow
0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4122038069,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0,0,0.0,...,,,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians
1,"Habe schon lang nicht gehört, daß Flüchtling G...",1179543852,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0,0,0.0,...,,,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians
2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",224607633,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0,0,0.0,...,,,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians
3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2480764313,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4,0,0.0,...,2480764313,"[{'start': 23, 'end': 35, 'tag': 'Flüchtlinge'}]",16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians
4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,606265303,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0,0,0.0,...,,"[{'start': 9, 'end': 21, 'tag': 'Flüchtlinge'}...",16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66260,"For day 1 of week 2, @AnnaMariaKonsta discusse...",110402493,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,0,...,110402493.0,"[{'start': 95, 'end': 108, 'tag': 'SocialRight...",26.0,6.0,2021.0,2021-26,2021-06,English,2021-06-28 00:00:00+00:00,Ukrainians
66261,"@ariadneconill Europe is racist, but in a diff...",2521808908,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,0,...,15869538.0,,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians
66262,"A labour of love, inspired by Middle-earth.\n\...",563381751,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,0,...,563381751.0,,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians
66263,@simongerman600 I must have missed the great f...,2591892350,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,0,...,359188534.0,,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-26 00:00:00+00:00,Ukrainians


## 8. Saving the data

In [77]:
df.to_csv(CASS_thesis / "02_Pre-processed_limited_merged.csv")